Java: A platform for platforms
Sun's reorg may seem promising to shareholders but it's also a scramble for position. The question now is whether Sun can, or wants to, maintain its hold on Java technology. Especially with enterprise leaders like SpringSource and RedHat investing heavily in Java's future as a platform for platforms

Also see:

Discuss: Java: A platform for platforms?

Featured Whitepapers
Newsletter sign-up
View all newsletters

Sign up for our technology specific newsletters.

Enterprise Java
Email Address:

Automating Web exploration

Here's how to create a Web search engine service

  • Digg
  • Reddit
  • SlashDot
  • Stumble
  • del.icio.us
  • Technorati
  • dzone
Say you want to set up a Web search engine service that competes with Lycos or Yahoo. How would you go about constructing your database of Web-page URLs? You could knock on the aforementioned services' doors and ask the people in charge if they'll sell you their databases. "Fat chance," I bet you're thinking. You could email all your friends and ask them to mail in their favorite URLs. Not very practical either. So why not mine the Web itself and extract from it the data you need? All that's required is the equivalent of a Lunar Rover -- adapted to cyberspace.

This article will look at some of the issues involved in designing and implementing such a Web crawler-type program.

Before we tackle the design of the program, let's review what Java provides in the field of network programming, since the program will necessarily delve into technical networking matters. Java happens to be wonderfully suited for all but the lowest-level TCP/IP programming tasks: The standard java.net package features a small but functional set of classes encapsulating TCP/IP in a way that greatly reduces potential network programming pitfalls. The java.net classes really let you concentrate on your application, rather than on myriad protocol programming details.

The main java.net classes used for doing any TCP/IP programming are:

  • Socket
  • InetAddress
  • URL


Using these three simple classes, you can write a host (no pun intended) of interesting Internet client software. Class Socket is your main gateway to the Internet: Given some host's address in the form of an InetAddress instance, and a port number to connect to, class Socket will hand you two I/O streams to communicate with your chosen server in full duplex (via one input stream and one output stream). Class URL allows both addressing and accessing of Web resources. This class therefore can be considered a higher-level combination of Socket and InetAddress but with a bias toward Web applications.

If you need to write server programs, need finer control over HTTP (WWW) exchanges, or need to use the User Datagram Protocol (UDP), then the following additional classes will help you in your endeavors:

  • ServerSocket
  • URLConnection
  • DatagramPacket and DatagramSocket (for UDP applications)


Before starting any network projects, bear in mind that Java applications and Java applets get different licenses to use these classes. As you probably know by now, applets are severely crippled when it comes to networking capabilities: For security reasons applets are restricted to communicating only with the server that produced (served) the applet itself. The Web exploration program we're going to write needs full access to the Web, so by necessity it will be a Java application and not an applet.

Conceptually, our Web exploration program is very similar to a textbook (recursive) file system lister (phew, what a mouthful), just like the ubiquitous DIRs or ls-es we all know. But instead of a directory specification, our program takes any Web page as sole argument. This page will act as the root for a hypertext graph to be explored. Every link (embedded in HTML <A> anchor tags) in a page leads to more pages that need to be visited. While the recursive algorithm in programs like DIR or ls is purely sequential in nature, there's an opportunity to design our Web explorer to be more efficient by harnessing the inherent parallelism of the Internet. Like all Web browsers, a Web explorer can download different Web pages at the same time, thus exploiting the full bandwidth possible with any given link.

  • Digg
  • Reddit
  • SlashDot
  • Stumble
  • del.icio.us
  • Technorati
  • dzone
Comment
Login
Forgot your account info?
Add comment
Anonymous comments subject to approval. Register here for member benefits.
Have a JavaWorld account? Log in here. Register now for a free account.
Resources