Newsletter sign-up
View all newsletters

Sign up for our technology specific newsletters.

Enterprise Java
Email Address:

Automating Web exploration

Here's how to create a Web search engine service

  • Digg
  • Reddit
  • SlashDot
  • Stumble
  • del.icio.us
  • Technorati
  • dzone

Page 2 of 5

Downloading Web pages at the same time calls for Java's multithreading capability. Every page visited can be processed by a separate thread, which in turn spawns new threads to deal with every page referenced by the page it is handling. This approach sounds very elegant, but actually it hides a flaw: Any Java virtual machine only has so many threads it will support -- which is unspecified by the language specification itself.

Also, the use of multiple threads quickly becomes counter-productive once the available bandwidth is saturated with concurrent page requests and downloads. (This is why browsers like Netscape Navigator have a "Number of Connections" option that users can tune to their particular set-ups and Internet providers.) The design of our Web explorer therefore calls for an additional mechanism to transparently limit the exponential growth potential of the recursive algorithm. By transparent I mean that this mechanism shouldn't require us to modify the elegance and simplicity of the basic recursive algorithm; it should just keep it on a (very) short lead to prevent the runaway creation of threads.

The pseudo-code for the program so far is simply:

load page
extract all URLs
for all found URLs
spawn new thread of self


Click here to download a zip file of the WebCrawler source code.

At the highest level, the recursion we rely on is very similar to that used by a DIR program. Unlike most hierarchical filing systems, though, the Web is not a pure tree. Therefore we need to take steps to avoid going around in endless loops, revisiting pages over and over again. A simple global database of visited pages will allow the program to decide whether to go down a link or not. This database should be efficient in terms of look-up speed, and flexible enough to deal with the infiniteness of the Web. Java's Hashtable class easily fulfills our requirements -- at least for the kind of prototype we're building here.

Writing the program

Let's work in a top-down fashion and write the program frame for WebExplorer first:

Listing 1. Top-level class WebExplorer.

The program takes a single argument -- the URL of an existing Web page (for example, http://www.telework.demon.co.uk, which is this author's home page). Notice the structure of the program: Its static main() method does not contain any of the program's logic; instead it instantiates an object of its own class and hands all control over to that object. This approach is preferable to the common C "technique" of cramming everything in main().

If you use Java as an implementation language, you will quickly find that this C idiom doesn't work very well in Java. Java's main() has to be declared static, which means that it can only directly call fellow static methods in the same class. This situation becomes unbearable after budding off just one or two subroutines from your main program. In Java, therefore, a much better approach is to simply do object-oriented work from the start, by instantiating an object and handing control over to that object. As an instance of the class, it can then invoke static and instance methods at its leisure.

  • Digg
  • Reddit
  • SlashDot
  • Stumble
  • del.icio.us
  • Technorati
  • dzone
Comment
Login
Forgot your account info?
Add comment
Anonymous comments subject to approval. Register here for member benefits.
Have a JavaWorld account? Log in here. Register now for a free account.
Resources