Page 2 of 5
Downloading Web pages at the same time calls for Java's multithreading capability. Every page visited can be processed by a separate thread, which in turn spawns new threads to deal with every page referenced by the page it is handling. This approach sounds very elegant, but actually it hides a flaw: Any Java virtual machine only has so many threads it will support -- which is unspecified by the language specification itself.
Also, the use of multiple threads quickly becomes counter-productive once the available bandwidth is saturated with concurrent page requests and downloads. (This is why browsers like Netscape Navigator have a "Number of Connections" option that users can tune to their particular set-ups and Internet providers.) The design of our Web explorer therefore calls for an additional mechanism to transparently limit the exponential growth potential of the recursive algorithm. By transparent I mean that this mechanism shouldn't require us to modify the elegance and simplicity of the basic recursive algorithm; it should just keep it on a (very) short lead to prevent the runaway creation of threads.
The pseudo-code for the program so far is simply:
load page extract all URLs for all found URLs spawn new thread of self
Click here to download a zip file of the WebCrawler source code.
At the highest level, the recursion we rely on is very similar to that used by a DIR program. Unlike most hierarchical filing
systems, though, the Web is not a pure tree. Therefore we need to take steps to avoid going around in endless loops, revisiting
pages over and over again. A simple global database of visited pages will allow the program to decide whether to go down a
link or not. This database should be efficient in terms of look-up speed, and flexible enough to deal with the infiniteness
of the Web. Java's Hashtable class easily fulfills our requirements -- at least for the kind of prototype we're building here.
Let's work in a top-down fashion and write the program frame for WebExplorer first:
Listing 1. Top-level class WebExplorer.
The program takes a single argument -- the URL of an existing Web page (for example, http://www.telework.demon.co.uk, which
is this author's home page). Notice the structure of the program: Its static main() method does not contain any of the program's logic; instead it instantiates an object of its own class and hands all control
over to that object. This approach is preferable to the common C "technique" of cramming everything in main().
If you use Java as an implementation language, you will quickly find that this C idiom doesn't work very well in Java. Java's main() has to be declared static, which means that it can only directly call fellow static methods in the same class. This situation becomes unbearable after budding off just one or two subroutines from your main program. In Java, therefore, a much better approach is to simply do object-oriented work from the start, by instantiating an object and handing control over to that object. As an instance of the class, it can then invoke static and instance methods at its leisure.