Automating Web exploration

Here's how to create a Web search engine service

Say you want to set up a Web search engine service that competes with Lycos or Yahoo. How would you go about constructing your database of Web-page URLs? You could knock on the aforementioned services' doors and ask the people in charge if they'll sell you their databases. "Fat chance," I bet you're thinking. You could email all your friends and ask them to mail in their favorite URLs. Not very practical either. So why not mine the Web itself and extract from it the data you need? All that's required is the equivalent of a Lunar Rover -- adapted to cyberspace.

This article will look at some of the issues involved in designing and implementing such a Web crawler-type program.

Before we tackle the design of the program, let's review what Java provides in the field of network programming, since the program will necessarily delve into technical networking matters. Java happens to be wonderfully suited for all but the lowest-level TCP/IP programming tasks: The standard java.net package features a small but functional set of classes encapsulating TCP/IP in a way that greatly reduces potential network programming pitfalls. The java.net classes really let you concentrate on your application, rather than on myriad protocol programming details.

The main java.net classes used for doing any TCP/IP programming are:

  • Socket
  • InetAddress
  • URL

Using these three simple classes, you can write a host (no pun intended) of interesting Internet client software. Class Socket is your main gateway to the Internet: Given some host's address in the form of an InetAddress instance, and a port number to connect to, class Socket will hand you two I/O streams to communicate with your chosen server in full duplex (via one input stream and one output stream). Class URL allows both addressing and accessing of Web resources. This class therefore can be considered a higher-level combination of Socket and InetAddress but with a bias toward Web applications.

If you need to write server programs, need finer control over HTTP (WWW) exchanges, or need to use the User Datagram Protocol (UDP), then the following additional classes will help you in your endeavors:

  • ServerSocket
  • URLConnection
  • DatagramPacket and DatagramSocket (for UDP applications)

Before starting any network projects, bear in mind that Java applications and Java applets get different licenses to use these classes. As you probably know by now, applets are severely crippled when it comes to networking capabilities: For security reasons applets are restricted to communicating only with the server that produced (served) the applet itself. The Web exploration program we're going to write needs full access to the Web, so by necessity it will be a Java application and not an applet.

Conceptually, our Web exploration program is very similar to a textbook (recursive) file system lister (phew, what a mouthful), just like the ubiquitous DIRs or ls-es we all know. But instead of a directory specification, our program takes any Web page as sole argument. This page will act as the root for a hypertext graph to be explored. Every link (embedded in HTML <A> anchor tags) in a page leads to more pages that need to be visited. While the recursive algorithm in programs like DIR or ls is purely sequential in nature, there's an opportunity to design our Web explorer to be more efficient by harnessing the inherent parallelism of the Internet. Like all Web browsers, a Web explorer can download different Web pages at the same time, thus exploiting the full bandwidth possible with any given link.

Downloading Web pages at the same time calls for Java's multithreading capability. Every page visited can be processed by a separate thread, which in turn spawns new threads to deal with every page referenced by the page it is handling. This approach sounds very elegant, but actually it hides a flaw: Any Java virtual machine only has so many threads it will support -- which is unspecified by the language specification itself.

Also, the use of multiple threads quickly becomes counter-productive once the available bandwidth is saturated with concurrent page requests and downloads. (This is why browsers like Netscape Navigator have a "Number of Connections" option that users can tune to their particular set-ups and Internet providers.) The design of our Web explorer therefore calls for an additional mechanism to transparently limit the exponential growth potential of the recursive algorithm. By transparent I mean that this mechanism shouldn't require us to modify the elegance and simplicity of the basic recursive algorithm; it should just keep it on a (very) short lead to prevent the runaway creation of threads.

The pseudo-code for the program so far is simply:

load page
extract all URLs
for all found URLs
spawn new thread of self

Click here to download a zip file of the WebCrawler source code.

At the highest level, the recursion we rely on is very similar to that used by a DIR program. Unlike most hierarchical filing systems, though, the Web is not a pure tree. Therefore we need to take steps to avoid going around in endless loops, revisiting pages over and over again. A simple global database of visited pages will allow the program to decide whether to go down a link or not. This database should be efficient in terms of look-up speed, and flexible enough to deal with the infiniteness of the Web. Java's Hashtable class easily fulfills our requirements -- at least for the kind of prototype we're building here.

Writing the program

Let's work in a top-down fashion and write the program frame for WebExplorer first:

Listing 1. Top-level class WebExplorer.

The program takes a single argument -- the URL of an existing Web page (for example, http://www.telework.demon.co.uk, which is this author's home page). Notice the structure of the program: Its static main() method does not contain any of the program's logic; instead it instantiates an object of its own class and hands all control over to that object. This approach is preferable to the common C "technique" of cramming everything in main().

If you use Java as an implementation language, you will quickly find that this C idiom doesn't work very well in Java. Java's main() has to be declared static, which means that it can only directly call fellow static methods in the same class. This situation becomes unbearable after budding off just one or two subroutines from your main program. In Java, therefore, a much better approach is to simply do object-oriented work from the start, by instantiating an object and handing control over to that object. As an instance of the class, it can then invoke static and instance methods at its leisure.

The business end of the WebExplorer class is the triggering of the Web crawl process. This is achieved by instantiating a PageVisitor thread object (which immediately starts running) while passing it the seed Web page URL the command line got as argument.

The next class to examine therefore, is the PageVisitor class, which is the algorithmic heart of the program:

Listing 2. class PageVisitor.

Before looking at PageVisitor's constructor, we need to look at what class initializations occur right after class PageVisitor is first loaded into the JVM. Class PageVisitor contains two class (static) objects that are central to the functioning of the program: the global database of encountered pages and an object of type CrowdController that is used to limit the number of threads being spawned by our recursive algorithm. The fact that both objects are declared static means that the many thread instances will all access the same database and CrowdController object. The database is implemented as a java.util.Hashtable and is initially empty (the choice of class Hashtable for the database, instead of Vector, for example, is because Hashtables can find stored items fast). The CrowdController object is initialized to manage a "crowd" of maximum MAX_THREADS participants (that is, PageVisitor threads). Class CrowdController will be discussed later.

The constructor for class PageVisitor converts the argument page URL (in String form) to an instance of a URL object. Being a thread, it labels itself with the page's URL; this gives the thread an identity that it can use later or use for debugging. The constructor then starts the thread running, which means the logic continues in this class' run() method. The body of the thread, as implemented by run(), is also the heart of this program. Therefore, the pseudo-code given earlier should be your guide. You can ignore the start and end statements dealing with limiting the number of running threads, for the moment.

The main methods supporting the recursive page-visiting algorithm are loadPage() and extractHyperTextLinks(). Method loadPage() relies heavily on an underlying class -- class HTTP -- to hide the nitty-gritty details of talking to a Web server and requesting (and getting) a Web page. Once a page is loaded and stored in a single string for convenience, the hypertext links can be extracted from it and collected in a vector. (I use a vector simply as a convenient "bag" to collect links in; none of the array-like properties of class Vector are used by the program.) The vector's elements are then enumerated (the links are stored as separate strings), and the entire process is repeated for each link (load page, extract links, and so on).

A simple Method extractHyperTextLinks() is implemented here in order to demonstrate a functional Web crawler-type program. Its extraction algorithm is very brittle; it will fail to extract some links and will extract others incorrectly, all depending on the syntax used by the HTML author or authoring tool. Currently, it uses String.indexOf() to hunt down consecutive occurrences of the case-sensitive substring "http://", which it regards as the start of a legal reference. It also assumes that this string is embedded in a <A.. > anchor tag. Both premises would have to be discarded in a real-life, robust implementation, one in which the HTML structure would possibly have to be parsed in full to be able to pinpoint real "hot" links embedded in <A> tags, as opposed to plain text (quoted) URL strings. (This article for a start will throw off the current prototype implementation because of the above non-link instance of the string "http://"!)

If you now look back at the beginning of the run() method, you'll see my solution for controlling the potential explosion of spawned threads: class CrowdController and its two pillar methods, getTicket() and returnTicket(). Once a CrowdController object has been initialized to "manage" a crowd, it maintains a collection of tracked "tickets" that it can issue to any clients wishing to be limited in numbers (as in being part of a finite crowd). Clients request a ticket, go about their business, and then release their ticket. Because there is a finite number of tickets available, any client requesting a ticket while all tickets are in use will, without realizing it, be forced to block (that is, sleep) until a ticket is returned by some other client (in another thread).

Put in concrete terms, if we ask our CrowdController object to limit the crowd of threads to four, then only four tickets can ever be in circulation. Any newly created thread that calls getTicket() while the four tickets are already issued will not be allowed to run. The new thread is prevented from running by putting it in a wait() state. When one of the other four active threads is about to terminate, it first releases its grip on its ticket by calling returnTicket (ticketId). This last method recycles the ticket to make it available to any sleeping threads waiting inside getTicket(). Method returnTicket() wakes up a random thread waiting in getTicket(), by calling notifyAll().

Methods wait() and notifyAll() are from class Object, so all Java objects can invoke these methods. But when using them you need to be very careful: You can only call them from within a synchronized block. I declared both getTicket() and returnTicket() to be synchronized methods, thus both methods are synchronized blocks in their entirety. Although previous JavaWorld articles in the Nuts and Bolts section have delved into Java thread synchronization issues, it might be appropriate to reiterate the basics.

A synchronized block of code is only entered when a lock (or monitor) is obtained on some object. In case of synchronized instance methods (as in our CrowdController example), the object locked is the instance on which the methods were invoked. Only one thread at a time can lock an object. Other threads attempting to lock an object that is already locked will be put in a wait state, until the object becomes unlocked. Class CrowdController contains instance variables which need to be manipulated in an atomic fashion; hence both getTicket() and returnTicket() are declared synchronized so that the entire object becomes locked while these methods execute.

1 2 Page 1