Newsletter sign-up
View all newsletters

Sign up for our technology specific newsletters.

Enterprise Java
Email Address:

Automating Web exploration

Here's how to create a Web search engine service

  • Digg
  • Reddit
  • SlashDot
  • Stumble
  • del.icio.us
  • Technorati
  • dzone

Page 3 of 5

The business end of the WebExplorer class is the triggering of the Web crawl process. This is achieved by instantiating a PageVisitor thread object (which immediately starts running) while passing it the seed Web page URL the command line got as argument.

The next class to examine therefore, is the PageVisitor class, which is the algorithmic heart of the program:

Listing 2. class PageVisitor.

Before looking at PageVisitor's constructor, we need to look at what class initializations occur right after class PageVisitor is first loaded into the JVM. Class PageVisitor contains two class (static) objects that are central to the functioning of the program: the global database of encountered pages and an object of type CrowdController that is used to limit the number of threads being spawned by our recursive algorithm. The fact that both objects are declared static means that the many thread instances will all access the same database and CrowdController object. The database is implemented as a java.util.Hashtable and is initially empty (the choice of class Hashtable for the database, instead of Vector, for example, is because Hashtables can find stored items fast). The CrowdController object is initialized to manage a "crowd" of maximum MAX_THREADS participants (that is, PageVisitor threads). Class CrowdController will be discussed later.

The constructor for class PageVisitor converts the argument page URL (in String form) to an instance of a URL object. Being a thread, it labels itself with the page's URL; this gives the thread an identity that it can use later or use for debugging. The constructor then starts the thread running, which means the logic continues in this class' run() method. The body of the thread, as implemented by run(), is also the heart of this program. Therefore, the pseudo-code given earlier should be your guide. You can ignore the start and end statements dealing with limiting the number of running threads, for the moment.

The main methods supporting the recursive page-visiting algorithm are loadPage() and extractHyperTextLinks(). Method loadPage() relies heavily on an underlying class -- class HTTP -- to hide the nitty-gritty details of talking to a Web server and requesting (and getting) a Web page. Once a page is loaded and stored in a single string for convenience, the hypertext links can be extracted from it and collected in a vector. (I use a vector simply as a convenient "bag" to collect links in; none of the array-like properties of class Vector are used by the program.) The vector's elements are then enumerated (the links are stored as separate strings), and the entire process is repeated for each link (load page, extract links, and so on).

A simple Method extractHyperTextLinks() is implemented here in order to demonstrate a functional Web crawler-type program. Its extraction algorithm is very brittle; it will fail to extract some links and will extract others incorrectly, all depending on the syntax used by the HTML author or authoring tool. Currently, it uses String.indexOf() to hunt down consecutive occurrences of the case-sensitive substring "http://", which it regards as the start of a legal reference. It also assumes that this string is embedded in a <A.. > anchor tag. Both premises would have to be discarded in a real-life, robust implementation, one in which the HTML structure would possibly have to be parsed in full to be able to pinpoint real "hot" links embedded in <A> tags, as opposed to plain text (quoted) URL strings. (This article for a start will throw off the current prototype implementation because of the above non-link instance of the string "http://"!)

  • Digg
  • Reddit
  • SlashDot
  • Stumble
  • del.icio.us
  • Technorati
  • dzone
Comment
Login
Forgot your account info?
Add comment
Anonymous comments subject to approval. Register here for member benefits.
Have a JavaWorld account? Log in here. Register now for a free account.
Resources