Create intelligent Web spiders

How to use Java network objects and HTML objects

This article demonstrates how to create an intelligent Web spider based on standard Java network objects. The heart of this spider is a recursive routine that can perform depth-first Web searches based on keyword/phrase criteria and Webpage characteristics. Search progress displays graphically using a JTree structure. I address issues such as resolving relative URLs, avoiding reference loops, and monitoring memory/stack usage. In addition, I demonstrate the proper use of Java network objects used in accessing and parsing remote Webpages.

Spider demonstration program

The demonstration program consists of the user interface class SpiderControl; the Web-searching class Spider; the two classes used to build a JTree showing the results, UrlTreeNode and UrlNodeRenderer; and two classes to help verify integer input into the user interface, IntegerVerifier and VerifierListener. See Resources for a link to the full source code and documentation.

The SpiderControl interface is composed of three tabs, one to set the search parameters, another to display the resulting search tree (JTree), and a third to display error and status messages—see Figure 1.

Figure 1. Search parameters tab. Click on thumbnail to view full-sized image.

Search parameters include the maximum number of sites to visit, the search's maximum depth (links to links to links), a list of keywords/phrases, the root-level domains to search, and the starting Website or portal. Once the user has entered the search parameters and pressed the Start button, the Web search will start, and the second tab (Figure 2) displays to show the search's progress.

Figure 2. Search tree. Click on thumbnail to view full-sized image.

An instance of the Spider class running in a separate thread conducts the Web search. Separate threads are used so that the SpiderControl module can continually update the search tree's display and process the Stop Search button. As the Spider runs, it continually adds nodes (UrlTreeNode) to the JTree displayed in the second tab. Search tree nodes that contain keywords and phrases appear in blue (UrlNodeRenderer).

When the search completes, the user can view the site's vital statistics and view the site itself in an external Web browser (the program defaults to Internet Explorer, located in the Program Files folder). The vital statistics include the keywords present, total text characters, total images, and total links.

The Spider class

The Spider class is responsible for searching the Web given a starting point (portal), a list of keywords and domains, and limits on the search's depth and size. Spider inherits Thread so it can run in a separate thread. This allows the SpiderControl module to continually update the search tree's display and process the Stop Search button.

The constructor method is passed the search parameters along with a reference to an empty JTree and an empty JTextArea. The JTree is used to create a hierarchical record of the sites visited as the search progress. This provides visual feedback to the user and helps the Spider track where it has been to prevent circular searches. The JTextArea posts error and progress messages.

The constructor stores its parameters in class variables and initializes the JTree to render nodes using the UrlNodeRenderer class. The search will not start until SpiderControl calls the run() method.

The run() method starts execution in a separate thread. It first determines whether the portal site is a Web reference (starting with http, ftp, or www) or a local file reference. It then ensures the portal site has the proper notation, resets the run statistics, and calls searchWeb() to begin the search:

    public void run()
       DefaultTreeModel treeModel = (DefaultTreeModel)searchTree.getModel(); // get our model
       DefaultMutableTreeNode root = (DefaultMutableTreeNode)treeModel.getRoot();
       String urllc = startSite.toLowerCase();
       if(!urllc.startsWith("http://") && !urllc.startsWith("ftp://") &&
          startSite = "file:///"+startSite;   // Note you must have 3 slashes !
         else // Http missing ?
            startSite = "http://"+startSite; // Tack on http://  
        startSite = startSite.replace('\\', '/'); // Fix bad slashes
       sitesFound = 0;
       sitesSearched = 0;
       searchWeb(root,startSite); // Search the Web

searchWeb() is a recursive method that accepts as parameters a parent node in the search tree and a Web address to search. searchWeb() first verifies that the given Website has not already been visited and that depth and site limits have not been exceeded. searchWeb() then yields to allow the SpiderControl thread to run (updating the screen and checking for Stop Search button presses). If all is in order, searchWeb() continues, if not, it returns.

Before searchWeb() begins reading and parsing the Website, it first verifies that the site is of the proper type and domain by creating a URL object based on the Website. The URL's protocol is checked to ensure it is either an HTML address or a file address (no need to search for "mailto:" and other protocols). Then the file extension (if present) is checked to ensure that it is an HTML file (no need to parse pdf or gif files). Once that is done, the domain is checked against the list specified by the user with the isDomainOk() method:

 ...URL url = new URL(urlstr); // Create the URL object from a string.
   String protocol = url.getProtocol(); // Ask the URL for its protocol
   if(!protocol.equalsIgnoreCase("http") && !protocol.equalsIgnoreCase("file"))
      messageArea.append("    Skipping : "+urlstr+" not a http site\n\n");
   String path = url.getPath();  // Ask the URL for its path
   int lastdot = path.lastIndexOf("."); // Check for file extension
   if(lastdot > 0)
      String extension = path.substring(lastdot);  // Just the file extension
      if(!extension.equalsIgnoreCase(".html") && !extension.equalsIgnoreCase(".htm"))
      return;  // Skip everything but html files
      messageArea.append("    Skipping : "+urlstr+" not in domain list\n\n");

At this point, searchWeb() is fairly certain it has a URL worth searching, so it creates a new node for the search tree, adds it to the tree, opens an input stream, and parses the file. The following sections provide more details on parsing HTML files, resolving relative URLs, and controlling recursion.

Parsing HTML files

There are two ways to parse (pick apart) an HTML file to look for the A HREF = tags—a hard way and an easy way.

If you choose the hard way, you would create your own parsing algorithm using Java's StreamTokenizer class. With this technique, you'd have to specify the word and white-space characters for the StreamTokenizer object, then pick off the < and > symbols to find the tags, the attributes, and separate the text between tags. A lot of work.

The easy way is to use the built-in ParserDelegator class, a subclass of the HTMLEditorKit.Parser abstract class. These classes are not well documented in the Java documentation. Using ParserDelegator is a three-step process. First, create an InputStreamReader object from your URL; then, create an instance of a ParserCallback object; finally, create an instance of the ParserDelegator object and call its one public method parse():

  UrlTreeNode newnode = new UrlTreeNode(url); // Create the data node 
   InputStream in = url.openStream(); // Ask the URL object to create an input stream
   InputStreamReader isr = new InputStreamReader(in); // Convert the stream to a reader
   DefaultMutableTreeNode treenode = addNode(parentnode, newnode);  
   SpiderParserCallback cb = new SpiderParserCallback(treenode); // Create a callback object
   ParserDelegator pd = new ParserDelegator(); // Create the delegator
   pd.parse(isr,cb,true); // Parse the stream
   isr.close();  // Close the stream

parse() is passed an InputStreamReader, an instance of a ParseCallback object, and a flag for specifying whether the CharSet tags should be ignored. parse() then reads and decodes the HTML file, calling methods in the ParserCallback object each time it has completely decoded a tag or HTML element.

In the demonstration code, I implemented my ParserCallback as an inner class of Spider. Doing so allows ParseCallback to access Spider's methods and variables. Classes based on ParserCallback can override the following methods:

  • handleStartTag(): Called when a starting HTML tag is encountered, e.g., >A <
  • handleEndTag(): Called when an end HTML tag is encountered, e.g., >/A<
  • handleSimpleTag(): Called when a HTML tag that has no matching end tag is encountered
  • handleText(): Called when text between tags is encountered

In the demonstration program, I overrode the handleSimpleTag(), handleStartTag(), handleEndTag(), and handleTextTag() methods.

I overrode handleSimpleTag() so that my code can process HTML BASE and IMG tags. BASE tags tell what URL to use when resolving relative URL references. If no BASE tag is present, then the current URL is used to resolve relative references. handleSimpleTag() is passed three parameters, an HTML.Tag object, a MutableAttributeSet that contains all the tag's attributes, and relative position within the file. My code checks the tag to see if it is a BASE object instance; if it is, then the HREF attribute is retrieved and stored in the page's data node. This attribute is used later when resolving URL addresses to linked Websites. Each time an IMG tag is encountered, that page's image count is updated.

I overrode handleStartTag() so that the program can process HTML A and TITLE tags. The method tests to see if the t parameter is in fact an A tag, if it is, then the HREF attribute is retrieved.

fixHref() is called to clean up sloppy references (changes back slashes to forward slashes, adds missing final slashes). The link's URL is resolved by creating a URL object instance using the base URL and the one referenced. Then, a recursive call to searchWeb() processes this link. If the method encounters a TITLE tag, it clears the variable storing the last text encountered so that the title's end tag is assured of having the proper value (sometimes, a Webpage will have title tags with no title between them).

I overrode handleEndTag() so the HTML TITLE end tags can be processed. This end tag indicates that the previous text (stored in lastText) is the page's title text. This text is then stored in the page's data node. Since adding the title information to the data node will change the display of the data node in the tree, the nodeChanged() method must be called so the tree can adjust its layout.

I overrode handleText() so that the HTML page's text can be checked for any of the keywords or phrases being searched. handleText() is passed an array of characters and the position of the characters within the file. handleText() first converts the character array to a String object, converting to all uppercase in the process. Then each keyword/phrase in the search list is checked against the String object using the indexOf() method. If indexOf() returns a non-negative result, then the keyword/phrase is present in the page's text. If the keyword/phrase is present, the match is recorded in the node's match list and run statistics are updated:

1 2 Page 1