Newsletter sign-up
View all newsletters

Sign up for our technology specific newsletters.

Enterprise Java
Email Address:

Create intelligent Web spiders

How to use Java network objects and HTML objects

  • Digg
  • Reddit
  • SlashDot
  • Stumble
  • del.icio.us
  • Technorati
  • dzone

Page 2 of 5

The constructor stores its parameters in class variables and initializes the JTree to render nodes using the UrlNodeRenderer class. The search will not start until SpiderControl calls the run() method.

The run() method starts execution in a separate thread. It first determines whether the portal site is a Web reference (starting with http, ftp, or www) or a local file reference. It then ensures the portal site has the proper notation, resets the run statistics, and calls searchWeb() to begin the search:

    public void run()
     {
       DefaultTreeModel treeModel = (DefaultTreeModel)searchTree.getModel(); // get our model
       DefaultMutableTreeNode root = (DefaultMutableTreeNode)treeModel.getRoot();
       String urllc = startSite.toLowerCase();
       if(!urllc.startsWith("http://") && !urllc.startsWith("ftp://") &&
            !urllc.startsWith("www."))
         {
          startSite = "file:///"+startSite;   // Note you must have 3 slashes !
         }
         else // Http missing ?
          if(urllc.startsWith("www."))
          {
            startSite = "http://"+startSite; // Tack on http://  
          }
         
        startSite = startSite.replace('\\', '/'); // Fix bad slashes
   
       sitesFound = 0;
       sitesSearched = 0;
       updateStats();
       searchWeb(root,startSite); // Search the Web
       messageArea.append("Done!\n\n");
     }



searchWeb() is a recursive method that accepts as parameters a parent node in the search tree and a Web address to search. searchWeb() first verifies that the given Website has not already been visited and that depth and site limits have not been exceeded. searchWeb() then yields to allow the SpiderControl thread to run (updating the screen and checking for Stop Search button presses). If all is in order, searchWeb() continues, if not, it returns.

Before searchWeb() begins reading and parsing the Website, it first verifies that the site is of the proper type and domain by creating a URL object based on the Website. The URL's protocol is checked to ensure it is either an HTML address or a file address (no need to search for "mailto:" and other protocols). Then the file extension (if present) is checked to ensure that it is an HTML file (no need to parse pdf or gif files). Once that is done, the domain is checked against the list specified by the user with the isDomainOk() method:

 ...URL url = new URL(urlstr); // Create the URL object from a string.
   String protocol = url.getProtocol(); // Ask the URL for its protocol
   if(!protocol.equalsIgnoreCase("http") && !protocol.equalsIgnoreCase("file"))
   {
      messageArea.append("    Skipping : "+urlstr+" not a http site\n\n");
      return;
   }
   String path = url.getPath();  // Ask the URL for its path
   int lastdot = path.lastIndexOf("."); // Check for file extension
   if(lastdot > 0)
   {
      String extension = path.substring(lastdot);  // Just the file extension
      if(!extension.equalsIgnoreCase(".html") && !extension.equalsIgnoreCase(".htm"))
      return;  // Skip everything but html files
   }
   if(!isDomainOk(url))
   {
      messageArea.append("    Skipping : "+urlstr+" not in domain list\n\n");
      return;
   }



At this point, searchWeb() is fairly certain it has a URL worth searching, so it creates a new node for the search tree, adds it to the tree, opens an input stream, and parses the file. The following sections provide more details on parsing HTML files, resolving relative URLs, and controlling recursion.

  • Digg
  • Reddit
  • SlashDot
  • Stumble
  • del.icio.us
  • Technorati
  • dzone
Comment
Login
Forgot your account info?
Add comment
Anonymous comments subject to approval. Register here for member benefits.
Have a JavaWorld account? Log in here. Register now for a free account.
Resources