|
|
Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs
This article demonstrates how to create an intelligent Web spider based on standard Java network objects. The heart of this
spider is a recursive routine that can perform depth-first Web searches based on keyword/phrase criteria and Webpage characteristics.
Search progress displays graphically using a JTree structure. I address issues such as resolving relative URLs, avoiding reference loops, and monitoring memory/stack usage.
In addition, I demonstrate the proper use of Java network objects used in accessing and parsing remote Webpages.
The demonstration program consists of the user interface class SpiderControl; the Web-searching class Spider; the two classes used to build a JTree showing the results, UrlTreeNode and UrlNodeRenderer; and two classes to help verify integer input into the user interface, IntegerVerifier and VerifierListener. See Resources for a link to the full source code and documentation.
The SpiderControl interface is composed of three tabs, one to set the search parameters, another to display the resulting search tree (JTree), and a third to display error and status messages—see Figure 1.
Figure 1. Search parameters tab. Click on thumbnail to view full-sized image.
Search parameters include the maximum number of sites to visit, the search's maximum depth (links to links to links), a list of keywords/phrases, the root-level domains to search, and the starting Website or portal. Once the user has entered the search parameters and pressed the Start button, the Web search will start, and the second tab (Figure 2) displays to show the search's progress.
Figure 2. Search tree. Click on thumbnail to view full-sized image.
An instance of the Spider class running in a separate thread conducts the Web search. Separate threads are used so that the SpiderControl module can continually update the search tree's display and process the Stop Search button. As the Spider runs, it continually adds nodes (UrlTreeNode) to the JTree displayed in the second tab. Search tree nodes that contain keywords and phrases appear in blue (UrlNodeRenderer).
When the search completes, the user can view the site's vital statistics and view the site itself in an external Web browser (the program defaults to Internet Explorer, located in the Program Files folder). The vital statistics include the keywords present, total text characters, total images, and total links.
The Spider class is responsible for searching the Web given a starting point (portal), a list of keywords and domains, and limits on
the search's depth and size. Spider inherits Thread so it can run in a separate thread. This allows the SpiderControl module to continually update the search tree's display and process the Stop Search button.
The constructor method is passed the search parameters along with a reference to an empty JTree and an empty JTextArea. The JTree is used to create a hierarchical record of the sites visited as the search progress. This provides visual feedback to the
user and helps the Spider track where it has been to prevent circular searches. The JTextArea posts error and progress messages.
Archived Discussions (Read only)