Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs
This article demonstrates how to create an intelligent Web spider based on standard Java network objects. The heart of this
spider is a recursive routine that can perform depth-first Web searches based on keyword/phrase criteria and Webpage characteristics.
Search progress displays graphically using a
JTree structure. I address issues such as resolving relative URLs, avoiding reference loops, and monitoring memory/stack usage.
In addition, I demonstrate the proper use of Java network objects used in accessing and parsing remote Webpages.
The demonstration program consists of the user interface class
SpiderControl; the Web-searching class
Spider; the two classes used to build a
JTree showing the results,
UrlNodeRenderer; and two classes to help verify integer input into the user interface,
VerifierListener. See Resources for a link to the full source code and documentation.
SpiderControl interface is composed of three tabs, one to set the search parameters, another to display the resulting search tree (
JTree), and a third to display error and status messages—see Figure 1.
Figure 1. Search parameters tab. Click on thumbnail to view full-sized image.
Search parameters include the maximum number of sites to visit, the search's maximum depth (links to links to links), a list of keywords/phrases, the root-level domains to search, and the starting Website or portal. Once the user has entered the search parameters and pressed the Start button, the Web search will start, and the second tab (Figure 2) displays to show the search's progress.
Figure 2. Search tree. Click on thumbnail to view full-sized image.
An instance of the
Spider class running in a separate thread conducts the Web search. Separate threads are used so that the
SpiderControl module can continually update the search tree's display and process the Stop Search button. As the
Spider runs, it continually adds nodes (
UrlTreeNode) to the
JTree displayed in the second tab. Search tree nodes that contain keywords and phrases appear in blue (
When the search completes, the user can view the site's vital statistics and view the site itself in an external Web browser (the program defaults to Internet Explorer, located in the Program Files folder). The vital statistics include the keywords present, total text characters, total images, and total links.
Spider class is responsible for searching the Web given a starting point (portal), a list of keywords and domains, and limits on
the search's depth and size.
Thread so it can run in a separate thread. This allows the
SpiderControl module to continually update the search tree's display and process the Stop Search button.
The constructor method is passed the search parameters along with a reference to an empty
JTree and an empty
JTree is used to create a hierarchical record of the sites visited as the search progress. This provides visual feedback to the
user and helps the
Spider track where it has been to prevent circular searches. The
JTextArea posts error and progress messages.
|Forum migration complete By Athen|
|Forum migration update By Athen|
|How do you copy img files to make the html complet By|
|adding mysql support or other database? By Anonymous|
|Having trouble compiling By JMike|
|I haven't a clue about Java... By Homersim2600|
|Create intelligent Web spiders By JavaWorld|