|
|
Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs
Page 5 of 5
URL(URL context, String spec) accepts the link in the spec parameter and the base URL in the context parameter. If spec is a relative link, the constructor will create a URL object using context to build the complete reference. URL prefers all URL specifications to be in the strict (Unix) format. Use of backslashes—which is allowed in the Microsoft Windows
world—instead of forward slashes, result in bad references. Also, if spec or context refers to a directory (containing index.html or default.html), instead of an HTML file, it must have a final slash. The method fixHref() checks for these sloppy references and fixes them:
public static String fixHref(String href)
{
String newhref = href.replace('\\', '/'); // Fix sloppy Web references
int lastdot = newhref.lastIndexOf('.');
int lastslash = newhref.lastIndexOf('/');
if(lastslash > lastdot)
{
if(newhref.charAt(newhref.length()-1) != '/')
newhref = newhref+"/"; // Add missing /
}
return newhref;
}
searchWeb() is initially called to search the starting Web address specified by the user. It then calls itself whenever an HTML link
is found. This forms the basis of the "depth-first" search and can lead to two types of problems. First, and most critical,
memory/stack overflow problems can result due to too many recursive calls. These will occur if there is a circular Web reference;
that is, one Webpage links to another that links back to the first—a common occurrence in the World Wide Web. To prevent this,
searchWeb() checks the search tree (via the urlHasBeenVisited() method) to see if the referenced page already exists. If it does, then the link is ignored. If you choose to implement a
spider without a search tree, you still must maintain a list of sites visited (either in a Vector or array) so that you can determine if you are revisiting a site.
The second problem with recursion results from the nature of a depth-first search and the World Wide Web's structure. Depending
on the chosen portal, a depth-first search could result in hundreds of recursive calls before the original link on the original
page is completely processed. This has two undesirable consequences: first, memory/stack overflow could occur, and second,
the pages being searched may be too far removed from the original portal to give meaningful results. To control this, I added
a "maximum search depth" setting to the spider. The user may select how deep the number of levels can go (links to links to
links); as each link is entered, the current depth is checked via a call to the depthLimitExceeded() method. If the limit is exceeded, the link is ignored. This test merely checks the level of the node in the JTree.
The demonstration program also adds a site limit, specified by the user, that can stop the search after the specified number
of URLs has been examined, thus ensuring the program will eventually stop! The site limit is controlled via a simple integer
counter sitesSearched that is updated and checked after each call to searchWeb().
UrlTreeNode and UrlNodeRenderer are classes for creating custom tree nodes to add to the JTree in the SpiderControl user interface. UrlTreeNode contains the statistics and URL information for each searched Website. The UrlTreeNode is stored in the JTree as the "user object" attribute of the standard DefaultMutableTreeNode objects. The data includes the ability to track keywords present in the node, the node's URL, the node's base URL, the number
of links, images, and text characters, and whether the node matches the search criteria.
UrlTreeNodeRenderer is an implementation of the DefaultTreeCellRenderer interface. UrlTreeNodeRenderer causes the node to display in blue text if the node contains matching keywords. UrlTreeNodeRenderer also incorporates a custom icon for the JTreeNodes. Custom display is achieved by overriding the getTreeCellRendererComponent() method (see below). This method creates a Component object to display in the tree. Most of Component's attributes are set by the superclass; UrlTreeNodeRenderer changes the text color (foreground) and icons:
public Component getTreeCellRendererComponent(
JTree tree,
Object value,
boolean sel,
boolean expanded,
boolean leaf,
int row,
boolean hasFocus) {
super.getTreeCellRendererComponent(
tree, value, sel,
expanded, leaf, row,
hasFocus);
UrlTreeNode node = (UrlTreeNode)(((DefaultMutableTreeNode)value).getUserObject());
if (node.isMatch()) // Set color
setForeground(Color.blue);
else
setForeground(Color.black);
if(icon != null) // Set a custom icon
{
setOpenIcon(icon);
setClosedIcon(icon);
setLeafIcon(icon);
}
return this;
}
This article has shown you how to create a Web spider and a user interface to control it. The user interface employs a JTree to track the spider's progress and record the sites visited. Alternatively, you could use a Vector to record the sites visited and display the progress using a simple counter. Other enhancements could include an interface
to a database to record keywords and sites, adding the ability to search through multiple portals, screening sites with too
much or too little text content, and giving the search engine synonym-search capabilities.
The Spider class shown in this article uses recursive calls to a search procedure. Alternatively, a separate thread with a new spider
could be launched for each link encountered. This has the benefit of allowing connections to remote URLs to occur concurrently,
speeding execution. However, note that some JTree objects, namely DefaultMutableTreeNode, are not thread-safe, and programmers must perform their own synchronization.
Archived Discussions (Read only)