|
|
Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs
Page 2 of 5
The constructor stores its parameters in class variables and initializes the JTree to render nodes using the UrlNodeRenderer class. The search will not start until SpiderControl calls the run() method.
The run() method starts execution in a separate thread. It first determines whether the portal site is a Web reference (starting with
http, ftp, or www) or a local file reference. It then ensures the portal site has the proper notation, resets the run statistics,
and calls searchWeb() to begin the search:
public void run()
{
DefaultTreeModel treeModel = (DefaultTreeModel)searchTree.getModel(); // get our model
DefaultMutableTreeNode root = (DefaultMutableTreeNode)treeModel.getRoot();
String urllc = startSite.toLowerCase();
if(!urllc.startsWith("http://") && !urllc.startsWith("ftp://") &&
!urllc.startsWith("www."))
{
startSite = "file:///"+startSite; // Note you must have 3 slashes !
}
else // Http missing ?
if(urllc.startsWith("www."))
{
startSite = "http://"+startSite; // Tack on http://
}
startSite = startSite.replace('\\', '/'); // Fix bad slashes
sitesFound = 0;
sitesSearched = 0;
updateStats();
searchWeb(root,startSite); // Search the Web
messageArea.append("Done!\n\n");
}
searchWeb() is a recursive method that accepts as parameters a parent node in the search tree and a Web address to search. searchWeb() first verifies that the given Website has not already been visited and that depth and site limits have not been exceeded.
searchWeb() then yields to allow the SpiderControl thread to run (updating the screen and checking for Stop Search button presses). If all is in order, searchWeb() continues, if not, it returns.
Before searchWeb() begins reading and parsing the Website, it first verifies that the site is of the proper type and domain by creating a URL object based on the Website. The URL's protocol is checked to ensure it is either an HTML address or a file address (no need to search for "mailto:" and other
protocols). Then the file extension (if present) is checked to ensure that it is an HTML file (no need to parse pdf or gif
files). Once that is done, the domain is checked against the list specified by the user with the isDomainOk() method:
...URL url = new URL(urlstr); // Create the URL object from a string.
String protocol = url.getProtocol(); // Ask the URL for its protocol
if(!protocol.equalsIgnoreCase("http") && !protocol.equalsIgnoreCase("file"))
{
messageArea.append(" Skipping : "+urlstr+" not a http site\n\n");
return;
}
String path = url.getPath(); // Ask the URL for its path
int lastdot = path.lastIndexOf("."); // Check for file extension
if(lastdot > 0)
{
String extension = path.substring(lastdot); // Just the file extension
if(!extension.equalsIgnoreCase(".html") && !extension.equalsIgnoreCase(".htm"))
return; // Skip everything but html files
}
if(!isDomainOk(url))
{
messageArea.append(" Skipping : "+urlstr+" not in domain list\n\n");
return;
}
At this point, searchWeb() is fairly certain it has a URL worth searching, so it creates a new node for the search tree, adds it to the tree, opens
an input stream, and parses the file. The following sections provide more details on parsing HTML files, resolving relative
URLs, and controlling recursion.
Archived Discussions (Read only)