Recommended: Sing it, brah! 5 fabulous songs for developers
JW's Top 5
Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs
Page 4 of 5
I overrode handleStartTag() so that the program can process HTML A and TITLE tags. The method tests to see if the t parameter is in fact an A tag, if it is, then the HREF attribute is retrieved.
fixHref() is called to clean up sloppy references (changes back slashes to forward slashes, adds missing final slashes). The link's
URL is resolved by creating a URL object instance using the base URL and the one referenced. Then, a recursive call to searchWeb() processes this link. If the method encounters a TITLE tag, it clears the variable storing the last text encountered so that the title's end tag is assured of having the proper
value (sometimes, a Webpage will have title tags with no title between them).
I overrode handleEndTag() so the HTML TITLE end tags can be processed. This end tag indicates that the previous text (stored in lastText) is the page's title text. This text is then stored in the page's data node. Since adding the title information to the data
node will change the display of the data node in the tree, the nodeChanged() method must be called so the tree can adjust its layout.
I overrode handleText() so that the HTML page's text can be checked for any of the keywords or phrases being searched. handleText() is passed an array of characters and the position of the characters within the file. handleText() first converts the character array to a String object, converting to all uppercase in the process. Then each keyword/phrase in the search list is checked against the String object using the indexOf() method. If indexOf() returns a non-negative result, then the keyword/phrase is present in the page's text. If the keyword/phrase is present, the
match is recorded in the node's match list and run statistics are updated:
public class SpiderParserCallback extends HTMLEditorKit.ParserCallback {
/**
* Inner class used to html handle parser callbacks
*/
public class SpiderParserCallback extends HTMLEditorKit.ParserCallback {
/** URL node being parsed */
private UrlTreeNode node;
/** Tree node */
private DefaultMutableTreeNode treenode;
/** Contents of last text element */
private String lastText = "";
/**
* Creates a new instance of SpiderParserCallback
* @param atreenode search tree node that is being parsed
*/
public SpiderParserCallback(DefaultMutableTreeNode atreenode) {
treenode = atreenode;
node = (UrlTreeNode)treenode.getUserObject();
}
/**
* Handle HTML tags that don't have a start and end tag
* @param t HTML tag
* @param a HTML attributes
* @param pos Position within file
*/
public void handleSimpleTag(HTML.Tag t,
MutableAttributeSet a,
int pos)
{
if(t.equals(HTML.Tag.IMG))
{
node.addImages(1);
return;
}
if(t.equals(HTML.Tag.BASE))
{
Object value = a.getAttribute(HTML.Attribute.HREF);
if(value != null)
node.setBase(fixHref(value.toString()));
}
}
/**
* Take care of start tags
* @param t HTML tag
* @param a HTML attributes
* @param pos Position within file
*/
public void handleStartTag(HTML.Tag t,
MutableAttributeSet a,
int pos)
{
if(t.equals(HTML.Tag.TITLE))
{
lastText="";
return;
}
if(t.equals(HTML.Tag.A))
{
Object value = a.getAttribute(HTML.Attribute.HREF);
if(value != null)
{
node.addLinks(1);
String href = value.toString();
href = fixHref(href);
try{
URL referencedURL = new URL(node.getBase(),href);
searchWeb(treenode, referencedURL.getProtocol()+"://"+referencedURL.getHost()+referencedURL.getPath());
}
catch (MalformedURLException e)
{
messageArea.append(" Bad URL encountered : "+href+"\n\n");
return;
}
}
}
}
/**
* Take care of start tags
* @param t HTML tag
* @param pos Position within file
*/
public void handleEndTag(HTML.Tag t,
int pos)
{
if(t.equals(HTML.Tag.TITLE) && lastText != null)
{
node.setTitle(lastText.trim());
DefaultTreeModel tm = (DefaultTreeModel)searchTree.getModel();
tm.nodeChanged(treenode);
}
}
/**
* Take care of text between tags, check against keyword list for matches, if
* match found, set the node match status to true
* @param data Text between tags
* @param pos position of text within Webpage
*/
public void handleText(char[] data, int pos)
{
lastText = new String(data);
node.addChars(lastText.length());
String text = lastText.toUpperCase();
for(int i = 0; i < keywordList.length; i++)
{
if(text.indexOf(keywordList[i]) >= 0)
{
if(!node.isMatch())
{
sitesFound++;
updateStats();
}
node.setMatch(keywordList[i]);
return;
}
}
}
}
When relative links to Webpages are encountered, you must build complete links based on their base URLs. Base URLs can be
explicitly defined in a Webpage via the BASE tag or implicitly defined as the URL of the page holding the link. The Java URL object provides a constructor that handles the resolution for you, providing you give it links structured to its liking.
Archived Discussions (Read only)