|
|
Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs
Page 3 of 5
There are two ways to parse (pick apart) an HTML file to look for the A HREF = tags—a hard way and an easy way.
If you choose the hard way, you would create your own parsing algorithm using Java's StreamTokenizer class. With this technique, you'd have to specify the word and white-space characters for the StreamTokenizer object, then pick off the < and > symbols to find the tags, the attributes, and separate the text between tags. A lot of work.
The easy way is to use the built-in ParserDelegator class, a subclass of the HTMLEditorKit.Parser abstract class. These classes are not well documented in the Java documentation. Using ParserDelegator is a three-step process. First, create an InputStreamReader object from your URL; then, create an instance of a ParserCallback object; finally, create an instance of the ParserDelegator object and call its one public method parse():
UrlTreeNode newnode = new UrlTreeNode(url); // Create the data node
InputStream in = url.openStream(); // Ask the URL object to create an input stream
InputStreamReader isr = new InputStreamReader(in); // Convert the stream to a reader
DefaultMutableTreeNode treenode = addNode(parentnode, newnode);
SpiderParserCallback cb = new SpiderParserCallback(treenode); // Create a callback object
ParserDelegator pd = new ParserDelegator(); // Create the delegator
pd.parse(isr,cb,true); // Parse the stream
isr.close(); // Close the stream
parse() is passed an InputStreamReader, an instance of a ParseCallback object, and a flag for specifying whether the CharSet tags should be ignored. parse() then reads and decodes the HTML file, calling methods in the ParserCallback object each time it has completely decoded a tag or HTML element.
In the demonstration code, I implemented my ParserCallback as an inner class of Spider. Doing so allows ParseCallback to access Spider's methods and variables. Classes based on ParserCallback can override the following methods:
handleStartTag(): Called when a starting HTML tag is encountered, e.g., >A <handleEndTag(): Called when an end HTML tag is encountered, e.g., >/A<handleSimpleTag(): Called when a HTML tag that has no matching end tag is encountered
handleText(): Called when text between tags is encountered
In the demonstration program, I overrode the handleSimpleTag(), handleStartTag(), handleEndTag(), and handleTextTag() methods.
I overrode handleSimpleTag() so that my code can process HTML BASE and IMG tags. BASE tags tell what URL to use when resolving relative URL references. If no BASE tag is present, then the current URL is used to resolve relative references. handleSimpleTag() is passed three parameters, an HTML.Tag object, a MutableAttributeSet that contains all the tag's attributes, and relative position within the file. My code checks the tag to see if it is a BASE object instance; if it is, then the HREF attribute is retrieved and stored in the page's data node. This attribute is used
later when resolving URL addresses to linked Websites. Each time an IMG tag is encountered, that page's image count is updated.
Archived Discussions (Read only)