Newsletter sign-up
View all newsletters

Enterprise Java Newsletter
Stay up to date on the latest tutorials and Java community news posted on JavaWorld

Sponsored Links

Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs

Create intelligent Web spiders

How to use Java network objects and HTML objects

  • Print
  • Feedback

Page 3 of 5

Parsing HTML files

There are two ways to parse (pick apart) an HTML file to look for the A HREF = tags—a hard way and an easy way.

If you choose the hard way, you would create your own parsing algorithm using Java's StreamTokenizer class. With this technique, you'd have to specify the word and white-space characters for the StreamTokenizer object, then pick off the < and > symbols to find the tags, the attributes, and separate the text between tags. A lot of work.

The easy way is to use the built-in ParserDelegator class, a subclass of the HTMLEditorKit.Parser abstract class. These classes are not well documented in the Java documentation. Using ParserDelegator is a three-step process. First, create an InputStreamReader object from your URL; then, create an instance of a ParserCallback object; finally, create an instance of the ParserDelegator object and call its one public method parse():

  UrlTreeNode newnode = new UrlTreeNode(url); // Create the data node 
   InputStream in = url.openStream(); // Ask the URL object to create an input stream
   InputStreamReader isr = new InputStreamReader(in); // Convert the stream to a reader
   DefaultMutableTreeNode treenode = addNode(parentnode, newnode);  
   SpiderParserCallback cb = new SpiderParserCallback(treenode); // Create a callback object
   ParserDelegator pd = new ParserDelegator(); // Create the delegator
   pd.parse(isr,cb,true); // Parse the stream
   isr.close();  // Close the stream



parse() is passed an InputStreamReader, an instance of a ParseCallback object, and a flag for specifying whether the CharSet tags should be ignored. parse() then reads and decodes the HTML file, calling methods in the ParserCallback object each time it has completely decoded a tag or HTML element.

In the demonstration code, I implemented my ParserCallback as an inner class of Spider. Doing so allows ParseCallback to access Spider's methods and variables. Classes based on ParserCallback can override the following methods:

  • handleStartTag(): Called when a starting HTML tag is encountered, e.g., >A <
  • handleEndTag(): Called when an end HTML tag is encountered, e.g., >/A<
  • handleSimpleTag(): Called when a HTML tag that has no matching end tag is encountered
  • handleText(): Called when text between tags is encountered


In the demonstration program, I overrode the handleSimpleTag(), handleStartTag(), handleEndTag(), and handleTextTag() methods.

I overrode handleSimpleTag() so that my code can process HTML BASE and IMG tags. BASE tags tell what URL to use when resolving relative URL references. If no BASE tag is present, then the current URL is used to resolve relative references. handleSimpleTag() is passed three parameters, an HTML.Tag object, a MutableAttributeSet that contains all the tag's attributes, and relative position within the file. My code checks the tag to see if it is a BASE object instance; if it is, then the HREF attribute is retrieved and stored in the page's data node. This attribute is used later when resolving URL addresses to linked Websites. Each time an IMG tag is encountered, that page's image count is updated.

  • Print
  • Feedback

Resources