Newsletter sign-up
View all newsletters

Enterprise Java Newsletter
Stay up to date on the latest tutorials and Java community news posted on JavaWorld

Sponsored Links

Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs

Matchmaking with regular expressions

Use the power of regular expressions to ease text parsing and processing

  • Print
  • Feedback

Page 4 of 5

After initializing the strings, instantiate the PatternCompiler object and create a Pattern object by using the PatternCompiler to compile the regular expression:

      PatternCompiler compiler=new Perl5Compiler();
        Pattern pattern=compiler.compile(regexp);


Now, create the PatternMatcher object and call the contain() method in the PatternMatcher interface to see if you have a match:

      PatternMatcher matcher=new Perl5Matcher();
        if (matcher.contains(logEntry,pattern)) {
            MatchResult result=matcher.getMatch();
            System.out.println("IP: "+result.group(1));
            System.out.println("Timestamp: "+result.group(2));
        }


Next, print out the matched groups using the MatchResult object returned from the PatternMatcher interface. Since the logEntry string contains the pattern to be matched, you could expect the following output:

        IP: 172.26.155.241
        Timestamp: 26/Feb/2001:10:56:03 -0500


HTML processing

Your next task is to churn through your company's HTML pages and perform an analysis of all of a font tag's attributes. The typical font tag in your HTML looks like this:

        <font face="Arial, Serif" size="+2" color="red">


Your program will print out the attributes for every font tag encountered in the following format:

        face=Arial, Serif
        size=+2
        color=red


In this case, I would suggest that you use two regular expressions. The first, shown in Figure 11, extracts "face="Arial, Serif" size="+2" color="red" from the font tag:

Figure 11. Matches: The all-attribute part of the font tag

The second regular expression, shown in Figure 12, breaks down each individual attribute into a name-value pair:

Figure 12. Matches: Each individual attribute, broken down into a name-value pair

Figure 12 breaks into:

        font    Arial, Serif
        size    +2
        color   red


Let's now discuss the code to achieve this. First, create the two regular expression strings and compile them into a Pattern object using the Perl5Compiler. Use the Perl5Compiler.CASE_INSENSITIVE_MASK option here when compiling the regular expression for a case-insensitive match.

Next, create a Perl5Matcher object to perform matching:

        String regexpForFontTag="<\\s*font\\s+([^>]*)\\s*>";
        String regexpForFontAttrib="([a-z]+)\\s*=\\s*\"([^\"]+)\"";
        PatternCompiler compiler=new Perl5Compiler();
        Pattern patternForFontTag=compiler.compile(regexpForFontTag,Perl5Compiler.CASE_INSENSITIVE_MASK);
        Pattern patternForFontAttrib=compiler.compile(regexpForFontAttrib,Perl5Compiler.CASE_INSENSITIVE_MASK);
        PatternMatcher matcher=new Perl5Matcher();


Assume you have a variable called html of type String that represents a line in the HTML file. If the content of the html string contains the font tag, the matcher will return true, and you'll use the MatchResult object returned from the matcher object to get your first group, which includes all of your font attributes:

        if (matcher.contains(html,patternForFontTag)) {
            MatchResult result=matcher.getMatch();
            String attribs=result.group(1);
            PatternMatcherInput input=new PatternMatcherInput(attribs);
            while (matcher.contains(input,patternForFontAttrib)) {
                result=matcher.getMatch();
                System.out.println(result.group(1)+": "+result.group(2));
            }
        }


Next, create a PatternMatcherInput object. As previously mentioned, this object lets you continue matching from where the last match was found in the string; thus, it's perfect for extracting the font tag's name-value pair. Create a PatternMatcherInput object by passing in the string to be matched. Then, use the matcher instance to extract each font attribute as it is encountered. This is done by repeatedly calling the contains() method of the PatternMatcher object with the PatternMatcherInput object instead of a string. Every iteration through the PatternMatcherInput object will advance a pointer within it, so the next test will start where the previous one left off.

  • Print
  • Feedback

Resources