Matchmaking with regular expressions

Use the power of regular expressions to ease text parsing and processing

1 2 Page 2
Page 2 of 2
      PatternMatcher matcher=new Perl5Matcher();
        if (matcher.contains(logEntry,pattern)) {
            MatchResult result=matcher.getMatch();
            System.out.println("IP: "+result.group(1));
            System.out.println("Timestamp: "+result.group(2));
        }

Next, print out the matched groups using the MatchResult object returned from the PatternMatcher interface. Since the logEntry string contains the pattern to be matched, you could expect the following output:

        IP: 172.26.155.241
        Timestamp: 26/Feb/2001:10:56:03 -0500

HTML processing

Your next task is to churn through your company's HTML pages and perform an analysis of all of a font tag's attributes. The typical font tag in your HTML looks like this:

        <font face="Arial, Serif" size="+2" color="red">

Your program will print out the attributes for every font tag encountered in the following format:

        face=Arial, Serif
        size=+2
        color=red

In this case, I would suggest that you use two regular expressions. The first, shown in Figure 11, extracts "face="Arial, Serif" size="+2" color="red" from the font tag:

Figure 11. Matches: The all-attribute part of the font tag

The second regular expression, shown in Figure 12, breaks down each individual attribute into a name-value pair:

Figure 12. Matches: Each individual attribute, broken down into a name-value pair

Figure 12 breaks into:

        font    Arial, Serif
        size    +2
        color   red

Let's now discuss the code to achieve this. First, create the two regular expression strings and compile them into a Pattern object using the Perl5Compiler. Use the Perl5Compiler.CASE_INSENSITIVE_MASK option here when compiling the regular expression for a case-insensitive match.

Next, create a Perl5Matcher object to perform matching:

        String regexpForFontTag="<\\s*font\\s+([^>]*)\\s*>";
        String regexpForFontAttrib="([a-z]+)\\s*=\\s*\"([^\"]+)\"";
        PatternCompiler compiler=new Perl5Compiler();
        Pattern patternForFontTag=compiler.compile(regexpForFontTag,Perl5Compiler.CASE_INSENSITIVE_MASK);
        Pattern patternForFontAttrib=compiler.compile(regexpForFontAttrib,Perl5Compiler.CASE_INSENSITIVE_MASK);
        PatternMatcher matcher=new Perl5Matcher();

Assume you have a variable called html of type String that represents a line in the HTML file. If the content of the html string contains the font tag, the matcher will return true, and you'll use the MatchResult object returned from the matcher object to get your first group, which includes all of your font attributes:

        if (matcher.contains(html,patternForFontTag)) {
            MatchResult result=matcher.getMatch();
            String attribs=result.group(1);
            PatternMatcherInput input=new PatternMatcherInput(attribs);
            while (matcher.contains(input,patternForFontAttrib)) {
                result=matcher.getMatch();
                System.out.println(result.group(1)+": "+result.group(2));
            }
        }

Next, create a PatternMatcherInput object. As previously mentioned, this object lets you continue matching from where the last match was found in the string; thus, it's perfect for extracting the font tag's name-value pair. Create a PatternMatcherInput object by passing in the string to be matched. Then, use the matcher instance to extract each font attribute as it is encountered. This is done by repeatedly calling the contains() method of the PatternMatcher object with the PatternMatcherInput object instead of a string. Every iteration through the PatternMatcherInput object will advance a pointer within it, so the next test will start where the previous one left off.

The output of the example is as follows:

        face: Arial, Serif
        size: +1
        color: red

More HTML processing

Let's continue with another HTML example. This time, imagine that your Web server has moved from widgets.acme.com to newserver.acme.com. You'll need to change the links on some of your Webpages from:

<a href="http://widgets.acme.com/interface.html#How_To_Buy">
<a href="http://widgets.acme.com/interface.html#How_To_Sell">
etc.

to

<a href="http://newserver.acme.com/interface.html#How_To_Buy">
<a href="http://newserver.acme.com/interface.html#How_To_Sell">
etc.

The regular expression to perform the search is shown in Figure 13.

Figure 13. Matches: The link "http://widgets.acme.com/interface.html#(any anchor). Click on thumbnail to view full-size image. (30 KB)

If this regular expression is found, you can make your substitution for the link in Figure 13 with the following expression:

<a href="http://newserver.acme.com/interface.html#">

Notice that you use after the # character. Perl regular expression syntax uses , , and so forth to represent groups that have been matched and extracted. The expression shown in Figure 13 appends whatever text has been matched and extracted as Group 1 to the link.

Now, back to Java. As usual, you must create your testing strings, the necessary object for compiling the regular expression into a Pattern object, and a PatternMatcher object:

        String link="<a href=\"http://widgets.acme.com/interface.html#How_To_Trade\">";
        String regexpForLink="<\\s*a\\s+href\\s*=\\s*\"http://widgets.acme.com/interface.html#([^\"]+)\">";
        PatternCompiler compiler=new Perl5Compiler();
        Pattern patternForLink=compiler.compile(regexpForLink,Perl5Compiler.CASE_INSENSITIVE_MASK);
        PatternMatcher matcher=new Perl5Matcher();

Next, use the static method substitute() from the Util class in the com.oroinc.text.regex package for performing a substitution, and print out the resulting string:

        String result=Util.substitute(matcher,
                                      patternForLink,
                                      new Perl5Substitution(
                                        "<a href=\"http://newserver.acme.com/interface.html#\">"),
                                      link,
                                      Util.SUBSTITUTE_ALL);
        System.out.println(result);

The syntax of the Util.substitute() method is as follows:

        public static String substitute(PatternMatcher matcher,
                                        Pattern pattern,
                                        Substitution sub,
                                        String input,
                                        int numSubs)

The first two parameters for this call are the PatternMatcher and Pattern objects created earlier. The input for the third parameter is a Substitution object that determines how the substitution is to be performed. In this case, use the Perl5Substitution object, which lets you use a Perl 5-style substitution. The fourth parameter is the actual string on which you wish to perform the substitution, and the last parameter lets you specify whether you wish to substitute on every occurrence of the pattern found (Util.SUBSTITUTE_ALL) or only substitute a specified number of times.

Express yourself

In this article, I've shown you the powerful features of regular expressions. When used appropriately, they can help a great deal in string extraction and text changes. I have also shown how you can incorporate regular expressions into your Java application using the open source Jakarta-ORO library. Now, it's up to you to decide whether the old string manipulation approach (using

StringTokenizers

,

charAt,

or

substring

) or a regular expression library, like Jakarta-ORO, works for you.

Benedict Chng is a Sun-certified developer currently consulting in the Boston area. He hails from sunny and tropical Singapore and has been working in the software development field for close to four years. His current interests include writing applications for Palm devices and sightseeing in the New England region.

Learn more about this topic

1 2 Page 2
Page 2 of 2