|
|
Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs
Page 3 of 5

Figure 7. Matches: All social security numbers of the form 123-12-1234
Many open source regular expression libraries are available for Java programmers, and many support the Perl 5-compatible regular expression syntax. I use the Jakarta-ORO regular expression library because it is one of the most comprehensive APIs available and is fully compatible with Perl 5 regular expressions. It is also one of the most optimized APIs around.
The Jakarta-ORO library was formerly known as OROMatcher and has been kindly donated to the Jakarta Project by Daniel Savarese. You can download the package from a link in the Resources section below.
I'll start by briefly describing the objects you need to create and access in order to use this library, and then I will show how you use the Jakarta-ORO API.
First, create an instance of the Perl5Compiler class and assign it to the PatternCompiler interface object. Perl5Compiler is an implementation of the PatternCompiler interface and lets you compile a regular expression string into a Pattern object used for matching:
PatternCompiler compiler=new Perl5Compiler();
Pattern object, call the compile() method of the compiler object, passing in the regular expression. For example, you can compile the regular expression "t[aeio]n" like so: Pattern pattern=null;
try {
pattern=compiler.compile("t[aeio]n");
} catch (MalformedPatternException e) {
e.printStackTrace();
}
By default, the compiler creates a case-sensitive pattern, so that the above setup only matches "tin", "tan", "ten", and "ton", but not "Tin" or "taN". To create a case-insensitive pattern, you would call a compiler with an additional mask:
pattern=compiler.compile("t[aeio]n",Perl5Compiler.CASE_INSENSITIVE_MASK);
Once you've created the Pattern object, you can use it for pattern matching with the PatternMatcher class.
The PatternMatcher object tests for a match based on the Pattern object and a string. You instantiate a Perl5Matcher class and assign it to the PatternMatcher interface. The Perl5Matcher class is an implementation of the PatternMatcher interface and matches patterns based on the Perl 5 regular expression syntax:
PatternMatcher matcher=new Perl5Matcher();
You can obtain a match using the PatternMatcher object in one of several ways, with the string to be matched against the regular expression passed in as the first parameter:
boolean matches(String input, Pattern pattern): Used if the input string and the regular expression should match exactly; in other words, the regular expression should
totally describe the string input
boolean matchesPrefix(String input, Pattern pattern): Used if the regular expression should match the beginning of the input string
boolean contains(String input, Pattern pattern): Used if the regular expression should match part of the input string (i.e., should be a substring)
You could also pass in a PatternMatcherInput object instead of a String object to the above three method calls; if you did so, you could continue matching from the point at which the last match
was found in the string. This is useful when you have many substrings that are likely to be matched by a given regular expression.
The method signatures with the PatternMatcherInput object instead of String are as follows:
boolean matches(PatternMatcherInput input, Pattern pattern)boolean matchesPrefix(PatternMatcherInput input, Pattern pattern)boolean contains(PatternMatcherInput input, Pattern pattern)Your job: analyze a Web server log file and determine how long each user spends on the Website. An entry from a typical BEA WebLogic log file looks like this:
172.26.155.241 - - [26/Feb/2001:10:56:03 -0500] "GET /IsAlive.htm HTTP/1.0" 200 15
After analyzing this entry, you'll realize that you need to extract two things from the log file: the IP address and a page's access time. You can use the grouping notation (parentheses) to extract the IP address field and the timestamp field from the log entry.
Let's first discuss the IP address. It consists of 4 bytes, each with values between 0 and 255; each byte is separated from the others by a period. Thus, in each individual byte in the IP address, you have at least one and at most three digits. You can see the regular expression for this field in Figure 8:

Figure 8. Matches: IP addresses that consist of 4 bytes, each with values between 0 and 255
You need to escape the period character because you literally want it to be there; you do not want it read in terms of its special meaning in regular expression syntax, which I explained earlier.
The log entry's timestamp part is surrounded by square brackets. You can extract whatever is within these brackets by first searching for the opening square bracket character ("[") and extracting whatever is not within the closing square bracket character ("]"), continuing until you reach the closing square bracket. Figure 9 shows the regular expression for this:

Figure 9. Matches: At least one character until "]" is found
Now you combine these two regular expressions into a single expression with grouping notation (parentheses) for extraction of your IP address and timestamp. Notice that "\s-\s-\s" is added in the middle so that matching occurs, although you won't extract that. You can see the complete regular expression in Figure 10.
![]() Figure 10. Matches: The IP address and timestamp by combining two regular expressions. Click on thumbnail to view full-size image. (4 KB) |
Now that you've formulated this regular expression, you can begin writing Java code using the regular expression library.
String logEntry="172.26.155.241 - - [26/Feb/2001:10:56:03 -0500] \"GET /IsAlive.htm HTTP/1.0\" 200 15 ";
String regexp="([0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3})\\s-\\s-\\s\\[([^\\]]+)\\]";
The regular expression used here is nearly identical to the one found in Figure 10, with only one difference: in Java, you need to escape every forward slash ("\"). Figure 10 is not in Java, so we need to escape the forward-slash character so as not to cause a compilation error. Unfortunately, this process is prone to error and you must do it carefully. You can type in the regular expression first without escaping the forward slashes, and then visually scan the string from left to right and replace every occurrence of the "\" character with "\\". To double check, print out the resulting string to the console.