Newsletter sign-up
View all newsletters

Enterprise Java Newsletter
Stay up to date on the latest tutorials and Java community news posted on JavaWorld

Sponsored Links

Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs

Matchmaking with regular expressions

Use the power of regular expressions to ease text parsing and processing

  • Print
  • Feedback
If you've programmed in Perl or any other language with built-in regular-expression capabilities, then you probably know how much easier regular expressions make text processing and pattern matching. If you're unfamiliar with the term, a regular expression is simply a string of characters that defines a pattern used to search for a matching string.

Many languages, including Perl, PHP, Python, JavaScript, and JScript, now support regular expressions for text processing, and some text editors use regular expressions for powerful search-and-replace functionality. What about Java? At the time of this writing, a Java Specification Request that includes a regular expression library for text processing has been approved; you can expect to see it in a future version of the JDK.

But what if you need a regular expression library now? Luckily, you can download the open source Jakarta ORO library from Apache.org. In this article, I'll first give you a short primer on regular expressions, and then I'll show you how to use regular expressions with the open source Jakarta-ORO API.

Regular expressions 101

Let's start simple. Suppose you want to search for a string with the word "cat" in it; your regular expression would simply be "cat". If your search is case-insensitive, the words "catalog", "Catherine", or "sophisticated" would also match:

Regular expression: cat
Matches: cat, catalog, Catherine, sophisticated


The period notation

Imagine you are playing Scrabble and need a three-letter word starting with the letter "t" and ending with the letter "n". Imagine also that you have an English dictionary and will search through its entire contents for a match using a regular expression. To form such a regular expression, you would use a wildcard notation -- the period (.) character. The regular expression would then be "t.n" and would match "tan", "Ten", "tin", and "ton"; it would also match "t#n", "tpn", and even "t n", as well as many other nonsensical words. This is because the period character matches everything, including the space, the tab character, and even line breaks:

Regular expression: t.n
Matches: tan, Ten, tin, ton, t n, t#n, tpn, etc.


The bracket notation

To solve the problem of the period's indiscriminate matches, you can specify characters you consider meaningful with the bracket ("[]") expression, so that only those characters would match the regular expression. Thus, "t[aeio]n" would just match "tan", "Ten", "tin", and "ton". "Toon" would not match because you can only match a single character within the bracket notation:

Regular expression: t[aeio]n
Matches: tan, Ten, tin, ton


The OR operator

If you want to match "toon" in addition to all the words matched in the previous section, you can use the "|" notation, which is basically an OR operator. To match "toon", use the regular expression "t(a|e|i|o|oo)n". You cannot use the bracket notation here because it will only match a single character. Instead, use parentheses -- "()". You can also use parentheses for groupings (more on that later):

  • Print
  • Feedback

Resources