Many languages, including Perl, PHP, Python, JavaScript, and JScript, now support regular expressions for text processing, and some text editors use regular expressions for powerful search-and-replace functionality. What about Java? At the time of this writing, a Java Specification Request that includes a regular expression library for text processing has been approved; you can expect to see it in a future version of the JDK.
But what if you need a regular expression library now? Luckily, you can download the open source Jakarta ORO library from Apache.org. In this article, I'll first give you a short primer on regular expressions, and then I'll show you how to use regular expressions with the open source Jakarta-ORO API.
Let's start simple. Suppose you want to search for a string with the word "cat" in it; your regular expression would simply be "cat". If your search is case-insensitive, the words "catalog", "Catherine", or "sophisticated" would also match:
Regular expression: cat
Matches: cat, catalog, Catherine, sophisticated
Imagine you are playing Scrabble and need a three-letter word starting with the letter "t" and ending with the letter "n". Imagine also that you have an English dictionary and will search through its entire contents for a match using a regular expression. To form such a regular expression, you would use a wildcard notation -- the period (.) character. The regular expression would then be "t.n" and would match "tan", "Ten", "tin", and "ton"; it would also match "t#n", "tpn", and even "t n", as well as many other nonsensical words. This is because the period character matches everything, including the space, the tab character, and even line breaks:
Regular expression: t.n
Matches: tan, Ten, tin, ton, t n, t#n, tpn, etc.
To solve the problem of the period's indiscriminate matches, you can specify characters you consider meaningful with the bracket ("[]") expression, so that only those characters would match the regular expression. Thus, "t[aeio]n" would just match "tan", "Ten", "tin", and "ton". "Toon" would not match because you can only match a single character within the bracket notation:
Regular expression: t[aeio]n
Matches: tan, Ten, tin, ton
If you want to match "toon" in addition to all the words matched in the previous section, you can use the "|" notation, which is basically an OR operator. To match "toon", use the regular expression "t(a|e|i|o|oo)n". You cannot use the bracket notation here because it will only match a single character. Instead, use parentheses -- "()". You can also use parentheses for groupings (more on that later):
Writer seems to have mistaken slashesBy Anonymous on December 21, 2009, 5:53 pmWhat writer is calling "forward slash" most of the world calls "backward slash". Otherwise pretty good.
Reply | Read entire comment
Good By Anonymous on December 2, 2009, 9:11 pmYes, a very well written overview; clear and useful.
Reply | Read entire comment
One of the best regex intros on the web!By Anonymous on October 27, 2009, 4:42 pmThank you!
Reply | Read entire comment
regarding reg expression tagsBy Anonymous on January 22, 2009, 1:21 ami am unable to trace down < and > symbols.. i couldnot find any material related .....
Reply | Read entire comment
View all comments