Newsletter sign-up
View all newsletters

Enterprise Java Newsletter
Stay up to date on the latest tutorials and Java community news posted on JavaWorld

Sponsored Links

Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs

Matchmaking with regular expressions

Use the power of regular expressions to ease text parsing and processing

  • Print
  • Feedback

Page 2 of 5

Regular expression: t(a|e|i|o|oo)n
Matches: tan, Ten, tin, ton, toon


The quantifier notations

Table 1 shows the quantifier notations used to determine how many times a given notation to the immediate left of the quantifier notation should repeat itself:

Table 1. Quantifier notations
Notation Number of Times
* 0 or more times
+ 1 or more times
? 0 or 1 time
{n} Exactly n number of times
{n,m} n to m number of times


Let's say you want to search for a social security number in a text file. The format for US social security numbers is 999-99-9999. The regular expression you would use to match this is shown in Figure 1. In regular expressions, the hyphen ("-") notation has special meaning; it indicates a range that would match any number from 0 to 9. As a result, you must escape the "-" character with a forward slash ("\") when matching the literal hyphens in a social security number.

Figure 1. Matches: All social security numbers of the form 123-12-1234

If, in your search, you wish to make the hyphen optional -- if, say, you consider both 999-99-9999 and 999999999 acceptable formats -- you can use the "?" quantifier notation. Figure 2 shows that regular expression:

Figure 2. Matches: All social security numbers of the forms 123-12-1234 and 123121234

Let's take a look at another example. One format for US car plate numbers consists of four numeric characters followed by two letters. The regular expression first comprises the numeric part, "[0-9]{4}", followed by the textual part, "[A-Z]{2}". Figure 3 shows the complete regular expression:

Figure 3. Matches: Typical US car plate numbers, such as 8836KV

The NOT notation

The "^" notation is also called the NOT notation. If used in brackets, "^" indicates the character you don't want to match. For example, the expression in Figure 4 matches all words except those starting with the letter X.

Figure 4. Matches: All words except those that start with the letter X

The parentheses and space notations

Say you're trying to extract the birth month from a person's birthdate. The typical birthdate is in the following format: June 26, 1951. The regular expression to match the string would be like the one in Figure 5:

Figure 5. Matches: All dates with the format of Month DD, YYYY

The new "\s" notation is the space notation and matches all blank spaces, including tabs. If the string matches perfectly, how do you extract the month field? You simply put parentheses around the month field, creating a group, and later retrieve the value using the ORO API (discussed in a following section). The appropriate regular expression is in Figure 6:

Figure 6. Matches: All dates with the format Month DD, YYYY, and extracts Month field as Group 1

Other miscellaneous notations

To make life easier, some shorthand notations for commonly used regular expressions have been created, as shown in Table 2:

Table 2. Commonly used notations
Notation Equivalent Notation
\d [0-9]
\D [^0-9]
\w [A-Z0-9]
\W [^A-Z0-9]
\s [ \t\n\r\f]
\S [^ \t\n\r\f]


To illustrate, we can use "\d" for all instances of "[0-9]" we used before, as was the case with our social security number expressions. The revised regular expression is in Figure 7:

  • Print
  • Feedback

Resources