Newsletter sign-up
View all newsletters

Enterprise Java Newsletter
Stay up to date on the latest tutorials and Java community news posted on JavaWorld

Sponsored Links

Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs

Regular expressions simplify pattern-matching code

Discover the elegance of regular expressions in text-processing scenarios that involve pattern matching

  • Print
  • Feedback

Page 9 of 16



Predefined character classes

Some character classes occur often enough in regexes to warrant shortcuts. Pattern provides such shortcuts with predefined character classes, which Table 1 presents. Use predefined character classes to simplify your regexes and minimize regex syntax errors.

Table 1. Predefined character classes

Predefined character class Description
\d A digit. Equivalent to [0-9].
\D A nondigit. Equivalent to [^0-9].
\s A whitespace character. Equivalent to [ \t\n\x0B\f\r].
\S A nonwhitespace character. Equivalent to [^\s].
\w A word character. Equivalent to [a-zA-Z_0-9].
\W A nonword character. Equivalent to [^\w].


The following command-line example uses the \w predefined character class to identify all word characters in its text command-line argument:

java RegexDemo \w "aZ.8 _"


The command line above produces the following output, which shows that the period and space characters are not considered word characters:

Regex = \w
Text = aZ.8 _
Found a
  starting at index 0 and ending at index 1
Found Z
  starting at index 1 and ending at index 2
Found 8
  starting at index 3 and ending at index 4
Found _
  starting at index 5 and ending at index 6


Note
Pattern's SDK documentation refers to the period metacharacter as a predefined character class that matches any character except for a line terminator—a one- or two-character sequence identifying the end of a text line—unless dotall mode (discussed later) is in effect. Pattern recognizes the following line terminators:
  • The carriage-return character (\r\)
  • The new-line (line feed) character (\n)
  • The carriage-return character immediately followed by the new-line character (\r\n)
  • The next-line character (\u0085)
  • The line-separator character (\u2028)
  • The paragraph-separator character (\u2029)


Capturing groups

Pattern supports a regex construct called a capturing group that saves a match's characters for later recall during pattern matching; that construct is a character sequence surrounded by parentheses metacharacters (( )). All characters within that capturing group are treated as a single unit during pattern matching. For example, the (Java) capturing group combines letters J, a, v, and a into a single unit. This capturing group matches the Java pattern against all occurrences of Java in text. Each match replaces the previous match's saved Java characters with the next match's Java characters.

Capturing groups can nest inside other capturing groups. For example, in (Java( language)), ( language) nests inside (Java). Each nested or nonnested capturing group receives its own number, numbering starts at 1, and capturing groups number from left to right. In the example, (Java( language)) is capturing group number 1, and ( language) is capturing group number 2. In (a)(b), (a) is capturing group number 1, and (b) is capturing group number 2.

Each capturing group saves its match for later recall by a back reference. Specified as a backslash character followed by a digit character denoting a capturing group number, the back reference recalls a capturing group's captured text characters. The presence of a back reference causes a matcher to use the back reference's capturing group number to recall the capturing group's saved match and then use that match's characters to attempt a further match operation. The following example demonstrates the usefulness of a back reference in searching text for a grammatical error:

java RegexDemo "(Java( language)\2)" "The Java language language"


The example uses the (Java( language)\2) regex to search the text The Java language language for a grammatical error, where Java immediately precedes two consecutive occurrences of language. That regex specifies two capturing groups: number 1 is (Java( language)\2), which matches Java language language, and number 2 is ( language), which matches a space character followed by language. The \2 back reference recalls number 2's saved match, which allows the matcher to search for a second occurrence of a space character followed by language, which immediately follows the first occurrence of the space character and language. The following output shows what RegexDemo's matcher finds:

Regex = (Java( language)\2)
Text = The Java language language
Found Java language language
  starting at index 4 and ending at index 26


Quantifiers

Quantifiers are probably the most confusing regex constructs to understand. Part of that confusion comes from trying to grasp Pattern's 18 quantifier categories (organized as three major categories of six fundamental quantifier categories). Another part of that confusion comes from trying to decipher the concept of zero-length matches. Once you understand that concept and those 18 categories, much (if not all) of the confusion disappears.

Note
For brevity, this section discusses only the basics of the 18 quantifier categories and the zero-length match concept. Study The Java Tutorial's "Quantifiers" section for a more detailed discussion and more examples.


A quantifier is a regex construct that implicitly or explicitly binds a numeric value to a pattern. That numeric value determines how many times to match a pattern. Pattern's six fundamental quantifiers match a pattern once or not at all, zero or more times, one or more times, an exact number of times, at least x times, and at least x times but no more than y times.

The six fundamental quantifier categories replicate in each of three major categories: greedy, reluctant, and possessive. Greedy quantifiers attempt to find the longest match. In contrast, reluctant quantifiers attempt to find the shortest match. Possessive quantifiers also try to find the longest match. However, they differ from greedy quantifies in how they work. Although greedy and possessive quantifiers force a matcher to read in the entire text prior to attempting a first match, greedy quantifiers often cause a matcher to make multiple attempts to find a match, whereas possessive quantifiers cause a matcher to attempt a match only once.

The following examples illustrate the behavior of the six fundamental quantifiers in the greedy category, and the behavior of a single fundamental quantifier in each of the reluctant and possessive categories. These examples also introduce the zero-length match concept:

  1. java RegexDemo a? abaa: uses a greedy quantifier to match a in abaa once or not at all. The following output results:

    Regex = a?
    Text = abaa
    Found a
      starting at index 0 and ending at index 1
    Found 
      starting at index 1 and ending at index 1
    Found a
      starting at index 2 and ending at index 3
    Found a
      starting at index 3 and ending at index 4
    Found 
      starting at index 4 and ending at index 4
    


    The output reveals five matches. Although the first, third, and fourth matches come as no surprise in that they reveal the positions of the three as in abaa, the second and fifth matches are probably surprising. Those matches seem to indicate that a matches b and also the text's end. However, that is not the case. a? does not look for b or the text's end. Instead, it looks for either the presence or lack of a. When a? fails to find a, it reports that fact as a zero-length match, a match of zero length where the start and end indexes are the same. Zero-length matches occur in empty text, after the last text character, or between any two text characters.

  2. java RegexDemo a* abaa: uses a greedy quantifier to match a in abaa zero or more times. The following output results:

    Regex = a*
    Text = abaa
    Found a
      starting at index 0 and ending at index 1
    Found 
      starting at index 1 and ending at index 1
    Found aa
      starting at index 2 and ending at index 4
    Found 
      starting at index 4 and ending at index 4
    


    The output reveals four matches. As with a?, a* produces zero-length matches. The third match, where a* matches aa, is interesting. Unlike a?, a* matches either no a or all consecutive as.

  3. java RegexDemo a+ abaa: uses a greedy quantifier to match a in abaa one or more times. The following output results:

    Regex = a+
    Text = abaa
    Found a
      starting at index 0 and ending at index 1
    Found aa
      starting at index 2 and ending at index 4
    


    The output reveals two matches. Unlike a? and a*, a+ does not match the absence of a. Thus, no zero-length matches result. Like a*, a+ matches all consecutive as.

  4. java RegexDemo a{2} aababbaaaab: uses a greedy quantifier to match every aa sequence in aababbaaaab. The following output results:

    Regex = a{2}
    Text = aababbaaaab
    Found aa
      starting at index 0 and ending at index 2
    Found aa
      starting at index 6 and ending at index 8
    Found aa
      starting at index 8 and ending at index 10
    
  5. java RegexDemo a{2,} aababbaaaab: uses a greedy quantifier to match two or more consecutive as in aababbaaaab. The following output results:

    Regex = a{2,}
    Text = aababbaaaab
    Found aa
      starting at index 0 and ending at index 2
    Found aaaa
      starting at index 6 and ending at index 10
    
  6. java RegexDemo a{1,3} aababbaaaab: uses a greedy quantifier to match every a, aa, or aaa in aababbaaaab. The following output results:

    Regex = a{1,3}
    Text = aababbaaaab
    Found aa
      starting at index 0 and ending at index 2
    Found a
      starting at index 3 and ending at index 4
    Found aaa
      starting at index 6 and ending at index 9
    Found a
      starting at index 9 and ending at index 10
    
  7. java RegexDemo a+? abaa: uses a reluctant quantifier to match a in abaa one or more times. The following output results:

    Regex = a+?
    Text = abaa
    Found a
      starting at index 0 and ending at index 1
    Found a
      starting at index 2 and ending at index 3
    Found a
      starting at index 3 and ending at index 4
    


    Unlike its greedy variant in the third example, the reluctant example produces three matches of a single a because the reluctant quantifier tries to find the shortest match.

  • Print
  • Feedback

Resources