|
|
Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs
Page 9 of 16
Some character classes occur often enough in regexes to warrant shortcuts. Pattern provides such shortcuts with predefined character classes, which Table 1 presents. Use predefined character classes to simplify
your regexes and minimize regex syntax errors.
Table 1. Predefined character classes
|
The following command-line example uses the \w predefined character class to identify all word characters in its text command-line argument:
java RegexDemo \w "aZ.8 _"
The command line above produces the following output, which shows that the period and space characters are not considered word characters:
Regex = \w Text = aZ.8 _ Found a starting at index 0 and ending at index 1 Found Z starting at index 1 and ending at index 2 Found 8 starting at index 3 and ending at index 4 Found _ starting at index 5 and ending at index 6
| Note |
|---|
Pattern's SDK documentation refers to the period metacharacter as a predefined character class that matches any character except
for a line terminator—a one- or two-character sequence identifying the end of a text line—unless dotall mode (discussed later) is in effect. Pattern recognizes the following line terminators:
|
Pattern supports a regex construct called a capturing group that saves a match's characters for later recall during pattern matching; that construct is a character sequence surrounded
by parentheses metacharacters (( )). All characters within that capturing group are treated as a single unit during pattern matching. For example, the (Java) capturing group combines letters J, a, v, and a into a single unit. This capturing group matches the Java pattern against all occurrences of Java in text. Each match replaces the previous match's saved Java characters with the next match's Java characters.
Capturing groups can nest inside other capturing groups. For example, in (Java( language)), ( language) nests inside (Java). Each nested or nonnested capturing group receives its own number, numbering starts at 1, and capturing groups number from
left to right. In the example, (Java( language)) is capturing group number 1, and ( language) is capturing group number 2. In (a)(b), (a) is capturing group number 1, and (b) is capturing group number 2.
Each capturing group saves its match for later recall by a back reference. Specified as a backslash character followed by a digit character denoting a capturing group number, the back reference recalls a capturing group's captured text characters. The presence of a back reference causes a matcher to use the back reference's capturing group number to recall the capturing group's saved match and then use that match's characters to attempt a further match operation. The following example demonstrates the usefulness of a back reference in searching text for a grammatical error:
java RegexDemo "(Java( language)\2)" "The Java language language"
The example uses the (Java( language)\2) regex to search the text The Java language language for a grammatical error, where Java immediately precedes two consecutive occurrences of language. That regex specifies two capturing groups: number 1 is (Java( language)\2), which matches Java language language, and number 2 is ( language), which matches a space character followed by language. The \2 back reference recalls number 2's saved match, which allows the matcher to search for a second occurrence of a space character
followed by language, which immediately follows the first occurrence of the space character and language. The following output shows what RegexDemo's matcher finds:
Regex = (Java( language)\2) Text = The Java language language Found Java language language starting at index 4 and ending at index 26
Quantifiers are probably the most confusing regex constructs to understand. Part of that confusion comes from trying to grasp
Pattern's 18 quantifier categories (organized as three major categories of six fundamental quantifier categories). Another part of
that confusion comes from trying to decipher the concept of zero-length matches. Once you understand that concept and those
18 categories, much (if not all) of the confusion disappears.
| Note |
|---|
| For brevity, this section discusses only the basics of the 18 quantifier categories and the zero-length match concept. Study The Java Tutorial's "Quantifiers" section for a more detailed discussion and more examples. |
A quantifier is a regex construct that implicitly or explicitly binds a numeric value to a pattern. That numeric value determines how
many times to match a pattern. Pattern's six fundamental quantifiers match a pattern once or not at all, zero or more times, one or more times, an exact number
of times, at least x times, and at least x times but no more than y times.
The six fundamental quantifier categories replicate in each of three major categories: greedy, reluctant, and possessive. Greedy quantifiers attempt to find the longest match. In contrast, reluctant quantifiers attempt to find the shortest match. Possessive quantifiers also try to find the longest match. However, they differ from greedy quantifies in how they work. Although greedy and possessive quantifiers force a matcher to read in the entire text prior to attempting a first match, greedy quantifiers often cause a matcher to make multiple attempts to find a match, whereas possessive quantifiers cause a matcher to attempt a match only once.
The following examples illustrate the behavior of the six fundamental quantifiers in the greedy category, and the behavior of a single fundamental quantifier in each of the reluctant and possessive categories. These examples also introduce the zero-length match concept:
java RegexDemo a? abaa: uses a greedy quantifier to match a in abaa once or not at all. The following output results:Regex = a? Text = abaa Found a starting at index 0 and ending at index 1 Found starting at index 1 and ending at index 1 Found a starting at index 2 and ending at index 3 Found a starting at index 3 and ending at index 4 Found starting at index 4 and ending at index 4
The output reveals five matches. Although the first, third, and fourth matches come as no surprise in that they reveal the
positions of the three as in abaa, the second and fifth matches are probably surprising. Those matches seem to indicate that a matches b and also the text's end. However, that is not the case. a? does not look for b or the text's end. Instead, it looks for either the presence or lack of a. When a? fails to find a, it reports that fact as a zero-length match, a match of zero length where the start and end indexes are the same. Zero-length matches occur in empty text, after the last
text character, or between any two text characters.
java RegexDemo a* abaa: uses a greedy quantifier to match a in abaa zero or more times. The following output results:Regex = a* Text = abaa Found a starting at index 0 and ending at index 1 Found starting at index 1 and ending at index 1 Found aa starting at index 2 and ending at index 4 Found starting at index 4 and ending at index 4
The output reveals four matches. As with a?, a* produces zero-length matches. The third match, where a* matches aa, is interesting. Unlike a?, a* matches either no a or all consecutive as.
java RegexDemo a+ abaa: uses a greedy quantifier to match a in abaa one or more times. The following output results:Regex = a+ Text = abaa Found a starting at index 0 and ending at index 1 Found aa starting at index 2 and ending at index 4
The output reveals two matches. Unlike a? and a*, a+ does not match the absence of a. Thus, no zero-length matches result. Like a*, a+ matches all consecutive as.
java RegexDemo a{2} aababbaaaab: uses a greedy quantifier to match every aa sequence in aababbaaaab. The following output results:Regex = a{2}
Text = aababbaaaab
Found aa
starting at index 0 and ending at index 2
Found aa
starting at index 6 and ending at index 8
Found aa
starting at index 8 and ending at index 10
java RegexDemo a{2,} aababbaaaab: uses a greedy quantifier to match two or more consecutive as in aababbaaaab. The following output results:Regex = a{2,}
Text = aababbaaaab
Found aa
starting at index 0 and ending at index 2
Found aaaa
starting at index 6 and ending at index 10
java RegexDemo a{1,3} aababbaaaab: uses a greedy quantifier to match every a, aa, or aaa in aababbaaaab. The following output results:Regex = a{1,3}
Text = aababbaaaab
Found aa
starting at index 0 and ending at index 2
Found a
starting at index 3 and ending at index 4
Found aaa
starting at index 6 and ending at index 9
Found a
starting at index 9 and ending at index 10
java RegexDemo a+? abaa: uses a reluctant quantifier to match a in abaa one or more times. The following output results:Regex = a+? Text = abaa Found a starting at index 0 and ending at index 1 Found a starting at index 2 and ending at index 3 Found a starting at index 3 and ending at index 4
Unlike its greedy variant in the third example, the reluctant example produces three matches of a single a because the reluctant quantifier tries to find the shortest match.