Java 101: Regular expressions in Java, Part 1

Use the Regex API to discover and describe patterns in your Java programs

Page 2 of 2

Subtraction character class

The subtraction character class consists of all characters except for those indicated in nested negation character classes and matches the remaining characters. For example, [a-z&&[^m-p]] matches characters a through l and q through z:

java RegexDemo "[a-f&&[^a-c]&&[^e]]" abcdefg

This example matches d and f with their counterparts in abcdefg:

regex = [a-f&&[^a-c]&&[^e]]
input = abcdefg
Found [d] starting at 3 and ending at 3
Found [f] starting at 5 and ending at 5

Predefined character classes

Some character classes occur often enough in regexes to warrant shortcuts. Pattern provides predefined character classes as these shortcuts. Use them to simplify your regexes and minimize syntax errors.

Several categories of predefined character classes are provided: standard, POSIX, java.lang.Character, and Unicode script/block/category/binary property. The following list describes only the standard category:

  • \d: A digit. Equivalent to [0-9].
  • \D: A nondigit. Equivalent to [^0-9].
  • \s: A whitespace character. Equivalent to [ \t\n\x0B\f\r].
  • \S: A nonwhitespace character. Equivalent to [^\s].
  • \w: A word character. Equivalent to [a-zA-Z_0-9].
  • \W: A nonword character. Equivalent to [^\w].

This example uses the \w predefined character class to identify all word characters in the input text:

java RegexDemo \w "aZ.8 _"

You should observe the following output, which shows that the period and space characters are not considered word characters:

regex = \w
input = aZ.8 _
Found [a] starting at 0 and ending at 0
Found [Z] starting at 1 and ending at 1
Found [8] starting at 3 and ending at 3
Found [_] starting at 5 and ending at 5

Capturing groups

A capturing group saves a match's characters for later recall during pattern matching; this construct is a character sequence surrounded by parentheses metacharacters ( ( ) ). All characters within the capturing group are treated as a single unit during pattern matching. For example, the (Java) capturing group combines letters J, a, v, and a into a single unit. This capturing group matches the Java pattern against all occurrences of Java in the input text. Each match replaces the previous match's saved Java characters with the next match's Java characters.

Capturing groups can be nested inside other capturing groups. For example, in the (Java( language)) regex, ( language) nests inside (Java). Each nested or non-nested capturing group receives its own number, numbering starts at 1, and capturing groups are numbered from left to right. In the example, (Java( language)) belongs to capturing group number 1, and ( language) belongs to capturing group number 2. In (a)(b), (a) belongs to capturing group number 1, and (b) belongs to capturing group number 2.

Each capturing group saves its match for later recall by a back reference. Specified as a backslash character followed by a digit character denoting a capturing group number, the back reference recalls a capturing group's captured text characters. The presence of a back reference causes a matcher to use the back reference's capturing group number to recall the capturing group's saved match, and then use that match's characters to attempt a further match operation. The following example demonstrates the usefulness of a back reference in searching text for a grammatical error:

java RegexDemo "(Java( language)\2)" "The Java language language"

The example uses the (Java( language)\2) regex to search the input text "The Java language language" for a grammatical error, where Java immediately precedes two consecutive occurrences of language. The regex specifies two capturing groups: number 1 is (Java( language)\2), which matches Java language language, and number 2 is ( language), which matches a space character followed by language. The \2 back reference recalls number 2's saved match, which allows the matcher to search for a second occurrence of a space character followed by language, which immediately follows the first occurrence of the space character and language. The output below shows what RegexDemo's matcher finds:

regex = (Java( language)\2)
input = The Java language language
Found [Java language language] starting at 4 and ending at 25

Boundary matchers

We sometimes want to match patterns at the beginning of lines, at word boundaries, at the end of text, and so on. You can accomplish this task by using one of Pattern's boundary matchers, which are regex constructs that identify match locations:

  • ^: The beginning of a line
  • $: The end of a line
  • \b: A word boundary
  • \B: A non-word boundary
  • \A: The beginning of the text
  • \G: The end of the previous match
  • \Z: The end of the text, except for the final line terminator (if any)
  • \z: The end of the text

The following example uses the ^ boundary matcher metacharacter to ensure that a line begins with The followed by zero or more word characters:

java RegexDemo "^The\w*" Therefore

The ^ character indicates that the first three input text characters must match the pattern's subsequent T, h, and e characters. Any number of word characters may follow. Here is the output:

regex = ^The\w*
input = Therefore
Found [Therefore] starting at 0 and ending at 8

Suppose you change the command line to java RegexDemo "^The\w*" " Therefore". What happens? No match is found because a space character precedes Therefore.

Zero-length matches

You'll occasionally encounter zero-length matches when working with boundary matchers. A zero-length match is a match with no characters. It occurs in empty input text, at the beginning of input text, after the last character of input text, or between any two characters of that text. Zero-length matches are easy to identify because they always start and end at the same index position.

Consider the following example:

java RegExDemo \b\b "Java is"

This example matches two consecutive word boundaries and generates the following output:

regex = \b\b
input = Java is
Found [] starting at 0 and ending at -1
Found [] starting at 4 and ending at 3
Found [] starting at 5 and ending at 4
Found [] starting at 7 and ending at 6

The output reveals several zero-length matches. The ending index is shown to be one less than the starting index because I specified end() - 1 in Listing 1's RegexDemo's source code.

Quantifiers

A quantifier is a regex construct that explicitly or implicitly binds a numeric value to a pattern. The numeric value determines how many times to match the pattern. Quantifiers are categorized as greedy, reluctant, or possessive:

  • A greedy quantifier (?, *, or +) attempts to find the longest match. Specify X? to find one or no occurrences of X, X* to find zero or more occurrences of X, X+ to find one or more occurrences of X, X{n} to find n occurrences of X, X{n,} to find at least n (and possibly more) occurrences of X, and X{n,m} to find at least n but no more than m occurrences of X.
  • A reluctant quantifier (??, *?, or +?) attempts to find the shortest match. Specify X?? to find one or no occurrences of X, X*? to find zero or more occurrences of X, X+? to find one or more occurrences of X, X{n}? to find n occurrences of X, X{n,}? to find at least n (and possibly more) occurrences of X, and X{n,m}? to find at least n but no more than m occurrences of X.
  • A possessive quantifier (?+, *+, or ++) is similar to a greedy quantifier except that a possessive quantifier only makes one attempt to find the longest match, whereas a greedy quantifier can make multiple attempts. Specify X?+ to find one or no occurrences of X, X*+ to find zero or more occurrences of X, X++ to find one or more occurrences of X, X{n}+ to find n occurrences of X, X{n,}+ to find at least n (and possibly more) occurrences of X, and X{n,m}+ to find at least n but no more than m occurrences of X.

The following example demonstrates a greedy quantifier:

java RegexDemo .*ox "fox box pox"

Here's the output:

regex = .*ox
input = fox box pox
Found [fox box pox] starting at 0 and ending at 10

The greedy quantifier (.*) matches the longest sequence of characters that terminates in ox. It starts by consuming all of the input text and then is forced to back off until it discovers that the input text terminates with these characters.

Now consider a reluctant quantifier:

java RegexDemo .*?ox "fox box pox"

Here's its output:

regex = .*?ox
input = fox box pox
Found [fox] starting at 0 and ending at 2
Found [ box] starting at 3 and ending at 6
Found [ pox] starting at 7 and ending at 10

The reluctant quantifier (.*?) matches the shortest sequence of characters that terminates in ox. It begins by consuming nothing and then slowly consumes characters until it finds a match. It then continues until it exhausts the input text.

Finally, we have the possessive quantifier:

java RegexDemo .*+ox "fox box pox"

And here's its output:

regex = .*+ox
input = fox box pox

The possessive quantifier (.*+) doesn't detect a match because it consumes the entire input text, leaving nothing left over to match ox at the end of the regex. Unlike a greedy quantifier, a possessive quantifier doesn't back off.

Zero-length matches

You'll occasionally encounter zero-length matches when working with quantifiers. For example, the following greedy quantifier produces several zero-length matches:

java RegexDemo a? abaa

This example produces the following output:

regex = a?
input = abaa
Found [a] starting at 0 and ending at 0
Found [] starting at 1 and ending at 0
Found [a] starting at 2 and ending at 2
Found [a] starting at 3 and ending at 3
Found [] starting at 4 and ending at 3

The output reveals five matches. Although the first, third, and fourth matches come as no surprise (in that they reveal the positions of the three a's in abaa), you might be surprised by the second and fifth matches. They seem to indicate that a matches b and also matches the text's end, but that isn't the case. Regex a? doesn't look for b or the text's end. Instead, it looks for either the presence or lack of a. When a? fails to find a, it reports that fact as a zero-length match.

Embedded flag expressions

Matchers assume certain defaults that can be overridden when compiling a regex into a pattern--something we'll discuss more in Part 2. A regex can override any default by including an embedded flag expression. This regex construct is specified as parentheses metacharacters surrounding a question mark metacharacter (?), which is followed by a specific lowercase letter. Pattern recognizes the following embedded flag expressions:

  • (?i): enables case-insensitive pattern matching. For example, java RegexDemo (?i)tree Treehouse matches tree with Tree. Case-sensitive pattern matching is the default.
  • (?x): permits whitespace and comments beginning with the # metacharacter to appear in a pattern. A matcher ignores both. For example, java RegexDemo ".at(?x)#match hat, cat, and so on" matter matches .at with mat. By default, whitespace and comments are not permitted; a matcher regards them as characters that contribute to a match.
  • (?s): enables dotall mode in which the period metacharacter matches line terminators in addition to any other character. For example, java RegexDemo (?s). \n matches new-line. Non-dotall mode is the default: line-terminator characters don't match. For example, Java RegexDemo . \n doesn't match new-line.
  • (?m): enables multiline mode in which ^ matches the beginning of every line and $ matches the end of every line. For example, java RegexDemo "(?m)^abc$" abc\nabc matches both abcs in the input text. Non-multiline mode is the default: ^ matches the beginning of the entire input text and $ matches the end of the entire input text. For example, java RegexDemo "^abc$" abc\nabc reports no matches.
  • (?u): enables Unicode-aware case folding. This flag works with (?i) to perform case-insensitive matching in a manner consistent with the Unicode Standard. The default setting is case-insensitive matching that assumes only characters in the US-ASCII character set match.
  • (?d): enables Unix lines mode in which a matcher recognizes only the \n line terminator in the context of the ., ^, and $ metacharacters. Non-Unix lines mode is the default: a matcher recognizes all terminators in the context of the aforementioned metacharacters.

Embedded flag expressions resemble capturing groups because they surround their characters with parentheses metacharacters. Unlike a capturing group, an embedded flag expression doesn't capture a match's characters. Instead, an embedded flag expression is an example of a noncapturing group, which is a regex construct that doesn't capture text characters. It's specified as a character sequence surrounded by parentheses metacharacters.

Conclusion

As you've probably realized by now, regular expressions are incredibly useful, and become more useful as you master the nuances of their syntax. So far I've introduced the basics of regular expressions and the Pattern class. In Part 2, we'll go much deeper into the Regex API, exploring methods associated with the Pattern, Matcher, and PatternSyntaxException classes. I'll also demonstrate two practical applications of the Regex API, which you will be able to immediately apply in your own programs.

| 1 2 Page 2