Apr 13, 2017 12:29 PM PT

Java 101: Regular expressions in Java, Part 1

Use the Regex API to discover and describe patterns in your Java programs

Kyle McDonald (CC BY 2.0)

Java's character and assorted string classes offer low-level support for pattern matching, but that support typically leads to complex code. For simpler and more efficient coding, Java offers the Regex API. This two-part tutorial helps you get started with regular expressions and the Regex API. First we'll unpack the three powerful classes residing in the java.util.regex package, then we'll explore the Pattern class and its sophisticated pattern-matching constructs.

download
Get the complete source code for this article's demo application. Created by Jeff Friesen for JavaWorld.

What are regular expressions?

A regular expression, also known as a regex or regexp, is a string whose pattern (template) describes a set of strings. The pattern determines which strings belong to the set. A pattern consists of literal characters and metacharacters, which are characters that have special meaning instead of a literal meaning.

Pattern matching is the process of searching text to identify matches, or strings that match a regex's pattern. Java supports pattern matching via its Regex API. The API consists of three classes--Pattern, Matcher, and PatternSyntaxException--all located in the java.util.regex package:

  • Pattern objects, also known as patterns, are compiled regexes.
  • Matcher objects, or matchers, are engines that interpret patterns to locate matches in character sequences (objects whose classes implement the java.lang.CharSequence interface and serve as text sources).
  • PatternSyntaxException objects describe illegal regex patterns.

Java also provides support for pattern matching via various methods in its java.lang.String class. For example, boolean matches(String regex) returns true only if the invoking string exactly matches regex's regex.

RegexDemo

I've created the RegexDemo application to demonstrate Java's regular expressions and the various methods located in the Pattern, Matcher, and PatternSyntaxException classes. Here's the source code for the demo:

Listing 1. Demonstrating regexes

import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;
public class RegexDemo
{
   public static void main(String[] args)
   {
      if (args.length != 2)
      {
         System.err.println("usage: java RegexDemo regex input");
         return;
      }
      // Convert new-line (\n) character sequences to new-line characters.
      args[1] = args[1].replaceAll("\\\\n", "\n");
      try
      {
         System.out.println("regex = " + args[0]);
         System.out.println("input = " + args[1]);
         Pattern p = Pattern.compile(args[0]);
         Matcher m = p.matcher(args[1]);
         while (m.find())
            System.out.println("Found [" + m.group() + "] starting at "
                               + m.start() + " and ending at " + (m.end() - 1));
      }
      catch (PatternSyntaxException pse)
      {
         System.err.println("Bad regex: " + pse.getMessage());
         System.err.println("Description: " + pse.getDescription());
         System.err.println("Index: " + pse.getIndex());
         System.err.println("Incorrect pattern: " + pse.getPattern());
      }
   }
}

The first thing RegexDemo's main() method does is to validate its command line. This requires two arguments: the first argument is a regex, and the second argument is input text to be matched against the regex.

You might want to specify a new-line (\n) character as part of the input text. The only way to accomplish this is to specify a \ character followed by an n character. main() converts this character sequence to Unicode value 10.

The bulk of RegexDemo's code is located in the try-catch construct. The try block first outputs the specified regex and input text and then creates a Pattern object that stores the compiled regex. (Regexes are compiled to improve performance during pattern matching.) A matcher is extracted from the Pattern object and used to repeatedly search for matches until none remain. The catch block invokes various PatternSyntaxException methods to extract useful information about the exception. This information is subsequently output.

You don't need to know more about the source code's workings at this point; it will become clear when you explore the API in Part 2. You do need to compile Listing 1, however. Grab the code from Listing 1, then type the following into your command line to compile RegexDemo:

javac RegexDemo.java

Pattern and its constructs

Pattern, the first of three classes comprising the Regex API, is a compiled representation of a regular expression. Pattern's SDK documentation describes various regex constructs, but unless you're already an avid regex user, you might be confused by parts of the documentation. What are quantifiers and what's the difference between greedy, reluctant, and possessive quantifiers? What are character classes, boundary matchers, back references, and embedded flag expressions? I'll answer these questions and more in the next sections.

Literal strings

The simplest regex construct is the literal string. Some portion of the input text must match this construct's pattern in order to have a successful pattern match. Consider the following example:

java RegexDemo apple applet

This example attempts to discover if there is a match for the apple pattern in the applet input text. The following output reveals the match:

regex = apple
input = applet
Found [apple] starting at 0 and ending at 4

The output shows us the regex and input text, then indicates a successful match of apple within applet. Additionally, it presents the starting and ending indexes of that match: 0 and 4, respectively. The starting index identifies the first text location where a pattern match occurs; the ending index identifies the last text location for the match.

Now suppose we specify the following command line:

java RegexDemo apple crabapple

This time, we get the following match with different starting and ending indexes:

regex = apple
input = crabapple
Found [apple] starting at 4 and ending at 8

The reverse scenario, in which applet is the regex and apple is the input text, reveals no match. The entire regex must match, and in this case the input text does not contain a t after apple.

Metacharacters

More powerful regex constructs combine literal characters with metacharacters. For example, in a.b, the period metacharacter (.) represents any character that appears between a and b. Consider the following example:

java RegexDemo .ox "The quick brown fox jumps over the lazy ox."

This example specifies .ox as the regex and The quick brown fox jumps over the lazy ox. as the input text. RegexDemo searches the text for matches that begin with any character and end with ox. It produces the following output:

regex = .ox
input = The quick brown fox jumps over the lazy ox.
Found [fox] starting at 16 and ending at 18
Found [ ox] starting at 39 and ending at 41

The output reveals two matches: fox and ox (with the leading space character). The . metacharacter matches the f in the first match and the space character in the second match.

What happens when we replace .ox with the period metacharacter? That is, what output results from specifying the following command line:

java RegexDemo . "The quick brown fox jumps over the lazy ox."

Because the period metacharacter matches any character, RegexDemo outputs a match for each character (including the terminating period character) in the input text:

regex = .
input = The quick brown fox jumps over the lazy ox.
Found [T] starting at 0 and ending at 0
Found [h] starting at 1 and ending at 1
Found [e] starting at 2 and ending at 2
Found [ ] starting at 3 and ending at 3
Found [q] starting at 4 and ending at 4
Found [u] starting at 5 and ending at 5
Found [i] starting at 6 and ending at 6
Found [c] starting at 7 and ending at 7
Found [k] starting at 8 and ending at 8
Found [ ] starting at 9 and ending at 9
Found [b] starting at 10 and ending at 10
Found [r] starting at 11 and ending at 11
Found [o] starting at 12 and ending at 12
Found [w] starting at 13 and ending at 13
Found [n] starting at 14 and ending at 14
Found [ ] starting at 15 and ending at 15
Found [f] starting at 16 and ending at 16
Found [o] starting at 17 and ending at 17
Found [x] starting at 18 and ending at 18
Found [ ] starting at 19 and ending at 19
Found [j] starting at 20 and ending at 20
Found [u] starting at 21 and ending at 21
Found [m] starting at 22 and ending at 22
Found [p] starting at 23 and ending at 23
Found [s] starting at 24 and ending at 24
Found [ ] starting at 25 and ending at 25
Found [o] starting at 26 and ending at 26
Found [v] starting at 27 and ending at 27
Found [e] starting at 28 and ending at 28
Found [r] starting at 29 and ending at 29
Found [ ] starting at 30 and ending at 30
Found [t] starting at 31 and ending at 31
Found [h] starting at 32 and ending at 32
Found [e] starting at 33 and ending at 33
Found [ ] starting at 34 and ending at 34
Found [l] starting at 35 and ending at 35
Found [a] starting at 36 and ending at 36
Found [z] starting at 37 and ending at 37
Found [y] starting at 38 and ending at 38
Found [ ] starting at 39 and ending at 39
Found [o] starting at 40 and ending at 40
Found [x] starting at 41 and ending at 41
Found [.] starting at 42 and ending at 42

Character classes

We sometimes need to limit characters that will produce matches to a specific character set. For example, we might search text for vowels a, e, i, o, and u, where any occurrence of a vowel indicates a match. A character class identifies a set of characters between square-bracket metacharacters ([ ]), helping us accomplish this task. Pattern supports simple, negation, range, union, intersection, and subtraction character classes. We'll look at all of these below.

Simple character class

The simple character class consists of characters placed side by side and matches only those characters. For example, [abc] matches characters a, b, and c.

Consider the following example:

java RegexDemo [csw] cave

This example matches only c with its counterpart in cave, as shown in the following output:

regex = [csw]
input = cave
Found [c] starting at 0 and ending at 0

Negation character class

The negation character class begins with the ^ metacharacter and matches only those characters not located in that class. For example, [^abc] matches all characters except a, b, and c.

Consider this example:

java RegexDemo "[^csw]" cave

Note that the double quotes are necessary on my Windows platform, whose shell treats the ^ character as an escape character.

This example matches a, v, and e with their counterparts in cave, as shown here:

regex = [^csw]
input = cave
Found [a] starting at 1 and ending at 1
Found [v] starting at 2 and ending at 2
Found [e] starting at 3 and ending at 3

Range character class

The range character class consists of two characters separated by a hyphen metacharacter (-). All characters beginning with the character on the left of the hyphen and ending with the character on the right of the hyphen belong to the range. For example, [a-z] matches all lowercase alphabetic characters. It's equivalent to specifying [abcdefghijklmnopqrstuvwxyz].

Consider the following example:

java RegexDemo [a-c] clown

This example matches only c with its counterpart in clown, as shown:

regex = [a-c]
input = clown
Found [c] starting at 0 and ending at 0

Union character class

The union character class consists of multiple nested character classes and matches all characters that belong to the resulting union. For example, [a-d[m-p]] matches characters a through d and m through p.

Consider the following example:

java RegexDemo [ab[c-e]] abcdef

This example matches a, b, c, d, and e with their counterparts in abcdef:

regex = [ab[c-e]]
input = abcdef
Found [a] starting at 0 and ending at 0
Found [b] starting at 1 and ending at 1
Found [c] starting at 2 and ending at 2
Found [d] starting at 3 and ending at 3
Found [e] starting at 4 and ending at 4

Intersection character class

The intersection character class consists of characters common to all nested classes and matches only common characters. For example, [a-z&&[d-f]] matches characters d, e, and f.

Consider the following example:

java RegexDemo "[aeiouy&&[y]]" party

Note that the double quotes are necessary on my Windows platform, whose shell treats the & character as a command separator.

This example matches only y with its counterpart in party:

regex = [aeiouy&&[y]]
input = party
Found [y] starting at 4 and ending at 4

Subtraction character class

The subtraction character class consists of all characters except for those indicated in nested negation character classes and matches the remaining characters. For example, [a-z&&[^m-p]] matches characters a through l and q through z:

java RegexDemo "[a-f&&[^a-c]&&[^e]]" abcdefg

This example matches d and f with their counterparts in abcdefg:

regex = [a-f&&[^a-c]&&[^e]]
input = abcdefg
Found [d] starting at 3 and ending at 3
Found [f] starting at 5 and ending at 5

Predefined character classes

Some character classes occur often enough in regexes to warrant shortcuts. Pattern provides predefined character classes as these shortcuts. Use them to simplify your regexes and minimize syntax errors.

Several categories of predefined character classes are provided: standard, POSIX, java.lang.Character, and Unicode script/block/category/binary property. The following list describes only the standard category:

  • \d: A digit. Equivalent to [0-9].
  • \D: A nondigit. Equivalent to [^0-9].
  • \s: A whitespace character. Equivalent to [ \t\n\x0B\f\r].
  • \S: A nonwhitespace character. Equivalent to [^\s].
  • \w: A word character. Equivalent to [a-zA-Z_0-9].
  • \W: A nonword character. Equivalent to [^\w].

This example uses the \w predefined character class to identify all word characters in the input text:

java RegexDemo \w "aZ.8 _"

You should observe the following output, which shows that the period and space characters are not considered word characters:

regex = \w
input = aZ.8 _
Found [a] starting at 0 and ending at 0
Found [Z] starting at 1 and ending at 1
Found [8] starting at 3 and ending at 3
Found [_] starting at 5 and ending at 5

Capturing groups

A capturing group saves a match's characters for later recall during pattern matching; this construct is a character sequence surrounded by parentheses metacharacters ( ( ) ). All characters within the capturing group are treated as a single unit during pattern matching. For example, the (Java) capturing group combines letters J, a, v, and a into a single unit. This capturing group matches the Java pattern against all occurrences of Java in the input text. Each match replaces the previous match's saved Java characters with the next match's Java characters.

Capturing groups can be nested inside other capturing groups. For example, in the (Java( language)) regex, ( language) nests inside (Java). Each nested or non-nested capturing group receives its own number, numbering starts at 1, and capturing groups are numbered from left to right. In the example, (Java( language)) belongs to capturing group number 1, and ( language) belongs to capturing group number 2. In (a)(b), (a) belongs to capturing group number 1, and (b) belongs to capturing group number 2.

Each capturing group saves its match for later recall by a back reference. Specified as a backslash character followed by a digit character denoting a capturing group number, the back reference recalls a capturing group's captured text characters. The presence of a back reference causes a matcher to use the back reference's capturing group number to recall the capturing group's saved match, and then use that match's characters to attempt a further match operation. The following example demonstrates the usefulness of a back reference in searching text for a grammatical error:

java RegexDemo "(Java( language)\2)" "The Java language language"

The example uses the (Java( language)\2) regex to search the input text "The Java language language" for a grammatical error, where Java immediately precedes two consecutive occurrences of language. The regex specifies two capturing groups: number 1 is (Java( language)\2), which matches Java language language, and number 2 is ( language), which matches a space character followed by language. The \2 back reference recalls number 2's saved match, which allows the matcher to search for a second occurrence of a space character followed by language, which immediately follows the first occurrence of the space character and language. The output below shows what RegexDemo's matcher finds:

regex = (Java( language)\2)
input = The Java language language
Found [Java language language] starting at 4 and ending at 25

Boundary matchers

We sometimes want to match patterns at the beginning of lines, at word boundaries, at the end of text, and so on. You can accomplish this task by using one of Pattern's boundary matchers, which are regex constructs that identify match locations:

  • ^: The beginning of a line
  • $: The end of a line
  • \b: A word boundary
  • \B: A non-word boundary
  • \A: The beginning of the text
  • \G: The end of the previous match
  • \Z: The end of the text, except for the final line terminator (if any)
  • \z: The end of the text

The following example uses the ^ boundary matcher metacharacter to ensure that a line begins with The followed by zero or more word characters:

java RegexDemo "^The\w*" Therefore

The ^ character indicates that the first three input text characters must match the pattern's subsequent T, h, and e characters. Any number of word characters may follow. Here is the output:

regex = ^The\w*
input = Therefore
Found [Therefore] starting at 0 and ending at 8

Suppose you change the command line to java RegexDemo "^The\w*" " Therefore". What happens? No match is found because a space character precedes Therefore.

Zero-length matches

You'll occasionally encounter zero-length matches when working with boundary matchers. A zero-length match is a match with no characters. It occurs in empty input text, at the beginning of input text, after the last character of input text, or between any two characters of that text. Zero-length matches are easy to identify because they always start and end at the same index position.

Consider the following example:

java RegExDemo \b\b "Java is"

This example matches two consecutive word boundaries and generates the following output:

regex = \b\b
input = Java is
Found [] starting at 0 and ending at -1
Found [] starting at 4 and ending at 3
Found [] starting at 5 and ending at 4
Found [] starting at 7 and ending at 6

The output reveals several zero-length matches. The ending index is shown to be one less than the starting index because I specified end() - 1 in Listing 1's RegexDemo's source code.

Quantifiers

A quantifier is a regex construct that explicitly or implicitly binds a numeric value to a pattern. The numeric value determines how many times to match the pattern. Quantifiers are categorized as greedy, reluctant, or possessive:

  • A greedy quantifier (?, *, or +) attempts to find the longest match. Specify X? to find one or no occurrences of X, X* to find zero or more occurrences of X, X+ to find one or more occurrences of X, X{n} to find n occurrences of X, X{n,} to find at least n (and possibly more) occurrences of X, and X{n,m} to find at least n but no more than m occurrences of X.
  • A reluctant quantifier (??, *?, or +?) attempts to find the shortest match. Specify X?? to find one or no occurrences of X, X*? to find zero or more occurrences of X, X+? to find one or more occurrences of X, X{n}? to find n occurrences of X, X{n,}? to find at least n (and possibly more) occurrences of X, and X{n,m}? to find at least n but no more than m occurrences of X.
  • A possessive quantifier (?+, *+, or ++) is similar to a greedy quantifier except that a possessive quantifier only makes one attempt to find the longest match, whereas a greedy quantifier can make multiple attempts. Specify X?+ to find one or no occurrences of X, X*+ to find zero or more occurrences of X, X++ to find one or more occurrences of X, X{n}+ to find n occurrences of X, X{n,}+ to find at least n (and possibly more) occurrences of X, and X{n,m}+ to find at least n but no more than m occurrences of X.

The following example demonstrates a greedy quantifier:

java RegexDemo .*ox "fox box pox"

Here's the output:

regex = .*ox
input = fox box pox
Found [fox box pox] starting at 0 and ending at 10

The greedy quantifier (.*) matches the longest sequence of characters that terminates in ox. It starts by consuming all of the input text and then is forced to back off until it discovers that the input text terminates with these characters.

Now consider a reluctant quantifier:

java RegexDemo .*?ox "fox box pox"

Here's its output:

regex = .*?ox
input = fox box pox
Found [fox] starting at 0 and ending at 2
Found [ box] starting at 3 and ending at 6
Found [ pox] starting at 7 and ending at 10

The reluctant quantifier (.*?) matches the shortest sequence of characters that terminates in ox. It begins by consuming nothing and then slowly consumes characters until it finds a match. It then continues until it exhausts the input text.

Finally, we have the possessive quantifier:

java RegexDemo .*+ox "fox box pox"

And here's its output:

regex = .*+ox
input = fox box pox

The possessive quantifier (.*+) doesn't detect a match because it consumes the entire input text, leaving nothing left over to match ox at the end of the regex. Unlike a greedy quantifier, a possessive quantifier doesn't back off.

Zero-length matches

You'll occasionally encounter zero-length matches when working with quantifiers. For example, the following greedy quantifier produces several zero-length matches:

java RegexDemo a? abaa

This example produces the following output:

regex = a?
input = abaa
Found [a] starting at 0 and ending at 0
Found [] starting at 1 and ending at 0
Found [a] starting at 2 and ending at 2
Found [a] starting at 3 and ending at 3
Found [] starting at 4 and ending at 3

The output reveals five matches. Although the first, third, and fourth matches come as no surprise (in that they reveal the positions of the three a's in abaa), you might be surprised by the second and fifth matches. They seem to indicate that a matches b and also matches the text's end, but that isn't the case. Regex a? doesn't look for b or the text's end. Instead, it looks for either the presence or lack of a. When a? fails to find a, it reports that fact as a zero-length match.

Embedded flag expressions

Matchers assume certain defaults that can be overridden when compiling a regex into a pattern--something we'll discuss more in Part 2. A regex can override any default by including an embedded flag expression. This regex construct is specified as parentheses metacharacters surrounding a question mark metacharacter (?), which is followed by a specific lowercase letter. Pattern recognizes the following embedded flag expressions:

  • (?i): enables case-insensitive pattern matching. For example, java RegexDemo (?i)tree Treehouse matches tree with Tree. Case-sensitive pattern matching is the default.
  • (?x): permits whitespace and comments beginning with the # metacharacter to appear in a pattern. A matcher ignores both. For example, java RegexDemo ".at(?x)#match hat, cat, and so on" matter matches .at with mat. By default, whitespace and comments are not permitted; a matcher regards them as characters that contribute to a match.
  • (?s): enables dotall mode in which the period metacharacter matches line terminators in addition to any other character. For example, java RegexDemo (?s). \n matches new-line. Non-dotall mode is the default: line-terminator characters don't match. For example, Java RegexDemo . \n doesn't match new-line.
  • (?m): enables multiline mode in which ^ matches the beginning of every line and $ matches the end of every line. For example, java RegexDemo "(?m)^abc$" abc\nabc matches both abcs in the input text. Non-multiline mode is the default: ^ matches the beginning of the entire input text and $ matches the end of the entire input text. For example, java RegexDemo "^abc$" abc\nabc reports no matches.
  • (?u): enables Unicode-aware case folding. This flag works with (?i) to perform case-insensitive matching in a manner consistent with the Unicode Standard. The default setting is case-insensitive matching that assumes only characters in the US-ASCII character set match.
  • (?d): enables Unix lines mode in which a matcher recognizes only the \n line terminator in the context of the ., ^, and $ metacharacters. Non-Unix lines mode is the default: a matcher recognizes all terminators in the context of the aforementioned metacharacters.

Embedded flag expressions resemble capturing groups because they surround their characters with parentheses metacharacters. Unlike a capturing group, an embedded flag expression doesn't capture a match's characters. Instead, an embedded flag expression is an example of a noncapturing group, which is a regex construct that doesn't capture text characters. It's specified as a character sequence surrounded by parentheses metacharacters.

Conclusion

As you've probably realized by now, regular expressions are incredibly useful, and become more useful as you master the nuances of their syntax. So far I've introduced the basics of regular expressions and the Pattern class. In Part 2, we'll go much deeper into the Regex API, exploring methods associated with the Pattern, Matcher, and PatternSyntaxException classes. I'll also demonstrate two practical applications of the Regex API, which you will be able to immediately apply in your own programs.