Java 101: Regular expressions in Java, Part 1

Use the Regex API to discover and describe patterns in your Java programs

words nodes map usage
Credit: Kyle McDonald

Java's character and assorted string classes offer low-level support for pattern matching, but that support typically leads to complex code. For simpler and more efficient coding, Java offers the Regex API. This two-part tutorial helps you get started with regular expressions and the Regex API. First we'll unpack the three powerful classes residing in the java.util.regex package, then we'll explore the Pattern class and its sophisticated pattern-matching constructs.

download
Get the complete source code for this article's demo application. Created by Jeff Friesen for JavaWorld.

What are regular expressions?

A regular expression, also known as a regex or regexp, is a string whose pattern (template) describes a set of strings. The pattern determines which strings belong to the set. A pattern consists of literal characters and metacharacters, which are characters that have special meaning instead of a literal meaning.

Pattern matching is the process of searching text to identify matches, or strings that match a regex's pattern. Java supports pattern matching via its Regex API. The API consists of three classes--Pattern, Matcher, and PatternSyntaxException--all located in the java.util.regex package:

  • Pattern objects, also known as patterns, are compiled regexes.
  • Matcher objects, or matchers, are engines that interpret patterns to locate matches in character sequences (objects whose classes implement the java.lang.CharSequence interface and serve as text sources).
  • PatternSyntaxException objects describe illegal regex patterns.

Java also provides support for pattern matching via various methods in its java.lang.String class. For example, boolean matches(String regex) returns true only if the invoking string exactly matches regex's regex.

RegexDemo

I've created the RegexDemo application to demonstrate Java's regular expressions and the various methods located in the Pattern, Matcher, and PatternSyntaxException classes. Here's the source code for the demo:

Listing 1. Demonstrating regexes

import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;
public class RegexDemo
{
   public static void main(String[] args)
   {
      if (args.length != 2)
      {
         System.err.println("usage: java RegexDemo regex input");
         return;
      }
      // Convert new-line (\n) character sequences to new-line characters.
      args[1] = args[1].replaceAll("\\\\n", "\n");
      try
      {
         System.out.println("regex = " + args[0]);
         System.out.println("input = " + args[1]);
         Pattern p = Pattern.compile(args[0]);
         Matcher m = p.matcher(args[1]);
         while (m.find())
            System.out.println("Found [" + m.group() + "] starting at "
                               + m.start() + " and ending at " + (m.end() - 1));
      }
      catch (PatternSyntaxException pse)
      {
         System.err.println("Bad regex: " + pse.getMessage());
         System.err.println("Description: " + pse.getDescription());
         System.err.println("Index: " + pse.getIndex());
         System.err.println("Incorrect pattern: " + pse.getPattern());
      }
   }
}

The first thing RegexDemo's main() method does is to validate its command line. This requires two arguments: the first argument is a regex, and the second argument is input text to be matched against the regex.

You might want to specify a new-line (\n) character as part of the input text. The only way to accomplish this is to specify a \ character followed by an n character. main() converts this character sequence to Unicode value 10.

The bulk of RegexDemo's code is located in the try-catch construct. The try block first outputs the specified regex and input text and then creates a Pattern object that stores the compiled regex. (Regexes are compiled to improve performance during pattern matching.) A matcher is extracted from the Pattern object and used to repeatedly search for matches until none remain. The catch block invokes various PatternSyntaxException methods to extract useful information about the exception. This information is subsequently output.

You don't need to know more about the source code's workings at this point; it will become clear when you explore the API in Part 2. You do need to compile Listing 1, however. Grab the code from Listing 1, then type the following into your command line to compile RegexDemo:

javac RegexDemo.java

Pattern and its constructs

Pattern, the first of three classes comprising the Regex API, is a compiled representation of a regular expression. Pattern's SDK documentation describes various regex constructs, but unless you're already an avid regex user, you might be confused by parts of the documentation. What are quantifiers and what's the difference between greedy, reluctant, and possessive quantifiers? What are character classes, boundary matchers, back references, and embedded flag expressions? I'll answer these questions and more in the next sections.

Literal strings

The simplest regex construct is the literal string. Some portion of the input text must match this construct's pattern in order to have a successful pattern match. Consider the following example:

java RegexDemo apple applet

This example attempts to discover if there is a match for the apple pattern in the applet input text. The following output reveals the match:

regex = apple
input = applet
Found [apple] starting at 0 and ending at 4

The output shows us the regex and input text, then indicates a successful match of apple within applet. Additionally, it presents the starting and ending indexes of that match: 0 and 4, respectively. The starting index identifies the first text location where a pattern match occurs; the ending index identifies the last text location for the match.

Now suppose we specify the following command line:

java RegexDemo apple crabapple

This time, we get the following match with different starting and ending indexes:

regex = apple
input = crabapple
Found [apple] starting at 4 and ending at 8

The reverse scenario, in which applet is the regex and apple is the input text, reveals no match. The entire regex must match, and in this case the input text does not contain a t after apple.

Metacharacters

More powerful regex constructs combine literal characters with metacharacters. For example, in a.b, the period metacharacter (.) represents any character that appears between a and b. Consider the following example:

java RegexDemo .ox "The quick brown fox jumps over the lazy ox."

This example specifies .ox as the regex and The quick brown fox jumps over the lazy ox. as the input text. RegexDemo searches the text for matches that begin with any character and end with ox. It produces the following output:

regex = .ox
input = The quick brown fox jumps over the lazy ox.
Found [fox] starting at 16 and ending at 18
Found [ ox] starting at 39 and ending at 41

The output reveals two matches: fox and ox (with the leading space character). The . metacharacter matches the f in the first match and the space character in the second match.

What happens when we replace .ox with the period metacharacter? That is, what output results from specifying the following command line:

java RegexDemo . "The quick brown fox jumps over the lazy ox."

Because the period metacharacter matches any character, RegexDemo outputs a match for each character (including the terminating period character) in the input text:

regex = .
input = The quick brown fox jumps over the lazy ox.
Found [T] starting at 0 and ending at 0
Found [h] starting at 1 and ending at 1
Found [e] starting at 2 and ending at 2
Found [ ] starting at 3 and ending at 3
Found [q] starting at 4 and ending at 4
Found [u] starting at 5 and ending at 5
Found [i] starting at 6 and ending at 6
Found [c] starting at 7 and ending at 7
Found [k] starting at 8 and ending at 8
Found [ ] starting at 9 and ending at 9
Found [b] starting at 10 and ending at 10
Found [r] starting at 11 and ending at 11
Found [o] starting at 12 and ending at 12
Found [w] starting at 13 and ending at 13
Found [n] starting at 14 and ending at 14
Found [ ] starting at 15 and ending at 15
Found [f] starting at 16 and ending at 16
Found [o] starting at 17 and ending at 17
Found [x] starting at 18 and ending at 18
Found [ ] starting at 19 and ending at 19
Found [j] starting at 20 and ending at 20
Found [u] starting at 21 and ending at 21
Found [m] starting at 22 and ending at 22
Found [p] starting at 23 and ending at 23
Found [s] starting at 24 and ending at 24
Found [ ] starting at 25 and ending at 25
Found [o] starting at 26 and ending at 26
Found [v] starting at 27 and ending at 27
Found [e] starting at 28 and ending at 28
Found [r] starting at 29 and ending at 29
Found [ ] starting at 30 and ending at 30
Found [t] starting at 31 and ending at 31
Found [h] starting at 32 and ending at 32
Found [e] starting at 33 and ending at 33
Found [ ] starting at 34 and ending at 34
Found [l] starting at 35 and ending at 35
Found [a] starting at 36 and ending at 36
Found [z] starting at 37 and ending at 37
Found [y] starting at 38 and ending at 38
Found [ ] starting at 39 and ending at 39
Found [o] starting at 40 and ending at 40
Found [x] starting at 41 and ending at 41
Found [.] starting at 42 and ending at 42

Character classes

We sometimes need to limit characters that will produce matches to a specific character set. For example, we might search text for vowels a, e, i, o, and u, where any occurrence of a vowel indicates a match. A character class identifies a set of characters between square-bracket metacharacters ([ ]), helping us accomplish this task. Pattern supports simple, negation, range, union, intersection, and subtraction character classes. We'll look at all of these below.

Simple character class

The simple character class consists of characters placed side by side and matches only those characters. For example, [abc] matches characters a, b, and c.

Consider the following example:

java RegexDemo [csw] cave

This example matches only c with its counterpart in cave, as shown in the following output:

regex = [csw]
input = cave
Found [c] starting at 0 and ending at 0

Negation character class

The negation character class begins with the ^ metacharacter and matches only those characters not located in that class. For example, [^abc] matches all characters except a, b, and c.

Consider this example:

java RegexDemo "[^csw]" cave

Note that the double quotes are necessary on my Windows platform, whose shell treats the ^ character as an escape character.

This example matches a, v, and e with their counterparts in cave, as shown here:

regex = [^csw]
input = cave
Found [a] starting at 1 and ending at 1
Found [v] starting at 2 and ending at 2
Found [e] starting at 3 and ending at 3

Range character class

The range character class consists of two characters separated by a hyphen metacharacter (-). All characters beginning with the character on the left of the hyphen and ending with the character on the right of the hyphen belong to the range. For example, [a-z] matches all lowercase alphabetic characters. It's equivalent to specifying [abcdefghijklmnopqrstuvwxyz].

Consider the following example:

java RegexDemo [a-c] clown

This example matches only c with its counterpart in clown, as shown:

regex = [a-c]
input = clown
Found [c] starting at 0 and ending at 0

Union character class

The union character class consists of multiple nested character classes and matches all characters that belong to the resulting union. For example, [a-d[m-p]] matches characters a through d and m through p.

Consider the following example:

java RegexDemo [ab[c-e]] abcdef

This example matches a, b, c, d, and e with their counterparts in abcdef:

regex = [ab[c-e]]
input = abcdef
Found [a] starting at 0 and ending at 0
Found [b] starting at 1 and ending at 1
Found [c] starting at 2 and ending at 2
Found [d] starting at 3 and ending at 3
Found [e] starting at 4 and ending at 4

Intersection character class

The intersection character class consists of characters common to all nested classes and matches only common characters. For example, [a-z&&[d-f]] matches characters d, e, and f.

Consider the following example:

java RegexDemo "[aeiouy&&[y]]" party

Note that the double quotes are necessary on my Windows platform, whose shell treats the & character as a command separator.

This example matches only y with its counterpart in party:

regex = [aeiouy&&[y]]
input = party
Found [y] starting at 4 and ending at 4
1 2 Page 1