Java's character and assorted string classes offer low-level support for pattern matching, but that support typically leads to complex code. For simpler and more efficient coding, Java offers the Regex API. This two-part tutorial helps you get started with regular expressions and the Regex API. First we'll unpack the three powerful classes residing in the java.util.regex
package, then we'll explore the Pattern
class and its sophisticated pattern-matching constructs.
What are regular expressions?
A regular expression, also known as a regex or regexp, is a string whose pattern (template) describes a set of strings. The pattern determines which strings belong to the set. A pattern consists of literal characters and metacharacters, which are characters that have special meaning instead of a literal meaning.
Pattern matching is the process of searching text to identify matches, or strings that match a regex's pattern. Java supports pattern matching via its Regex API. The API consists of three classes--Pattern
, Matcher
, and PatternSyntaxException
--all located in the java.util.regex
package:
Pattern
objects, also known as patterns, are compiled regexes.Matcher
objects, or matchers, are engines that interpret patterns to locate matches in character sequences (objects whose classes implement thejava.lang.CharSequence
interface and serve as text sources).PatternSyntaxException
objects describe illegal regex patterns.
Java also provides support for pattern matching via various methods in its java.lang.String
class. For example, boolean matches(String regex)
returns true only if the invoking string exactly matches regex
's regex.
RegexDemo
I've created the RegexDemo
application to demonstrate Java's regular expressions and the various methods located in the Pattern
, Matcher
, and PatternSyntaxException
classes. Here's the source code for the demo:
Listing 1. Demonstrating regexes
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;
public class RegexDemo
{
public static void main(String[] args)
{
if (args.length != 2)
{
System.err.println("usage: java RegexDemo regex input");
return;
}
// Convert new-line (\n) character sequences to new-line characters.
args[1] = args[1].replaceAll("\\\\n", "\n");
try
{
System.out.println("regex = " + args[0]);
System.out.println("input = " + args[1]);
Pattern p = Pattern.compile(args[0]);
Matcher m = p.matcher(args[1]);
while (m.find())
System.out.println("Found [" + m.group() + "] starting at "
+ m.start() + " and ending at " + (m.end() - 1));
}
catch (PatternSyntaxException pse)
{
System.err.println("Bad regex: " + pse.getMessage());
System.err.println("Description: " + pse.getDescription());
System.err.println("Index: " + pse.getIndex());
System.err.println("Incorrect pattern: " + pse.getPattern());
}
}
}
The first thing RegexDemo
's main()
method does is to validate its command line. This requires two arguments: the first argument is a regex, and the second argument is input text to be matched against the regex.
You might want to specify a new-line (\n
) character as part of the input text. The only way to accomplish this is to specify a \
character followed by an n
character. main()
converts this character sequence to Unicode value 10.
The bulk of RegexDemo
's code is located in the try
-catch
construct. The try
block first outputs the specified regex and input text and then creates a Pattern
object that stores the compiled regex. (Regexes are compiled to improve performance during pattern matching.) A matcher is extracted from the Pattern
object and used to repeatedly search for matches until none remain. The catch
block invokes various PatternSyntaxException
methods to extract useful information about the exception. This information is subsequently output.
You don't need to know more about the source code's workings at this point; it will become clear when you explore the API in Part 2. You do need to compile Listing 1, however. Grab the code from Listing 1, then type the following into your command line to compile RegexDemo
:
javac RegexDemo.java
Pattern and its constructs
Pattern
, the first of three classes comprising the Regex API, is a compiled representation of a regular expression. Pattern
's SDK documentation describes various regex constructs, but unless you're already an avid regex user, you might be confused by parts of the documentation. What are quantifiers and what's the difference between greedy, reluctant, and possessive quantifiers? What are character classes, boundary matchers, back references, and embedded flag expressions? I'll answer these questions and more in the next sections.
Literal strings
The simplest regex construct is the literal string. Some portion of the input text must match this construct's pattern in order to have a successful pattern match. Consider the following example:
java RegexDemo apple applet
This example attempts to discover if there is a match for the apple
pattern in the applet
input text. The following output reveals the match:
regex = apple
input = applet
Found [apple] starting at 0 and ending at 4
The output shows us the regex and input text, then indicates a successful match of apple
within applet
. Additionally, it presents the starting and ending indexes of that match: 0
and 4
, respectively. The starting index identifies the first text location where a pattern match occurs; the ending index identifies the last text location for the match.
Now suppose we specify the following command line:
java RegexDemo apple crabapple
This time, we get the following match with different starting and ending indexes:
regex = apple
input = crabapple
Found [apple] starting at 4 and ending at 8
The reverse scenario, in which applet
is the regex and apple
is the input text, reveals no match. The entire regex must match, and in this case the input text does not contain a t
after apple
.
Metacharacters
More powerful regex constructs combine literal characters with metacharacters. For example, in a.b
, the period metacharacter (.
) represents any character that appears between a
and b
. Consider the following example:
java RegexDemo .ox "The quick brown fox jumps over the lazy ox."
This example specifies .ox
as the regex and The quick brown fox jumps over the lazy ox.
as the input text. RegexDemo
searches the text for matches that begin with any character and end with ox
. It produces the following output:
regex = .ox
input = The quick brown fox jumps over the lazy ox.
Found [fox] starting at 16 and ending at 18
Found [ ox] starting at 39 and ending at 41
The output reveals two matches: fox
and ox
(with the leading space character). The .
metacharacter matches the f
in the first match and the space character in the second match.
What happens when we replace .ox
with the period metacharacter? That is, what output results from specifying the following command line:
java RegexDemo . "The quick brown fox jumps over the lazy ox."
Because the period metacharacter matches any character, RegexDemo
outputs a match for each character (including the terminating period character) in the input text:
regex = .
input = The quick brown fox jumps over the lazy ox.
Found [T] starting at 0 and ending at 0
Found [h] starting at 1 and ending at 1
Found [e] starting at 2 and ending at 2
Found [ ] starting at 3 and ending at 3
Found [q] starting at 4 and ending at 4
Found [u] starting at 5 and ending at 5
Found [i] starting at 6 and ending at 6
Found [c] starting at 7 and ending at 7
Found [k] starting at 8 and ending at 8
Found [ ] starting at 9 and ending at 9
Found [b] starting at 10 and ending at 10
Found [r] starting at 11 and ending at 11
Found [o] starting at 12 and ending at 12
Found [w] starting at 13 and ending at 13
Found [n] starting at 14 and ending at 14
Found [ ] starting at 15 and ending at 15
Found [f] starting at 16 and ending at 16
Found [o] starting at 17 and ending at 17
Found [x] starting at 18 and ending at 18
Found [ ] starting at 19 and ending at 19
Found [j] starting at 20 and ending at 20
Found [u] starting at 21 and ending at 21
Found [m] starting at 22 and ending at 22
Found [p] starting at 23 and ending at 23
Found [s] starting at 24 and ending at 24
Found [ ] starting at 25 and ending at 25
Found [o] starting at 26 and ending at 26
Found [v] starting at 27 and ending at 27
Found [e] starting at 28 and ending at 28
Found [r] starting at 29 and ending at 29
Found [ ] starting at 30 and ending at 30
Found [t] starting at 31 and ending at 31
Found [h] starting at 32 and ending at 32
Found [e] starting at 33 and ending at 33
Found [ ] starting at 34 and ending at 34
Found [l] starting at 35 and ending at 35
Found [a] starting at 36 and ending at 36
Found [z] starting at 37 and ending at 37
Found [y] starting at 38 and ending at 38
Found [ ] starting at 39 and ending at 39
Found [o] starting at 40 and ending at 40
Found [x] starting at 41 and ending at 41
Found [.] starting at 42 and ending at 42
Character classes
We sometimes need to limit characters that will produce matches to a specific character set. For example, we might search text for vowels a
, e
, i
, o
, and u
, where any occurrence of a vowel indicates a match. A character class identifies a set of characters between square-bracket metacharacters ([ ]
), helping us accomplish this task. Pattern
supports simple, negation, range, union, intersection, and subtraction character classes. We'll look at all of these below.
Simple character class
The simple character class consists of characters placed side by side and matches only those characters. For example, [abc]
matches characters a
, b
, and c
.
Consider the following example:
java RegexDemo [csw] cave
This example matches only c
with its counterpart in cave
, as shown in the following output:
regex = [csw]
input = cave
Found [c] starting at 0 and ending at 0
Negation character class
The negation character class begins with the ^
metacharacter and matches only those characters not located in that class. For example, [^abc]
matches all characters except a
, b
, and c
.
Consider this example:
java RegexDemo "[^csw]" cave
Note that the double quotes are necessary on my Windows platform, whose shell treats the ^
character as an escape character.
This example matches a
, v
, and e
with their counterparts in cave
, as shown here:
regex = [^csw]
input = cave
Found [a] starting at 1 and ending at 1
Found [v] starting at 2 and ending at 2
Found [e] starting at 3 and ending at 3
Range character class
The range character class consists of two characters separated by a hyphen metacharacter (-
). All characters beginning with the character on the left of the hyphen and ending with the character on the right of the hyphen belong to the range. For example, [a-z]
matches all lowercase alphabetic characters. It's equivalent to specifying [abcdefghijklmnopqrstuvwxyz]
.
Consider the following example:
java RegexDemo [a-c] clown
This example matches only c
with its counterpart in clown
, as shown:
regex = [a-c]
input = clown
Found [c] starting at 0 and ending at 0
Union character class
The union character class consists of multiple nested character classes and matches all characters that belong to the resulting union. For example, [a-d[m-p]]
matches characters a
through d
and m
through p
.
Consider the following example:
java RegexDemo [ab[c-e]] abcdef
This example matches a
, b
, c
, d
, and e
with their counterparts in abcdef
:
regex = [ab[c-e]]
input = abcdef
Found [a] starting at 0 and ending at 0
Found [b] starting at 1 and ending at 1
Found [c] starting at 2 and ending at 2
Found [d] starting at 3 and ending at 3
Found [e] starting at 4 and ending at 4
Intersection character class
The intersection character class consists of characters common to all nested classes and matches only common characters. For example, [a-z&&[d-f]]
matches characters d
, e
, and f
.
Consider the following example:
java RegexDemo "[aeiouy&&[y]]" party
Note that the double quotes are necessary on my Windows platform, whose shell treats the &
character as a command separator.
This example matches only y
with its counterpart in party
:
regex = [aeiouy&&[y]]
input = party
Found [y] starting at 4 and ending at 4