Java 101: Regular expressions in Java, Part 2

Simplify common coding tasks with the Regex API

reg ex2
Credit: timlewisnm

The first half of this tutorial introduced you to regular expressions and the Regex API. You learned about the Pattern class, then worked through examples demonstrating regex constructs, from basic pattern matching with literal strings to more complex matches using ranges, boundary matchers, and quantifiers.

In Part 2 we'll pick up where we left off, exploring methods associated with the Pattern, Matcher, and PatternSyntaxException classes. You'll also be introduced to two tools that use regular expressions to simplify common coding tasks. The first extracts comments from code for documentation purposes. The second is a reusable library for performing lexical analysis, which is an essential component of assemblers, compilers, and similar software.

Explore the Regex API

Pattern, Matcher, and PatternSyntaxException are the three classes that comprise the Regex API. Each class offers methods that you can use to integrate regexes into your code.

Pattern methods

An instance of the Pattern class describes a compiled regex, also known as a pattern. Regexes are compiled to increase performance during pattern-matching operations. The following static methods support compilation.

  • Pattern compile(String regex) compiles regex's contents into an intermediate representation stored in a new Pattern object. This method either returns the object's reference upon success, or throws PatternSyntaxException if it detects invalid syntax in the regex. Any Matcher object used by or returned from this Pattern object adheres to various default settings, such as case-sensitive searching. As an example, Pattern p = Pattern.compile("(?m)^\\."); creates a Pattern object that stores a compiled representation of the regex for matching all lines starting with a period character.
  • Pattern compile(String regex, int flags) accomplishes the same task as Pattern compile(String regex), but is able to account for flags: a bitwise-inclusive ORed set of flag constant bit values. Pattern declares CANON_EQ, CASE_INSENSITIVE, COMMENTS, DOTALL, LITERAL, MULTILINE, UNICODE_CASE, UNICODE_CHARACTER_CLASS, and UNIX_LINES constants that can be bitwise ORed together (e.g., CASE_INSENSITIVE | DOTALL) and passed to flags.

    Except for CANON_EQ, LITERAL, and UNICODE_CHARACTER_CLASS, these constants are an alternative to embedded flag expressions, which were demonstrated in Part 1. The Pattern compile(String regex, int flags) method throws java.lang.IllegalArgumentException when it detects a flag constant other than those defined by Pattern constants. For example, Pattern p = Pattern.compile("^\\.", Pattern.MULTILINE); is equivalent to the previous example, where the Pattern.MULTILINE constant and the (?m) embedded flag expression accomplish the same task.

At times you will need to obtain a copy of an original regex string that has been compiled into a Pattern object, along with the flags it is using. You can do this by calling the following methods:

  • String pattern() returns the original regex string that was compiled into the Pattern object.
  • int flags() returns the Pattern object's flags.

After obtaining a Pattern object, you'll typically use it to obtain a Matcher object, so that you can perform pattern-matching operations. The Matcher matcher(Charsequence input) creates a Matcher object that matches provided input text against a given Pattern object's compiled regex. When called, it returns a reference to this Matcher object. For example, Matcher m = p.matcher(args[1]); returns a Matcher for the Pattern object referenced by variable p.

Splitting text

Most developers have written code to break input text into its component parts, such as converting a text-based employee record into a set of fields. Pattern offers a quicker way to handle this tedium, via a pair of text-splitting methods:

  • String[] split(CharSequence text, int limit) splits text around matches of the Pattern object's pattern and returns the results in an array. Each entry specifies a text sequence that's separated from the next text sequence by a pattern match (or the text's end). All array entries are stored in the same order as they appear in the text.

    In this method, the number of array entries depends on limit, which also controls the number of matches that occur:
    • A positive value means that at most limit - 1 matches are considered and the array's length is no greater than the limit entries.
    • A negative value means all possible matches are considered, and the array can be of any length.
    • A zero means all possible matches are considered, the array can have any length, and trailing empty strings are discarded.
  • String[] split(CharSequence text) invokes the previous method with zero as the limit and returns the method call's result.

Here's how split(CharSequence text) handles the task of splitting an employee record into its field components of name, age, street address, and salary:

Pattern p = Pattern.compile(",\\s");
String[] fields = p.split("John Doe, 47, Hillsboro Road, 32000");
for (int i = 0; i < fields.length; i++)

The above code specifies a regex that matches a comma character immediately followed by a single-space character. Here's the output:

John Doe
Hillsboro Road

Pattern predicates and the Streams API

Java 8 introduced the Predicate<String> asPredicate() method to Pattern. This method creates a predicate (Boolean-valued function) that's used for pattern matching. The code below demonstrates asPredicate():

List<String> progLangs = Arrays.asList("apl", "basic", "c", "c++", "c#", "cobol",
                                       "java", "javascript", "perl", "python", 
Pattern p = Pattern.compile("^c");;

This code creates a list of programming language names, then compiles a pattern for matching all of the names that start with the lowercase letter c. The last line above obtains a sequential stream with the list as its source. It installs a filter that uses asPredicate()'s Boolean function, which returns true when a name begins with c, and iterates over the stream, outputting matched names to the standard output.

That last line is equivalent to the following traditional loop, which you might remember from the RegexDemo application in Part 1:

for (String progLang: progLangs) 
   if (p.matcher(progLang).find())

Matcher methods

An instance of the Matcher class describes an engine that performs match operations on a character sequence by interpreting a Pattern's compiled regex. Matcher objects support different kinds of pattern-matching operations:

  • boolean find() scans input text for the next match. This method starts its scan either at the beginning of the given text, or at the first character following the previous match. The latter option is only possible when the previous method invocation has returned true and the matcher hasn't been reset. In either case, Boolean true is returned when a match is found. You will find an example of this method in the RegexDemo from Part 1.
  • boolean find(int start) resets the matcher and scans text for the next match. The scan begins at the index specified by start. Boolean true is returned when a match is found. For example, m.find(1); scans text beginning at index 1. (Index 0 is ignored.) If start contains a negative value or a value exceeding the length of the matcher's text, this method throws java.lang.IndexOutOfBoundsException.
  • boolean matches() attempts to match the entire text against the pattern. This method returns true when the entire text matches. For example, Pattern p = Pattern.compile("\\w*"); Matcher m = p.matcher("abc!"); System.out.println(p.matches()); outputs false because the ! symbol isn't a word character.
  • boolean lookingAt() attempts to match the given text against the pattern. This method returns true when any of the text matches. Unlike matches(), the entire text doesn't need to be matched. For example, Pattern p = Pattern.compile("\\w*"); Matcher m = p.matcher("abc!"); System.out.println(p.lookingAt()); outputs true because the beginning of the abc! text consists of word characters only.

Unlike Pattern objects, Matcher objects record state information. Occasionally, you might want to reset a matcher to clear that information after performing a pattern match. The following methods reset a matcher:

  • Matcher reset() resets a matcher's state, including the matcher's append position (which is cleared to zero). The next pattern-match operation begins at the start of the matcher's text. A reference to the current Matcher object is returned. For example, m.reset(); resets the matcher referenced by m.
  • Matcher reset(CharSequence text) resets a matcher's state and sets the matcher's text to text. The next pattern-match operation begins at the start of the matcher's new text. A reference to the current Matcher object is returned. For example, m.reset("new text"); resets the m-referenced matcher and also specifies new text as the matcher's new text.

Appending text

A matcher's append position identifies the start of the matcher's text that's appended to a java.lang.StringBuffer object. The following methods use the append position:

  • Matcher appendReplacement(StringBuffer sb, String replacement) reads the matcher's text characters and appends them to the sb-referenced StringBuffer object. This method stops reading after the last character preceding the previous pattern match. Next, the method appends the characters in the replacement-referenced String object to the StringBuffer object. (The replacement string may contain references to text sequences captured during the previous match, via dollar-sign characters ($) and capturing group numbers.) Finally, the method sets the matcher's append position to the index of the last matched character plus one, then returns a reference to the current matcher.

    The Matcher appendReplacement(StringBuffer sb, String replacement) method throws java.lang.IllegalStateException when the matcher hasn't yet made a match, or when the previous match attempt has failed. It throws IndexOutOfBoundsException when replacement specifies a capturing group that doesn't exist in the pattern.
  • StringBuffer appendTail(StringBuffer sb) appends all text to the StringBuffer object and returns that object's reference. Following a final call to the appendReplacement(StringBuffer sb, String replacement) method, call appendTail(StringBuffer sb) to copy remaining text to the StringBuffer object.

The following code calls appendReplacement(StringBuffer sb, String replacement) and appendTail(StringBuffer sb) to replace all occurrences of cat with caterpillar in the provided text:

Pattern p = Pattern.compile("(cat)");
Matcher m = p.matcher("one cat, two cats, or three cats on a fence");
StringBuffer sb = new StringBuffer();
while (m.find())
   m.appendReplacement(sb, "$1erpillar");

Placing a capturing group and a reference to the capturing group in the replacement text instructs the program to insert erpillar after each cat match. The above code results in the following output:

one caterpillar, two caterpillars, or three caterpillars on a fence

Replacing text

Matcher provides a pair of text-replacement methods that complement appendReplacement(StringBuffer sb, String replacement). These methods let you replace either the first match or all matches:

  • String replaceFirst(String replacement) resets the matcher, creates a new String object, copies all of the matcher's text characters (up to the first match) to the string, appends the replacement characters to the string, copies remaining characters to the string, and returns the String object. (The replacement string may contain references to text sequences captured during the previous match, via dollar-sign characters and capturing-group numbers.)
  • String replaceAll(String replacement) operates similarly to replaceFirst(String replacement), but replaces all matches with replacement's characters.

The \s+ regex detects one or more occurrences of whitespace characters in the input text. Below, we use this regex and call the replaceAll(String replacement) method to remove duplicate whitespace:

Pattern p = Pattern.compile("\\s+");
Matcher m = p.matcher("Remove     the \t\t duplicate whitespace.   ");
System.out.println(m.replaceAll(" "));

Here is the output:

Remove the duplicate whitespace.

1 2 3 Page 1