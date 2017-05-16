The first half of this tutorial introduced you to regular expressions and the Regex API. You learned about the
Pattern class, then worked through examples demonstrating regex constructs, from basic pattern matching with literal strings to more complex matches using ranges, boundary matchers, and quantifiers.
In Part 2 we'll pick up where we left off, exploring methods associated with the
Pattern,
Matcher, and
PatternSyntaxException classes. You'll also be introduced to two tools that use regular expressions to simplify common coding tasks. The first extracts comments from code for documentation purposes. The second is a reusable library for performing lexical analysis, which is an essential component of assemblers, compilers, and similar software.
Explore the Regex API
PatternSyntaxException are the three classes that comprise the Regex API. Each class offers methods that you can use to integrate regexes into your code.
Pattern methods
An instance of the
Pattern class describes a compiled regex, also known as a pattern. Regexes are compiled to increase performance during pattern-matching operations. The following
static methods support compilation.
Pattern compile(String regex)compiles
regex's contents into an intermediate representation stored in a new
Patternobject. This method either returns the object's reference upon success, or throws
PatternSyntaxExceptionif it detects invalid syntax in the
regex. Any
Matcherobject used by or returned from this
Patternobject adheres to various default settings, such as case-sensitive searching. As an example,
Pattern p = Pattern.compile("(?m)^\\.");creates a
Patternobject that stores a compiled representation of the regex for matching all lines starting with a period character.
Pattern compile(String regex, int flags)accomplishes the same task as
Pattern compile(String regex), but is able to account for
flags: a bitwise-inclusive ORed set of flag constant bit values.
Patterndeclares
CANON_EQ,
CASE_INSENSITIVE,
COMMENTS,
DOTALL,
LITERAL,
MULTILINE,
UNICODE_CASE,
UNICODE_CHARACTER_CLASS, and
UNIX_LINESconstants that can be bitwise ORed together (e.g.,
CASE_INSENSITIVE | DOTALL) and passed to
flags.
Except for
CANON_EQ,
LITERAL, and
UNICODE_CHARACTER_CLASS, these constants are an alternative to embedded flag expressions, which were demonstrated in Part 1. The
Pattern compile(String regex, int flags)method throws
java.lang.IllegalArgumentExceptionwhen it detects a flag constant other than those defined by
Patternconstants. For example,
Pattern p = Pattern.compile("^\\.", Pattern.MULTILINE);is equivalent to the previous example, where the
Pattern.MULTILINEconstant and the
(?m)embedded flag expression accomplish the same task.
At times you will need to obtain a copy of an original regex string that has been compiled into a
Pattern object, along with the flags it is using. You can do this by calling the following methods:
String pattern()returns the original regex string that was compiled into the
Patternobject.
int flags()returns the
Patternobject's flags.
After obtaining a
Pattern object, you'll typically use it to obtain a
Matcher object, so that you can perform pattern-matching operations. The
Matcher matcher(Charsequence input) creates a
Matcher object that matches provided
input text against a given
Pattern object's compiled regex. When called, it returns a reference to this
Matcher object. For example,
Matcher m = p.matcher(args[1]); returns a
Matcher for the
Pattern object referenced by variable
p.
Splitting text
Most developers have written code to break input text into its component parts, such as converting a text-based employee record into a set of fields.
Pattern offers a quicker way to handle this tedium, via a pair of text-splitting methods:
String[] split(CharSequence text, int limit)splits
textaround matches of the
Patternobject's pattern and returns the results in an array. Each entry specifies a text sequence that's separated from the next text sequence by a pattern match (or the text's end). All array entries are stored in the same order as they appear in the
text.
In this method, the number of array entries depends on
limit, which also controls the number of matches that occur:
- A positive value means that at most
limit - 1matches are considered and the array's length is no greater than the
limitentries.
- A negative value means all possible matches are considered, and the array can be of any length.
- A zero means all possible matches are considered, the array can have any length, and trailing empty strings are discarded.
- A positive value means that at most
String[] split(CharSequence text)invokes the previous method with zero as the limit and returns the method call's result.
Here's how
split(CharSequence text) handles the task of splitting an employee record into its field components of name, age, street address, and salary:
Pattern p = Pattern.compile(",\\s");
String[] fields = p.split("John Doe, 47, Hillsboro Road, 32000");
for (int i = 0; i < fields.length; i++)
System.out.println(fields[i]);
The above code specifies a regex that matches a comma character immediately followed by a single-space character. Here's the output:
John Doe
47
Hillsboro Road
32000
Pattern predicates and the Streams API
Java 8 introduced the
Predicate<String> asPredicate() method to
Pattern. This method creates a predicate (Boolean-valued function) that's used for pattern matching. The code below demonstrates
asPredicate():
List<String> progLangs = Arrays.asList("apl", "basic", "c", "c++", "c#", "cobol",
"java", "javascript", "perl", "python",
"scala");
Pattern p = Pattern.compile("^c");
progLangs.stream().filter(p.asPredicate()).forEach(System.out::println);
This code creates a list of programming language names, then compiles a pattern for matching all of the names that start with the lowercase letter
c. The last line above obtains a sequential stream with the list as its source. It installs a filter that uses
asPredicate()'s Boolean function, which returns true when a name begins with
c, and iterates over the stream, outputting matched names to the standard output.
That last line is equivalent to the following traditional loop, which you might remember from the
RegexDemo application in Part 1:
for (String progLang: progLangs)
if (p.matcher(progLang).find())
System.out.println(progLang);
Matcher methods
An instance of the
Matcher class describes an engine that performs match operations on a character sequence by interpreting a
Pattern's compiled regex.
Matcher objects support different kinds of pattern-matching operations:
boolean find()scans input text for the next match. This method starts its scan either at the beginning of the given text, or at the first character following the previous match. The latter option is only possible when the previous method invocation has returned true and the matcher hasn't been reset. In either case, Boolean true is returned when a match is found. You will find an example of this method in the
RegexDemofrom Part 1.
boolean find(int start)resets the matcher and scans text for the next match. The scan begins at the index specified by
start. Boolean true is returned when a match is found. For example,
m.find(1);scans text beginning at index
1. (Index 0 is ignored.) If
startcontains a negative value or a value exceeding the length of the matcher's text, this method throws
java.lang.IndexOutOfBoundsException.
boolean matches()attempts to match the entire text against the pattern. This method returns true when the entire text matches. For example,
Pattern p = Pattern.compile("\\w*"); Matcher m = p.matcher("abc!"); System.out.println(p.matches());outputs
falsebecause the
!symbol isn't a word character.
boolean lookingAt()attempts to match the given text against the pattern. This method returns true when any of the text matches. Unlike
matches(), the entire text doesn't need to be matched. For example,
Pattern p = Pattern.compile("\\w*"); Matcher m = p.matcher("abc!"); System.out.println(p.lookingAt());outputs
truebecause the beginning of the
abc!text consists of word characters only.
Unlike
Pattern objects,
Matcher objects record state information. Occasionally, you might want to reset a matcher to clear that information after performing a pattern match. The following methods reset a matcher:
Matcher reset()resets a matcher's state, including the matcher's append position (which is cleared to zero). The next pattern-match operation begins at the start of the matcher's text. A reference to the current
Matcherobject is returned. For example,
m.reset();resets the matcher referenced by
m.
Matcher reset(CharSequence text)resets a matcher's state and sets the matcher's text to
text. The next pattern-match operation begins at the start of the matcher's new text. A reference to the current
Matcherobject is returned. For example,
m.reset("new text");resets the
m-referenced matcher and also specifies
new textas the matcher's new text.
Appending text
A matcher's append position identifies the start of the matcher's text that's appended to a
java.lang.StringBuffer object. The following methods use the append position:
Matcher appendReplacement(StringBuffer sb, String replacement)reads the matcher's text characters and appends them to the
sb-referenced
StringBufferobject. This method stops reading after the last character preceding the previous pattern match. Next, the method appends the characters in the
replacement-referenced
Stringobject to the
StringBufferobject. (The
replacementstring may contain references to text sequences captured during the previous match, via dollar-sign characters (
$) and capturing group numbers.) Finally, the method sets the matcher's append position to the index of the last matched character plus one, then returns a reference to the current matcher.
The
Matcher appendReplacement(StringBuffer sb, String replacement)method throws
java.lang.IllegalStateExceptionwhen the matcher hasn't yet made a match, or when the previous match attempt has failed. It throws
IndexOutOfBoundsExceptionwhen
replacementspecifies a capturing group that doesn't exist in the pattern.
StringBuffer appendTail(StringBuffer sb)appends all text to the
StringBufferobject and returns that object's reference. Following a final call to the
appendReplacement(StringBuffer sb, String replacement)method, call
appendTail(StringBuffer sb)to copy remaining text to the
StringBufferobject.
The following code calls
appendReplacement(StringBuffer sb, String replacement) and
appendTail(StringBuffer sb) to replace all occurrences of
cat with
caterpillar in the provided text:
Pattern p = Pattern.compile("(cat)");
Matcher m = p.matcher("one cat, two cats, or three cats on a fence");
StringBuffer sb = new StringBuffer();
while (m.find())
m.appendReplacement(sb, "$1erpillar");
m.appendTail(sb);
System.out.println(sb);
Placing a capturing group and a reference to the capturing group in the replacement text instructs the program to insert
erpillar after each
cat match. The above code results in the following output:
one caterpillar, two caterpillars, or three caterpillars on a fence
Replacing text
Matcher provides a pair of text-replacement methods that complement
appendReplacement(StringBuffer sb, String replacement). These methods let you replace either the first match or all matches:
String replaceFirst(String replacement)resets the matcher, creates a new
Stringobject, copies all of the matcher's text characters (up to the first match) to the string, appends the
replacementcharacters to the string, copies remaining characters to the string, and returns the
Stringobject. (The
replacementstring may contain references to text sequences captured during the previous match, via dollar-sign characters and capturing-group numbers.)
String replaceAll(String replacement)operates similarly to
replaceFirst(String replacement), but replaces all matches with
replacement's characters.
The
\s+ regex detects one or more occurrences of whitespace characters in the input text. Below, we use this regex and call the
replaceAll(String replacement) method to remove duplicate whitespace:
Pattern p = Pattern.compile("\\s+");
Matcher m = p.matcher("Remove the \t\t duplicate whitespace. ");
System.out.println(m.replaceAll(" "));
Here is the output:
Remove the duplicate whitespace.