Newsletter sign-up
View all newsletters

Enterprise Java Newsletter
Stay up to date on the latest tutorials and Java community news posted on JavaWorld

Sponsored Links

Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs

Regular expressions simplify pattern-matching code

Discover the elegance of regular expressions in text-processing scenarios that involve pattern matching

  • Print
  • Feedback

Page 11 of 16

Regex = ^The\w*
Text = Therefore
Found Therefore
  starting at index 0 and ending at index 9


Change the command line to java RegexDemo ^The\w* " Therefore". What happens? No match is found because a space character precedes Therefore.

Embedded flag expressions

Matchers assume certain defaults, such as case-sensitive pattern matching. A program may override any default by using an embedded flag expression, that is, a regex construct specified as parentheses metacharacters surrounding a question mark metacharacter (?) followed by a specific lowercase letter. Pattern recognizes the following embedded flag expressions:

  • (?i): enables case-insensitive pattern matching. Example: java RegexDemo (?i)tree Treehouse matches tree with Tree. Case-sensitive pattern matching is the default.
  • (?x): permits whitespace and comments beginning with the # metacharacter to appear in a pattern. A matcher ignores both. Example: java RegexDemo ".at(?x)#match hat, cat, and so on" matter matches .at with mat. By default, whitespace and comments are not permitted; a matcher regards them as characters that contribute to a match.
  • (?s): enables dotall mode. In that mode, the period metacharacter matches line terminators in addition to any other character. Example: java RegexDemo (?s). \n matches . with \n. Nondotall mode is the default: line-terminator characters do not match.
  • (?m): enables multiline mode. In multiline mode, ^ and $ match just after or just before (respectively) a line terminator or the text's end. Example: java RegexDemo (?m)^.ake make\rlake\n\rtake matches .ake with make, lake, and take. Non-multiline mode is the default: ^ and $ match only at the beginning and end of the entire text.
  • (?u): enables Unicode-aware case folding. This flag works with (?i) to perform case-insensitive matching in a manner consistent with the Unicode Standard. The default: case-insensitive matching that assumes only characters in the US-ASCII character set match.
  • (?d): enables Unix lines mode. In that mode, a matcher recognizes only the \n line terminator in the context of the ., ^, and $ metacharacters. Non-Unix lines mode is the default: a matcher recognizes all terminators in the context of the aforementioned metacharacters.


Embedded flag expressions resemble capturing groups because both regex constructs surround their characters with parentheses metacharacters. Unlike a capturing group, an embedded flag expression does not capture a match's characters. Thus, an embedded flag expression is an example of a noncapturing group, that is, a regex construct that does not capture text characters; it's specified as a character sequence surrounded by parentheses metacharacters. Several kinds of noncapturing groups appear in Pattern's SDK documentation.

Tip
To specify multiple embedded flag expressions in a regex, either place them side by side (e.g., (?m)(?i)) or place their lowercase letters side by side (e.g., (?mi)).


Explore the java.util.regex classes' methods

java.util.regex's three classes offer various methods to help us write more robust regex source code and create powerful tools to manipulate text. Our exploration of those methods begins in the Pattern class.

Note
You might also want to explore the CharSequence interface's methods, which you can implement when you create a new character sequence class. The only classes currently implementing CharSequence are java.nio.CharBuffer, String, and StringBuffer.


Pattern methods

A regex is useless until code compiles that string into a Pattern object. Accomplish that task with either of the following compilation methods:

  • public static Pattern compile(String regex): compiles regex's contents into a tree-structured object representation stored in a new Pattern object. A reference to that object returns. Example: Pattern p = Pattern.compile ("(?m)^\\."); creates a Pattern object that stores a compiled representation of the regex for matching all lines starting with a period character.
  • public static Pattern compile(String regex, int flags): accomplishes the same task as the previous method. However, it also considers a bitwise inclusive ORed set of flag constant bit values (which flags specifies). Flag constants are declared in Pattern and serve (except the canonical equivalence flag, CANON_EQ) as an alternative to embedded flag expressions. Example: Pattern p = Pattern.compile ("^\\.", Pattern.MULTILINE); is equivalent to the previous example, where the Pattern.MULTILINE constant and the (?m) embedded flag expression accomplish the same task. (Consult the SDK's Pattern documentation to learn about other constants.) This method throws an IllegalArgumentException object if bit values other than those bit values that Pattern's constants represent appear in flags.


When needed, obtain a copy of a Pattern object's flags and the original regex that compiled into that object by calling the following methods:

  • public int flags(): returns the Pattern object's flags specified when a regex compiles. Example: System.out.println (p.flags ()); outputs the flags associated with the Pattern object that p references.
  • public String pattern(): returns the original regex that compiled into the Pattern object. Example: System.out.println (p.pattern ()); outputs the regex corresponding to the Pattern that p references. (The Matcher class includes a similar Pattern pattern() method that returns a matcher's associated Pattern object.)


After creating a Pattern object, you commonly obtain a Matcher object from Pattern by calling Pattern's public matcher(CharSequence text) method. That method requires a single text object argument, whose class implements the CharSequence interface. The obtained matcher scans the characters in the text object during a pattern match operation. Example: Pattern p = Pattern.compile ("[^aeiouy]"); Matcher m = p.matcher ("This is a test."); obtains a matcher to match nonvowel characters in This is a test..

Creating Pattern and Matcher objects is bothersome when you wish to quickly check if a pattern completely matches a text sequence. Fortunately, Pattern offers a convenience method to help you accomplish that task: public static boolean matches(String regex, CharSequence text). That static method returns a Boolean true value if and only if the entire text character sequence matches regex's pattern. Example: System.out.println (Pattern.matches ("[a-z[\\s]]*", "all lowercase letters and whitespace only")); returns a Boolean true value, indicating only whitespace characters and lowercase letters appear in all lowercase letters and whitespace only.

Writing code to break text into its component parts (such as a text file's employee record into a set of fields) is a task many developers find tedious. Pattern relieves that tedium by providing a pair of text-splitting methods:

  • public String [] split(CharSequence text, int limit): splits text around matches of the current Pattern object's pattern. This method returns an array, where each entry specifies a text sequence separated from the next text sequence by a pattern match (or the text's end); and all array entries store in the same order as they appear in the text. The number of array entries depends on limit, which also controls the number of matches that occur. A positive value means that, at most, limit-1 matches are considered and the array's length is no greater than limit entries. A negative value means all possible matches are considered and the array can have any length. A zero value means all possible matches are considered, the array can have any length, and trailing empty strings are discarded.
  • public String [] split(CharSequence text): invokes the previous method with zero as the limit and returns the method call's result.


Suppose you want to split an employee record, consisting of name, age, street address, and salary, into its field components. The following code fragment uses split(CharSequence text) to accomplish that task:

Pattern p = Pattern.compile (",\\s");
String [] fields = p.split ("John Doe, 47, Hillsboro Road, 32000");
for (int i = 0; i < fields.length; i++)
     System.out.println (fields [i]);


The code fragment above specifies a regex that matches a comma character immediately followed by a single-space character and produces the following output:

  • Print
  • Feedback

Resources