Regular expressions simplify pattern-matching code

Discover the elegance of regular expressions in text-processing scenarios that involve pattern matching

Text processing frequently requires code to match text against patterns. That capability makes possible text searches, email header validation, custom text creation from generic text (e.g., "Dear Mr. Smith" instead of "Dear Customer"), and so on. Java supports pattern matching via its character and assorted string classes. Because that low-level support commonly leads to complex pattern-matching code, Java also offers regular expressions to help you write simpler code.

Regular expressions often confuse newcomers. However, this article dispels much of that confusion. After introducing regular expression terminology, the java.util.regex package's classes, and a program that demonstrates regular expression constructs, I explore many of the regular expression constructs that the Pattern class supports. I also examine the methods comprising Pattern and other java.util.regex classes. A practical application of regular expressions concludes my discussion.

Note
Regular expressions' long history begins in the theoretical computer science fields of automata theory and formal language theory. That history continues to Unix and other operating systems, where regular expressions are often used in Unix and Unix-like utilities: examples include awk (a programming language that enables sophisticated text analysis and manipulation—named after its creators, Aho, Weinberger, and Kernighan), emacs (a developer's editor), and grep (a program that matches regular expressions in one or more text files and stands for global regular expression print).

What are regular expressions?

A regular expression, also known as a regex or regexp, is a string whose pattern (template) describes a set of strings. The pattern determines what strings belong to the set, and consists of literal characters and metacharacters, characters that have special meaning instead of a literal meaning. The process of searching text to identify matches—strings that match a regex's pattern—is pattern matching.

Java's java.util.regex package supports pattern matching via its Pattern, Matcher, and PatternSyntaxException classes:

  • Pattern objects, also known as patterns, are compiled regexes
  • Matcher objects, or matchers, are engines that interpret patterns to locate matches in character sequences, objects whose classes implement the java.lang.CharSequence interface and serve as text sources
  • PatternSyntaxException objects describe illegal regex patterns

Listing 1 introduces those classes:

Listing 1. RegexDemo.java

// RegexDemo.java
import java.util.regex.*;
class RegexDemo
{
   public static void main (String [] args)
   {
      if (args.length != 2)
      {
          System.err.println ("java RegexDemo regex text");
          return;
      }
      Pattern p;
      try
      {
         p = Pattern.compile (args [0]);
      }
      catch (PatternSyntaxException e)
      {
         System.err.println ("Regex syntax error: " + e.getMessage ());
         System.err.println ("Error description: " + e.getDescription ());
         System.err.println ("Error index: " + e.getIndex ());
         System.err.println ("Erroneous pattern: " + e.getPattern ());
         return;
      }
      String s = cvtLineTerminators (args [1]);
      Matcher m = p.matcher (s);
      System.out.println ("Regex = " + args [0]);
      System.out.println ("Text = " + s);
      System.out.println ();
      while (m.find ())
      {
         System.out.println ("Found " + m.group ());
         System.out.println ("  starting at index " + m.start () +
                             " and ending at index " + m.end ());
         System.out.println ();
      }
   }
   // Convert \n and \r character sequences to their single character
   // equivalents
   static String cvtLineTerminators (String s)
   {
      StringBuffer sb = new StringBuffer (80);
      int oldindex = 0, newindex;
      while ((newindex = s.indexOf ("\\n", oldindex)) != -1)
      {
         sb.append (s.substring (oldindex, newindex));
         oldindex = newindex + 2;
         sb.append ('\n');
      }
      sb.append (s.substring (oldindex));
      s = sb.toString ();
      sb = new StringBuffer (80);
      oldindex = 0;
      while ((newindex = s.indexOf ("\\r", oldindex)) != -1)
      {
         sb.append (s.substring (oldindex, newindex));
         oldindex = newindex + 2;
         sb.append ('\r');
      }
      sb.append (s.substring (oldindex));
      return sb.toString ();
   }
}

RegexDemo's public static void main(String [] args) method validates two command-line arguments: one that identifies a regex and another that identifies text. After creating a pattern, this method converts all the text argument's new-line and carriage-return line-terminator character sequences to their actual meanings. For example, a new-line character sequence (represented as backslash (\) followed by n) converts to one new-line character (represented numerically as 10). After outputting the regex and converted text command-line arguments, main(String [] args) creates a matcher from the pattern, which subsequently finds all matches. For each match, the match's characters and information on where the match occurs in the text output to the standard output device.

To accomplish pattern matching, RegexDemo calls various methods in java.util.regex's classes. Don't concern yourself with understanding those methods right now; we'll explore them later in this article. More importantly, compile Listing 1: you need RegexDemo.class to explore Pattern's regex constructs.

Explore Pattern's regex constructs

Pattern's SDK documentation presents a section on regular expression constructs. Unless you're an avid regex user, an initial examination of that section might confuse you. What are quantifiers and the differences among greedy, reluctant, and possessive quantifiers? What are character classes, boundary matchers, back references, and embedded flag expressions? To answer those and other questions, we explore many of the regex constructs, or regex pattern categories, that Pattern recognizes. We begin with the simplest regex construct: literal strings.

Caution
Do not assume that Pattern's and Perl 5's regex constructs are identical. Although they share many similarities, they also share differences, ranging from disparities in the constructs they support to their treatment of dangling metacharacters. (For more information, examine your SDK documentation on the Pattern class, which you should have on your platform.)

Literal strings

You specify the literal string regex construct whenever you type a literal string in the search text field of your word processor's search dialog box. Execute the following RegexDemo command line to see this regex construct in action:

java RegexDemo apple applet

The command line above identifies apple as a literal string regex construct that consists of literal characters a, p, p, l, and e (in that order). The command line also identifies applet as text for pattern-matching purposes. After executing the command line, observe the following output:

Regex = apple
Text = applet
Found apple
  starting at index 0 and ending at index 5

The output identifies the regex and text command-line arguments, indicates a successful match of apple within applet, and presents the starting and ending indexes of that match: 0 and 5, respectively. The starting index identifies the first text location where a pattern match occurs, and the ending index identifies the first text location after the match. In other words, the range of matching text is inclusive of the starting index and exclusive of the ending index.

Metacharacters

Although literal string regex constructs are useful, more powerful regex constructs combine literal characters with metacharacters. For example, in a.b, the period metacharacter (.) represents any character that appears between a and b. To see the period metacharacter in action, execute the following command line:

java RegexDemo .ox "The quick brown fox jumps over the lazy ox."

The command line above specifies .ox as the regex and The quick brown fox jumps over the lazy ox. as the text command-line argument. RegexDemo searches the text for matches that begin with any character and end with ox, and produces the following output:

Regex = .ox
Text = The quick brown fox jumps over the lazy ox.
Found fox
  starting at index 16 and ending at index 19
Found  ox
  starting at index 39 and ending at index 42

The output reveals two matches: fox and ox (with a leading space character). The . metacharacter matches the f in the first match and the space character in the second match.

What happens if we replace .ox with the period metacharacter? That is, what outputs when we specify java . "The quick brown fox jumps over the lazy ox."? Because the period metacharacter matches any character, RegexDemo outputs a match for each character in its text command-line argument, including the terminating period character.

Tip
To specify . or any metacharacter as a literal character in a regex construct, quote—convert from meta status to literal status—the metacharacter in one of two ways:
  • Precede the metacharacter with a backslash character.
  • Place the metacharacter between \Q and \E (e.g., \Q.\E).
In either scenario, don't forget to double each backslash character (as in \\. or \\Q.\\E) that appears in a string literal (e.g., String regex = "\\.";). Do not double the backslash character when it appears as part of a command-line argument.

Character classes

We sometimes limit those characters that produce matches to a specific set of characters. For example, we might search text for vowels a, e, i, o, and u, where any occurrence of any vowel indicates a match. A character class, a regex construct that identifies a set of characters between open and close square bracket metacharacters ([ ]), helps us accomplish that task. Pattern supports the following character classes:

  • Simple: consists of characters placed side by side and matches only those characters. Example: [abc] matches characters a, b, and c. The following command line offers a second example:

    java RegexDemo [csw] cave
    

    java RegexDemo [csw] cave matches c in [csw] with c in cave. No other matches exist.

  • Negation: begins with the ^ metacharacter and matches only those characters not in that class. Example: [^abc] matches all characters except a, b, and c. The following command line offers a second example:

    java RegexDemo [^csw] cave
    

    java RegexDemo [^csw] cave matches a, v, and e with their counterparts in cave. No other matches exist.

  • Range: consists of all characters beginning with the character on the left of a hyphen metacharacter (-) and ending with the character on the right of the hyphen metacharacter, matching only those characters in that range. Example: [a-z] matches all lowercase alphabetic characters. The following command line offers a second example:

    java RegexDemo [a-c] clown
    

    java RegexDemo [a-c] clown matches c in [a-c] with c in clown. No other matches exist.

  • Tip
    Combine multiple ranges within the same range character class by placing them side by side. Example: [a-zA-Z] matches all lowercase and uppercase alphabetic characters.
  • Union: consists of multiple nested character classes and matches all characters that belong to the resulting union. Example: [a-d[m-p]] matches characters a through d and m through p. The following command line offers a second example:

    java RegexDemo [ab[c-e]] abcdef
    

    java RegexDemo [ab[c-e]] abcdef matches a, b, c, d, and e with their counterparts in abcdef. No other matches exist.

  • Intersection: consists of characters common to all nested classes and matches only common characters. Example: [a-z&&[d-f]] matches characters d, e, and f. The following command line offers a second example:

    java RegexDemo [aeiouy&&[y]] party
    

    java RegexDemo [aeiouy&&[y]] party matches y in [aeiou&&[y]] with y in party. No other matches exist.

  • Subtraction: consists of all characters except for those indicated in nested negation character classes and matches the remaining characters. Example: [a-z&&[^m-p]] matches characters a through l and q through z. The following command line offers a second example:

    java RegexDemo [a-f&&[^a-c]&&[^e]] abcdefg
    

    java RegexDemo [a-f&&[^a-c]&&[^e]] abcdefg matches d and f with their counterparts in abcdefg. No other matches exist.

Predefined character classes

Some character classes occur often enough in regexes to warrant shortcuts. Pattern provides such shortcuts with predefined character classes, which Table 1 presents. Use predefined character classes to simplify your regexes and minimize regex syntax errors.

Table 1. Predefined character classes

Predefined character class Description
\d A digit. Equivalent to [0-9].
\D A nondigit. Equivalent to [^0-9].
\s A whitespace character. Equivalent to [ \t\n\x0B\f\r].
\S A nonwhitespace character. Equivalent to [^\s].
\w A word character. Equivalent to [a-zA-Z_0-9].
\W A nonword character. Equivalent to [^\w].

The following command-line example uses the \w predefined character class to identify all word characters in its text command-line argument:

java RegexDemo \w "aZ.8 _"

The command line above produces the following output, which shows that the period and space characters are not considered word characters:

1 2 3 4 Page
Recommended
Join the discussion
Be the first to comment on this article. Our Commenting Policies
See more