Newsletter sign-up
View all newsletters

Sign up for our technology specific newsletters.

Enterprise Java
Email Address:

Regular expressions simplify pattern-matching code

Discover the elegance of regular expressions in text-processing scenarios that involve pattern matching

  • Digg
  • Reddit
  • SlashDot
  • Stumble
  • del.icio.us
  • Technorati
  • dzone

Text processing frequently requires code to match text against patterns. That capability makes possible text searches, email header validation, custom text creation from generic text (e.g., "Dear Mr. Smith" instead of "Dear Customer"), and so on. Java supports pattern matching via its character and assorted string classes. Because that low-level support commonly leads to complex pattern-matching code, Java also offers regular expressions to help you write simpler code.

Regular expressions often confuse newcomers. However, this article dispels much of that confusion. After introducing regular expression terminology, the java.util.regex package's classes, and a program that demonstrates regular expression constructs, I explore many of the regular expression constructs that the Pattern class supports. I also examine the methods comprising Pattern and other java.util.regex classes. A practical application of regular expressions concludes my discussion.

Note
Regular expressions' long history begins in the theoretical computer science fields of automata theory and formal language theory. That history continues to Unix and other operating systems, where regular expressions are often used in Unix and Unix-like utilities: examples include awk (a programming language that enables sophisticated text analysis and manipulation—named after its creators, Aho, Weinberger, and Kernighan), emacs (a developer's editor), and grep (a program that matches regular expressions in one or more text files and stands for global regular expression print).


What are regular expressions?

A regular expression, also known as a regex or regexp, is a string whose pattern (template) describes a set of strings. The pattern determines what strings belong to the set, and consists of literal characters and metacharacters, characters that have special meaning instead of a literal meaning. The process of searching text to identify matches—strings that match a regex's pattern—is pattern matching.

Java's java.util.regex package supports pattern matching via its Pattern, Matcher, and PatternSyntaxException classes:

  • Pattern objects, also known as patterns, are compiled regexes
  • Matcher objects, or matchers, are engines that interpret patterns to locate matches in character sequences, objects whose classes implement the java.lang.CharSequence interface and serve as text sources
  • PatternSyntaxException objects describe illegal regex patterns


Listing 1 introduces those classes:

Listing 1. RegexDemo.java

// RegexDemo.java
import java.util.regex.*;
class RegexDemo
{
   public static void main (String [] args)
   {
      if (args.length != 2)
      {
          System.err.println ("java RegexDemo regex text");
          return;
      }
      Pattern p;
      try
      {
         p = Pattern.compile (args [0]);
      }
      catch (PatternSyntaxException e)
      {
         System.err.println ("Regex syntax error: " + e.getMessage ());
         System.err.println ("Error description: " + e.getDescription ());
         System.err.println ("Error index: " + e.getIndex ());
         System.err.println ("Erroneous pattern: " + e.getPattern ());
         return;
      }
      String s = cvtLineTerminators (args [1]);
      Matcher m = p.matcher (s);
      System.out.println ("Regex = " + args [0]);
      System.out.println ("Text = " + s);
      System.out.println ();
      while (m.find ())
      {
         System.out.println ("Found " + m.group ());
         System.out.println ("  starting at index " + m.start () +
                             " and ending at index " + m.end ());
         System.out.println ();
      }
   }
   // Convert \n and \r character sequences to their single character
   // equivalents
   static String cvtLineTerminators (String s)
   {
      StringBuffer sb = new StringBuffer (80);
      int oldindex = 0, newindex;
      while ((newindex = s.indexOf ("\\n", oldindex)) != -1)
      {
         sb.append (s.substring (oldindex, newindex));
         oldindex = newindex + 2;
         sb.append ('\n');
      }
      sb.append (s.substring (oldindex));
      s = sb.toString ();
      sb = new StringBuffer (80);
      oldindex = 0;
      while ((newindex = s.indexOf ("\\r", oldindex)) != -1)
      {
         sb.append (s.substring (oldindex, newindex));
         oldindex = newindex + 2;
         sb.append ('\r');
      }
      sb.append (s.substring (oldindex));
      return sb.toString ();
   }
}


RegexDemo's public static void main(String [] args) method validates two command-line arguments: one that identifies a regex and another that identifies text. After creating a pattern, this method converts all the text argument's new-line and carriage-return line-terminator character sequences to their actual meanings. For example, a new-line character sequence (represented as backslash (\) followed by n) converts to one new-line character (represented numerically as 10). After outputting the regex and converted text command-line arguments, main(String [] args) creates a matcher from the pattern, which subsequently finds all matches. For each match, the match's characters and information on where the match occurs in the text output to the standard output device.

  • Digg
  • Reddit
  • SlashDot
  • Stumble
  • del.icio.us
  • Technorati
  • dzone
Comments (1)
Login
Forgot your account info?

zero-length matchesBy Anonymous on October 10, 2009, 11:48 amSo these are essentially spurious matches appended to genuine ones, for no apparent reason. Can anyone explain why this absurd behaviour wasn't stopped immediately...

Reply | Read entire comment

View all comments

Add comment
Anonymous comments subject to approval. Register here for member benefits.
Have a JavaWorld account? Log in here. Register now for a free account.
Resources