Reading textual data: Fun with streams

Find out how to extend and customize the character-stream classes to easily read textual data

Never let it be said that I'm happy to rush out a simple article (or book) just to meet a publisher's deadline. What started out as a basic, inefficient character-stream filter, designed to read a stream of digits and parse it into a number, has gradually ballooned into a small stream library sporting unnecessary features and go-faster stripes. Not only has the evolving character-stream library discarded many genetically-inferior siblings along the way, it's behind schedule! (This brings back memories of that 300-page manuscript I discarded in favor of starting anew. I think I'm beginning to see a destructive pattern emerging...)

I initially set out to design a data-reading character-stream filter class. Analogous to the byte-stream filter DataInputStream, this character-stream filter was intended to provide the capability to read textual data from a character stream (namely the output of a human or the PrintWriter println() method).

Let me now, in retrospect, describe what I actually implemented.

First, I created an UndoReader class. This character-stream filter supports three special methods:

  • checkpoint()
  • commit()
  • rollback()

As you read characters through the stream, you have the option to checkpoint the stream -- that is, save the stream's current state and put it into a mode such that it stores all data subsequently read through it. From that point on, the UndoReader stores all the data you read. After any amount of reading, choosing to commit the stream will cause the stored data to be discarded, after which reading proceeds undisturbed. Alternatively, choosing to rollback the stream will cause it to rewind and revert to reading from the position at which you asserted the checkpoint -- just as if you hadn't yet read anything. This stream also supports a couple of related methods.

Next, I implemented the DataReader class, the character-stream filter. This class makes use of the UndoReader class and provides methods to read all the primitive Java types (readInt(), readFloat(), readBoolean(), and so on). What is special about this class is that if you attempt to read a primitive from the stream and it turns out the stream data is incorrect -- if, for example, you attempt to read a Boolean and the next token in the stream is truthfulness -- it rolls back, un-reading any characters read during the erroneous operation, and throws an exception. The stream also supports a feature whereby it can read data one line at a time, signaling each time the end of a line is reached (among other wonders).

The classes I developed will only work in JDK 1.1-plus. Adapting them to work as InputStreams, usable under JDK 1.0.2, should be quite easy, however.

Justification for UndoReader

In the interest of brevity, I'll spare you an introduction to character streams. Todd Sundsted's November 1997 How-To Java column should serve that purpose quite adequately; if you want an introduction to byte streams, check out Todd's October 1997 column. For further details of the Java stream classes, I refer you to Java Network Programming, Second Edition, which I coauthored with Michael Shoffner and Derek Hamner, and which is due out any day now. (See Resources.)

What I should perhaps explain is my justification for the UndoReader class. At the most abstract level, I want the ability to undo a series of read operations, because otherwise my DataReader class will violate the basic law of propriety: It would be improper for an erroneous attempt at reading an int to end up consuming a Boolean. Furthermore, the behavior of my class wouldn't necessarily be clearly defined in the presence of such an error: the amount of erroneous input consumed would be implementation-dependent, and exposing implementation-dependent details of this nature simply invites abuse.

Those readers familiar with the stream classes may then ask about the mark() and reset() methods, or indeed the PushbackReader class: Do these basic features of the stream API not already address my needs? Indeed, use of the mark() and reset() methods does allow a sequence of read operations to be undone, and the PushbackReader unread() methods can be used to the same effect. However, both of these options are bounded. Therefore you must, in each case, declare ahead of time the maximum volume of data you will un-read. In our situation, no such limit exists for textual data: "00...01" is a valid integer, just as "00...0z" is not. I cannot, simply to avoid writing an extra class, presume to impose arbitrary limitations on the data I will process.

Thus rationalized, we can now proceed with the code.

Class UndoReader

The UndoReader class is a character-stream filter that provides unbounded checkpoint, commit, and rollback operations and an additional undo facility.

Figure 1. Using methods checkpoint(), commit(), rollback(), and undo()
  • When checkpoint() is used, it proceeds to store all data read through it in an internal, expanding buffer

  • When commit() is used, the stored contents of the checkpoint buffer are discarded and further reads are no longer stored

  • When rollback() is used, reading reverts to data stored in the internal buffer; when this is used up, reading proceeds as normal

  • Any number of reads performed since a checkpoint can be undone without a full rollback by partially reverting in the internal buffer

  • It is an error to checkpoint a stream that has already been checkpointed or to commit, rollback, or undo a noncheckpointed stream

  • We must support the case where a checkpoint is placed while we're still reading out of the internal buffer

The class definition

We'll start by looking at our class definition:

package org.merlin.io;
import java.io.*;
public class UndoReader extends Reader {
  protected static final int INITIAL_UNDO_SIZE = 64;
  
  protected Reader reader;
  protected int[] buffer;
  protected int index, stored, capacity;
  protected boolean storing, restoring, closed;
  
  public UndoReader (Reader reader) {
    this (reader, INITIAL_UNDO_SIZE);
  }
  public UndoReader (Reader reader, int capacity) {
    super (reader);
    this.reader = reader;
    this.capacity = capacity;
    buffer = new int[capacity];
  }
  ...
}

Ordinarily, a character-stream filter would inherit from FilterReader; however in this case we must override the entire function of the Reader superclass so FilterReader will be of no benefit.

The constructors we provide allow the attached stream, and optionally the initial buffer capacity, to be specified. As the internal buffer can grow as necessary, this will only affect efficiency. In the constructor, we initialize the buffer variable, in which checkpoint data are stored, and the capacity variable, which defines the current size of our buffer.

You may notice that the checkpoint buffer is, in fact, an int array and not a char array. The reason: I wish to be able to store not only normal character data, but the end of file (EOF) [-1] in this buffer. Although there may be other ways to implement this, my purpose is served by this solution.

Other variables related to this buffer are stored, for the amount of checkpoint data that have been stored; and index, for the current read index when data are read from the buffer. The storing flag indicates that a stream has been checkpointed, so data should be stored in the checkpoint buffer; restoring indicates a rollback, so data should be read from the checkpoint buffer; and closed indicates the stream has been closed.

Figure 2. Internals of the UndoReader class

The read() methods

We now look at the operation of reading a single character:

  public int read () throws IOException {
    synchronized (lock) {
      int result;
      if (!storing && !restoring) {
        result = reader.read ();
      } else if (restoring) {
        result = buffer[index ++];
        if (index >= stored)
          restoring = false;
      } else {
        result = reader.read ();
    if (stored + 1 > capacity)
          ensureCapacity (1);
        buffer[stored ++] = result;
      }
      return result;
    }
  }
  protected void ensureCapacity (int amount) {
    capacity = capacity * 2 + amount;
    int[] copy = new int[capacity];
    System.arraycopy (buffer, 0, copy, 0, index);
    buffer = copy;
  }

We start by synchronizing on lock. This variable, inherited from the superclass, is the attached stream (reader) we passed to the superconstructor. It is more efficient for us to synchronize on reader than on this, because if we subsequently call synchronized methods on the attached stream, those synchronization calls will not incur any significant costs.

Figure 3. Synchronization efficiency

If we have neither placed the stream in the checkpoint buffer nor rolled it back, we can simply read a character from the attached stream. Otherwise, if we have rolled it back, we must return a character from the internal checkpoint buffer. If we then reach the end of the internal buffer, we can reset restoring. Otherwise, the stream has been checkpointed, so we must read a character from the attached stream and store it in the checkpoint buffer.

Before inserting data into the checkpoint buffer, we use the ensureCapacity() method to enlarge the array to hold the extra datum. Whenever we perform this enlargement, we more than double the buffer's size. This means it will grow very rapidly to accommodate our needs. (It could, for instance, grow from 1 byte to 1 megabyte in just 20 expansions.)

Read multiple characters in one call:

  public int read (char[] dst, int offset, int length) throws IOException {
    synchronized (lock) {
      int result;
      if (!storing && !restoring) {
        result = reader.read (dst, offset, length);
      } else if (restoring) {
        if (buffer[index] < 0) {
          result = buffer[index ++];
        } else {
          result = (length < stored - index) ? length : stored - index;
          for (int i = 0; i < result; ++ i)
            dst[offset + i] = (char) buffer[index ++];
        }
        if (index >= stored)
          restoring = false;
      } else {
        result = reader.read (dst, offset, length);
        if (result < 0) {
          if (stored + 1 > capacity)
            ensureCapacity (1);
          buffer[stored ++] = result;
        } else {
          if (stored + result > capacity)
            ensureCapacity (result);
          for (int i = 0; i < result; ++ i)
            buffer[stored ++] = dst[offset + i];
        }
      }
      return result;
    }
  }

This method essentially follows the same logic as before. Note, however, the extra code to handle the end of file.

Finally, we test to see if the stream is ready:

  public boolean ready () throws IOException {
    synchronized (lock) {
      return restoring || reader.ready ();
    }
  }

This method returns true if an attempt at reading data from this stream will retrieve data immediately and without blocking (that is to say if we currently are reading from the checkpoint buffer, or else if the attached stream is ready).

The close() method

Here, we look at closing the stream:

  public void close () throws IOException {
    try {
      reader.close ();
    } finally {
      synchronized (lock) {
        storing = restoring = false;
        closed = true;
      }
    }
  }

To close our stream, we first close the attached stream, then synchronize on lock, and finally set all flags appropriately. Consider what might happen if we instead synchronized on the stream before closing it. If one thread calls read() and there are no data available, it will block while holding the synchronization lock. We cannot then close the stream from another thread to wake up the blocked reader, as the close() method would first need to obtain the synchronization lock itself. Synchronizing first, in this manner, would be improper.

The checkpoint() method

Here, we implement the checkpoint operation:

  public void checkpoint () throws IOException {
    synchronized (lock) {
      if (closed)
        throw new IOException ("Stream closed");
      else if (storing)
        throw new IOException ("Alreading checkpointed");
      if (restoring)
        System.arraycopy (buffer, index, buffer, 0, stored - index);
      stored -= index;
      index = 0;
      storing = true;
    }
  }

In this situation, we must verify that our state is OK and assert the storing flag. Additional logic is required, however, if we currently are restoring from the checkpoint buffer. In that case, we must copy the remaining checkpoint data back to the start of the buffer and update our indices.

The commit() method

Here, we execute the commit operation:

  public void commit () throws IOException {
    synchronized (lock) {
      if (closed)
        throw new IOException ("Stream closed");
      else if (!storing)
        throw new IOException ("Undo without checkpoint");
      storing = false;
    }
  }

All we need do is verify that our state is okay and clear the storing flag.

The rollback() method

Here, we implement the rollback operation:

  public void rollback () throws IOException {
    synchronized (lock) {
      if (closed)
        throw new IOException ("Stream closed");
      else if (!storing)
        throw new IOException ("Rollback without checkpoint");
      storing = false;
      index = 0;
      restoring = (index < stored);
    }
  }

The rollback() method verifies that our state is okay, clears the storing flag, and asserts the restoring flag if data remain in the checkpoint buffer.

The undo() method

Next we perform the undo operation:

  public void undo (int undo) throws IOException {
    synchronized (lock) {
      if (closed)
        throw new IOException ("Stream closed");
      else if (!storing)
        throw new IOException ("Undo without checkpoint");
      else if (undo < 0)
        throw new IOException ("Negative undo (" + undo + ")");
      if (restoring) {
        if (undo > index)
          throw new IOException ("Undo overflow");
        index -= undo;
      } else {
        if (undo > stored)
          throw new IOException ("Undo overflow");
        index = stored - undo;
        restoring = (index < stored);
      }
    }
  }

This operation reverts partially into the checkpoint buffer without actually performing a full rollback. Logic is complicated mildly to handle the cases whether or not we are already reading from the checkpoint buffer.

The peek() method

Here, we implement a peek operation:

  public int peek () throws IOException {
    synchronized (lock) {
      if (closed)
        throw new IOException ("Stream closed");
      else if (!storing)
        throw new IOException ("Peek without checkpoint");
      int result = read ();
      undo (1);
      return result;
    }
  }

This operation lets you peek ahead a single character into the future. We implement it in the obvious manner, with a call to read() followed by a call to undo().

The other methods

Other methods inherited from the superclass include the following:

  • The full-array read() method simply calls our overridden subarray read() method.

  • We implement the skip() method in the superclass as a series of calls to read(). Note: We cannot provide a significantly more efficient implementation as we may still have to store the data in our checkpoint buffer.

  • The mark(), markSupported(), and reset() methods inherit from the superclass their default unimplemented status.

Using this stream

A simple example of the use of this stream follows:

public static void gibber (String filename) throws IOException {
  int chr;
  FileReader reader = new FileReader (filename);
  UndoReader undo = new UndoReader (reader);
  undo.checkpoint ();
  for (int i = 0; i < 8; ++ i)
    System.out.print ((char) undo.read ());
  undo.undo (4);
  for (int i = 0; i < 4; ++ i)
    System.out.print ((char) undo.read ());
  undo.rollback ();
  for (int i = 0; i < 8; ++ i)
    System.out.print ((char) undo.read ());
  undo.checkpoint ();
  for (int i = 0; i < 4; ++ i)
    System.out.print ((char) undo.read ());
  undo.commit ();
  while ((chr = undo.read ()) >= 0)
    System.out.write ((char) chr);
  undo.close ();
  System.out.flush ();
}

Here, we use the UndoReader class to gibber out the contents of a file.

To do so, we open the file, create an UndoReader with the default initial buffer size, and immediately checkpoint the stream. We then read and print out eight characters, realize we were partially in error, and undo four of the reads. We then perform four more reads (rereading what we read before) and, realizing we were totally wrong, rollback the stream. We then read out the eight characters we've already seen and checkpoint the stream. We then read out four more characters, decide this is correct, and commit the reads. Finally, we read out the rest of the file and close our stream.

Class DataReader

The DataReader class is a character-stream filter that provides capabilities for reading all the primitive Java types from a textual data source. It has two modes of operation: default mode and line mode. In default mode, it reads data from the source in a continuous stream. In line mode, however, it reports each time it reaches the end of a line of input.

Figure 4. Using class DataReader
  • Internally, an UndoReader is used to support data un-reading.

  • To read any of the primitive datatypes, we skip any immediate whitespace, then read a sequence of nonwhitespace characters, then parse the result as the appropriate data type. If the data is invalid, we undo the reads we performed and throw the resulting exception.

  • We provide additional methods to read a word (a sequence of nonwhitespace characters) and to read a line.

  • If, while reading any of the primitive datatypes, we reach the end of the file without reading any nonwhitespace characters, we throw an EOFException.

  • If, in line mode, we reach the end of a line without reading any nonwhitespace characters, we then consume the end-of-line characters and throw an end of line (EOL) exception (EOLException -- a specialization of the EOFException). Future reads will proceed from the next line.

The class definition

We start by looking at our class definition:

package org.merlin.io;
import java.io.*;
import java.util.*;
import org.merlin.lang.*;
public class DataReader extends FilterReader {
  protected static final int EOF = -1, CR = '\r', LF = '\n';
  
  protected UndoReader undo;
  protected int whitespace;
  protected boolean lineMode;
  
  public DataReader (Reader reader) {
    this (reader, false);
  }
  public DataReader (Reader reader, boolean lineMode) {
    super (new UndoReader (reader));
    lock = reader;
    undo = (UndoReader) in;
    setLineMode (lineMode);
  }
  public void setLineMode (boolean lineMode) {
    synchronized (lock) {
      this.lineMode = lineMode;
      if (lineMode) {
        whitespace = WHITESPACE_NOT_CRLF;
      } else {
        whitespace = WHITESPACE;
      }
    }
  }
  ...
}

As a character-stream filter that adds capabilities to Reader and doesn't modify the basic functions thereof, we inherit from FilterReader. We thus will inherit implementations of all the standard Reader methods that invoke the corresponding methods of the UndoReader undo.

You will note, however, that we reassign lock to have the value reader (the underlying attached stream). The problem is, the Reader locking mechanism doesn't extend well to long chains of Readers. Consider if we didn't do this: Our synchronization lock would be undo, whereas both undo and reader would use reader as a lock. It clearly is more efficient if we also synchronize on reader. Admittedly, we could simply ignore the lock variable. However, it's more obvious to do things this way.

Figure 5. Synchronization in long FilterReader chains

The two constructors we provide allow line mode to be specified, calling through to the public setLineMode() method. As it turns out, line mode is simply a matter of what we consider whitespace to be, so we assign the whitespace variable accordingly. The magic will become clear later on.

The readLine() method

We now look at the operation of reading a single line of text:

  public String readLine () throws IOException {
    synchronized (lock) {
      undo.checkpoint ();
      try {
        String line = readExp (~CRLF, ZERO_OR_MORE);
        int chr = undo.read ();
        if ((chr == CR) || (chr == LF)) {
          if ((chr == CR) && (undo.read () != LF))
            undo.undo (1);
        } else if ((chr == EOF) && "".equals (line)) {
          throw new EOFException ("EOF reading line");
        } else {
          undo.undo (1);
        }
        undo.commit ();
        return line;
      } catch (EOFException ex) {
        undo.commit ();
        throw ex;
      } catch (IOException ex) {
        undo.rollback ();
        throw ex;
      }
    }
  }

The readLine() method starts out by synchronizing on lock and checkpointing our UndoReader. We then call on the readExp() method to read a line of text. This method -- details still to come -- reads a specified volume of a specified type of character.

We define a line of text as a character sequence up to, but not including, the EOF or EOL. We thus specify the character type ~CRLF (not carriage return or line feed), and the character volume ZERO_OR_MORE (as many as possible). The result of this method will be a line of text.

If it turns out that we've actually reached the EOF without reading anything, we just throw an EOFException. We must otherwise consume the EOL. There are three end-of-line sequences to handle: LF (Unix), CR/LF (DOS) and CR (MacOS). We consume any of these combinations with a sequence of read and/or undo operations: If the immediate character is a carriage return, we check to see if it is followed by a line feed; if so, we eat two characters, otherwise just one.

Now, this isn't the proper way to do this (yeah, yeah). On poor MacOS we will always end up reading one extra character, which can cause unwanted artifacts when reading from the keyboard. The proper implementation would be to assert a flag instructing us to ignore the next character if it is a line feed. Doing so, however, involves relatively complex interactions with UndoReader (which I'm unable to fathom at this particular moment).

Otherwise, after reading our line and consuming the EOL, we commit our reads and return our result -- then we're done. If, however, an IOException occurs, we rollback the reads and rethrow the exception. This means we can try again later to read our datum if a transient error occurs, such as an InterruptedIOException resulting from a Socket timeout (presuming that bytesTransferred is zero). Magic! All the justification a pedant could possibly want for the UndoReader class!

The readWord() method

We now look at the operation of reading a single word of text:

  public String readWord () throws IOException {
    synchronized (lock) {
      undo.checkpoint ();
      try {
        String value = readWordString ("word");
        undo.commit ();
        return value;
      } catch (EOFException ex) {
        undo.commit ();
        throw ex;
      } catch (IOException ex) {
        undo.rollback ();
        throw ex;
      }
    }
  }
  protected String readWordString (String what) throws IOException {
    prepare (what);
    return readExp (~WHITESPACE, ZERO_OR_MORE);
  }
  
  protected void prepare (String what) throws IOException {
    readExp (whitespace, ZERO_OR_MORE);
    int chr = undo.read ();
    if (chr == EOF) {
      throw new EOFException ("EOF reading " + what);
    } else if ((chr == CR) || (chr == LF)) {
      if ((chr == CR) && (undo.read () != LF))
        undo.undo (1);
      throw new EOLException ("EOL reading " + what);
    }
    undo.undo (1);
  }
Figure 6.Internals of the DataReader class

Our readWord() method is broadly similar to readLine() but relies on another helper method, readWordString(), to do the reading.

We define reading a word as: skip any immediate whitespace, and then read the following sequence of nonwhitespace characters. So, the readWordString() method calls prepare() to skip any immediate whitespace and then calls the readExp() "wonder method" to read as many ~WHITESPACE (nonwhitespace) characters as it can.

The prepare() method is interesting, if I do say so myself. We skip leading whitespace using the readExp() method (is there nothing readExp() can't do?). Recall our definition of whitespace, however. In stream mode, it is WHITESPACE (any whitespace); but in line mode, it is WHHITESPACE_NOT_CRLF (any whitespace but the end of line). So, in stream mode we'll skip all whitespaces, but in line mode, we'll stop at the end of a line.

We can therefore peek at the next character and determine how to proceed. If the next character is the EOF, we've reached the end of the file. This method is called to prepare for reading a datum, and the EOF means there will be nothing to read, so we accordingly throw an EOFException. Otherwise, if the next character is an end-of-line character (this can only occur in line mode) then we must consume the EOL (recall our earlier discussion of EOL consumption) and throw an EOLException (a specialization of EOFException). After all that, if nothing has happened, the next character is a valid, nonwhitespace character suitable for use by the caller. We therefore unread it and return.

Now, take a gander back at the readWord() method: If an EOFException occurs, it commits the read (we want to approve the reads that located and consumed the end of file or line) before rethrowing the exception.

The readInteger() methods

We now look at the operation of reading integer values of various widths:

  public long readLong () throws IOException, NumberFormatException {
    return readInteger ("long", Long.MIN_VALUE, Long.MAX_VALUE);
  }
  public int readInt () throws IOException, NumberFormatException {
    return (int) readInteger ("int", Integer.MIN_VALUE, Integer.MAX_VALUE);
  }
  public short readShort () throws IOException, NumberFormatException {
    return (short) readInteger ("short", Short.MIN_VALUE, Short.MAX_VALUE);
  }
  public byte readByte () throws IOException, NumberFormatException {
    return (byte) readInteger ("byte", Byte.MIN_VALUE, Byte.MAX_VALUE);
  }
  protected long readInteger (String what, long min, long max)
      throws IOException, NumberFormatException {
    synchronized (lock) {
      undo.checkpoint ();
      try {
        String value = readIntegerString (what);
        long result = Long.parseLong (value);
        if ((result < min) || (result > max))
          throw new NumberFormatException (value);
        undo.commit ();
        return result;
      } catch (EOFException ex) {
        undo.commit ();
        throw ex;
      } catch (IOException ex) {
        undo.rollback ();
        throw ex;
      } catch (NumberFormatException ex) {
        undo.rollback ();
        throw ex;
      }
    }
  }
  protected String readIntegerString (String what) throws IOException {
    return readWordString (what);
  }

The readLong(), readInt(), readShort(), and readByte() methods all call on the readInteger() method to read an integer value within a specified range.

The readInteger() method has a similar form to readWord(), but uses the helper method readIntegerString() to read the textual number and the Long class to parse the result. If a NumberFormatException occurs as a result of parsing the number, or if we must throw a NumberFormatException because the resulting value is out of range for the width we're reading, then we rollback our reads before passing on the exception. A possible enhancement would be a custom number parser that throws a specialization of NumberFormatException in case of a range error.

The readIntegerString() method simply calls through to readWordString() to read the textual number. However, I propose an alternative:

  protected String readIntegerString (String what) throws IOException {
    prepare (what);
    StringBuffer buffer = new StringBuffer ();
    buffer.append (readExp ('-', ZERO_OR_ONE));
    buffer.append (readExp (DIGITS, ZERO_OR_MORE));
    return buffer.toString ();
  }

My alternative avoids the problem caused by the integral-values parsing API methods understanding only numbers consisting of an optional leading minus sign followed by a sequence of digits. This implementation of readIntegerString(), therefore, restricts what it will read to exactly this format (using the wondrous readExp() method). The advantage? It will read the leading 10 from "10hello". The disadvantage? It will read the leading 10 from "10.3". Overall, not a good idea, except for certain limited circumstances.

The readCardinal() methods

Next, we look at the operation of reading cardinal values of various widths:

  public long readUnsignedInt () throws IOException, NumberFormatException {
    return readCardinal ("unsigned int", 1L << 32);
  }
  
  public int readUnsignedShort () throws IOException, NumberFormatException {
    return (int) readCardinal ("unsigned short", 1 << 16);
  }
  public short readUnsignedByte () throws IOException, NumberFormatException {
    return (short) readCardinal ("unsigned byte", 1 << 8);
  }
  protected long readCardinal (String what, long max)
      throws IOException, NumberFormatException {
    synchronized (lock) {
      undo.checkpoint ();
      try {
        String value = readCardinalString (what);
        long result = Long.parseLong (value);
        if ((result < 0) || (result > max))
          throw new NumberFormatException (value);
        undo.commit ();
        return result;
      } catch (EOFException ex) {
        undo.commit ();
        throw ex;
      } catch (IOException ex) {
        undo.rollback ();
        throw ex;
      } catch (NumberFormatException ex) {
        undo.rollback ();
        throw ex;
      }
    }
  }
  protected String readCardinalString (String what) throws IOException {
    return readWordString (what);
  }

The readUnsignedInt(), readUnsignedShort(), and readUnsignedByte() methods all have a similar form to the integer methods, but use the helper method readCardinal() to decode and range check the value.

And, as before, the readCardinal() method calls through to readCardinalString(), which calls through to readWordString() to read the textual number. The alternative would be as follows:

  protected String readCardinalString (String what) throws IOException {
    prepare (what);
    StringBuffer buffer = new StringBuffer ();
    buffer.append (readExp (DIGITS, ZERO_OR_MORE));
    return buffer.toString ();
  }

The readDecimal() methods

We now look at the operation of reading decimal values of various widths:

  public double readDouble () throws IOException, NumberFormatException {
    return readDecimal ("double");
  }
  public float readFloat () throws IOException, NumberFormatException {
    return (float) readDecimal ("float");
  }
  
  protected double readDecimal (String what) throws IOException, NumberFormatException {
    synchronized (lock) {
      undo.checkpoint ();
      try {
        String value = readDecimalString (what);
        double result = Double.valueOf (value).doubleValue ();
        undo.commit ();
        return result;
      } catch (EOFException ex) {
        undo.commit ();
        throw ex;
      } catch (IOException ex) {
        undo.rollback ();
        throw ex;
      } catch (NumberFormatException ex) {
        undo.rollback ();
        throw ex;
      }
    }
  }
  protected String readDecimalString (String what) throws IOException {
    return readWordString (what);
  }

The readDouble() and readFloat() methods are similar to all the above methods, but use the helper method readDecimal() to read and decode the number. Note that we don't need to do range checking here because range errors are represented as infinities by the floating-point datatypes.

Again, the readDecimal() method calls through to readDecimalString(), which in turn calls through to readWordString() to read the textual number.

The alternative would be as follows:

  protected String readDecimalString (String what) throws IOException {
    prepare (what);
    StringBuffer buffer = new StringBuffer ();
    buffer.append (readExp (SIGN, ZERO_OR_ONE));
    buffer.append (readExp (DIGITS, ZERO_OR_MORE));
    buffer.append (readExp ('.', ZERO_OR_ONE));
    buffer.append (readExp (DIGITS, ZERO_OR_MORE));
    if (buffer.length () > 0) {
      int chr = undo.read ();
      if ((chr == 'e') || (chr == 'E')) {
        buffer.append ((char) chr);
        buffer.append (readExp (SIGN, ZERO_OR_ONE));
        buffer.append (readExp (DIGITS, ZERO_OR_MORE));
      } else {
        undo.undo (1);
      }
    }
    return buffer.toString ();
  }

Here, we handle an optional leading sign (plus or minus) followed by digits, a period, digits, and an optional exponent -- this being the format permitted by Double and Float.

The readBoolean() method

We now look at Boolean values:

  protected static final Hashtable booleans = new Hashtable ();
  
  static {
    booleans.put ("true", Boolean.TRUE);
    booleans.put ("false", Boolean.FALSE);
  }
  public boolean readBoolean () throws IOException, BooleanFormatException {
    synchronized (lock) {
      undo.checkpoint ();
      try {
        String value = readWordString ("boolean");
        Boolean result = (Boolean) booleans.get (value.toLowerCase ());
        if (result == null)
          throw new BooleanFormatException (value);
        undo.commit ();
        return result.booleanValue ();
      } catch (EOFException ex) {
        undo.commit ();
        throw ex;
      } catch (IOException ex) {
        undo.rollback ();
        throw ex;
      } catch (BooleanFormatException ex) {
        undo.rollback ();
        throw ex;
      }
    }
  }

The readBoolean() method is unsurprising, throwing a BooleanFormatException (a specialization of NumberFormatException) in case of an invalid datum. Note that we use a Hashtable to store the valid (case-insensitive) textual Boolean representations. A possible additional feature: The ability to specify alternative textual values, on a per-stream basis.

The readChar() methods

In addition to the standard read() method, we want to expose some type-specific readChar() methods:

  public char readChar () throws IOException {
    synchronized (lock) {
      undo.checkpoint ();
      try {
        prepare ("char");
        int chr = undo.read ();
        undo.commit ();
        return (char) chr;
      } catch (EOFException ex) {
        undo.commit ();
        throw ex;
      } catch (IOException ex) {
        undo.rollback ();
        throw ex;
      }
    }
  }
  public char readAnyChar () throws IOException {
    synchronized (lock) {
      undo.checkpoint ();
      try {
        int chr = undo.read ();
        if (chr == EOF) {
          throw new EOFException ("EOF reading any char");
        } else if (lineMode && ((chr == CR) || (chr == LF))) {
          if ((chr == CR) && (undo.read () != LF))
            undo.undo (1);
          throw new EOLException ("EOL reading any char");
        }
        undo.commit ();
        return (char) chr;
      } catch (EOFException ex) {
        undo.commit ();
        throw ex;
      } catch (IOException ex) {
        undo.rollback ();
        throw ex;
      }
    }
  }

The readChar() method uses the techniques outlined above to read the next nonwhitespace character. We choose to skip leading whitespace because doing so most conforms with the other readData methods. In case the caller wants to read whitespaces, however, we provide a method readAnyChar() that returns the next character, whether or not it is whitespace. Here, however, we need extra logic to correctly handle the EOL and EOF in line mode.

The peek() methods

With our extensive undo capabilities, we can expose a useful peek operation that allows the caller to look ahead and see what's coming up next. This is frequently useful for scanning through data:

  public int peek () throws IOException {
    synchronized (lock) {
      undo.checkpoint ();
      try {
        readExp (whitespace, ZERO_OR_MORE);
        int chr = undo.peek ();
        undo.rollback ();
        return chr;
      } catch (IOException ex) {
        undo.rollback ();
        throw ex;
      }
    }
  }
  public int peekAny () throws IOException {
    synchronized (lock) {
      undo.checkpoint ();
      try {
        int chr = undo.peek ();
        undo.rollback ();
        return chr;
      } catch (IOException ex) {
        undo.rollback ();
        throw ex;
      }
    }
  }
  public boolean peekEOL () throws IOException {
    int chr = peek ();
    return (chr == LF) || (chr == CR) || (chr == EOF);
  }
  public boolean peekEOF () throws IOException {
    int chr = peek ();
    return (chr == EOF);
  }

As above, we provide two variants: The peek() method returns the next nonwhitespace character -- this may be the EOF, or, in line mode, CR or LF. The peekAny() method returns the next character, whether or not it is a whitespace.

For convenience, we also provide the methods peekEOF() and peekEOL() that return, respectively, whether the EOF, or the EOF or EOL (in line mode) is coming up. Both of these methods ignore leading whitespace, so you can use them to determine if your next attempt at reading a datum would result in an EOFException (or EOLException).

The skip() methods

The ability to skip over high-level data is also useful:

  public boolean skipWhitespace () throws IOException {
    synchronized (lock) {
      undo.checkpoint ();
      try {
        readExp (whitespace, ZERO_OR_MORE);
        int chr = undo.peek ();
        undo.commit ();
        return (chr != LF) && (chr != CR) && (chr != EOF);
      } catch (IOException ex) {
        undo.rollback ();
        throw ex;
      }
    }
  }
  
  public boolean skipLine () throws IOException {
    synchronized (lock) {
      try {
       readLine ();
      } catch (EOFException eof) {
        return false;
      }
      return true;
    }
  }
  public boolean skipEOL () throws IOException {
    synchronized (lock) {
      if (peekEOL ()) {
        skipLine ();
        return true;
      } else {
        return false;
      }
    }
  }

The skipWhitespace() method calls readExp() to skip any immediate whitespace and then returns true if the next character is neither the EOF or EOL (line mode). In other words, if something useful could be read next.

The skipLine() method calls readLine() to discard the immediate line of text and then returns true if data were read successfully, or else false at the EOF. Similarly, the skipEOL() method calls peekEOL() to determine if we are at the end of a line. If so, it skips the line and returns true; otherwise, it simply returns false.

The readExp() method

Finally, here's the real workhorse of this class:

  protected static final int
    WHITESPACE = 0x10000,
    WHITESPACE_NOT_CRLF = 0x10001,
    CRLF = 0x10002,
    DIGITS = 0x10003,
    SIGN = 0x10004;
  
  protected static final int
    ZERO_OR_MORE = 0,
    ZERO_OR_ONE = 1;
  protected final char[] buffer = new char[16];
  
  protected String readExp (int type, int volume) throws IOException {
    StringBuffer result = new StringBuffer ();
    int index, amount, max = (volume == ZERO_OR_ONE) ? 1 : buffer.length;
    boolean invert = (type < 0);
    if (invert)
      type = ~type;
    do {
      index = 0;
      amount = undo.read (buffer, 0, max);
      while ((index < amount) && (invert ^ isMatch (buffer[index], type)))
        ++ index;
      result.append (buffer, 0, index);
    } while ((amount >= 0) && (index >= amount) && (volume == ZERO_OR_MORE));
    
    undo.undo ((amount < 0) ? 1 : amount - index);
    
    return result.toString ();
  }
  protected boolean isMatch (char chr, int type) {
    boolean result;
    switch (type) {
      case WHITESPACE:
        result = Character.isWhitespace (chr);
        break;
      case WHITESPACE_NOT_CRLF:
        result = (chr != CR) && (chr != LF) && Character.isWhitespace (chr);
        break;
      case CRLF:
        result = (chr == CR) || (chr == LF);
        break;
      case DIGITS:
        result = Character.isDigit (chr);
        break;
      case SIGN:
        result = (chr == '-') || (chr == '+');
        break;
      default:
        result = (type == chr);
        break;
    }
    return result;
  }

The readExp() method reads a specified volume of a specified character type. The volume can be either ZERO_OR_ONE or ZERO_OR_MORE. The type can either be an individual character or one of the various defined types. To read inverse type or character matches, specify ~type.

To accomplish this, we use a straightforward loop that reads through a buffer for efficiency until we either encounter the EOF or a character that is not what we want, or we've read a sufficient volume. When the loop finishes, we undo any unneeded reads. To test for character/type equality, we use a helper method, isMatch(), that itself makes use of the Character class.

How to use this stream

A simple example of this stream's utility follows:

import java.io.*;
import org.merlin.io.*;
public class DataReaderTest {
  public static void main (String[] args) throws IOException {
    DataReader data = new DataReader (new FileReader (FileDescriptor.in));
    boolean lineMode = data.readBoolean ();
    data.setLineMode (lineMode);
    while (!data.peekEOF ()) {
      while (!data.skipEOL ()) {
        int value = data.readInt ();
        System.out.print (value + " ");
      }
      System.out.println ("EOL");
    }
    System.out.println ("EOF");
  }
}

In this case we use the DataReader class to read input from the user.

To start, we read a Boolean and set the DataReader line mode to this value. We then sit in an outer loop until we locate the EOF. We then sit in an inner loop until we locate (and skip) the EOL. Inside this inner loop, we read integers and print them out -- magic! In line mode, this will repeat every line of numbers you enter followed by "EOL,", followed ultimately by "EOF" (when you type ^D [Unix] or ^Z [DOS]). In stream mode it will not notice the EOL.

The Exception classes

Finally, here's the code for our two Exception classes:

Class EOLException

This is a specialization of EOFException to indicate that the end of the line has been reached:

package org.merlin.io;
import java.io.*;
public class EOLException extends EOFException {
  public EOLException () {
  }
  
  public EOLException (String detail) {
    super (detail);
  }
}

Class BooleanFormatException

This is a specialization of NumberFormatException to indicate that there was an error parsing a Boolean:

package org.merlin.lang;
public class BooleanFormatException extends NumberFormatException {
  public BooleanFormatException () {
  }
  public BooleanFormatException (String detail) {
    super (detail);
  }
}

Alternative approaches, and why they're no good

As is my wont, I have turgidly solved a problem that could have been addressed in other ways. Let me enumerate (some of) those alternative solutions:

Alternative 1

StreamTokenizer tokenizer = new StreamTokenizer (reader);
int tokenType = tokenizer.nextToken ();
switch (tokenType) {
  case TT_EOF:
    // EOF
    break;
  case TT_EOL:
    // EOL
    break;
  case TT_NUMBER:
    double value = tokenizer.nval;
    // ...
    break;
  case TT_WORD:
    String word = tokenizer.sval;
    // ...
    break;
}

Does it work? Yes. Is it what we want? No. The StreamTokenizer class is oriented towards parsing Java source files, and it's very good at that job. However, it isn't a stream filter and so doesn't present the intuitive streams interface we all know and love, nor does it perform all the wondrous tasks we want.

Alternative 2

public class CrudeDataReader extends FilterReader {
  public CrudeDataReader (Reader reader) {
    super (reader);
  }
  public int readInt () throws IOException {
    StringBuffer buffer = new StringBuffer ();
    int chr;
    while ((chr = read ()) > ' ')
      buffer.append ((char) chr);
    return Integer.parseInt (buffer.toString ());
  }
  // ...
}

Does it work? Yes. Is it a stream filter? Yes. Can it do everything we want? Yes. Is it efficient? Moderately so. But, is it proper? No, I say! It doesn't handle EOL or EOF properly. Okay, with a bit of work it could. But it will consume a single whitespace after each word. Okay, with PushbackReader (or mark() and reset()) we can fix that. But it still isn't thread safe. Done. And it will consume invalid data. Gotcha!

Conclusion: Final thoughts

I guess it all boils down to one question: Do you do it hastily or do you do it properly?

Propriety being, of course, a subjective quality.

Many of us live in the real world, where financial concerns sometimes force us to do things hastily. However, I would like to argue that if you are developing the plumbing for an application -- and a stream of this nature is part of that plumbing -- you must do it properly. Your code will be used and reused over the course of time; if you skimp now the drains will surely back up one day and someone will have to lay the plumbing all over again. I can guarantee that had I implemented this class improperly, someone (maybe me), someday, would have cursed its inadequacies. After all, every one of us curses the occasional improprieties of the JDK.

Now that this article is finished, we have two proper (my definition) classes that serve useful, reusable functions. I hope you'll find them serviceable. I believe they address basic needs that the standard JDK classes don't.

Merlin appears to be at once the prophet, the solitary who lives in the woods, and the bastard child who knows the first principles of the world. Or at least, so it is written. He's also lead author of Java Network Programming, Second Edition, due out shortly.

Learn more about this topic

  • Download the complete source code for this article as a zip file http://www.javaworld.com/jw-04-1999/step/jw-04-step.zip
  • "Use the two 'R's of Java 1.1 -- Readers and Writers" by Todd Sundsted (JavaWorld November 1997) http://www.javaworld.com/jw-11-1997/jw-11-howto.html
  • "Waging war on electronic junk mail" by Todd Sundsted (JavaWorld October 1997) http://rwanda.wpi.com/javaworld/jw-10-1997/jw-10-howto.html
  • Merlin Hughes, Michael Shoffner, and Derek Hamner's Java Network Programming, Second Edition covers the stream classes in detail http://www.manning.com/Hughes/007.html
  • The Java Developer ConnectionSM TechTips often cover I/O-related issues http://developer.java.sun.com/developer/javaInDepth/TechTips/
  • The JDK 1.1 documentation includes coverage of the character streams http://java.sun.com/products/jdk/1.1/docs/guide/io/
  • Read Merlin's previous Java Step by Step columns http://www.javaworld.com/topicalindex/jw-ti-step.html
Join the discussion
Be the first to comment on this article. Our Commenting Policies