Waging war on electronic junk mail

Put Java on the front line in the war against electronic junk mail

Sound familiar? These are but a few of the numerous (and often offensive) unsolicited e-mails I received this past week, and which inspired me to write this column.

For those of you lucky enough to have avoided the electronic junk mail epidemic, let me tell you, it's a real problem. And this month, we're going to tackle it head-on -- with Java.

Just as in columns past, We'll begin with a quick look at the problem and discuss its solution. Then, I will introduce you to the parts of the Java class library that we'll use to implement the solution. Finally, we'll work through the solution.

Staking out the enemy

There's no escaping the reality of electronic junk mail, so let's take a moment to think about how we can minimize its intrusion into our lives.

The best, most efficient solution would simply be to stop people from sending us unwanted electronic mail. Unfortunately, something called the First Amendment (at least here in the U.S.) prevents us from taking this approach, so we must consider another angle. We must focus on getting rid of junk electronic mail before we ever set eyes on it. The question is how?

One reasonably effective method involves examining a piece of electronic mail and deciding whether to keep it or reject it based on its content. This is, after all, what we do when we read a piece of electronic mail.

Consider how we go about filtering electronic mail now. We scan a piece of mail -- character by character and line by line -- looking for words we recognize. If the mail contains the word "Java," we keep it; if it contains the phrase "Make Money Fast," we send it to the bit-bucket.

But why go through the trouble? Let's see if we can make a computer program suffer this task for us.

Tactical assessment

I'm going to take a step back from the problem at hand and look at the classes in the I/O package of the Java class library. I/O stands for Input/Output, which represents the information that goes into and comes out of a program, and the parts of a program that handle the information.

The Java class library input and output classes are based on a very simple, but very powerful model -- the stream.

The stream model, which is shown in the following figure, presents information as flowing from one point to another, as if it were in a stream or pipe. From a vantage point at any position along the flow, an observer sees pieces of information pass by, a piece at a time, in sequence.

A stream passes information from one point to another

The model fits many types of real-world information. Whether it is keycodes coming from a computer keyboard, audio data coming from an audio file, or line after line of text coming from a text file, all appear to be streams of information.

An important tool for working on streams is the filter. Filters take information arriving at their upstream side, filter or process it in some way, and send it out their downstream side. The figure below shows how a filter works.

A filter interrupts the flow of information for processing

The key to the stream model's power is the ability to chain together very simple individual filters to create more powerful compound filters, as shown in the following figure.

A cascade of filters

The Java class library breaks streams up into two types -- input and output. Such a distinction is not necessary in theory, but is useful in practice.

Input streams generally have as their ultimate source some device or file, and are involved in taking data from that source and bringing it into the domain of the program. The input stream is often filtered in the process.

Output streams generally have as their ultimate destination some device or file, and are involved in taking data from the domain of the program and sending it to that source. The output stream is often filtered in the process.

We will use the stream classes of the Java class library in the solution to our electronic junk mail problem for two reasons:

  1. It's easy to think of electronic mail as flowing line by line and word by word into our computer.

  2. We want to examine the mail, line by line and word by word, as it arrives at our computer to see if it matches any of the patterns we specify.

Our arsenal -- The stream classes in detail

The Java 1.1 specification describes two nearly identical sets of input and output stream classes. One set is byte oriented, the other is character oriented. The byte-oriented stream classes were present, with only minor differences, in Java 1.0.2. The character-oriented stream classes are entirely new with the 1.1 spec.

This month we'll look at the byte-oriented stream classes. We'll do this for two reasons. First, this will allow those of you who do not yet have access to Java 1.1 to make use of the this material. Second, it will allow me to point out some problem areas with the Java 1.0.2 class libraries that were fixed in Java 1.1.

Recall that streams can be divided into two broad categories: input streams and output streams. In Java, all byte-oriented input stream classes are subclasses of the abstract class InputStream. Class InputStream defines the basic suite of methods an input stream class must provide. Likewise, all byte-oriented output streams classes are subclasses of the abstract class OutputStream. Class OutputStream defines the basic suite of methods an output stream class must provide.

Common input stream methods

Let's take a look at the methods common to all input streams. Following each method declaration, I'll list the tasks the method performs.

public int read() throws IOException

  • Reads a single byte from the input stream and returns it.

  • Returns -1 if the end of the input stream has been reached.

  • Blocks (or waits) until data is available, if necessary.

  • Throws IOException if an error occurs during the read operation.

public int read(byte [] rgb) throws IOException

  • Reads a sequence of bytes from the input stream and places them in the specified array.

  • Returns the number of bytes read.

  • Returns -1 if the end of the input stream has been reached.

  • Blocks (or waits) until data is available, if necessary.

  • Throws IOException if an error occurs during the read operation.

public int read(byte [] rgb, int nOff, int nLen) throws IOException

  • Reads a sequence of bytes of the specified length from the input

    stream and places them in the specified array at the specified offset.

  • Returns the number of bytes read.

  • Returns -1 if the end of the input stream has been reached.

  • Blocks (or waits) until data is available, if necessary.

  • Throws IOException if an error occurs during the read operation.

public long skip(long n) throws IOException

  • Skips over the specified number of bytes.

  • Returns the number of bytes skipped.

  • Returns -1 if the end of the input stream has been reached.

  • Throws IOException if an error occurs during the skip operation.

public int available() throws IOException

  • Returns the number of the bytes that can be read from the input stream without the read operation blocking.

  • Throws IOException if an error occurs during the operation.

public void close() throws IOException

  • Closes the input stream and releases any resources (operating system file handles, for example) associated with the input stream.

  • Throws IOException if an error occurs during the operation.

public void mark(int nReadLimit)

  • Marks the current position in the input stream. Subsequent calls to reset() will reposition the input stream to this position.
  • Specifies the number of bytes that may be read past the mark before the the mark is invalidated.

public void reset() throws IOException

  • Repositions the input stream to the last marked position.

  • Throws IOException if the stream has not been marked, or if the mark has been invalidated.

public boolean markSupported()

  • Indicates whether or not this input stream supports the mark and reset operations.

Common output stream methods

Let's take a look at the methods common to all output stream. As with the previous section, I'll list the tasks the method performs following each method declaration.

public void write(int b) throws IOException

  • Writes a single byte to the output stream.

  • Blocks (or waits) until

    the data is actually written.

  • Throws IOException if an error occurs during the write operation.

public void write(byte [] rgb) throws IOException

  • Writes a sequence of bytes to the output stream.

  • Blocks (or waits) until the data is actually written.

  • Throws IOException if an error occurs during the write operation.

public void write(byte [] rgb, int nOff, int nLen) throws IOException

  • Writes a sequence of bytes of the specified length to the output stream, beginning at the specified offset.

  • Blocks (or waits) until the data is actually written.
  • Throws IOException if an error occurs during the write operation.

public void flush() throws IOException

  • Flushes the output stream, immediately writing any buffered data.

  • Throws IOException if an error occurs during the operation.

public void close() throws IOException

  • Closes the output stream and releases any resources (operating system file handles, for example) associated with the output stream.

  • Throws IOException if an error occurs during the operation.

Our plan of attack

The code this month comes in three different flavors. Here's why.

Byte to char conversions in Java 1.0.2 were fundamentally flawed (making the language's Unicode support essentially useless). To support internationalization, the flaws were fixed in Java 1.1. The result is two almost identical APIs that differ only in the methods they provide for converting bytes to chars.

In order to provide all of you with workable code, and to show how the APIs have changed, I've provided code in three packages. The energetic among you are encouraged to download them all and compare them. The rest of you should download the package appropriate for your platform.

The first package works with Java 1.0.2. It is available as a gzipped tar file and as a zip file.

The second package works with Java 1.1. Is is also available as a gzipped tar file and as a zip file.

The third package works with both Java 1.0.2 and Java 1.1. This package doesn't use the conflicting parts of the APIs; it does the work itself. Those of you interested in portability might want to look here. Of course, you can download the package as either a gzipped tar file or as a zip file.

The code this month doesn't run as an applet, so you'll need access to the Java Development Kit or a similar command-line environment.

First, using a method appropriate for your system, unpack the downloaded file.

Next, from the command line, execute the Java runtime as follows:

 % java Main [keyword] [keyword] ... < [email file]

You may specify any number of keywords on the command line. The program builds a filter for each of the keywords and links the filters together -- into a stream. The input is expected to arrive on standard input. The program will read from standard input, send the data through the stream, and write to standard output. If any of the filters detect a match on their keyword, they will raise an exception which will stop the program.

Very simple, yet extremely useful and very effective.

Conclusion

Next month we'll dive right in and take a look at the the character-oriented input and output streams: class Reader and class Writer. I'll show you how they work alongside class InputStream and class OutputStream. I'll also talk briefly about character encodings in preparation for a glance at Java internationalization.

See you next month.

Todd Sundsted has been writing programs since computers became available in convenient desktop models. Though originally interested in building distributed applications in C++, Todd moved to the Java programming language when it became the obvious choice for that sort of thing. Todd is co-author of the Java Language API SuperBible. In addition to writing, Todd is president of Etcee, which offers Java-centric training, mentoring, and consulting.
1 2 Page 1
Page 1 of 2