Talking Java!

Add speech capability to your Java 1.3 applications and applets

Why would you want to make your applications talk? For a start, it's fun, and suitable for fun applications like games. And there's a more serious accessibility side. I'm thinking here not just of those naturally disadvantaged when using a visual interface, but also those situations where it's impossible -- or even illegal -- to take your eyes off what you're doing.

Recently I've been working with some technologies to take HTML and XML information from the Web [see "Access the World's Biggest Database with Web DataBase Connectivity" (JavaWorld, March 2001)]. It occurred to me that I could plug that work and this idea together to build a talking Web browser. Such a browser would prove useful for listening to snippets of information from your favorite sites -- news headlines, for example -- just like listening to the radio while out walking your dog or driving to work. Of course, with current technology you'd have to carry around your laptop computer with your mobile phone attached, but that impractical scenario could well change in the near future with the arrival of Java-enabled smart phones like the Nokia 9210 (9290 in the US).

Perhaps more useful in the short term would be an email reader, also possible thanks to the JavaMail API. This application would check your inbox periodically, and your attention would be attracted by a voice from nowhere proclaiming "You have new mail, would you like me to read it to you?" In a similar vein, consider a talking reminder -- connected with your diary application -- that shouts out "Don't forget your meeting with the boss in 10 minutes!"

Assuming you're sold on those ideas, or have some good ideas of your own, we'll move on. I'll start by showing how to put my supplied zip file to work so you can get up-and-running straightaway and skip the implementation details if you think that's too much hard work.

Test drive the speech engine

To use the speech engine, you'll need to include the file in your CLASSPATH and run the com.lotontech.speech.Talker class from the command line or from within a Java program.

To run it from the command line, type:

java com.lotontech.speech.Talker "h|e|l|oo"

To run it from a Java program, simply include two lines of code:

com.lotontech.speech.Talker talker=new com.lotontech.speech.Talker();

At this point you probably wonder about the format of the "h|e|l|oo" string you supply on the command line or provide to the sayPhoneWord(...) method. Let me explain.

The speech engine works by concatenating short sound samples that represent the smallest units of human -- in this case English -- speech. Those sound samples, called allophones, are labeled with a one-, two-, or three-letter identifier. Some identifiers are obvious and some not so obvious, as you can see from the phonetic representation of the word "hello."

  • h -- sounds as you would expect
  • e -- sounds as you would expect
  • l -- sounds as you would expect, but notice that I've reduced a double "l" to a single one
  • oo -- is the sound for "hello," not for "bot," and not for "too"

Here is a list of the available allophones:

  • a -- as in cat
  • b -- as in cab
  • c -- as in cat
  • d -- as in dot
  • e -- as in bet
  • f -- as in frog
  • g -- as in frog
  • h -- as in hog
  • i -- as in pig
  • j -- as in jig
  • k -- as in keg
  • l -- as in leg
  • m -- as in met
  • n -- as in begin
  • o -- as in not
  • p -- as in pot
  • r -- as in rot
  • s -- as in sat
  • t -- as in sat
  • u -- as in put
  • v -- as in have
  • w -- as in wet
  • y -- as in yet
  • z -- as in zoo
  • aa -- as in fake
  • ay -- as in hay
  • ee -- as in bee
  • ii -- as in high
  • oo -- as in go
  • bb -- variation of b with different emphasis
  • dd -- variation of d with different emphasis
  • ggg -- variation of g with different emphasis
  • hh -- variation of h with different emphasis
  • ll -- variation of l with different emphasis
  • nn -- variation of n with different emphasis
  • rr -- variation of r with different emphasis
  • tt -- variation of t with different emphasis
  • yy -- variation of y with different emphasis
  • ar -- as in car
  • aer -- as in care
  • ch -- as in which
  • ck -- as in check
  • ear -- as in beer
  • er -- as in later
  • err -- as in later (longer sound)
  • ng -- as in feeding
  • or -- as in law
  • ou -- as in zoo
  • ouu -- as in zoo (longer sound)
  • ow -- as in cow
  • oy -- as in boy
  • sh -- as in shut
  • th -- as in thing
  • dth -- as in this
  • uh -- variation of u
  • wh -- as in where
  • zh -- as in Asian

In human speech the pitch of words rises and falls throughout any spoken sentence. This intonation makes the speech sound more natural, more emotive, and allows questions to be distinguished from statements. If you've ever heard Stephen Hawking's synthetic voice, you understand what I'm talking about. Consider these two sentences:

  • It is fake -- f|aa|k
  • Is it fake? -- f|AA|k

As you might have guessed, the way to raise the intonation is to use capital letters. You need to experiment with this a little, and my hint is that you should concentrate on the long vowel sounds.

That's all you need to know to use the software, but if you're interested in what's going on under the hood, read on.

Implement the speech engine

The speech engine requires just one class to implement, with four methods. It employs the Java Sound API included with J2SE 1.3. I won't provide a comprehensive tutorial of the Java Sound API, but you'll learn by example. You'll find there's not much to it, and the comments tell you what you need to know.

Here's the basic definition of the Talker class:

package com.lotontech.speech;
import javax.sound.sampled.*;
import java.util.*;
public class Talker
  private SourceDataLine line=null;

If you run Talker from the command line, the main(...) method below will serve as the entry point. It takes the first command line argument, if one exists, and passes it to the sayPhoneWord(...) method:

 * This method speaks a phonetic word specified on the command line.
public static void main(String args[])
  Talker player=new Talker();
  if (args.length>0) player.sayPhoneWord(args[0]);

The sayPhoneWord(...) method is called by main(...) above, or it may be called directly from your Java application or plug-in supported applet. It looks more complicated than it is. Essentially, it simply steps though the word allophones -- separated by "|" symbols in the input text -- and plays them one by one through a sound-output channel. To make it sound more natural, I merge the end of each sound sample with the beginning of the next one:

 * This method speaks the given phonetic word.
public void sayPhoneWord(String word)
  // -- Set up a dummy byte array for the previous sound --
  byte[] previousSound=null;
  // -- Split the input string into separate allophones --
  StringTokenizer st=new StringTokenizer(word,"|",false);
  while (st.hasMoreTokens())
    // -- Construct a file name for the allophone --
    String thisPhoneFile=st.nextToken();
    // -- Get the data from the file --
    byte[] thisSound=getSound(thisPhoneFile);   
    if (previousSound!=null)
      // -- Merge the previous allophone with this one, if we can --
      int mergeCount=0;
      if (previousSound.length>=500 && thisSound.length>=500)
      for (int i=0; i<mergeCount;i++)
      // -- Play the previous allophone --
      // -- Set the truncated current allophone as previous --
      byte[] newSound=new byte[thisSound.length-mergeCount];
      for (int ii=0; ii<newSound.length; ii++)
  // -- Play the final sound and drain the sound channel --

At the end of sayPhoneWord(), you'll see it calls playSound(...) to output an individual sound sample (an allophone), and it calls drain(...) to flush the sound channel. Here's the code for playSound(...):

 * This method plays a sound sample.
private void playSound(byte[] data)
  if (data.length>0) line.write(data, 0, data.length);

And for drain(...):

 * This method flushes the sound channel.
private void drain()
  if (line!=null) line.drain();
  try {Thread.sleep(100);} catch (Exception e) {}

Now, if you look back at the sayPhoneWord(...) method, you'll see there's one method I've not yet covered: getSound(...).

getSound(...) reads in a prerecorded sound sample, as byte data, from an au file. When I say a file, I mean a resource held within the supplied zip file. I draw the distinction because the way you get hold of a JAR resource -- using the getResource(...) method -- proceeds differently from the way you get hold of a file, a not obvious fact.

For a blow-by-blow account of reading the data, converting the sound format, instantiating a sound output line (why they call it a SourceDataLine, I don't know), and assembling the byte data, I refer you to the comments in the code that follows:

 * This method reads the file for a single allophone and
 * constructs a byte vector.
private byte[] getSound(String fileName)
    URL url=Talker.class.getResource(fileName);
    AudioInputStream stream = AudioSystem.getAudioInputStream(url);
    AudioFormat format = stream.getFormat();
    // -- Convert an ALAW/ULAW sound to PCM for playback -- 
    if ((format.getEncoding() == AudioFormat.Encoding.ULAW) ||
     (format.getEncoding() == AudioFormat.Encoding.ALAW)) 
      AudioFormat tmpFormat = new AudioFormat(
       format.getSampleSizeInBits() * 2,
       format.getFrameSize() * 2,
      stream = AudioSystem.getAudioInputStream(tmpFormat, stream);
      format = tmpFormat;
    DataLine.Info info = new DataLine.Info(
     ((int) stream.getFrameLength() * format.getFrameSize()));
    if (line==null)
      // -- Output line not instantiated yet --
      // -- Can we find a suitable kind of line? --
      DataLine.Info outInfo = new DataLine.Info(SourceDataLine.class, 
      if (!AudioSystem.isLineSupported(outInfo))
        System.out.println("Line matching " + outInfo + " not supported.");
        throw new Exception("Line matching " + outInfo + " not supported.");
      // -- Open the source data line (the output line) --
      line = (SourceDataLine) AudioSystem.getLine(outInfo);, 50000);
    // -- Some size calculations --
    int frameSizeInBytes = format.getFrameSize();
    int bufferLengthInFrames = line.getBufferSize() / 8;
    int bufferLengthInBytes = bufferLengthInFrames * frameSizeInBytes;
    byte[] data=new byte[bufferLengthInBytes];
    // -- Read the data bytes and count them --
    int numBytesRead = 0;
    if ((numBytesRead = != -1)
      int numBytesRemaining = numBytesRead;
    // -- Truncate the byte array to the correct size --
    byte[] newData=new byte[numBytesRead];
    for (int i=0; i<numBytesRead;i++)
    return newData;
  catch (Exception e)
    return new byte[0];

So, that's it. A speech synthesizer in about 150 lines of code, including comments. But it's not quite over.

Text-to-speech conversion

Specifying words phonetically might seem a bit tedious, so if you intend to build one of the example applications I suggested in the introduction, you want to provide ordinary text as input to be spoken.

After looking into the issue, I've provided an experimental text-to-speech conversion class in the zip file. When you run it, the output will give you insight into what it does.

You can run a text-to-speech converter with a command like this:

java com.lotontech.speech.Converter "hello there"

What you'll see as output looks something like:

hello -> h|e|l|oo
there -> dth|aer

Or, how about running it like:

java com.lotontech.speech.Converter "I like to read JavaWorld"

to see (and hear) this:

i -> ii
like -> l|ii|k
to -> t|ouu
read -> r|ee|a|d
java -> j|a|v|a
world -> w|err|l|d

If you're wondering how it works, I can tell you that my approach is quite simple, consisting of a set of text replacement rules applied in a certain order. Here are some example rules that you might like to apply mentally, in order, for the words "ant," "want," "wanted," "unwanted," and "unique":

  1. Replace "*unique*" with "|y|ou|n|ee|k|"
  2. Replace "*want*" with "|w|o|n|t|"
  3. Replace "*a*" with "|a|"
  4. Replace "*e*" with "|e|"
  5. Replace "*d*" with "|d|"
  6. Replace "*n*" with "|n|"
  7. Replace "*u*" with "|u|"
  8. Replace "*t*" with "|t|"

For "unwanted" the sequence would be thus:

un[|w|o|n|t|]ed (rule 2)
[|u|][|n|][|w|o|n|t|][|e|][|d|] (rules 4, 5, 6, 7)
u|n|w|o|n|t|e|d (with surplus characters removed)

You should see how words containing the letters wont will be spoken in a different way to words containing the letters ant. You should also see how the special case rule for the complete word unique takes precedence over the other rules so that this word is spoken as y|ou... rather than u|n....

1 2 Page 1
Page 1 of 2