Learn to speak Jamaican

Introducing Jamaica, a JVM macro assembler language

Most Java programmers, at one time or another, have wondered how the JVM works. Java bytecode programming reveals much insight into the JVM and helps developers program Java better. Also, the ability to produce bytecode at runtime is a great asset and opens doors for new options and imaginations. Historically, various language systems have invented their own runtime systems; today, many want to switch to Java for a good reason: to leverage the free hard work of others who port and optimize the virtual machine on numerous platforms. For pure-Java system software, dynamically generated bytecode may provide performance impossible otherwise. For example, suppose an RDBMS (relational database management system) supported functions in queries like this: SELECT GetLastName(name) FROM emp WHERE CalcAge(birthday) > 30; it is possible and desirable for the database engine to create and use a native Java method rather than simply interpret.

A JVM is a simple stack-based CPU with no general-purpose programmable registers. Its instruction set is called bytecode. Code and data are organized in JVM classes (to which Java classes are mapped), but the JVM does not support all Java language features directly. The JVM Specification also defines many verification and security rules. Even so, programming at the bytecode level proves error-prone and risky, as I personally witnessed in Jamaica testing. Also, bytecode programs tend to be longer than other CPU assembly programs because JVM instructions mostly operate on the stack top. Jamaica tries to address some of these issues, adopting a Java-ish approach: it uses Java syntax for class infrastructure declaration and symbolic names in instructions for references to variables, fields, and labels. Moreover, Jamaica has defined numerous macros for common patterns, making it much easier to read and write JVM assembly programs.

Because the JVM Specification does not define an assembly language, a few efforts have been made—the best-known thus far is Jasmin—and Jamaica is the latest. This article introduces the Jamaica language with many examples, details the instruction set's more complicated instructions, and elaborates on all Jamaica's macros. An equally important part of Jamaica is the underlying abstract API for creating JVM classes; this API and its close relationship with Jamaica are introduced in this article. Assembly programming is closely related to the CPU architecture, but this article does not cover JVM architecture extensively. In the end, this article summarizes Jamaica's benefits and limitations.

Let's start by looking at an example:

public class CHelloWorld {
  public static void main(String[] args) {
    getstatic System.out PrintStream
    ldc "Hello, World!"
    invokevirtual PrintStream.println(String)void
  }
}

The code above looks quite familiar to Java programmers, except for the method body, where executable code is written in the Jamaica bytecode instruction format. All class names are in Java format rather than JVM format. As a "Jamaican convenience," Java classes in java.lang, java.io, and java.util are automatically imported, so you can use the class names directly without package prefixes. If we use macros, we can reduce the code above to a single statement and easily do more:

public class CHelloWorld {
  public static void main(String[] args) {
    %println "Hello, World!"
    %println <out> "Hello, World!"
    %println <err> "This is NOT an error!"
  }
}

The %println is probably the most used macro for debugging purposes. It prints to either System.out (by default) or System.err. With macros, reading (and writing) JVM assembly code is much easier. The following example is slightly juicier:

public class CHelloWorld {
  public static void main(String[] args) {
    Date d;
    %set d = %object Date
    %println "Hello, World!\nIt is ", d, '.'
  }
}

Let's compile and test this program. First off, download Jamaica and install. Then run this:

% java com.judoscript.jamaica.Main CHelloWorld.ja

If everything goes well, a file named CHelloWorld.class will be generated in the current directory. (If the class belongs to a package, you need to move it wherever appropriate.) To verify, run the class with java (even if the class does not have a main() method). If the Java verifier reports problems, employ javap -c, a commonly used tool that decompiles a Java class and prints its content including bytecode instructions. Javap's output format differs from Jamaica's syntax but is close enough. In fact, the easiest way to do JVM assembly programming is to reverse-engineer Java classes with javap or similar tools.

Experienced JVM bytecode programmers might have found a "fraud" in the above examples: there are no return instructions at the end of those methods. In the Java language, explicit return statements are not required for methods returning void; at the JVM level, however, they are required. Jamaica checks the code and automatically inserts a return instruction if needed.

Now that you have a sense of what Jamaica programs look like, let's move on to the specifics.

Define classes and interfaces

Java identifiers, keywords, and comments are also Jamaica's. In Jamaica's method bodies, bytecode instruction mnemonics are considered reserved words and should not be used as variable or label names.

To define a JVM class and interface in Jamaica, use the exact Java syntax, including the package statement for the class package prefix, extends, and/or implements clauses. The import statements can be used to introduce Java class-name shortcuts.

Java class names used in programs are in Java format (e.g., java.lang.String) rather than JVM format (e.g., java/lang/String). Inner class names use a dollar sign ($) between their own names and their enclosing class names. Class names also follow the Java import rules, and, as mentioned above, java.lang.*, java.io.*, and java.util.* are implicitly imported (at the end of the import list).

Fields and symbolic constants

Class data fields are declared in Jamaica as they are in Java, but cannot be initialized, except for static final fields of primitive types, which must be initialized. Initializations must happen either within constructors for nonstatic members or class initialization blocks for static ones.

Static final primitive fields are initialized with constant values. A constant value can be a number, string, or a symbolic constant defined either explicitly through the %const statement or as a static final value of this or other classes. Symbolic constants are quoted by { } in the code. They can be used anywhere a constant may occur and are converted to the intended types. For example:

%const MAX_COUNT = 10000
%const MONTH_FLD = java.text.DateFormat.MONTH_FIELD
public class ConstantTest extends java.sql.Types {
  static final double MY_MAX_COUNT = { MAX_COUNT };
  public static void main(String[] args) {
    long   var  = { MAX_COUNT };
    double dvar = { MY_MAX_COUNT };
    %println "var = ", var
    %println "dvar = ", dvar
    %println "java.text.DateFormat.YEAR_FIELD = ",
             { java.text.DateFormat.YEAR_FIELD }
    %println "java.text.DateFormat.MONTH_FIELD = ", { MONTH_FLD }
    %println "java.sql.Types.ARRAY = ", { ARRAY } // in parent class
  }
}

Methods and exception handling

Methods, including constructors and class initialization blocks, are declared in Jamaica using Java syntax. Within the method bodies, local variables are declared with Java syntax; they can take constant initializations. Executable code is written with bytecode instructions. Instructions are not terminated with any characters such as ;, and multiple instructions can appear on the same line, although it is recommend to have one line per instruction. Each instruction has a mnemonic and its own format, and the operands follow a consistent convention. Instructions can be prefixed with labels. Variable declarations and bytecode instructions can intermingle.

At the method's end, exception catch clauses can be added. Jamaica does not have an explicit finally mechanism as in Java because the JVM doesn't either. For example:

public class ExceptionTest
{
  public static void main(String[] args) {
    Writer w;
    PrintWriter out;
  label_start:
    %set w = %object FileWriter(String) (args[0])
    %set out = %object PrintWriter(Writer) (w)
    aload out
    invokevirtual PrintWriter.close()void
    goto label_finally
  label_io:
    %println "Caught IOException."
    invokevirtual Exception.printStackTrace()void
    goto label_finally
  label_any:
    %println "Caught an Exception."
    invokevirtual Exception.printStackTrace()void
  label_finally:
    %println "Finally."
    catch IOException (label_start label_io) label_io
    catch Exception (label_start label_io) label_any
  }
}

In any catch clause, three labels are used. Quoted in parentheses are the start label (inclusive) and end label (exclusive) of the block for the specified exception to be caught; the trailing label is for the handling code.

Default constructor

If a class needs a default constructor simply to call the superclass's constructor, a class-level macro, %default_constructor, does the trick easily. It can be followed by <public>, <protected>, or <private>. Here is an example:

class Block extends HashMap
{
  %default_constructor <public>
  // ...
}

Bytecode programming and instructions

To do JVM bytecode assembly programming, you must understand the static structure of JVM classes and the runtime method invocation. (You do not need to concern yourself with thread execution and synchronization. JVM has two instructions, monitorenter and monitorexit, which is all you can do at the bytecode level.)

Each Java class has a constant pool, which holds all the class's constant parts, including constant numbers and strings, Java class identifiers, method identifiers, field identifiers, and so on. Bytecode instructions use constant pool indices to reference those entries. Jamaica uses Java syntax to define a Java class structure and symbolic names in instructions, so the constant pool and other bits and pieces are completely hidden.

Each running thread has a "frame" stack; when a method is called, the JVM allocates a new frame on the frame stack to store state information during the method execution; it is popped and discarded when the method returns. The frame maintains information such as local variables and the operand stack. JVM instructions receive values, return results, and pass parameters to method calls on the operand stack. The operand stack is one word (32-bits) wide; values of long and double hold two entries.

Parmeters, variables, and "this"

When a method is invoked, parameters are added as initialized local variables, with the this reference as the first if the method is not static. Local variables are one-word (32-bits) slots, which fit most JVM data values including object references; values of types long and double take two slots. In Jamaica, most instructions use names to reference variables, but a few instructions can reference variables via their indices, such as aload_0 and istore_2. Look at this example:

public void foo(long a, int b) {
  aload_0  // Loads 'this' on to the stack
  lload_1  // Loads the long value of 'a' on to the stack
  iload_3  // Loads the int value of 'b' on to the stack
}

In Jamaica, there is little reason to use those instructions. The following is recommended instead:

public void foo(long a, int b) {
  aload this  // Becomes aload_0
  lload a     // Becomes lload_1
  iload b     // Becomes lload_3
}

Bytecode instruction basics

Jamaica supports most bytecode instructions in the JVM Specification except for the quick and debug instructions, and wide is simply ignored. Some instructions have "wide" versions (such as ldc_w and goto_w); you can use their short forms instead. For a complete description of all instructions, refer to the language user's guide. I just discuss those that are syntactically not obvious from the JVM Specification.

Constant loading instructions

In the JVM, small value numbers and null can be loaded onto the stack directly via bipush, sipush, and various xconst_n instructions. Other constant values, including strings, are stored in the constant pool and loaded by the ldc (and its variations, ldc_w and ldc2_w, where the "_w" suffix indicates a wide index of two bytes) onto the stack for use. In Jamaica, there is no direct access to the constant pool; instead, ldc is the universal instruction for adding and loading any constants:

ldc 129832      // Integer
ldc (long)232   // Long and becomes ldc2_w
ldc 5.5         // Double and becomes ldc2_w
ldc (float)5.5  // Float
ldc "ABCD"
ldc "ABCD"      // Only one entry for "ABCD" in the constant pool
ldc 1234        // Jamaica optimizes this to "sipush 1234"
ldc 234         // Jamaica optimizes this to "bipush 234"
ldc 2           // Jamaica optimizes this to "iconst_2"
ldc -1          // Jamaica optimizes this to "iconst_m1"

Field access instructions

In the JVM, class fields are accessed via their indices by these instructions: getfield, putfield, getstatic, and putstatic. For security reasons, these instructions also take a field descriptor, containing class name, field name, and field type, which may seem redundant. In Jamaica, these instructions take this format:

getstatic System.out PrintStream
putfield myFld int
1 2 3 Page 1