JVM Advent

The JVM Programming Advent Calendar

My First Compiler

As a Java developer you probably spend a lot of time writing Java source code and executing a Java compiler to convert that human readable Java source into machine readable bytecode stored in Java class files.

If you’ve ever wondered how a compiler works or how a Java compiler creates Java class files, then keep reading! We’ll work through writing a compiler for a simple programming language that compiles to Java bytecode.

We’ll implement a compiler for the esoteric programming language Brainf*ck, which is simple enough that it doesn’t require much code to create a working compiler.


What is Brainf*ck?

Brainf*ck is a programming language that consists of just 8 operators which operate on an array of memory cells. Even though there are only 8 operators, the language is Turing complete and so, in theory, you can write any program you can think of; whether you’d want to is another question!

The 8 operators use and manipulate the values in the memory, with the current memory cell pointed to by a data pointer. Implementations of the language typically have a memory size of at least 30,000 cells and the cells are typically 1 byte in size but this can vary.

These are the 8 operators:

>Moves the data pointer to the right
<Moves the data pointer to the left
+Increment the value in memory pointed to by the data pointer
Decrement the value in memory pointed to by the data pointer
[Jump to the location after the corresponding ] if the current value is zero
]Jump to the corresponding [ if the current value is not zero
,Read 1 character (1 byte) from the standard input
.Print 1 character to standard out

Any character not in this list is considered a comment. At the beginning of a Brainf*ck program all the values are zero and the data pointer points to the left-most block:

[0][0][0][0][0]…[0]
|
DP

Given the following program, +>++>++++-, the memory will end up looking like this:

[1][2][3][0][0]…[0]
       |
       DP

Compiling to Java Bytecode

We’ll write a Java program that will read Brainf*ck input files and produce Java class files in a jar as output: then you’ll be able to execute the jar with the java command (java -jar output.jar).

Java class files contain Java bytecode: these are the instructions that are executed by a Java virtual machine. A Java virtual machine is a stack-based machine: many of the instructions deal with pushing and popping from the operand stack. For example, the instruction iconst_0 is used to push a constant 0 onto the stack and the pop instruction will pop the top value from the stack.

So, how do we create a class file? Technically, a class file is just a bunch of bytes so we could just start writing out a stream of bytes but that’ll get hard pretty quickly so we’d benefit from a higher-level API to help us out.

ProGuardCORE

ProGuardCORE is a Java bytecode manipulation & analysis library that contains the tools required to read, write and manipulate Java class files and their bytecode. It abstracts away some of the details and provides model classes, editors and builders for all things class file related. It’s not the only such library out there: ASM, Byte Buddy and the new JEP457 are other examples.

ProGuardCORE is published to Maven Central, so you can simply create a new Java project and add a dependency to start using it. For example, a Gradle build.gradle file could look like the following:

plugins {
  id 'java'
}

repositories {
  mavenCentral()
}

dependencies {
  implementation 'com.guardsquare:proguard-core:9.1.1'
}

A Brainf*ck compiler

We’ll start with some boiler plate code for setting up the input and output arguments, creating a class with a main method and writing the class to a jar.

We can use the ClassBuilder utility to create a class, which takes as parameters the class file version (we’ll use Java 1.6), the access flags (public), the name (BF) and the super class (java/lang/Object).

Adding a method using the ClassBuilder is easy with the addMethod builder method: we first need to provide the access flags (public, static), the name (main) and the descriptor of the method (([Ljava/lang/String;)V). 

The final parameter of addMethod takes a function that allows building code with a CodeBuilder. The CodeBuilder interface declares a single method compose that provides a CompactCodeAttributeComposer parameter: this is how we’ll generate the specific bytecode instructions needed to implement Brainf*ck logic. For now, we’ll just add a return instruction.

Finally, the utility method writeJar can be used to write our class to a jar.

public class BfCompiler {
  public static void main(String[] args) throws IOException {
    if (args.length != 2) throw new RuntimeException("Expected input and output arguments");
 
    var input = Files.readString(Path.of(args[0]));
    var output = args[1];

    var bfClass = new ClassBuilder(CLASS_VERSION_1_6, PUBLIC, "BF", NAME_JAVA_LANG_OBJECT)
        .addMethod(PUBLIC | STATIC, "main", "([Ljava/lang/String;)V", 65_535, composer -> {

         // TODO: Generate code here

         // Generate the return instruction
         composer.return_();

    }).getProgramClass();

    IOUtil.writeJar(new ClassPool(bfClass), output, externalClassName(bfClass.getName()));
  }
}

If you run BfCompiler, it will generate a jar file containing a class called BF, which contains a main method that does nothing but return! Try it out and see nothing for yourself:

$ java -jar output.jar

You can though use the javap command to look at the generated bytecode where you’ll see the main method which contains the return instruction at offset 0.

$ javap -cp output.jar -c -v -p BF
public class BF
  minor version: 0
  major version: 50
  flags: (0x0001) ACC_PUBLIC
  this_class: #2                          // BF
  super_class: #4                         // java/lang/Object
  interfaces: 0, fields: 0, methods: 1, attributes: 0
Constant pool:
  #1 = Utf8               BF
  #2 = Class              #1              // BF
  #3 = Utf8               java/lang/Object
  #4 = Class              #3              // java/lang/Object
  #5 = Utf8               main
  #6 = Utf8               ([Ljava/lang/String;)V
  #7 = Utf8               Code
{
  public static void main(java.lang.String[]);
    descriptor: ([Ljava/lang/String;)V
    flags: (0x0009) ACC_PUBLIC, ACC_STATIC
    Code:
      stack=0, locals=1, args_size=1
         0: return
}

Next we’ll need to generate the code that implements the logic of the the input Brainf*ck program.


Memory

A Brainf*ck program operates on a block of memory with a typical implementation size of 30,000. Some implementations automatically expand the memory size if it’s exceeded but for our implementation we’ll use a fixed-size byte array to represent the memory; and to keep it simple we won’t handle out of bounds access.

So, the first code we’ll need to generate in our main method is to declare the array, which will require 3 instructions:

composer
   .sipush(30_000)
   .newarray(arrayTypeFromInternalType(BYTE))
   .astore(MEMORY);

The sipush (short integer) instruction pushes the desired size of the array, 30000, onto the stack which the newarray instruction will pop from the stack. The operand for the newarray instruction is the type of array: in this case a byte array. The actual value here is 8 but we use the utility function arrayTypeFromInternalType and the BYTE constant to make the code more readable.

Then we store the array that was just pushed onto the stack into a local variable slot with an astore instruction. The astore instruction pops a reference value from the stack and stores it in the specified local variable slot.

MEMORY is a static final field int that contains the local variable slot number, which you should add to the BfCompiler class:

private static final int MEMORY = 1;

Data Pointer

We’ll also need to keep track of the data pointer value: we’ll use another local variable slot for that and generate some code to initialise the value to zero. The iconst_0 instruction pushes the integer 0 onto the stack and then the istore instruction will pop the value from the stack and store it in the specified local variable slot.

composer
       .iconst_0()
       .istore(DATA_POINTER);

DATA_POINTER is a static final int field that contains the local variable slot number, which you should add to the BfCompiler class:

private static final int DATA_POINTER = 0;

Now we’ve initialised the memory and the data pointer; next we need to generate the code for each of the Brainf*ck operators in the input.


Parsing

We don’t need a complicated or fancy parser since the language is so simple. We can simply iterate over the characters in the input string, generating the corresponding code for each operator and then continue to the next one:

input.chars().forEach(c -> {
  switch (c) {
    case '>' -> move(composer,       1); // Move right by 1
    case '<' -> move(composer,      -1); // Move left by 1
    case '+' -> increment(composer,  1); // Increment by 1
    case '-' -> increment(composer, -1); // Decrement by 1
    case ',' -> printChar(composer); 
    case '.' -> readChar(composer);
    case '[' -> loopBegin(composer);
    case ']' -> loopEnd(composer);
    default -> {
      // Ignore other characters.
    }
   }
});

This will call a corresponding method to generate the code for each operator and ignore any non-operator characters.

We’ll take a look at each code generation method one by one.


The move operators < >

The < and > operators move the data pointer left or right by 1: in our generated Java program this corresponds to incrementing or decrementing the data pointer by 1.

Out of all of the Brainf*ck operators, this is the easiest to generate code for as there is a single Java bytecode instruction that does exactly what we need: iinc.

The iinc instruction takes as operands a local variable index and a signed byte value; and when executed the local variable will be incremented by that value.

private static void move(CompactCodeAttributeComposer composer, int amount) {
    composer.iinc(DATA_POINTER, amount);
}

The increment / decrement operators + –

The increment and decrement operators increment or decrement the value in the current memory cell by 1. In our generated Java bytecode this corresponds to incrementing or decrementing the value in the array at the data pointer index i.e. in Java source code it would look like MEMORY[DATA_POINTER]++.

The baload instruction can be used to load a byte value from a byte array. It pops its two operands from the stack: the array reference and the array index. So to load a value from the memory array requires 3 instructions:

  • aload to push the memory array onto the stack
  • iload to push the data pointer array index onto the stack
  • baload to load the value from the memory array at the data pointer index

There is also a corresponding bastore instruction which pops from the stack the array reference, the array index and the value to store into the array.

Since we’re going to need the memory array and data pointer on the stack later for using with bastore we duplicate the top two entries on the stack, using dup2, before the baload instruction pops them:

composer
    .aload(MEMORY)
    .iload(DATA_POINTER)
    .dup2()
    .baload()

After executing these instructions the JVM stack will look something like this:

MEMORY[DATA_POINTER]
DATA_POINTER
MEMORY

We’ll then add 1 or -1 to the value on the top of the stack by pushing 1 or -1 and using the iadd instruction to add them together. The iadd instruction pops two integers from the stack, adds them together and then pushes the result back onto the stack.

composer
  .aload(MEMORY)
  .iload(DATA_POINTER)
  .dup2()
  .baload()
  .iconst(amount)
  .iadd()

The JVM stack will then look something like this:

MEMORY[DATA_POINTER] + (-)1
DATA_POINTER
MEMORY

We now have all the operands in place to use the bastore instruction to store the incremented value back into the memory array. The bastore instruction will pop the memory array, data pointer array index and the value to store into the memory array.

The full increment code generation method looks like this:

private static void increment(CompactCodeAttributeComposer composer, int amount) {
  composer
    .aload(MEMORY)
    .iload(DATA_POINTER)
    .dup2()
    .baload()
    .iconst(amount)
    .iadd()
    .bastore();
}

So far we can move the data pointer and modify the values in memory but we can’t yet observe what’s happening: next we’ll generate code for the print operator.


The print operator .

If you’re a Java or Kotlin developer, you’ll know that to print something to standard out you’ll use the System.out.print method (or println if you want a newline printed). We’ll generate the code to invoke the same print method to print a character from the Brainf*ck memory.

The code to print a character from memory will need the following instructions:

  • getstatic to push the System.out reference onto the stack
  • aload / iload / baload to load the byte from the memory array
  • i2c to convert the byte to a char
  • invokevirtual to invoke the print method on System.out

The getstatic instruction takes the fully qualified class name, field name and descriptor as parameters and pushes the field value onto the stack:

composer
  .getstatic("java/lang/System", "out", "Ljava/io/PrintStream;")

The code to load the byte from the Brainf*ck memory is familiar, since it’s the same 3 instructions as for the increment operator:

composer
  .getstatic("java/lang/System", "out", "Ljava/io/PrintStream;")
  .aload(MEMORY)
  .iload(DATA_POINTER)
  .baload()

We’ll then add the i2c instruction to convert the byte to a char:

composer
  .getstatic("java/lang/System", "out", "Ljava/io/PrintStream;")
  .aload(MEMORY)
  .iload(DATA_POINTER)
  .baload()
  .i2c()

After executing these instructions the JVM stack will look something like this:

(char)MEMORY[DATA_POINTER]
System.out

The stack is now set up for executing an invokevirtual instruction for the print method. The object on which to execute the method is popped from the stack after the method parameters, in this case the single character.

The invokevirtual instruction requires the fully qualified class name (java/io/PrintStream), the method name (print) and the method descriptor ((C)V).

The full printChar code generation method looks like this:

private static void printChar(CompactCodeAttributeComposer composer) {
  composer
    .getstatic("java/lang/System", "out", "Ljava/io/PrintStream;")
    .aload(MEMORY)
    .iload(DATA_POINTER)
    .baload()
    .i2c()
    .invokevirtual("java/io/PrintStream", "print", "(C)V");
}

We’ll now be able to see the results of the increment operations on the memory by printing out values. If you run the compiler with the following input and execute the resulting jar, what do you see?

+++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++.

The read operator ,

We can now print characters using System.out.println and next we’ll implement the read operator with System.in.

As with the print operator, we’ll start by pushing System.in onto the stack:

composer
  .getstatic("java/lang/System", "in", "Ljava/io/InputStream;")

To read a byte from the input we can use the read(byte[] array, int offset, int length) method of the System.in input stream. This method takes as parameters a byte array, a destination offset within the array and the length of bytes to read from the stream which will then be stored in the array. The method returns the number of bytes that were actually read from the stream.

Our Brainf*ck memory is a byte array and the data pointer points to the position in the array where the read byte should be stored; so these can be directly used as parameters to the read method. The Brainf*ck read operator only reads a single byte at a time so the length parameter to the read method is a constant integer 1. 

This means we need to push the memory array, the data pointer and constant integer 1 onto the stack before the invokevirtual instruction to invoke the read method:

composer
  .getstatic("java/lang/System", "in", "Ljava/io/InputStream;")
  .aload(MEMORY)
  .iload(DATA_POINTER)
  .iconst_1()
  .invokevirtual("java/io/InputStream", "read", "([BII)I")

Remember that the read method returns the number of bytes read? The invokevirtual instruction will push that return value onto the stack. We don’t care about it so we can discard it with a pop instruction. 

The full readChar code generation method looks like this:

private static void readChar(CompactCodeAttributeComposer composer) {
  composer
    .getstatic("java/lang/System", "in", "Ljava/io/InputStream;")
    .aload(MEMORY)
    .iload(DATA_POINTER)
    .iconst_1()
    .invokevirtual("java/io/InputStream", "read", "([BII)I")
    .pop();
}

Now try compiling and running ,+.; type a character, press enter and what do you see?

We’re almost done but we’re still missing the [ and ] operators.


The loop operators [ ]

So far we’ve implemented 6 out of the 8 operators and have been able to compile and execute some simple Brainf*ck programs. But the two remaining operators are required to allow us to run more complicated programs.

Let’s remind ourselves of what the remaining two operators do:

[Jump to the location after the corresponding ] if the current value is zero
]Jump to the corresponding [ if the current value is not zero

For our compiler, this means that the generated code for the [ operator will need to conditionally jump forward to a location in code which we have not yet generated for the ]; and the code for the ] operator will need to conditionally jump back to the location of the corresponding [ operator.

To implement this in Java bytecode we’ll use the ifeq and ifne instructions that jump to a specified offset if the value on the stack is 0 or not 0 respectively. When we’re parsing the [ operator to generate the ifeq we’ll need to already know the location to jump forward to even though we haven’t yet generated it.

ProGuardCORE solves this problem with labels: the CompactCodeAttributeComposer has a createLabel() method that will create a label that can be used as a parameter for the label pseudo-instruction like this:

var exampleLabel = composer.createLabel();
composer
  .iconst_0()
  .ifeq(exampleLabel)
  ...
  .label(exampleLabel)
  ....

We’ll need to keep track of the labels that we create because we’ll create the labels when generating code for the [ and then need to use the labels later when generating the code for the corresponding ].

We’ll do this with a stack of pairs of labels: one label for the loop body and another for the loop exit. We use a stack because loops can be nested and this allows us to match the corresponding [ and ]. 

The following record and field should now be added to our BfCompiler class:

private record LoopInfo(Label body, Label exit) {}
private static final Stack<LoopInfo> loops = new Stack<>();

For the beginLoop code generation method, we start by creating labels and pushing them onto the stack:

var loopInfo = loops.push(new LoopInfo(composer.createLabel(), composer.createLabel()));

Then we need to generate the code for the [ operator with the following instructions:

  • aload / iload / baload to load the byte from the memory
  • ifeq to conditionally jump to the label loopInfo.exit
  • label to mark the position in the code with the label loopInfo.body

The full beginLoop code generation method then looks like this:

private static void loopBegin(CompactCodeAttributeComposer composer) {
  var loopInfo = loops.push(new LoopInfo(composer.createLabel(), composer.createLabel()));

  composer
    .aload(MEMORY)
    .iload(DATA_POINTER)
    .baload()
    .ifeq(loopInfo.exit)   // jump to loop exit if zero
    .label(loopInfo.body); // mark the start of loop body
}

The code generation for the ] operator works in a similar way except it pops the labels from the loops stack; then it generates the code for the ] operator with the following instructions:

  • aload / iload / baload to load the byte from the memory
  • ifne to conditionally jump to the label loopInfo.body
  • label to mark the position in the code with the label loopInfo.exit

We expect that the [ and ] operators are correctly balanced when popping from the array but in the case of a syntax error in the input this may not be the case. For example, the following input would throw an EmptyStackException if we try to pop labels from the loops stack: ++[--]].

We can provide a nicer error message by checking if the stack is empty before we try to pop the labels from the loops stack.

The full endLoop code generation method then looks like this:

private static void loopEnd(CompactCodeAttributeComposer composer) {
  if (loops.empty()) throw new RuntimeException("Unexpected ']'");

  var loopInfo = loops.pop();

  composer
    .aload(MEMORY)
    .iload(DATA_POINTER)
    .baload()
    .ifne(loopInfo.body)
    .label(loopInfo.exit);
}

As a final improvement, we can also detect when there are too many [ operators compared with the number of ] operators and provide a nicer error message. If the loops stack is not empty at the end of our code generation we know that we’ve pushed more loop labels onto the stack than we have popped:

// Add just before composer.return_();
if (!loops.empty()) throw new RuntimeException("Too many '['");

Now our compiler is ready for anything (almost)!


Finally, a working compiler

Congratulations! The compiler can now run almost any Brainf*ck program. Try the following, what does it do?

-[------->+<]>-.-[->+++++<]>
++.+++++++..+++.[--->+<]>-----
.+++++[->++<]>.>+[--->++
<]>.---------.-[-->+<]>-----.[--->+<]>-.

The full compiler code can be found over on GitHub which you can easily run via Gradle:

$ ./gradlew run --args "examples/hellojvm.bf build/output.jar"
$ java -jar buid/output.jar

Why Almost?

There are some limitations in our implementation that prevent us from running every possible Brainf*ck program. For one, the memory is limited to a fixed size of 30,000 cells: some Brainf*ck programs may require more than this.

There are also some JVM limitations: we currently generate code in a single method. A single method in the JVM is limited to 65,535 bytes which we could exceed. In fact, if you compile this Mandelbrot Brainf*ck program we already reach 43,658 bytes.

One improvement we can make to reduce code size could be to generate helper methods for each of our operations rather than generating all the code in a single method.

But a bigger improvement would be to optimise the generation of the code for the move (< >) and add (+ -) operators.

Optimizations

Consider the following Brainf*ck program: >>>>>

Our compiler will call the move code generation method 5 times, generating 5 iinc instructions; when instead we could generate a single iinc instruction with 5 as the increment amount. 

The move and increment methods are already set up to pass in increment amounts other than 1 or -1: can you modify the compiler to optimise the generation of consecutive move and increment operations?

Here’s a nice blog post that describes this and other optimizations; can you implement them in our compiler?


Next steps

We’ve written a JVM compiler for a real, albeit esoteric language using a small amount of Java code and just 19 bytecode instructions; but there are many more bytecode instructions — take a look at the specification.

For your next steps with ProGuardCORE, take a look at the manual, examples or the source code for ProGuard which uses ProGuardCORE to shrink & optimise Java bytecode.

ProGuardCORE is also not the only library that provides the tools to generate Java bytecode: other examples are ASM and Byte Buddy; and the functionality to generate Java bytecode will even be included in the Java standard library soon with JEP 457.


Author: James Hamilton

I’m a compiler engineer working at Guardsquare on JVM/Android related tools & libraries including ProGuardCORE, ProGuard and DexGuard.

Next Post

Previous Post

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

© 2024 JVM Advent | Powered by steinhauer.software Logosteinhauer.software

Theme by Anders Norén