Home
Admin | Edit

Transpiling a Forth dialect to LLVM IR

This article is an offshoot of a series of articles about a baremetal ARM Forth dialect, see part 1 / part 2.

Contents

Introduction



This is a write up about enhancing my Forth dialect with performance / portability by adding a LLVM IR code generator.

Primary reason for this was to leverage a modern optimizer to produce fast code tailored for a CPU target, secondary reason was to target many other architectures, background goal was to remain small and simple.

Adding LLVM IR code generation led to the implementation of a fairly generic transpiler extension to the compiler, this "transpiler" is able to translate the Forth dialect code to other languages in the same way that another architecture would be targeted with the compiler. (through a tailored dictionary)

The transpiler implementation is fairly small with ~400 lines of ARM code plus a tailored LLVM IR dictionary of ~400 lines as well, i also have a quickly hacked up JavaScript (p5js environment) dictionary from the LLVM one.

The LLVM IR code generator is not yet complete as it have limitations and some words are still unsupported but it works for a large subset of the dialect which is sufficient for computer graphics prototyping.

The transpiler can also be used to check program correctness, this an useful side feature as errors are not handled by the compiler to avoid any bloats, the transpiler adds some features that can be used to do some stack checks and stack leaking programs also generate wrong LLVM IR code.

Translating Forth code to LLVM IR

Challenge of converting Forth code to LLVM IR mainly relate to the translation of stack based code to registers code and generating the needed SSA form where variables are assigned exactly once.

Stack can be used in LLVM IR (simplest model for Forth !) but my wish was to go all registers instead to get the most out of the optimizer. (and registers based CPUs all around)

Generating code in SSA form has analysis / optimization advantages later on (if needed) and allow an easy way to translate my dialect to other languages as well.

Forth words could be defined as LLVM IR functions but i chose to generate all the IR code into the main function to keep it simple / single pass.

My transpiler generate the LLVM IR textual / human-readable form (debugging felt easier), the binary form might be more adequate in size constrained context.

Helpful resources

Some Forth to LLVM IR proof-of-concept projects exist such as movForth but they use a different approach, some Forth probably also include LLVM IR support so they may be worth checking out for more standardized Forth.

Tools / Debugging

CPUlator was again very useful to quickly iterate / debug the compiler code.

godbolt.org was useful to debug generated LLVM IR code, i kept a LLVM IR source tab with llc (compiler) + opt (optimizer / analyzer) output tab.

Main platform target still remains the early ARM CPUs (ARM2 etc.) so none of the modern ARM code is used in the compiler, i use these arguments to llc to generate unoptimized vanilla (early 90s) ARM code : -march=arm -mcpu=strongarm -O=0

The code generator


There are various approaches to generate LLVM IR, my aim was to keep it simple and preserve the compiler code.

Initial idea for the LLVM IR code generator was to simply reuse the Forth dictionary model by defining words whose contents consist of LLVM IR code.

My early solution was to build a lightweight virtual machine (the transpiler core) in combination with raw ASCII strings in words definition to represent unresolved LLVM IR code.

Special ASCII codes are used as placeholders or to manipulate the local template-level context, some of these codes also simulate Forth stack operations on data / return stack to help track registers / labels, a requirement for SSA code generation.

Forth to SSA / registers

In compiler design, static single assignment form (often abbreviated as SSA form or simply SSA) is a type of intermediate representation (IR) where each variable is assigned exactly once. SSA is used in most high-quality optimizing compilers for imperative languages.
A straightforward way to convert Forth code into SSA form is to treat push instructions as the creation of a new symbol (register name or label name) at compile time, this symbol is then pushed onto a register stack, and each subsequent pop returns a symbol that can be used in one or more emitted instructions.

For simplicity, i chose to represent symbols as 32-bit values (referred as index), this value represent a virtual register and is expanded into padded hexadecimal strings when emitted.

The transpile routine

The transpile routine is called by the compiler at specific points and take a template ASCII string as argument which is then interpreted, the routine then update its internal representation of the data / return stacks as it encounter various instructions, it also generates the necessary register / label names and outputs the resolved LLVM IR code.

Transpiler context has :
Some notes :
Here is an example of how the transpile routine processes the + word :

forth_word "+"
    .ascii "\x0f\x0f\x0e"
    .ascii "%r\x16 = add i32 %r\x15, %r\x14\n"
    .asciz "\x08"
Note : register name are prefixed by r and followed by a register index formatted as a 32 bit hexadecimal value with padding.

Most words of the dictionary follow the same concept, here is another example with the > word :

forth_word ">"
    .ascii "\x0f\x0f\x10\x10\x0e"
    .ascii "%r\x16 = icmp sgt i32 %r\x15, %r\x14\n"
    .ascii "%r\x17 = zext i1 %r\x16 to i32\n"
    .ascii "%r\x18 = sub i32 0, %r\x17\n"

Same as before with additional ASCII codes to get temporary register names for the two first instructions. (temporary means that they are only used within the template code, a new symbol is created but the stack is untouched)

Here is a similar example taken from the JavaScript dictionary for the word <> :

forth_word "<>"
    .ascii "\x0f\x0f\x0e"
    .ascii "var v\x16 = (v\x15 != v\x14) ? -1 : 0\n"

Flow words use more ASCII codes (e.g. for return stack) and different prefixes but follow the same concept with way more verbosity due to word complexity, there is also special templates for built in constructs such as literals, variables etc.

Note that the first line of each word definition could be moved into a macro (and placed in a separate file) as it likely remain unchanged across many target languages, i don't do this as i like that the template context remains visible.

Template strings provide high flexibility but they also are a bit redundant in term of content (e.g. the first line), this method is still satisfying as it have low complexity and match the original Forth dictionary concept. (it avoid more structured representations)

Template codes

Here is a list of all the instructions that can be used within a template string :

CODE
0x00 end of template code
0x04 output string from context stack (TOS - 4)
0x05 set current block index from context stack TOS
0x06 push frame; next byte is a block marker (e.g. if or else)
0x07 pop frame and add phi nodes
0x08 reset template-level stack
0x0B get data stack TOS value
0x0D push r12 onto context stack; used to expose literals to template code
0x0E emulates Forth push (data stack)
0x0F emulates Forth pop (data stack)
0x10 get temporary register
0x11 emulates Forth push (return stack)
0x12 emulates Forth pop (return stack)
0x13 get temporary label
0x14 output hex literal at context TOS - 0
0x15 output hex literal at context TOS - 4
0x16 output hex literal at context TOS - 8
0x17 output hex literal at context TOS - 12
0x18 output hex literal at context TOS - 16
0x19 output hex literal at context TOS - 20
0x1A output hex literal at context TOS - 24
0x1B output hex literal at context TOS - 28
0x1C output hex literal at context TOS - 32
0x1D output hex literal at context TOS - 36
0x1E output hex literal at context TOS - 40
0x1F output hex literal at TOS - N, where N is the next byte

Some notes :

phi nodes generation

The code generator described so far is quite simple and sufficient to handle most Forth primitives but the conditional ones (if, else, then) require register context tracking (block / label) to be able to generate phi nodes. (SSA form again; used to resolve values from different code paths into a single one)

The way i solved this is pretty hacky (and not very RAM efficient) but it works :
  • a template instructions was added to indicate a block change, the register index was also paired with its corresponding block / label index in the register stack, these features allow the transpiler to know which label / block a virtual register belong to
  • two template instructions were added to manage stack frames (push / pop), the pop instruction trigger phi nodes generation
The frame push instruction begins a new context by cloning the current register stack, the pop instruction then compares the register stacks from the two preceding blocks and a phi node is generated when register name differs. The associated register and block-level information is then used to correctly emit the phi node in the resulting LLVM IR.

Note that phi nodes are not required actually since IR stack / temporary variables can be used in combination with the mem2reg LLVM pass to automatically generate them, it is still unclear to me whether this method is simpler than phi nodes generation though... it may be simpler / more adequate as a generic solution though as it probably don't have the same limitation. (my phi nodes generation has a constrain : the stack level must match within different code paths)

p5js output mode

p5.js output was added to test the integration of other languages, it was straightforward to do compared to LLVM IR due to many low level constructs being abstracted away (labels) and usage of global variables.

Phi nodes are also used to resolve conditionals with p5js but the implementation is simpler as it does not require register context (doesn't have labels), the merge is simply done by using the most recently defined variable with the Nullish coalescing operator so phi nodes looks like (JavaScript code) :

var v0000004D = ...
if (...) {
    // drop instruction; v0000004D is dropped
    var v00000051 = ... // TOS is now this
}
var v00000052 = v00000051 ?? v0000004D // will take the most recently defined variable
// avoid issues on next run (due to being global)
var v00000031 = undefined
var v00000035 = undefined

The generated code works on a contained environment roughly similar to the RPI 0 1.3, a memory buffer is defined and the Forth words exclusively works with this buffer, there is added code at the top of the draw() function to copy the virtual framebuffer area into the p5.js canvas.

There is some limitations :
  • JavaScript has stricter variable name so my_variable? may trigger errors (the transpiler doesn't check this)
  • the code is wrapped into the draw() function due to being used for graphics, infinite loop (which is necessary on RPI) may bring issues as the draw() function is already repeatedly called
  • only gett (timer) and puts (text drawing) API words are implemented, they emulate U-Boot API although much simpler (no ANSI escape codes support)

Early transpiler design issues

Here are some early design issues i encountered from blindly (no prior knowledge) attempting to convert the dialect code to LLVM IR :
  • not keeping track of the Forth stack, initially used simple register increment / decrement, this quickly led into breaking the SSA form :)
  • not keeping track of the Forth return stack (separation is helpful for IR clarity !)
  • LLVM IR template strings were initially resolved as i transpiled user words content, this had many SSA / logic issues since the compiler produce inline code by default, the fix was easy : the code should stay unresolved until it is processed in the entry point, only literals (and variables) can be safely resolved as encountered since they are static values
  • not keeping tracks of register context, turns out this was important to introduce phi nodes for working conditionals / loops, all this was aggravated due to generating all the code into a single main function
  • variables didn't use LLVM IR global variables scheme (to meet code simplicity goal) so performances were kinda disappointing at first (sitting between GCC -O2 and -O3), this was added as an option with the minor gotcha that definitions must be added manually at the top of the LLVM file after code generation
  • didn't add any platform ABI specification to the LLVM IR file, this alone provided a nice performances boost (x2) with cache off
Keeping track of register context for the phi nodes generation was the hardest part as i needed a few days of intensive debugging due to it being quickly hacked in with the initial data structures, the solution i ended up with is okay but not very RAM efficient.

Benchmark

Speed

Benchmark done on a Raspberry PI Zero 1.3 :


The cache on result is pretty good as it beat Clang at max optimization level but i don't know why the transpiled code sit between GCC -O1 and -O2 with cache off, it is perhaps due to some parts of the generated IR code, load / store specifically, inttoptr, lack of GEP ?

Note : p5.js benchmark were done on my desktop (i7 6700 3.4 GHz), transpiled code use typed array which may explain the difference between manual vs transpiled, perhaps it generate more pleasant code for the JavaScript optimizer as well. Quite fun overall that it is just ~4x slower than a 2003 1 GHz CPU running raw ARM code.

Size

As for binary size, a 60 bytes binary (RPI 0 1.3 target) of a short graphics program was produced with LLVM size optimizations, it is about 2 instructions more than a handcrafted tiny binary, near optimal tiny binaries like this are possible as long as they don't use any data (variable, string etc.) and API words.

handcrafted program (left) vs a program transpiled to LLVM (right)

Conclusion


There is still some room to simplify / improve the transpiler such as some refactor on the phi nodes generation routine or adding automatic support to emit global variable definitions, generating better load / store code and implementing quotation support for completeness.

All in all the transpiler was quite fun to do and a good compact addition to the compiler with surprising end results for a few days of code.

tiny binary running on RPI 0 1.3 (with U-Boot)

back to topLicence Creative Commons