Transpiling a Forth dialect to LLVM IR

This article is an offshoot of a series of articles about a baremetal ARM Forth dialect, see part 1 / part 2.

Introduction
The code generator
Generated LLVM IR sample
Benchmark
- Speed
- Size
Conclusion

Introduction

This is a write up about enhancing my Forth dialect with performance / portability by adding a LLVM IR code generator.

Primary reason for this was to leverage a modern optimizer to produce fast code tailored for a CPU target, secondary reason was to target many other architectures, background goal was to remain small and simple.

Adding LLVM IR code generation led to the implementation of a fairly generic transpiler extension to the compiler, this "transpiler" is able to translate the Forth dialect code to other languages in the same way that another architecture would be targeted with the compiler. (through a tailored dictionary)

The transpiler implementation is fairly small with ~400 lines of ARM code plus a tailored LLVM IR dictionary of ~400 lines as well, i also have a quickly hacked up JavaScript (p5js environment) dictionary from the LLVM one.

The LLVM IR code generator is not yet complete as it have limitations and some words are still unsupported but it works for a large subset of the dialect which is sufficient for computer graphics prototyping.

The transpiler can also be used to check program correctness, this an useful side feature as errors are not handled by the compiler to avoid any bloats, the transpiler adds some features that can be used to do some stack checks and stack leaking programs also generate wrong LLVM IR code.

Translating Forth code to LLVM IR

Challenge of converting Forth code to LLVM IR mainly relate to the translation of stack based code to registers code and generating the needed SSA form where variables are assigned exactly once.

Stack can be used in LLVM IR (simplest model for Forth !) but my wish was to go all registers instead to get the most out of the optimizer. (and registers based CPUs all around)

Generating code in SSA form has analysis / optimization advantages later on (if needed) and allow an easy way to translate my dialect to other languages as well.

Forth words could be defined as LLVM IR functions but i chose to generate all the IR code into the main function to keep it simple / single pass.

My transpiler generate the LLVM IR textual / human-readable form (debugging felt easier), the binary form might be more adequate in size constrained context.

Helpful resources

LLVM reference documentation
Mapping High Level Constructs to LLVM IR gitbook

Some Forth to LLVM IR proof-of-concept projects exist such as movForth but they use a different approach, some Forth probably also include LLVM IR support so they may be worth checking out for more standardized Forth.

Tools / Debugging

CPUlator was again very useful to quickly iterate / debug the compiler code.

godbolt.org was useful to debug generated LLVM IR code, i kept a LLVM IR source tab with llc (compiler) + opt (optimizer / analyzer) output tab.

Main platform target still remains the early ARM CPUs (ARM2 etc.) so none of the modern ARM code is used in the compiler, i use these arguments to llc to generate unoptimized vanilla (early 90s) ARM code : -march=arm -mcpu=strongarm -O=0

The code generator

There are various approaches to generate LLVM IR, my aim was to keep it simple and preserve the compiler code.

Initial idea for the LLVM IR code generator was to simply reuse the Forth dictionary model by defining words whose contents consist of LLVM IR code.

My early solution was to build a lightweight virtual machine (the transpiler core) in combination with raw ASCII strings in words definition to represent unresolved LLVM IR code.

Special ASCII codes are used as placeholders or to manipulate the local template-level context, some of these codes also simulate Forth stack operations on data / return stack to help track registers / labels, a requirement for SSA code generation.

Forth to SSA / registers

In compiler design, static single assignment form (often abbreviated as SSA form or simply SSA) is a type of intermediate representation (IR) where each variable is assigned exactly once. SSA is used in most high-quality optimizing compilers for imperative languages.

A straightforward way to convert Forth code into SSA form is to treat push instructions as the creation of a new symbol (register name or label name) at compile time, this symbol is then pushed onto a register stack, and each subsequent pop returns a symbol that can be used in one or more emitted instructions.

For simplicity, i chose to represent symbols as 32-bit values (referred as index), this value represent a virtual register and is expanded into padded hexadecimal strings when emitted.

The transpile routine

The transpile routine is called by the compiler at specific points and take a template ASCII string as argument which is then interpreted, the routine then update its internal representation of the data / return stacks as it encounter various instructions, it also generates the necessary register / label names and outputs the resolved LLVM IR code.

Transpiler context has :

a 32-bit register counter, incremented on each data stack push operation (i.e. the creation of a new symbol)
a register stack containing pairs of values : block index and register index
a 32-bit label counter, incremented on each return stack push operation
a label stack
a 32-bit block index representing the current LLVM IR block (defined by LLVM IR labels)
a template-level stack, cleared at the end of each string template (which is required due to inlining), it hold temporary values used to fill placeholders

Some notes :

register / label stack could be merged into a single stack (same for the associated counters) but i chose to keep them separate for clarity in template code / IR output
register stack has redundant information (block index) due to a quick fix for an early design issue (didn't kept track of register context at first...)

Here is an example of how the transpile routine processes the + word :

forth_word "+"

    .ascii "\x0f\x0f\x0e"

    .ascii "%r\x16 = add i32 %r\x15, %r\x14\n"

    .asciz "\x08"

first string emulate the Forth data stack, it does two pop and one push, these three instructions push a value onto the template-level stack (a register index)
second string is the LLVM IR instruction with getters (placeholders), these getters get a value from the template-level stack and output it at their location
third string is null-terminated and hold an instruction to reset the template-level stack, this allow the next processed template string to start from a clean context (last line is automatically generated by the forth_word macro on most cases)

Note : register name are prefixed by r and followed by a register index formatted as a 32 bit hexadecimal value with padding.

Generated LLVM IR code then looks like this :

%r00000023 = add i32 %r00000021,
%r00000022

Most words of the dictionary follow the same concept, here is another example with the > word :

forth_word ">"

    .ascii "\x0f\x0f\x10\x10\x0e"

    .ascii "%r\x16 = icmp sgt i32 %r\x15, %r\x14\n"

    .ascii "%r\x17 = zext i1 %r\x16 to i32\n"

    .ascii "%r\x18 = sub i32 0, %r\x17\n"

Same as before with additional ASCII codes to get temporary register names for the two first instructions. (temporary means that they are only used within the template code, a new symbol is created but the stack is untouched)

Here is a similar example taken from the JavaScript dictionary for the word <> :

forth_word "<>"

    .ascii "\x0f\x0f\x0e"

    .ascii "var v\x16 = (v\x15 != v\x14) ? -1 :
0\n"

Flow words use more ASCII codes (e.g. for return stack) and different prefixes but follow the same concept with way more verbosity due to word complexity, there is also templates for built in constructs such as literals, variables etc.

Note that the first line of each word definition could be moved into a macro (and placed in a separate file) as it likely remain unchanged across many target languages, i don't do this as i like that the template context remains visible.

Template strings provide high flexibility but they also are a bit redundant in term of content (e.g. the first line), this method is still satisfying as it have low complexity and match the original Forth dictionary concept. (it avoid more structured representations)

Template codes

Here is a list of all the instructions that can be used within a template string :

CODE
`0x00`	end of template code
`0x04`	output string from context stack (TOS - 4)
`0x05`	set current block index from context stack TOS
`0x06`	push frame; next byte is a block marker (e.g. `if` or `else`)
`0x07`	pop frame and add phi nodes
`0x08`	reset template-level stack
`0x0B`	get data stack TOS value
`0x0D`	push `r12` onto context stack; used to expose literals to template code
`0x0E`	emulates Forth `push` (data stack)
`0x0F`	emulates Forth `pop` (data stack)
`0x10`	get temporary register
`0x11`	emulates Forth `push` (return stack)
`0x12`	emulates Forth `pop` (return stack)
`0x13`	get temporary label
`0x14`	output hex literal at context TOS - 0
`0x15`	output hex literal at context TOS - 4
`0x16`	output hex literal at context TOS - 8
`0x17`	output hex literal at context TOS - 12
`0x18`	output hex literal at context TOS - 16
`0x19`	output hex literal at context TOS - 20
`0x1A`	output hex literal at context TOS - 24
`0x1B`	output hex literal at context TOS - 28
`0x1C`	output hex literal at context TOS - 32
`0x1D`	output hex literal at context TOS - 36
`0x1E`	output hex literal at context TOS - 40
`0x1F`	output hex literal at TOS - N, where `N` is the next byte

Some notes :

instructions evolved organically as i implemented code generation, the list could be way cleaner / have less instructions with a bit of refactoring and thoughtful design
all instructions from 0x14 to 0x1e can be emulated with 0x1f but i let them for clarity in template code
0x05 to 0x07 instructions are solely used to handle conditionals / loops (phi nodes generation)

phi nodes generation

The code generator described so far is quite simple and sufficient to handle most Forth primitives but the conditional ones (if, else, then) require register context tracking (block / label) to be able to generate phi nodes. (SSA form again; used to resolve values from different code paths into a single one)

The way i solved this is pretty hacky (and not very RAM efficient) but it works :

a template instruction was added to indicate a block change, the register index was also paired with its corresponding block / label index in the register stack, these features allow the transpiler to know which label / block a virtual register belong to
two template instruction were added to manage stack frames (push / pop), the pop instruction trigger phi nodes generation

The frame push instruction begins a new context by cloning the current register stack, the pop instruction then compares the register stacks from the two preceding blocks and a phi node is generated when register name differs. The associated register and block-level information is then used to correctly emit the phi node in the resulting LLVM IR.

Note that phi nodes are not required actually since IR stack / temporary variables can be used in combination with the mem2reg LLVM pass to automatically generate them, it is still unclear to me whether this method is simpler than phi nodes generation though... it may be simpler / more adequate as a generic solution though as it probably don't have the same limitation. (my phi nodes generation has a constraint : the stack level must match across code paths)

p5js output mode

p5.js output was added to test the integration of other languages, it was straightforward to do compared to LLVM IR due to many low level constructs being abstracted away (labels) and usage of global variables.

Phi nodes are also used to resolve conditionals with p5js but the implementation is simpler as it does not require register context (doesn't have labels), the merge is simply done by using the most recently defined variable with the Nullish coalescing operator so phi nodes looks like (JavaScript code) :

var v0000004D = ...

if (...) {

    // drop instruction; v0000004D is dropped

    var v00000051 = ... // TOS is now this

}

var v00000052 = v00000051 ?? v0000004D // will take the most
recently defined variable

// avoid issues on next run (due to being global)

var v00000031 = undefined

var v00000035 = undefined

The generated code works on a contained environment roughly similar to the RPI 0 1.3, a memory buffer is defined and the Forth words exclusively works with this buffer, there is added code at the top of the draw() function to copy the virtual framebuffer area into the p5.js canvas.

There is some limitations :

JavaScript has stricter variable name so my_variable? may trigger errors (the transpiler doesn't check this)
the code is wrapped into the draw() function due to being used for graphics, infinite loop (which is necessary on RPI) may bring issues as the draw() function is already repeatedly called
only gett (timer) and puts (text drawing) API words are implemented, they emulate U-Boot API although text drawing is much simpler (no ANSI escape codes support)

Early transpiler design issues

Here are some early design issues i encountered from blindly (no prior knowledge) attempting to convert the dialect code to LLVM IR :

not keeping track of the Forth stack, initially used simple register increment / decrement, this quickly led into breaking the SSA form :)
not keeping track of the Forth return stack (separation is helpful for IR clarity !)
LLVM IR template strings were initially resolved as i transpiled user words content, this had many SSA / logic issues since the compiler produce inline code by default, the fix was easy : the code should stay unresolved until it is processed in the entry point, only literals (and variables) can be safely resolved as encountered since they are static values
not keeping tracks of register context, turns out this was important to introduce phi nodes for working conditionals / loops, all this was aggravated due to generating all the code into a single main function
variables didn't use LLVM IR global variables scheme (to meet code simplicity goal) so performances were kinda disappointing at first (sitting between GCC -O2 and -O3), this was added as an option with the minor gotcha that definitions must be added manually at the top of the LLVM file after code generation
didn't add any platform ABI specification to the LLVM IR file, this alone provided a nice performances boost (x2) with cache off

Keeping track of register context for the phi nodes generation was the hardest part as i needed a few days of intensive debugging due being quickly hacked in with the initial data structures, the solution i ended up with is okay but not very RAM efficient.

Generated LLVM IR sample

Here is a full sample of LLVM IR generated by the transpiler :

target triple =
"armv6-unknown-linux-gnueabihf"

target datalayout =
"e-m:e-p:32:32-Fi8-i64:32-v128:64:128-a:0:32-n32-S32"

@rA = internal global i32 0

@rB = internal global i32 0

@rC = internal global i32 0

@rD = internal global i32 0

define void @main() nounwind {

    l00000000:

    %r00000000 = add i32 0,u0x00000001

    ; negate

    %r00000001 = sub i32 0, %r00000000

    ; A!

    store i32 %r00000001, i32* @rA

    %r00000002 = add i32 0,u0x00000000

    ; B!

    store i32 %r00000002, i32* @rB

    ; loop

    br label %l00000001

    l00000001:

        ; A@

        %r00000003 = load i32, i32* @rA

        ; B@

        %r00000004 = load i32, i32* @rB

        %r00000005 = add i32 0,u0x00000002

        ; >>>

        %r00000006 = lshr i32 %r00000004,
%r00000005

        ; +

        %r00000007 = add i32 %r00000003,
%r00000006

        ; A!

        store i32 %r00000007, i32* @rA

        ; B@

        %r00000008 = load i32, i32* @rB

        ; A@

        %r00000009 = load i32, i32* @rA

        %r0000000A = add i32 0,u0x00000001

        ; >>

        %r0000000B = ashr i32 %r00000009,
%r0000000A

        ; -

        %r0000000C = sub i32 %r00000008,
%r0000000B

        ; B!

        store i32 %r0000000C, i32* @rB

        ; A@

        %r0000000D = load i32, i32* @rA

        ; B@

        %r0000000E = load i32, i32* @rB

        %r0000000F = add i32 0,u0x00000002

        ; >>

        %r00000010 = ashr i32 %r0000000E,
%r0000000F

        ; +

        %r00000011 = add i32 %r0000000D,
%r00000010

        ; A!

        store i32 %r00000011, i32* @rA

        %r00000012 = add i32 0,u0x00FFFFFF

        ; A@

        %r00000013 = load i32, i32* @rA

        %r00000014 = add i32 0,u0x00000017

        ; >>>

        %r00000015 = lshr i32 %r00000013,
%r00000014

        %r00000016 = add i32 0,u0x00000100

        ; +

        %r00000017 = add i32 %r00000015,
%r00000016

        ; B@

        %r00000018 = load i32, i32* @rB

        %r00000019 = add i32 0,u0x00000017

        ; >>>

        %r0000001A = lshr i32 %r00000018,
%r00000019

        %r0000001B = add i32 0,u0x0000002C

        ; +

        %r0000001C = add i32 %r0000001A,
%r0000001B

        %r0000001D = add i32 0,u0x00000400

        ; *

        %r0000001E = mul i32 %r0000001C,
%r0000001D

        ; +

        %r0000001F = add i32 %r00000017,
%r0000001E

        %r00000020 = add i32 0,u0x00000002

        ; <<

        %r00000021 = shl i32 %r0000001F,
%r00000020

        %r00000022 = add i32 0,u0x1E742000

        ; +

        %r00000023 = add i32 %r00000021,
%r00000022

        ; !

        %r00000024 = inttoptr i32 %r00000023 to
i32*

        store i32 %r00000012, i32*
%r00000024

        ; again

        br label %l00000001

    ret void

}

Generated from the following Forth dialect code :

: FB_BASE 1e99a000

: FB_SIZE 95ffc ; 1024 * 600 - 4

: FB_WIDTH 400 ; 1024

: FB_BPP 4

: FB_BPP_SHIFT 2



: plot_index FB_BPP_SHIFT << FB_BASE + !

: put_pixel FB_WIDTH * + plot_index



: hakmem149 entrypoint

    1 negate A! 0 B!

    loop

        A@ (B@ 2 >>>) + A!

        B@ (A@ 1 >>) - B!

        A@ (B@ 2 >>) + A!



        ffffff ((A@ 17 >>>) 100 +)
((B@ 17 >>>) 2c +) put_pixel

    again

Benchmark

Speed

Benchmark done on a Raspberry PI Zero 1.3 :

The cache on result is pretty good as it beat Clang at max optimization level, i don't know why the transpiled code sit between GCC -O1 and -O2 with cache off though, it is perhaps due to some parts of the generated IR code, load / store specifically, inttoptr, lack of GEP ?

Note : p5.js benchmark were done on desktop (i7 6700 3.4 GHz), transpiled code use typed array which may explain the difference between handcrafted vs transpiled, perhaps it generate more pleasant code for the JavaScript optimizer as well. Quite fun overall that the p5.js benchmark is just ~4x faster than a 2003 1 GHz CPU running raw ARM code.

Size

As for binary size, a 60 bytes binary (RPI 0 1.3 target) of a short graphics program was produced with LLVM size optimizations, it is about 2 instructions more than a handcrafted tiny binary, near optimal tiny binaries like this are possible as long as they don't use any data (variable, string etc.) and API words.

handcrafted program (left) vs a program transpiled to LLVM (right)

Conclusion

There is still some room to simplify / improve the transpiler such as some refactor on the phi nodes generation routine or adding automatic support to emit global variable definitions, generating better load / store code and implementing quotation support for completeness, template interpreter could perhaps use Forth as well as it works in a similar fashion.

All in all the transpiler was quite fun to do and a good compact addition to the compiler with surprising end results for a few days of code.

tiny binary running on RPI 0 1.3 (with U-Boot)