Transpiling a Forth dialect to LLVM IR
This article is
an offshoot of a series of articles about a baremetal ARM Forth
dialect, see
part 1 / part
2.
Contents
Introduction
This is a write up about enhancing my Forth dialect with
performance / portability by adding a LLVM IR
code generator.
Primary reason for this was to leverage a modern
optimizer to produce fast code tailored for a CPU target, secondary
reason was to target many other architectures, background goal
was to remain small and simple.
Adding LLVM IR code generation led to the
implementation of a fairly generic transpiler
extension to the compiler, this "transpiler" is able to translate
the Forth dialect code to other languages in the same way that
another architecture would be targeted with the compiler. (through
a tailored dictionary)
The transpiler implementation is fairly small
with ~400 lines of ARM code plus a tailored LLVM IR dictionary of
~400 lines as well, i also have a quickly hacked up JavaScript
(p5js environment)
dictionary from the LLVM one.
The LLVM IR code generator is not yet complete as it
have limitations and some words are still unsupported but it works
for a large subset of the dialect which is sufficient for computer
graphics prototyping.
The transpiler can also be used to check program
correctness, this an useful side feature as errors are not
handled by the compiler to avoid any bloats, the transpiler
adds some features that can be used to do some stack checks and
stack leaking programs also generate wrong LLVM IR code.
Translating Forth code to LLVM IR
Challenge of converting Forth code to LLVM IR mainly
relate to the translation of stack based code to registers code and
generating the needed SSA
form where variables are assigned exactly once.
Stack can be used in LLVM IR (simplest model for
Forth !) but my wish was to go all registers instead to get the
most out of the optimizer. (and registers based CPUs all
around)
Generating code in SSA form has analysis /
optimization advantages later on (if needed) and allow an easy way
to translate my dialect to other languages as well.
Forth words could be defined as LLVM
IR functions but i chose to generate all the IR code into the
main function to keep it simple / single pass.
My transpiler generate the LLVM IR textual
/ human-readable form (debugging felt easier), the
binary form might be more adequate in size constrained
context.
Helpful resources
- LLVM reference documentation
- Mapping High Level Constructs to LLVM IR gitbook
Some Forth to LLVM IR proof-of-concept projects exist such
as movForth but they
use a different approach, some Forth probably also include LLVM IR
support so they may be worth checking out for more standardized
Forth.
Tools / Debugging
CPUlator was again
very useful to quickly iterate / debug the compiler code.
godbolt.org was useful to
debug generated LLVM IR code, i kept a LLVM IR source tab with
llc (compiler)
+ opt (optimizer
/ analyzer) output tab.
Main platform target still remains the early ARM CPUs (ARM2
etc.) so none of the modern ARM code is used in the compiler, i use
these arguments to llc to
generate unoptimized vanilla (early 90s) ARM code : -march=arm
-mcpu=strongarm -O=0
The code generator
There are various approaches to generate LLVM IR, my aim was
to keep it simple and preserve the compiler code.
Initial idea for the LLVM IR code generator was to simply
reuse the Forth dictionary model by defining words whose contents
consist of LLVM IR code.
My early solution was to build a lightweight virtual machine
(the transpiler core) in combination with raw ASCII strings in
words definition to represent unresolved LLVM IR code.
Special ASCII codes are used as placeholders or to manipulate
the local template-level context, some of these codes also simulate
Forth stack operations on data / return stack to help track
registers / labels, a requirement for SSA code generation.
Forth to SSA / registers
In compiler design, static single assignment form (often abbreviated as SSA form or simply SSA) is a type of intermediate representation (IR) where each variable is assigned exactly once. SSA is used in most high-quality optimizing compilers for imperative languages.
A straightforward way to convert Forth code into SSA form is
to treat push instructions as the creation of a new
symbol (register name or label name) at compile time, this
symbol is then pushed onto a register stack, and each
subsequent pop returns a symbol that can be used in one or more
emitted instructions.
For simplicity, i chose to represent symbols as 32-bit values
(referred as index), this value represent a virtual
register and is expanded into padded hexadecimal strings when
emitted.
The transpile routine
The transpile routine is called by the compiler at specific
points and take a template ASCII string as argument which is then
interpreted, the routine then update its internal representation of
the data / return stacks as it encounter various instructions, it
also generates the necessary register / label names and outputs the
resolved LLVM IR code.
Transpiler context has :
- a 32-bit register counter, incremented on each data stack push operation (i.e. the creation of a new symbol)
- a register stack containing pairs of values : block index and register index
- a 32-bit label counter, incremented on each return stack push operation
- a label stack
- a 32-bit block index representing the current LLVM IR block (defined by LLVM IR labels)
- a template-level stack, cleared at the end of each string template (which is required due to inlining), it hold temporary values used to fill placeholders
Some notes :
- register / label stack could be merged into a single stack (same for the associated counters) but i chose to keep them separate for clarity in template code / IR output
- register stack has redundant information (block index) due to a quick fix for an early design issue (didn't kept track of register context at first...)
Here is an example of how the transpile routine processes the
+
word :forth_word "+"
.ascii "\x0f\x0f\x0e"
.ascii "%r\x16 = add i32 %r\x15, %r\x14\n"
.asciz "\x08"
- first string emulate the Forth data stack, it does two pop and one push, these three instructions push a value onto the template-level stack (a register index)
- second string is the LLVM IR instruction with getters (placeholders), these getters get a value from the template-level stack and output it at their location
- third string is null-terminated and hold an instruction to reset the template-level stack, this allow the next processed template string to start from a clean context (this last line is usually automatically generated by the forth_word macro)
Note : register name are prefixed by
r
and followed by a register index
formatted as a 32 bit hexadecimal value with padding.Most words of the dictionary follow the same concept, here is
another example with the
>
word :forth_word ">"
.ascii "\x0f\x0f\x10\x10\x0e"
.ascii "%r\x16 = icmp sgt i32 %r\x15, %r\x14\n"
.ascii "%r\x17 = zext i1 %r\x16 to i32\n"
.ascii "%r\x18 = sub i32 0, %r\x17\n"
Same as before with additional ASCII codes to get temporary
register names for the two first instructions. (temporary means
that they are only used within the template code, a new symbol is
created but the stack is untouched)
Here is a similar example taken from the JavaScript
dictionary for the word
<>
:forth_word "<>"
.ascii "\x0f\x0f\x0e"
.ascii "var v\x16 = (v\x15 != v\x14) ? -1 :
0\n"
Flow words use more ASCII codes (e.g. for return stack) and
different prefixes but follow the same concept with way more
verbosity due to word complexity, there is also special templates
for built in constructs such as literals, variables etc.
Note that the first line of each word definition could be
moved into a macro (and placed in a separate file) as it likely
remain unchanged across many target languages, i don't do this as i
like that the template context remains visible.
Template strings provide high flexibility but they also are a
bit redundant in term of content (e.g. the first line), this method
is still satisfying as it have low complexity and match the
original Forth dictionary concept. (it avoid more structured
representations)
Template codes
Here is a list of all the instructions that can be used within
a template string :
CODE | |
---|---|
0x00 |
end of template code |
0x04 |
output string from context stack (TOS - 4) |
0x05 |
set current block index from context stack TOS |
0x06 |
push frame; next byte is a block marker (e.g.
if or else ) |
0x07 |
pop frame and add phi nodes |
0x08 |
reset template-level stack |
0x0B |
get data stack TOS value |
0x0D |
push r12 onto context stack; used to expose
literals to template code |
0x0E |
emulates Forth push (data stack) |
0x0F |
emulates Forth pop (data stack) |
0x10 |
get temporary register |
0x11 |
emulates Forth push (return stack) |
0x12 |
emulates Forth pop (return stack) |
0x13 |
get temporary label |
0x14 |
output hex literal at context TOS - 0 |
0x15 |
output hex literal at context TOS - 4 |
0x16 |
output hex literal at context TOS - 8 |
0x17 |
output hex literal at context TOS - 12 |
0x18 |
output hex literal at context TOS - 16 |
0x19 |
output hex literal at context TOS - 20 |
0x1A |
output hex literal at context TOS - 24 |
0x1B |
output hex literal at context TOS - 28 |
0x1C |
output hex literal at context TOS - 32 |
0x1D |
output hex literal at context TOS - 36 |
0x1E |
output hex literal at context TOS - 40 |
0x1F |
output hex literal at TOS - N, where N is the next
byte |
Some notes :
- instructions evolved organically as i implemented code generation, the list could be way cleaner / have less instructions with a bit of refactoring and thoughtful design
- all instructions from 0x14 to 0x1e can be emulated with 0x1f but i let them for clarity in template code
- 0x05 to 0x07 instructions are solely used to handle conditionals / loops (phi nodes generation)
phi nodes generation
The code generator described so far is quite simple and
sufficient to handle most Forth primitives but the conditional ones
(
if
, else
, then
) require register context tracking (block /
label) to be able to generate
phi nodes. (SSA form again; used to resolve values from
different code paths into a single one)The way i solved this is pretty hacky (and not very RAM
efficient) but it works :
- a template instructions was added to indicate a block change, the register index was also paired with its corresponding block / label index in the register stack, these features allow the transpiler to know which label / block a virtual register belong to
- two template instructions were added to manage stack frames (push / pop), the pop instruction trigger phi nodes generation
The frame push instruction begins a new context by cloning the
current register stack, the pop instruction then compares the
register stacks from the two preceding blocks and a phi node is
generated when register name differs. The associated register and
block-level information is then used to correctly emit the phi node
in the resulting LLVM IR.
Note that phi nodes are not required actually
since IR stack
/ temporary variables can be used in combination with
the mem2reg LLVM pass to automatically generate them, it is still
unclear to me whether this method is simpler than phi nodes
generation though... it may be simpler / more adequate as a generic
solution though as it probably don't have the same limitation. (my
phi nodes generation has a constrain : the stack level must match
within different code paths)
p5js output mode
p5.js output
was added to test the integration of other languages, it was
straightforward to do compared to LLVM IR due to many low level
constructs being abstracted away (labels) and usage of global
variables.
Phi nodes are also used to resolve conditionals with p5js but
the implementation is simpler as it does not require register
context (doesn't have labels), the merge is simply done
by using the most recently defined variable with the
Nullish coalescing operator so phi nodes looks like
(JavaScript code) :
var v0000004D = ...
if (...) {
// drop instruction; v0000004D is dropped
var v00000051 = ... // TOS is now this
}
var v00000052 = v00000051 ?? v0000004D // will take the most
recently defined variable
// avoid issues on next run (due to being global)
var v00000031 = undefined
var v00000035 = undefined
The generated code works on a contained environment roughly
similar to the RPI 0 1.3, a memory buffer is defined and the Forth
words exclusively works with this buffer, there is added code at
the top of the draw() function to copy the virtual framebuffer area
into the p5.js canvas.
There is some limitations :
- JavaScript has stricter variable name so my_variable? may trigger errors (the transpiler doesn't check this)
- the code is wrapped into the draw() function due to being used for graphics, infinite loop (which is necessary on RPI) may bring issues as the draw() function is already repeatedly called
- only gett (timer) and puts (text drawing) API words are implemented, they emulate U-Boot API although much simpler (no ANSI escape codes support)
Early transpiler design issues
Here are some early design issues i encountered from blindly
(no prior knowledge) attempting to convert the dialect code to LLVM
IR :
- not keeping track of the Forth stack, initially used simple register increment / decrement, this quickly led into breaking the SSA form :)
- not keeping track of the Forth return stack (separation is helpful for IR clarity !)
- LLVM IR template strings were initially resolved as i transpiled user words content, this had many SSA / logic issues since the compiler produce inline code by default, the fix was easy : the code should stay unresolved until it is processed in the entry point, only literals (and variables) can be safely resolved as encountered since they are static values
- not keeping tracks of register context, turns out this was important to introduce phi nodes for working conditionals / loops, all this was aggravated due to generating all the code into a single main function
- variables didn't use LLVM IR global variables scheme (to meet code simplicity goal) so performances were kinda disappointing at first (sitting between GCC -O2 and -O3), this was added as an option with the minor gotcha that definitions must be added manually at the top of the LLVM file after code generation
- didn't add any platform ABI specification to the LLVM IR file, this alone provided a nice performances boost (x2) with cache off
Keeping track of register context for the phi nodes generation
was the hardest part as i needed a few days of intensive debugging
due to it being quickly hacked in with the initial data structures,
the solution i ended up with is okay but not very RAM
efficient.
Benchmark
Speed
Benchmark done on a Raspberry
PI Zero 1.3 :


The cache on result is pretty good as it beat Clang at max optimization
level but i don't know why the transpiled code sit between GCC -O1
and -O2 with cache off, it is perhaps due to some parts of the
generated IR code, load / store specifically, inttoptr, lack
of GEP ?
Note : p5.js benchmark were done on my desktop (i7 6700
3.4 GHz), transpiled code use typed array which may explain
the difference between manual vs transpiled, perhaps it generate
more pleasant code for the JavaScript optimizer as well. Quite fun
overall that it is just ~4x slower than a 2003 1 GHz CPU running
raw ARM code.
Size
As for binary size, a 60 bytes binary (RPI 0 1.3 target)
of a
short graphics program was produced with LLVM size
optimizations, it is about 2 instructions more than a handcrafted
tiny binary, near optimal tiny binaries like this are possible as
long as they don't use any data (variable, string etc.) and API
words.


handcrafted program (left) vs a program
transpiled to LLVM (right)
Conclusion
There is still some room to simplify / improve the transpiler
such as some refactor on the phi nodes generation routine or adding
automatic support to emit global variable definitions, generating
better load / store code and implementing quotation support for
completeness.
All in all the transpiler was quite fun to do and a good
compact addition to the compiler with surprising end results for a
few days of code.

tiny binary running on RPI 0 1.3 (with
U-Boot)
back to top


