Writing a small Forth-based RPI OS : Part 2 (ARM Forth dialect implementation)
This article focus on the ARM implementation of a Forth
dialect, see
part 1 for a full Forth introduction and U-Boot setup, part 3
is coming soon.
Contents
- Introduction
- Initial interpreter implementation: A lightweight ARM Forth core
- Simplifying through unifying; a compile only Forth
- More Forth words: a base dictionary for a general purpose dialect
- Syntax customization with a small parser (a mistake ?)
- Reduced stack juggling with variables / registers
- Data area and adding strings, arrays and quotations
- U-Boot I/O primitives
- Optimization: Constant load for 8-bits number
- Optimization: Inline expansion
- Full example program
- Conclusion
Introduction
In
part 1 i explained how to setup a Raspberry PI Zero
1.3 with Das
U-Boot to load a custom binary and use the U-Boot API from ARM
code for easy I/O.
In this article i will details an ARM implementation of a
low level Forth interpreter called ARM-ForthLite.
I will then details various improvements to revamp it into a
compiler called GnosTh with syntactic
changes to suit my needs. (see my mastodon feed for the progress
history)
This Forth is actually a low level follow up of
a prototype made in JavaScript.
My initial idea was to use this dialect to serve as a shell
and as a general toolbox for a simple OS (development, running
software etc.) but i decided to simplify it further (will show in
part 3 !) by going back to Forth roots (self contained; IDE
/ OS like) by rewriting the dialect in itself as a tokenized
version with a simple environment built around. A decision which
was heavily inspired by colorForth.
Initial interpreter implementation: A lightweight ARM Forth
core
The initial version of my bare metal ARM Forth interpreter can
be found here, it is a
minimalist ~440 bytes Forth skeleton which use Subroutine Threaded Code, it was used as a base to
tinker with Forth and bootstrap a Forth compiler for a Raspberry PI
0.
Here is some notes about the implementation :
- it use Subroutine Threaded Code which seemed
the most attractive choice to expand it later into a
compiler
- parse hexadecimal numbers on unknown words which is a bit dodgy
for many reasons but i like it and it is very simple to change (or
mix) radix
anyway
- stack top is stored into a register (r4), implementation use all available registers for additional speed
- target is a RPI Zero 1.3 (ARM1176JZF-S), it target ARMv6 architecture but probably works on any ARM that support conditional instructions, side goal was ARMv2 (Acorn Archimedes) support but didn't test it
- no REPL nor much primitives bundled by default, idea is to stay
flexible so it serve as a base and a REPL can be bootstrapped
around with a chosen set of primitives
Implementation is quite small even if the focus wasn't code
size, i chose some shortcuts that made the code simpler such as
:
- doesn't trim extra white-spaces; should always be exactly one white-space between words
- no negative numbers parsing (can be built easily)
- no errors handling such as stack underflow although i
implemented a rudimentary one in earlier commits (note : same in my
dialect, it is quite raw !)
- no unknown words, they are parsed as number (base 16)
Source
code is fully commented.
The implementation is not cache friendly, the cache isn't
invalidated on code generation so ARM independent instruction /
data cache should be disabled on some CPU for full
reliability.
Interpreter context
The interpreter context is made of data structures (data
stack, return stack, dictionary) with their associated pointers
stored in dedicated registers along with top data stack value (for
speed) and interpreter mode as a binary flag.
These registers are dedicated to the interpreter so they
should never change between interpret calls (except r1) to keep the
context.
Keeping almost all the interpreter state in registers makes
interpreter context switching easy, it is just a matter of saving /
restoring these registers after delimiting RAM area for each
interpreters and it also result in a straightforward implementation
code that is easy to read.
@ ============= FORTH INTERPRETER
@
data stack addr.: sp
@ return
stack addr.: r0
@ input
buffer addr.: r1
@ dict last word addr.:
r2
@
compile mode: r3
@
stack top value: r4
@
dict. end addr. :r14
@ ===============================
The difference between
the high level implementation and this one is the introduction
of the return stack which was abstracted away in the high level
interpreter, it is necessary here as a way to return from sub
routines (word call) and avoid cluttering the data stack.
Dictionary pointers could be squeezed to a single pointer
since dictionary end address can be computed from the last
dictionary word but keeping these two pointers into registers makes
the implementation simpler.
Entry point (parsing
words)
The entry point of the Forth interpreter is essentially a
Forth word parser, a Forth word is printable
characters delimited by a white-space (or NUL)
character, this code iterate on the input buffer and continue to
iterate until it hit a ASCII non printable characters, it then
compute the word length from the word boundaries addresses then
branch to the find_word label or jump back to the
interpreter caller when length is 0.
I made some shortcuts to make it as simple as it can be so it
doesn't skip extra spaces and non printable characters are all
considered the same way so it is just a matter of printable (words)
versus non printable characters.
forth:
@ =================== READ A WORD ; any printable characters
delimited by ' '
@ ===================== ON RETURN ; does not skip / trim extra
spaces !
@
end addr: r1
@
word length: r5
@
word start addr: r9
@
cond. flags updated
@ ===================== CLOBBERED
@
r5, r8, r9
@ ===============================
read_word:
mov r9, r1
0:
ldrb r8, [r1], #1
subs r8, #' '
bgt 0b
subs r5, r1,
r9 @ get word
length; update condition flags
subgts r5,
#1 @
adjust if not empty
bne
find_word
@ find word if len > 0
ldr pc, [pc,
#-4] @
jump to return addr.
forth_retn_addr:
.word 0
Note that i don't use any
branch and link relative instructions since i make use of all
registers including r14 which is used as a dictionary end pointer
so the interpreter return address must be stored at
forth_retn_addr location before the call which is a fair
inconvenience. (could be skipped if the interpreter code is
wrapped)
Word dictionary lookup
The word dictionary is a linked list with the following data
:
- pointer to previous element or 0 when no previous
elements
- word flag along with word length (as a byte in my case so maximum word length is 255 characters)
- word identifier as NUL terminated string
- a data section which hold the word code in my case (always end
with a RET macro call which is an instruction which pop the return
stack directly into PC)
This code search the parsed word within the dictionary by
first comparing their lengths then comparing individual characters,
it jump to eval_word when the word is found or fall through
parse_number if not.
@ =================== FIND A WORD
@ last dict. word addr:
r2
@
word length: r5
@
word start addr: r9
@ ===================== ON RETURN
@ found word dict
addr:r12
@ found word name end addr:r10
@ ===================== CLOBBERED
@ r6, r7, r8, r10, r11, r12
@ ===============================
find_word:
mov r12, r2
0:
ldrb r7, [r12,
#5] @ get dict. word len.
cmp r5,
r7
@ word length match ?
bne
2f
@ skip word if not
add r10, r12,
#6 @ get dict word
addr.
mov r11,
r9
@ get word addr.
1:
ldrb r6, [r11], #1 @ word char.
ldrb r8, [r10], #1 @ dict. word char.
cmp r6, r8
bne
2f
@ skip word on != char.
subs r7, #1
bne 1b
b eval_word @
found
2:
ldr r12,
[r12] @ get
previous dict. word addr.
cmp r12, #0
bne
0b
@ continue until no words to search for
Number parsing
Unknown words are always parsed as hexadecimal number in my
interpreter to simplify the implementation, the number is either
pushed immediately to the data stack or compiled into the current
definition on compile mode.
@ ================== PARSE NUMBER
@
compile mode: r3
@
word length: r5
@
word start addr: r9
@
0: r10
@ ===================== ON RETURN
@ ===================== CLOBBERED
@
r4, r5, r7, r10, r12
@ ===============================
parse_number:
0:
ldrb r7, [r9],
#1 @ get char.
subs r10, r7,
#87 @ get char. numeric value
(a-f)
sublts r10, r7,
#'0' @ get char. numeric value (0-9)
addge r12,r10,r12,LSL
#4@ n * 16 + v
subs r5, #1
bgt 0b
cmp r3, #FORTH_IMM_MODE
@ immediate mode ?
pusheq { r4
} @
immediate mode: push old value to Forth stack
moveq r4,
r12 @
immediate mode: push new value to Forth stack
beq read_word
1:
@ else: compile mode
adr r10, 2f
b compile
2:
@ generated code (compile mode)
.word
0xe52d4004 @ opcode: push
{ r4 }
.word
0xe59f4000 @ opcode: ldr
r4, [pc, #0]
.word
0xe28ff000 @ opcode: add
pc, #0 @ value is stored after this instruction, it is stored by
"compile" from r12
Word definition (compilation) and evaluation
A word is evaluated immediately or compiled once it is
found.
This code check the interpreter mode and the word flag to
decide whether the word is evaluated or compiled, it put
read_word as return address and call the word code on
evaluation, it generate a return stack push instruction and a call
instruction on compilation.
@ =============== EVALUATE A WORD
@
compile mode: r3
@ found word name end addr:r10 ; unaligned ok,
will be aligned later
@ found word dict
addr:r12
@ ===================== ON RETURN
@ ===================== CLOBBERED
@ r0, r5, r6, r8,
r9, r10
@ ===============================
eval_word:
add r10,
#4
@ adjust word name end addr for alignment
cmp r3, #FORTH_IMM_MODE
@ immediate mode ?
ldrb r8, [r12,
#4] @ get
word flag
andnes r9, r8,
#0xff @ in compile mode :
is an immediate word ?
bne compile_word
adr r5, read_word
stmdb r0!, { r5
} @ push return address
bic pc, r10,
#3 @ align (point
to code addr.) and jump to word code
compile_word:
bic r12, r10,
#3 @ align (point to code
addr.)
adr r10, 1f
b compile
1:
@ generated code (compile mode)
.word
0xe28f5008
@ opcode: add r5, pc, #8
.word
0xe9200020
@ opcode: stmdb r0!, {r5}
.word
0xe51ff004
@ opcode: ldr pc, [pc, #-4] @ word code addr. to jump to, stored
after this, stored by "compile" from r12
@ ======================= COMPILE
@ generated code
addr.:r10
@
value to store:r12
@ current def. code addr. :r14
@ ===================== ON RETURN
@ ===================== CLOBBERED
@
r9, r10, r11, r14
@ ===============================
compile:
ldmia r10,
{r9-r11} @ load
generated code
stmia r14!,
{r9-r12} @ store
generated code at current definition code addr.
Note that there is alignment
instructions (due to the variable length word name) so that the
word code pointer fall within a 4 byte boundary which is a
requirement on early ARM.
Primitives
A minimal set of primitive words is bundled in :
immediate
which set the previously defined word flag to immediate:
which is the character used to specify a new word definition;
which is the character that ends a word definition+
which is an operator word that adds the two value on the stack and push back the result
These primitives are static data, they constitute the base
dictionary that will be used by the interpreter, i use a
forth_word
macro as a convenient way to define them so i just needs to
provide the word name, flag and the word code body, this data is
inserted statically into the dictionary, the word length is
automatically computed by the macro.
I also use macros as a less obscure way to manipulate the
return stack. (RET instruction)
The new word definition is the densest primitive as it must
create a new dictionary entry, it have its own word parser that
takes the word that follow
:
and copy it into
the dictionary along with the word length, flag and link to the
previous dictionary entry, it also switch the interpreter global
state to compile mode so that subsequent words will be compiled
instead of evaluated. (unless a word flag is set to
immediate).include "dict_utils.inc"
forth_word "immediate"
mov r5, #FORTH_WORD_FLAG_IMMEDIATE
strb r5, [r2, #4]
ret_opcode:
ret
forth_word ":" FORTH_WORD_FLAG_IMMEDIATE
str r2,
[r14]
@ store previous word addr.
mov r2,
r14
@ update last dict. word addr.
add r14,
#6
@ point to name field
0:
ldrb r8, [r1], #1
cmp r8, #' '
strgtb r8, [r14],
#1 @ copy name
addgt r7,
#1
@ update length (note: r7 implied to be 0 before loop which is the
case after "find_word")
bgt 0b
strb r3,
[r14]
@ append \0 (note: use r3 since mode is 0 at this point; nested
defs don't make sens)
strb r7, [r2,
#5] @
store word length
mov r3, #FORTH_COM_MODE
@ switch to compile mode
strb r3, [r2,
#4] @
store FORTH_WORD_FLAG_DEFAULT flag (note: use r3 since COM_MODE is
same constant)
add r14, #4
bic r14,
#3
@ align to get to end of dict. addr.
ret
forth_word ";" FORTH_WORD_FLAG_IMMEDIATE
mov r3, #FORTH_IMM_MODE
@ switch to immediate mode
adr r7, ret_opcode
ldr r7,
[r7]
@ get opcode for RET instruction
str r7, [r14],
#4 @
store RET opcode at the end of the newly defined word
ret
forth_last_word_addr:
forth_word "+"
pop { r5 }
add r4, r5
ret
forth_dict_end_addr:
The power of Forth is that all the primitives can be redefined
easily by modifying the dictionary as above or by redefining them
at runtime to create a different dialect, for example adding a line
comment to the dialect just require to choose the characters (for
example
//
) and write the code that will
skip the characters from this point on until the next line, this
word should likely have an immediate flag as its shouldn't
be compiled. The only inconvenience of defining a line comment that
way is that //
should be delimited by
space as it is still parsed as a Forth word.In other Forth the word parsing code may share the same
routine as the main parser code to alleviate redundancy, it may be
bundled as a primitive as well, i ignored most reflection features
in my Forth but they are highly powerful feature to have. Many
chunk of code are shared and implemented as a primitive in most
advanced Forth implementation.
Bare metal usage sample
The sample code is a bare metal usage example of the Forth
interpreter, it initialize an interpreter context (memory layout,
state) and evaluate a line of Forth code.
Here is the line that it interpret :
: 1e f f +
; 1e
What it does is define a word called
1e
containing the operations of a simple addition between two numbers
15+15
then call this newly defined
word.The result is placed at the top of the data stack (r4) and is
(in base 10, after the execution) : 30
@ FORTH MEMORY LAYOUT CONSTANTS
.equ FORTH_DATA_STACK_SIZE, 1024 * 1000 @ always multiple of 4
.equ FORTH_RETN_STACK_SIZE, 1024 * 1000 @ always multiple of 4
.equ FORTH_DICT_SIZE, 1024 *
1000 @ always multiple of 4
.global _start
_start:
@ = FORTH SETUP
@ data stack
adr r8, forth_data_stack_addr
ldr sp, [r8]
@ return stack
adr r8, forth_retn_stack_addr
ldr r0, [r8]
@ input code
adr r1, forth_code
@ dict last word
adr r2, forth_last_word_addr
@ compile flag
mov r3, #FORTH_IMM_MODE
@ dict end
adr lr, forth_dict_end_addr
@ save return addr.
adr r9, forth_retn_addr
str pc, [r9]
b forth
@ r4 has value 1e at this point
0:
b 0b
.include "forth.s"
forth_data_stack_addr:
.word (_end +
FORTH_DICT_SIZE + FORTH_DATA_STACK_SIZE)
forth_retn_stack_addr:
.word (_end +
FORTH_DICT_SIZE + FORTH_DATA_STACK_SIZE +
FORTH_RETN_STACK_SIZE)
@ code to evaluate :
forth_code:
.asciz ": 1e f f + ;
1e"
.align 2
@ = FORTH DICTIONARY
.include "dict.inc"
_end:
Simplifying through unifying; a compile only Forth
As i wrote the Forth interpreter i realized that it was not
too far from a compiler already, i just needed to drop the two
modes and makes a compile only Forth ! The slight difference is the
need to jump at the compiled code after running through the
input.
To simplify further i dropped the ';' word later on.
The realization came from trying to simplify the dialect and
optimizing for speed, the Forth interpreter wasn't necessary for my
use case.
Peoples also realized this 30
years from now !
The program code was the last word of the dictionary in the
first implementation, it works but adds several constrains so i
made it a bit more flexible by separating the program code from the
dictionary, i also added a primitive to mark any definitions as the
entry point (entrypoint word), it was more flexible that way
and swapping entry point can be handy for debugging or prototyping.
The entry point word switch the compiler code output address from
dictionary to program output address so that the definition code is
put into the program area instead of the dictionary.
I also dropped immediate definitions by design to improve
dialect overall cohesion, this makes the Forth much less reflective
by default so much less powerful, it still support reflection
through immediate primitives and literals.
It may be possible to gain some reflection back by making some
words (such as stack ones) works in mixed mode (with immediate and
default constructs) without having to redefine an immediate version
by having the same stack implementation on compile / runtime which
isn't the actual case. (the compiler stack does not have its top
value stored in a register)
More Forth words: a base dictionary for a general purpose
dialect
The basic Forth above isn't a full language yet, it is the
base skeleton by which a Forth dialect can be defined and further
refined, the dictionary is too simple as it lacks many primitives /
words to be useful, the power of Forth is that this set of words
can be very simple / bootstrapped (see sectorforth or
milliForth)
or dense (Gforth), either general
purpose or task focused ones, they just have to be written !
Words
My general purpose Forth dialect at the time of writing this
article include these words :
- stack manipulation (dup, drop, swap, over, rot)
- memory access (@ !) also have the equivalent for byte access (db@ db!)
- data allocation (allot), a way to get data length (dlen) and a
way to align data (align)
- logic (0= 0<> <> = < > <= => ~ & | ^
<< >> >>>)
- flow controls (loop, again, until, for, do, next, if, else,
then, call, case)
- arithmetic (* / % */ - + negate >>+ >>-)
- registers access (A! A@ B! B@ C! C@ D! D@)
- I/O mostly provided by Das U-Boot API (getc, putc, tstc, pix etc.)
- also have some special immediate words (entrypoint, #!)
- Das U-Boot specifics (use extended U-Boot API) :
cache_care
My Forth dialect has ~70 primitives, it is inspired
by Chuck
Moore 2000s Forth and also heavily inspired by C as i am
accommodated to some of the concise symbols that C use, Joy,
Factor, r3 and DSSP were also
inspirational although i didn't borrow much from them.
It should be seen as a prototype / transitional step
towards a colorForth environment.
I don't expose the return stack and it only handle
integer arithmetic, fractional numbers can be represented as
fixed-point.
I will not details the implementation of most of the
primitives, many are trivial and nearly map one to one to the
underlying instructions set but i will still details the non
trivial constructs such as loops and conditionals.
The language support include
directive in the comments, they are preprocessed at build time
by tools such as AWK and the build process
convert an ASCII source to a set of .ascii gas assembly
directives.
Note that the primitives are tailored for a Forth in
compile only mode (it might works slightly differently in the Forth
presented above), i also use r5 as stack top instead of r4 to
reflect the latest version of my dialect, also note that i omit the
return instruction at the end of each primitives since my dialect
now insert it automatically.
Loops
Here is the code for the loop, again
and until construct (again is an infinite loop,
until is a conditional loop, note that i use loop
instead of begin which is unconventional) :
forth_word "loop"
mov r6, pc
stmdb r0!, { r6 }
forth_word "again"
ldr pc, [r0]
forth_word "until"
cmp r5, #0
pop { r5 }
ldreq pc, [r0]
ldmia r0!, { r6 }
Not too difficult to implement, it use the return stack to
store the address to jump back to, the return stack is convenient
but may become less readable with complex constructs because the
stack is shared with subroutine calls, some Forth adds a loop
control stack to simplify this at the cost of introducing yet
another stack.
I have two other loop construct, for and
do, they are handled with the same end word which is
next, for iterate backward from n (user
provided value) to 0 (included), do mimic a C for loop and
is able to iterate from n to another given value at a given
step size, the direction of the iteration is automatically detected
:
forth_word "for"
mov r6,
r5 @ start value
mov r5,
#1 @ step
mov r7,
#0 @ end value
rpush r5-r7
@ push loop parameters on return stack
rpush
pc @ push
start address of the loop body
mov r5,
r7 @ "push" iteration
value on data stack (stack top is r5)
forth_word "do" @ same as for
but takes 3 parameters instead of 1
pop { r6-r7 }
rpush r5-r7
rpush pc
mov r5, r7
forth_word "next"
rpop r6-r9
cmp r8,
r9 @ determine iteration
direction
addgt r9, r9, r7 @ positive iteration
cmpgt r8, r9 @
check stop condition
sublt r9, r9, r7 @ negative iteration
cmplt r9, r8 @
check stop condition
addle pc, pc, #12 @ jump out of the loop when
stop condition is reached
rpush r6-r9
@ otherwise push parameters back
push { r5 }
mov r5,
r9 @ "push" iteration
value on data stack
mov pc,
r6 @ jump to start
address of the loop body
The return stack is used as temporary storage for the loop
index.
Unlike conventional Forth my dialect doesn't expose the
iteration value (loop index) through other primitives such as "i"
and "j", the iteration value is instead directly pushed on the data
stack after for or do and it is up to the
user to either drop it or use it later.
A way to break out of a loop at any given moment may be handy
but is not as simple to implement because the code between "leave"
and "next" must be skipped and making it work for all loops end
word adds difficulty, i used to have a constrained "goto" but this
was an ugly hack so my dialect doesn't support this right now, a
way to implement a "leave" word would be to :
- introduce another stack or use a stack that is untouched by any
other immediate words
- allow words to have a mix of immediate code and compile code
(this step might add a bit more complexity to the
compiler)
- all loop words definition should have a mix of immediate and
compile code; compile code will stay as is with added code to skip
entire loop body (including loop end block code; it must also clean
the stacks), the added code is just a forward jump of yet
unresolved length, immediate code push the location of the forward
jump length on the newly introduced stack so the length can be
resolved later on
- loop end words must also have a mix of immediate and compile
code; compile code stay as is, immediate code pop the newly
introduced stack to get the skip length location, it then compute
the skip length which is stored at the location
- "leave" primitive code use the return stack to retrieve the
loop start location and jump at the newly added skip loop body
logic
Conditionals
Conditionals implementation also have some complexity because
the interpreter doesn't know where to jump to when a conditional is
hit, this information is only available when a then word is
encountered later on. (i.e. there is references to resolve later
on)
My implementation generate the conditional code with
room for a jump on the if word, i compute a jump length
instead of resolving a jump address, this jump length is
computed later on by else or then, i use the return
stack once again to hold information about the location of the jump
length that must be patched :
forth_word "if" WORD_IMMEDIATE
adr r5, forth_if_code
ldmia r5,
{r5-r10} @
load generated code
stmia r14!,
{r5-r10} @ copy generated
code
rpop r5
rpush
r14
@ push body length addr.
mov pc, r5
forth_if_code:
cmp r5, #0
pop { r5 }
addne pc,
#8
@ jump to if body
forth_skip_over_code:
ldr r6,
[pc]
@ get body length
add pc, pc,
r6 @ jump to
either "else" body or "then"
.word
0
@ if body length
forth_word "else" WORD_IMMEDIATE
adr r5, forth_skip_over_code
ldmia r5,
{r5-r7}
@ load generated code
stmia r14!,
{r5-r7} @ copy
generated code
rpop r5-r6
sub r8, r14,
r6
@ compute body length
str r8, [r6,
#-4] @
update if body length
rpush
r14
@ push body length addr.
mov pc, r5
forth_word "then" WORD_IMMEDIATE
rpop r5-r6
sub r8, r14,
r6
@ compute body length
str r8, [r6,
#-4] @
update "if" or "else" body length
mov pc, r5
Switch statement
My dialect leverage the implementation of quotations
(lambdas / anonymous functions) for switch
statements, it only have a case word which takes two
values to compare to plus a quotation evaluated on a successful
compare, the word drop the last two values on stack on a failure so
other cases can be evaluated , the last code of the quotation is
the default case, the word is implemented like this :
forth_word "case"
pop { r6-r7 }
cmp r6, r7
moveq r9,
r5
@ get case blocks addr. (when cond. is true)
popeq { r5
}
@ restore stack
moveq pc,
r9
@ call case blocks
mov r5,
r7
@ continue with the value to compare to otherwise
And used like this :
getc {
'a' {
"getc result: 'a'"
puts
} case
'b' {
"getc result: 'b'"
puts
} case
"unhandled getc value: " puts putc ; default
case
} call
Quotation evaluation is done through the call word
which is a jump to the address pushed on return stack by "{...}"
construct plus a return stack push to get out of the
quotation :
forth_word "call"
pop { r6 } @ see below
mov r7, r5
mov r5, r6 @ update TOS after pop
(only due to the usage of a register for stack top value)
rpush pc @ go back to
after call once the quotation is evaluated
mov pc, r7 @ jump to quotation code
start address
Syntax customization with a small parser (a mistake ?)
A lot can be done with Forth minimal syntax (words delimited
by spaces) but i still wanted some small amount of conventional
syntactic sugar to define variables or have conventional strings
definition. ("..." instead of s"
...")
So i started to deviate from Forth minimal parser (as seen
above) by going for a small parser.
Was it worth ?
Not sure, it add flexibility perhaps while remaining fairly
lightweight; it contribute to a smaller binary rather than having
specific words in the dictionary but it is only worth for a tiny
amount of constructs in my case... (variables, immediate number and
verbatim value) so it could have been avoided entirely by making
some small edge cases on the initial parse loop... it is also more
errors prone to get right, i also understood later that all the
edge cases the parser solve vanish when the ASCII input is replaced
by a tokenized input. (the colorForth approach)
Syntactic features
of my Forth dialect
- strings : "..."
- arrays definition : [0 1 2 3 ...]
- character
literal : 'a'
- quotations : {...}
- variables definition : >my_variable
- immediate number (push a number on compiler stack) :
#42
- verbatim value (compile verbatim value) :
$42
- line comment : ;
Reduced
stack juggling with variables / registers
Stack juggling (also called "stack noise") is related to all
the stack manipulation that must be done to manipulate stack data
values, it is not transparent in Forth unlike C due to its stack
machine model; data values must sometimes be reordered by the user
which lead to high usage of stack manipulation words which might
lead to highly verbose code and slightly less readable code.
There is some rules that one may apply to reduce stack noise
(factor into small definitions, keep up to 3 values at most on
stack) but it is not entirely avoidable as is.
One of the first feature added to my dialect to reduce stack
juggling and improve readability was registers access and named
parameter, i quickly found out that it was not sufficient to
reduce stack juggling as much as i wanted (especially in loops) so
i quickly fused it into a "variables" concept instead. I still kept
registers access as it is a good way to speed up a
program.
The usage of variables is slow but effectively eliminate stack
juggling, i like the versatility of having variables (slow), stack
words (fast), registers access (very fast) with readability
variations, all of this blend easily with the base features of the
language and doesn't add much complexity.
A variable in my dialect is defined with a ">" character
immediately followed by a word which basically means that whatever
is on stack is stored into that word, the value can be retrieved on
the stack through the word. Variables name must starts with an
ASCII character higher than 64. (this prevent words such as '>='
and '>>' etc. to be interpreted as a variable
definition)
early implementation of variables to avoid
stack juggling
Data
area and adding strings, arrays and quotations
Variables were initially stored in the base dictionary but i
realized that many form of anonymous data might be stored there
later on (strings, arrays etc.) thus i introduced a separate area
to store all data (similar to the data segment) so
it doesn't mess with the base dictionary. Also tried to place data
where they were defined but this had performances impact due to
generated jump instruction to skip over the data, some features
were also harder to add due to this later on.
Variables
Variables are defined like a word definition but are stored in
the data area, they are parts of the dictionary (linked with it) so
the compiler find their symbol through a regular dictionary search,
the compiler distinguish between regular word and variables through
a word type flag.
A local variables implementation could be introduced easily by
limiting the word search on a variable definition to the current
definition, local variables actually help a lot for things like
recursion. Perhaps a better implementation is to store them on the
return stack !
String type
My dialect support a multi-line conventional "..." string
syntax, it support null-terminated
string and
length-prefixed string at the same time. The implementation
generate code that push the string address on stack.
The string length can be retrieved through the dlen
primitive.
Array
The dialect also support the definition of integer arrays with
"[0 1 2 3]" syntax, the implementation is similar to the string
implementation (it is length-prefixed) although the parser state
switch to "array" mode; parse_number routine is then reused to
directly put parsed values as a linear series of numbers in the
data section, the parser state is reset on "]".
Quotations
Quotations
is supported in my dialect with "{ ... usual code... }" syntax. It
enable the definition of anonymous block of code. (lambdas /
anonymous functions)
Quotations does not use the data area, code is generated
inline.
The implementation is straightforward :
- upon "{" the compiler generate code that push the quotation start address on stack and also generate a jump to skip over the quotation body (jump length is not yet resolved)
- upon "}" the compiled generate a RET instruction (which use the
Forth return stack) and resolve the previous jump length (it
compute the code body length from the start / end address)
The call primitive can be used to "call" a quotation.
(see code above)
Quotations are powerful as they can simplify all "blocks" /
control structures such as conditionals or switch statement, an
example in this dialect is the implementation of the "switch"
statement which has a quotations implementation.
The array statement can be replaced by using a quotation with
verbatim literals :
{
$42 $deadc0de ; verbatim literals
} ; data start address is now on stack
; quotation could also be called to put all the data on stack when
the data is defined as regular values (not
verbatim)
See also how the DSSP programming
language use blocks to simplify some constructs, it has some
similarity to quotations although they are not anonymous.
Data allocation word:
allot
The array construct is handy but a way to programmatically
allocate arrays is missing so the allot primitive was
implemented :
forth_word "allot" WORD_IMMEDIATE
pop { r5 }
add r10, r4,
#4
@ assign data addr. (+4 to skip length)
adr r7, number_code_generic @ reuse generic
number load code
ldmia r7,
{r7-r9}
@ load generated code
stmia r14!,
{r7-r10} @ copy generated
code
str r5, [r4],
#4
@ store length
add r4, r4,
r5
@ offset data
add r4, #4
bic r4,
#3
@ align
This word is immediate, it is called when the code is
compiled thus i also needed a way to define immediate numbers (due
to a
design issue) which is done by prefixing a number with
#.
The first use of allot was to define a static buffer
for a number to string conversion routine :
number to string conversion in my
Forth dialect
I also added an "align" word later since
a way to align the next alloted data was missing, alignment is a
requirement in some bare metal cases. (mailbox CPU / GPU interface
requirement)
U-Boot I/O primitives
I/O was implemented with U-Boot standalone API at first (see
first article) then U-Boot API (which has storage device access),
it was quite time consuming as it required to tinker with U-Boot a
lot and constantly gathering knowledge on how to interface with it
at this level.
The result is a set of assembly macros that save / restore
U-Boot context (stack and global_data address) plus some macros for
standalone API calls and a lengthy macro for the main API
interface.
See this
file for the U-Boot macros.
Optimization: Constant load for 8-bits number
An easy optimization to do was to detect 8-bits number and
generate a constant load instruction which is way faster than a
data load; code that directly load the number into the stack top
register.
Optimization: Inline
expansion
My dialect inline code by
default for performance reasons, it was fairly easy to tweak the
compiler to generate inline code from the use of Subroutine Threaded Code by directly copying the
word body code instead of generating a call instruction.
An early implementation issue was to determine the word body
length which was solved by going for a doubly linked
list dictionary implementation.
Also had issues with the code generation of some primitives
(conditionals etc.) as i used to patch by address initially, i
switched to patching by jump length instead which is actually way
more flexible.
Inline expansion brought some other issues such as broken
recursion which was addressed by detecting self calling words, it
default to a STC model in this case. I later found out that this
sort of recursion in a traditional Forth use a "recurse" word
which has a simpler implementation than allowing self calling
words. (less edge cases)
There is no mutual
recursion support in my dialect due to lacks of forward
declaration, recursion is also a bit dodgy with global
variables.
Inline expansion can bring easier optimization later on such
as reducing a set of words to a smaller set of words.
Early benchmark
I did a crude (visuals only; no measurements done) benchmark
with / without inline expansion which demonstrate that inline
expansion made the code between two to three times faster.
The benchmark is a graphics program that shows some procedural
logo gfx, the program run on a bare metal RPI Zero 1.3.
crude benchmark; left is STC model, right is
inline expansion optimization (ARM cache off)
Early dialect benchmarks against C / GCC optimization levels
I then did a C version of the same program and did some
measurements of the program compiled with my Forth dialect against
C / GCC optimization levels :
my Forth dialect vs GCC optimization levels
(ARM cache off)
my Forth dialect vs GCC optimization levels
(ARM cache on)
C code vs Forth dialect code side by side
(high
res)
Was not too bad as it approach GCC -O0
level.
Primitives were not optimized at this time, it
might be closer to -O0 level now with optimized primitives. (an
early round of primitives optimization brought the Forth down to
~262ms)
Reaching GCC -O0 level speed
Variables usage is slow in my dialect so most variables were
replaced by stack manipulation words (the Forth way !) + registers
and "pix" word was implemented to do optimized pixels access by
using multiply accumulate + base / offset store ARM instruction
:
a quickly optimized variant (note : was still
using some variables !)
updated benchmark
Speed is now between GCC -O0 and -O1, not even
fully optimized ! (note : for some unknown reasons the C code has a
20ms gain when it is compiled for ARMv4 instead of ARMv6)
Full example program
Here is a full sample program of my dialect that print RPI
Zero 1.3 I/O device info (SD device) to screen and read / write to
it through U-Boot API calls :
Conclusion
It was an incredible journey for me to go from a high level
Forth to a low level minimal prototype then build a complete
dialect / compiler out of it, most of the development, debug and
testing was done on bare machine (SD
swap !) with occasional help from cpulator, the compiler code is
about ~500 lines of assembly (total) which is still quite small for
all the features it pack, it can serve as a base to go
further.
There is some improvements here and there such as using the
same stack design for the compiler / runtime, doing tail-call
optimization or forward declaration or going back to a simpler
parser as i am not sure the actual parser is even needed.
the benchmark graphics program running on RPI
Zero 1.3
What is next ?
My next goal is to rewrite the dialect in itself but in a
tokenized way to make it even simpler, i also want to build a whole
prototyping environment around it. (colorForth
style)
back to top