Home
Admin | Edit

Writing a small Forth-based RPI OS : Part 2 (ARM Forth dialect implementation)

This article focus on the ARM implementation of a Forth dialect, see part 1 for a full Forth introduction and U-Boot setup, part 3 is coming soon.

Contents

Introduction


In part 1 i explained how to setup a Raspberry PI Zero 1.3 with Das U-Boot to load a custom binary and use the U-Boot API from ARM code for easy I/O.

In this article i will details an ARM implementation of a low level Forth interpreter called ARM-ForthLite.

I will then details various improvements to revamp it into a compiler called GnosTh with syntactic changes to suit my needs. (see my mastodon feed for the progress history)

This Forth is actually a low level follow up of a prototype made in JavaScript.

My initial idea was to use this dialect to serve as a shell and as a general toolbox for a simple OS (development, running software etc.) but i decided to simplify it further (will show in part 3 !) by going back to Forth roots (self contained; IDE / OS like) by rewriting the dialect in itself as a tokenized version with a simple environment built around. A decision which was heavily inspired by colorForth.

Initial interpreter implementation: A lightweight ARM Forth core


The initial version of my bare metal ARM Forth interpreter can be found here, it is a minimalist ~440 bytes Forth skeleton which use Subroutine Threaded Code, it was used as a base to tinker with Forth and bootstrap a Forth compiler for a Raspberry PI 0.

Here is some notes about the implementation :
Implementation is quite small even if the focus wasn't code size, i chose some shortcuts that made the code simpler such as :
  • doesn't trim extra white-spaces; should always be exactly one white-space between words
  • no negative numbers parsing (can be built easily)
  • no errors handling such as stack underflow although i implemented a rudimentary one in earlier commits (note : same in my dialect, it is quite raw !)
  • no unknown words, they are parsed as number (base 16)
Source code is fully commented.

The implementation is not cache friendly, the cache isn't invalidated on code generation so ARM independent instruction / data cache should be disabled on some CPU for full reliability.

Interpreter context

The interpreter context is made of data structures (data stack, return stack, dictionary) with their associated pointers stored in dedicated registers along with top data stack value (for speed) and interpreter mode as a binary flag.

These registers are dedicated to the interpreter so they should never change between interpret calls (except r1) to keep the context.

Keeping almost all the interpreter state in registers makes interpreter context switching easy, it is just a matter of saving / restoring these registers after delimiting RAM area for each interpreters and it also result in a straightforward implementation code that is easy to read.

@ ============= FORTH INTERPRETER
@            data stack addr.: sp
@          return stack addr.: r0
@          input buffer addr.: r1
@        dict last word addr.: r2
@                compile mode: r3
@             stack top value: r4
@            dict. end addr. :r14
@ ===============================

The difference between the high level implementation and this one is the introduction of the return stack which was abstracted away in the high level interpreter, it is necessary here as a way to return from sub routines (word call) and avoid cluttering the data stack.

Dictionary pointers could be squeezed to a single pointer since dictionary end address can be computed from the last dictionary word but keeping these two pointers into registers makes the implementation simpler.

Entry point (parsing words)

The entry point of the Forth interpreter is essentially a Forth word parser, a Forth word is printable characters delimited by a white-space (or NUL) character, this code iterate on the input buffer and continue to iterate until it hit a ASCII non printable characters, it then compute the word length from the word boundaries addresses then branch to the find_word label or jump back to the interpreter caller when length is 0.

I made some shortcuts to make it as simple as it can be so it doesn't skip extra spaces and non printable characters are all considered the same way so it is just a matter of printable (words) versus non printable characters.

forth:
@ =================== READ A WORD ; any printable characters delimited by ' '
@ ===================== ON RETURN ; does not skip / trim extra spaces !
@                    end addr: r1
@                 word length: r5
@             word start addr: r9
@             cond. flags updated
@ ===================== CLOBBERED
@                      r5, r8, r9
@ ===============================
    read_word:
        mov r9, r1
        0:
            ldrb r8, [r1], #1
            subs r8, #' '
            bgt 0b
        subs r5, r1, r9         @ get word length; update condition flags
        subgts r5, #1           @ adjust if not empty
        bne find_word           @ find word if len > 0
    ldr pc, [pc, #-4]           @ jump to return addr.

forth_retn_addr:
    .word 0

Note that i don't use any branch and link relative instructions since i make use of all registers including r14 which is used as a dictionary end pointer so the interpreter return address must be stored at forth_retn_addr location before the call which is a fair inconvenience. (could be skipped if the interpreter code is wrapped)

Word dictionary lookup

The word dictionary is a linked list with the following data :
  • pointer to previous element or 0 when no previous elements
  • word flag along with word length (as a byte in my case so maximum word length is 255 characters)
  • word identifier as NUL terminated string
  • a data section which hold the word code in my case (always end with a RET macro call which is an instruction which pop the return stack directly into PC)
This code search the parsed word within the dictionary by first comparing their lengths then comparing individual characters, it jump to eval_word when the word is found or fall through parse_number if not.

@ =================== FIND A WORD
@        last dict. word addr: r2
@                 word length: r5
@             word start addr: r9
@ ===================== ON RETURN
@        found word dict addr:r12
@    found word name end addr:r10
@ ===================== CLOBBERED
@       r6, r7, r8, r10, r11, r12
@ ===============================
find_word:
    mov r12, r2
    0:
        ldrb r7, [r12, #5]      @ get dict. word len.
        cmp r5, r7              @ word length match ?
        bne 2f                  @ skip word if not
        add r10, r12, #6        @ get dict word addr.
        mov r11, r9             @ get word addr.
        1:
            ldrb r6, [r11], #1  @ word char.
            ldrb r8, [r10], #1  @ dict. word char.
            cmp r6, r8
            bne 2f              @ skip word on != char.
            subs r7, #1
            bne 1b
            b eval_word         @ found
        2:
        ldr r12, [r12]          @ get previous dict. word addr.
        cmp r12, #0
        bne 0b                  @ continue until no words to search for

Number parsing

Unknown words are always parsed as hexadecimal number in my interpreter to simplify the implementation, the number is either pushed immediately to the data stack or compiled into the current definition on compile mode.

@ ================== PARSE NUMBER
@                compile mode: r3
@                 word length: r5
@             word start addr: r9
@                          0: r10
@ ===================== ON RETURN
@ ===================== CLOBBERED
@            r4, r5, r7, r10, r12
@ ===============================
parse_number:
    0:
        ldrb r7, [r9], #1       @ get char.
        subs r10, r7, #87       @ get char. numeric value (a-f)
        sublts r10, r7, #'0'    @ get char. numeric value (0-9)
        addge r12,r10,r12,LSL #4@ n * 16 + v
        subs r5, #1
        bgt 0b
        cmp r3, #FORTH_IMM_MODE @ immediate mode ?
        pusheq { r4 }           @ immediate mode: push old value to Forth stack
        moveq r4, r12           @ immediate mode: push new value to Forth stack
        beq read_word
    1:                          @ else: compile mode
        adr r10, 2f
        b compile
    2:                          @ generated code (compile mode)
        .word 0xe52d4004        @ opcode: push { r4 }
        .word 0xe59f4000        @ opcode: ldr r4, [pc, #0]
        .word 0xe28ff000        @ opcode: add pc, #0 @ value is stored after this instruction, it is stored by "compile" from r12

Word definition (compilation) and evaluation

A word is evaluated immediately or compiled once it is found.

This code check the interpreter mode and the word flag to decide whether the word is evaluated or compiled, it put read_word as return address and call the word code on evaluation, it generate a return stack push instruction and a call instruction on compilation.

@ =============== EVALUATE A WORD
@                compile mode: r3
@    found word name end addr:r10 ; unaligned ok, will be aligned later
@        found word dict addr:r12
@ ===================== ON RETURN
@ ===================== CLOBBERED
@         r0, r5, r6, r8, r9, r10
@ ===============================
eval_word:
    add r10, #4                 @ adjust word name end addr for alignment
    cmp r3, #FORTH_IMM_MODE     @ immediate mode ?
    ldrb r8, [r12, #4]          @ get word flag
    andnes r9, r8, #0xff        @ in compile mode : is an immediate word ?
    bne compile_word
        adr r5, read_word
        stmdb r0!, { r5 }       @ push return address
        bic pc, r10, #3         @ align (point to code addr.) and jump to word code
    compile_word:
        bic r12, r10, #3        @ align (point to code addr.)
        adr r10, 1f
        b compile
    1:                          @ generated code (compile mode)
    .word 0xe28f5008            @ opcode: add r5, pc, #8
    .word 0xe9200020            @ opcode: stmdb r0!, {r5}
    .word 0xe51ff004            @ opcode: ldr pc, [pc, #-4] @ word code addr. to jump to, stored after this, stored by "compile" from r12

@ ======================= COMPILE
@        generated code addr.:r10
@              value to store:r12
@    current def. code addr. :r14
@ ===================== ON RETURN
@ ===================== CLOBBERED
@               r9, r10, r11, r14
@ ===============================
compile:
    ldmia r10, {r9-r11}         @ load generated code
    stmia r14!, {r9-r12}        @ store generated code at current definition code addr.

Note that there is alignment instructions (due to the variable length word name) so that the word code pointer fall within a 4 byte boundary which is a requirement on early ARM.

Primitives

A minimal set of primitive words is bundled in :
  • immediate which set the previously defined word flag to immediate
  • : which is the character used to specify a new word definition
  • ; which is the character that ends a word definition
  • + which is an operator word that adds the two value on the stack and push back the result
These primitives are static data, they constitute the base dictionary that will be used by the interpreter, i use a forth_word macro as a convenient way to define them so i just needs to provide the word name, flag and the word code body, this data is inserted statically into the dictionary, the word length is automatically computed by the macro.

I also use macros as a less obscure way to manipulate the return stack. (RET instruction)

The new word definition is the densest primitive as it must create a new dictionary entry, it have its own word parser that takes the word that follow : and copy it into the dictionary along with the word length, flag and link to the previous dictionary entry, it also switch the interpreter global state to compile mode so that subsequent words will be compiled instead of evaluated. (unless a word flag is set to immediate)

.include "dict_utils.inc"

forth_word "immediate"
    mov r5, #FORTH_WORD_FLAG_IMMEDIATE
    strb r5, [r2, #4]
    ret_opcode:
    ret

forth_word ":" FORTH_WORD_FLAG_IMMEDIATE
    str r2, [r14]               @ store previous word addr.
    mov r2, r14                 @ update last dict. word addr.
    add r14, #6                 @ point to name field
    0:
        ldrb r8, [r1], #1
        cmp r8, #' '
        strgtb r8, [r14], #1    @ copy name
        addgt r7, #1            @ update length (note: r7 implied to be 0 before loop which is the case after "find_word")
        bgt 0b
    strb r3, [r14]              @ append \0 (note: use r3 since mode is 0 at this point; nested defs don't make sens)
    strb r7, [r2, #5]           @ store word length
    mov r3, #FORTH_COM_MODE     @ switch to compile mode
    strb r3, [r2, #4]           @ store FORTH_WORD_FLAG_DEFAULT flag (note: use r3 since COM_MODE is same constant)
    add r14, #4
    bic r14, #3                 @ align to get to end of dict. addr.
    ret

forth_word ";" FORTH_WORD_FLAG_IMMEDIATE
    mov r3, #FORTH_IMM_MODE     @ switch to immediate mode
    adr r7, ret_opcode
    ldr r7, [r7]                @ get opcode for RET instruction   
    str r7, [r14], #4           @ store RET opcode at the end of the newly defined word
    ret

forth_last_word_addr:
forth_word "+"
    pop { r5 }
    add r4, r5
    ret
forth_dict_end_addr:


The power of Forth is that all the primitives can be redefined easily by modifying the dictionary as above or by redefining them at runtime to create a different dialect, for example adding a line comment to the dialect just require to choose the characters (for example // ) and write the code that will skip the characters from this point on until the next line, this word should likely have an immediate flag as its shouldn't be compiled. The only inconvenience of defining a line comment that way is that // should be delimited by space as it is still parsed as a Forth word.

In other Forth the word parsing code may share the same routine as the main parser code to alleviate redundancy, it may be bundled as a primitive as well, i ignored most reflection features in my Forth but they are highly powerful feature to have. Many chunk of code are shared and implemented as a primitive in most advanced Forth implementation.

Bare metal usage sample

The sample code is a bare metal usage example of the Forth interpreter, it initialize an interpreter context (memory layout, state) and evaluate a line of Forth code.

Here is the line that it interpret : : 1e f f + ; 1e

What it does is define a word called 1e containing the operations of a simple addition between two numbers 15+15 then call this newly defined word.

The result is placed at the top of the data stack (r4) and is (in base 10, after the execution) : 30

@ FORTH MEMORY LAYOUT CONSTANTS
.equ FORTH_DATA_STACK_SIZE, 1024 * 1000 @ always multiple of 4
.equ FORTH_RETN_STACK_SIZE, 1024 * 1000 @ always multiple of 4
.equ FORTH_DICT_SIZE, 1024 * 1000       @ always multiple of 4

.global _start

_start:
    @ = FORTH SETUP
    @ data stack
    adr r8, forth_data_stack_addr
    ldr sp, [r8]
    @ return stack
    adr r8, forth_retn_stack_addr
    ldr r0, [r8]
    @ input code
    adr r1, forth_code
    @ dict last word
    adr r2, forth_last_word_addr
    @ compile flag
    mov r3, #FORTH_IMM_MODE
    @ dict end
    adr lr, forth_dict_end_addr
    @ save return addr.
    adr r9, forth_retn_addr
    str pc, [r9]
    b forth

    @ r4 has value 1e at this point
    0:
        b 0b

    .include "forth.s"

    forth_data_stack_addr:
        .word (_end + FORTH_DICT_SIZE + FORTH_DATA_STACK_SIZE)
    forth_retn_stack_addr:
        .word (_end + FORTH_DICT_SIZE + FORTH_DATA_STACK_SIZE + FORTH_RETN_STACK_SIZE)
    @ code to evaluate :
    forth_code:
        .asciz ": 1e f f + ; 1e"
        .align 2

    @ = FORTH DICTIONARY
    .include "dict.inc"
_end:

Simplifying through unifying; a compile only Forth


As i wrote the Forth interpreter i realized that it was not too far from a compiler already, i just needed to drop the two modes and makes a compile only Forth ! The slight difference is the need to jump at the compiled code after running through the input.

To simplify further i dropped the ';' word later on.

The realization came from trying to simplify the dialect and optimizing for speed, the Forth interpreter wasn't necessary for my use case.

Peoples also realized this 30 years from now !

The program code was the last word of the dictionary in the first implementation, it works but adds several constrains so i made it a bit more flexible by separating the program code from the dictionary, i also added a primitive to mark any definitions as the entry point (entrypoint word), it was more flexible that way and swapping entry point can be handy for debugging or prototyping. The entry point word switch the compiler code output address from dictionary to program output address so that the definition code is put into the program area instead of the dictionary.

I also dropped immediate definitions by design to improve dialect overall cohesion, this makes the Forth much less reflective by default so much less powerful, it still support reflection through immediate primitives and literals.

It may be possible to gain some reflection back by making some words (such as stack ones) works in mixed mode (with immediate and default constructs) without having to redefine an immediate version by having the same stack implementation on compile / runtime which isn't the actual case. (the compiler stack does not have its top value stored in a register)

More Forth words: a base dictionary for a general purpose dialect


The basic Forth above isn't a full language yet, it is the base skeleton by which a Forth dialect can be defined and further refined, the dictionary is too simple as it lacks many primitives / words to be useful, the power of Forth is that this set of words can be very simple / bootstrapped (see sectorforth or milliForth) or dense (Gforth), either general purpose or task focused ones, they just have to be written !

Words

My general purpose Forth dialect at the time of writing this article include these words :
My Forth dialect has ~70 primitives, it is inspired by Chuck Moore 2000s Forth and also heavily inspired by C as i am accommodated to some of the concise symbols that C use, Joy, Factor, r3 and DSSP were also inspirational although i didn't borrow much from them.

It should be seen as a prototype / transitional step towards a colorForth environment.

I don't expose the return stack and it only handle integer arithmetic, fractional numbers can be represented as fixed-point.

I will not details the implementation of most of the primitives, many are trivial and nearly map one to one to the underlying instructions set but i will still details the non trivial constructs such as loops and conditionals.

The language support include directive in the comments, they are preprocessed at build time by tools such as AWK and the build process convert an ASCII source to a set of .ascii gas assembly directives.

Note that the primitives are tailored for a Forth in compile only mode (it might works slightly differently in the Forth presented above), i also use r5 as stack top instead of r4 to reflect the latest version of my dialect, also note that i omit the return instruction at the end of each primitives since my dialect now insert it automatically.

Loops

Here is the code for the loop, again and until construct (again is an infinite loop, until is a conditional loop, note that i use loop instead of begin which is unconventional) :

forth_word "loop"
    mov r6, pc
    stmdb r0!, { r6 }

forth_word "again"
    ldr pc, [r0]

forth_word "until"
    cmp r5, #0
    pop { r5 }
    ldreq pc, [r0]
    ldmia r0!, { r6 }

Not too difficult to implement, it use the return stack to store the address to jump back to, the return stack is convenient but may become less readable with complex constructs because the stack is shared with subroutine calls, some Forth adds a loop control stack to simplify this at the cost of introducing yet another stack.

I have two other loop construct, for and do, they are handled with the same end word which is next, for iterate backward from n (user provided value) to 0 (included), do mimic a C for loop and is able to iterate from n to another given value at a given step size, the direction of the iteration is automatically detected :

forth_word "for"
    mov r6, r5        @ start value
    mov r5, #1        @ step
    mov r7, #0        @ end value
    rpush r5-r7       @ push loop parameters on return stack
    rpush pc          @ push start address of the loop body
    mov r5, r7        @ "push" iteration value on data stack (stack top is r5)

forth_word "do"       @ same as for but takes 3 parameters instead of 1
    pop { r6-r7 }
    rpush r5-r7
    rpush pc
    mov r5, r7

forth_word "next"
    rpop r6-r9
    cmp r8, r9        @ determine iteration direction
    addgt r9, r9, r7  @ positive iteration
    cmpgt r8, r9      @ check stop condition
    sublt r9, r9, r7  @ negative iteration
    cmplt r9, r8      @ check stop condition
    addle pc, pc, #12 @ jump out of the loop when stop condition is reached
    rpush r6-r9       @ otherwise push parameters back
    push { r5 }
    mov r5, r9        @ "push" iteration value on data stack
    mov pc, r6        @ jump to start address of the loop body

The return stack is used as temporary storage for the loop index.

Unlike conventional Forth my dialect doesn't expose the iteration value (loop index) through other primitives such as "i" and "j", the iteration value is instead directly pushed on the data stack after for or do and it is up to the user to either drop it or use it later.

A way to break out of a loop at any given moment may be handy but is not as simple to implement because the code between "leave" and "next" must be skipped and making it work for all loops end word adds difficulty, i used to have a constrained "goto" but this was an ugly hack so my dialect doesn't support this right now, a way to implement a "leave" word would be to :

Conditionals

Conditionals implementation also have some complexity because the interpreter doesn't know where to jump to when a conditional is hit, this information is only available when a then word is encountered later on. (i.e. there is references to resolve later on)

My implementation generate the conditional code with room for a jump on the if word, i compute a jump length instead of resolving a jump address, this jump length is computed later on by else or then, i use the return stack once again to hold information about the location of the jump length that must be patched :

forth_word "if" WORD_IMMEDIATE
    adr r5, forth_if_code
    ldmia r5, {r5-r10}          @ load generated code
    stmia r14!, {r5-r10}        @ copy generated code
    rpop r5
    rpush r14                   @ push body length addr.
    mov pc, r5
    forth_if_code:
        cmp r5, #0
        pop { r5 }
        addne pc, #8            @ jump to if body
        forth_skip_over_code:
        ldr r6, [pc]            @ get body length
        add pc, pc, r6          @ jump to either "else" body or "then"
        .word 0                 @ if body length

forth_word "else" WORD_IMMEDIATE
    adr r5, forth_skip_over_code
    ldmia r5, {r5-r7}           @ load generated code
    stmia r14!, {r5-r7}         @ copy generated code
    rpop r5-r6
    sub r8, r14, r6             @ compute body length
    str r8, [r6, #-4]           @ update if body length
    rpush r14                   @ push body length addr.
    mov pc, r5

forth_word "then" WORD_IMMEDIATE
    rpop r5-r6
    sub r8, r14, r6             @ compute body length
    str r8, [r6, #-4]           @ update "if" or "else" body length
    mov pc, r5

Switch statement

My dialect leverage the implementation of quotations (lambdas / anonymous functions) for switch statements, it only have a case word which takes two values to compare to plus a quotation evaluated on a successful compare, the word drop the last two values on stack on a failure so other cases can be evaluated , the last code of the quotation is the default case, the word is implemented like this :

forth_word "case"
    pop { r6-r7 }
    cmp r6, r7
    moveq r9, r5                @ get case blocks addr. (when cond. is true)
    popeq { r5 }                @ restore stack
    moveq pc, r9                @ call case blocks
    mov r5, r7                  @ continue with the value to compare to otherwise

And used like this :

getc {
    'a' {
        "getc result: 'a'" puts
    } case
    'b' {
        "getc result: 'b'" puts
    } case
    "unhandled getc value: " puts putc ; default case
} call

Quotation evaluation is done through the call word which is a jump to the address pushed on return stack by "{...}" construct plus a return stack push to get out of the quotation  :

forth_word "call"
    pop { r6 }   @ see below
    mov r7, r5
    mov r5, r6   @ update TOS after pop (only due to the usage of a register for stack top value)
    rpush pc     @ go back to after call once the quotation is evaluated
    mov pc, r7   @ jump to quotation code start address

Syntax customization with a small parser (a mistake ?)


A lot can be done with Forth minimal syntax (words delimited by spaces) but i still wanted some small amount of conventional syntactic sugar to define variables or have conventional strings definition. ("..." instead of s" ...")

So i started to deviate from Forth minimal parser (as seen above) by going for a small parser.

Was it worth ?

Not sure, it add flexibility perhaps while remaining fairly lightweight; it contribute to a smaller binary rather than having specific words in the dictionary but it is only worth for a tiny amount of constructs in my case... (variables, immediate number and verbatim value) so it could have been avoided entirely by making some small edge cases on the initial parse loop... it is also more errors prone to get right, i also understood later that all the edge cases the parser solve vanish when the ASCII input is replaced by a tokenized input. (the colorForth approach)

Syntactic features of my Forth dialect

  • strings : "..."
  • arrays definition : [0 1 2 3 ...]
  • character literal : 'a'
  • quotations : {...}
  • variables definition : >my_variable
  • immediate number (push a number on compiler stack) : #42
  • verbatim value (compile verbatim value) : $42
  • line comment : ;

Reduced stack juggling with variables / registers


Stack juggling (also called "stack noise") is related to all the stack manipulation that must be done to manipulate stack data values, it is not transparent in Forth unlike C due to its stack machine model; data values must sometimes be reordered by the user which lead to high usage of stack manipulation words which might lead to highly verbose code and slightly less readable code.

There is some rules that one may apply to reduce stack noise (factor into small definitions, keep up to 3 values at most on stack) but it is not entirely avoidable as is.

One of the first feature added to my dialect to reduce stack juggling and improve readability was registers access and named parameter, i quickly found out that it was not sufficient to reduce stack juggling as much as i wanted (especially in loops) so i quickly fused it into a "variables" concept instead. I still kept registers access as it is a good way to speed up a program.

The usage of variables is slow but effectively eliminate stack juggling, i like the versatility of having variables (slow), stack words (fast), registers access (very fast) with readability variations, all of this blend easily with the base features of the language and doesn't add much complexity.

A variable in my dialect is defined with a ">" character immediately followed by a word which basically means that whatever is on stack is stored into that word, the value can be retrieved on the stack through the word. Variables name must starts with an ASCII character higher than 64. (this prevent words such as '>=' and '>>' etc. to be interpreted as a variable definition)

early implementation of variables to avoid stack juggling

Data area and adding strings, arrays and quotations


Variables were initially stored in the base dictionary but i realized that many form of anonymous data might be stored there later on (strings, arrays etc.) thus i introduced a separate area to store all data (similar to the data segment) so it doesn't mess with the base dictionary. Also tried to place data where they were defined but this had performances impact due to generated jump instruction to skip over the data, some features were also harder to add due to this later on.

Variables

Variables are defined like a word definition but are stored in the data area, they are parts of the dictionary (linked with it) so the compiler find their symbol through a regular dictionary search, the compiler distinguish between regular word and variables through a word type flag.

A local variables implementation could be introduced easily by limiting the word search on a variable definition to the current definition, local variables actually help a lot for things like recursion. Perhaps a better implementation is to store them on the return stack !

String type

My dialect support a multi-line conventional "..." string syntax, it support null-terminated string and length-prefixed string at the same time. The implementation generate code that push the string address on stack.

The string length can be retrieved through the dlen primitive.

Array

The dialect also support the definition of integer arrays with "[0 1 2 3]" syntax, the implementation is similar to the string implementation (it is length-prefixed) although the parser state switch to "array" mode; parse_number routine is then reused to directly put parsed values as a linear series of numbers in the data section, the parser state is reset on "]".

Quotations

Quotations is supported in my dialect with "{ ... usual code... }" syntax. It enable the definition of anonymous block of code. (lambdas / anonymous functions)

Quotations does not use the data area, code is generated inline.

The implementation is straightforward :
  • upon "{" the compiler generate code that push the quotation start address on stack and also generate a jump to skip over the quotation body (jump length is not yet resolved)
  • upon "}" the compiled generate a RET instruction (which use the Forth return stack) and resolve the previous jump length (it compute the code body length from the start / end address)
The call primitive can be used to "call" a quotation. (see code above)

Quotations are powerful as they can simplify all "blocks" / control structures such as conditionals or switch statement, an example in this dialect is the implementation of the "switch" statement which has a quotations implementation.

The array statement can be replaced by using a quotation with verbatim literals :

{
    $42 $deadc0de ; verbatim literals
} ; data start address is now on stack
; quotation could also be called to put all the data on stack when the data is defined as regular values (not verbatim)

See also how the DSSP programming language use blocks to simplify some constructs, it has some similarity to quotations although they are not anonymous.

Data allocation word: allot

The array construct is handy but a way to programmatically allocate arrays is missing so the allot primitive was implemented :

forth_word "allot" WORD_IMMEDIATE
    pop { r5 }
    add r10, r4, #4             @ assign data addr. (+4 to skip length)
    adr r7, number_code_generic @ reuse generic number load code
    ldmia r7, {r7-r9}           @ load generated code
    stmia r14!, {r7-r10}        @ copy generated code
    str r5, [r4], #4            @ store length
    add r4, r4, r5              @ offset data
    add r4, #4
    bic r4, #3                  @ align

This word is immediate, it is called when the code is compiled thus i also needed a way to define immediate numbers (due to a design issue) which is done by prefixing a number with #.

The first use of allot was to define a static buffer for a number to string conversion routine :

number to string conversion in my Forth dialect

I also added an "align" word later since a way to align the next alloted data was missing, alignment is a requirement in some bare metal cases. (mailbox CPU / GPU interface requirement)

U-Boot I/O primitives


I/O was implemented with U-Boot standalone API at first (see first article) then U-Boot API (which has storage device access), it was quite time consuming as it required to tinker with U-Boot a lot and constantly gathering knowledge on how to interface with it at this level.

The result is a set of assembly macros that save / restore U-Boot context (stack and global_data address) plus some macros for standalone API calls and a lengthy macro for the main API interface.

See this file for the U-Boot macros.

Optimization: Constant load for 8-bits number


An easy optimization to do was to detect 8-bits number and generate a constant load instruction which is way faster than a data load; code that directly load the number into the stack top register.

Optimization: Inline expansion


My dialect inline code by default for performance reasons, it was fairly easy to tweak the compiler to generate inline code from the use of Subroutine Threaded Code by directly copying the word body code instead of generating a call instruction.

An early implementation issue was to determine the word body length which was solved by going for a doubly linked list dictionary implementation.

Also had issues with the code generation of some primitives (conditionals etc.) as i used to patch by address initially, i switched to patching by jump length instead which is actually way more flexible.

Inline expansion brought some other issues such as broken recursion which was addressed by detecting self calling words, it default to a STC model in this case. I later found out that this sort of recursion in a traditional Forth use a "recurse" word which has a simpler implementation than allowing self calling words. (less edge cases)

There is no mutual recursion support in my dialect due to lacks of forward declaration, recursion is also a bit dodgy with global variables.

Inline expansion can bring easier optimization later on such as reducing a set of words to a smaller set of words.

Early benchmark

I did a crude (visuals only; no measurements done) benchmark with / without inline expansion which demonstrate that inline expansion made the code between two to three times faster.

The benchmark is a graphics program that shows some procedural logo gfx, the program run on a bare metal RPI Zero 1.3.

crude benchmark; left is STC model, right is inline expansion optimization (ARM cache off)

Early dialect benchmarks against C / GCC optimization levels

I then did a C version of the same program and did some measurements of the program compiled with my Forth dialect against C / GCC optimization levels :

my Forth dialect vs GCC optimization levels (ARM cache off)

my Forth dialect vs GCC optimization levels (ARM cache on)

C code vs Forth dialect code side by side (high res)

Was not too bad as it approach GCC -O0 level.

Primitives were not optimized at this time, it might be closer to -O0 level now with optimized primitives. (an early round of primitives optimization brought the Forth down to ~262ms)

Reaching GCC -O0 level speed

Variables usage is slow in my dialect so most variables were replaced by stack manipulation words (the Forth way !) + registers and "pix" word was implemented to do optimized pixels access by using multiply accumulate + base / offset store ARM instruction :

a quickly optimized variant (note : was still using some variables !)

updated benchmark

Speed is now between GCC -O0 and -O1, not even fully optimized ! (note : for some unknown reasons the C code has a 20ms gain when it is compiled for ARMv4 instead of ARMv6)

Full example program


Here is a full sample program of my dialect that print RPI Zero 1.3 I/O device info (SD device) to screen and read / write to it through U-Boot API calls :

Conclusion


It was an incredible journey for me to go from a high level Forth to a low level minimal prototype then build a complete dialect / compiler out of it, most of the development, debug and testing was done on bare machine (SD swap !) with occasional help from cpulator, the compiler code is about ~500 lines of assembly (total) which is still quite small for all the features it pack, it can serve as a base to go further.

There is some improvements here and there such as using the same stack design for the compiler / runtime, doing tail-call optimization or forward declaration or going back to a simpler parser as i am not sure the actual parser is even needed.

the benchmark graphics program running on RPI Zero 1.3

What is next ?

My next goal is to rewrite the dialect in itself but in a tokenized way to make it even simpler, i also want to build a whole prototyping environment around it. (colorForth style)

back to topLicence Creative Commons