ARM2+ LZ4 decompression routine

Few days ago i ported to ARM2 a LZ4 decompression routine which was made for ARM Cortex-M0. It took me a few hours to get it working, it did not use much modern ARM instructions.

LZ4 is a lossless general data compression algorithm focused on compression / decompression speed. It belongs to the LZ77 compression schemes. This sort of compression was actually used on some old-school platforms like Atari ST or Apple II because it was fast and good.

The point was to compress my Acorn Archimedes binaries (because ARM code density isn't great) and associated data like bitmaps. The later idea is to have some kind of loader which take care of that dynamically, it must be fast since it will decode data at runtime.

At first i was going for the old LZW compression algorithm which is very simple to implement (compression isn't great though) but stumbled upon LZ4 later on which had great compression / speed ratio and is also relatively simple.

The decompression speed and simplicity was the most appealing to me, you don't want to make peoples wait on some logos when you do a demo that use a lot of optimizations like loops unrolling, code generation and various lookup data.

The code probably works on many old-school ARM CPU (ARM2, ARM250, ARM3 and maybe newer), it is completely independent as i made it rely on its own stack. It is also the most recent version. (the author could not modify its own post after some times and some peoples pushed fixes in the comments section of the ARM online board, those fixes were integrated below)

It does not parse any headers so you may have to remove the compression header if you compress your data with some LZ4 tools. You must also append a 16-bit value on top of the compressed data, this value is the length of the compressed data. Since it is 16-bit the compressed data must fit into 64KB.

The original author uploaded a simple Perl tool to compress (with the official compressor), remove the compression header and append the 16-bit length at the top. I modified it a bit for my needs since it produced too many files (like some C code) and i wanted the files to have Arculator hostfs style file-type/extension. The updated Perl script can be found below after the decompression routine. The original filename of that tool was lz4cut, just save it somewhere, do a chmod +x lz4cut and compress your files with lz4cut your_filename the program will produce two files, the compressed one with headers (end with lz4) and the one compatible with the decompression routine below (start with lz4). You must have the lz4 binary somewhere in your system. (you can compile it from sources)

The performances are good (my data is loaded in few seconds) although the code could probably be much more optimized. As for the size it compress my 629KB binary to 38KB so it is very efficient! For some bitmap data it compress 131KB into 25Kb which is also not bad. Here is some write-up about LZ4 on 8088/8086 CPU which can give some hints about what this algorithm is capable of.

Usage

ldr r0,lz4DataAddr

ldr r1,dstDataAddr

bl unlz4

ARM2+ LZ4 decompression routine

 dcd 0

 dcd 0

 dcd 0

 dcd 0

.unlz4_stack ; stack goes upward



; LZ4 decompression routine ported to ARM2 by grz-

; credits ->
community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/lz4-decompression-routine-for-cortex-m0-and-later


; make sure to remove compressed data header if you use a
compression tool

; make sure to append a 16-bit value at the top of your compressed
file which represent the compressed data length

; compressed data should thus fit into 64kb

; commented lines is the original version

; r0 = address of the compressed data

; r1 = RAM address of the decompressed data

; r4,r5,r6 = preserved

.unlz4

 ;ldrh r2,[r0] ; get length of compressed data

 ldr r2,[r0]

 mov r2,r2, ROR #16

 mov r2,r2, LSR #16

 adds r0,r0,#2 ; advance source pointer



 ;push {r4-r6,lr} ;save r4, r5, r6 and return-address

.unlz4_len

 adr r13,unlz4_stack

 stmfd r13!,{r4-r6,r14}

 adds r5,r2,r0 ; point r5 to end of compressed data

.unlz4_getToken

 ldrb r6,[r0] ; get token

 adds r0,r0,#1 ; advance source pointer

 ;lsrs r4,r6,#4 ; get literal length, keep token in r6

 movs r4,r6, LSR #4

 beq unlz4_getOffset ; jump forward if there are no
literals

 bl unlz4_getLength ; get length of literals

 movs r2,r0 ; point r2 to literals

 bl unlz4_copyData ; copy literals (r2=src, r1=dst,
r4=len)

 movs r0,r2 ; update source pointer

.unlz4_getOffset

 cmp r0,r5

 bge unlz4_bye

 ldrb r3,[r0,#0] ; get match offset's low byte

 subs r2,r1,r3 ; subtract from destination; this will become
the match position

 ldrb r3,[r0,#1] ; get match offset's high byte

 ;lsls r3,r3,#8 ; shift to high byte

 movs r3,r3, LSL #8

 subs r2,r2,r3 ; subtract from match position

 adds r0,r0,#2 ; advance source pointer

 ;lsls r4,r6,#28 ; get rid of token's high 28 bits

 movs r4,r6, LSL #28

 ;lsrs r4,r4,#28 ; move the 4 low bits back where they
were

 movs r4,r4, LSR #28

 bl unlz4_getLength ; get length of match data

 adds r4,r4,#4 ; minimum match length is 4 bytes

 bl unlz4_copyData ; copy match data (r2=src, r1=dst,
r4=len)

 cmp r0,r5 ; check if we've reached the end of the compressed
data

 blt unlz4_getToken ; if not, go get the next token

 ;pop {r4-r6,pc} ; restore r4, r5 and r6, then return

 ldmfd r13!,{r4-r6,r15}



.unlz4_getLength

 cmp r4,#15 ; if length is 15, then more length info
follows

 bne unlz4_gotLength ; jump forward if we have the complete
length

.unlz4_getLengthLoop

 ldrb r3,[r0] ; read another byte

 adds r0,r0,#1 ; advance source pointer

 adds r4,r4,r3 ; add byte to length

 cmp r3,#255 ; check if end reached

 beq unlz4_getLengthLoop ; if not, go round loop

.unlz4_gotLength

 ;bx lr

 mov r15,r14 ; return

.unlz4_copyData

 rsbs r4,r4,#0 ; index = -length

 subs r2,r2,r4 ; point to end of source

 subs r1,r1,r4 ; point to end of destination

.unlz4_copyDataLoop

 ldrb r3,[r2,r4] ; read byte from source_end[-index]

 strb r3,[r1,r4] ; store byte in destination_end[-index]

 adds r4,r4,#1 ; increment index

 bne unlz4_copyDataLoop ; keep going until index wraps to
0

 ;bx lr ; return

 mov r15,r14

.unlz4_bye

 ldmfd r13!,{r4-r6,pc}

Perl script (lz4cut)

#!/usr/bin/perl

use strict;



my $infile=@ARGV[0] or die("no input file.");



my ($file, $name)=removeExtension("$infile");

my $lz4file="$file.lz4";

my $binfile="$file.bin";

my $cfile="$file.c";

($name) = "$name" =~ /^([^\.]+)/;

my $array = "s" . uc(substr("$name", 0, 1)) . substr("$name",
1);



my $e = system("lz4", "-9", "$infile", "$lz4file");

if(-1 == $e){ die("failed compressing $infile to $lz4file."); }



my $bin;

if(open(LZ4FILE, "<", "$lz4file"))

{

    my $b;

    if(read(LZ4FILE, $b, 11))

    {

        my ($i, $f, $j, $l) =
unpack("H8CA2V", $b);

        if("$i" eq "04224d18")

        {

           
seek(LZ4FILE, 1, $f & 8);

            my $cl =
read(LZ4FILE, $bin, $l);

            if($cl ==
$l)

            {

           
    print("Extracted $l bytes from $lz4file\n");

           
    print("prepending 16-bit length.\n");

           
    $bin = pack("v", $cl) . "$bin";

            }

            else

            {

           
    die("$lz4file is garbled.");

            }

        }

    }

    close(LZ4FILE);

}



if(length($bin))

{

    if(open(BINFILE, ">", "$binfile"))

    {

        print(BINFILE "$bin");

        close(BINFILE);

    }

    if(open(CFILE, ">", "$cfile"))

    {

        print(CFILE "static const
uint8_t ${array}[] = {\n");

        my $data;

        while("$bin")

        {

            $data =
substr($bin, 0, 16);

            $bin =
substr("$bin", length($data));

            $data =~
s/(.)/sprintf("0x%02x, ", ord($1))/seg;

           
print(CFILE "\t$data\n");

        }

        print(CFILE "};\n");

        close(CFILE);

    }

}



sub removeExtension(pathname)

{

    my @path = split(/\//, "$infile");

    my $fname = pop(@path);

    my @fname = split(/\./, "$fname");

    scalar(@fname) > 1 && pop(@fname);

    push(@path, join(".", @fname));

    (join("/", @path), "$fname");

}