Home
Admin | Edit

Archimedes ARM2 ~50 FPS Scaled dots 3D landscape / fake voxel 3D height-field

This is a technical write up about my first graphics program for the ARM2 8MHz Archimedes: a 3D height-field at 50 FPS.

Few weeks ago i started programming for the ARM2, i decided to do some graphical effects after being accustomed to the tools. (BBC BASIC 5 inline assembly, RISC OS API)

I decided to do a height-field, making some rough prototype in p5js and then do the ARM2 implementation:

Acorn Archimedes, ARM2, 8MHz real-time landscape (first version, ~900 onscreen 16x11 blocks)

The initial idea was to see how fast the ARM2 CPU was by using a brute-force height-field rendering approach, how many animated blocks making up a clean 3D landscape could i push at ~50 FPS, 320x256 resolution and 256 colors ? The answer is more than ~950 real-time onscreen 16x11 blocks on a 1987 Archimedes with a 8MHz ARM2 CPU! Note that there is also some screen clearing. (+ FPS / OS calls)

Second version is slightly faster, ~1100 onscreen 16x11 with bitmap background on ARM2 MEMC1

Benchmark video of the landscape running on ARM2 8 MHz, ARM250 (12 MHz, 16 MHz)

Height is altered / blocks are clipped in real-time. (clipping is done in selected fixed width blocks when they overflow on the left or right)

It is possible to move in any directions on the landscape although there is no code for yaw rotation.

This is quite heavily optimized for the cache-less ARM2 CPU with heavy loops unrolling and duplicate maps data resulting in a ~500kB binary for 128x128 maps. The color-map and height-map come from Comanche: Maximum Overkill. Some data are compressed with LZ4.

ARM3 with its cache would have a hard time with all the loops unrolling so the code is not adapted to ARM3+ CPU, it can be converted easily though.

Here is a summary of all the optimization methods i applied:
All of this result in a very simple rendering code albeit with massive code generation and code for all the preprocessing stages.

same as above with different map (prototype version)

16 colors version, 8x10, ~2000 rectangles, first version

The 16 colors version is about twice as fast / use 2x less memory, it writes a single word which amount to 8 pixels at once, it also fetch 5 points at once. It can draw ~2000 rectangles. Screenshots don't look that good due to raw conversion of the assets to 16 colors, it also uses early code with less accurate perspective.

Memory usage

Program memory usage of the 256 colors version is ~500kB with 128x128 maps. For simplicity there is no data loaders so the program binary embed all the data (even compressed ones) which result in about ~50kB overhead, ideally the data would be loaded / freed dynamically as needed.

There may be many improvements for memory usage, about 100kB can be gained at the expense of some CPU cycles by using a single landscape map data, i don't think this would affect performances much. Maybe the amount of generated code can be improved as well.

Memory usage would be much lower (more than 100kB) without loops unrolling. ARM2 code density is not great.

With 256x256 maps it goes to more than 1MB uncompressed.

Scaled dots landscape ? Fake voxel heightfield ?

Dots landscape ?

A dots landscape is a simple well-known retro graphical effect where dots are positioned on a plane (with 3D projection) and the points Y position is altered by a height-map, coloring the points based on their height (as seen below) or a color-map is also possible to make it nicer, this was popular on 80s computers like the Atari ST or Amiga, i actually started by implementing a dots landscape:

the points generator with points Y position altered by a noise function (p5js prototype)

same with more dots and animation (p5js prototype)

As you may see the only change is the points vertical position and shading, there is no changes on the horizontal axis, this can be used as an optimization by precomputing the grid.

Scaled ? Fake voxel ?

A brute-force idea to make it 'voxel-like' is to 'scale' each dots (i.e., turning them into squares/rectangles) so there is no gaps between them. This is the core idea of the algorithm. It is a brute-force approach because high amount of (small) rectangles is needed for high quality landscape.

Same as above with 8x12 rectangles instead of dots, height based shading (left) and color-map based shading (right), iteration step: 2 (p5js prototype)

The issue of this rendering method is the gaps which may appear between rectangles, especially noticeable when the height difference between points is high and the rectangle is too small.

There is several ways to hide the gap such as increasing the points density or rectangles size, the computing power required is higher however. Height range can be decreased as well.

Some gap are actually okay if the screen is left uncleared for the parts of which the landscape cover, gaps will be filled and most visual artifacts will be unnoticeable.

Another way would be to draw adequately sized rectangle / vertical rectangles / lines, this is how Voxel Space rendering engine did it back in the 90s, there is some advantages of doing it that way mainly related to optimizations.

This rendering method can draw textured planes as-is easily by dropping the height component. This could be useful in 3D platform games for old-school platforms.

Textured plane rendered with the same method (p5js prototype)

The rendering method presented in this article use rectangles of fixed size and was chosen because it is quite effective on the Acorn Archimedes due to fast ARM block copy instructions (ldmia / stmia), block copy instructions are able to load / store the ARM 32-bit registers in a single instruction. This means that drawing a 16x16 pixels square in 256 colors mode can be done in 31 instructions which equals to ~127 CPU cycles. (in 256 colors mode a pixel = one byte thus storing a register = 4 pixels at once and 8 pixels at once for 16 colors mode etc.)

Main disadvantage of fixed size blocks is that many cycles might be wasted on overdraw.

Rendering / ARM2 optimizations

Weak perspective projection

The first version of the grid algorithm was built with fixed point arithmetic as i thought of computing the grid on the ARM2 at first, i decided to switch to preprocessed points data later on to match my real-time performances requirements, i was also using a custom projection at first but switched to weak perspective projection later on as it is actually nicer and flexible in most cases and required less points in the end.

The camera setup can get pretty close to the one in a game like Comanche: Maximum Overkill (1992).

Here is the basis of the points generator algorithm (p5js):

function setup() {
  createCanvas(320, 256)
 
  rectMode(CENTER)

  noStroke()
  fill(255)
}

function draw() {
  background(0)
 
  let points = 0
 
  let stepx = 4
  let stepy = 4

  // might have to play with the extent of the loops here to fill the screen
  for (let y = 0; y < height / 2; y += stepy) {
    for (let x = -width / 2; x < width / 2; x += stepx) {
      let h = 1 // replace by noise function or height-map sampling

      let xx = x / (height - y) * height
      let yy = y / (height - y) * height

      // color the rectangle based on a color-map by using x/y as u/v coordinates like:
      //  const mapIndex = (floor(x + width / 2) + floor((y + frameCount) % colormap.height) * colormap.width) * 4
      //  const red = colormap.pixels[mapIndex + 0] // get red color component
      //  etc.
      // the same index can be used for sampling the height-map
      // and multiplying, dividing x and y in the map lookup allow to adjust the 'camera' FOV
      
      rect(width / 2 + xx, height / 2 + yy / 2 + h, 2, 2)
      
      points += 1
    }
  }
 
  fill(255)
  text(points, 16, 16)
}

Visibility

All points of the grid are preprocessed in a p5js tool, the tool basically run the landscape code for some frames and store all the rectangles which will be visible at some point in an array along with metadata like UV, position and width of a block. The generated data files are downloadable and are used by the Archimedes landscape rendering code.

Sides clipping

Fixed size rectangles may overlap the edges of the screen and may produce visual artifacts as can be seen below.

16px width blocks overlapping the left side piles up on the right side

This was solved by detecting the blocks that overlap the edges and fixing their position to a 4px grid in the tool that generate the points data, the points on the Archimedes are then drawn using different width (4, 8, 12 or 16px). 4 is the smallest block width because it fits into a word on the Archimedes.

The first implementation was real-time and a bit unreadable as can be seen here, it used a branch table and modified the PC directly to jump to the appropriate block code.

Points horizontal position never change though so all the code of the first implementation is unnecessary... clipping can be done efficiently by selecting / generating the right block draw code at compile time for each points, this result in a much simpler / faster / smaller implementation with CPU cycles gain compared to a no clipping approach !

Culling / clipping bottom blocks

Culling / clipping bottom blocks is needed because blocks can also overflow the bottom part of the screen which actually result in a crash unless many screen banks are used.

The sides method wouldn't work here because the rectangles Y position is altered in real-time.

The first implementation had a simple culling / clipping combo which was implemented as two instructions per rectangle, checking the Y position against the screen boundary and jumping to the next rectangle when the block overflow. I was using 4 screen banks for clipping, this was a cheap trick to make the bottom blocks appear clipped by allowing the ones partly visible to overflow the bottom part of the screen.

Found a solution later on to only use 3 screen banks, the first bank would have thrown some blocks to the second ones but this could have been mitigated by clearing the top part of the screen (or drawing the sky) before rendering the second bank, it was the same result with very small performance penalty and better memory usage.

Proper clipping of the bottom blocks

I was not satisfied with the clipping method of the first implementation, it had the advantage of being very simple code-wise at the expense of high memory usage (4 or 3 screen banks instead of 2) and wasting some precious CPU cycles.

The current implementation does proper clipping (lines outside the screen are discarded) and don't use additional banks, it use a duplicate of the block draw code with 2 instructions added after each lines to check if the current line overflow.

The clipping code use a subroutine call to make the code simpler, it is good enough as-is but could be optimized further at readability expense. I don't think the cycles would be worth though.

Skipping screen rows

The second version can skip screen rows, this provide a huge boost to performances at the expense of looking more blocky if the step is high, it is generally fine for a step of 2 and begin to be very noticeable at a step of 4.

The way it works is by fixing all the blocks to a grid (vertically) which is the size of the wanted step then i just skip one of two (or more) lines of each rectangles and copy the lines of the screen as needed outside the main rendering loop.

It is fast because the lines copy uses unrolled loop plus ARM block copy instructions (ldmia / stmia).

Going further with optimizations

Ideas that may provide some boost:
  • the height / color lookup data could be provided as 'data as code', there would be no memory lookup to gather the maps information, just code like some jump and `mov`, still unsure how much gain this would provide since the actual lookup is already efficient (done with a block instruction) so it doesn't require much cycles to execute (~6 cycles), this idea may also be severely limited by the ARM2 instructions set due to constrains on immediate value encoding
  • drawing the bitmap background can be slightly more efficient by taking into account screen rows skipping
  • there may be some optimization to do with rows skipping such as copying one frame on two, this may produce visual artifacts though
Some more ideas on how to optimize further if the view / camera path is constrained:
  • don't generate code for points elevation if the point never does for the chosen view / camera path, require to generate some kind of points height list and skip the instructions as needed (~2 cycles gain / rectangle)
  • don't generate code for points culling if the point never overflow the bottom part of the screen for the chosen view / path, require to generate some kind of list and skip the instructions as needed (~2 cycles gain / rectangle)

How efficient it is ?

At 50Hz 256 colors mode, the second version of the code can draw ~1100 16x11 rectangles per frame on the 1987 Acorn Archimedes with MEMC1. (skipping one of two screen row)

The theoretical maximum of cycles per frame on a 1987 Acorn Archimedes is a bit less than 160000 cycles.

The most critical part (and the ones i consider the most efficient) is the rectangle rendering code which alone consume on average 98 cycles (for 16x11 block) plus ~38 cycles added by the DRAM controller (every 4 words).

So, if we scale this up we can theoretically render ~1160 rectangles / frame at 50Hz with that code alone.

For a more accurate answer i made some simple test with that code and came up with ~1200 rectangles / frame at 50Hz. This give an estimate of the theoretical limit at 50Hz.

The rectangle rendering code alone is of course not sufficient to draw a real-time landscape, there is at least one lookup for the points data, texture map and elevation data plus some instructions in between, this amount to approximately 20 more cycles / rectangle, clearing half the screen efficiently is about 10k cycles as well, double with a bitmap and skipping/copying lines of the screen is about 20k cycles as well.

So, it is nearly at its peak efficiency already.

The numbers are a rough estimate and don't take into account clipping, culling and main code. Clipping may add a small cost sometimes because of the use of a subroutine.

A slide i like to refer to when it comes to timings on the ARM2 (from ABug 05 Sarah Walker talk)

Render loop code


Below is the BBC BASIC function which generate the ARM2 render loop code (first implementation with multiple screen banks) that is executed each frames. It is small with ~20 instructions for each points, about half of which is duplicate block copy instructions to render each lines of the rectangle.

Would be simpler without code generation!

The code with improved clipping (final version) is slightly more dense but is about the same for fully visible blocks.

Download


Most recent ARM2 assembly sources + binary and associated JavaScript files to generate the data can be found here.

Just put all the files into hostfs Arculator directory and either run the binary directly (!bin file) or run !assemble to build the program. This include !bass which is my BBC BASIC V aided framework / assembler.

On real hardware you may want to use this archive (soon) which has preserved metadata.

Sources for the 16 colors version can be found here (first version)

First 256 colors version sources / binary can be found here.

Tools

Two p5js (JavaScript) tools are included, one is used to preview / generate data and the other is a simple 256 colors bitmap exporter which is used for the background.

Use p5js for the tools, you can just edit/run the code within the online editor.

The landscape generator is quite flexible: everything can be done inside the tool without having to fiddle with assembly, this include all parameters which control the landscape quality / resolution and geometry.

The data generated by the generator can be used directly by overwriting the existing files and assembling again.

Conclusion


This was fun to optimize, it was all done within Arculator (Archimedes emulator) with bundled RISC OS apps (!edit and BBC BASIC) on 1987 emulated hardware, like a real 80s development environment (also included slow compile time and difficult to debug bugs !), doing the prototyping in JavaScript helped to iterate a lot faster though.

The landscape code is quite simple and flexible and could be used for games / demos. The advantage of doing the brute-force landscape approach is the amount of parameters (rectangles count, rectangles size, grid layout etc.) to tweak to improve the quality of the landscape (can make trade-off easily), it is also an efficient approach for the ARM2 due to block copy instructions.

It could be used to do textured planes made of tiles (a bit like Star Fighter 3000 game but more restricted) by dropping the height parameter.

May also be used to draw objects with some changes, bit like point cloud.

For games, decals or height modifying stuff (Magic Carpet anyone ?) can be done easily and fast by poking into the maps.

Also wonder how textured blocks would looks, wouldn't be hard to adapt, maybe it would look like those 80s/90s 3D games with many scaled sprites.

There is not much cycles left as-is on ARM2 to stay at 50 FPS (without tweaking / changing screen mode or going for lower quality), on ARM250 (12MHz) there would be more cycles available though.

back to topLicence Creative Commons