AVX2 depth buffer check and 32-bit to 24-bit data conversion

I have been tinkering with my C pseudo REYES software renderer again lately, i especially worked on some SIMD optimizations to compose multiple render buffers into a single buffer with a depth check, the code heavily use Intel x86 / x86-64 AVX2 instructions which works on 256 bits data, in my case it is able to process 8 pixels at a time "branchless" which greatly enhance the performance of my software renderer.

The idea was to replace this compositing loop (depth buffer is a series of float values and i/o buffer are a series of bytes) :

int di = 0;

for (int i = 0; i < len; i += bpp) {

    float depth = input_depth_buffer[di];



    if (depth > output_depth_buffer[di]) { //
compare z value

        output_buffer[i+0] =
input_buffer[i+0];

        output_buffer[i+1] =
input_buffer[i+1];

        output_buffer[i+2] =
input_buffer[i+2];

        if (bpp == 4) {

           
output_buffer[i+3] = input_buffer[i+3];

        }



        output_depth_buffer[di]
= depth; // copy depth

    }



    di += 1;

}

by an optimized version which works on 8 pixels at a time, my idea was to use AVX2 compare instruction (vcmpps) which produce a mask and give that mask to a data output instruction (vpmaskmovd) which is able to output selectively based on the mask value.

My buffers can be 32-bit or 24-bit so i needed to make the code works for both type of data, the 32-bit code took me an evening to figure out, the 24-bit version took two days and proved to be harder because some AVX2 instructions are limited such as vpshufb and there is no mask store instructions for 8 bits data, all of this were added in AVX-512 alas my CPU (i7 7600) doesn't handle that kind of instruction set.

32-bit (4 bpp; RGBA) C code

Here is the 4 bpp version using Intel Intrinsics :

// to use Intel Intrinsics with GCC :
#include <x86intrin.h>

const int vec_step = 8;

const int loop_step = bpp * vec_step;

const __m256i mask_24bpp = _mm256_set_epi8(-1, -1, -1, -1, -1, -1,
-1, -1, 28, 28, 28, 24, 24, 24, 20, 20, 20, 16, 16, 16, 12, 12, 12,
8, 8, 8, 4, 4, 4, 0, 0, 0);

for (int i = 0; i < len; i += loop_step) {

    // compare z value

    __m256 ymm0 = _mm256_loadu_ps((const mfloat_t
*)input_depth_buffer);

    __m256 ymm1 = _mm256_loadu_ps((const mfloat_t
*)output_depth_buffer);

    __m256 ymm2 = _mm256_cmp_ps(ymm0, ymm1,
_CMP_GT_OS);



    // copy input buffer following mask value

    __m256i ymm3 = _mm256_loadu_si256((const
__m256i_u *)input_buffer);

    _mm256_maskstore_epi32((int *)output_buffer,
_mm256_castps_si256(ymm2), ymm3);



    // copy depth following mask value

    _mm256_maskstore_ps(output_depth_buffer,
_mm256_castps_si256(ymm2), ymm0);



    input_depth_buffer += vec_step;

    output_depth_buffer += vec_step;



    input_buffer += loop_step;

    output_buffer += loop_step;

}

This version is a straightforward execution of the idea above, it works pretty well because the instructions maps well with 32-bit data.

24-bit (3 bpp; RGB) C code

Here is a working 3 bpp version :

// _mm256_shuffle_epi8 but shuffle
across lanes

// source :
https://github.com/clausecker/24puzzle/blob/master/transposition.c#L111C1-L124C2


static __m256i avxShuffle(__m256i p, __m256i q) {

    __m256i fifteen = _mm256_set1_epi8(15), sixteen
= _mm256_set1_epi8(16);

    __m256i plo, phi, qlo, qhi;



    plo = _mm256_permute2x128_si256(p, p, 0x00);

    phi = _mm256_permute2x128_si256(p, p, 0x11);



    qlo = _mm256_or_si256(q, _mm256_cmpgt_epi8(q,
fifteen));

    qhi = _mm256_sub_epi8(q, sixteen);



    return (_mm256_or_si256(_mm256_shuffle_epi8(plo,
qlo), _mm256_shuffle_epi8(phi, qhi)));

}



const int vec_step = 8;

const int loop_step = bpp * vec_step;

const __m256i mask_24bpp = _mm256_set_epi8(-1, -1, -1, -1, -1, -1,
-1, -1, 28, 28, 28, 24, 24, 24, 20, 20, 20, 16, 16, 16, 12, 12, 12,
8, 8, 8, 4, 4, 4, 0, 0, 0);

for (int i = 0; i < len; i += loop_step) {

    // compare z value

    __m256 ymm0 = _mm256_loadu_ps((const float
*)input_depth_buffer);

    __m256 ymm1 = _mm256_loadu_ps((const float
*)output_depth_buffer);

    __m256 ymm2 = _mm256_cmp_ps(ymm0, ymm1,
_CMP_GT_OS);



    __m256i ymm11 =
avxShuffle(_mm256_castps_si256(ymm2), mask_24bpp);



    // copy input buffer following mask value

    __m256i ymm3 = _mm256_loadu_si256((const
__m256i_u *)input_buffer);

    __m256i ymm12 = _mm256_loadu_si256((const
__m256i_u *)output_buffer);

    __m256i ymm13 = _mm256_blendv_epi8(ymm12, ymm3,
ymm11);

    _mm256_storeu_si256((__m256i_u *)output_buffer,
ymm13);



    // copy depth

    _mm256_maskstore_ps(output_depth_buffer,
_mm256_castps_si256(ymm2), ymm0);



    input_depth_buffer += vec_step;

    output_depth_buffer += vec_step;



    input_buffer += loop_step;

    output_buffer += loop_step;

}

It is slightly more complex than the 32-bit version and took me a while to figure out, maybe there is a simpler variant but it works and doesn't add that much.

The added complexity is explainable due to the lack of 8 bits mask store and that the depth mask data has 8x32-bit values; vpmaskmovd cannot be used because the input data is 24-bit and the mask data is 32-bit data.

The lack of appropriate mask store was solved easily at the cost of a bit more instructions using vpblendvb which works at byte level, this required to add two vmovdqu instruction as well for the vpblendvb and to output the result.

vpshufb issue

My first tentative to solve the depth mask type issue was to use the vpshufb instruction to reorder the depth mask to get 8x24-bit values instead with the rest filled with 0 (-1 in the mask), this produced plenty visual artifacts, the reason was that the vpshufb instruction works within 128-bit lanes, a part of the result was wrong so this instruction could not be used.

The solution was to use a couple more instructions to get out of the 128-bit lanes limitation which is what the avxShuffle function does (it replaced vpshufb), this function is from 24puzzle repository which i found out from a stack overflow post about the bpp conversion issue.

Final C code (3 bpp and 4 bpp + handle arbitrary buffers length)

Note : The depth buffer contains float data, AVX2 instructions must be adapted for other type of data.

Code above is restricted to buffer size divisible by 24 or 32 depending on the bpp but this version handle arbitrary buffers length.

#if (defined(__x86_64__) ||
defined(__i386__))

    const int float_type = sizeof(float) / 4;

    const int vec_step = (3 - float_type) * 4;

    const int loop_step = bpp * vec_step;

    const int loop_rest = buffer_size %
loop_step;

    const int loop_len = buffer_size -
loop_rest;

    const __m256i mask_24bpp = _mm256_set_epi8(-1,
-1, -1, -1, -1, -1, -1, -1, 28, 28, 28, 24, 24, 24, 20, 20, 20, 16,
16, 16, 12, 12, 12, 8, 8, 8, 4, 4, 4, 0, 0, 0);

    for (int i = 0; i < loop_len; i += loop_step)
{

        // compare z value

        __m256 ymm0 =
_mm256_loadu_ps((const mfloat_t *)input_depth_buffer);

        __m256 ymm1 =
_mm256_loadu_ps((const mfloat_t *)output_depth_buffer);

        __m256 ymm2 =
_mm256_cmp_ps(ymm0, ymm1, _CMP_GT_OS);

#if RY_BPP == 4

        // copy render buffer
following mask value

        __m256i ymm3 =
_mm256_loadu_si256((const __m256i_u *)input_buffer);

       
_mm256_maskstore_epi32((int *)output_buffer,
_mm256_castps_si256(ymm2), ymm3);

#else

        __m256i ymm11 =
avxShuffle(_mm256_castps_si256(ymm2), mask_24bpp);



        // copy render buffer
following mask value

        __m256i ymm3 =
_mm256_loadu_si256((const __m256i_u *)input_buffer);

        __m256i ymm12 =
_mm256_loadu_si256((const __m256i_u *)output_buffer);

        __m256i ymm13 =
_mm256_blendv_epi8(ymm12, ymm3, ymm11);

       
_mm256_storeu_si256((__m256i_u *)output_buffer, ymm13);

#endif

        // copy depth

       
_mm256_maskstore_ps(output_depth_buffer, _mm256_castps_si256(ymm2),
ymm0);



        input_depth_buffer +=
vec_step;

        output_depth_buffer +=
vec_step;



        input_buffer +=
loop_step;

        output_buffer +=
loop_step;

    }



    // handle arbitrary buffers length; copy rest in
case length is not divisible by 24 or 32

    int di = 0;

    for (int i = 0; i < loop_rest; i += bpp)
{

        float depth =
input_depth_buffer[di];



        if (depth >
output_depth_buffer[di]) {

           
output_buffer[i+0] = input_buffer[i+0];

           
output_buffer[i+1] = input_buffer[i+1];

           
output_buffer[i+2] = input_buffer[i+2];

           
if (bpp == 4) {

               
output_buffer[i+3] = input_buffer[i+3];

           
}



           
output_depth_buffer[di] = depth;

        }



        di += 1;

    }

#endif

32-bit (RGBA) to 24-bpp (RGB)

The 24 bpp mask above must be adapted if used for bitmap data, the reason is that the mask above is similar to a grayscale reordering in the context of bitmap data.

Here is the mask to be used with avxShuffle for bitmap data :

const __m256i mask_24bpp =
_mm256_set_epi8(-1, -1, -1, -1, -1, -1, -1, -1, 30, 29, 28, 26, 25,
24, 22, 21, 20, 18, 17, 16, 14, 13, 12, 10, 9, 8, 6, 5, 4, 2, 1,
0);

Conclusion

This was a fun foray on AVX2 optimization, as far as i know there is not much resources for this sort of conversion available currently (only hints to what the avxShuffle function does and perhaps code in some repos), it works pretty well for both 4 bpp and 3 bpp and helped to increase the frames per second of my software renderer.

7 CPU cores render this scene; 7 render buffers are composited (with depth buffer) in this screenshot