Vectorized ZX Spectrum Screen Rendering
A vectorized method for rendering the ZX Spectrum screen on modern devices.
In the previous post we came up with a routine to render the ZX Spectrum screen in C#. Let's see if we can speed it up using some SIMD vector operations.
SIMD operations will come into play when converting a row. SIMD operations work on raw number types, not our PixelColour enumeration, so we will change some of the previous code to work with bytes by using MemoryMarshal to cast our PixelColour array to bytes and then pass that around:
[Pure]
public static PixelColour[] Convert(ReadOnlySpan<byte> screen)
{
    var attributes = screen[6144..];
    var result = new PixelColour[256 * 192];
    var bytes = MemoryMarshal.Cast<PixelColour, byte>(result.AsSpan());
    ConvertThird(bytes, 0, screen, attributes);
    screen = screen[2048..];
    attributes = attributes[256..];
    ConvertThird(bytes, 64 * 256, screen, attributes);
    screen = screen[2048..];
    attributes = attributes[256..];
    ConvertThird(bytes, 128 * 256, screen, attributes);
    return result;
}
private static void ConvertThird(Span<byte> result, nuint resultIndex, ReadOnlySpan<byte> screen, ReadOnlySpan<byte> attributes)
{
    for (var row = 0; row < 8; row++)
    {
        ConvertRow(result, resultIndex, screen, attributes);
        screen = screen[32..];
        attributes = attributes[32..];
        resultIndex += 8 * 256;
    }
}
private static void ConvertRow(Span<byte> result, nuint resultIndex, ReadOnlySpan<byte> screen, ReadOnlySpan<byte> attributes)...Firstly we'll want to get reference to the result byte array; this will be needed later to write the output:
ref var resultReference = ref MemoryMarshal.GetReference(result);A ZX Spectrum screen row has 32 attributes, each of a byte, which fits perfectly into a Vector256<byte>. We can create a vector with the row's attributes easily enough:
var attributeVector = Vector256.Create(attributes);Previously we used we used bit masks and shifts to extract the bright, ink and paper values from one attribute at a time. We can do the same with vectors and extract all the bright, ink and paper values for the row in one go. To get bright we used a mask to extract the bit and then shifted to bit three to give a value of 8 for bright, 0 for normal. Vectors have a Create overload that copies the same value to each item in the vector, this will give us our mask. We can then use the vector equivalents of & and >>:
var brightMask = Vector256.Create<byte>(0b01000000);
var brightOffsetVector = Vector256.ShiftRightLogical(Vector256.BitwiseAnd(attributeVector, brightMask), 3);Ink was extracted with a simple mask, paper was another mask and shift. We then combined them with the bright value. In vector terms:
var inkMask = Vector256.Create<byte>(0b00000111);
var inkVector = Vector256.Add(brightOffsetVector, Vector256.BitwiseAnd(attributeVector, inkMask));
var paperMask = Vector256.Create<byte>(0b00111000);
var paperVector = Vector256.Add(brightOffsetVector, Vector256.ShiftRightLogical(Vector256.BitwiseAnd(attributeVector, paperMask), 3));We have now isolated the attributes, next the pixels. Recall that pixels are stored in bytes, with a bit of 1 representing the ink colour and 0 the paper. Our result is a byte representing a colour, and we can fit 32 of those into a Vector256. This means we can process 32 pixels at a time, or 32 / 8 = 4 bytes of the ZX Spectrum layout at a time.
Firstly we need to extract each bit from each of those 4 bytes. We will need to define a vector of bit masks to do this:
var bitMaskVector = Vector256.Create(
    0b10000000, 0b01000000, 0b00100000, 0b00010000, 0b00001000, 0b00000100, 0b00000010, 0b00000001,
    0b10000000, 0b01000000, 0b00100000, 0b00010000, 0b00001000, 0b00000100, 0b00000010, 0b00000001,
    0b10000000, 0b01000000, 0b00100000, 0b00010000, 0b00001000, 0b00000100, 0b00000010, 0b00000001,
    0b10000000, 0b01000000, 0b00100000, 0b00010000, 0b00001000, 0b00000100, 0b00000010, 0b00000001);For each four byte section we will need to get the corresponding ink and paper attributes. We can do this using a vector shuffle operation. Shuffles take values from an input vector and place them in a different position in the output vector, based on a set of indices held in another vector. For the first section we will need to put attribute 0 in the first 8 positions, attribute 1 in the second 8 and so on. The section section will put attribute 4 in the first 8 positions, attribute 5 in the second 8 and so on. This gives us the following set of 8 shuffle vectors for the entire row:
ReadOnlySpan<Vector256<byte>> shuffleVectors =
[
    Vector256.Create((byte)0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3),
    Vector256.Create((byte)4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7),
    Vector256.Create((byte)8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11),
    Vector256.Create((byte)12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15),
    Vector256.Create((byte)16, 16, 16, 16, 16, 16, 16, 16, 17, 17, 17, 17, 17, 17, 17, 17, 18, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 19),
    Vector256.Create((byte)20, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22, 22, 22, 22, 22, 22, 22, 23, 23, 23, 23, 23, 23, 23, 23),
    Vector256.Create((byte)24, 24, 24, 24, 24, 24, 24, 24, 25, 25, 25, 25, 25, 25, 25, 25, 26, 26, 26, 26, 26, 26, 26, 26, 27, 27, 27, 27, 27, 27, 27, 27),
    Vector256.Create((byte)28, 28, 28, 28, 28, 28, 28, 28, 29, 29, 29, 29, 29, 29, 29, 29, 30, 30, 30, 30, 30, 30, 30, 30, 31, 31, 31, 31, 31, 31, 31, 31)
];We can then start to loop over the 8 sections after defining a screenIndex variable to track our position in the pixels bytes in the Spectrum screen. First step of the loop is to use our shuffle vectors to get the ink and paper attributes:
for (var section = 0; section < 8; section++)
{
    var shuffleVector = shuffleVectors[section];
    var inkForSection = Vector256.Shuffle(inkVector, shuffleVector);
    var paperForSection = Vector256.Shuffle(paperVector, shuffleVector);As before we process 8 lines of pixels as each attribute covers an 8x8 square. First step is to load these 4 bytes which we can do with Vector256.Create. This will load more bytes than 4, however we will discard the rest:
for (var line = 0; line < 8; line++)
{
    var pixels = Vector256.Create(screen[screenIndex..]);We now use a shuffle to make 8 copies of those 4 bytes so we can use our masks against them. Luckily we already have the shuffle vector we need in shuffleVectors:
pixels = Vector256.Shuffle(pixels, shuffleVectors[0]);Next we extract the bits using bitMaskVector and then compare it with the same vector; this will give all 1s if the extracted bit was a 1 and all 0s if it was a 0:
pixels = Vector256.BitwiseAnd(pixels, bitMaskVector);
pixels = Vector256.Equals(pixels, bitMaskVector);Lastly we use the ConditionalSelect method to pick either ink or paper from our attributes. This method takes two other vectors, and for each position takes its value from the first vector if the source has all 1s in that position, otherwise it takes a value from the second vector:
pixels = Vector256.ConditionalSelect(pixels, inkForSection, paperForSection);We now have our pixels in a vector, we just need to write them to the output using the reference to the result we defined at the start:
pixels.StoreUnsafe(ref resultReference, resultIndex);All that remains is some juggling to get our indices into the correct positions for the next loops.
How much faster is it though? Benchmark.NET can help us there:
| Method     | Mean     | Error    | StdDev   | Ratio |
|----------- |---------:|---------:|---------:|------:|
| Fallback   | 63.54 us | 0.218 us | 0.193 us |  1.00 |
| Vectorized | 56.05 us | 0.055 us | 0.046 us |  0.88 |Not an amazing improvement, given the increase in complexity, but it is an improvement. There are probably some tweaks we could do to grind out another few microseconds but it's good enough for me.
For reference the full method looks something like this:
protected override void ConvertRow(Span<byte> result, nuint resultIndex, ReadOnlySpan<byte> screen, ReadOnlySpan<byte> attributes)
{
    ref var resultReference = ref MemoryMarshal.GetReference(result);
    // Copy all 32 attributes for the row into a vector.
    var attributeVector = Vector256.Create(attributes);
    // Extract the bright attribute and then shift right by 3 to get the number 8 for bright, 0 otherwise.
    var brightMask = Vector256.Create<byte>(0b01000000);
    var brightOffsetVector = Vector256.ShiftRightLogical(Vector256.BitwiseAnd(attributeVector, brightMask), 3);
    // Extract the ink value then add the bright to get a PixelColour value.
    var inkMask = Vector256.Create<byte>(0b00000111);
    var inkVector = Vector256.Add(brightOffsetVector, Vector256.BitwiseAnd(attributeVector, inkMask));
    // Extract the paper value then add the bright to get a PixelColour value.
    var paperMask = Vector256.Create<byte>(0b00111000);
    var paperVector = Vector256.Add(brightOffsetVector, Vector256.ShiftRightLogical(Vector256.BitwiseAnd(attributeVector, paperMask), 3));
    // The bit mask vector. This is used to take four bytes and then extract each bit from each byte.
    var bitMaskVector = Vector256.Create(
        0b10000000, 0b01000000, 0b00100000, 0b00010000, 0b00001000, 0b00000100, 0b00000010, 0b00000001,
        0b10000000, 0b01000000, 0b00100000, 0b00010000, 0b00001000, 0b00000100, 0b00000010, 0b00000001,
        0b10000000, 0b01000000, 0b00100000, 0b00010000, 0b00001000, 0b00000100, 0b00000010, 0b00000001,
        0b10000000, 0b01000000, 0b00100000, 0b00010000, 0b00001000, 0b00000100, 0b00000010, 0b00000001);
    // Shuffle vectors. These are used to take the eight bits of four bytes and map them to four ink/paper values. The first vector maps
    // the first four bytes in a column, the second the next four bytes and so on.
    ReadOnlySpan<Vector256<byte>> shuffleVectors =
    [
        Vector256.Create((byte)0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3),
        Vector256.Create((byte)4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7),
        Vector256.Create((byte)8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11),
        Vector256.Create((byte)12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15),
        Vector256.Create((byte)16, 16, 16, 16, 16, 16, 16, 16, 17, 17, 17, 17, 17, 17, 17, 17, 18, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 19),
        Vector256.Create((byte)20, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22, 22, 22, 22, 22, 22, 22, 23, 23, 23, 23, 23, 23, 23, 23),
        Vector256.Create((byte)24, 24, 24, 24, 24, 24, 24, 24, 25, 25, 25, 25, 25, 25, 25, 25, 26, 26, 26, 26, 26, 26, 26, 26, 27, 27, 27, 27, 27, 27, 27, 27),
        Vector256.Create((byte)28, 28, 28, 28, 28, 28, 28, 28, 29, 29, 29, 29, 29, 29, 29, 29, 30, 30, 30, 30, 30, 30, 30, 30, 31, 31, 31, 31, 31, 31, 31, 31)
    ];
    // Get the index in the screen to read from.
    var screenIndex = 0;
    // Loop over the 8 sections of 4 bytes in this row.
    for (var section = 0; section < 8; section++)
    {
        var shuffleVector = shuffleVectors[section];
        var inkForSection = Vector256.Shuffle(inkVector, shuffleVector);
        var paperForSection = Vector256.Shuffle(paperVector, shuffleVector);
        for (var line = 0; line < 8; line++)
        {
            // Take the first 4 bytes. This will actually load 32 bytes, but we will discard the ones we don't care about.
            // We don't trim the length of screen (it still has attributes at the end) so we don't need to worry about
            // going out of bounds.
            var pixels = Vector256.Create(screen[screenIndex..]);
            // Make 8 copies of each of the 4 bytes we care about.
            pixels = Vector256.Shuffle(pixels, shuffleVectors[0]);
            // Extract each bit from the 4 bytes.
            pixels = Vector256.BitwiseAnd(pixels, bitMaskVector);
            // Compare each position to the bit mask vector; this will give all ones if the original bit was a 1, and all zeroes if it was a 0.
            pixels = Vector256.Equals(pixels, bitMaskVector);
            // Select either ink or paper; all ones will take ink, all zeros will take paper.
            pixels = Vector256.ConditionalSelect(pixels, inkForSection, paperForSection);
            // Write to the result.
            pixels.StoreUnsafe(ref resultReference, resultIndex);
            // Move our indices by one row; this is 256 for screen due to the Spectrum's funky layout, and 8 * 32 pixels for the result.
            screenIndex += 256;
            resultIndex += 8 * 32;
        }
        // Move our indices to the next section. For the screen we need to move along by 4 bytes from our original position. For the result
        // we need to move 32 bytes from our original position.
        screenIndex = screenIndex - 256 * 8 + 4;
        resultIndex = resultIndex - 256 * 8 + 32;
    }
}