MrKWatkins

Evaluating Poker Hands

Kevin Watkins — Sat, 18 Jun 2022 14:56:02 GMT

Using our set of cards and our bit representations we can now evaluate poker hands using excessive bit twiddling. The general approach is to shuffle the bits for each suit around and combine them with various logical operations to work out what hand we have. We will work with bit masks where the ace is in the high position (bit 13 as opposed to bit 0) because in the vast majority of hands we need the ace to be high; only straight/straight flushes of ace to five need the ace to be low.

The first step is to horizontally reduce our bit mask of cards with OR. This will give us all the unique ranks in our hand. From this we can narrow the potential hands down. For example if there are two unique ranks then we either have four-of-a-kind (four of one rank plus a high card) or a full house. (two of one rank and three of another) We check the count in order of which is most common to improve performance, e.g. we check for five unique ranks first as this corresponds to a straight, a flush or a high card, which is the majority of hands.

Two Unique Ranks - Four-of-a-Kind or a Full House

To determine which hand we have we can horizontally reduce our bit mask of cards with AND. As four-of-a-kind has the same rank in every suit this will leave us with the rank of the four. A full house will be zero because it doesn't have the same rank in each suit.

Four-of-a-Kind

We now have our OR reduction with bits for both ranks and our AND reduction with a bit for the four. Combine these with a XOR and we will be left with a bit for the rank of the high card which is all we need to return our PokerHand.

Full House

We have our OR reduction with a bit for the two cards and a bit for the three. If we perform a horizontal XOR reduction of the original hand then this will give us a bit for the three; combining two cards of the same rank will give 0, so then combining with the third will leave a bit for the three cards. We can then combine the OR and XOR reductions in a similar way to four-of-a-kind to isolate the rank of the two cards and we're done.

Three Unique Ranks - Three-of-a-Kind or Two Pair

A horizontal XOR reduction will come in handy again here. We will be left with a 1 when there is an odd number of the same rank and a 0 when there is an even number. For three-of-a-kind we have three of one rank and then two other ranks, which will leave us with three set bits in total. A two pair has two of the same rank twice plus the single high card giving us one set bit in total. So performing the XOR reduction and counting the bits tells us which hand we have.

Three-of-a-Kind

Isolating the three from the two single cards is a little tricky; a XOR won't help us here because all the ranks are an odd number. There are four possible combinations of the three suits, spades/hearts/diamonds, spades/hearts/clubs, spades/diamonds/clubs and hearts/diamonds/clubs. All of these have either spades and diamonds or hearts and clubs. As our hand masks are laid out clubs > diamonds > hearts > spades we can shift our hand mask two suits (32 bits) to the right and combine with the original hand mask via an AND.

So what do we have now? We either have a bit corresponding to the three card rank in bits 0 -> 15 (spades/diamonds) or bits 16 -> 31. (Hearts/clubs) If we shift this right 16 bits and OR it with the original we will then have the three card rank in bits 0 -> 15. We might however still have a bit in 16 -> 31 so we will need to clear that out with a suitable AND mask. Now we've isolated the three we can XOR it with the original OR reduction to get the ranks of the high cards.

Two Pair

This one is fairly straightforward as our horizontal XOR reduction contains just the high card rank so we can just XOR it with our OR reduction to get the ranks of the two pairs.

Four Unique Ranks - Pair

Only one possibility for four unique ranks; a pair. One bit for the pair and one bit for each of the three other cards. Using a horizontal XOR reduction again will isolate the ranks of the three other cards, and we can then XOR this with our horizontal OR reduction to get the rank of the pair.

Five Unique Ranks - Flush, Straight or High Card

We first start with a special case. As we are treating aces high if the horizontal OR reduction is 10000000011110 then we have a straight or a straight flush from ace to five. To differentiate between the two hands we need to determine if all the cards are in a single suit or not. This can be done by isolating each 16 bit section via a suitable AND mask and seeing if the value is the same as the hand mask overall, i.e.

    private static bool IsSingleSuit(ulong handMask) =>
        (handMask & Bits0To15) == handMask |
        (handMask & Bits16To31) == handMask |
        (handMask & Bits32To47) == handMask |
        (handMask & Bits48To63) == handMask;

Note that we're not using the short circuit || operator here! It turns out that doing that slightly reduces performance as the overhead introduced by the branching is greater than just performing all four branches.

Straight

We need to determine if all five bits are sequential. An easy way to do this is to take our horizontal OR reduction, shift it one bit to the right and AND it with the original value. This will give us a 1 bit everywhere that there is a run of two cards, with the bit being the lower of the two cards. We don't have to worry about bit zero being shifted off to the right as that would only matter for an ace to five straight and we've already handled with our special case. For a run of five we must have four sequential and overlapping runs of two. Therefore if our number of 1 bits is four we have a straight; if they weren't sequential and overlapping we would get a smaller number.

What if all the cards the same suit? If so we have a straight flush instead of a straight. We can use the same method as for the special case above to determine this. One last check is then needed for a straight flush - if we have a straight flush that includes an ace (which is easy to check with an AND of the ace high rank) then we have a royal straight flush.

Flush or High Card

We now either have a flush or a high card. If it's a flush then they must all be the same suit, so we can use the same single suit check from above to determine which hand we have.

Performance

So how quick is it? I wanted it to be as fast as possible as that enables us to use brute force to evaluate lots of hands to calculate things like probability rather than having to perform the maths. Using Benchmark.NET to test evaluating every one of the 2,598,960 possible five card hands using a single thread on my 3.20GHz Intel Core i7-8700 gives:

|               Method |     Mean |   Error |  StdDev |
|--------------------- |---------:|--------:|--------:|
| EvaluateFiveCardHand | 193.3 ms | 1.36 ms | 1.20 ms |

So under 200ms for all possible five card hands, or roughly 74ns per hand. I'll take that!

Find the source code at https://github.com/MrKWatkins/cards-cs.

More Bit Representations

Kevin Watkins — Sat, 05 Mar 2022 14:39:01 GMT

Card Bit Mask

We already have one representation of cards using bit indices but there is another representation we could use that would be easier to work with when evaluating poker hands. A long can be split into 4 × 16 bit sections, each of which can represent a suit. 16 bits is enough to represent each of the 13 ranks with a few bits left over. Poker can treat the ace as both high and low depending on the other cards so we can use the 14th bit to represent a high ace. This gives us the following layout:

|     Clubs     ||    Diamonds   ||     Hearts    ||     Spades    |
--AKQJT987654321A--AKQJT987654321A--AKQJT987654321A--AKQJT987654321A

This layout is designed to make horizontal reductions easier. What do I mean by a horizontal reduction? I'm taking inspiration from the various horizontal instructions CPUs have for the naming, such as _mm_hadd_pi16 which adds together 16 bit sections from a 64 bit number. In our case we are going to be combining the 16 bit sections with bitwise logical operators. So reducing using AND would look something like:

public static ulong HorizontalAnd16(this ulong value)
{
    value &= value >> 32;

    return value & (value >> 16);
}

What does this give us? Four of a kind requires us to have a card of the same rank in each suit. If we therefore combine the sections with AND then we will be left with one bit set which will be the rank of the four of a kind. Other logical operators will let us extract other information about the hand during evaluation.

We could of course do exactly the same using 14 bits for each suit and 8 high bits unused. However 16 bits is a short which might come in handy, it's more symmetrical, the shifts are all multiples of 16 and I've spent a lot of my years programming 8 and 16 bit machines so it just feels natural...

Creating a bit mask from our card is simple; shift 1 left by the value of rank to get the rank bit mask, then shift that by 16 × value of the suit. How do we go the other way? We could check each section in turn for a set bit and extract that. This would involve if conditions and branching though; a quicker way would be just to extract the sections, shift them into the correct place and then use the fast lookup by bit index we already have. Ignoring high aces that would look something like this:

public static ulong BitMaskToBitIndex(ulong bitMask)
{
    // Bit indices: Spades = 0 -> 12, Hearts = 13 -> 25, Diamonds = 26 -> 38, Clubs = 39 -> 51.
    // Bit masks: Spades = 0 -> 12, Hearts = 16 -> 28, Diamonds = 32 -> 44, Clubs = 48 -> 60.
    var spades = bitMask & 0x000000000000FFFF;
    var hearts = (bitMask & 0x00000000FFFF0000) >> 3;
    var diamonds = (bitMask & 0x0000FFFF00000000) >> 6;
    var clubs = (bitMask & 0xFFFF000000000000) >> 9;
    return spades | hearts | diamonds | clubs;
}

This is an example of an extract operation, sometimes called a compress. (Once again see the excellent Hacker's Delight for an in depth discussion) We extract bits from certain positions in the source value and ignore the rest. We then compress those bits into the low end of the result. Modern CPUs have a built in instruction to do this for us, _pext_u64. This takes a bit mask for the bits we want to extract and performs the extraction in a single operation. We can use .NET hardware intrinsics to do this:

internal static ulong BitMaskToBitIndex(ulong bitMask)
{
    const ulong mask = 0b0001111111111111_0001111111111111_0001111111111111_0001111111111111;
    return Bmi2.X64.ParallelBitExtract(bitMask, mask);
}

Comparing these two methods along with an implementation of the branching mentioned above using Benchmark.NET we can see no branching is better than branching and intrinsics are even better still:

|      Method |     Mean |   Error |  StdDev | Ratio |
|------------ |---------:|--------:|--------:|------:|
|   Branching | 378.3 ns | 3.13 ns | 2.93 ns |  1.00 |
| NoBranching | 352.4 ns | 0.70 ns | 0.65 ns |  0.93 |
|   Intrinsic | 328.1 ns | 1.53 ns | 1.44 ns |  0.87 |

Poker Hand Representation

When we run our poker evaluation it will work out what type of hand we have (four of a kind, straight, etc.) and give us some information about the ranks involved, e.g. if we have two four of a kinds we also need to know the rank of the four cards so we can state that four aces beats four kings. If we twiddle our bits in the right way we can reduce this information to a single number where a higher number means a better hand.

Firstly we will represent the hand type as an enum. I am going to treat a royal flush as it's own type rather than treating it as a straight flush to ace; if you're ever luckily enough to get a royal flush I think its only right you're treated specially. I am also going to include five of a kind in case I ever get around to implementing wild cards. This gives a total of 11 hand types which fits nicely into 4 bits. We will use these as the most significant bits in our number so that the hand with the highest numbered type is always greater than lower numbered type.

Next we need the rank information. This will be used to differentiate between two hands of the same type. For some hands we only need a single rank. For example with four of a kind you only need to know about the rank of the four cards to work out who wins against another four of a kind; the rank of the other card is irrelevant. For two high card hands you need to know the rank of every card in the hands as if the two high cards are the same the winner is the one with the next highest card, if they're the same the third highest and so on.

We cannot however just use a single bit mask of ranks. This would work for some hands such as high card; the numbers represented by the five bits will be in the correct order to give a higher number for the winning hand, e.g.:

Loser:  Q J 8 5 4 -> 00110010011000 ->  3,224
Winner: K Q 5 4 3 -> 01100000011100 -> 14,034

Loser:  Q J 8 5 4 -> 00110010011000 ->  3,224
Winner: Q J 9 8 4 -> 00110110001000 ->  6,610

But for other hands such as two pair, full house or three of a kind this falls down because the pair/triple should take precedence over the other cards, e.g.:

Loser:  K K 8 8 8 -> 01000010000000 -> 10,200
Winner: K K K 8 8 -> 01000010000000 -> 10,200

Loser:  9 9 9 K Q -> 01100100000000 -> 14,400
Winner: J J J 9 8 -> 00010110000000 ->  2,600

Loser:  Q Q J J K -> 01110000000000 -> 16,000
Winner: K K 3 3 2 -> 01000000000110 -> 10,006

We will therefore have two rank bit masks, primary and secondary. For our full house the primary will be the three, the secondary the two. For three of a kind the primary is the three, the secondary the other cards. And for a two pair the primary is the pairs, the secondary the other cards. Using the primary as the high bits gives us numbers in the correct order:

Loser:  K K 8 8 8 -> 00000010000000 01000000000000 ->  10,010,000
Winner: K K K 8 8 -> 01000000000000 00000010000000 -> 400,000,200

Loser:  9 9 9 K Q -> 00000100000000 01100000000000 ->  20,014,000
Winner: J J J 9 8 -> 00010000000000 00000110000000 -> 100,000,600

Loser:  Q Q J J K -> 00110000000000 01000000000000 -> 300,010,000
Winner: K K 3 3 2 -> 01000000000100 00000000000010 -> 400,200,002

There are 14 bits needed for ranks if we have both high and low ace. The secondary rank will only ever have ace high so we can represent that as 13 bits. Although the primary rank does need low aces for ace to five straights/straight flushes we could represent straights just by the high card and the comparisons would still work as expected. I have decided to use all five cards for the ranks in straights however as to extract the high card (as we will see in a later post) would require extra operations.

So in total we have 4 + 14 +13 = 31 bits. Which handily fits into an int without using the most significant bit so we will never have a negative number to worry about. (Negative numbers in two's complement have the most significant bit set) This satisifes our goal of representing hands by a single number where a higher number means a better hand.

We now have all the tools we need to start evaluating hands in the next post.

Card Combinations

Kevin Watkins — Sat, 19 Feb 2022 16:20:42 GMT

Before we can start evaluating five card poker hands we actually need to get some five card poker hands. So we need some code to produce five card hand combinations from a full deck of cards.

Combinations can be built recursively. Start with an empty combination and add cards from your full deck until you have reached the desired size. That's the first combination. You can then remove the last card in the combination and replace it with the other remaining cards in the deck for the next set of combinations. Once you've tried every card in the last place in the hand, backtrack to the last place but one and replace that card with the next one, then proceed as before. Keep backtracking and trying each card until you're done. Make sense? Probably not as it turns out it's quite hard to explain recursive algorithms. So let's try an example - build all combinations of 2 letters from the set of 4 letters A, B, C and D:

Start at index 0 and add the first letter:
[A
Move on to index 1 and try each remaining letter in turn:
[A, B] [A, C] [A, D]
Backtrack to index 0 and try the next letter:
[B
Carry on as before trying the remaining letters:
[B, C] [B, D]
Backtrack again:
[C
And try the remaining letter:
[C, D]
Backtrack again:
[D
No further letters to try; we're done.

Note that we only use letters after the one we've just added, i.e. when we were on [B we did not then proceed to add A to give [B, A] - that would give us permutations, not combinations.

The C# code for the above would look something like this:

public static IEnumerable> Combinations(IReadOnlyList source, int combinationSize) =>
    Combinations(source, 5, ImmutableCardSet.Empty, 0);

private static IEnumerable> Combinations(IReadOnlyList source, int combinationSize, IImmutableSet currentCombination, int startSourceIndex)
{
    // Reached the desired size; return the combination.
    if (combinationSize == currentCombination.Count)
    {
        yield return currentCombination;
        yield break;
    }

    // Start from the current index in our source. We will add each card in turn from that index onwards to the combination.
    for (var f = startSourceIndex; f < source.Count; f++)
    {
        // Add the card and proceed recursively from the next index to fill up the combination.
        foreach (var result in Combinations(source, combinationSize, currentCombination.Add(source[f]), f + 1))
        {
            yield return result;
        }
    }
}

Whilst that works there is an overhead from recursion. Every recursive calls needs push a stack frame onto the call stack, make the method call and restore the stack afterwards. To avoid this overhead we can make the algorithm iterative by managing the stack aspect ourselves. This should be faster and use less memory, at the cost of more complicated code. Our recursive algorithm has two parameters we need to store on our stack, currentCombination and startSourceIndex, giving us an iterative version that looks something like this:

public static IEnumerable Iterative_Stack(IReadOnlyList source, int combinationSize)
{
    // Stack of the current combination and the next index to start with for that combination.
    var stack = new Stack<(ImmutableCardSet CurrentCombination, int StartSourceIndex)>(combinationSize);
    stack.Push((ImmutableCardSet.Empty, 0));

    // Whilst the stack isn't empty we have more cards to process.
    while (stack.Count > 0)
    {
        // Pop the stack to get the next index to process.
        var (combination, startSourceIndex) = stack.Pop();

        // Loop over the remaining cards to add them to the combination.
        while (startSourceIndex < source.Count)
        {
            // Add the next card.
            var toAdd = source[startSourceIndex];
            startSourceIndex++;
            
            // Push the combination and the position of the next card to add onto the stack so we come back to them later.
            stack.Push((combination, startSourceIndex));

            combination = combination.Add(toAdd);
            
            // If we've reached the desired size return the combination.
            if (stack.Count == combinationSize) 
            {
                yield return combination;
                break;
            }
        }
    }
}

So a lot more complex but hopefully quicker. We can test with Benchmark.NET to be sure by generating all 2,598,960 distinct five card hands for each method and comparing:

|    Method |      Mean |    Error |   StdDev | Ratio |      Gen 0 | Allocated |
|---------- |----------:|---------:|---------:|------:|-----------:|----------:|
| Recursive | 218.83 ms | 1.070 ms | 0.948 ms |  1.00 | 55333.3333 |    331 MB |
| Iterative |  51.72 ms | 0.316 ms | 0.280 ms |  0.24 |  9900.0000 |     59 MB |

Yup quite a bit quicker! It also allocates a lot less memory, which is nice. Can we grind out a bit more performance still? There are a few things that might slow the above down:

.NET's Stack class has a few overheads. For example it tracks a version number so it can give you the dreaded "Collection was modified; enumeration operation may not execute." exception. It also needs to check if a resize is required, but in our case it never is as we have a fixed size. We could therefore replace it with a simple array and an integer to keep track of our position in the array.
We are storing tuples in our stack, and there will be a small overhead with creating them and dereferencing the items. Instead we could store two arrays, one for the cards and one for the indices.
We create card sets for incomplete combinations. Instead we could work with the underlying bit indices instead to avoid this overhead, along with any other overheads the sets might add.

Each of these incremental improvements gives us another speed boost:

|                         Method |      Mean |    Error |   StdDev | Ratio | RatioSD |
|------------------------------- |----------:|---------:|---------:|------:|--------:|
|                      Recursive | 218.25 ms | 0.984 ms | 0.920 ms |  4.18 |    0.03 |
|                      Iterative |  52.18 ms | 0.259 ms | 0.216 ms |  1.00 |    0.00 |
|                Iterative_Array |  45.12 ms | 0.540 ms | 0.505 ms |  0.87 |    0.01 |
|            Iterative_TwoArrays |  42.23 ms | 0.161 ms | 0.142 ms |  0.81 |    0.00 |
| Iterative_TwoArrays_BitIndices |  39.37 ms | 0.122 ms | 0.114 ms |  0.75 |    0.00 |

The final code along with benchmarks can be found at https://github.com/MrKWatkins/cards-cs.

A Set of Cards

Kevin Watkins — Sun, 23 Jan 2022 18:46:54 GMT

Many years ago I worked at a company that used a C library for evaluating poker hands. Never really understood how it worked; it used a lot of hard to read bit twiddling. A few years back I decided to see if I could come up with something similar in C#. I got something working but never did anything with it. I've now decided to dust off the code, update it to .NET 6.0 taking advantage of some new features on the way, and then make a Rust port to practice my Rust.

Representing Cards in Code

Playing cards have two attributes, rank (ace, five, queen, etc.) and suit. (Spades, clubs, etc.) They're simple to represent by enums in C#. I've chosen rank to start from zero rather than use the rank as the number for the enum, i.e. ace = 0, two = 1, rather than ace = 1, two = 2, etc. Whilst a little unusual it means I don't have to worry about checking for a value of zero all over the place, which is especially important as I'm going to make the Card object a readonly struct and using default(Card) would give a value of 0 for rank.

Bit Indices

Another way to represent a card is via bits. There are 52 cards in a deck and most CPUs these days are 64 bits. By using a bit to represent each card we can fit all the cards into a 64 bit long. It's simple to convert to this format. Multiply the suit value by 13 (as there are 13 ranks) and add the rank to give an index. This index is the position of the card in a full ordered deck of cards. Then left shift the value 1 by the index to get what I'm calling the bit index of the card.

To convert back we can take the bit index and work out the count of trailing zeros to get the index. .NET has the BitOperations.TrailingZeroCount method to get this for us. Most processors these days have a built in trailing zero count instruction making it nice and fast, but if they don't then that method has software fall backs. There are various ways to work out the trailing zero count; the excellent book Hacker's Delight covers several of them. Once we have the index we can get the suit with index % 13 and the rank with index / 13. However as I'm keeping a full deck cached in a static field I can just lookup into that instead.

Set Operations

Why bother with these bit indices? Well they make set operations easy. As each card is a single bit we can use a single long to represent a set of unique cards and we can use some bit twiddling for all the set operations we need in O(1). Some examples:

Operation	Code
Except	`x & ~y`
Intersect	`x & y`
IsSubsetOf	`(x & y) == x`
Overlaps	`(x & y) != 0UL`
SymmetricExcept	`x ^ y`

Working with a single card is no different from working with a set, so adding a card is just a union, removing is an except, etc.

The count of items in the set is just the number of set bits which can be obtained by the BitOperations.PopCount method. As for trailing zeros most modern processors have a dedicated instruction, but that method has software fallbacks if not. Again see Hacker's Delight for various methods.

Enumerating a Set

How do we enumerate over our set of cards? Again with bit twiddling. We can get the lowest bit in the set and convert it to a card for the current item in the enumeration. To move on to the next operation we can reset that bit. And then carry on until all bits have been reset. These two operations are simple enough; x & -x will extract the lowest set bit and x & (x - 1) will reset the lowest bit. However we can do slightly better as some processors have built in instructions for both of these (_blsi_u64 and _blsr_u64) which we can access via the .NET hardware intrinsics that were added with .NET Core 3.0. The System.Runtime.Intrinsics namespace gives methods to test if a given instruction is available on the CPU running your code and methods to call it if it is. The pattern is to check for availability, use the instruction if it's available or provide a fallback if not. The checks are treated as constants by the JIT meaning the check isn't performed at runtime but when the JIT compiles the code so there is no overhead of checking each time. Resetting the lowest set bit then becomes:

Bmi1.X64.IsSupported ? Bmi1.X64.ResetLowestSetBit(x) : x & (x - 1);

We can do a similar thing for extracting the lowest set bit.

The Source Code

You can find the source code for all of this at https://github.com/MrKWatkins/cards-cs.

Spherical Lights

Kevin Watkins — Wed, 20 Jan 2021 22:15:00 GMT

As mentioned in the previous post I've decided to start looking at rendering again to help me learn a new programming language, Rust. I initially started with just point light sources but I have recently added spherical light sources too. Why bother? What is the difference?

Point vs. Spherical Lights

In the real world there are no point light sources, everything has a volume. If the source is small or far away then it is approximately a point source. When a point source projects light it hits everywhere with the same amount of light. (Ignoring attenuation) This means when an object casts a shadow the part in shade gets no light at all, the part in the light gets the full amount.

A point light source casting a shadow. A given point on the backboard is either in or out of the shadow.

However when a light source has a volume that is no longer true. Each part of the surface will be emitting light. This alters shadows; the amount of light received at each point is now going to be related to the amount of light source that is visible. If the source is totally obscured then no light will be received. If the source is totally visible then the full amount will be received. In between a variable amount will be received. This leads to soft shadows - instead of the shadow going sharply from 0% light to 100% there will be a smooth gradient instead.

A spherical light source casting a shadow. A given point on the backboard is either in, out or partially in the shadow

Adding to the Ray Tracer

How do we add this to the ray tracer? We could calculate the amount of the light visible at each point, however this could get very complicated very quickly, especially for multiple complex shapes in front of the light. And easier approach is to sample points on the surface of the light and treat each one as a point light. We can then average the results from each separate points to get our approximation. Adding this to our ray tracer gives us the soft shadows we're looking for:

How do we sample the points on the surface of the sphere? I've added two methods, random and uniform. With both methods we will want to ensure that the points are evenly distributed over the surface.

Random Sampling

For the random method we can use spherical co-ordinates to ensure the even distribution, provided we already have a generator that can give us random numbers in the interval \([0,1\)). Spherical co-ordinates are defined in terms of radius \(r\), polar angle \(\theta\) and azimuthal angle \(\phi\), where \(\theta \in [0, 2\pi)\) and \(\phi \in [0, \pi]\). These map to normal Cartesian co-ordinates via the formulae \(x = r sin \theta cos \phi \), \(y = r sin \theta sin \phi \) and \(z = r cos \theta \).

Whilst we can map our uniform numbers to \(\phi\) quite easily we cannot do the same to \(\theta\). We can think of our spherical co-ordinates as giving us circles around the sphere in the \(xy\) plane. As we move towards the poles these circles become smaller, so including the same amount of points for each circle would cause more points to be distributed nearer the poles. Instead we apply our uniform distribution to \(z\) over the range \([-r,r]\), and map back to \(\theta\) using the formula above. This gives us the following formulae for two random numbers \(A\) and \(B\):
\[\theta = 2 \pi A
\\ \phi = cos^{-1}(2 B - 1)\]

Uniform Sampling

There are many methods to uniformly map points over a sphere but one of the most popular is the Fibonnaci Sphere algorithm. This utilises a property of the Golden Angle (which is defined from the Golden Ratio and therefore strongly related to the Fibonacci Sequence, giving the algorithm its name) often found in nature that if you step around a circle incrementing by the Golden Angle each time then the overall distribution of points will be approximately equal, with 2 or 3 points being added each revolution.

Once we have our distribution around a circle for \(n\) points we can easily map to a sphere using the same approach for mapping to \(z\) above, except this time we uniformly distribute the points. This gives us the following formulae for point number \(i\) when generating a total of \(n\) points:
\[\theta = G_ai
\\ \phi = cos^{-1}(1 - \frac{2i}{n})\]

Number of Samples

This sampling increases the time it takes to render a picture. For each ray intersection we are now having to trace multiple rays back to each spherical light rather than just a single one. The more samples we do the more accurate our picture will be, but the longer it will take to render. The following images shows a section of the scene above rendered with the two different methods for a range of sample:

Top row random, bottom row uniform. From left to right 1, 10, 100 and 1000 samples.

A few things to note:

Both are pretty much the same with 1,000 samples.
For fewer samples random gives a speckled pattern, whereas uniform gives a banded pattern. This is because uniform locks the points in place giving the same positions for every intersection, whereas random changes them each time. If you cache the random positions and use the same set each time then that produces bands too.
A single uniform point is equivalent to a light source, however random sampling with a single point still gives some soft shadow due to the position of the random point being different for every intersection.
The size of the specular highlight also increases with more samples.

Potential Improvements

If the light was directly in view of the camera it wouldn't show up in the image as I'm treating lights separately to objects. A workaround would be to add a spherical object at the same place that doesn't have shading and just gives the colour of it's light on intersection.
The number of samples could be dynamic depending on how far away the light is from the intersection and its attenuation to reduce rendering times. After all stars are massive but appear as points to us.
I'm currently sampling over the entire sphere meaning the light from the far intersection is also being included as if it passed clean through the light. Really it should be the hemisphere facing towards the intersection only.

For the interested the source code can be found at https://github.com/MrKWatkins/rust-rendering.

Rust Rendering

Kevin Watkins — Tue, 12 Jan 2021 12:38:04 GMT

I've decided to start looking at rendering again to help me learn a new programming language, Rust. I haven't learnt a new language for a while and Rust sounds like a good one to learn. It has some nice concepts, especially around enforcing lifetimes, that I've been wanting from a language for a while now. A lot of the concepts are related to the memory model I've been designing for Oakley too so it would be good to see how another language does things to get ideas. And it might give me a kick up the behind to get back into regular blogging as I won't be in a position to write new blog posts about Oakley for a while, given I'm rebuilding it from scratch...

I've started again with ray tracing, however this time I'm using libraries (nalgebra and ncollide) for all the geometry and maths rather than hand rolling it. I'm not following the same progress path as my previous C++ version so I can try out a few new things early on. For example I currently have no reflections but I have implemented spherical lights, which I'll write a blog post on soon.

The source code can be found at https://github.com/MrKWatkins/rust-rendering for anyone interested, and here is a quick teaser image showing the progress so far which demonstrates the soft shadows caused by using spherical lights instead of point lights:

Topological Ordering

Kevin Watkins — Mon, 24 Feb 2020 22:33:00 GMT

Topological ordering is ordering dependent items in such a way that items you depend upon come first in the ordering. For example if x depends on y and y depends on z then the topological order would be z, y, x. Maybe dependency ordering would be a better name; it took me a long time to find details of an algorithm online simply because I didn't know it was called 'topological ordering'... My particular use case was making sure types in the language I'm developing were compiled in a sensible order, i.e. if a type X has a field of type Y, compile type Y first.

One way to produce a topological ordering is to think of the items and their dependencies as nodes in a graph. We can utilise a depth first traversal of the graph to help us order things correctly. The deepest node will have no dependencies so should be the first to be returned. The next nodes will all depend directly on the deepest only so are the next to be returned, and so on. We have to be make sure not to return the same node multiple times given a node can have multiple nodes that depend on it. We also need to make sure a node doesn't depend on itself either directly (x depends on x) or indirectly (x depends on y which depends on x) as then we cannot possibly topologically order the nodes. A lazy version in C# might look something like this:

public static IEnumerable TopologicalOrder(
   this IEnumerable source, 
   Func> dependentOnSelector,
   Func keySelector,
   IEqualityComparer keyComparer)
{
   // Keep track of the nodes we've visited. An entry means we've
   // visited it. If the value stored against the node is true then
   // we are currently processing that node and nodes dependent on
   // it. false means we've seen it already but it's not in the 
   // current path through the graph.
   var visited = new Dictionary(keyComparer);

   return source.SelectMany(
      item => Visit(item, dependentOnSelector, keySelector, visited));
}

private static IEnumerable Visit(
   T node,
   Func> dependentOnSelector,
   Func keySelector,
   Dictionary visited)
{
   var key = keySelector(node);

   // Have we already visited this node?
   if (visited.TryGetValue(key, out var inProcess))
   {
      // Yes we have. Are we currently processing that node
      // or any node dependent on it?
      if (inProcess)
      {
         // Yes we are. That means we have a cycle, as this node
         // is dependent upon itself.
         throw new InvalidOperationException(
            "Cyclic dependency found.");
      }
      
      // If we reach here we have visited the node already so don't
      // need to do anything further.
      yield break;
   }

   // Not yet visited this node. Mark it as currently being
   // processed.
   visited[key] = true;

   // Find all the nodes that depend on this one.
   var nodesDependentOn = dependentOnSelector(node);

   // Topologically order those nodes and return them.
   foreach (var nodeDependentOn in nodesDependentOn.SelectMany(c => Visit(c, dependentOnSelector, keySelector, visited)))
   {
      yield return nodeDependentOn;
   }

   // We have now returned all nodes dependent on this one so we 
   // are safe to return this one.
   yield return node;

   // Mark it as having been visited, but we're not currently
   // processing it or those that depend on it.
   visited[key] = false;
}

That's it. Not super complicated but can be a little tricky to get your head around at first. You can find it in my sample data structures and algorithms repository; the version there has niceties like null checking and including details of the nodes found in a cycle in the exception to make debugging easier.

Installing Z88DK on Ubuntu 18.04

Kevin Watkins — Thu, 11 Oct 2018 19:11:00 GMT

I had a few issues whilst following the instructions to install Z88DK on Ubuntu 18.04 so thought I would document my solutions here in case anyone else has similar problems. Note that I am a Linux novice so apologies if any of this seems obvious for a Linux pro.

The command chmod 777 config.sh failed because that file does not appear to exist in the distribution any more.
I did not have Subversion installed, giving me the error svn: command not found. This was fixed by installing Subversion with sudo apt install subversion.
I did not have Bison installed, giving me the error configure: error: Cannot find required program bison. Fixed by installing Bison with sudo apt install bison.
At this point re-running ./build.sh gave me the error
Patch error -> patching file src/SDCC.lex Reversed (or previously applied) patch detected! This is because the build downloads SDCC via Subversion and then patches it, however because I had already passed this point in the build before the Bison error the patched files were still present. I therefore deleted them with rm -r -f /tmp/sdcc. This had to be repeated every time the build failed.
I did not have Flex installed, giving me the error configure: error: Cannot find required program flex. Fixed with sudo apt install flex.
I did not have Boost installed, giving me the error configure: error: boost library not found (boost/graph/adjacency_list.hpp). Fixed with sudo apt-get install libboost-all-dev.
I did not have MakeInfo installed. However this did not give me an obvious error; instead there was a warning a few lines before the end of the build output: WARNING: `makeinfo' is missing on your system.. Fixed with sudo apt-get install texinfo.

After that the build completed successfully. However adding the required environment variables wasn't obvious for a Windows user like myself. In Windows there is a single way to define environment variables; in the land of Linux it seems there are many. I found I had to include the lines from the instructions in the file ~/.profile rather than ~/.bash_profile.

With all that done I am now able to compile Oakley programs on Linux.

Oakley - LoRes Demos

Kevin Watkins — Sun, 07 Oct 2018 20:04:00 GMT

I've been adding some support to the Oakley standard library for the LoRes screen mode of the ZX Spectrum Next, as well as some code for creating palettes and floating point number support. I've created two demos that use the new code:

Stripes

Displays some stripes. 'nuff said. The demo is a little flickery because I haven't tried to sync with the vertical blank, just drawn the stripes on the screen. Obviously there are much better/more efficient ways to create this effect!

Gravity

This one is a bit more interesting. It simulates a number of particles attracted to each other by gravity. The code using floating point maths to calculate gravity, which is rather slow, so more that 6 or so particles on the screen and the frame rate starts to drop quite a lot. Of course it could be optimised; using lower precision floating point code would speed things up a lot without making much difference to the paths of the particles.

Oakley Progress Report

Kevin Watkins — Sun, 30 Sep 2018 18:09:00 GMT

I've made quite a bit of progress since I last blogged about Oakley. I might actually be able to release an alpha version in the next decade!

Better Numbers

No need to specify the type of number anymore, e.g. you can now write 12 rather than 12b.
Binary literals, e.g. 0b00111011.
Numeric separators to make it easier to read large numbers, e.g. 12_345. Supported for decimal, hex and binary numbers.

Error Reporting

The compiler now actually reports some errors, as opposed to crashing hideously, which is nice. It produces a hopefully useful error message along with the piece of code causing the problem:

Error E0011: Hex literals must have an even number of digits.
In System.Spectrum.Screen.oakley at line 26, column 19.
         Plot(coords.X, 0x102);
                        ^^^^^

Type Checking

There is now some type checking in place rather than relying on z88dk giving me an error when compiling the resultant C code. If you try to use an incompatible type you'll be told, along with a list of allowed types where possible:

Error E0019: Cannot resolve method call Plot(System.Byte, const System.String). Candidates are:
   Plot(System.Byte, System.Byte) (in type Screen)
In System.Spectrum.Screen.oakley at line 26, column 4.
         Plot(coords.X, "Invalid");
         ^^^^^^^^^^^^^^^^^^^^^^^^^

Overloads

Methods, operators and constructors can now be overloaded, i.e. you can have multiple methods/operators/constructors on the same type with the same name provided they have different parameters.

Type Classes

Type classes have been added. Somewhere between Haskell's type classes and Scala's traits they give a way to share code between types and a form of polymorphism. For example you could define an Equatable type class that represents checking if two types are equal:

typeclass Equatable[T]
{
   public Boolean (==)(T other);

   public Boolean (!=)(T other)
   {
      return !(this == other);
   }
}

Note how == has no implementation! This means you must implement this method when you implement the type class:

type SomeType : Equatable[SomeType]
{
   public Byte Id;

   public constructor(Byte id)
   {
      Id = id;
   }

   public Boolean (==)(SomeType other)
   {
      return Id == other.Id;
   }
}

Your type now gains the != operator for free as it was defined in the type class:

SomeType x = SomeType(4);
SomeType y = SomeType(5);
Console.WriteLine(x != y);

You can also use type classes polymorphically, i.e. write code that refers to the type class rather than the type itself. For example, consider the following type class and types:

typeclass Identifiable
{
   public Byte GetId();
}

type AnotherType : Identifiable
{
   public Byte Id;

   public Byte GetId()
   {
      return Id;
   }
}

type YetAnotherType : Identifiable
{
   public Byte GetId()
   {
      return 125;
   }
}

Both implement Identifiable but have very different implementations. We could then write a method that uses the type class:

public static void Write(Identifiable hasId)
{
   Console.WriteLine(hasId.GetId());
}

Because both AnotherType and YetAnotherType implement Identifiable they can both be passed to this method:

AnotherType a = AnotherType();
Write(a);

YetAnotherType y = YetAnotherType();
Write(y);

Generic Constraints

Having type classes make generic constraints possible. These enable us to constrain a generic parameter so that it must implement one or more type classes:

type UsesIds[T]
   where T : Identifiable
{
   public static void Write(T hasId)
   {
      Console.WriteLine(hasId.GetId());
   }
}

The above code says that we can use any type we like for T provided it implements Identifiable. This then means we can use the GetId method on T inside the type as we can be sure it has one. Trying to use a type that isn't Identifiable won't work:

UsesIds[YetAnotherType] x = UsesIds[YetAnotherType]();   // Compiles.
UsesIds[Byte] y = UsesIds[Byte]();   // Does not compile.

A type can be constrained with as many type classes as you like. It is also possible to constrain by a set of types, forcing the type parameter to be one of the specified types. They standard library Array type uses this to specify that the length of an array can only be a Byte or a Word:

type Array[T, TLength]
   where TLength : Byte | Word
{
...

Default Generic Parameters

Defaults can now be specified for type parameters, which will be used if they are not specified. Array uses this to default TLength as most of the time your arrays will be fairly small:

type Array[T, TLength = Byte]

You then do not need to specify TLength most of the time, only needing to for large arrays:

Array[SomeType] array = Array[SomeType](5);
Array[SomeType, Word] largeArray = Array[SomeType, Word](500);

TThis

Every type and type class in Oakley has a built in generic type parameter, TThis, which is the current type. On the surface that doesn't sound that useful - don't you always know what type you are? Well not if you're a type class. When implemented by a type TThis will be swapped out with the type itself. Confusing I know. Let's look how we might define Equatable using TThis:

typeclass Equatable
{
   public Boolean (==)(TThis other);
...

Our implementation is similar to before except we no longer need the type parameter on Equatable:

type SomeType : Equatable
{
   public Boolean (==)(SomeType other)
...
}

Could we not have done this using Equatable itself?

typeclass Equatable
{
   public Boolean (==)(Equatable other);
...

No, because our implementation in SomeType would then have to be:

type SomeType : Equatable
{
   public Boolean (==)(Equatable other)
...
}

Which is very different - it is saying that == will take anything that implements Equatable rather than specifically SomeType.

The implementation of Equatable in the standard library uses TThis with default generic parameters:

typeclass Equatable[T = TThis]
{
   public Boolean (==)(T other);
...

Meaning if you don't specify a type parameter it will use the current type, otherwise it will use the type you specified.

Enums

Enums are strongly typed numbers. For example the standard library has an enum for the ZX Spectrum colours:

enum Colour
{
   Black = 0;
   Blue = 1;
   Red = 2;
   Magenta = 3;
   Green = 4;
   Cyan = 5;
   Yellow = 6;
   White = 7;
}

You can use these as you would any other type, e.g.:

public static void SetBorder(Colour colour)
{
...

Why would you use an enum rather than a number? First reason is that you cannot use a number you haven't defined in the enum. If you used a number then someone could pass the number 8 to SetBorder above which is invalid.

Second reason is readability - it's much easier to work out what code is doing with enums as they have sensible names. Whilst you might remember all the numbers for the Spectrum colours other values might be harder. Which of the following two lines of code is easiest to understand?

Hardware.SetRegister(7, 1);

Hardware.SetRegister(Register.TurboControl, Turbo.Double);

Third reason is type safety. It's easy to pass the wrong number to something, it's impossible to pass the wrong enum type because the compiler prevents it:

Hardware.SetRegister(7, 1);   // Compiles.
Hardware.SetRegister(1, 7);   // Compiles.
Hardware.SetRegister(Register.TurboControl, Turbo.Double);   // Compiles.
Hardware.SetRegister(Turbo.Double, Register.TurboControl);   // Does not compile.

Enums default to use Bytes for the number they represent but Words can also be used:

enum BigEnum[Word]
{
   LargeValue = 2345;
   LargerValue = 3456;
}

Enums can even have operators and methods:

enum Turbo
{
   Normal = 0;
   Double = 1;
   Quadruple = 2;

   public static Turbo Halve(Turbo turbo)
   {
      if (turbo == Turbo.Normal)
      {
         return Turbo.Normal;
      }
      return Turbo(turbo.GetValue() / 2);
   }
}

Wondering where the GetValue method above comes from? Enums compile into normal Oakley types that implement the Enum type class, and this type class defines GetValue.

Testing

Automated tests are a good thing. So good that Oakley has a special syntax to make it easier for you to write them:

tests WordTests
{
   test GetLeastSignificantByte_returns_the_expected_value
   {
      Word word = 0x1234;
      Byte lsb = word.GetLeastSignificantByte();
      Assert.AreEqual(lsb, 0x34);
   }

   test GetMostSignificantByte_returns_the_expected_value
   {
      Word word = 0x1234;
      Byte msb = word.GetMostSignificantByte();
      Assert.AreEqual(msb, 0x13);
   }
}

The standard library Assert type contains several checks to test your values are correct. These tests can then be built into an output file that you can run on the Next, or the emulator of your choice:

Did you notice the mistake in the test above?

Currently the test runner just outputs to the screen but eventually (i.e. after the alpha release) it will write output to disk using ESXDOS. This will allow us to combine the output of multiple test runs (because all your tests might not fit into memory) and do more complex things, such as take a screen shot of the emulator on the PC and compare them with a saved screen to test display code, or combine the test run with trace output from the emulator to perform code coverage analysis.

Various Other Things

Address type to add a bit of type safety around peeking and poking.
Got rid of the entrypoint keyword that indicated where they program starts, replacing with a Program type class that has a Run method.
Improvements to the generated code to use less pointers. Makes things run a bit faster.
I think the the second demo will now run on the real hardware, however I still don't have any real hardware to check. :(

Alpha Release

Soon... There are still a few more boring things I need to do before I can release a compiler:

The standard library needs a bit more work, although I am hoping other people using an alpha version will help to define the library a bit better. I might just release as-is with my ideas for the library and see what people think.
A few bugs to fix.
Some more errors to check for.
Documentation. I have started but lots more to do.
Command line options - currently I have the options hard coded into the program...

The memory model still needs a bit of work too, however I think I'm going to release an alpha before that is done. With memory I want to avoid any dynamic memory allocation and have everything allocated up front. Whilst this might sound restrictive having 48k to work with is restrictive anyways, meaning you really should be thinking about memory up front, so I might as well force you to do so. :) And it will make things much easier for me when I write the straight-to-assembly version of the compiler. (Which won't be anytime soon admittedly.) The up-front-allocation will also involve some compile time analysis that will enable lots of clever optimisations as a nice side effect, such as running Oakley code and even tests at compile time. But all of that is the subject for another long and rambling blog post some other day...

ZX Spectrum Next Test Programs

Kevin Watkins — Sun, 20 May 2018 10:54:00 GMT

Whilst developing the second Oakley demo I've hit various behavioural differences between CSpect, ZEsarUX and the real hardware. I thought it might be useful to setup a repository of test programs to reproduce these differences in test cases, as opposed to just moaning about things. (Although I do enjoy moaning...)

I've created a GitHub repository to contain the tests. It's very early days yet so there are only two tests so far, however one already highlights a difference in behaviour between CSpect 1.11 and ZEsarUX 6.1-RC:

Eventually it would be nice to have a suite of Acid3 style tests such as these, along with normal unit test programs. And ideally a script to run the lot. Emulator writers can use them to test their emulators match the hardware, and FPGA tinkerers can change the Next core and test things still work as expected.

I can't really write all the tests on my own of course, so if you're interested in helping out please get in touch!

Oakley - The Second Demo

Kevin Watkins — Wed, 16 May 2018 19:53:00 GMT

Time for another demo created with my new programming language for the ZX Spectrum Next. This one has quite a lot more going on than the first demo and if nothing else demonstrates that I'm not completely wasting my time with this project. Here is a video of it running on the CSpect emulator:

This demo uses a lot more features of the Next such as the Layer2 display mode for the background and custom palettes to get the correct greys for the rocks. Combine that with a drawing to the standard Spectrum screen for the stars and double or triple sized sprites for the rocks and there is quite a bit going on. I haven't done that much optimisation and yet it still manages to run at 50 frames per second. Not too bad if I do say so myself.

Whilst I haven't added that many more new language features since the first demo a lot of base code has been added for manipulating sprites, drawing to Layer2, loading resources via ESXDOS, etc. The compiler has been enhanced quite a lot with a project system to compile multiple code files and prepare Next compatible resources (images, palettes, etc) on the fly. It produces some useful error output too meaning I didn't tear quite as much of my hair out in writing this demo...

It could be better of course. There is too much repetition in the code at the moment. This is because I haven't yet added any polymorphism to the language. This means that, for example, the main loop has a call to update the ship, then the rocks, then the bullets, then the explosions, etc. With some polymorphism I could treat them all as one thing and just have a single loop over all the elements, irrespective of their type.

Can you have a copy of the compiler then? No, not just yet. There is one major annoyance with the compiler at the moment - you have to specify a suffix on all numbers to specify what type they are. This wouldn't be too bad if that produced proper errors. However the ANTLR parsing section of the code instead completely ignores the numbers if they do not have a suffix, which leads to a myriad of weird and wonderful error messages, none of which mention a number at all... I therefore intend to remove the need for these suffixes before I release the compiler.

Aren't you impressed that I resisted the temptation to call the demo Project Next?

EDIT: The version I initially released didn't work properly on an actual Next, or the ZEsarUX emulator... Turns out that CSpect's timing is a bit different so I had to tweak the main game loop a little to get it to display correctly on ZEsarUX. It's working pretty well on version 6.1 despite the occasional dropped frame.

The palette had to be adjusted too - CSpect appears to have a bug whereby it treats E3 as the index of the transparent colour in the sprite palette, rather than the colour itself. It also doesn't shift the transparent colour with the rest of the colours when using palette offsets. I therefore changed the palette and the offset to accommodate both.

The links in this post have been updated with the new version. Hopefully it will work on a real Next now; mine hasn't arrived yet so I have no idea. 8o(

EDIT 2: After several tweaks and a few more retries I've given up getting the demo to work on a real Next until mine arrives. Thanks to everyone on the Facebook group who helped with my attempts to get it working.

Performance Tuning Oakley

Kevin Watkins — Thu, 10 May 2018 15:49:00 GMT

I've nearly completed the second Oakley demo and by and large it runs very well in CSpect. However there are a few minor display issues at the top of the screen which is due to the code for each frame taking slightly longer than a frame to run, meaning the top of the frame is actually the previous frame...

The ZX Spectrum display is rendered horizontally from top to bottom, line by line. In total there are 312 of these horizontal lines, often called scanlines. The numbering is slightly odd, in that the top of the screen is not 0. Instead 0 is the first of the Spectrum's display rows, immediately below the top border. As the Spectrum display is 256 x 192 (inside the border) then the scanline immediately after the display at the start of the bottom border is 192. The bottom of the screen is then line 247. Scanlines 248 -> 256 are not on the display at all; they were needed for synchronization on old display devices. Lastly the top border covers scanlines 256 -> 312.

The ZX Spectrum Next makes it very easy to wait for the display to be in the right place before executing some code as it has two hardware registers, Raster Line MSB and Raster Line LSB, which can be used to determine the line currently being rendered by the display. There is a method in the Oakley Display type that uses these to wait until the display reaches the bottom border:

public static void WaitForBottomBorderScanLine()
{
    // Wait for the MSB to become zero.
    while (Hardware.GetRegister(Register.RasterLineMsb) != 0b)
    {
    }

    // Wait for the LSB to hit 192.
    while (Hardware.GetRegister(Register.RasterLineLsb) < 192b)
    {
    }
}

There is one complication on the Next however. The Next allows you to change the clock speed of the processor by using a hardware register. However... The full 14 MHz clock speed cannot be used whilst a Layer2 screen is being rendered - a maximum of 7 MHz is allowed. I therefore suspect that if I ran the demo on a real Next the top of the display would look much worse as I am using 14 MHz to run the code...

In the second demo I've been fairly lazy so far and only started running code once the display was in the bottom border. One obvious optimisation therefore is to run some non-display code, such as updating the positions of objects being displayed, in scanlines 0 -> 192 at 7 MHz and run the rest outside this window. I will do this before the demo is released, but for now I just wanted to see if I could speed up the existing code with a few quick wins.

First step in performance optimisation is to measure - how can you know what the slow bits are and if your improvements work without measuring? I intend Oakley to eventually have benchmarking and testing tools included, however they're a long way from being written. Until then one easy way to measure performance is to change the border colour before the code is ran and change it again afterwards. This then shows how many scanlines the code takes to run. Adding in some border colours between the various sections of code in the second demo's loop gives the following:

The real demo is more interesting than this, I've just hidden the display as it's not ready yet! A black border is no code running, and then the various colours are different sections of demo code. As you can see I start running code at the bottom border (blue) and see it is still running by the time it reaches the top of the display again.

After a bit of investigation I realised the code generated by the compiler for for...in loops could be improved. I reduced the demo down to just an empty loop over a 64 element array and started timing from the top border with no turbo enabled. The following Oakley code:

private static Array[Sprite] sprites = Array[Sprite](64b);

private static void MainLoop()
{
   Screen.Cls(0b);
   Hardware.SetTurbo(0b);

   while (true)
   {
      WaitForTopBorderRasterLine();

      Screen.SetBorder(7b);

      for (Sprite sprite in sprites)
      {
      }

      Screen.SetBorder(1b);
   }
}

Gave me this timing:

80 scanlines for an empty loop, not ideal! The Z88DK C code that the Oakley compiler generates for the above is:

void Second_Demo_MainLoop()
{
    struct System_Spectrum_Next_Sprite* sprite;
    struct System_Array_System_Spectrum_Next_Sprite* _items0;
    zx_cls(0);
    ZXN_WRITE_REG(7, 0);
    while (1)
    {
        System_Spectrum_Next_Display_WaitForTopBorderScanLine();
        zx_border(7);
        _items0 = Second_Demo_sprites;
        for (sprite = _items0->Items; sprite < &_items0->Items[_items0->Length]; sprite++)
        {
        }
        zx_border(1);
    }
}

One thing you can see from this is that Oakley arrays are compiled into a C struct which has two fields, Items and Length. The for...in loop is then being mapped to a C for loop. We have a pointer corresponding to the item which we increase each iteration until we reach the end of the array. As Oakley arrays will not change size there is one immediate improvement we can make - do not calculate the pointer for the end of the array each iteration and just calculate it once at the start. That then gives the following loop code: (non-loop code omitted for clarity)

struct System_Spectrum_Next_Sprite* sprite;
struct System_Array_System_Spectrum_Next_Sprite* _items0;
struct System_Spectrum_Next_Sprite* _last0;

_items0 = Second_Demo_sprites;
_last0 = &_items0->Items[_items0->Length];
for (sprite = _items0->Items; sprite < _last0; sprite++)
{
}

This has much better timing, 19 scanlines:

Caching the last pointer helped, but do we even need the last pointer at all? We know the length of the array so we could just count from 0 to Length-1 as we would normally do in a for loop. Even better - count backwards from Length to 1. This has the same number of iterations but our loop termination condition becomes not equal to 0, and processors usually have an instruction to do exactly that. (The Z80 has the JP NZ instruction, JumP if Not Zero) Our generated C code then becomes:

struct System_Array_System_Spectrum_Next_Sprite* _items0;
struct System_Spectrum_Next_Sprite* sprite;
uint16_t _counter0;

_items0 = Second_Demo_sprites;
for (sprite = _items0->Items, _counter0 = _items0->Length; _counter0 != 0; sprite++, _counter0--)
{
}

Which gets us down to 10 scanlines:

8 times faster than the original code, not too bad. Plugging this back into the demo and things look a lot better:

Along with a few other optimisations and running code whilst the display is being rendered there will be more than enough CPU cycles for the second Oakley demo. Which will hopefully be available soon...

Having Greys Makes a Bit of Difference

Kevin Watkins — Mon, 07 May 2018 16:03:00 GMT

The ZX Spectrum Next uses 9-bits for its colours, giving a total of 512 possible different colours. The Next's palettes are only 256 colours in size however. The default palette uses 8-bit colours of the format RRRGGGBB, generated by the taking the numbers 0 - 255 in order. The third B bit is taken by ORing the other two B bits together. So the colour at index 60 in the palette is the colour given by the number 60, which is 00111100 in binary, or a smidgen of red (001), loads of green (111) and no blue. (00)

Whilst this gives a reasonable selection of colours it does obviously lean towards red and green as they have more bits in the 8-bit colour. One big problem with the default palette however are greys - there aren't many. To get a grey colour you need the R, G and B components to all be the same value. As there are only 2 bits for B that means 4 possible greys. However one of those is black and one is white, giving only 2 shades of grey in the default palette. I can see this proving a problem for a lot of games as drawing things like metal and rocks will require the extra greys.

The second Oakley demo requires a few greys for it's sprites so I've made sure there are methods in the language for easily loading 9-bit palettes. For example to load a palette for the sprites from disk using ESXDOS we can do the following:

Display.Load9BitColourPalette(PaletteType.Sprites, "sprites.9pl");

This loads the palette using the ULANext Palette Extension register, which accepts 9-bit colours.

How to generate the file in the first place? Well converting from 24-bit colours to 9-bit colours is fairly easy - just shift the bits for each component right 5 places, leaving the 3 most significant bits. For example if we had an 8-bit R component 11001110 that would become the first 3 bits, or 110. Rather than have some separate utility to do this I've baked support for generating 9-bit palettes on the fly into the Oakley compiler using via the new project support.

A project is basically a folder filled with *.oakley code files, an oakley.json project file and optional resource files, which is currently limited to indexed BMP images. Images can be referenced in the project file and outputs specified, for example:

{
   "path": "Resources\\bg1.bmp",
   "output": ["image", "palette9"]
}

This will cause the compiler to output two files, bg1.9pl, which is the 9-bit palette, and bg1.img which is the image itself. The image format is simply a series of bytes, one per pixel, corresponding to the palette index for that pixel. There is currently one other output option, "sprites", which is similar to image except that it outputs the image in 16x16 blocks to be used as sprite patterns.

To edit images I've been using a combination of paint packages, namely Paint.NET because I'm used to it, followed by GIMP for palette specific stuff; as far as I'm aware Paint.NET doesn't have support for limited palettes. To help with that I've written two further snippets of code that I might either package up as utilities to be bundled with the Oakley compiler or allow to be called via options from the compiler itself.

The first is to generate palettes for use by GIMP. This snippet of C# can be used to output both 8 and 9-bit Next palettes. I've tried to interpolate the colours a bit; for example rather than just use the 2/3 bits of a colour shifted left I've tried to divide the range between the available bits. This means that 8-bit white (111-111-11) comes out as #fcfcfc which is much closer to full white (#ffffff) than #e0e0e0 is, which is what you would get by just shifting left. (e0 = 11100000) It will still convert to the corresponding 8-bit colour by trimming the right most bits however. You can grab the code for that, or just grab the GIMP .gpl files it outputs for 8-bit and 9-bit palettes instead.

The second snippet of C# code loads an image and adjusts all the colours to the nearest colour from one of the palettes above and saves as a new image. I couldn't find an easy way to do this with GIMP, especially for the 9-bit palette as it doesn't support palettes larger than 256 colours for a lot of its functions. I found this snippet handy when taking existing graphics rendered in more colours - I could quickly reduce it to the Next colours before further editing. Download the code here.

Keeping Things Consistent

Kevin Watkins — Mon, 02 Apr 2018 13:38:00 GMT

The main rule I've tried to follow when designing Oakley is to avoid special cases as far as possible. C# has quite a few of these and they annoy me. For example:

Constructors. Why do these have a special syntax? They're just a static function that returns a value. There is no difference in the result of a constructor new Blah() and a factory method Blah.Create(). However because they have a special syntax they cannot be used in various places where you might use a method; e.g. Func function = Blah.Create is fine but to use a constructor you need to wrap it in a lambda, Func function = () => new Blah().
Operators. A similar argument can be made for operators. They're just methods with funny names. So why are they so different in C#? There is a special syntax to remember define them, not all can be defined, and they can't be passed around as Funcs without wrapping in lambdas first. Other languages such as Scala handle operators better; in Scala you can define pretty much any operator you want and can even call operators with method style syntax, e.g. x + y could also be written as x.+(y).
Structs. Whilst I can see why they exist they are very much a leaky abstraction. I don't want to care about allocations and garbage and stacks and blah blah blah - that's why I'm using a high level language.
Keywords aliases for types. Reinforces the fact that some types are just a bit different to the others.

In Oakley I've tried to be a make things uniform wherever possible. Constructors and operators are methods and can be treated as such. Any operator can be defined, not just some special few. No classes/structs, we just have types and all types are treated the same.

There is one exception to this however, and that is the assignment operator. Assignment is treated as a special case. It wasn't originally; earlier versions allowed you to define it yourself and the parser/compiler code treated it the same as any other operator. However this made some things a little more complicated:

Translating to C was tricky for assignment - the C function generated had to be slightly different to others and wouldn't work at all for some native types. I therefore had to treat it as a special case and always inline the method.
The Antlr grammar got a lot more complicated because there are various places where assignment can be used but other operators can't such as assigning an initial value to a field or variable.

I therefore made assignment a special case and made my life a lot easier in the process. It annoys the hell out of me though...