Zeta Chess

Three Layers of Parallelism on GPU

With Zeta v099 I explored three layers of parallelism for Chess on a GPU:

  • 4x|8x direction-wise parallel vector-processing of Bitboards for move generation
  • 64x square-wise parallel processing for move generation, move picking and evaluation
  • 256x worker-wise parallel AlphaBeta game tree search

I did not succeed to utilize 8-bit vector packed math for a combined piece-wise+direction-wise move generation, and with Zeta NNUE in pipe it has yet to be seen if a 32x piece-wise Bitboard move generator + NNUE eval on a GPU SIMD unit approach can boost Zeta's nodes per second throughput enough to compete with NNUE engines running on CPUs.

I guess it is possible to add a 4th layer for multiple GPUs, to run a kind of PVS (Principal Variation Splitting) on CPU host to a certain ply and offload to up to 4 GPUs to compute the sub-trees, each with own parallel AlphaBeta.


I doubt a 5th layer for distributed computing across multiple nodes would make any sense considering the massive amount of parallel workers in a single node with several GPUs, the latency and communication overhead across nodes, and the effective branching factor of ~2 of modern AlphaBeta chess engines.

Zeta NNUE for AMD RDNA2...

I took a look at the RDNA white paper and I pretty much like the architecture.

One SIMD unit has 32 cores, this would fit for a piece-wise move generator, it clocks up to 2.5 GHz, and the cache hierarchy of RDNA2 seems to fit NNUE networks with 10 million weights with 20 MB.

I am yet not sure if the caches are for textures only or if they can be used for program data, and the latencies are according to some benchmarks about an order of magnitude higher than on CPUs, hence it remains open how the NNUE inference will perform.

The scratch-pad memory (LDS) shared across one work-group is 32KB, this is enough for me to store the iterative search var stacks, constant cache of 16KB for the lookup-tables, the L0 cache is 16KB, L1 128KB, L2 4MB, and L3 varies from 16 to 128 MB. If I use a short2 data-type for the first NNUE layer and char4 for the further this should fit across the L0 to L3 caches. Alternative is to store the INT8 weights in the vector registers file, 128KB per SIMD.

Still not sure if 8-bit vector packed math is supported via OpenCL, to speed up NN inference.

RMO - Randomized Move Order - another Lazy SMP derivate

I implemented ABDADA as parallel game tree search in an iterative way in Zeta, and it scales well, but I did it with some atomic operations on global memory what will cause less optimal NPS scaling with more workers, so here, yet another Lazy SMP derivate - RMO:

With <= 64 workers it can not compete with ABDADA, but with >= 128 there is not such a big difference, at least on my time to depth benchmarks.

RMO - basics:

- communication of TT move and TT scores between workers via Shared Hash Table
- all workers follow the same Iterative Deepening loop with same depth
- worker 0 performs a normal search with move ordering etc.
- workers >0 perform a normal search for oldest son first
- workers >0 randomize the move order up to QSearch for other siblings
- as soon as one worker finishes its search, all worker should terminate the current ID iteration and start the next one

RMO - exceptions for workers>0:

There are some cases you may not want to randomize the move order:

- TT move
- IID move
- Counter-Heuristics move
- Killer-Heuristics move

and some cases where you may want to perform a normal search:

- Null move

I have tested to randomize captures with an higher priority than quiet moves, but have no results that justify to do so.

In general you can also combine RMO with an classic parallel search, to let the first n workers search a classic way and the others via RMO. Maybe such an approach will make more sense with a workers count of > 128.

64-bit Kogge-Stone Move Generator - Vector-Based

I switched to 64-bit Dumb7Fill move gen to keep things simple and spare some registers, but maybe it is of interest for some folks, the generalized and vectorized Kogge-Stone move generation*:

the shifts and wraps:

// move generator costants
__constant ulong4 shift4 = (ulong4)( 9, 1, 7, 8);

// wraps for kogge stone shifts
__constant ulong4 wraps4[2] =
  // wraps shift left
            0xfefefefefefefe00, // <<9
            0xfefefefefefefefe, // <<1
            0x7f7f7f7f7f7f7f00, // <<7
            0xffffffffffffff00  // <<8

  // wraps shift right
            0x007f7f7f7f7f7f7f, // >>9
            0x7f7f7f7f7f7f7f7f, // >>1
            0x00fefefefefefefe, // >>7
            0x00ffffffffffffff  // >>8

the precalculated attack-tables:

// precomputed attack tables for move generation and square in check
__constant Bitboard AttackTablesPawnPushes[2*64] = 
  // white pawn pushes
  // black pawn pushes
// piece attack tables
__constant Bitboard AttackTables[7*64] = 
  // white pawn
  // black pawn
  // knight
  // king
  // bishop
  // rook
  // queen

for each piece do...with lid as square

    pfrom   = GETPIECE(board, lid);
    color   = GETCOLOR(pfrom);
    // generator and propagator (piece and empty squares)
    bbGen4  = (ulong4)bbBlockers&SETMASKBB(lid);
    bbPro4  = (ulong4)(~bbBlockers);
// kogge stone shift left via ulong4 vector bbPro4 &= wraps4[0]; bbGen4 |= bbPro4 & (bbGen4 << shift4); bbPro4 &= (bbPro4 << shift4); bbGen4 |= bbPro4 & (bbGen4 << 2*shift4); bbPro4 &= (bbPro4 << 2*shift4); bbGen4 |= bbPro4 & (bbGen4 << 4*shift4); bbGen4 = wraps4[0] & (bbGen4 << shift4); bbTemp = bbGen4.s0|bbGen4.s1|bbGen4.s2|bbGen4.s3; // set generator and propagator (piece and empty squares) bbGen4 = (ulong4)bbBlockers&SETMASKBB(lid); bbPro4 = (ulong4)(~bbBlockers);
// kogge stone shift right via ulong4 vector bbPro4 &= wraps4[1]; bbGen4 |= bbPro4 & (bbGen4 >> shift4); bbPro4 &= (bbPro4 >> shift4); bbGen4 |= bbPro4 & (bbGen4 >> 2*shift4); bbPro4 &= (bbPro4 >> 2*shift4); bbGen4 |= bbPro4 & (bbGen4 >> 4*shift4); bbGen4 = wraps4[1] & (bbGen4 >> shift4); bbTemp |= bbGen4.s0|bbGen4.s1|bbGen4.s2|bbGen4.s3; // consider knights bbTemp = (GETPTYPE(pfrom)==KNIGHT)?BBFULL:bbTemp; // verify captures n = (color==stm)?(s32)stm:(s32)!stm; n = (GETPTYPE(pfrom)==PAWN)?n:GETPTYPE(pfrom); bbMask = AttackTables[n*64+lid]; bbMoves = (color==stm)?(bbMask&bbTemp&bbOpp):(bbMask&bbTemp); // verify non captures bbMask = (GETPTYPE(pfrom)==PAWN)?(AttackTablesPawnPushes[stm*64+lid]):bbMask; bbMoves|= (color==stm&&!qs)?(bbMask&bbTemp&~bbBlockers):BBEMPTY;

This code snippet covers all pieces and moves, except en-passant, castle-moves, pawn promotions, and is without a legality-check. A ulong4 vector is used to perform 4 shift-directions, in theory you can also use a ulong8 vector for 8 directions, but some compilers complained about the negative shift on an unsigned data-type.

*based on work by Steffan Westcott, as published on CPW:



Zeta on a GPU - Mission Impossible?

Alright, I activated my time machine and looked over >10 years old threads on Chess on a GPU. Chess programmers are really smart guys and as soon as there were GPGPU frameworks they started to ponder on how to use a GPU for Chess. So what was the opinion back then?

  • CPU-GPU latency
  • SIMD architecture
  • SIMT architecture
  • VRAM memory latency
  • integer performance
  • no recursion
  • synching of gpu-threads

CPU-GPU latency
I offload the whole search with move generation and position evaluation onto GPU. The host runs just an ID-loop (iterative-deepening) for time control.

SIMD architecture
I couple 64 gpu-threads to one worker, used for move generation, move picking, and position evaluation, square-wise, in parallel.

SIMT architecture
I achieve ~50% VALU utilization with one worker running on one SIMD unit, could be better, but no need to run multiple waves of SIMT.

VRAM memory latency
I lowered the global memory footprint by the use of registers for the board representation, scratch-pad memory for the iterative search stacks, constant memory for pre-calculated attack-tables and alike, and VRAM for Transposition Tables only.

integer performance
I implemented a 64-bit Bitboard based board representation and move generator, did not succeed to switch to something with 32-bit or even 8-bit. 64-bit integer opreations need multiple cycles on 32-bit hardware, could be better, could be worse, with Bitboards it is easier to do SIMD-friendly arithmetics, other designs rely on loops or branches for move generation.

no recursion
I implemented a selective AlphaBeta search in an iterative way.

synching of gpu-threads
I implemented a synchless parallel game tree search (RMO), communication is done via a Shared Hash Table in VRAM for hundreds of workers.

As I mentioned already, to get an chess playing engine running on GPU and to get an competitive engine running on GPU are two different tasks. I worked on the topic of parallel search, and with an NNUE implementation in pipe also on chess heuristics, what is left is the speed (the computed nodes per second of a single worker) and the selective part of the search algorithm. Zeta v099 is yet a quite simple implementation of an chess engine, if I add NNUE for chess position evaluation, there is still the branching-factor of >3 which eats up possible speed gains. If we compare one SIMD unit of a GPU for one single Zeta worker and a single CPU core, we have 32 SIMD 32-bit cores @~1.5GHz vs. 4 CPU 64-bit ALUs@~4GHz, we have roughly the same arithmetic throughput, but I did not succeed to match the nodes-per-second performance of a single CPU core, Zeta still lacks 10 to 20 fold behind, and has to catch up in Elo via hundreds of workers in an parallel search.

Zeta with NNUE on GPU?

I think it is possible to add the new neural network technique 'NNUE' to Zeta for upcoming GPU architectures like Nvidia Lovelace, Intel Xe and AMD RDNA3 which probably will all have support of INT8, 8 bit integer, math with higher throughput and maybe some 10 to 20 MB L3 cache per SIMD unit for the network weights file.

With INT8 optimized datatypes and instructions, one could build an vectorized 8 bit 0x88 move generator which operates over the 8 directions as vector and with 32 parallel gpu threads of one SIMD unit handles all pieces at once. Maybe reaching 1 to 2 million nodes per second per SIMD unit in an Zeta engine like framework.

With 32 SIMD gpu threads performing 32xFP32 or 32x2xFP16 operations per clock the NNUE inference performance could be 2 to 4 times faster than with current NNUE on CPUs with AVX2 (roughly estimated), considering a switch from integer to float weights.

Volta/Turing/Ampere have currently 16 cores per FPU SIMD and support doubled throughput for FP16 operations, I guess Nvidia will move with Lovelace back again to an 32 core per SIMD design with unified INT/FP16 cores. RDNA has 32 cores per SIMD, also with doubled throughput for FP16. Intel seems to use SIMD8 with 8 FP cores for its Xe GPU (with support for higher throughput for lower precision), maybe Intel will also add some kind of SIMD32, to couple 4 EUs to one compute unit.


  • up to 2 Mnps per SIMD unit possible
  • up to 4x faster inference for NNUE possible
  • up to 160 parallel workers (SIMD units) on current highend-gpus

again, just some rough numbers estimated, big grain of salt and alike...

if the above all holds, then you get a hell of NNUE monster on highend-gpus.

Zeta v099 already has an simple AB framework implemented with ABDADA or as option RMO Lazy SMP parallel search across SIMD units, hence the main part would then be to implement all those funny search extensions and tricks Stockfish does in an iterative way in Zeta for GPU - full time job ;)


I wrote 10 to 20 MB L3 cache per SIMD unit, assuming the whole net should fit in cache, doubt that this is common practice with NNUE on CPU, maybe the first layer with most of the weights resists in RAM for the incremental updates, and the further layers only get cached? Dunno.

2021-04-12 Followup:

  • I mixed up NNUE first layer INT16 and further INT8 weights, so the possible 4x inference speedup holds only if we assume 8-bit vector packed math on gpu.
  • I was not able to implement an efficient 8-bit 0x88 vector-based board representation on pen n paper, hence no 8-bit speedup for move gen in sight.

  • Even if I keep the current v099 bitboard design a switch to 32 gpu-threads piece-wise worker may pay off with certain architecture improvements of AMD's RDNA and increasing gpu clocks in mind, benchmarks will tell.
Home - Top