Zeta Chess

Open Score with Nvidia 8800GT/GTX 750

My own main objective in computer chess was already solved by AlphaZero/Leela Chess Zero, how to use a GPU for chess. Nevertheless I have an open score with my Zeta engine, Zeta Dva on an AMD Phenom X4 CPU plays chess against Zeta on an Nvidia 8800GT GPU, now with hardware upgrade, Zeta Dva on an i5-6500, 4-cores@3.2GHz, plays chess against Zeta on an Nvidia GTX 750, 512-shader-cores@1GHz, a 30$ CPU plays chess against a 30$ GPU. I hope to find in 2024 some time to settle this one ;)

Zeta Dva v0405 + Zeta v099q Blueprint

I got everything I need on paper:

  • Board Representation: Quad-Bitboards
  • Move Generation: vectorized Kogge-Stone
  • Search Algorithm: RMO - parallel AlphaBeta
  • Selective Search: NNOM++*
  • Evaluation: SF 13 NNUE**
  • Parallel Layers: direction-wise, square-wise, NN-wise**, AB-worker-wise, PV-Splitting-device-wise

now to find some time to implement this blueprint, first on CPU+AVX2 then on GPU.

*NNOM++ - Move Ordering Neural Networks: use SF 13 NNUE for selective search in non-QS search.

** still need to figure how to use 64 gpu threads of a worker during NNUE inference.

Zeta v099o

Zeta v099o released as source and Linux/Windows x86-64 binary, still with HCE, handcrafted eval, the next step would be to implement NNUE neural networks.

GitHub:

https://github.com/smatovic/Zeta/releases

Alternative downloads:

https://zeta-chess.app26.de/downloads/

Please consider the README file or --help option before running the engine.

From the release notes:

Zeta (099o) alpha

  * switch to 64-bit Kogge-Stone, generalized and vectorized, move gen again
  * removed ABDADA parallel search
* activated RMO parallel search
* switch to MIT license
* * Zeta 099o on Intel HD 530, 1 worker, ~30 Knps * Zeta 099o on Intel HD 530, 24 workers, ~550 Knps * Zeta 099o on Nvidia GeForce GTX 750, 1 worker, ~50 Knps * Zeta 099o on Nvidia GeForce GTX 750, 16 workers, ~850 Knps * Zeta 099o on AMD Radeon HD 8570, 1 worker, ~22 Knps * Zeta 099o on AMD Radeon HD 8570, 48 workers, ~900 Knps -- Srdja Matovic 28 May 2022

Zeta v099 - FP32 - dummy benchmarks

I implemented some dummy 10x12 mailbox floating-point move generation with 32 parallel gpu-threads and get only 2x speedup for an unoptimized approach compared to 64 gpu-threads Bitboards, too lil, it does not pay off for me to explore that branch any further.

So I am stuck on ~100 Knps per worker with Zeta v099 with up to 320 workers on current gpu architectures.

Zeta - Trick 17

I will give the Zeta v099 approach one more time a try, apply programmer's trick 17, develop on old and outdated architecture. If my approach runs on the Nvidia 8800 GT, then it will run also on newer architectures with more beef.

With pen n paper I get an x2 speedup for switching from 64 gpu-threads square-wise to 32 gpu-threads piece-wise worker, and a further x2 speedup for switching from 64-bit integer Bitboards to 32-bit floats for the board representation and move generation. I still could not figure a vectorized 0x88 board representation for uchar4 8-bit move generation.

If I apply trick 17 more strict, I would have to run the 'one-thread-one-board' approach on the 8800 GT, with some thousands, independent gpu-threads, with up to 164K threads on newer architectures like the AMD Fury X. But since NNUE and upcoming NNOM this approach does not fit anymore, one single thread does not have enough beef to compute the new neural networks alone, meanwhile I have to couple multiple threads together to work in parallel on the same node for evaluation.

Three Layers of Parallelism on GPU

With Zeta v099 I explored three layers of parallelism for Chess on a GPU:

  • 4x|8x direction-wise parallel vector-processing of Bitboards for move generation
  • 64x square-wise parallel processing for move generation, move picking and evaluation
  • 256x worker-wise parallel AlphaBeta game tree search

I did not succeed to utilize 8-bit vector packed math for a combined piece-wise+direction-wise move generation, and with Zeta NNUE in pipe it has yet to be seen if a 32x piece-wise Bitboard move generator + NNUE eval on a GPU SIMD unit approach can boost Zeta's nodes per second throughput enough to compete with NNUE engines running on CPUs.

I guess it is possible to add a 4th layer for multiple GPUs, to run a kind of PVS (Principal Variation Splitting) on CPU host to a certain ply and offload to up to 4 GPUs to compute the sub-trees, each with own parallel AlphaBeta.

followup:

I doubt a 5th layer for distributed computing across multiple nodes would make any sense considering the massive amount of parallel workers in a single node with several GPUs, the latency and communication overhead across nodes, and the effective branching factor of ~2 of modern AlphaBeta chess engines.

Zeta NNUE for AMD RDNA2...

I took a look at the RDNA white paper and I pretty much like the architecture.

One SIMD unit has 32 cores, this would fit for a piece-wise move generator, it clocks up to 2.5 GHz, and the cache hierarchy of RDNA2 seems to fit NNUE networks with 10 million weights with 20 MB.

I am yet not sure if the caches are for textures only or if they can be used for program data, and the latencies are according to some benchmarks about an order of magnitude higher than on CPUs, hence it remains open how the NNUE inference will perform.

The scratch-pad memory (LDS) shared across one work-group is 32KB, this is enough for me to store the iterative search var stacks, constant cache of 16KB for the lookup-tables, the L0 cache is 16KB, L1 128KB, L2 4MB, and L3 varies from 16 to 128 MB. If I use a short2 data-type for the first NNUE layer and char4 for the further this should fit across the L0 to L3 caches. Alternative is to store the INT8 weights in the vector registers file, 128KB per SIMD.

Still not sure if 8-bit vector packed math is supported via OpenCL, to speed up NN inference.

RMO - Randomized Move Order - another Lazy SMP derivate

I implemented ABDADA as parallel game tree search in an iterative way in Zeta, and it scales well, but I did it with some atomic operations on global memory what will cause less optimal NPS scaling with more workers, so here, yet another Lazy SMP derivate - RMO:

With <= 64 workers it can not compete with ABDADA, but with >= 128 there is not such a big difference, at least on my time to depth benchmarks.

RMO - basics:

- communication of TT move and TT scores between workers via Shared Hash Table
- all workers follow the same Iterative Deepening loop with same depth
- worker 0 performs a normal search with move ordering etc.
- workers >0 perform a normal search for oldest son first
- workers >0 randomize the move order up to QSearch for other siblings
- as soon as one worker finishes its search, all worker should terminate the current ID iteration and start the next one

RMO - exceptions for workers>0:

There are some cases you may not want to randomize the move order:

- TT move
- IID move
- Counter-Heuristics move
- Killer-Heuristics move

and some cases where you may want to perform a normal search:

- Null move
- LMR
- IID

I have tested to randomize captures with an higher priority than quiet moves, but have no results that justify to do so.

In general you can also combine RMO with an classic parallel search, to let the first n workers search a classic way and the others via RMO. Maybe such an approach will make more sense with a workers count of > 128.

64-bit Kogge-Stone Move Generator - Vector-Based

I switched to 64-bit Dumb7Fill move gen to keep things simple and spare some registers, but maybe it is of interest for some folks, the generalized and vectorized Kogge-Stone move generation*:

the shifts and wraps:

// move generator costants
__constant ulong4 shift4 = (ulong4)( 9, 1, 7, 8);

// wraps for kogge stone shifts
__constant ulong4 wraps4[2] =
{
  // wraps shift left
  (ulong4)(
            0xfefefefefefefe00, // <<9
            0xfefefefefefefefe, // <<1
            0x7f7f7f7f7f7f7f00, // <<7
            0xffffffffffffff00  // <<8
          ),

  // wraps shift right
  (ulong4)(
            0x007f7f7f7f7f7f7f, // >>9
            0x7f7f7f7f7f7f7f7f, // >>1
            0x00fefefefefefefe, // >>7
            0x00ffffffffffffff  // >>8
          )
};

the precalculated attack-tables:

// precomputed attack tables for move generation and square in check
__constant Bitboard AttackTablesPawnPushes[2*64] = 
{
  // white pawn pushes
  0x100,0x200,0x400,0x800,0x1000,0x2000,0x4000,0x8000,0x1010000,0x2020000,0x4040000,0x8080000,0x10100000,0x20200000,0x40400000,0x80800000,0x1000000,0x2000000,0x4000000,0x8000000,0x10000000,0x20000000,0x40000000,0x80000000,0x100000000,0x200000000,0x400000000,0x800000000,0x1000000000,0x2000000000,0x4000000000,0x8000000000,0x10000000000,0x20000000000,0x40000000000,0x80000000000,0x100000000000,0x200000000000,0x400000000000,0x800000000000,0x1000000000000,0x2000000000000,0x4000000000000,0x8000000000000,0x10000000000000,0x20000000000000,0x40000000000000,0x80000000000000,0x100000000000000,0x200000000000000,0x400000000000000,0x800000000000000,0x1000000000000000,0x2000000000000000,0x4000000000000000,0x8000000000000000,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,
  // black pawn pushes
  0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x1,0x2,0x4,0x8,0x10,0x20,0x40,0x80,0x100,0x200,0x400,0x800,0x1000,0x2000,0x4000,0x8000,0x10000,0x20000,0x40000,0x80000,0x100000,0x200000,0x400000,0x800000,0x1000000,0x2000000,0x4000000,0x8000000,0x10000000,0x20000000,0x40000000,0x80000000,0x100000000,0x200000000,0x400000000,0x800000000,0x1000000000,0x2000000000,0x4000000000,0x8000000000,0x10100000000,0x20200000000,0x40400000000,0x80800000000,0x101000000000,0x202000000000,0x404000000000,0x808000000000,0x1000000000000,0x2000000000000,0x4000000000000,0x8000000000000,0x10000000000000,0x20000000000000,0x40000000000000,0x80000000000000
};
// piece attack tables
__constant Bitboard AttackTables[7*64] = 
{
  // white pawn
  0x200,0x500,0xa00,0x1400,0x2800,0x5000,0xa000,0x4000,0x20000,0x50000,0xa0000,0x140000,0x280000,0x500000,0xa00000,0x400000,0x2000000,0x5000000,0xa000000,0x14000000,0x28000000,0x50000000,0xa0000000,0x40000000,0x200000000,0x500000000,0xa00000000,0x1400000000,0x2800000000,0x5000000000,0xa000000000,0x4000000000,0x20000000000,0x50000000000,0xa0000000000,0x140000000000,0x280000000000,0x500000000000,0xa00000000000,0x400000000000,0x2000000000000,0x5000000000000,0xa000000000000,0x14000000000000,0x28000000000000,0x50000000000000,0xa0000000000000,0x40000000000000,0x200000000000000,0x500000000000000,0xa00000000000000,0x1400000000000000,0x2800000000000000,0x5000000000000000,0xa000000000000000,0x4000000000000000,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,
  // black pawn
  0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x2,0x5,0xa,0x14,0x28,0x50,0xa0,0x40,0x200,0x500,0xa00,0x1400,0x2800,0x5000,0xa000,0x4000,0x20000,0x50000,0xa0000,0x140000,0x280000,0x500000,0xa00000,0x400000,0x2000000,0x5000000,0xa000000,0x14000000,0x28000000,0x50000000,0xa0000000,0x40000000,0x200000000,0x500000000,0xa00000000,0x1400000000,0x2800000000,0x5000000000,0xa000000000,0x4000000000,0x20000000000,0x50000000000,0xa0000000000,0x140000000000,0x280000000000,0x500000000000,0xa00000000000,0x400000000000,0x2000000000000,0x5000000000000,0xa000000000000,0x14000000000000,0x28000000000000,0x50000000000000,0xa0000000000000,0x40000000000000,
  // knight
  0x20400,0x50800,0xa1100,0x142200,0x284400,0x508800,0xa01000,0x402000,0x2040004,0x5080008,0xa110011,0x14220022,0x28440044,0x50880088,0xa0100010,0x40200020,0x204000402,0x508000805,0xa1100110a,0x1422002214,0x2844004428,0x5088008850,0xa0100010a0,0x4020002040,0x20400040200,0x50800080500,0xa1100110a00,0x142200221400,0x284400442800,0x508800885000,0xa0100010a000,0x402000204000,0x2040004020000,0x5080008050000,0xa1100110a0000,0x14220022140000,0x28440044280000,0x50880088500000,0xa0100010a00000,0x40200020400000,0x204000402000000,0x508000805000000,0xa1100110a000000,0x1422002214000000,0x2844004428000000,0x5088008850000000,0xa0100010a0000000,0x4020002040000000,0x400040200000000,0x800080500000000,0x1100110a00000000,0x2200221400000000,0x4400442800000000,0x8800885000000000,0x100010a000000000,0x2000204000000000,0x4020000000000,0x8050000000000,0x110a0000000000,0x22140000000000,0x44280000000000,0x88500000000000,0x10a00000000000,0x20400000000000,
  // king
  0x302,0x705,0xe0a,0x1c14,0x3828,0x7050,0xe0a0,0xc040,0x30203,0x70507,0xe0a0e,0x1c141c,0x382838,0x705070,0xe0a0e0,0xc040c0,0x3020300,0x7050700,0xe0a0e00,0x1c141c00,0x38283800,0x70507000,0xe0a0e000,0xc040c000,0x302030000,0x705070000,0xe0a0e0000,0x1c141c0000,0x3828380000,0x7050700000,0xe0a0e00000,0xc040c00000,0x30203000000,0x70507000000,0xe0a0e000000,0x1c141c000000,0x382838000000,0x705070000000,0xe0a0e0000000,0xc040c0000000,0x3020300000000,0x7050700000000,0xe0a0e00000000,0x1c141c00000000,0x38283800000000,0x70507000000000,0xe0a0e000000000,0xc040c000000000,0x302030000000000,0x705070000000000,0xe0a0e0000000000,0x1c141c0000000000,0x3828380000000000,0x7050700000000000,0xe0a0e00000000000,0xc040c00000000000,0x203000000000000,0x507000000000000,0xa0e000000000000,0x141c000000000000,0x2838000000000000,0x5070000000000000,0xa0e0000000000000,0x40c0000000000000,
  // bishop
  0x8040201008040200,0x80402010080500,0x804020110a00,0x8041221400,0x182442800,0x10204885000,0x102040810a000,0x102040810204000,0x4020100804020002,0x8040201008050005,0x804020110a000a,0x804122140014,0x18244280028,0x1020488500050,0x102040810a000a0,0x204081020400040,0x2010080402000204,0x4020100805000508,0x804020110a000a11,0x80412214001422,0x1824428002844,0x102048850005088,0x2040810a000a010,0x408102040004020,0x1008040200020408,0x2010080500050810,0x4020110a000a1120,0x8041221400142241,0x182442800284482,0x204885000508804,0x40810a000a01008,0x810204000402010,0x804020002040810,0x1008050005081020,0x20110a000a112040,0x4122140014224180,0x8244280028448201,0x488500050880402,0x810a000a0100804,0x1020400040201008,0x402000204081020,0x805000508102040,0x110a000a11204080,0x2214001422418000,0x4428002844820100,0x8850005088040201,0x10a000a010080402,0x2040004020100804,0x200020408102040,0x500050810204080,0xa000a1120408000,0x1400142241800000,0x2800284482010000,0x5000508804020100,0xa000a01008040201,0x4000402010080402,0x2040810204080,0x5081020408000,0xa112040800000,0x14224180000000,0x28448201000000,0x50880402010000,0xa0100804020100,0x40201008040201,
  // rook
  0x1010101010101fe,0x2020202020202fd,0x4040404040404fb,0x8080808080808f7,0x10101010101010ef,0x20202020202020df,0x40404040404040bf,0x808080808080807f,0x10101010101fe01,0x20202020202fd02,0x40404040404fb04,0x80808080808f708,0x101010101010ef10,0x202020202020df20,0x404040404040bf40,0x8080808080807f80,0x101010101fe0101,0x202020202fd0202,0x404040404fb0404,0x808080808f70808,0x1010101010ef1010,0x2020202020df2020,0x4040404040bf4040,0x80808080807f8080,0x1010101fe010101,0x2020202fd020202,0x4040404fb040404,0x8080808f7080808,0x10101010ef101010,0x20202020df202020,0x40404040bf404040,0x808080807f808080,0x10101fe01010101,0x20202fd02020202,0x40404fb04040404,0x80808f708080808,0x101010ef10101010,0x202020df20202020,0x404040bf40404040,0x8080807f80808080,0x101fe0101010101,0x202fd0202020202,0x404fb0404040404,0x808f70808080808,0x1010ef1010101010,0x2020df2020202020,0x4040bf4040404040,0x80807f8080808080,0x1fe010101010101,0x2fd020202020202,0x4fb040404040404,0x8f7080808080808,0x10ef101010101010,0x20df202020202020,0x40bf404040404040,0x807f808080808080,0xfe01010101010101,0xfd02020202020202,0xfb04040404040404,0xf708080808080808,0xef10101010101010,0xdf20202020202020,0xbf40404040404040,0x7f80808080808080,
  // queen
  0x81412111090503fe,0x2824222120a07fd,0x404844424150efb,0x8080888492a1cf7,0x10101011925438ef,0x2020212224a870df,0x404142444850e0bf,0x8182848890a0c07f,0x412111090503fe03,0x824222120a07fd07,0x4844424150efb0e,0x80888492a1cf71c,0x101011925438ef38,0x20212224a870df70,0x4142444850e0bfe0,0x82848890a0c07fc0,0x2111090503fe0305,0x4222120a07fd070a,0x844424150efb0e15,0x888492a1cf71c2a,0x1011925438ef3854,0x212224a870df70a8,0x42444850e0bfe050,0x848890a0c07fc0a0,0x11090503fe030509,0x22120a07fd070a12,0x4424150efb0e1524,0x88492a1cf71c2a49,0x11925438ef385492,0x2224a870df70a824,0x444850e0bfe05048,0x8890a0c07fc0a090,0x90503fe03050911,0x120a07fd070a1222,0x24150efb0e152444,0x492a1cf71c2a4988,0x925438ef38549211,0x24a870df70a82422,0x4850e0bfe0504844,0x90a0c07fc0a09088,0x503fe0305091121,0xa07fd070a122242,0x150efb0e15244484,0x2a1cf71c2a498808,0x5438ef3854921110,0xa870df70a8242221,0x50e0bfe050484442,0xa0c07fc0a0908884,0x3fe030509112141,0x7fd070a12224282,0xefb0e1524448404,0x1cf71c2a49880808,0x38ef385492111010,0x70df70a824222120,0xe0bfe05048444241,0xc07fc0a090888482,0xfe03050911214181,0xfd070a1222428202,0xfb0e152444840404,0xf71c2a4988080808,0xef38549211101010,0xdf70a82422212020,0xbfe0504844424140,0x7fc0a09088848281,
};

for each piece do...with lid as square

    pfrom   = GETPIECE(board, lid);
    color   = GETCOLOR(pfrom);
    // generator and propagator (piece and empty squares)
    bbGen4  = (ulong4)bbBlockers&SETMASKBB(lid);
    bbPro4  = (ulong4)(~bbBlockers);
// kogge stone shift left via ulong4 vector bbPro4 &= wraps4[0]; bbGen4 |= bbPro4 & (bbGen4 << shift4); bbPro4 &= (bbPro4 << shift4); bbGen4 |= bbPro4 & (bbGen4 << 2*shift4); bbPro4 &= (bbPro4 << 2*shift4); bbGen4 |= bbPro4 & (bbGen4 << 4*shift4); bbGen4 = wraps4[0] & (bbGen4 << shift4); bbTemp = bbGen4.s0|bbGen4.s1|bbGen4.s2|bbGen4.s3; // set generator and propagator (piece and empty squares) bbGen4 = (ulong4)bbBlockers&SETMASKBB(lid); bbPro4 = (ulong4)(~bbBlockers);
// kogge stone shift right via ulong4 vector bbPro4 &= wraps4[1]; bbGen4 |= bbPro4 & (bbGen4 >> shift4); bbPro4 &= (bbPro4 >> shift4); bbGen4 |= bbPro4 & (bbGen4 >> 2*shift4); bbPro4 &= (bbPro4 >> 2*shift4); bbGen4 |= bbPro4 & (bbGen4 >> 4*shift4); bbGen4 = wraps4[1] & (bbGen4 >> shift4); bbTemp |= bbGen4.s0|bbGen4.s1|bbGen4.s2|bbGen4.s3; // consider knights bbTemp = (GETPTYPE(pfrom)==KNIGHT)?BBFULL:bbTemp; // verify captures n = (color==stm)?(s32)stm:(s32)!stm; n = (GETPTYPE(pfrom)==PAWN)?n:GETPTYPE(pfrom); bbMask = AttackTables[n*64+lid]; bbMoves = (color==stm)?(bbMask&bbTemp&bbOpp):(bbMask&bbTemp); // verify non captures bbMask = (GETPTYPE(pfrom)==PAWN)?(AttackTablesPawnPushes[stm*64+lid]):bbMask; bbMoves|= (color==stm&&!qs)?(bbMask&bbTemp&~bbBlockers):BBEMPTY;

This code snippet covers all pieces and moves, except en-passant, castle-moves, pawn promotions, and is without a legality-check. A ulong4 vector is used to perform 4 shift-directions, in theory you can also use a ulong8 vector for 8 directions, but some compilers complained about the negative shift on an unsigned data-type.

*based on work by Steffan Westcott, as published on CPW:

https://www.chessprogramming.org/Kogge-Stone_Algorithm

 

Zeta on a GPU - Mission Impossible?

Alright, I activated my time machine and looked over >10 years old threads on Chess on a GPU. Chess programmers are really smart guys and as soon as there were GPGPU frameworks they started to ponder on how to use a GPU for Chess. So what was the opinion back then?

  • CPU-GPU latency
  • SIMD architecture
  • SIMT architecture
  • VRAM memory latency
  • integer performance
  • no recursion
  • synching of gpu-threads

CPU-GPU latency
I offload the whole search with move generation and position evaluation onto GPU. The host runs just an ID-loop (iterative-deepening) for time control.

SIMD architecture
I couple 64 gpu-threads to one worker, used for move generation, move picking, and position evaluation, square-wise, in parallel.

SIMT architecture
I achieve ~50% VALU utilization with one worker running on one SIMD unit, could be better, but no need to run multiple waves of SIMT.

VRAM memory latency
I lowered the global memory footprint by the use of registers for the board representation, scratch-pad memory for the iterative search stacks, constant memory for pre-calculated attack-tables and alike, and VRAM for Transposition Tables only.

integer performance
I implemented a 64-bit Bitboard based board representation and move generator, did not succeed to switch to something with 32-bit or even 8-bit. 64-bit integer opreations need multiple cycles on 32-bit hardware, could be better, could be worse, with Bitboards it is easier to do SIMD-friendly arithmetics, other designs rely on loops or branches for move generation.

no recursion
I implemented a selective AlphaBeta search in an iterative way.

synching of gpu-threads
I implemented a synchless parallel game tree search (RMO), communication is done via a Shared Hash Table in VRAM for hundreds of workers.

As I mentioned already, to get an chess playing engine running on GPU and to get an competitive engine running on GPU are two different tasks. I worked on the topic of parallel search, and with an NNUE implementation in pipe also on chess heuristics, what is left is the speed (the computed nodes per second of a single worker) and the selective part of the search algorithm. Zeta v099 is yet a quite simple implementation of an chess engine, if I add NNUE for chess position evaluation, there is still the branching-factor of >3 which eats up possible speed gains. If we compare one SIMD unit of a GPU for one single Zeta worker and a single CPU core, we have 32 SIMD 32-bit cores @~1.5GHz vs. 4 CPU 64-bit ALUs@~4GHz, we have roughly the same arithmetic throughput, but I did not succeed to match the nodes-per-second performance of a single CPU core, Zeta still lacks 10 to 20 fold behind, and has to catch up in Elo via hundreds of workers in an parallel search.

Zeta with NNUE on GPU?

I think it is possible to add the new neural network technique 'NNUE' to Zeta for upcoming GPU architectures like Nvidia Lovelace, Intel Xe and AMD RDNA3 which probably will all have support of INT8, 8 bit integer, math with higher throughput and maybe some 10 to 20 MB L3 cache per SIMD unit for the network weights file.

With INT8 optimized datatypes and instructions, one could build an vectorized 8 bit 0x88 move generator which operates over the 8 directions as vector and with 32 parallel gpu threads of one SIMD unit handles all pieces at once. Maybe reaching 1 to 2 million nodes per second per SIMD unit in an Zeta engine like framework.

With 32 SIMD gpu threads performing 32xFP32 or 32x2xFP16 operations per clock the NNUE inference performance could be 2 to 4 times faster than with current NNUE on CPUs with AVX2 (roughly estimated), considering a switch from integer to float weights.

Volta/Turing/Ampere have currently 16 cores per FPU SIMD and support doubled throughput for FP16 operations, I guess Nvidia will move with Lovelace back again to an 32 core per SIMD design with unified INT/FP16 cores. RDNA has 32 cores per SIMD, also with doubled throughput for FP16. Intel seems to use SIMD8 with 8 FP cores for its Xe GPU (with support for higher throughput for lower precision), maybe Intel will also add some kind of SIMD32, to couple 4 EUs to one compute unit.

So...

  • up to 2 Mnps per SIMD unit possible
  • up to 4x faster inference for NNUE possible
  • up to 160 parallel workers (SIMD units) on current highend-gpus

again, just some rough numbers estimated, big grain of salt and alike...

if the above all holds, then you get a hell of NNUE monster on highend-gpus.

Zeta v099 already has an simple AB framework implemented with ABDADA or as option RMO Lazy SMP parallel search across SIMD units, hence the main part would then be to implement all those funny search extensions and tricks Stockfish does in an iterative way in Zeta for GPU - full time job ;)

Followup:

I wrote 10 to 20 MB L3 cache per SIMD unit, assuming the whole net should fit in cache, doubt that this is common practice with NNUE on CPU, maybe the first layer with most of the weights resists in RAM for the incremental updates, and the further layers only get cached? Dunno.

2021-04-12 Followup:

  • I mixed up NNUE first layer INT16 and further INT8 weights, so the possible 4x inference speedup holds only if we assume 8-bit vector packed math on gpu.
  • I was not able to implement an efficient 8-bit 0x88 vector-based board representation on pen n paper, hence no 8-bit speedup for move gen in sight.

  • Even if I keep the current v099 bitboard design a switch to 32 gpu-threads piece-wise worker may pay off with certain architecture improvements of AMD's RDNA and increasing gpu clocks in mind, benchmarks will tell.

Zeta - v099 revisited II

If I wish to keep the v099 design of Zeta, with classic parallel AlphaBeta, how could I improve the nps throughput per worker further?

The current board presentation is Bitboard, 64 bit based, this makes it easier to parallelize across the SIMD unit of a GPU. Current GPUs are 32 bit machines, and upcoming GPUs will probable support INT8, 8 bit integer, math with higher throughput, so you can do four INT8 operations per cycle instead of one 32 bit. Further I used the most simple parallelization of Bitboards for SIMD during move generation and evaluation, square-wise, so the engine runs 64 times, per square, the same code. Current GPU architectures (Turing/RDNA) have 32 cores per SIMD unit, so I need to run the square-wise code over two waves on the SIMD unit.

If I change to some kind of vectorized 8 bit move generation and evaluation to use INT8 optimized math, I could achieve a ~8x speedup, cos 64 bit operations need multiple cycles on 32 bit hardware. If I switch further from an square-wise parallelization to some kind of piece-wise, or better direction-wise, I could achieve at least a further 2x speedup.

Of course, these are numbers on paper, in practice there is always a trade-off, and one has to consider Amdahl's law, but it seems to me that this could be one way to go.

Zeta - v099 revisited

It works, Zeta v099 plays decent chess, with an classic parallel AlphaBeta approach, and I am convinced that with some further work it could reach more than 3000 CCRL Elo on an highend gpu.

But the obvious thing is, it lacks nps throughput per worker, the single thread performance is too low, and even with an better parallel search, there is not much to gain on massive parallel systems with more than 128 workers.

So to be able to beat the top 10 chess engines out there, the nps throughput per worker must be increased ten or twenty fold...

during early development I tried an design based on an LIFO-Stack parallel search. It had the best nps throughput of all my designs, but I was not able to implement AlphaBeta pruning efficient, so the speed gain was lost again during pruning.

If I had to start over, and make another Zeta version, I would try the LIFO-Stack based parallel search again...

Zeta v099m

Zeta v099m released as source and Linux/Windows 64 bit binary:

https://github.com/smatovic/Zeta/releases

Alternative downloads:

https://zeta-chess.app26.de/downloads/

Please consider the README file or --help option before running the engine.

From the changelog:

Zeta (099m) alpha; urgency=medium

* patch for ABDADA parallel search
* disabled RMO parallel search
* removed max device memory limitation
* mods in time control
* cleanups
*
* Zeta 099m on Nvidia V100, 160 workers, ~ 13.5 Mnps
* Zeta 099m on Nvidia V100, 1 worker, ~ 85 Knps

-- Srdja Matovic 13 Jul 2019

Here some nps and search scaling results...

################################################################################
# Zeta 099m, startposition, depth 12, best of 4 runs, Nvidia V100:
# tt1: 2048 MB, tt2: 1536 MB
#
### workers #nps          #nps speedup   #time in s   #ttd speedup   #relative ttd
### 1       86827         1.000000       156.586000   1.000000       1.000000 
### 2       180282        2.076336       55.749000    2.808768       2.808768 
### 4       356910        4.110588       35.564000    4.402936       1.567568 
### 8       704741        8.116611       19.637000    7.974029       1.811071 
### 16      1385758       15.959989      14.583000    10.737571      1.346568 
### 32      2786039       32.087242      11.124000    14.076411      1.310949 
### 64      5460849       62.893443      8.838000     17.717357      1.258656 
### 128     10235993      117.889516     7.377000     21.226244      1.198048 
### 160     11639290      134.051505     7.202000     21.742016      1.024299 

Zeta v099l

Zeta v099k did not scale well on Nvidia Pascal and Turing gpus, so I wrote a patch to fix this issue, and released Zeta v099l:

https://github.com/smatovic/Zeta/release

On Pascal it runs now 4 workers per Compute Unit and on Turing 2 workers per Compute Unit during guessconfigx.

According to Nvidia papers, Turing should have 16 wide SIMD units, with four units per Compute Unit, but according to my tests I can only speculate that the integer units are 32 wide, not 16, with two of them per Compute Unit.

During benchmarks on other systems it was shown again that some Windows OS have an OS gpu timeout, so you may want to apply this registry update on your Windows machine:

https://zeta-chess.app26.de/downloads/SetWindowsGPUTimeoutTo20s.reg

Download, double-click and reboot OS to increase gpu timeout from 2 to 20 seconds.

If you want to run an SMP benchmark for your gpu, I suggest to increase the gpu timeout to 400 seconds:

https://zeta-chess.app26.de/downloads/SetWindowsGPUTimeoutTo400s.reg

Zeta - Source Code and Binaries online

I fixed some issues in Zeta Dva and Zeta, source code and binaries are online again

https://github.com/smatovic/ZetaDva/releases

https://github.com/smatovic/Zeta/releases

Please consider the README file or --help option before running the Zeta engine on GPU.

I lost the source of Zeta Vintage, and an attempt to do an rewrite in C showed again that the 6502 processor should really be programmed in assembly, so a rewrite in 6502 assembly is still on my bucket list...

https://github.com/smatovic/ZetaVintage

Alternative downloads:

https://zeta-chess.app26.de/downloads/

 

Zeta v099

I finished my current run on Zeta v099, my experimental gpu chess engine.

https://github.com/smatovic/Zeta

The actual conclusion of the current iteration is, that an simple engine, with standard chess programming techniques, can be ported to OpenCL to run on a gpu, but it would take more effort to make the engine competitive in terms of computed nodes per second (speed), heuristics (expert knowledge), and scaling (parallel search algorithm).

Computer Chess, as an computer science topic, evolved over decades, starting in the 40s and 50s, and reached one peak 1997 with the match Deep Blue vs. Kapsarow. Nowadays chess engines are tuned by playing thousands and thousands of games, so to get an chess playing engine running on the gpu and to get an competitive chess playing engine running on the gpu are two different tasks.

Zeta - Milestones

Here an overview of what happened before....

Zeta (099o)

* switch to 64-bit Kogge-Stone, generalized and vectorized, move gen again
*
* Zeta 099o on Intel HD 530, 1 worker, ~30 Knps
* Zeta 099o on Intel HD 530, 24 workers, ~550 Knps
* Zeta 099o on Nvidia GeForce GTX 750, 1 worker, ~50 Knps
* Zeta 099o on Nvidia GeForce GTX 750, 16 workers, ~850 Knps
* Zeta 099o on AMD Radeon HD 8570, 1 worker, ~22 Knps
* Zeta 099o on AMD Radeon HD 8570, 48 workers, ~900 Knps

-- Srdja Matovic 28 May 2022

Zeta (099n)

* removed ABDADA parallel search
* activated RMO parallel search
* switch to MIT license

-- Srdja Matovic Sep 2021

Zeta (099m)

* patch for ABDADA parallel search
* disabled RMO parallel search
* removed max device memory limitation
* mods in time control
* cleanups
*
* Zeta 099m on Nvidia V100, 160 workers, ~ 13.5 Mnps
* Zeta 099m on Nvidia V100, 1 worker, ~ 85 Knps

-- Srdja Matovic 13 Jul 2019

Zeta (099l)

* patch for parallel search scaling
* max device memory increased from 1 GB to 16 GB

-- Srdja Matovic Jun 2019

Zeta (099h to 099k)

* fixes n cleanups
* switch from Lazy SMP to ABDADA parallel search
* added IID - Internal Iterative Deepening
* one .cl file for all gpu generations with inlined optimizations
*
* Zeta 099k on AMD Radeon R9 Fury X, 256 workers, ~ 7.6 Mnps
* Zeta 099k on Nvidia GeForce GTX 750, 16 workers, ~ 800 Knps
* Zeta 099k on AMD Radeon HD 7750, 32 workers, ~ 700 Knps
* Zeta 099k on Nvidia GeForce 8800 GT, 14 workers, ~ 110 Knps

-- Srdja Matovic 2018

Zeta (099b to 099g)

* switch from Kogge-Stone based move generation to Dumb7Fill
* added atomic features for different gpu generations

-- Srdja Matovic 2017

Zeta (099a)

* switch from BestFirstMiniMax-Search to parallel AlphaBeta (Lazy SMP)
* ported all (except IID) search techniques from Zeta Dva v0305 to OpenCL
* ported the evaluation function of Zeta Dva v0305 to OpenCL
* vectorized and generalized 64-bit Kogge-Stone move generator
* 64 gpu-threads are now coupled to one worker, performing move generation,
   move picking and evaluation, square-wise, in parallel on the same node
* portability over performance, should run on the very first GPUs with
   OpenCL 1.x support (>= 2008)

-- Srdja Matovic 2017

Zeta (098d to 098g)

* mostly cleanup and fixes
* restored simple heuristics from Zeta Dva (~2000 Elo on CCRL) engine
* protocol fixes
* fixed autoconfig for AMD gpus
* switched to Kogge-Stone based move generator
* switched to rotate left based Zobrist hashes
* switched to move picker
* switched to GPL >= 2
*
* Zeta 098e on Nvidia GeForce GTX 580, ca. 6 Mnps, est. 1800 Elo on CCRL
* Zeta 098e on AMD Radeon HD 7750, ca. 1 Mnps
* Zeta 098e on AMD Phenom X4, ca. 1 Mnps
* Zeta 098e on Nvidia GeForce 8800 GT, ca. 500 Knps

-- Srdja Matovic 2016


Zeta (098a to 098c)

* improved heuristics, partly ported from the Stockfish chess engine
* AutoConfig for OpenCL devices
* parameter tuning
* Zeta 098c on Nvidia GeForce GTX 480, ca. 5 Mnps, est. 2000 Elo on CCRL
* Zeta 098c on AMD Radeon R9 290, ca. 3.2 Mnps

-- Srdja Matovic Aug 2013

Zeta (097a to 097z)

* implementation of an BestFirstMiniMax-Search algorithm with UCT parameters
   for parallelization
* Zeta 097x on Nvidia GeForce GTX 480, ca. 5 Mnps, est. 1800 Elo on CCRL
* Zeta 097x on AMD Radeon HD 7750, ca. 800 Knps

-- Srdja Matovic Jan 2013

Zeta (0930 to 0960)

* tested LIFO-stack based load balancing for AlphaBeta search on one compute
   unit of the GPU
* tested Monte Carlo Tree Search without UCT across multiple compute units of
   the GPU
* tested 'Nagging' and 'Spam' parallelization, the multi-window approach,
   for AlphaBeta search on one compute unit of the GPU
* tested 'RBFMS', Randomized BestFirstMiniMax-Search, a parallel version of
   BestFirstMiniMax, across multiple compute units of the GPU
* failed to implement YBWC parallel AlphaBeta
* failed to implement Conspiracy Numbers Search

-- Srdja Matovic 2012

Zeta (0915 to 0918)

* 64-bit Magic Bitboard move generator running
* AlphaBeta search algorithm with 'SPPS'-parallelization approach plays chess,
   running 128 gpu-threads on one compute unit of the GPU

-- Srdja Matovic 2011

Zeta (0900 to 0910)

* tested 32-bit 0x88 and 64-bit Magic Bitboard move generator
* ported heuristics, the evaluation function, from CPU engine 'Zeta Dva'
   (~2000 Elo on CCRL) to OpenCL

-- Srdja Matovic 2010

 

* updated on 2022-05-28 *

Zeta - Source Code

Zeta and Zeta Dva support only some basic Xboard protocol commands and some users have reported problems with the configuration and interface of the last Zeta versions.‭ ‬So i will publish the source code again when these parts are more user friendly designed and tested for Windows Chess-GUIs like Winboard or Arena.

Home - Top