Alright, I activated my time machine and looked over >10 years old threads on Chess on a GPU. Chess programmers are really smart guys and as soon as there were GPGPU frameworks they started to ponder on how to use a GPU for Chess. So what was the opinion back then?

  • CPU-GPU latency
  • SIMD architecture
  • SIMT architecture
  • VRAM memory latency
  • integer performance
  • no recursion
  • synching of gpu-threads

CPU-GPU latency
I offload the whole search with move generation and position evaluation onto GPU. The host runs just an ID-loop (iterative-deepening) for time control.

SIMD architecture
I couple 64 gpu-threads to one worker, used for move generation, move picking, and position evaluation, square-wise, in parallel.

SIMT architecture
I achieve ~50% VALU utilization with one worker running on one SIMD unit, could be better, but no need to run multiple waves of SIMT.

VRAM memory latency
I lowered the global memory footprint by use the of registers for the board representation, scratch-pad memory for the iterative search stacks, constant memory for pre-calculated attack-tables and alike, and VRAM for Transposition Tables only.

integer performance
I implemented a 64-bit Bitboard based board representation and move generator, did not succeed to switch to something with 32-bit or even 8-bit. 64-bit integer opreations need multiple cycles on 32-bit hardware, could be better, could be worse, with Bitboards it is easier to do SIMD-friendly arithmetics, other designs rely on loops or branches for move generation.

no recursion
I implemented a selective AlphaBeta search in an iterative way.

synching of gpu-threads
I implemented a synchless parallel game tree search (RMO), communication is done via a Shared Hash Table in VRAM for hundreds of workers.

As I mentioned already, to get an chess playing engine running on GPU and to get an competitive engine running on GPU are two different tasks. I worked on the topic of parallel search, and with an NNUE implementation in pipe also on chess heuristics, what is left is the speed (the computed nodes per second of a single worker) and the selective part of the search algorithm. Zeta v099 is yet a quite simple implementation of an chess engine, if I add NNUE for chess position evaluation, there is still the branching-factor of >3 which eats up possible speed gains. If we compare one SIMD unit of a GPU for one single Zeta worker and a single CPU core, we have 32 SIMD 32-bit cores @~1.5GHz vs. 4 CPU 64-bit ALUs@~4GHz, we have roughly the same arithmetic throughput, but I did not succeed to match the nodes-per-second performance of a single CPU core, Zeta still lacks 10 to 20 fold behind, and has to catch up in Elo via hundreds of workers in an parallel search.