With Zeta v099 I explored three layers of parallelism for Chess on a GPU:

  • 4x|8x direction-wise parallel vector-processing of Bitboards for move generation
  • 64x square-wise parallel processing for move generation, move picking and evaluation
  • 256x worker-wise parallel AlphaBeta game tree search

I did not succeed to utilize 8-bit vector-packed-math for a combined piece-wise+direction-wise move generation, and with Zeta NNUE in pipe it has yet to be seen if a 32x piece-wise Bitboard move generator + NNUE-eval on a GPU-SIMD-unit approach can boost Zeta's nodes-per-second throughput enough to compete with NNUE engines running on CPUs.

I guess it is possible to add a 4th layer for multiple GPUs, to run a kind of PVS (Principal Variation Splitting) on CPU host to a certain ply and offload to up to 4 GPUs to compute the sub-trees, each with own parallel AlphaBeta.