With Zeta v099 I explored three layers of parallelism for Chess on a GPU:

  • 4x|8x direction-wise parallel vector-processing of Bitboards for move generation
  • 64x square-wise parallel processing for move generation, move picking and evaluation
  • 256x worker-wise parallel AlphaBeta game tree search

I did not succeed to utilize 8-bit vector packed math for a combined piece-wise+direction-wise move generation, and with Zeta NNUE in pipe it has yet to be seen if a 32x piece-wise Bitboard move generator + NNUE eval on a GPU SIMD unit approach can boost Zeta's nodes per second throughput enough to compete with NNUE engines running on CPUs.

I guess it is possible to add a 4th layer for multiple GPUs, to run a kind of PVS (Principal Variation Splitting) on CPU host to a certain ply and offload to up to 4 GPUs to compute the sub-trees, each with own parallel AlphaBeta.

followup:

I doubt a 5th layer for distributed computing across multiple nodes would make any sense considering the massive amount of parallel workers in a single node with several GPUs, the latency and communication overhead across nodes, and the effective branching factor of ~2 of modern AlphaBeta chess engines.