If I wish to keep the v099 design of Zeta, with classic parallel AlphaBeta, how could I improve the nps throughput per worker further?

The current board presentation is Bitboard, 64 bit based, this makes it easier to parallelize across the SIMD unit of a GPU. Current GPUs are 32 bit machines, and upcoming GPUs will probable support INT8, 8 bit integer, math with higher throughput, so you can do four INT8 operations per cycle instead of one 32 bit. Further I used the most simple parallelization of Bitboards for SIMD during move generation and evaluation, square-wise, so the engine runs 64 times, per square, the same code. Current GPU architectures (Turing/RDNA) have 32 cores per SIMD unit, so I need to run the square-wise code over two waves on the SIMD unit.

If I change to some kind of vectorized 8 bit move generation and evaluation to use INT8 optimized math, I could achieve a ~8x speedup, cos 64 bit operations need multiple cycles on 32 bit hardware. If I switch further from an square-wise parallelization to some kind of piece-wise, or better direction-wise, I could achieve at least a further 2x speedup.

Of course, these are numbers on paper, in practice there is always a trade-off, and one has to consider Amdahl's law, but it seems to me that this could be one way to go.