If I wish to keep the v099 design of Zeta, with classic parallel AlphaBeta, how
could I improve the nps throughput per worker further?

The current board presentation is Bitboard, 64 bit based, this makes it easier
to parallelize across the SIMD unit of a GPU. Current GPUs are 32 bit machines,
and upcoming GPUs will probable support INT8, 8 bit integer, math with higher
throughput, so you can do four INT8 operations per cycle instead of one 32 bit.
Further I used the most simple parallelization of Bitboards for SIMD during
move generation and evaluation, square-wise, so the engine runs 64 times, per
square, the same code. Current GPU architectures (Turing/RDNA) have 32 cores per
SIMD unit, so I need to run the square-wise code over two waves on the SIMD unit.

If I change to some kind of vectorized 8 bit move generation and evaluation to
use INT8 optimized math, I could achieve a ~8x speedup, cos 64 bit operations
need multiple cycles on 32 bit hardware. If I switch further from an square-wise
parallelization to some kind of piece-wise, or better direction-wise, I could
achieve at least a further 2x speedup.

Of course, these are numbers on paper, in practice there is always a trade-off,
and one has to consider Amdahl's law, but it seems to me that this could be one
way to go.