The most common way of gpgpu programming is simply to run thousands to millions
gpu threads performing the same task, take input from global memory, VRAM, do
some computation on local memory or registers, copy results back to VRAM.

Current Neural Network based chess engines do run millions of threads on gpu in
this manner.

With Zeta v099 I tried a different mode, to view the gpu as a bunch of SIMD
units. The parallel cores of one SIMD unit are coupled to one worker to work on
the same chess node in parallel during move generation, move picking, and
position evaluation. Different SIMD units work on the chess game tree via an
classic parallel AlphaBeta search like ABDADA or Lazy SMP. The advantage of
such an design is that you have only hundreds of workers (the amount of SIMD
units) to feed an parallel search with, cos the more workers you run in
AlphaBeta search, the less efficient the parallel game tree search performs.

With Zeta v097 and v098 I tried yet another mode. Current gpus are based on SIMT
architecture, they are able to run multiple waves of Warps (Nvidia: 32 parallel
SIMD threads) resp. Wavefronts (AMD: 64 parallel SIMD threads) on the same SIMD
unit to hide memory latencies. So they are able to run a multitude of threads
on a single core. With this mode I ran thousands of parallel gpu threads, each
working on a single node, all working on the chess game tree search in parallel
via an parallel Best-First-MiniMax search. The disadvantage of such an design
is the low nodes-per-second performance of a single gpu thread, and the massive
amount of parallel workers.

So, the question which remains open is how to run thousands to millions of
threads, performing all the same computation, in an parallel game tree search.

I still do not know. As I already mentioned in an earlier post, I tried an
LIFO-Stack based parallel search. Take one position from global stack, compute
the children, put back the children onto global stack. But I could not implement
AlphaBeta pruning effective, therefore some kind of linked list as data
structure is in need, and my mind was not able to wrap around this in a SIMD
friendly manner...but I guess such an approach could be one way to go to utilize
a gpu for chess.