The most common way of gpgpu programming is simply to run thousands to millions gpu threads performing the same task, take input from global memory, VRAM, do some computation on local memory or registers, copy results back to VRAM.
Current Neural Network based chess engines do run millions of threads on gpu in this manner.
With Zeta v099 I tried a different mode, to view the gpu as a bunch of SIMD units. The parallel cores of one SIMD unit are coupled to one worker to work on the same chess node in parallel during move generation, move picking, and position evaluation. Different SIMD units work on the chess game tree via an classic parallel AlphaBeta search like ABDADA or Lazy SMP. The advantage of such an design is that you have only hundreds of workers (the amount of SIMD units) to feed an parallel search with, cos the more workers you run in AlphaBeta search, the less efficient the parallel game tree search performs.
With Zeta v097 and v098 I tried yet another mode. Current gpus are based on SIMT architecture, they are able to run multiple waves of Warps (Nvidia: 32 parallel SIMD threads) resp. Wavefronts (AMD: 64 parallel SIMD threads) on the same SIMD unit to hide memory latencies. So they are able to run a multitude of threads on a single core. With this mode I ran thousands of parallel gpu threads, each working on a single node, all working on the chess game tree search in parallel via an parallel Best-First-MiniMax search. The disadvantage of such an design is the low nodes-per-second performance of a single gpu thread, and the massive amount of parallel workers.
So, the question which remains open is how to run thousands to millions of threads, performing all the same computation, in an parallel game tree search.
I still do not know. As I already mentioned in an earlier post, I tried an LIFO-Stack based parallel search. Take one position from global stack, compute the children, put back the children onto global stack. But I could not implement AlphaBeta pruning effective, therefore some kind of linked list as data structure is in need, and my mind was not able to wrap around this in a SIMD friendly manner...but I guess such an approach could be one way to go to utilize a gpu for chess.