I will give the Zeta v099 approach one more time a try, apply programmer's trick 17, develop on old and outdated architecture. If my approach runs on the Nvidia 8800 GT, then it will run also on newer architectures with more beef.

With pen n paper I get an x2 speedup for switching from 64 gpu-threads square-wise to 32 gpu-threads piece-wise worker, and a further x2 speedup for switching from 64-bit integer Bitboards to 32-bit floats for the board representation and move generation. I still could not figure a vectorized 0x88 board representation for uchar4 8-bit move generation.

If I apply trick 17 more strict, I would have to run the 'one-thread-one-board' approach on the 8800 GT, with some thousands, independent gpu-threads, with up to 164K threads on newer architectures like the AMD Fury X. But since NNUE and upcoming NNOM this approach does not fit anymore, one single thread does not have enough beef to compute the new neural networks alone, meanwhile I have to couple multiple threads together to work in parallel on the same node for evaluation.