I implemented some dummy 10x12 mailbox floating-point move generation with 32 parallel gpu-threads and get only 2x speedup for an unoptimized approach compared to 64 gpu-threads Bitboards, too lil, it does not pay off for me to explore that branch any further.

So I am stuck on ~100 Knps per worker with Zeta v099 with up to 320 workers on current gpu architectures.