Haha, okay, did some basic benchmarks with my v099 Bitboard 64 gpu-threads per worker design, I am stuck on 100 Knps to max 200 Knps per worker, w/o any NNUE implementation, far too slow to compete with NNUE engines on CPUs...