I think it is possible to add the new neural network technique 'NNUE' to Zeta for upcoming GPU architectures like Nvidia Lovelace, Intel Xe and AMD RDNA3 which probably will all have support of INT8, 8 bit integer, math with higher throughput and maybe some 10 to 20 MB L3 cache per SIMD unit for the network weights file.

With INT8 optimized datatypes and instructions, one could build an vectorized 8 bit 0x88 move generator which operates over the 8 directions as vector and with 32 parallel gpu threads of one SIMD unit handles all pieces at once. Maybe reaching 1 to 2 million nodes per second per SIMD unit in an Zeta engine like framework.

With 32 SIMD gpu threads performing 32xFP32 or 32x2xFP16 operations per clock the NNUE inference performance could be 2 to 4 times faster than with current NNUE on CPUs with AVX2 (roughly estimated), considering a switch from integer to float weights.

Volta/Turing/Ampere have currently 16 cores per FPU SIMD and support doubled throughput for FP16 operations, I guess Nvidia will move with Lovelace back again to an 32 core per SIMD design with unified INT/FP16 cores. RDNA has 32 cores per SIMD, also with doubled throughput for FP16. Intel seems to use SIMD8 with 8 FP cores for its Xe GPU (with support for higher throughput for lower precision), maybe Intel will also add some kind of SIMD32, to couple 4 EUs to one compute unit.

So...

  • up to 2 Mnps per SIMD unit possible
  • up to 4x faster inference for NNUE possible
  • up to 160 parallel workers (SIMD units) on current highend-gpus

again, just some rough numbers estimated, big grain of salt and alike...

if the above all holds, then you get a hell of NNUE monster on highend-gpus.

Zeta v099 already has an simple AB framework implemented with ABDADA or as option RMO Lazy SMP parallel search across SIMD units, hence the main part would then be to implement all those funny search extensions and tricks Stockfish does in an iterative way in Zeta for GPU - full time job ;)

Followup:

I wrote 10 to 20 MB L3 cache per SIMD unit, assuming the whole net should fit in cache, doubt that this is common practice with NNUE on CPU, maybe the first layer with most of the weights resists in RAM for the incremental updates, and the further layers only get cached? Dunno.

2021-04-12 Followup:

  • I mixed up NNUE first layer INT16 and further INT8 weights, so the possible 4x inference speedup holds only if we assume 8-bit vector packed math on gpu.
  • I was not able to implement an efficient 8-bit 0x88 vector-based board representation on pen n paper, hence no 8-bit speedup for move gen in sight.

  • Even if I keep the current v099 bitboard design a switch to 32 gpu-threads piece-wise worker may pay off with certain architecture improvements of AMD's RDNA and increasing gpu clocks in mind, benchmarks will tell.