Haha, okay, did some basic benchmarks with my v099 Bitboard 64 gpu-threads per worker design, I am stuck on 100 Knps to max 200 Knps per worker, w/o any NNUE implementation, far too slow to compete with NNUE engines on CPUs...
I took a look at the RDNA white paper and I pretty much like the architecture.
One SIMD unit has 32 cores, this would fit for a piece-wise move generator, it clocks up to 2.5 GHz, and the cache hierarchy of RDNA2 seems to fit NNUE networks with 10 million weights with 20 MB.
I am yet not sure if the caches are for textures only or if they can be used for program data, and the latencies are according to some benchmarks about an order of magnitude higher than on CPUs, hence it remains open how the NNUE inference will perform.
The scratch-pad memory (LDS) shared across one work-group is 32KB, this is enough for me to store the iterative search var stacks, constant cache of 16KB for the lookup-tables, the L0 cache is 16KB, L1 128KB, L2 4MB, and L3 varies from 16 to 128 MB. If I use a short2 data-type for the first NNUE layer and char4 for the further this should fit across the L0 to L3 caches. Alternative is to store the INT8 weights in the vector registers file, 128KB per SIMD.
Still not sure if 8-bit vector packed math is supported via OpenCL, to speed up NN inference.
I think it is possible to add the new neural network technique 'NNUE' to Zeta for upcoming GPU architectures like Nvidia Lovelace, Intel Xe and AMD RDNA3 which probably will all have support of INT8, 8 bit integer, math with higher throughput and maybe some 10 to 20 MB L3 cache per SIMD unit for the network weights file.
With INT8 optimized datatypes and instructions, one could build an vectorized 8 bit 0x88 move generator which operates over the 8 directions as vector and with 32 parallel gpu threads of one SIMD unit handles all pieces at once. Maybe reaching 1 to 2 million nodes per second per SIMD unit in an Zeta engine like framework.
With 32 SIMD gpu threads performing 32xFP32 or 32x2xFP16 operations per clock the NNUE inference performance could be 2 to 4 times faster than with current NNUE on CPUs with AVX2 (roughly estimated), considering a switch from integer to float weights.
Volta/Turing/Ampere have currently 16 cores per FPU SIMD and support doubled throughput for FP16 operations, I guess Nvidia will move with Lovelace back again to an 32 core per SIMD design with unified INT/FP16 cores. RDNA has 32 cores per SIMD, also with doubled throughput for FP16. Intel seems to use SIMD8 with 8 FP cores for its Xe GPU (with support for higher throughput for lower precision), maybe Intel will also add some kind of SIMD32, to couple 4 EUs to one compute unit.
- up to 2 Mnps per SIMD unit possible
- up to 4x faster inference for NNUE possible
- up to 160 parallel workers (SIMD units) on current highend-gpus
again, just some rough numbers estimated, big grain of salt and alike...
if the above all holds, then you get a hell of NNUE monster on highend-gpus.
Zeta v099 already has an simple AB framework implemented with ABDADA or as option RMO Lazy SMP parallel search across SIMD units, hence the main part would then be to implement all those funny search extensions and tricks Stockfish does in an iterative way in Zeta for GPU - full time job ;)
I wrote 10 to 20 MB L3 cache per SIMD unit, assuming the whole net should fit in cache, doubt that this is common practice with NNUE on CPU, maybe the first layer with most of the weights resists in RAM for the incremental updates, and the further layers only get cached? Dunno.
- I mixed up NNUE first layer INT16 and further INT8 weights, so the possible 4x inference speedup holds only if we assume 8-bit vector packed math on gpu.
I was not able to implement an efficient 8-bit 0x88 vector-based board representation on pen n paper, hence no 8-bit speedup for move gen in sight.
- Even if I keep the current v099 bitboard design a switch to 32 gpu-threads piece-wise worker may pay off with certain architecture improvements of AMD's RDNA and increasing gpu clocks in mind, benchmarks will tell.
Currently there is much going on with neural networks for chess. With Giraffe, AlphaZero, and its open source adaptation LC0 (Leela Chess Zero), it was shown that, with enough horse power, artificial neural networks are competitive in computer chess.
Currently LC0 uses an MCTS, Monte-Carlo Tree Search, approach with GPU as neural network accelerator for position evaluation.
My own experiments showed that AlphaBeta search is superior to MCTS, but current GPU architectures suffer from host-device latency, so you have to couple tasks to batches to be executed in one run on the GPU, not that conform with the serial nature of AlphaBeta.
With upcoming GPGPU architectures (or ANN accelerators) with less latency there might be AlphaBeta ANN engines possible...
On this blog you will find information about my computer chess projects...
Zeta Dva - a classic, amateur level, computer chess program.
Zeta (Zeta OpenCL Chess) - an experimental chess engine written in OpenCL to run on GPUs.
Zeta Vintage - a chess engine for the Atari 800 XE (6502 processor).
Zeta AI - an attempt to make use of ANNs in computer chess.
It was in the news, Google's AlphaGo won against the European Champion Fan Hui in the game of Go...another frontier is fallen to computer domination.
The question if such an attempt with deep neural networks works also for chess was answered by Matthew Lai in his Master Thesis with his chess engine Giraffe, which reached the level of an FIDE International Master (about 2400 Elo), an astounding achievement considering only 4 month of work....
...so, when are we going to see AlphaChess Mr. Lai? :-)
Giraffe: Using Deep Reinforcement Learning to Play Chess by Matthew Lai, 2015
Mastering the Game of Go with Deep Neural Networks and Tree Search by Google Deepmind, 2016
Learning to Play the Game of Chess by Sebastian Thrun, 1995
NeuroChess by Sebastian Thrun on CPW