Hmm, I thought I was done with collecting GPU architectures and could move on to cloud computing in 2019 for further development, but it did not work out, meanwhile you have to talk through the sales department of all big cloud players before you get access to their GPUs, and, mostly they offer only Nvidia server brands to rent, no AMD, no Intel. Hence I did set up a lil Maschina again for GPGPU development. Intel i5-6500 (Skylake 14nm with AVX2 from 2015) with HD 530 graphics, Nvidia GeForce GTX 750 (Maxwell arch) and have yet to purchase some AMD GCN, maybe an used Radeon HD 7750 or even HD 8570. These are all outdated and low-end GPUs, but meanwhile I have enough benchmarks to inter/extrapolate Zeta's performance across different models and architectures, I just need some entry-level hardware to test some ideas, it should then scale up on newer and bigger models...
With Zeta v099 I explored three layers of parallelism for Chess on a GPU:
- 4x|8x direction-wise parallel vector-processing of Bitboards for move generation
- 64x square-wise parallel processing for move generation, move picking and evaluation
- 256x worker-wise parallel AlphaBeta game tree search
I did not succeed to utilize 8-bit vector packed math for a combined piece-wise+direction-wise move generation, and with Zeta NNUE in pipe it has yet to be seen if a 32x piece-wise Bitboard move generator + NNUE eval on a GPU SIMD unit approach can boost Zeta's nodes per second throughput enough to compete with NNUE engines running on CPUs.
I guess it is possible to add a 4th layer for multiple GPUs, to run a kind of PVS (Principal Variation Splitting) on CPU host to a certain ply and offload to up to 4 GPUs to compute the sub-trees, each with own parallel AlphaBeta.
I doubt a 5th layer for distributed computing across multiple nodes would make any sense considering the massive amount of parallel workers in a single node with several GPUs, the latency and communication overhead across nodes, and the effective branching factor of ~2 of modern AlphaBeta chess engines.
I took a look at the RDNA white paper and I pretty much like the architecture.
One SIMD unit has 32 cores, this would fit for a piece-wise move generator, it clocks up to 2.5 GHz, and the cache hierarchy of RDNA2 seems to fit NNUE networks with 10 million weights with 20 MB.
I am yet not sure if the caches are for textures only or if they can be used for program data, and the latencies are according to some benchmarks about an order of magnitude higher than on CPUs, hence it remains open how the NNUE inference will perform.
The scratch-pad memory (LDS) shared across one work-group is 32KB, this is enough for me to store the iterative search var stacks, constant cache of 16KB for the lookup-tables, the L0 cache is 16KB, L1 128KB, L2 4MB, and L3 varies from 16 to 128 MB. If I use a short2 data-type for the first NNUE layer and char4 for the further this should fit across the L0 to L3 caches. Alternative is to store the INT8 weights in the vector registers file, 128KB per SIMD.
Still not sure if 8-bit vector packed math is supported via OpenCL, to speed up NN inference.
Alright, I activated my time machine and looked over >10 years old threads on Chess on a GPU. Chess programmers are really smart guys and as soon as there were GPGPU frameworks they started to ponder on how to use a GPU for Chess. So what was the opinion back then?
- CPU-GPU latency
- SIMD architecture
- SIMT architecture
- VRAM memory latency
- integer performance
- no recursion
- synching of gpu-threads
I offload the whole search with move generation and position evaluation onto GPU. The host runs just an ID-loop (iterative-deepening) for time control.
I couple 64 gpu-threads to one worker, used for move generation, move picking, and position evaluation, square-wise, in parallel.
I achieve ~50% VALU utilization with one worker running on one SIMD unit, could be better, but no need to run multiple waves of SIMT.
VRAM memory latency
I lowered the global memory footprint by the use of registers for the board representation, scratch-pad memory for the iterative search stacks, constant memory for pre-calculated attack-tables and alike, and VRAM for Transposition Tables only.
I implemented a 64-bit Bitboard based board representation and move generator, did not succeed to switch to something with 32-bit or even 8-bit. 64-bit integer opreations need multiple cycles on 32-bit hardware, could be better, could be worse, with Bitboards it is easier to do SIMD-friendly arithmetics, other designs rely on loops or branches for move generation.
I implemented a selective AlphaBeta search in an iterative way.
synching of gpu-threads
I implemented a synchless parallel game tree search (RMO), communication is done via a Shared Hash Table in VRAM for hundreds of workers.
As I mentioned already, to get an chess playing engine running on GPU and to get an competitive engine running on GPU are two different tasks. I worked on the topic of parallel search, and with an NNUE implementation in pipe also on chess heuristics, what is left is the speed (the computed nodes per second of a single worker) and the selective part of the search algorithm. Zeta v099 is yet a quite simple implementation of an chess engine, if I add NNUE for chess position evaluation, there is still the branching-factor of >3 which eats up possible speed gains. If we compare one SIMD unit of a GPU for one single Zeta worker and a single CPU core, we have 32 SIMD 32-bit cores @~1.5GHz vs. 4 CPU 64-bit ALUs@~4GHz, we have roughly the same arithmetic throughput, but I did not succeed to match the nodes-per-second performance of a single CPU core, Zeta still lacks 10 to 20 fold behind, and has to catch up in Elo via hundreds of workers in an parallel search.
The most common way of gpgpu programming is simply to run thousands to millions gpu threads performing the same task, take input from global memory, VRAM, do some computation on local memory or registers, copy results back to VRAM.
Current Neural Network based chess engines do run millions of threads on gpu in this manner.
With Zeta v099 I tried a different mode, to view the gpu as a bunch of SIMD units. The parallel cores of one SIMD unit are coupled to one worker to work on the same chess node in parallel during move generation, move picking, and position evaluation. Different SIMD units work on the chess game tree via an classic parallel AlphaBeta search like ABDADA or Lazy SMP. The advantage of such an design is that you have only hundreds of workers (the amount of SIMD units) to feed an parallel search with, cos the more workers you run in AlphaBeta search, the less efficient the parallel game tree search performs.
With Zeta v097 and v098 I tried yet another mode. Current gpus are based on SIMT architecture, they are able to run multiple waves of Warps (Nvidia: 32 parallel SIMD threads) resp. Wavefronts (AMD: 64 parallel SIMD threads) on the same SIMD unit to hide memory latencies. So they are able to run a multitude of threads on a single core. With this mode I ran thousands of parallel gpu threads, each working on a single node, all working on the chess game tree search in parallel via an parallel Best-First-MiniMax search. The disadvantage of such an design is the low nodes-per-second performance of a single gpu thread, and the massive amount of parallel workers.
So, the question which remains open is how to run thousands to millions of threads, performing all the same computation, in an parallel game tree search.
I still do not know. As I already mentioned in an earlier post, I tried an LIFO-Stack based parallel search. Take one position from global stack, compute the children, put back the children onto global stack. But I could not implement AlphaBeta pruning effective, therefore some kind of linked list as data structure is in need, and my mind was not able to wrap around this in a SIMD friendly manner...but I guess such an approach could be one way to go to utilize a gpu for chess.
Currently there is much going on with neural networks for chess. With Giraffe, AlphaZero, and its open source adaptation LC0 (Leela Chess Zero), it was shown that, with enough horse power, artificial neural networks are competitive in computer chess.
Currently LC0 uses an MCTS, Monte-Carlo Tree Search, approach with GPU as neural network accelerator for position evaluation.
My own experiments showed that AlphaBeta search is superior to MCTS, but current GPU architectures suffer from host-device latency, so you have to couple tasks to batches to be executed in one run on the GPU, not that conform with the serial nature of AlphaBeta.
With upcoming GPGPU architectures (or ANN accelerators) with less latency there might be AlphaBeta ANN engines possible...
Hmm, my GPU workstation went broken, so I decommissioned my Maschina, time to say bye bye to my workhorse from 2008, the Nvidia 8800 GT
and the Asus Crosshair Formula II
and the 8 GB GeIL Black Dragon Memory
not to forget, the water-cooled 'Beast' from 2015, AMD Fury X with 4096 cores, 8 TFLOPS and 4 GB HBM
...we had a lot of fun, may you find a new, worthy owner on eBay.
- One SIMD Unit - One Board
To avoid thread divergence in a Warp, resp. Wavefront, the engine could couple, for example, 32 or 64 Work-Items of one Work-Group to work together on the same chess position. For instance, to generate moves, sort a move list or do an board evaluation in parallel. A move generator of such an Work-Group could operate over pieces, directions, or simply 64 squares in parallel. But in any of these cases current GPU SIMD units will 'waste' some instructions compared to the more efficient, sequential, processing of an CPU.
- Use of Local Memory* instead of Global Memory
The more sequential threads are coupled into one Work-Group to work on one chess position in parallel, the more Local Memory* per Work-Group could be available to store a move list, or a move list stack. By the use of faster Local Memory, less Warps (resp. Wavefronts) are in need to hide Global Memory latency.
- Hundreds of Work-Groups instead Thousands of Threads
YBWC is a parallel game tree search algorithm used in nowadays chess engines, but the more workers the algorithm runs, the less efficient he performs. So, by coupling sequential operating threads into one Work-Group, to work on one chess position in parallel, we lower the total number of workers and increase efficiency of the parallel search.
* Local Memory as OpenCL term is translated to Shared Memory as Nvidia Cuda term.
- SIMT architecture of GPUs
GPUs consists of tens to hundreds of SIMD or Vector Units that process multiple threads in multiple Warps or Wavefronts in SIMT fashion.
- Memory architecture of GPUs
To hide latency of Global Memory (VRAM) GPUs can run multiple Warps or Wavefronts and prefer to do computation by the use of Local or Private Memory. So, the more Work-Items and Work-Groups you run to hide latency, the less Local and Private Memory per thread will be available.
- Thousands of threads on GPUs
MiniMax search with Alpha-Beta pruning performs best serial, not parallel.
* edit on 2015-03-30 *
- GPU Chess by Hieronymus
- Parallelis by Philippe Vigier
- GPU Hex by Daniel Shawul
- Chess on GPGPU by Samuel
- perft gpu by Ankan Banerjee
- NIKOLA Chess
- WeeChess by Ewan Crawford
- Pigeon by Stuart Riffle
- LC0 - Leela Chess Zero
- Tinsmith by Graham Jones
- Eta by Srdja Matovic
- Gigantua by Daniel Infuehr
* updated on 2022-03-24 *