The most common way of gpgpu programming is simply to run thousands to millions gpu threads performing the same task, take input from global memory, VRAM, do some computation on local memory or registers, copy results back to VRAM.
Current Neural Network based chess engines do run millions of threads on gpu in this manner.
With Zeta v099 I tried a different mode, to view the gpu as a bunch of SIMD units. The parallel cores of one SIMD unit are coupled to one worker to work on the same chess node in parallel during move generation, move picking, and position evaluation. Different SIMD units work on the chess game tree via an classic parallel AlphaBeta search like ABDADA or Lazy SMP. The advantage of such an design is that you have only hundreds of workers (the amount of SIMD units) to feed an parallel search with, cos the more workers you run in AlphaBeta search, the less efficient the parallel game tree search performs.
With Zeta v097 and v098 I tried yet another mode. Current gpus are based on SIMT architecture, they are able to run multiple waves of Warps (Nvidia: 32 parallel SIMD threads) resp. Wavefronts (AMD: 64 parallel SIMD threads) on the same SIMD unit to hide memory latencies. So they are able to run a multitude of threads on a single core. With this mode I ran thousands of parallel gpu threads, each working on a single node, all working on the chess game tree search in parallel via an parallel Best-First-MiniMax search. The disadvantage of such an design is the low nodes-per-second performance of a single gpu thread, and the massive amount of parallel workers.
So, the question which remains open is how to run thousands to millions of threads, performing all the same computation, in an parallel game tree search.
I still do not know. As I already mentioned in an earlier post, I tried an LIFO-Stack based parallel search. Take one position from global stack, compute the children, put back the children onto global stack. But I could not implement AlphaBeta pruning effective, therefore some kind of linked list as data structure is in need, and my mind was not able to wrap around this in a SIMD friendly manner...but I guess such an approach could be one way to go to utilize a gpu for chess.
If I wish to keep the v099 design of Zeta, with classic parallel AlphaBeta, how could I improve the nps throughput per worker further?
The current board presentation is Bitboard, 64 bit based, this makes it easier to parallelize across the SIMD unit of a GPU. Current GPUs are 32 bit machines, and upcoming GPUs will probable support INT8, 8 bit integer, math with higher throughput, so you can do four INT8 operations per cycle instead of one 32 bit. Further I used the most simple parallelization of Bitboards for SIMD during move generation and evaluation, square-wise, so the engine runs 64 times, per square, the same code. Current GPU architectures (Turing/RDNA) have 32 cores per SIMD unit, so I need to run the square-wise code over two waves on the SIMD unit.
If I change to some kind of vectorized 8 bit move generation and evaluation to use INT8 optimized math, I could achieve a ~8x speedup, cos 64 bit operations need multiple cycles on 32 bit hardware. If I switch further from an square-wise parallelization to some kind of piece-wise, or better direction-wise, I could achieve at least a further 2x speedup.
Of course, these are numbers on paper, in practice there is always a trade-off, and one has to consider Amdahl's law, but it seems to me that this could be one way to go.
It works, Zeta v099 plays decent chess, with an classic parallel AlphaBeta approach, and I am convinced that with some further work it could reach more than 3000 CCRL Elo on an highend gpu.
But the obvious thing is, it lacks nps throughput per worker, the single thread performance is too low, and even with an better parallel search, there is not much to gain on massive parallel systems with more than 128 workers.
So to be able to beat the top 10 chess engines out there, the nps throughput per worker must be increased ten or twenty fold...
during early development I tried an design based on an LIFO-Stack parallel search. It had the best nps throughput of all my designs, but I was not able to implement AlphaBeta pruning efficient, so the speed gain was lost again during pruning.
If I had to start over, and make another Zeta version, I would try the LIFO-Stack based parallel search again...
Zeta v099m released as source and Linux/Windows 64 bit binary:
Please consider the README file or --help option before running the engine.
From the changelog:
Zeta (099m) alpha; urgency=medium * patch for ABDADA parallel search * disabled RMO parallel search * removed max device memory limitation * mods in time control * cleanups * * Zeta 099m on Nvidia V100, 160 workers, ~ 13.5 Mnps * Zeta 099m on Nvidia V100, 1 worker, ~ 85 Knps -- Srdja Matovic 13 Jul 2019
Here some nps and search scaling results...
################################################################################ # Zeta 099m, startposition, depth 12, best of 4 runs, Nvidia V100: # tt1: 2048 MB, tt2: 1536 MB # ### workers #nps #nps speedup #time in s #ttd speedup #relative ttd ### 1 86827 1.000000 156.586000 1.000000 1.000000 ### 2 180282 2.076336 55.749000 2.808768 2.808768 ### 4 356910 4.110588 35.564000 4.402936 1.567568 ### 8 704741 8.116611 19.637000 7.974029 1.811071 ### 16 1385758 15.959989 14.583000 10.737571 1.346568 ### 32 2786039 32.087242 11.124000 14.076411 1.310949 ### 64 5460849 62.893443 8.838000 17.717357 1.258656 ### 128 10235993 117.889516 7.377000 21.226244 1.198048 ### 160 11639290 134.051505 7.202000 21.742016 1.024299
Zeta v099k did not scale well on Nvidia Pascal and Turing gpus, so I wrote a patch to fix this issue, and released Zeta v099l:
On Pascal it runs now 4 workers per Compute Unit and on Turing 2 workers per Compute Unit during guessconfigx.
According to Nvidia papers, Turing should have 16 wide SIMD units, with four units per Compute Unit, but according to my tests I can only speculate that the integer units are 32 wide, not 16, with two of them per Compute Unit.
During benchmarks on other systems it was shown again that some Windows OS have an OS gpu timeout, so you may want to apply this registry update on your Windows machine:
Download, double-click and reboot OS to increase gpu timeout from 2 to 20 seconds.
If you want to run an SMP benchmark for your gpu, I suggest to increase the gpu timeout to 400 seconds:
Currently there is much going on with neural networks for chess. With Giraffe, AlphaZero, and its open source adaptation LC0 (Leela Chess Zero), it was shown that, with enough horse power, artificial neural networks are competitive in computer chess.
Currently LC0 uses an MCTS, Monte-Carlo Tree Search, approach with GPU as neural network accelerator for position evaluation.
My own experiments showed that AlphaBeta search is superior to MCTS, but current GPU architectures suffer from host-device latency, so you have to couple tasks to batches to be executed in one run on the GPU, not that conform with the serial nature of AlphaBeta.
With upcoming GPGPU architectures (or ANN accelerators) with less latency there might be AlphaBeta ANN engines possible...
my GPU workstation went broken, so I decommissioned my Maschina, time to say bye bye to my workhorse from 2008, the Nvidia 8800 GT
and the Asus Crosshair Formula II
and the 8 GB GeIL Black Dragon Memory
not to forget, the water-cooled 'Beast' from 2015, AMD Fury X with 4096 cores, 8 TFLOPS and 4 GB HBM
...we had a lot of fun, may you find a new, worthy owner on eBay.
I fixed some issues in Zeta Dva and Zeta, source code and binaries are online again
Please consider the README file or --help option before running the Zeta engine on GPU.
I lost the source of Zeta Vintage, and an attempt to do an rewrite in C showed again that the 6502 processor should really be programmed in assembly, so a rewrite in 6502 assembly is still on my bucket list...
I finished my current run on Zeta v099, my experimental gpu chess engine.
The actual conclusion of the current iteration is, that an simple engine, with standard chess programming techniques, can be ported to OpenCL to run on a gpu, but it would take more effort to make the engine competitive in terms of computed nodes per second (speed), heuristics (expert knowledge), and scaling (parallel search algorithm).
Computer Chess, as an computer science topic, evolved over decades, starting in the 40s and 50s, and reached one peak 1997 with the match Deep Blue vs. Kapsarow. Nowadays chess engines are tuned by playing thousands and thousands of games, so to get an chess playing engine running on the gpu and to get an competitive chess playing engine running on the gpu are two different tasks.
It was in the news, Google's AlphaGo won against the European Champion Fan Hui in the game of GO...another frontier is fallen to computer domination.
The question if such an attempt with deep neural networks works also for chess was answered by Matthew Lai in his Master Thesis with his chess engine Giraffe, which reached the level of an FIDE International Master (about 2400 Elo), an astounding achievement considering only 4 month of work....
...so, when are we going to see AlphaChess Mr. Lai? :-)
Giraffe: Using Deep Reinforcement Learning to Play Chess by Matthew Lai, 2015
Mastering the Game of Go with Deep Neural Networks and Tree Search by Google Deepmind, 2016
Learning to Play the Game of Chess by Sebastian Thrun, 1995
NeuroChess by Sebastian Thrun on CPW
To port an classic chess engine approach with an parallel Alphabeta algorithm like YBWC to an GPU architecutre would take a significant bunch of time, if it is even possible to port all well known computer chess techniques straight forward. And it is questionable if an Elo gain, by more computed nodes per second, is eaten up again by an higher branchingfactor due to an simpler implementation.
Zeta 098 and 097 make use of an Randomized Best First MiniMax Search, but my implementation makes excessive use of Global Memory and scales poorly.
At the very beginning of the project it was clear, that an Monte Carlo Tree Search would fit best for gpus. But until now there is no known engine that could make MCTS work well for Chess.
What is left, except to try to port an classic approach?
I could improve the performance of the BestFist search significantly by switching from GlobalMemory to LocalMemory and i could remove the randomness...another alternative would be to switch to MCAB, Monte Carlo Alphabeta...
- It looks like Monte Carlo Tree Search gives the best speedups compared to an CPU implementation.
- The Node Based Parallel Search is an hybrid approach that offloads computational tasks to the GPU.
- MiniMax search can be parallelized on the GPU, but is inferior to AlphaBeta.
- Speedup of parallel AlphaBeta implementations depend on the branching factor of the Game.
So far i have found nothing about an implementation that makes use of the recursive features of newer architectures.
Linklist of papers related to Game Tree Seach on GPUs for two-player zero-sum games...
Parallel Game Tree Search on SIMD Machines
Holger Hopp and Peter Sanders (1995), citeseerx pdf
Note - Implementation of YBWC, parallel AlphaBeta, on an 16K SIMD machine for an synthetic game tree.
Efficiency of Parallel Minimax Algorithm for Game Tree Search
Plamenka Borovska, Milena Lazarova (2007), citeseerx pdf
Note - Efficiency of AlphaBeta search for 4x4 TicTacToe via MPI and OpenMP on an CPU cluster.
GPU-Accelerated program to play Go
Zachary Clifford (2009), pdf
Note - Implementation of MCTS for Go.
Playing Zero Sum Games on the GPU
Avi Bleiweiss (2010), pdf
Note - Multliple Game Tree Searches on GPU.
Parallel Minimax Tree Searching on GPU
Kamil Rocki and Reiji Suda (2010), pdf
Note - Implementation of MinixMax for Reversi.
Large-Scale Parallel State Space Search Utilizing Graphics Processing Units and Solid State Disks
Damian Sulewski (2011), link to pdf
Note - Chapter 4 - GPUSSD-BFS - A GPU and SSD supported Breadth-First Search.
Parallel Game Tree Search Using GPU
L’ubomír Lackovi (2011), pdf
Note - Implementation of parallel search for Czech Draughts.
Parallel Monte Carlo Tree Search on GPU
Kamil Rocki and Reiji Suda (2011) pdf
Note - Implementation of MCTS for Reversi.
Parallel alpha-beta algorithm on the GPU
Damjan Strnad and Nikola Guid (2011), scholar google pdf
Note - Implementation of PV-Split, parallel AlphaBeta, for Reversi.
A Node-based Parallel Game Tree Algorithm Using GPUs
Liang Li, Hong Liu, Peiyu Liu, Taoying Liu, Wei Li, Hao Wang (2012) IEEE
Note - Implementation of node-based parallelism for Connect6.
Parallel UCT Search on GPUs
Nicolas A. Barriga, Marius Stanescu, Michael Buro (2014), IEEE
Note - Implementation of MCTS with UCT for 8x8 Ataxx.
A Review on Parallelization of Node based Game Tree Search Algorithms on GPU
Ms. Rutuja U. Gosavi, Prof. Payal S. Kulkarni (2014), pdf
Parallelization of Node Based Game Tree Search Algorithm on GPU
Ms. Rutuja U. Gosavi, Mrs. Archana S. Vaidya (2015), pdf
Note - Implementation of node-based parallelism for Connect4/Connect6.
- One SIMD Unit - One Board
To avoid thread divergence in a Warp, resp. Wavefront, the engine could couple, for example, 32 or 64 Work-Items of one Work-Group to work together on the same chess position. For instance, to generate moves, sort a move list or do an board evaluation in parallel. A move generator of such an Work-Group could operate over pieces, directions, or simply 64 squares in parallel. But in any of these cases current GPU SIMD units will 'waste' some instructions compared to the more efficient, sequential, processing of an CPU.
- Use of Local Memory* instead of Global Memory
The more sequential threads are coupled into one Work-Group to work on one chess position in parallel, the more Local Memory* per Work-Group could be available to store a move list, or a move list stack. By the use of faster Local Memory, less Warps (resp. Wavefronts) are in need to hide Global Memory latency.
- Hundreds of Work-Groups instead Thousands of Threads
YBWC is a parallel game tree search algorithm used in nowadays chess engines, but the more workers the algorithm runs, the less efficient he performs. So, by coupling sequential operating threads into one Work-Group, to work on one chess position in parallel, we lower the total number of workers and increase efficiency of the parallel search.
* Local Memory as OpenCL term is translated to Shared Memory as Nvidia Cuda term.
On this blog you will find information about my computer chess projects...
Zeta Dva - a classic, amateur level, computer chess program.
Zeta (Zeta OpenCL Chess) - an experimental chess engine written in OpenCL to run on GPUs.
Zeta Vintage - a chess engine for the Atari 800 XE (6502 processor).
Zeta AI - an attempt to make use of ANNs in computer chess.
To make it short, Zeta v097 and v098 make excessive use of the slower Global Memory, therefore the computed nodes per second scale linear with the memory bandwidth and not with the amount of cores or their clock rates.
Here some alternative algorithms to plain MiniMax AlphaBeta search...
- Best-first minimax search by Richard E. Korf and David Maxwell Chickering, http://maxchickering.com/publications/aij96.pdf
- CNS, Conspiracy Number Search
- Monte Carlo Tree Search, with or without UCT extension.
- Nagging, http://www.academia.edu/
- Parallel Game Tree Search on SIMD machines by Holger Hopp and Peter Sanders, http://algo2.iti.kit.edu/sanders/papers/gamet.ps.gz
- Parallel Randomized Best-First Minimax Search by Yaron Shoham and Sivan Toledo, http://www.cs.tau.ac.il/~stoledo/Pubs/rbf-ai.pdf
- SIMT architecture of GPUs
GPUs consists of tens to hundreds of SIMD or Vector Units that process multiple threads in multiple Warps or Wavefronts in SIMT fashion.
- Memory architecture of GPUs
To hide latency of Global Memory (VRAM) GPUs can run multiple Warps or Wavefronts and prefer to do computation by the use of Local or Private Memory. So, the more Work-Items and Work-Groups you run to hide latency, the less Local and Private Memory per thread will be available.
- Thousands of threads on GPUs
MiniMax search with Alpha-Beta pruning performs best serial, not parallel.
* edit on 2015-03-30 *
Here an overview of what happened before....
- patch for ABDADA parallel search
- disabled RMO parallel search
- removed max device memory limitation
- mods in time control
- Zeta 099m on Nvidia V100, 160 workers, ~ 13.5 Mnps
- Zeta 099m on Nvidia V100, 1 worker, ~ 85 Knps
-- Srdja Matovic 13 Jul 2019
- patch for parallel search scaling
- max device memory increased from 1 GB to 16 GB
-- Srdja Matovic Jun 2019
Zeta (099h to 099k)
- fixes n cleanups
- switch from Lazy SMP to ABDADA parallel search
- added IID - Internal Iterative Deepening
- one cl file for all gpu generations with inlined optimizations
- Zeta 099k on AMD Radeon R9 Fury X, 256 workers, ~ 7.6 Mnps
- Zeta 099k on Nvidia GeForce GTX 750, 16 workers, ~ 800 Knps
- Zeta 099k on AMD Radeon HD 7750, 32 workers, ~ 700 Knps
- Zeta 099k on Nvidia GeForce 8800 GT, 14 workers, ~ 110 Knps
-- Srdja Matovic 2018
Zeta (099b to 099g)
- switch from KoggeStone based move generation to Dumb7Fill
- added atomic features for different gpu generations
-- Srdja Matovic 2017
- switch from best first minimax search to parallel alphabeta (lazy smp)
- ported all (except IID) search techniques from Zeta Dva v0305 to OpenCL
- ported the evaluation function of Zeta Dva v0305 to OpenCL
- vectorized and generalized 64 bit Kogge-Stone move generator
- 64 threads are now coupled to one worker, performing move generation,
move picking and evaluation, square-wise, in parallel on the same node
- portability over performance, should run on the very first gpus with
OpenCL 1.x support (>= 2008)
-- Srdja Matovic 2017
Zeta (098d to 098g)
- mostly cleanup and fixes
- restored simple heuristics from Zeta Dva (~2000 Elo on CCRL) engine
- protocol fixes
- fixed autoconfig for AMD gpus
- switched to KoggeStone based move generator
- switched to rotate left based Zobrist hashes
- switched to move picker
- switched to GPL >= 2
- Zeta 098e on Nvidia GeForce GTX 580, ca. 6 Mnps, est. 1800 Elo on CCRL
- Zeta 098e on AMD Radeon HD 7750, ca. 1 Mnps
- Zeta 098e on AMD Phenom X4, ca. 1 Mnps
- Zeta 098e on Nvidia GeForce 8800 GT, ca. 500 Knps
-- Srdja Matovic 2016
Zeta (098a to 098c)
- Improved heuristics, partly ported from the Stockfish chess engine
- AutoConfig for OpenCL devices
- Parameter tuning
- Zeta 098c on Nvidia GeForce GTX 480, ca. 5 Mnps, est. 2000 Elo on CCRL
- Zeta 098c on AMD Radeon R9 290, ca. 3.2 Mnps
-- Srdja Matovic Aug 2013
Zeta (097a to 097z)
- Implementation of an BestFirstMiniMax search algorithm with UCT parameters for parallelization
- Zeta 097x on Nvidia GeForce GTX 480, ca. 5 Mnps, est. 1800 Elo on CCRL
- Zeta 097x on AMD Radeon HD 7750, ca. 800 Knps
-- Srdja Matovic Jan 2013
Zeta (0930 to 0960)
- Tested Monte Carlo Tree Search without UCT across multiple Compute Units of the GPU
- Tested LIFO-Stack based load balancing for AlphaBeta search on one Compute Unit of the GPU
- Tested the 'Nagging' and 'Spam' parallelization approach for AlphaBeta search on one Compute Unit of the GPU
- Tested 'RBFMS', Randomized BestFirstMiniMax Search, a parallelized version of BestFirstMinixMax, across multiple Compute Units of the GPU
-- Srdja Matovic 2012
Zeta (0915 to 0918)
- 64 bit Magic Bitboard Move Generator running
- AlphaBeta search algorithm with 'SPPS'-parallelization running 128 threads on one Compute Unit of the GPU
-- Srdja Matovic 2011
Zeta (0900 to 0910)
- Tested 32 bit 0x88 and 64 bit Magic Bitboard Move Generator
- Ported Heuristics, the Evaluation Function, from CPU engine 'ZetaDva' (~2000 Elo) to OpenCL
-- Srdja Matovic 2010
* updated on 2019-07-13 *
Zeta and Zeta Dva support only some basic Xboard protocol commands and some users have reported problems with the configuration and interface of the last Zeta versions. So i will publish the source code again when these parts are more user friendly designed and tested for Windows Chess-GUIs like Winboard or Arena.