Zeta Chess

Zeta - v099 revisited

It works, Zeta v099 plays decent chess, with an classic parallel AlphaBeta
approach, and I am convinced that with some further work it could reach more
than 3000 CCRL Elo on an highend gpu.

But the obvious thing is, it lacks nps throughput per worker, the single thread
performance is too low, and even with an better parallel search, there is not
much to gain on massive parallel systems with more than 128 workers.

So to be able to beat the top 10 chess engines out there, the nps throughput per
worker must be increased ten or twenty fold...

during early development I tried an design based on an LIFO-Stack parallel
search. It had the best nps throughput of all my designs, but I was not able to
implement AlphaBeta pruning efficient, so the speed gain was lost again during
pruning.

If I had to start over, and make another Zeta version, I would try the LIFO-
Stack based parallel search again...

Zeta v099m

Zeta v099m released as source and Linux/Windows 64 bit binary:

https://github.com/smatovic/Zeta/releases

Alternative downloads:

https://zeta-chess.app26.de/downloads/

Please consider the README file or --help option before running the engine.

From the changelog:

Zeta (099m) alpha; urgency=medium

* patch for ABDADA parallel search
* disabled RMO parallel search
* removed max device memory limitation
* mods in time control
* cleanups
*
* Zeta 099m on Nvidia V100, 160 workers, ~ 13.5 Mnps
* Zeta 099m on Nvidia V100, 1 worker, ~ 85 Knps

-- Srdja Matovic 13 Jul 2019

Here some nps and search scaling results...

################################################################################
# Zeta 099m, startposition, depth 12, best of 4 runs, Nvidia V100:
# tt1: 2048 MB, tt2: 1536 MB
#
### workers #nps          #nps speedup   #time in s   #ttd speedup   #relative ttd
### 1       86827         1.000000       156.586000   1.000000       1.000000 
### 2       180282        2.076336       55.749000    2.808768       2.808768 
### 4       356910        4.110588       35.564000    4.402936       1.567568 
### 8       704741        8.116611       19.637000    7.974029       1.811071 
### 16      1385758       15.959989      14.583000    10.737571      1.346568 
### 32      2786039       32.087242      11.124000    14.076411      1.310949 
### 64      5460849       62.893443      8.838000     17.717357      1.258656 
### 128     10235993      117.889516     7.377000     21.226244      1.198048 
### 160     11639290      134.051505     7.202000     21.742016      1.024299 

Zeta v099l

Zeta v099k did not scale well on Nvidia Pascal and Turing gpus, so I wrote a
patch to fix this issue, and released Zeta v099l:

https://github.com/smatovic/Zeta/release

On Pascal it runs now 4 workers per Compute Unit and on Turing 2 workers per
Compute Unit during guessconfigx.

According to Nvidia papers, Turing should have 16 wide SIMD units, with four
units per Compute Unit, but according to my tests I can only speculate that the
integer units are 32 wide, not 16, with two of them per Compute Unit.

During benchmarks on other systems it was shown again that some Windows OS have
an OS gpu timeout, so you may want to apply this registry update on your Windows
machine:

https://zeta-chess.app26.de/downloads/SetWindowsGPUTimeoutTo20s.reg

Download, double-click and reboot OS to increase gpu timeout from 2 to 20 seconds.

If you want to run an SMP benchmark for your gpu, I suggest to increase the gpu
timeout to 400 seconds:

https://zeta-chess.app26.de/downloads/SetWindowsGPUTimeoutTo400s.reg

Zeta - Source Code and Binaries online

I fixed some issues in Zeta Dva and Zeta, source code and binaries are online again

https://github.com/smatovic/ZetaDva/releases

https://github.com/smatovic/Zeta/releases

Please consider the README file or --help option before running the Zeta engine on GPU.

I lost the source of Zeta Vintage, and an attempt to do an rewrite in C showed again that the 6502 processor should really be programmed in assembly, so a rewrite in 6502 assembly is still on my bucket list...

https://github.com/smatovic/ZetaVintage

 

YBWC vs. RBFMS vs. MCTS vs. MCAB

To port an classic chess engine approach with an parallel Alphabeta algorithm like YBWC to an GPU architecutre would take a significant bunch of time, if it is even possible to port all well known computer chess techniques straight forward. And it is questionable if an Elo gain, by more computed nodes per second, is eaten up again by an higher branchingfactor due to an simpler implementation.

Zeta 098 and 097 make use of an Randomized Best First MiniMax Search, but my implementation makes excessive use of Global Memory and scales poorly.

At the very beginning of the project it was clear, that an Monte Carlo Tree Search would fit best for gpus. But until now there is no known engine that could make MCTS work well for Chess.

What is left, except to try to port an classic approach?

I could improve the performance of the BestFist search significantly by switching from GlobalMemory to LocalMemory and i could remove the randomness...another alternative would be to switch to MCAB, Monte Carlo Alphabeta...

Zeta v099

I finished my current run on Zeta v099, my experimental gpu chess engine.

https://github.com/smatovic/Zeta

The actual conclusion of the current iteration is, that an simple engine, with standard chess programming techniques, can be ported to OpenCL to run on a gpu, but it would take more effort to make the engine competitive in terms of computed nodes per second (speed), heuristics (expert knowledge), and scaling (parallel search algorithm).

Computer Chess, as an computer science topic, evolved over decades, starting in the 40s and 50s, and reached one peak 1997 with the match Deep Blue vs. Kapsarow. Nowadays chess engines are tuned by playing thousands and thousands of games, so to get an chess playing engine running on the gpu and to get an competitive chess playing engine running on the gpu are two different tasks.

How Computer Chess Engines could run on GPUs

  1. One SIMD Unit - One Board
    To avoid thread divergence in a Warp, resp. Wavefront, the engine could couple, for example, 32 or 64 Work-Items of one Work-Group to work together on the same chess position. For instance, to generate moves, sort a move list or do an board evaluation in parallel. A move generator of such an Work-Group could operate over pieces, directions, or simply 64 squares in parallel. But in any of these cases current GPU SIMD units will 'waste' some instructions compared to the more efficient, sequential, processing of an CPU.
  2. Use of Local Memory* instead of Global Memory
    The more sequential threads are coupled into one Work-Group to work on one chess position in parallel, the more Local Memory* per Work-Group could be available to store a move list, or a move list stack. By the use of faster Local Memory ,less Warps (resp. Wavefronts) are in need to hide Global Memory latency.
  3. Hundreds of Work-Groups instead Thousands of Threads
    YBWC is a parallel game tree search algorithm used in nowadays chess engines, but the more workers the algorithm runs, the less efficient he performs.So, by coupling sequential operating threads into one Work-Group, to work on one chess position in parallel, we lower the total number of workers and increase efficiency of the parallel search.

* Local Memory as OpenCL term is translated to Shared Memory as Nvidia Cuda term.

Zeta - Milestones

Here an overview of what happened before....

Zeta (099m)

  • patch for ABDADA parallel search
  • disabled RMO parallel search
  • removed max device memory limitation
  • mods in time control
  • cleanups
  • Zeta 099m on Nvidia V100, 160 workers, ~ 13.5 Mnps
  • Zeta 099m on Nvidia V100, 1 worker, ~ 85 Knps

Zeta (099l)

  • patch for parallel search scaling
  • max device memory increased from 1 GB to 16 GB

Zeta (099h to 099k)

  • fixes n cleanups
  • switch from Lazy SMP to ABDADA parallel search
  • added IID - Internal Iterative Deepening
  • one cl file for all gpu generations with inlined optimizations
  • Zeta 099k on AMD Radeon R9 Fury X, 256 workers, ~ 7.6 Mnps
  • Zeta 099k on Nvidia GeForce GTX 750, 16 workers, ~ 800 Knps
  • Zeta 099k on AMD Radeon HD 7750, 32 workers, ~ 700 Knps
  • Zeta 099k on Nvidia GeForce 8800 GT, 14 workers, ~ 110 Knps

Zeta (099b to 099g)

  • switch from KoggeStone based move generation to Dumb7Fill
  • added atomic features for different gpu generations

Zeta (099a)

  • switch from best first minimax search to parallel alphabeta (lazy smp)
  • ported all (except IID) search techniques from Zeta Dva v0305 to OpenCL
  • ported the evaluation function of Zeta Dva v0305 to OpenCL
  • vectorized and generalized 64 bit Kogge-Stone move generator
    64 threads are now coupled to one worker, performing move generation,
    move picking and evaluation, square-wise, in parallel on the same node
  • portability over performance, should run on the very first gpus with
    OpenCL 1.x support (>= 2008)

Zeta (098d to 098g)

  • mostly cleanup and fixes
  • restored simple heuristics from Zeta Dva (~2000 Elo on CCRL) engine
  • protocol fixes
  • fixed autoconfig for AMD gpus
  • switched to KoggeStone based move generator
  • switched to rotate left based Zobrist hashes
  • switched to move picker
  • switched to GPL >= 2
  • Zeta 098e on Nvidia GeForce GTX 580, ca. 6 Mnps, est. 1800 Elo on CCRL
  • Zeta 098e on AMD Radeon HD 7750, ca. 1 Mnps
  • Zeta 098e on AMD Phenom X4, ca. 1 Mnps
  • Zeta 098e on Nvidia GeForce 8800 GT, ca. 500 Knps


Zeta (098a to 098c)

  • Improved heuristics, partly ported from the Stockfish chess engine
  • AutoConfig for OpenCL devices
  • Parameter tuning
  • Zeta 098c on Nvidia GeForce GTX 480, ca. 5 Mnps, est. 2000 Elo on CCRL
  • Zeta 098c on AMD Radeon R9 290, ca. 3.2 Mnps

Zeta (097a to 097z)

  • Implementation of an BestFirstMiniMax search algorithm with UCT parameters for parallelization
  • Zeta 097x on Nvidia GeForce GTX 480, ca. 5 Mnps, est. 1800 Elo on CCRL
  • Zeta 097x on AMD Radeon HD 7750, ca. 800 Knps

Zeta (0930 to 0960)

  • Tested Monte Carlo Tree Search without UCT across multiple Compute Units of the GPU
  • Tested LIFO-Stack based load balancing for AlphaBeta search on one Compute Unit of the GPU
  • Tested the 'Nagging' and 'Spam' parallelization approach for AlphaBeta search on one Compute Unit of the GPU
  • Tested 'RBFMS', Randomized BestFirstMiniMax Search, a parallelized version of BestFirstMinixMax, across multiple Compute Units of the GPU

Zeta (0915 to 0918)

  • 64 bit Magic Bitboard Move Generator running
  • AlphaBeta search algorithm with 'SPPS'-parallelization running 128 threads on one Compute Unit of the GPU

Zeta (0900 to 0910)

  • Tested 32 bit 0x88 and 64 bit Magic Bitboard Move Generator
  • Ported Heuristics, the Evaluation Function, from CPU engine 'ZetaDva' (~2000 Elo) to OpenCL

 

* updated on 2019-07-13 *

Zeta - Source Code

Zeta and Zeta Dva support only some basic Xboard protocol commands and some users have reported problems with the configuration and interface of the last Zeta versions.‭ ‬So i will publish the source code again when these parts are more user friendly designed and tested for Windows Chess-GUIs like Winboard or Arena.

Home - Top