Zeta Chess

Review of Papers on GPU Game Tree Search

  • It looks like Monte Carlo Tree Search gives the best speedups compared to an CPU implementation.
  • The Node Based Parallel Search is an hybrid approach that offloads computational tasks to the GPU.
  • MiniMax search can be parallelized on the GPU, but is inferior to AlphaBeta.
  • Speedup of parallel AlphaBeta implementations depend on the branching factor of the Game.

So far i have found nothing about an implementation that makes use of the recursive features of newer architectures.

Papers on GPU Game Tree Search

Linklist of papers related to Game Tree Seach on GPUs for two-player zero-sum games...

Parallel Game Tree Search on SIMD Machines
Holger Hopp and Peter Sanders (1995), citeseerx pdf
Note - Implementation of YBWC, parallel AlphaBeta, on an 16K SIMD machine for an synthetic game tree.

Efficiency of Parallel Minimax Algorithm for Game Tree Search
Plamenka Borovska, Milena Lazarova (2007), citeseerx pdf
Note - Efficiency of AlphaBeta search for 4x4 TicTacToe via MPI and OpenMP on an CPU cluster.

GPU-Accelerated program to play Go
Zachary Clifford (2009), pdf
Note - Implementation of MCTS for Go.

Playing Zero Sum Games on the GPU
Avi Bleiweiss (2010), pdf
Note - Multliple Game Tree Searches on GPU.

Parallel Minimax Tree Searching on GPU
Kamil Rocki and Reiji Suda (2010), pdf
Note - Implementation of MinixMax for Reversi.

Large-Scale Parallel State Space Search Utilizing Graphics Processing Units and Solid State Disks
Damian Sulewski (2011), link to pdf
Note - Chapter 4 - GPUSSD-BFS - A GPU and SSD supported Breadth-First Search.

Parallel Game Tree Search Using GPU
L’ubomír Lackovi (2011), pdf
Note - Implementation of parallel search for Czech Draughts.

Parallel Monte Carlo Tree Search on GPU
Kamil Rocki and Reiji Suda (2011) pdf
Note - Implementation of MCTS for Reversi.

Parallel alpha-beta algorithm on the GPU
Damjan Strnad and Nikola Guid (2011), scholar google pdf
Note - Implementation of PV-Split, parallel AlphaBeta, for Reversi.

A Node-based Parallel Game Tree Algorithm Using GPUs
Liang Li, Hong Liu, Peiyu Liu, Taoying Liu, Wei Li, Hao Wang (2012) IEEE
Note - Implementation of node-based parallelism for Connect6.

Parallel UCT Search on GPUs
Nicolas A. Barriga, Marius Stanescu, Michael Buro (2014), IEEE
Note - Implementation of MCTS with UCT for 8x8 Ataxx.

A Review on Parallelization of Node based Game Tree Search Algorithms on GPU
Ms. Rutuja U. Gosavi, Prof. Payal S. Kulkarni (2014), pdf

Parallelization of Node Based Game Tree Search Algorithm on GPU
Ms. Rutuja U. Gosavi, Mrs. Archana S. Vaidya (2015), pdf
Note - Implementation of node-based parallelism for Connect4/Connect6.

How Computer Chess Engines could run on GPUs

  1. One SIMD Unit - One Board
    To avoid thread divergence in a Warp, resp. Wavefront, the engine could couple, for example, 32 or 64 Work-Items of one Work-Group to work together on the same chess position. For instance, to generate moves, sort a move list or do an board evaluation in parallel. A move generator of such an Work-Group could operate over pieces, directions, or simply 64 squares in parallel. But in any of these cases current GPU SIMD units will 'waste' some instructions compared to the more efficient, sequential, processing of an CPU.
  2. Use of Local Memory* instead of Global Memory
    The more sequential threads are coupled into one Work-Group to work on one chess position in parallel, the more Local Memory* per Work-Group could be available to store a move list, or a move list stack. By the use of faster Local Memory, less Warps (resp. Wavefronts) are in need to hide Global Memory latency.
  3. Hundreds of Work-Groups instead Thousands of Threads
    YBWC is a parallel game tree search algorithm used in nowadays chess engines, but the more workers the algorithm runs, the less efficient he performs. So, by coupling sequential operating threads into one Work-Group, to work on one chess position in parallel, we lower the total number of workers and increase efficiency of the parallel search.

* Local Memory as OpenCL term is translated to Shared Memory as Nvidia Cuda term.

Home - Top