Zeta Chess

GPGPU

 

GPU (Graphics Processing Unit) as of 01/2024

a specialized processor initially intended for fast image processing. GPUs may have more raw computing power than general purpose CPUs but need a specialized and parallelized way of programming. Leela Chess Zero has shown that a Best-first Monte-Carlo Tree Search (MCTS) with deep learning methodology works with GPU architectures.

undefined

History

In the 1970s and 1980s RAM was expensive and Home Computers used custom graphics chips to operate directly on registers/memory without a dedicated frame buffer resp. texture buffer, like TIA in the Atari VCS gaming system, GTIA+ANTIC in the Atari 400/800 series, or Denise+Agnus in the Commodore Amiga series. The 1990s would make 3D graphics and 3D modeling more popular, especially for video games. Cards specifically designed to accelerate 3D math, such as SGI Impact (1995) in 3D graphics-workstations or 3dfx Voodoo (1996) for playing 3D games on PCs, emerged. Some game engines could use instead the SIMD-capabilities of CPUs such as the Intel MMX instruction set or AMD's 3DNow! for real-time rendering. Sony's 3D capable chip GTE used in the PlayStation (1994) and Nvidia's 2D/3D combi chips like NV1 (1995) coined the term GPU for 3D graphics hardware acceleration. With the advent of the unified shader architecture, like in Nvidia Tesla (2006), ATI/AMD TeraScale (2007) or Intel GMA X3000 (2006), GPGPU frameworks like CUDA and OpenCL emerged and gained in popularity.

GPU in Computer Chess

There are in main four ways how to use a GPU for Chess:

  • As an accelerator in Lc0: run a neural network for position evaluation on GPU
  • Offload the search in Zeta: run a parallel game tree search with move generation and position evaluation on GPU
  • As an hybrid in perft_gpu: expand the game tree to a certain degree on CPU and offload to GPU to compute the sub-tree
  • Neural network training such as Stockfish NNUE trainer in Pytorch or Lc0 TensorFlow Training

GPGPU

Early efforts to leverage a GPU for general-purpose computing required reformulating computational problems in terms of graphics primitives via graphics APIs like OpenGL or DirextX, followed by first GPGPU frameworks such as Sh/RapidMind or Brook and finally CUDA and OpenCL.

Khronos OpenCL

OpenCL specified by the Khronos Group is widely adopted across all kind of hardware accelerators from different vendors.

AMD

AMD supports language frontends like OpenCL, HIP, C++ AMP and with OpenMP offload directives. It offers with ROCm its own parallel compute platform.

Apple

Since macOS 10.14 Mojave a transition from OpenCL to Metal is recommended by Apple.

Intel

Intel supports OpenCL with implementations like BEIGNET and NEO for different GPU architectures and the oneAPI platform with DPC++ as frontend language.

Nvidia

CUDA is the parallel computing platform by Nvidia. It supports language frontends like C, C++, Fortran, OpenCL and offload directives via OpenACC and OpenMP.

Further

Hardware Model

A common scheme on GPUs with unified shader architecture is to run multiple threads in SIMT fashion and a multitude of SIMT waves on the same SIMD unit to hide memory latencies. Multiple processing elements (GPU cores) are members of a SIMD unit, multiple SIMD units are coupled to a compute unit, with up to hundreds of compute units present on a discrete GPU. The actual SIMD units may have architecture dependent different numbers of cores (SIMD8, SIMD16, SIMD32), and different computation abilities - floating-point and/or integer with specific bit-width of the FPU/ALU and registers. There is a difference between a vector-processor with variable bit-width and SIMD units with fix bit-width cores. Different architecture white papers from different vendors leave room for speculation about the concrete underlying hardware implementation and the concrete classification as hardware architecture. Scalar units present in the compute unit perform special functions the SIMD units are not capable of and MMAC units (matrix-multiply-accumulate units) are used to speed up neural networks further.

Vendor Terminology

AMD Terminology Nvidia Terminology              
Compute Unit Streaming Multiprocessor              
Stream Core CUDA Core              
Wavefront Warp              

 

Hardware Examples

Nvidia GeForce GTX 580 (Fermi)

  • 512 CUDA cores @1.544GHz
  • 16 SMs - Streaming Multiprocessors
  • organized in 2x16 CUDA cores per SM
  • Warp size of 32 threads

AMD Radeon HD 7970 (GCN)

  • 2048 Stream cores @0.925GHz
  • 32 Compute Units
  • organized in 4xSIMD16, each SIMT4, per Compute Unit
  • Wavefront size of 64 work-items

Wavefront and Warp

Generalized the definition of the Wavefront and Warp size is the amount of threads executed in SIMT fashion on a GPU with unified shader architecture.

Programming Model 

parallel programming model for GPGPU can be data-parallel, task-parallel, a mixture of both, or with libraries and offload-directives also implicitly-parallel. Single GPU threads (work-items in OpenCL) contain the kernel to be computed and are coupled to a work-group, one or multiple work-groups form the NDRange to be executed on the GPU device. The members of a work-group execute the same kernel, can be usually synchronized and have access to the same scratch-pad memory, with an architecture limit of how many work-items a work-group can hold and how many threads can run in total concurrently on the device.

Terminology 

OpenCL Terminology CUDA Terminology              
Kernel Kernel              
Compute Unit Streaming Multiprocessor              
Processing Element CUDA Core              
Work-Item Thread              
Work-Group Block              
NDRange Grid              

 

Thread Examples

Nvidia GeForce GTX 580 (Fermi, CC2)

  • Warp size: 32
  • Maximum number of threads per block: 1024
  • Maximum number of resident blocks per multiprocessor: 32
  • Maximum number of resident warps per multiprocessor: 64
  • Maximum number of resident threads per multiprocessor: 2048

AMD Radeon HD 7970 (GCN)

  • Wavefront size: 64
  • Maximum number of work-items per work-group: 1024
  • Maximum number of work-groups per compute unit: 40
  • Maximum number of Wavefronts per compute unit: 40
  • Maximum number of work-items per compute unit: 2560

Memory Model

OpenCL offers the following memory model for the programmer:

  • __private - usually registers, accessable only by a single work-item resp. thread.
  • __local - scratch-pad memory shared across work-items of a work-group resp. threads of block.
  • __constant - read-only memory.
  • __global - usually VRAM, accessable by all work-items resp. threads.

Terminology

OpenCL Terminology CUDA Terminology              
Private Memory Registers              
Local Memory Shared Memory              
Constant Memory Constant Memory              
Global Memory Global Memory              

 

Memory Examples

Nvidia GeForce GTX 580 (Fermi)

  • 128 KiB private memory per compute unit
  • 48 KiB (16 KiB) local memory per compute unit (configurable)
  • 64 KiB constant memory
  • 8 KiB constant cache per compute unit
  • 16 KiB (48 KiB) L1 cache per compute unit (configurable)
  • 768 KiB L2 cache in total
  • 1.5 GiB to 3 GiB global memory

AMD Radeon HD 7970

  • 256 KiB private memory per compute unit
  • 64 KiB local memory per compute unit
  • 64 KiB constant memory
  • 16 KiB constant cache per four compute units
  • 16 KiB L1 cache per compute unit
  • 768 KiB L2 cache in total
  • 3 GiB to 6 GiB global memory

Unified Memory

Usually data has to be copied between a CPU host and a discrete GPU device, but different architectures from different vendors with different frameworks on different operating systems may offer a unified and accessible address space between CPU and GPU.

Instruction Throughput 

GPUs are used in HPC environments because of their good FLOP/Watt ratio. The instruction throughput in general depends on the architecture (like Nvidia's Tesla, Fermi, KeplerMaxwell or AMD's TeraScale, GCN, RDNA), the brand (like Nvidia GeForce, QuadroTesla or AMD Radeon, Radeon Pro, Radeon Instinct) and the specific model.

Integer Instruction Throughput

  • INT32
    The 32-bit integer performance can be architecture and operation depended less than 32-bit FLOP or 24-bit integer performance.

  • INT64
    In general registers and Vector-ALUs of consumer brand GPUs are 32-bit wide and have to emulate 64-bit integer operations.

  • INT8
    Some architectures offer higher throughput with lower precision. They quadruple the INT8 or octuple the INT4 throughput.

Floating-Point Instruction Throughput

  • FP32
    Consumer GPU performance is measured usually in single-precision (32-bit) floating-point FMA (fused-multiply-add) throughput.

  • FP64
    Consumer GPUs have in general a lower ratio (FP32:FP64) for double-precision (64-bit) floating-point operations throughput than server brand GPUs.

  • FP16
    Some GPGPU architectures offer half-precision (16-bit) floating-point operation throughput with an FP32:FP16 ratio of 1:2.

Throughput Examples

Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32-bit integer operations/clock cycle per compute unit

   MAD 16
   MUL 16
   ADD 32
   Bit-shift 16
   Bitwise XOR 32

Max theoretic ADD operation throughput: 32 Ops x 16 CUs x 1544 MHz = 790.528 GigaOps/sec

AMD Radeon HD 7970 (GCN 1.0) - 32-bit integer operations/clock cycle per processing element

   MAD 1/4
   MUL 1/4
   ADD 1
   Bit-shift 1
   Bitwise XOR 1

Max theoretic ADD operation throughput: 1 Op x 2048 PEs x 925 MHz = 1894.4 GigaOps/sec 

Tensors

MMAC (matrix-multiply-accumulate) units are used in consumer brand GPUs for neural network based upsampling of video game resolutions, in professional brands for upsampling of images and videos, and in server brand GPUs for accelerating convolutional neural networks in general. Convolutions can be implemented as a series of matrix-multiplications via Winograd-transformations. Mobile SoCs usually have an dedicated neural network engine as MMAC unit.

Nvidia TensorCores 

With Nvidia Volta series TensorCores were introduced. They offer FP16xFP16+FP32, matrix-multiplication-accumulate-units, used to accelerate neural networks. Turing's 2nd gen TensorCores add FP16, INT8, INT4 optimized computation. Amperes's 3rd gen adds support for BF16, TF32, FP64 and sparsity acceleration. Ada Lovelaces's 4th gen adds support for FP8.

AMD Matrix Cores

AMD released 2020 its server-class CDNA architecture with Matrix Cores which support MFMA (matrix-fused-multiply-add) operations on various data types like INT8, FP16, BF16, FP32. AMD's CDNA 2 architecture adds FP64 optimized throughput for matrix operations. AMD's RDNA 3 architecture features dedicated AI tensor operation acceleration. AMD's CDNA 3 architecture adds support for FP8 and sparse matrix data (sparsity).

Intel XMX Cores

Intel added XMX, Xe Matrix eXtensions, cores to some of the Intel Xe GPU series, like Arc Alchemist and Intel Data Center GPU Max Series.

Host-Device Latencies

One reason GPUs are not used as accelerators for chess engines is the host-device latency, aka. kernel-launch-overhead. Nvidia and AMD have not published official numbers, but in practice there is a measurable latency for null-kernels of 5 microseconds up to 100s of microseconds. One solution to overcome this limitation is to couple tasks to batches to be executed in one run.

Under the Hood

AMD architectures

My own conclusions are:

  • TeraScale has VLIW design.
  • GCN has 16 wide SIMD, executing a Wavefront of 64 threads over 4 cycles.
  • RDNA has 32 wide SIMD, executing a Wavefront:32 over 1 cycle and Wavefront:64 over two cycles.
  • CDNA is advanced GCN.

Nvidia architectures

AFAIK Nvidia did never official mention SIMD in their papers as hardware architecture, with Tesla they only referred to as SIMT.

Nevertheless, my own conclusions are: 

  • Tesla has 8 wide SIMD, executing a Warp of 32 threads over 4 cycles.
  • Fermi has 16 wide SIMD, executing a Warp of 32 threads over 2 cycles.
  • Kepler is somehow odd, not sure how the compute units are partitioned.
  • Maxwell and Pascal have 32 wide SIMD, executing a Warp of 32 threads over 1 cycle.
  • Volta and Turing seem to have 16 wide FPU SIMDs, but my own experiments show 32 wide VALU.

SIMD + Scalar Unit

It seems every SIMD unit has one scalar unit on GPU architectures, executing control flow (branches, loops) or special functions the SIMD ALUs are not capable of.

embedded CPU controller

It is not documented in the whitepapers, but it seems that every discrete GPU has an embedded CPU controller (e.g. Nvidia Falcon) who (speculation) launches the kernels.

GPUs and Duncan's taxonomy

It is not clear to me how the underlying hardware of GPU SIMD units of architectures with unified shader architecture is realized by different vendors, there is the concept of bit-sliced ALUs, there is the concept of pipelined vector processors, there is the concept of SIMD units with fix bit-width ALUs. The white papers from different vendors leave room for speculation, the different instruction throughputs for higher precision and lower precision too, what is left to the programmer is to do microbenchmarking and make conclusions on their own.

https://en.wikipedia.org/wiki/Duncan%27s_taxonomy

https://en.wikipedia.org/wiki/Flynn%27s_taxonomy

Legacy GPGPU

This article does not cover legacy, pre 2007, GPGPU methods, how to use pixel, vertex, geometry, tessellation and compute shaders via OpenGL or DirectX for GPGPU. I can imagine it is possible to backport a neural network Lc0 backend to a certain DirextX/OpenGL API, but I doubt it has real contemporary relevance (running Lc0 on an SGI Indy or alike).

Alternative Architectures

There was for example the IBM PowerXCell 8i, used in the IBM Roadrunner super-computer from 2008, the first heterogeneous petaFLOP, a smaller version ran in the PlayStation 3:

https://en.wikipedia.org/wiki/Cell_%28processor%29#PowerXCell_8i

There was the Intel Larrabee project, a lot of simple x64 cores with AVX-512 vector unit from 2010, later released as Xeon Phi accelerator:

https://en.wikipedia.org/wiki/Larrabee_%28microarchitecture%29

https://en.wikipedia.org/wiki/Xeon_Phi

There is still the NEC SX Aurora (>=2017), a vector-processor on a PCIe card, descendant from the NEC SX super-computer series as used e.g. in the Earth Simulator super-computer:

https://en.wikipedia.org/wiki/NEC_SX-Aurora_TSUBASA

There is the Chinese Matrix 2000/3000 many-core accelerator (>=2017), used in the Tianhe super-computer:

https://en.wikichip.org/wiki/nudt/matrix-2000

AFAIK, none of the above was used to play computer chess....on the other side:

IBM Deep Blue used ASICs:
https://www.chessprogramming.org/Deep_Blue

Hydra used FPGAs:
https://www.chessprogramming.org/Hydra

AlphaZero used TPUs:
https://www.chessprogramming.org/AlphaZero