1. One SIMD Unit - One Board
    To avoid thread divergence in a Warp, resp. Wavefront, the engine could couple, for example, 32 or 64 Work-Items of one Work-Group to work together on the same chess position. For instance, to generate moves, sort a move list or do an board evaluation in parallel. A move generator of such an Work-Group could operate over pieces, directions, or simply 64 squares in parallel. But in any of these cases current GPU SIMD units will 'waste' some instructions compared to the more efficient, sequential, processing of an CPU.
  2. Use of Local Memory* instead of Global Memory
    The more sequential threads are coupled into one Work-Group to work on one chess position in parallel, the more Local Memory* per Work-Group could be available to store a move list, or a move list stack. By the use of faster Local Memory, less Warps (resp. Wavefronts) are in need to hide Global Memory latency.
  3. Hundreds of Work-Groups instead Thousands of Threads
    YBWC is a parallel game tree search algorithm used in nowadays chess engines, but the more workers the algorithm runs, the less efficient he performs. So, by coupling sequential operating threads into one Work-Group, to work on one chess position in parallel, we lower the total number of workers and increase efficiency of the parallel search.

* Local Memory as OpenCL term is translated to Shared Memory as Nvidia Cuda term.