GPU Programming

Modern graphics cards sport Graphics Processing Units (GPUs) which are insanely powerful. They are highly optimised to perform a sequence of identical operations on lots of data (the SIMD paradigm). A graphics card can often perform such calculations far faster than the machine's own CPU. In recent years, programmable GPUs have been developed, enabling programmers to run their own code on the card, rather than being restricted to the instructions provided by the hardware (typically OpenGL calls). Nvidia in particular has developed its own variant of the C language to promote this. They call the system CUDA, and it is available as a free download.

Alice and I learned about this, and decided to investigate the potential for astrophysical research. I built a machine containing an Nvidia GeForce 8800 series graphics card - which was never attached to a monitor (the monitor is driven by a far cheaper GeForce 5 card). Next, I settled myself down with a copy of the CUDA manual, and started programming

CUDA Architecture

Block diagram of computer architecture

Figure 1: Block diagram of a computer

A simplified block diagram of a computer containing a GPU. The CPU and GPU each have their own memory spaces, and communicate over the PCI bus.

It is important to understand that a GPU is a separate, self-contained computing device which sits inside another computer. Figure 1 is a simple block diagram of a computer containing a programmable graphics card. The CPU has a direct link to the computer's main memory. Similarly, the GPU has its own processing units, and its own memory banks. The CPU and GPU communicate over the PCI bus. This leads to the following model for GPU programming:

  1. CPU sends data to the GPU
  2. GPU processes data
  3. CPU copies data back from the GPU

It is the addition of the third stage which makes general purpose GPU programming possible. Graphics cards have always been capable of processing large amounts of data, but they could only sent output to a display. Now that a GPU can return its results to the CPU, general purpose programming is possible. This ability also makes it extremely difficult to make graphics cards available to virtual machines (at least until hardware I/O memory management units become available in consumer equipment), since the GPU will write to a physical address (rather than one in virtual memory). Fortunately, we don't need to worry about this here.

Note that steps 1 and 3 in the GPU programming model represent an extra cost, which isn't present for code running exclusively on a CPU. Unless the GPU portion (step 2) is sufficiently faster than the CPU, the total time taken for the program to run will actually be longer. This is a trade-off which must always be borne in mind.

The GPU itself consists of a number of processing units (typically eight or so). Each one of these processing units executes a block of threads (typically up to 512) in parallel. There is one significant limitation: each thread in a block must perform the same instruction. Although each thread can operate its own data, all the threads in a block must be performing the same operation on that data. Branches (such as if statements) violate this constraint. This is the Single Instruction Multiple Data (SIMD) paradigm, which gives GPUs their performance. Codes which cannot be written without branches will perform better on a CPU.

Many array operations are ideally suited to the SIMD approach. For example consider adding two arrays together: ai = bi + ci. Each i value is independent of the others, and hence can be done by a single thread. Note that CUDA code can contain if statements, but this is a very good way to kill performance. The compiler can sometimes remove the branches, but there will inevitably be times that it cannot. In these cases, all the threads in a block will have to be run in serial, with a huge performance penalty.

CUDA code run on a GPU is called a kernel, and only one kernel may be run at a time. The programmer must specify how many blocks of threads are required, and how many threads should be in each block. The GPU itself then schedules the blocks for execution on its processing units. A kernel launch can contain more blocks than there are processing units on the card - the GPU will simply start to run as many blocks as it can, and whenever a processing unit finishes a block, it will be given the next unprocessed one. This continues until all blocks have run.

This is enough information to get started with CUDA programming. I've written a simple nbody program which demonstrates the basics of GPU programming with CUDA.

We can, however, do much better. Modern CPUs are vastly faster than their main memory banks, and so have caches to speed up access to frequently used data. Similarly, each block of threads running on the GPU has a small amount of memory (typically 16 kiB) available for its exclusive use. This memory can be accessed much faster than the graphics card's global memory. Unfortunately, the program must manage this per-block memory itself - unlike a CPU cache, it is not transparent. I have written an optimised version of my simple nbody program, which shows how to use this memory.

CUDA Caveats

The main caveat when using CUDA is that graphics cards do not yet support double precision arithmetic. Only having single precision arithmetic restricts the range of problems which they can solve. Double precision cards are currently still in the design stages. Furthermore, the arithmetic operations (such as sine and cosine) provided are slightly less accurate than those available on the CPU. This has is due to the origins of GPUs - no human is going to notice a single pixel being slightly the wrong colour in a game playing at over one hundred frames per second. Modern GPUs are also very power-hungry, requiring their own feed from the machine's power supply.