Introducing CUDA

Compute Unified Device Architecture (CUDA) is a very popular parallel computing platform and programming model developed by NVIDIA. It is only supported on NVIDIA GPUs. OpenCL is used to write parallel code for other types of GPUs such as AMD and Intel, but it is more complex than CUDA. CUDA allows creating massively parallel applications running on graphics processing units (GPUs) with simple programming APIs. Software developers using C and C++ can accelerate their software application and leverage the power of GPUs by using CUDA C or C++. Programs written in CUDA are similar to programs written in simple C or C++ with the addition of keywords needed to exploit parallelism of GPUs. CUDA allows a programmer to specify which part of CUDA code will execute on the CPU and which part will execute on the GPU.

The next section describes the need for parallel computing and how CUDA architecture can leverage the power of the GPU, in detail.

Parallel processing

In recent years, consumers have been demanding more and more functionalities on a single hand held device. So, there is a need for packaging more and more transistors on a small area that can work quickly and consume minimal power. We need a fast processor that can carry out multiple tasks with a high clock speed, a small area, and minimum power consumption. Over many decades, transistor sizing has seen a gradual decrease resulting in the possibility of more and more transistors being packed on a single chip. This has resulted in a constant rise of the clock speed. However, this situation has changed in the last few years with the clock speed being more or less constant. So, what is the reason for this? Have transistors stopped getting smaller? The answer is no. The main reason behind clock speed being constant is high power dissipation with high clock rate. Small transistors packed in a small area and working at high speed will dissipate large power, and hence it is very difficult to keep the processor cool. As clock speed is getting saturated in terms of development, we need a new computing paradigm to increase the performance of the processors. Let's understand this concept by taking a small real-life example.

Suppose you are told to dig a very big hole in a small amount of time. You will have the following three options to complete this work in time:

You can dig faster.
You can buy a better shovel.
You can hire more diggers, who can help you complete the work.

If we can draw a parallel between this example and a computing paradigm, then the first option is similar to having a faster clock. The second option is similar to having more transistors that can do more work per clock cycle. But, as we have discussed in the previous paragraph, power constraints have put limitations on these two steps. The third option is similar to having many smaller and simpler processors that can carry out tasks in parallel. A GPU follows this computing paradigm. Instead of having one big powerful processor that can perform complex tasks, it has many small and simple processors that can get work done in parallel. The details of GPU architecture are explained in the next section.

Introducing GPU architecture and CUDA

GeForce 256 was the first GPU developed by NVIDIA in 1999. Initially, GPUs were only used for rendering high-end graphics on monitors. They were only used for pixel computations. Later on, people realized that if GPUs can do pixel computations, then they would also be able to do other mathematical calculations. Nowadays, GPUs are used in many applications other than rendering graphics. These kinds of GPUs are called General-Purpose GPUs (GPGPUs).

The next question that may have come to your mind is the difference between the hardware architecture of a CPU and a GPU that allows it to carry out parallel computation. A CPU has a complex control hardware and less data computation hardware. Complex control hardware gives a CPU flexibility in performance and a simple programming interface, but it is expensive in terms of power. On the other hand, a GPU has simple control hardware and more hardware for data computation that gives it the ability for parallel computation. This structure makes it more power-efficient. The disadvantage is that it has a more restrictive programming model. In the early days of GPU computing, graphics APIs such as OpenGL and DirectX were the only way to interact with GPUs. This was a complex task for normal programmers, who were not familiar with OpenGL or DirectX. This led to the development of CUDA programming architecture, which provided an easy and efficient way of interacting with the GPUs. More details about CUDA architecture are given in the next section.

Normally, the performance of any hardware architecture is measured in terms of latency and throughput. Latency is the time taken to complete a given task, while throughput is the amount of the task completed in a given time. These are not contradictory concepts. More often than not, improving one improves the other. In a way, most hardware architectures are designed to improve either latency or throughput. For example, suppose you are standing in a queue at the post office. Your goal is to complete your work in a small amount of time, so you want to improve latency, while an employee sitting at a post office window wants to see more and more customers in a day. So, the employee's goal is to increase the throughput. Improving one will lead to an improvement in the other, in this case, but the way both sides look at this improvement is different.

In the same way, normal sequential CPUs are designed to optimize latency, while GPUs are designed to optimize throughput. CPUs are designed to execute all instructions in the minimum time, while GPUs are designed to execute more instructions in a given time. This design concept of GPUs makes them very useful in image processing and computer vision applications, which we are targeting in this book, because we don't mind a delay in the processing of a single pixel. What we want is that more pixels should be processed in a given time, which can be done on a GPU.

So, to summarize, parallel computing is what we need if we want to increase computational performance at the same clock speed and power requirement. GPUs provide this capability by having lots of simple computational units working in parallel. Now, to interact with the GPU and to take advantage of its parallel computing capabilities, we need a simple parallel programming architecture, which is provided by CUDA.

CUDA architecture

This section covers basic hardware modifications done in GPU architecture and the general structure of software programs developed using CUDA. We will not discuss the syntax of the CUDA program just yet, but we will cover the steps to develop the code. The section will also cover some basic terminology that will be followed throughout this book.

CUDA architecture includes several new components specifically designed for general-purpose computations in GPUs, which were not present in earlier architectures. It includes the unified shedder pipeline which allows all arithmetic logical units (ALUs) present on a GPU chip to be marshaled by a single CUDA program. The ALUs are also designed to comply with IEEE floating-point single and double-precision standards so that it can be used in general-purpose applications. The instruction set is also tailored to general purpose computation and not specific to pixel computations. It also allows arbitrary read and write access to memory. These features make CUDA GPU architecture very useful in general purpose applications.

All GPUs have many parallel processing units called cores. On the hardware side, these cores are divided into streaming processors and streaming multiprocessors (SMs). The GPU has a grid of these streaming multiprocessors. On the software side, a CUDA program is executed as a series of multiple threads running in parallel. Each thread is executed on a different core. The GPU can be viewed as a combination of many blocks, and each block can execute many threads. Each block is bound to a different SM on the GPU. How mapping is done between a block and SM is not known to a CUDA programmer, but it is known and done by a scheduler. The threads from same block can communicate with one another. The GPU has a hierarchical memory structure that deals with communication between threads inside one block and multiple blocks. This will be dealt with in detail in the upcoming chapters.

As a programmer, you will be curious to know what will be the programming model in CUDA and how the code will understand whether it should be executed on the CPU or the GPU. For this book, we will assume that we have a computing platform comprising a CPU and a GPU. We will call a CPU and its memory the host and a GPU and its memory a device. A CUDA code contains the code for both the host and the device. The host code is compiled on CPU by a normal C or C++ compiler, and the device code is compiled on the GPU by a GPU compiler. The host code calls the device code by something called a kernel call. It will launch many threads in parallel on a device. The count of how many threads to be launched on a device will be provided by the programmer.

Now, you might ask how this device code is different from a normal C code. The answer is that it is similar to a normal sequential C code. It is just that this code is executed on a greater number of cores in parallel. However, for this code to work, it needs data on the device's memory. So, before launching threads, the host copies data from the host memory to the device memory. The thread works on data from the device's memory and stores the result on the device's memory. Finally, this data is copied back to the host memory for further processing. To summarize, the steps to develop a CUDA C program are as follows:

Allocate memory for data in the host and device memory.
Copy data from the host memory to the device memory.
Launch a kernel by specifying the degree of parallelism.
After all the threads are finished, copy the data back from the device memory to the host memory.
Free up all memory used on the host and the device.