GeForce 256 was the first GPU developed by NVIDIA in 1999. Initially, GPUs were only used for rendering high-end graphics on monitors. They were only used for pixel computations. Later on, people realized that if GPUs can do pixel computations, then they would also be able to do other mathematical calculations. Nowadays, GPUs are used in many applications other than rendering graphics. These kinds of GPUs are called General-Purpose GPUs (GPGPUs).
The next question that may have come to your mind is the difference between the hardware architecture of a CPU and a GPU that allows it to carry out parallel computation. A CPU has a complex control hardware and less data computation hardware. Complex control hardware gives a CPU flexibility in performance and a simple programming interface, but it is expensive in terms of power. On the other hand, a GPU has simple control hardware and more hardware for data computation that gives it the ability for parallel computation. This structure makes it more power-efficient. The disadvantage is that it has a more restrictive programming model. In the early days of GPU computing, graphics APIs such as OpenGL and DirectX were the only way to interact with GPUs. This was a complex task for normal programmers, who were not familiar with OpenGL or DirectX. This led to the development of CUDA programming architecture, which provided an easy and efficient way of interacting with the GPUs. More details about CUDA architecture are given in the next section.
Normally, the performance of any hardware architecture is measured in terms of latency and throughput. Latency is the time taken to complete a given task, while throughput is the amount of the task completed in a given time. These are not contradictory concepts. More often than not, improving one improves the other. In a way, most hardware architectures are designed to improve either latency or throughput. For example, suppose you are standing in a queue at the post office. Your goal is to complete your work in a small amount of time, so you want to improve latency, while an employee sitting at a post office window wants to see more and more customers in a day. So, the employee's goal is to increase the throughput. Improving one will lead to an improvement in the other, in this case, but the way both sides look at this improvement is different.
In the same way, normal sequential CPUs are designed to optimize latency, while GPUs are designed to optimize throughput. CPUs are designed to execute all instructions in the minimum time, while GPUs are designed to execute more instructions in a given time. This design concept of GPUs makes them very useful in image processing and computer vision applications, which we are targeting in this book, because we don't mind a delay in the processing of a single pixel. What we want is that more pixels should be processed in a given time, which can be done on a GPU.
So, to summarize, parallel computing is what we need if we want to increase computational performance at the same clock speed and power requirement. GPUs provide this capability by having lots of simple computational units working in parallel. Now, to interact with the GPU and to take advantage of its parallel computing capabilities, we need a simple parallel programming architecture, which is provided by CUDA.