After we have all of our CUDA dependencies installed and running, we can start out with a simple CUDA C++ program:
- First, we'll include all of our necessary header files and define the number of elements we'd like to process. 1 << 20 is 1,048,576, which is more than enough elements to show an adequate GPU test. You can shift this if you'd like to see the difference in processing time:
#include <cstdlib>
#include <iostream>
const int ELEMENTS = 1 << 20;
Our multiply function is wrapped in a __global__ specifier. This allows nvcc, the CUDA-specific C++ compiler, to run a particular function on the GPU. This multiply function is relatively straightforward: it takes the a and b arrays, multiplies them together using some CUDA magic, and returns the value in the c array:
__global__ void multiply(int j, float...