We will now go through the very basics of how to write a full-on CUDA-C program. We'll start small and just translate the fixed version of the little matrix multiplication test program we just debugged in the last section to a pure CUDA-C program, which we will then compile from the command line with NVIDIA's nvcc compiler into a native Windows or Linux executable file (we will see how to use the Nsight IDE in the next section, so we will just be doing this with only a text editor and the command line for now). Again, the reader is encouraged to look at the code we are translating from Python as we go along, which is available as the matrix_ker.py file in the repository.
Now, let's open our favorite text editor and create a new file entitled matrix_ker.cu. The extension will indicate that this is a CUDA-C program, which can be compiled...