In the prior chapters, we saw that there are two primary operations we perform from the host when interacting with the GPU:
- Copying memory data to and from the GPU
- Launching kernel functions
We know that within a single kernel, there is one level of concurrency among its many threads; however, there is another level of concurrency over multiple kernels and GPU memory operations that are also available to us. This means that we can launch multiple memory and kernel operations at once, without waiting for each operation to finish. However, on the other hand, we will have to be somewhat organized to ensure that all inter-dependent operations are synchronized; this means that we shouldn't launch a particular kernel until its input data is fully copied to the device memory, or we shouldn't copy the output data of a launched kernel...