First, we will take a look at dynamic parallelism, a feature in CUDA that allows a kernel to launch and manage other kernels without any interaction or input on behalf of the host. This also makes many of the host-side CUDA-C features that are normally available also available on the GPU, such as device memory allocation/deallocation, device-to-device memory copies, context-wide synchronizations, and streams.
Let's start with a very simple example. We will create a small kernel over N threads that will print a short message to the terminal from each thread, which will then recursively launch another kernel over N - 1 threads. This process will continue until N reaches 1. (Of course, beyond illustrating how dynamic parallelism works, this example would be pretty pointless.)
Let's start with the import statements in Python:
from __future__ import division...