Theano Op in C for GPU
As you could have imagined, it is possible to combine both optimizations:
- Reduce the Python/C overhead by programming directly in C
- Write the code for the GPU
To write CUDA code for GPU, the code that will be run in parallel on the numerous cores of the GPU has to be packaged into a special function type named kernel.
For that purpose, the __init__()
, make_node()
, and c_code_cache_version()
methods stay the same as for our Python example for GPU, but with a new gpu_kernels()
method to define new GPU kernels and the c_code()
method (which replaces the perform()
method again) to implement the C code, also named the host code, that orchestrates how and when to call the different kernels on GPU:
def gpu_kernels(self, node, name): code = """ KERNEL void axpb(GLOBAL_MEM %(ctype)s *x, GLOBAL_MEM %(ctype)s *z, ga_size n, ga_size m) { for (ga_size i = LID_0; i < n; i += LDIM_0) { for (ga_size j = LID_0; j < m; j += LDIM_0) { z[i*m + j] = %...