Coalesced transpose via shared memory, NVIDIA parallel for all
When the dimension of the data is not divisible into a block size times a grid size, threads dealing with data at the border will execute faster than other threads, and the kernel code has to be written in a way to check for out-of-bounds memory accesses.
When programming in parallel, race conditions, as well as memory bank conflicts in shared memory, and data that cannot stay local to the thread in the available registrars are some new pains to check. Coalescing global memory accesses is by far the most critical aspect of achieving good performance. The NVIDIA® Nsight™ tool will help you develop, debug, and profile the code that executes on CPU and GPU.
Model conversions
When a model is saved, the resulting data is simply a list of arrays, that is, weight vectors (for biases) and matrices (for multiplications) and a name for each layer. It is quite simple to convert a model from one framework to another: it consists...