We will now look at what is known as warp shuffling. This is a feature in CUDA that allows threads that exist within the same CUDA Warp concurrently to communicate by directly reading and writing to each other's registers (that is, their local stack-space variables), without the use of shared variables or global device memory. Warp shuffling is actually much faster and easier to use than the other two options. This almost sounds too good to be true, so there must be a catch—indeed, the catch is that this only works between threads that exist on the same CUDA Warp, which limits shuffling operations to groups of threads of size 32 or less. Another catch is that we can only use datatypes that are 32 bits or less. This means that we can't shuffle 64-bit long long integers or double floating point values across a Warp.
Warp shuffling
Only 32-bit (or smaller) datatypes...