- In the atomic operations example, try changing the grid size from 1 to 2 before the kernel is launched while leaving the total block size at 100. If this gives you the wrong output for add_out (anything other than 200), then why is it wrong, considering that atomicExch is thread-safe?
- In the atomic operations example, try removing __syncthreads, and then run the kernel over the original parameters of grid size 1 and block size 100. If this gives you the wrong output for add_out (anything other than 100), then why is it wrong, considering that atomicExch is thread-safe?
- Why do we not have to use __syncthreads to synchronize over a block of size 32 or less?
- We saw that sum_ker is around five times faster than PyCUDA's sum operation for random-valued arrays of length 640,000 (10000*2*32). If you try adding a zero to the end of this number (that is, multiply it...