- Try it.
- All of the threads don't operate on the GPU simultaneously. Much like a CPU switching between tasks in an OS, the individual cores of the GPU switch between the different threads for a kernel.
- O( n/640 log n), that is, O(n log n).
- Try it.
- There is actually no internal grid-level synchronization in CUDA—only block-level (with __syncthreads). We have to synchronize anything above a single block with the host.
- Naive: 129 addition operations. Work-efficient: 62 addition operations.
- Again, we can't use __syncthreads if we need to synchronize over a large grid of blocks. We can also launch over fewer threads on each iteration if we synchronize on the host, freeing up more resources for other operations.
- In the case of a naive parallel sum, we will likely be working with only a small number of data points that...