- The best method to choose the number of threads and number of blocks is as follows:
gpuAdd << <512, 512 >> >(d_a, d_b, d_c);
There is a limit to the number of threads that can be launched per block which is 512 or 1024 for the latest processors. The same way there is a limit to the number of blocks per grid. So if there are a large number of threads then it is better to launch kernel by a small number of blocks and threads as described.
- Following is the CUDA program to find the cube of 50000 number:
#include "stdio.h"
#include<iostream>
#include <cuda.h>
#include <cuda_runtime.h>
#define N 50000
__global__ void gpuCube(float *d_in, float *d_out)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
while (tid < N)
{
float temp = d_in[tid];
d_out[tid] = temp*temp*temp;
tid += blockDim.x * gridDim.x;
}
}
int main...