a)Consider this loop:
a[0] = 0;
for (i = 1; i < n; i++)
a[i] = a[i-1] + i;
Since the value of a[i] can't be computed without the value of a[i-1],
there's a loop-carried dependence. Determine how this dependence could be
eliminated so that the loop could be parallelized (for ex, in OpenMP,
although you do not have to prepare the OpenMP code)
b)Consider the following portion of CUDA reduction code:
reduce1<<>>(dev_array_orig,dev_array_new);
cudaMemcpy(host_array_new,dev_array_new,sizeof(int)*N,
cudaMemcpyDeviceToHost);
for (i = 0; i < nBlocks.x; i++)
host_array_new[0] += host_array_new[i];
Again, you do not have to prepare code. Describe what would need to be done in order to replace the for loop with another CUDA kernel call that would implement a ?nal reduction into just one element.