Description
The aim of this lab is to implement the reduction algorithm we learned in class. You are required to write
a full CUDA program to compute the sum of a 1D float array with 224 elements. In this assignment, you
need to implement and time four versions of the kernel that will do the reduction:
• Version (1) uses the shared memory but has more divergence within the warps. This version is
similar to the example on page 46 in the lecture notes “05G_CUDA_BestPractices”.
• Version (2) uses the shared memory and has less divergence. This is similar to the example on
page 49 in the lecture notes.
• Version (3) is similar to version (1) but it does not use the shared memory (writes directly to the
global memory).
• Version (4) is similar to version (2) but it does not use the shared memory.
In the host code (the main function), create a 1D array with 2
24 random float numbers from 0 to 255.
Then, launch each of the above four kernels, one at a time. The output should similar to one below:
Reducing an array of 16777216 floats on a grid of(32768,1,1) blocks, each block with (512,1,1) threads
Using shared memory, More divergence: GPU time: 227.364 ms GPU sum: 2139115520.00
Using shared memory, Less divergence: GPU time: 37.777 ms GPU sum: 2139115520.00
Using global memory, More divergence: GPU time: 234.256 ms GPU sum: 2139115520.00
Using global memory, Less divergence: GPU time: 45.022 ms GPU sum: 2139115520.00
In addition, choose any of the four kernels and compare its execution time: (i) once using the
‘multiplication’ operator (*) to compute the ‘stride’, and (ii) another time using the ‘shift’ operator (<<).
Report the difference in time. Note: ignore the error that Visual Studio displaysfor using __syncthreads().
Your code should still run with this error.
Marking guide:
+16 for 4 kernels. The marks are distributed for each kernel as follows:
– +1 for using global/shared memory
– +1 for more/less divergence
– +2 for correctness
+8 for the host code distributed as follows:
– +2 for timing the execution
– +4 for kernel configuration and launch (+1 for each kernel)
– +2 for other code (e.g., copying to/from GPU, freeing up the memory, etc).
+1 for the difference between using * and <<
+3 bonus for creating and using a macro to handle errors for CUDA function calls.
Submission Instructions
For this assignment, you need to do the following:
1- Compress all your files (the source code and the image) into one zip file and give it a name that
matches your ID (e.g., 1234567.zip).
2- Submit the zipped file to Canvas.
Note that you can resubmit an assignment, but the new submission overwrites the old submission and
receives a new timestamp.

