CS6023: GPU Programming Assignment 1 solution

\$30.00

Original Work ?

5/5 - (1 vote)

1 Problem Statement

Write three separate CUDA C++ kernels for performing computations on two
input matrices (A and B) and generating the output matrix C.

In the first
kernel per row column kernel, each thread should process a complete row
of matrix A and corresponding complete column of matrix B. In the second
kernel per column row kernel, each thread should process a complete column
of matrix A and corresponding complete row of matrix A.

In the third kernel
per element kernel, each thread should process exactly one element from both
the input matrices. For the evaluation purpose, per row column kernel will
be invoked with 1D grid and 1D blocks, per column row kernel will be
invoked with 1D grid and 2D blocks and per element kernel will be invoked
with 2D grid and 2D blocks.

2 Input and Output

2.1 Input
• Matrix A of size m x n
• Matrix B of size n x m

2.2 Output
• Output is Matrix C of size m x n
• Output is computed as : C = (A + BT) · (BT − A), where XT
is the
transpose of matrix X and X · Y is the dot product of the matrices X and
Y.

2.3 Constraints
• 2 ≤ m ≤ 2
13, 2 ≤ n ≤ 2
13
1

3 Sample Testcase

• Input Matrix A:

6 18 −9 9
−5 13 16 −8
−6 3 −7 −10

• Input Matrix B:

−3 8 10
−3 −9 13
13 −2 10
14 16 19

• (A + BT)

6 18 −9 9
−5 13 16 −8
−6 3 −7 −10

 +

−3 −3 13 14
8 −9 −2 16
10 13 10 19

 =

3 15 4 23
3 4 14 8
4 16 3 9

• (BT − A)

−3 −3 13 14
8 −9 −2 16
10 13 10 19

 −

6 18 −9 9
−5 13 16 −8
−6 3 −7 −10

 =

−9 −21 22 5
13 −22 −18 24
16 10 17 29

• C = (A + BT) · (BT − A)
C =

3 15 4 23
3 4 14 8
4 16 3 9

 ·

−9 −21 22 5
13 −22 −18 24
16 10 17 29

=

−27 −315 88 115
39 −88 252 192
64 160 51 261

• Output Matrix C:

−27 −315 88 115
39 −88 −252 192
64 160 51 261

2

4 Points to be noted

• The file ’main.cu’ provided by us contains the code, which takes care of file
reading, writing etc. The prototypes of the three kernels are also provided
in the same file. You need to implement the three kernels provided in the
code.
• The number of threads launched for each of the three kernels will be more
than or equal to the number of threads required to do the computation.
• Do not write any print statements inside the kernel.
• Test your code on large input graphs.

5 Submission Guidelines
• Use the file ’main.cu’ provided by us.
• Compress the file ’main.cu’, which contains the implementation of the
above-described functionality to ROLL NUMBER.zip
• Submit only the ROLL NUMBER.zip file on Moodle.

• After submission, download the file and make sure it was the one you
intended to submit.
• Kindly adhere strictly to the above guidelines.

6 Learning Suggestions

Write a CPU-version of code achieving the same functionality. Time the CPU
code and GPU code separately for large matrices and compare the performances.
3