## Description

## 1 Problem Statement

Write three separate CUDA C++ kernels for performing computations on two

input matrices (A and B) and generating the output matrix C.

In the first

kernel per row column kernel, each thread should process a complete row

of matrix A and corresponding complete column of matrix B. In the second

kernel per column row kernel, each thread should process a complete column

of matrix A and corresponding complete row of matrix A.

In the third kernel

per element kernel, each thread should process exactly one element from both

the input matrices. For the evaluation purpose, per row column kernel will

be invoked with 1D grid and 1D blocks, per column row kernel will be

invoked with 1D grid and 2D blocks and per element kernel will be invoked

with 2D grid and 2D blocks.

## 2 Input and Output

2.1 Input

• Matrix A of size m x n

• Matrix B of size n x m

2.2 Output

• Output is Matrix C of size m x n

• Output is computed as : C = (A + BT) · (BT − A), where XT

is the

transpose of matrix X and X · Y is the dot product of the matrices X and

Y.

2.3 Constraints

• 2 ≤ m ≤ 2

13, 2 ≤ n ≤ 2

13

1

## 3 Sample Testcase

• Input Matrix A:

6 18 −9 9

−5 13 16 −8

−6 3 −7 −10

• Input Matrix B:

−3 8 10

−3 −9 13

13 −2 10

14 16 19

• (A + BT)

6 18 −9 9

−5 13 16 −8

−6 3 −7 −10

+

−3 −3 13 14

8 −9 −2 16

10 13 10 19

=

3 15 4 23

3 4 14 8

4 16 3 9

• (BT − A)

−3 −3 13 14

8 −9 −2 16

10 13 10 19

−

6 18 −9 9

−5 13 16 −8

−6 3 −7 −10

=

−9 −21 22 5

13 −22 −18 24

16 10 17 29

• C = (A + BT) · (BT − A)

C =

3 15 4 23

3 4 14 8

4 16 3 9

·

−9 −21 22 5

13 −22 −18 24

16 10 17 29

=

−27 −315 88 115

39 −88 252 192

64 160 51 261

• Output Matrix C:

−27 −315 88 115

39 −88 −252 192

64 160 51 261

2

## 4 Points to be noted

• The file ’main.cu’ provided by us contains the code, which takes care of file

reading, writing etc. The prototypes of the three kernels are also provided

in the same file. You need to implement the three kernels provided in the

code.

• The number of threads launched for each of the three kernels will be more

than or equal to the number of threads required to do the computation.

• Do not write any print statements inside the kernel.

• Test your code on large input graphs.

5 Submission Guidelines

• Use the file ’main.cu’ provided by us.

• Compress the file ’main.cu’, which contains the implementation of the

above-described functionality to ROLL NUMBER.zip

• Submit only the ROLL NUMBER.zip file on Moodle.

• After submission, download the file and make sure it was the one you

intended to submit.

• Kindly adhere strictly to the above guidelines.

## 6 Learning Suggestions

Write a CPU-version of code achieving the same functionality. Time the CPU

code and GPU code separately for large matrices and compare the performances.

3