## Description

## 1. Problem Statement

Given four input matrices π΄, π΅, πΆ, and π·. Compute the output matrix, π = (π΄ + π΅ π ) πΆ π· π Write an efficient code to compute the output matrix. While writing the code, consider aspects like memory coalescing, shared memory, degree of divergence, etc.

## 2. Input and Output

2.1. Input β 4 integers: π, π, π and π β Matrix π΄ of size π Γ π β Matrix π΅ of size π Γ π β Matrix πΆ of size π Γ π β Matrix π· of size π Γ π 2.2. Output β Matrix π of size π Γ π 2.3. Constraints β 2 β€ π, π, π, π β€ 2 10 β All the elements in the input matrices will be in the range [-10, 10]

## 3. Sample Testcase

β Input matrices π΄, π΅, πΆ and π·: Input will be given as: 2 3 3 2 2 5 0 3 -2 1 6 1 -4 2 1 3 1 9 6 -6 7 2 2 4 -3 10 0 5 1 3 -3 First line represents the values π, π, π and π Next π lines represents the rows of matrix π΄ Next π lines represents the rows of matrix π΅ Next π lines represents the rows of matrix πΆ Next π lines represents the rows of matrix π· β (π΄ + π΅ π ) β Output matrix, π = (π΄ + π΅ π ) πΆ π· π

## 4. Points to be noted

β The file βmain.cuβ provided by us contains the code, which takes care of taking the input, printing the result and printing the execution time. β Donβt write any code in the main() function. β You need to implement the compute() function provided in the βmain.cuβ. β You are free to use any number of functions/kernels. β You can launch the kernels as you wish. β It is compulsory to optimize for coalesced accesses. Also, make use of shared memory. β Do not write any print statements. β Test your code on large input matrices.

### 5. Submission Guidelines

β Use the file βmain.cuβ provided by us. β Donβt change anything in the main() function. β Rename the file βmain.cuβ, which contains the implementation of the above-described functionality, to .cu β For example, if your roll number is CS20M039, then the name of the file you submit on the Moodle should be CS20M039.cu (submit only the .cu file). β After submission, download the file and make sure it was the one you intended to submit.

### 6. Learning Suggestions

β Write a CPU-version of code achieving the same functionality. Time the CPU code and GPU code separately for large matrices and compare the performances. β Exploit shared memory as much as possible to gain performance benefits. β Try reducing thread divergence as much as possible.