AMATH 483 / 583 – HW2 solution

$30.00

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (1 vote)

1 Information for problems

1.1 Timing code segments #include #include int main ( ) { // S t a r t t h e c l o c k auto s t a r t = s t d : : chrono : : h i g h r e s o l u t i o n c l o c k : : now ( ) ; // Code segmen t t o t ime // . . . // S top t h e c l o c k auto s t o p = s t d : : chrono : : h i g h r e s o l u t i o n c l o c k : : now ( ) ; // C a l c u l a t e t h e d u r a t i o n o f t h e code segmen t auto d u r a ti o n = s t d : : chrono : : d u r a ti o n c a s t ( s t o p − s t a r t ) ; // Ou tpu t t h e d u r a t i o n t o t h e c o n s o l e s t d : : c ou t << ”Time taken by code segment : ” << d u r a ti o n . count ( ) << ” mi c r o s e c o n d s ” << return 0 ; }

1.2 Loop Unroll The loop unrolling technique aims to optimize the performance of a program by reducing the overhead of the loop itself. The idea is to execute multiple iterations of the loop at once, thereby reducing the number of instructions executed by the processor. // Example l o o p t h a t sums t h e el em e n t s in an a r r ay int sum ( int a r r [ ] , int l e n ) { int r e s u l t = 0 ; for ( int i = 0 ; i < l e n ; i++) { r e s u l t += a r r [ i ] ; } 1 return r e s u l t ; } // Loop u n r o l l e d v e r s i o n t h a t sums t h e el em e n t s in an a r r ay int s um u n r oll e d ( int a r r [ ] , int l e n ) { int r e s u l t = 0 ; for ( int i = 0 ; i < l e n ; i += 4 ) { r e s u l t += a r r [ i ] ; r e s u l t += a r r [ i + 1 ] ; r e s u l t += a r r [ i + 2 ] ; r e s u l t += a r r [ i + 3 ] ; } for ( ; i < n ; i++) { // don ’ t f o r g e t ab o u t t h e rema in ing a r r ay el em e n t s r e s u l t += a r r [ i ] ; } return r e s u l t ; }

1.3 Shared object library Consider C++ source files foo.cpp and bar.cpp which contain functions that will be used by main.cpp. To create a shared object library that is composed of the object codes foo.o and bar.o compile as follows (assuming you are using GNU C++ compiler): • g++ -c -fPIC foo.cpp -o foo.o • g++ -c -fPIC bar.cpp -o bar.o • g++ -shared -o libfoobar.so foo.o bar.o This creates a shared object library called libfoobar.so. Note that PIC means position independent code. This is binary code that can be loaded and executed at any address without being modified by the linker during the compilation. For the main.cpp to use this library, it must be linked correctly as follows: • g++ -o xmain main.cpp -L. -lfoobar The -L. -lfoobar flags are used to link the shared library libfoobar.so during the compilation stage. The executable xmain can now be run.

1.4 FLOPs FLoating point OPerations per second, this is a metric used to understand the performance of numerically intensive software. If a code block for instance of size n theoretically does f(n) floating point operations (which we can know if we know the algorithm or problem), and it takes t seconds to complete the code block, then F LOP s = f(n)/t.

2 Problems

This assignment is due Wednesday April 19 2023 by midnight PDT. We will explore some coding exercises in this homework. For the performance measurements, you must have a theoretical flop count for each operation you implement. This flop count must be included in your test codes.

1. Level 1 BLAS (Basic Linear Algebra Subprograms). Given the following specification, write a C++ function that computes y ← αx + y, where x, y ∈ R n, α ∈ R. Write a C++ code that calls the function and measures the performance for n = 2 to n = 1024. Let each n be measured ntrial times and plot the average performance for each case versus n, ntrial ≥ 3. (performance is FLOPs don’t forget) You may initialize your problem with any non-zero values you desire (random numbers are good). The correctness of your function will be tested against a test system with known result, so please test prior to submission. Check for and flag incorrect cases. Submit C++ function file, main source file, and performance plot. void daxpy (double a , const s t d : : v e c t o r &x , s t d : : v e c t o r &y ) ; 2

2. Level 1 BLAS Loop Unrolled. Given the following specification, write a C++ function that computes y ← αx + y, where x, y ∈ R n, α ∈ R. Your function should unroll the loop at least to depth 4, and accept a block size parameter. Write a C++ code that calls the function and measures the performance for n = 2048 and study the block sizes 1, 2, 4, 8, 16, 32, 64. Measure ntrial times for each block size and plot the average performance for each case versus n, ntrial ≥

3. You may initialize your problem with any non-zero values you desire (random numbers are good). The correctness of your function will be tested against a test system with known result, so please test prior to submission. Check for and flag incorrect cases. Submit C++ function file, main source file, and performance plot. void d a x p y u n r oll (double a , const s t d : : v e c t o r &x , s t d : : v e c t o r &y , int b l o c k s i z e ) ; 3. Level 2 BLAS. Given the following specification, write a C++ function that computes y ← αAx + βy, where A ∈ R m×n,x ∈ R n, y ∈ R m, α, β ∈ R. Write a C++ code that calls the function and measures the performance for the case m = n, and n = 2 to n = 1024. Let each n be measured ntrial times and plot the average performance for each case versus n, ntrial ≥ 3. You may initialize your problem with any non-zero values you desire (random numbers are good). The correctness of your function will be tested against a general m, n test system with known result, so please test prior to submission. Check for and flag incorrect cases. Submit C++ function file, main source file, and performance plot. void dgemv (double a , const s t d : : v e c t o r> &A, const s t d : : v e c t o r &x , double b , s t d : : v e c t o r &y ) ; 4. Level 3 BLAS. Given the following specification, write a C++ function that computes C ← αAB + βC, where A ∈ R m×p , B ∈ R p×n, C ∈ R m×n, α, β ∈ R. Write a C++ code that calls the function and measures the performance for square matrices of dimension n = 2 to n = 1024. Let each n be measured ntrial times and plot the average performance for each case versus n, ntrial ≥ 3. You may initialize your problem with any non-zero values you desire (random numbers are good). The correctness of your function will be tested against a general m, p, n test system with known result, so please test prior to submission. Check for and flag incorrect cases. Submit C++ function file, main source file, and performance plot. void dgemm(double a , const s t d : : v e c t o r> &A, const s t d : : v e c t o r> &B, double b , s t d : : v e c t o r> &C)

5. Template L1, L2, L3 BLAS. Given the following specifications, write C++ template functions that compute the L1, L2, L3 BLAS functions from the previous problems. Write a C++ code that calls each function separately, and measure and plot the performance as above for each method, except use type float for these experiments. You may initialize with any non-zero values you desire (random numbers are good). The correctness of your functions will be tested against a general test systems with known results, so please test prior to submission. Check for and flag incorrect cases.

Submit C++ function file(s), main source file, and performance plots. template void axpy (double alpha , const s t d : : v e c t o r &x , s t d : : v e c t o r &y ) ; template void gemv (T a , const s t d : : v e c t o r> &A, const s t d : : v e c t o r &x , T b , s t d : : v e c t o r &y ) ; template void gemm(T a , const s t d : : v e c t o r> &A, const s t d : : v e c t o r> &B, T b , s t d : : v e c t o r> &C ) ; 6. Shared object library. Compile all your functions into a library called librefBLAS.so. Create a header file refBLAS.hpp that contains the specification of each function you created in this homework.

Write a C++ code that includes this header file and calls each function from the previous problems to convince yourself that it works. Submit the details of your compilation in file README.txt, C++ function file(s), and header file. We will build your shared object library using your instructions (compilation details) and run it against a test code. 3 3 Extra Credit 1. (+2) Write the 6-bit representation of +∞ and −∞. 2. (+1) Define the vector 1-norm, 2-norm, and ∞-norm. 3. (+2) Given matrix A =  1 2 0 2 , evaluate the action of A on the unit balls of R 2 defined by the 1-norm, 2-norm, and ∞-norm (induced matrix norms). Submit your work and drawings. 4