Time In this part we will be writing a simple C program to measure the amount of time required to compute the sum of the integers from 0 to 10000. The Linux performance tools library allows us to accurately count the number of cycles it takes to execute a chunk of code.
Two files are given to you, counters.h and counters.cpp, which take care of setting up and reading the hardware performance counters.
1.1 C Skeleton The following code snippet shows how to use the provided counter library. This code shows how to initalize a hardware counter, and also how to retrieve the current time in cycles. 1 #i n cl u d e #i n cl u d e #i n cl u d e #i n cl u d e
2.2 Pointer Chasing Assume we start with an array, A, of length N, initialized with the integers {0 . . . N − 1} in shuffled order. • Initialize the starting value of our index, i, to 0. 4 • In the first step, update i by setting it to the number contained in A[0]. For this example, let us say that i is now 3. • In the second step, update i by setting it to the number contained in A[3]. Let us say that i is now 7. • In the third step, update i by setting it to the number contained in A[7].
• Continue in this fashion for the required number of steps. Program a pointer chasing benchmark that follows the steps outlined above. Allow the length, N, of the array to be inputted by the user. Time how long it takes to run the simulation for 2 20 steps, and compute the average number of cycles to complete a single step. Remember to keep in mind the subtleties involved in timing a computation you learned from part 1.
2.3 Tasks • Explain how this pointer chasing benchmark measures the time between requesting and obtaining a value from memory (i.e. the memory latency). (2 points) • How many bytes in memory is an int array of length N on the computer you are running on? (1 point) • Generate a plot showing the relationship between the average number of cycles to execute a step (vertical axis) and the size in bytes of the array (horizontal axis). Vary the size of the array from 32 kilobytes to 8 megabytes. (32 points) • Discuss the shape of the curve. In particular, explain what the sudden slope changes in the curve correspond to.
(4 points) • In class we discussed how modern out-of-order processors can execute many instructions in parallel through the use of pipelining. Thus even if a single instruction takes t nanoseconds to complete, ten instructions will likely take much less than 10t nanoseconds to complete. Does this affect our benchmark? That is, if it takes t cycles to run c steps in our pointer chasing benchmark, how can we be sure that a single step is actually completing in t c cycles and that the steps aren’t simply being pipelined? (2 points)
• BONUS: We would ideally like the index, i, to eventually traverse through the entire array. That is, we would like i to take on all the values between 0 and N eventually. However not every array A allows this. Write a function that when given an array A, computes how many unique values i actually takes on. Explain how the code works. (16 points) 5
3 Measuring Memory Bandwidth
In this part we will be writing a benchmark for measuring the average number of bytes that the memory system can transfer to the CPU per second. This is known as the memory bandwidth. The two most important parameters that characterize a memory system is its latency and its bandwidth. 3.1 Measuring Time in Seconds In order to measure time in seconds, instead of in cycles as we did previously, we need a new timing routine.
The following code prints out the current time measured in seconds. #i n cl u d e #i n cl u d e #i n cl u d e i n t main ( i n t argc , cha r ∗ argv [ ] ) { s t r u c t tim e val tv ; g e t tim eo f da y (&tv , 0 ) ; double time = tv . tv_sec + 1e−6 ∗ tv . tv_usec ; p r i n t f (” Current time = %f s e co n d s . \ n ” , time ) ; } 3.2 Array Copying In order to measure the memory bandwidth of your computer system, we are going to time how long it takes to copy an array. Write a program that given an array of length, N, calculates the average number of megabits copied per second (Mbps). Remember to first “warm up the cache” by copying the array a few times before starting timing. 3.3 Efficient Array Copying We have provided two different efficient array copying procedures in “simd_copy.cpp”, simd_memcpy and simd_memcpy_cache. These two routines use SSE intrinsics, to maximize the performance. Alter your program to be able to use any of the three array copying procedures to time how long it takes to copy an array of length, N. 3.4 Tasks • Plot the relationship between the average bandwidth (in Mbps) sustained by the memory system and the total size (in bytes) of the array.
Produce a plot using each of the three copy routines, your own routine, 6 simd_memcpy and simd_memcpy_cache. Vary the size of the array from 32 kilobytes to 8 megabytes. (3 × 8 points) • Discuss the shape of the curve. Explain any sudden increases in bandwidth as the array size varies. (4 points) • Explain why it is necessary to first “warm up the cache” by copying the array a few times before starting timing. (2 points) • We are trying to measure the maximum bandwidth that the memory system can sustain. Explain why an inefficient array copying procedure would yield an inaccurate measure of the bandwidth. (2 points)
• The simd copying procedures use SSE intrinsics to improve the performance. Intel and AMD processors come with a special set of instructions, called SSE instructions, for directly controlling the simd functional units. SSE intrinsics are a pleasant C wrapper for these instructions, so that the user does not need to write assembly code to gain access to this functionality. Find all the different SSE intrinsics calls (they start with “_mm_”) and explain what each of them do. The “Intel Intrinsics Guide” is a good resource. (2 × 3 points) 4 Measuring Flops and IPC In this part we will be writing a benchmark for measuring the number of floating point operations completed per second (Flops) of matrix multiply and the number of instructions completed per cycle (IPC).
Flops is an important performance measure for numerical code. 4.1 Measuring Instructions The provided counter library can measure the number of instructions executed in a chunk of code in a similar manner to how we measured the number of cycles elapsed. // I n i t i a l i z e i n s t r u c t i o n co u n t e r hwCounter_t c ; c . i n i t = f a l s e ; i n i t I n s n s ( c ) ; //Get c u r r e n t i n s t r u c t i o n count uint64_t count = g e t I n s n s ( c ) ; . . . . do some computation . . . . //Compute number o f i n s t r u c t i o n s e xe cu te d uint64_t e x ec u t ed = g e t I n s n s ( c ) − count ; 7 We will be using this feature to measure the IPC of four different implementations of matrix multiply. 4.2 Matrix Multiply We are providing a binary file, square_sgemm.o, that contains four different implementations of square matrix multiply (opt_simd_sgemm, opt_scalar1_sgemm, opt_scalar0_sgemm, and naive_sgemm), and a skeleton file, mmultiply.cpp, that shows how to call them.
Here is the prototype for naive_sgemm: void naive_sgemm ( f l o a t ∗Y, f l o a t ∗A, f l o a t ∗B, i n t n ) ; Naive_sgemm takes two square matrices, A and B, and their width, n, as input and stores the result of multiplying A and B in Y. The other implementations have identical signatures. Write a program that computes the average number of floating point operations completed per second (Flops) and the number of instructions completed per cycle (IPC) for multiplying matrices of size 2 10 . 4.3 Tasks • Produce a table recording the Flops and IPC of each matrix multiply implementation. (6 × 4 points) • How many floating point operations (in this case, multiply and add) are performed to multiply two square matrices of size n? (3 points)
• Compute the Flops by dividing the total number of floating point operations performed by the time in seconds required. • Some of the implementations should have measured IPC exceeding 1, indicating that the processor is completing more than one instruction per cycle. How can this be? (2 points) • Comparing opt_scalar1_sgemm with opt_simd_sgemm, opt_simd_sgemm should have a lower measured IPC than opt_scalar1_sgemm even though it performs nearly eight times more floating point operations per second. How can this be? Isn’t a higher IPC always an indicator of an efficient program? (2 points) 8