Description

5/5 - (3 votes)

CS 6290: High-Performance Computer Architecture Project 0

This project is intended to help you set up and use the simulator for the other threeprojects, and also to help you understand the importance of branch prediction a little bit better. To complete this project, you will need to set up VirtualBox and the CS6290 Project virtual machine (VM).

You will put your answers in the red-ish boxes in this Word document, and thensubmit it in Canvas (the submitted file name should still be PRJ0.docx). You should not modify any other text in this document, and your answers should be in the red-ish boxes, in red letters and with a red underline. The font settings for the red-ish boxes are already set that way, so you should be OK if you just put your answers in and don’t change the font colors or underline settings. Georgia Tech students have full access to Office 365 so opening and editing of this document should not be a problem for anyone, and we will not accept answers in any other format. If you use another tool to edit and save the file, please make sure that it preserves at least the font colors and underline settings, otherwise graders might not find your answers easily, and points lost for that reason will stay lost.

In each answer box, you must first provide your answer to the actual question (e.g. a number). You can then use square brackets to provide any explanations that the question is not asking for but that you feel might help us grade your answer. E.g. answer 9.7102 may be entered as 9.7102 [Because 9.71+0.0002 is 9.7102].For questions that are asking “why” and/or “explain”, the correct answer is one that concisely states the cause for what the question is describing, and also states what evidence you have for that. Guesswork, even when entirely correct, will only yield 50% of the points on such questions.

Additional files to upload are specified in each part of this document. Do not archive(zip, rar, or anything else) the files when you submit them – each file should be uploaded separately, with the file name specified in this assignment. You will lose up to 10 points for not following the file submission and naming guidelines, and if any files are missing you will lose all the points for answers that are in any way related to that file (yes, this means an automatic zero score if the PRJ0.docx file is missing). Furthermore, if it is not VERY clear which submitted file matcheswhich requested file, we will treat the submission as missing that file. The same is true if you submit multiple files that appear to match the same requested file (e.g. several files with the same name). In short, if there is any ambiguity about which submitted file(s) should be used for grading, the grading will be done as if those ambiguous files were not submitted at all.

As explained in the course rules, this is an individual project: no collaboration with other students or anyone else is allowed.

Part 1 [60 points]: Simulator set-up and simulation

Install the Oracle VirtualBox VMM (Virtual Machine Manager), then download the CS6290 Project virtual machine https://drive.google.com/open?id=0B-Ab4Cb_awnScXpPeEhMUDYtQ1k(DO NOT get the VM somewhere else, e.g. at the Udacity website for CS 6290 OMS, because those are likely older obsolete versions). Then import this VM into VirtualBox. Once you start (boot) the CS6290 Project VM from within VirtualBox, you should see a Linux desktop. Click on the the topmost icon on the left-side ribbon and in the “Type Your Command” field type “Terminal”. Now you should get a terminal window – click on it and use it to issue commands we need.

The simulator in the “sesc” directory within our home directory, and we first need to build it. We will go into that directory and build the simulator:

cd ~/sesc

make

In the ~/sesc directory you should see (use ‘ls’) sesc.opt and sesc.debug links to executable files. We only compiled sesc.opt now, so only that link works, but that is enough for now because we will not be modifying the simulator in this project.

Now we need some application to run inside the simulator. The simulator simulates processors that use the MIPS ISA, so we need an application compiled for a MIPS-based Linux machine. Since these are hard to find, we have set up the Splash2 benchmark suite in the ‘apps’ subdirectory. For now, let’s focus on the lu application:

cd apps/Splash2/lu

There is a Makefile for building the application from its source code, which is in the Source subdirectory:

make

Now you should see (‘ls’) a lu.mipseb executable in the application’s directory. This is an executable for a big-endian MIPS processor – just the right thing for our simulator. So let’s simulate the execution of this program!

Well, the simulator can simulate all sorts of processors– you can change its branch predictor, the number of instructions to execute per cycle, etc. So for each simulation we need to tell it exactly what kind of system to simulate. This is specified in the configuration file. SESC already comes with a variety of configuration files (in the ~/sesc/confs/ directory), and we will use the one described in the cmp4-noc.conf file. So we use the following command line (this is all one line):

~/sesc/sesc.opt –fn32.rpt-c ~/sesc/confs/cmp4-noc.conf
-olu.out -elu.err lu.mipseb –n32-p1

Note that the dashes in the above command line are “minus” signs –pasting them into this document may have resulted in changing these to longer dashes, in which case you should change them back to “minus” signs. Also, note this is one command line that just does not fit in a single row in this document. Overall, be very careful if you copy-paste this from the document into the VM.

In this command line, the –f option tells the simulator to create a report file that ends with a dot and the specified string. The name of the report file is “sesc_” followed by the name of the executable for the simulated application (lu.mipseb in this case), followed by a dot and the string we specify here. So, in this case sesc_lu.mipseb.n32.rpt will be the file where the simulator stores simulation results. It is very important to delete the simulation file before we run a simulation that writes to thesame file – if there is already a report file with the same name, the simulator appends its report to the file, and that causes the reporting scripts to combine the results of the simulations, and that means you will not get the results you need for the simulation you just ran!

The –c option tells the simulator which configuration file to use. The configuration file tells the simulator all sorts of parameters for the simulated processor, including what its branch predictor looks like,how big are its caches, etc. The –o and –e options tell it to save the standard and error output of the application into lu.out and lu.err files, respectively. This is needed because the simulator itself prints stuff, and we want to be able to look at the output and possible errors from the application to tell it executed OK. Note that there should be a space between “-c” and the configuration file name, but there should be no space between the “-o” and its file name, and there should be no space between “-e” and its file name.

Then the command line tellsthe simulator which (MIPS) executable to use, and what parameters to pass to that executable. The –n32 parameter tells the LU application (which does LU decomposition of a matrix) to work on a 32×32 matrix. The –p1 parameter tells the application to use only one core – we will want to use only one core until Project 3!

Now you should look at the output of the application – the lu.err file should be empty and the lu.out should have the “Total time without initialization: 0” line at the end. If you do not check, you may get incorrect simulation results without knowing it, e.g. because you had a typo in the command line that causes the application to abort without actually doing much. Applications not really finishing is a very common pitfall in benchmarking – we are running the applications just to see how the hardware performs, so we tend to not look at the output. However, if a run got aborted half-way through because of a problem (e.g. mistyped parameter, running out of memory, etc.), the hardware might look really fast for that run because execution time will be much shorter! So we need to check for each run whether it actually completed!

After the run completes, we also get a simulation report file (sesc_lu.mipseb.n32 for the simulation we just ran). This report file containsthe command line we used, a copy of the hardware configuration we used, and then lots of information about the simulated execution we have just completed. You can read the report file yourself, but since the processor we are simulating is very sophisticated, there are many statistics about its various components in the report file. To get the most relevant statistics out of the report file, you can use the report script that comes with SESC:

~/sesc/scripts/report.pl sesc_lu.mipseb.n32.rpt

You can specify multiple files (or use wildcards), in which case the script prints the summary for each file you listed.

The printout of the report script gives us the command line that was used, then tells us how fast the simulation was going (Exe Speed), how much real time elapsed while the simulator was running (Exe Time), and how much simulated time on the simulated machine it took to run the benchmark (Sim Time). Note that we have simulated an execution that takes much less than a millisecond, but it probably took many milliseconds for the simulator to simulate that execution. Because simulation executes on one machine but is trying to model another, when reporting results there can be some confusion about which one we are referring to. To avoid this confusion, the simulation itself is called “native” execution and the simulated execution is sometimes also called “target” execution.

In this simulation, the simulated processor needed462 milliseconds of simulated time (“Sim Time” in the report) to run the application, but the simulator itself took 2.780 seconds of native execution time to do this simulation (“Exe Time” in the report).

The printout of the report script also tells us how accurate the simulated processor’s branch predictor was overall (the number under Total). The next two numbers are in a parenthesis under the RAS label,and they report the accuracy of the return address stack (RAS) and what percentage of branches actually used the RAS. Next, under the BPRed label we have the accuracy for branches that didn’t use the RAS (i.e. they used the direction predictor and possibly the BTB and/or the BTAC), and then we have statistics for the BTB and BTAC, which are less interesting for now.

The overall accuracy of the simulated branch predictor in this simulation was
07percent. The RAS tends to always be highly accurate but it only took care of 11.34 percent of all dynamic branch instructions (i.e. branches as they occur at runtime). That means that the direction predictor handles the remaining ___88.66____ percent of dynamic branch instructions and achieves 89.94 percent accuracy.

Next, the report tells us how many dynamic instructions were completed (nInst) and the breakdown of these instructions, i.e. the percentage of nInst that comes from control flow instructions like branches, jumps, calls, returns, etc. (the number under the BJ label), from loads, from stores, from integer ALU, and from floating point ALU operations. The rest of that line (LD Forward and beyond) is less interesting for now.

The number of completed dynamic instructions in this run of lu on the simulated processor is 323668, and control flow instructions represent
49 percent of these instruction.

Next the report tells us the overall IPC achieved by the simulated processor on this application, how many cycles the processor spent trying to execute instructions, and what those cycles were spent on. This last part is broken down into issuing useful instructions (Busy), waiting for a load queue entry (LDQ) to become free, then the same for store queue (STQ), reservation stations (IWin), ROB entries, and physical registers (Regs, the simulator uses a Pentium 4-like separate physical register file for register renaming). Some of the cycles are wasted on TLB misses (TLB), on issuing wrong-path instructions due to branch mispredictions (MisBr), and some other more obscure reasons (ports, maxBr, Br4Clk, Other). Note that each of these “how the cycles were spent” items is a percentage although they do not have the percentage sign. Also note that the simulator models a multiple-issue processor, so the breakdown of where the cycles went is actually attributing partial cycles when needed. For example, in a four-issue processor we may have a cycle where one useful instruction is issued and then the processor could not issue another one because the LDW became full. In this cycle, 0.25 cycles would be counted as “Busy” and 0.75 cycles toward “LDQ”.

Given that branches represent 49 percent of all dynamically completed instructions, and that only 8.93 [100 – 91.07] percent of all completed branch instructions are mispredicted, we can conclude that only percent of all dynamically completed instructions are mispredicted branches. But now we see that branch mispredictions result in wasting 45.2 percent of the processor’s performance potential. Why is this last number (percentage of performance wasted) so high compared to the previous one (percentage of instruction that are mispredicted branches)?

Due to pipeline mechanism, a branch misprediction would result the previous loaded instruction and related work to be invalid. There are instruction dependency and data dependency for execution and misprediction would be severely affected by these dependencies, resulting cycles to be stalled. Thus, lots of cycles are wasted for waiting for the correct instruction or data to be loaded.

Now run two additional simulations, this time using 128 and then 256as the matrix size for lu (note the changes in –f and –n parameters):

~/sesc/sesc.opt -fn128.rpt -c ~/sesc/confs/cmp4-noc.conf
-olu.out -elu.err lu.mipseb -n128-p1

~/sesc/sesc.opt -fn256.rpt-c ~/sesc/confs/cmp4-noc.conf
-olu.out-elu.err lu.mipseb -n256-p1

Note: remember to check lu.err and lu.out after each simulation to make sure that the simulated run has completed correctly!

As we increase the matrix size (-n parameter) from 32 to 128, and then to 256, the overall branch predictor improves from 07 to 95.03 and then to
94.72percent. Why does this accuracy improve as we increase the –n parameter in lu?

From the running results, it is clear that the number of instructions becoming larger. Since the percentage of branch instruction remaining the same (9.49%for 32, 9.29% for 128, 9.44% for 256), the overall branch number increases. Within the program, there are “for” loops and the iteration number is related with the matrix dimension. As the matrix dimension increasing, the number increase for branch error is slower than that for branch, i.e. the calculation behavior becoming more patterned. This results the improvement in prediction accuracy.

Note: Predictor accuracy is essentially the overall probability that an occurrence of branch instruction will be predicted correctly, i.e. the fact that we complete more instructions or that we complete more branches does not by itself explain why each branch occurrence is now more likely to be predicted correctly.

Note: We are not asking for guesswork – the answer to this question should include either conclusive evidence or excellent reasoning.

Submit all three simulator-generated report files (mipseb.n32.rpt, sesc_lu.mipseb.n128.rpt, and sesc_lu.mipseb.n256.rpt). Remember that each of these should be a result of only one simulation!

Part 2 [40 points]: Compiling and simulating a small application

Now let’s see how to compile our own code so we can run it within the simulator. For that purpose, modify the hello.cin the ~/sesc/tests/ directory to print a greeting with your first and last name. The hello.c should be exactly as follows (e.g. do not omit the \n at the end of the string in the printf call, do not add extra spaces or other characters to the string, etc.), except that the name (the part in bold and cursive letters) should be your first and last name as show in Canvas:

#include <stdio.h>

int main(int argc, char *argv[]){

printf(“Hello!I am Milos Prvulovic\n”);

return 0;

}

Now you should try to compile and run your code natively, to see if it’s working:

gcc -o hello hello.c

./hello

This should result in printing out the “Hello! I am Milos Prvulovic” message and then a newline, with your own name of course. Once your code is working natively, you can try compiling it for the simulator. Remember that, because the simulator uses the MIPS ISA, we cannot use the native compiler. Instead, we use a cross-compiler – it is a compiler that runs on one machine (the native machine)but produces code for a different machine (the target machine, i.e. MIPS). Installing and setting up the cross-compiler is a time-consuming and tricky process, so we have already installed one and made it ready for you to use. It can be found in /mipsroot/cross-tools/bin with a name “mips-unknown-linux-gnu-gcc”. In fact, there a number of other cross-compilation tools there, all of them with a prefix “mips-unknown-linux-gnu-“, followed by the usual name of a GNU/GCC tool. Examples of these tools include gcc (a C compiler), g++ (a C++ compiler), gas (assembler), objdump (tool for displaying information about object files), addr2line (returns source file and line for a given instruction address), etc.

To compile our hello.c application, we can use the cross-compiler directly:

/mipsroot/cross-tools/bin/mips-unknown-linux-gnu-gcc –O0-g -static -mabi=32 -fno-delayed-branch -fno-optimize-sibling-calls-msplit-addresses -march=mips4 -o hello.mipseb hello.c

We disable optimization (-O0), include debugging information in the executable file (-g), link with static libraries (-static) to avoid dynamic linking which requires additional setting up in the simulator, tell the compiler to avoid using branch delay slots (-fno-delayed-branch), regular call/return for all procedure calls (-fno-optimize-sibling-calls), use some optimizations that allow faster addressing (-msplit-addresses), and use a 32-bit implementation of the MIPS IV ISA (-mabi=32 –march=mips4).

After the compiler generates the MIPS executable file hello.mipseb for us, we can execute it in the simulator. Run this command line exactly as shown below – you should not change the command line options, run it from a different directory, or specify the path for hello.mipseb – that would change the simulated execution and prevent us from checking if you did it correctly and if everything is working.

~/sesc/sesc.opt –fhello0.rpt-c ~/sesc/confs/cmp4-noc.conf -ohello.out hello.mipseb

To complete this part of the project, get the simulation report summary (using report.pl) and fill in the blanks in the following statements:

In this simulation, the simulated processor executed 4843(“nInst” in the report) instructions.
Why does it take so many instructions to execute even this small program?

Note: Again, answers based on guesswork will not earn points here!

There are preliminary steps before executing “main”. “<__libc_start_main>” sets up the environments and call the “main”. “<__libc_start_main>” is called by “<__start>”, which sets up the argument stack for “<__lib_start_main>” and calls it. Within “<__libc_start_main>”, it calls necessary function like “<__pthread_initialize_minimal>” for creating thread environment.

Helpful Note #1: You can use mips-unknown-linux-gnu-objdump with a “-d” option to see the assembler listing of the hello.mipseb executable. In the listing produced by the disassembler, the code for the C “main” can be found by searching for “<main>”.

Helpful Note #2: When an executable file is loaded, it does not begin execution at main(). For what really happens, you can look up the documentation for the __glibc_start_main function which is part of the GNU standard C library (GLIBC).

Helpful Note #3: Not all of the code you see in the listing from Hint 1 is actually executed. For example, you can use mips-unknown-linux-gnu-strip on the hello.mipseb executable to dramatically reduce its code size without breaking it (you can easily find online what the “strip” utility does). Even among the code that is not removed by “strip”, a lot of the code is only executed under special circumstances (failure to allocate memory, etc.) and is not reached in the specific run we are simulating.

Edit your hello.c to add an exclamation mark (“!”) to the string that is printed: printf(“Hello!I am Milos Prvulovic!\n”);
then cross-compile and simulate your hello.mipseb program again (use –fhello1 to save the simulation report to a different file). Now the simulated processor has completed4851 If the number of simulated instructions has increased, add another exclamation mark, compile, and simulate again with an incremented value for the –f option, so the report is saved to a new file with a number that matches the number of exclamation marks that were added. Keep doing this until the number of completed instructions goes down from the previous run. For example, if the number of simulated instructions increases when you add one exclamation mark, increases again when you add two, but with three you get fewer completed instructions than with two exclamation marks, then you should stop after the simulation with three exclamation marks. Explain why the number of executed (simulated) instruction decreases in this latest simulation compared to the previous one?

Instead of calling “printf”, it calls “<_IO_puts>” which output a string to stdout. Within “<_IO_puts>”, there are branch jump inside the code. With more exclamation mark, the branch will jump out earlier instead of being trapped for more instructions.

Note: No guesswork or vague answers! The answer should be very specific, i.e. to answer this you need to point out specific behavior in specific code.

Helpful Note: You can use objdump to see what main is really doing (it’s not calling printf!) then find the source code for that function online instead of trying to figure out what it does by looking at the (barely human-readable) disassembly.

Submit your last c source code (the one with the exclamation mark that made is execute fewer instructions than with one less exclamation), the original report file (sesc_hello.mipseb.hello0.rpt), and the one from the last simulation (the one that resulted in a decrease in completed instructions, e.g. it may be sesc_hello.mipseb.hello3.rpt).

CS 6290: High-Performance Computer Architecture Project 1

This project is intended to help you understand branch prediction and performance of out-of-order processors.You will again need the “CS6290 Project VM”virtual machine, the same one we used for Project 0. Just like for Project 0, you will put your answers in the red-ish boxes in this Word document, and then submit it in Canvas (the submitted file name should now be PRJ1.docx).

In each answer box, you must first provide your answer to the actual question (e.g. a number). You can then use square brackets to provide any explanations that the question is not asking for but that you feel might help us grade your answer. E.g. answer 9.7102 may be entered as9.7102 [Because 9.71+0.0002 is 9.7102].For questions that are asking “why” and/or “explain”, the correct answer is one that concisely states the cause for what the question is describing, and also states what evidence you have for that. Guesswork, even when entirely correct, will only yield up to 50% of the points on such questions.

Additional files to upload are specified in each part of this document. Do not archive (zip, rar, or anything else) the files when you submit them – each file should be uploaded separately, with the file name specified in this assignment. You will lose up to 20 points for not following the file submission and naming guidelines, and if any files are missing you will lose all the points for answers that are in any way related to a missing file (yes, this means an automatic zero score if the PRJ1.docx file is missing). Furthermore, if it is not VERY clear which submitted file matches which requested file, we will treat the submission as missing that file. The same is true if you submit multiple files that appear to match the same requested file (e.g. several files with the same name). In short, if there is any ambiguity about which submitted file(s) should be used for grading, the grading will be done as if those ambiguous files were not submitted at all.

Most numerical answers should have at least two decimals of precision. Speedups should be computed to at least 4 decimals of precision, using the number of cycles, not the IPC (the IPC reported by report.pl is rounded to only two decimals). You lose points if you round to fewer decimals than required, or if you truncate digits instead of correctly rounding (e.g. a speedup of 3.141592 rounded to four decimals is 3.1416, not 3.1415).

As explained in the course rules, this is an individual project: no collaboration with other students or anyone else is allowed.

Part 1 [20 points]: Configuration of the Branch Predictor

The hardware of the simulated machine is described in the configuration file. In this project we will be using the cmp4-noc.conf configuration file again, but this time we will modify this file so this is a good time to make a copy so we can restore the original configuration when we need it.

The processors (cores) are specified in the “cpucore” parameter near the beginning of the file. In this case, the file specifies that the machine has 4 identical cores numbered 0 through 3 (the procsPerNode parameter is 4), and that each core is described in section [issueX]. Going to section [issueX], we see that a core has a lot of parameters, among which we see that the clock frequency is set at 1GHz, that this is an out-of-order core (inorder set to false) which fetches, issues, and retires up to2 instructions per cycle (the “issue” parameter is set to two earlier in the file). The core has a branch predictor described in the [BPredIssueX] section, fetches instructions from a structure called “IL1” described in the [IMemory] section (this is specified by the instrSource parameter, and reads/writes data from a structure called “DL1” described in the [DMemory] section. In this part of this project, we will be modifying the branch predictor, so let’s take a closer look at the [BPRedIssueX] section. It says that the type of the predictor is “Hybrid” (which does not tell us much), and then specifies the parameters for this predictor.

The “Hybrid” predictor is actually a tournament predictor. You now need to look at its source code (which is in BPred.h and BPRed.cpp files in the ~/sesc/src/libcore/ directory) and determine which of the parameters in the configuration file controls which aspect of the predictor. Hint: the “Hybrid” predictor is implemented in the BPHybrid class, so its constructor and predict method will tell you most of what you need to find out.

The meta-predictor in this hybrid predictor is a table that has 2048 entries and each entry is a 2 –bit counter. This meta-predictor decides, based on the PC address (i.e. the address from which we fetched the branch instruction), whether to make the prediction using a simple (no history) array of counters, or to use a(is it local or global?)globalhistory predictor. The simpler (non-history) predictor uses 2-bit counters, and has 2048 of them (this number is specified using a parameter label localSize in the BPredIssueX section of the configuration file). The history-based predictor has 8 bits of history, which are combined with the PC address to index into an array that has 2048 entries (this number of entries is specified in the configuration file using parameter label l2size), and each entry is a 1-bit counter.

Part 2 [30 points]: Changing the Branch Predictor

Now we will compare some branch predictors. The LU benchmark we used in Project 0 does not really stress the branch predictor, so we will use the raytrace benchmark:

cd ~/sesc/apps/Splash2/raytrace

make

Now it is time to do some simulations:

Simulate the execution of this benchmark using the unmodified
cmp4-noc configuration (with the “Hybrid” predictor). The following should all be a single command line, which has a space before –ort.out. As before, the dashes in this command line should be the minus character but a copy-paste might result in something else that looks similar but is not a minus character, so be careful is you are copy-pasting.

~/sesc/sesc.opt -f HyA -c ~/sesc/confs/cmp4-noc.conf
-ort.out -ert.err raytrace.mipseb -p1 -m128-a2Input/reduced.env

Then we will modify the configuration file, so make a copy of it if you did not do this already. Then change the configuration to model an oracle (perfect) direction predictor by changing the “type” of the predictor from “Hybrid” to “Oracle”, then and re-run the simulation (change the –f parameter to -f OrA so the simulation results are written to a different file). Note that overall branch prediction accuracy is not perfect in this case – only the direction predictor is perfect, but the target address predictor is a (non-oracle) BTB! After that, configure the processor to use a simple predict-not-taken predictor (type=”NotTaken”) and run the simulation again (now using –f NTA). Submit the three simulation report files (sesc_raytrace.mipseb.HyA, sesc_raytrace.mipseb.OrA, and sesc_raytrace.mipseb.NTA) in Canvas along with the other files for this project.

In the table below, for each simulation fill in the overall accuracy (number under BPred in the output of report.pl), the number of cycles, and the speedup relative to the configuration that uses the Hybrid predictor.

	BPred Accuracy	Cycles	Speedup vs. Hybrid
NotTaken	44.39 %	309510920C	0.7281X
Hybrid	84.18 %	225359336C	1.0000X
Oracle	88.14 %	212764015C	1.0592X

Now change the processor’s renameDelay parameter (in the issuesX section of the configuration file) from 1 to 9. This makes the processor’s pipeline 8 stages longer. Repeat the three simulations, submit the simulation report files (mipseb.HyC, sesc_raytrace.mipseb.OrC, and sesc_raytrace.mipseb.NTC) in Canvas along with the other files for this project.

In the table below, fill in the number of cycles with each type of predictor from Part A (simulations with the default pipeline depth) and from Part C (when the pipeline is 8 stages deeper), then compute the speedup of shortening the pipeline for each type of predictor, assuming that the clock cycle time stays the same (so the speedup can be computed using the number of cycles instead of execution time).

	Cyclesw/ renameDelay=1	Cycles w/ renameDelay=9	Speedup of changing renameDelay from 9 to 1
NotTaken	309510920C	402817352 C	1.3015X
Hybrid	225359336C	250303858C	1.1107X
Oracle	212764015C	230989637 C	1.0857X

The results in Part E) lead us to conclude that better branch prediction becomes (more or less?) more important when the processor’s pipeline depth increases. Explain why:

According to the calculation to calculate CPI = CPI_ideal+ MissPenalty * MissPrediction%, there are two factors determining the CPI. The MissPenalty is related with the pipeline depth and the stage that we get the actual taken or not taken of branch. The MissPrediction% is just the miss prediction rate. As the depth of pipeline increasing, the MissPenalty increases since more instructions will be abandoned, once there is miss prediction, for greater pipeline depth. The only way to maintaining or decreasing CPI is to perform better branch prediction and lower the miss predication.

From simulation results you have collected up to this point, there are at least two good ways to estimate how many cycles are wasted when we have a branch misprediction in the processor that has the default pipeline depth, i.e. what the branch misprediction penalty (in cycles) was for simulations in Part A). Enter your best estimate here 32 and then explain how you got it:

According to configuration file, the “issue” value is 2, meaning two instructions can be issued, leading the CPI_idealin above formulation to be 0.5. We then use the parameter in report file to calculate the rest part. We use report from “Hybrid”. The IPC is 0.92. By taking reverse, the CPI is 1.087, which is much greater than the ideal CPI. According to equation, 1.087 – 0.5 = MissPenalty * MissPrediction%. MissPrediction% is (1-84.18%)*BJ% = 1..85%. Then, the MissPenalty is 31.7297.

Part 3 [50 points]: Which branches tend to be mispredicted?

In this part of the project we again use the cmp4-noc configuration. You should change it back to its original content, i.e. what it had before we modified it for Part 2. We will continue to use the Raytrace benchmark with the same parameters as in Part 2.

Our goal in this part of the project is to determine for each instruction in the program how many times the direction predictor (Hybrid or NotTaken) correctly predicts and how many times it mispredicts that branch. The number of times the static branch instruction was completed can be computed as the sum of two (correct and incorrect predictions) values for that static branch instruction. You should change the simulator’s code to count correct and incorrect predictions for each static branch instruction separately, and to (at the end of the simulation) print out the numbers you need to answer the following questions.The printing out should be in the order in which the numbers are requested below, and your code should not be changing the simulation report in any way. Then you should, of course, run the simulation and get the simulation results with the Hybrid and also with the NT predictor.

In both simulations, the number of static branch instructions that are completed at least once but fewer than 10 times (i.e. between 1 and 9 completions) is 1031 , the number of static branch instructions with 10 to 99 completions is 153, the number of static branch instructions with 100 and 999 completions is 426, and the number of static branch instructions with 1000+completions is 728.
The accuracy for the direction predictor, computed separately for the four groups of static branch instructions (1-9, 10-99, 100-999, and 1000+ completions), is a follows:

	Hybrid Accuracy	NT Accuracy
1-9	67.10%C	43.60%C
10-99	91.31%C	36.54%C
100-999	94.28%C	37.29%C
1000+	95.16%C	44.42%C

We cannot run the raytrace benchmark with a much larger input because the simulation time would become excessive. But if we did, concisely state what would happen to the overall direction predictor accuracy of the Hybrid predictor and of the NT predictor, and then explain why the predictor accuracy should change in the way you stated.

As we increase the input, the predicator accuracy is expected to be better, since there are repeating patterns. For example, the number of “for” loop increases, leading better prediction accuracy for both predictors. Also, as for Hybrid predictor, the increase for accuracy might be more, due to its history predictor. With larger input, itshistory predictor could contain more repeating patterns of this specific program, leading better prediction. This assumption can be validated by the table in part H as well. According to table, the prediction accuracyis higher for predicting instructions that runs more time. Also, the increase in accuracy is larger for Hybrid predictor, comparing with NotTaken predictor.

Submit the h and BPred.cpp files that you have modified to produce the numbers you needed for Part 3 of this project. If you have modified any other source code in the simulator, create an OtherCode.zip file that includes these files and submit it, too. Also submit the output of the simulator for the two runs (as rt.out.Hybrid and rt.out.NT) Note that there is no need to submit the simulation report files (these should be the same as those from Part A).

CS 6290: High-Performance Computer Architecture Project 2

This project is intended to help you understand caches and performance of out-of-order processors.As with previous projects, for this project you will need VirtualBox and our project virtual machine. Just like in previous projects, you will put your answers in the reddish boxes in this Word document, and then submit it in Canvas (but this time the submitted file name should be PRJ2.docx).

Additional files to upload are specified in each part of this document. Do not archive (zip, rar, or anything else) the files when you submit them – each file should be uploaded separately, with the file name specified in this assignment. You will lose up to 20 points for not following the file submission and naming guidelines, and if any files are missing you will lose all the points for answers that are in any way related to a missing file (yes, this means an automatic zero score if the PRJ2.docx file is missing). Furthermore, if it is not VERY clear which submitted file matches which requested file, we will treat the submission as missing that file. The same is true if you submit multiple files that appear to match the same requested file (e.g. several files with the same name). In short, if there is any ambiguity about which submitted file(s) should be used for grading, the grading will be done as if those ambiguous files were not submitted at all.

This project can be done either individually or in groups of two students. If doing this project as a two-student group, you can do the simulations and programming work together, but each student is responsible for his/her own project report, and each student will be graded based solely on what that student submits. Finally, no collaboration with other students or anyone else is allowed. If you do have a partner you have toprovide his/her name hereNone(enter None here if no partner) and his/her Canvas username hereNone. Note that this means that you cannot change partners once you begin working on the project, i.e. if you do any work with a partner you cannot “drop” your partner and submit the project as your own (or start working with someone else) because the collaboration you already had with your (original) partner then becomes unauthorized collaboration.

In this project we will be using the FMM benchmark with 256 particles and single-core execution, and we will continue to use the cmp4-noc.conf configuration file. Remember to first restore the cmp4-noc.conf file to its default contents. Also, if your branch predictor changes from Project 1 can result in changing the results of the simulation even when using the default (Hybrid) predictor, you need to restore the original code of the simulator. Essentially, you need to undo all the changes made in Project 0 and Project 1. Then you can run the simulation:

cd ~/sesc/apps/Splash2/fmm

If the fmm.mipseb file is not already present in this directory, then build it:

make

and run the first simulation we will need for this projectexactly like this (this should be one linewhere all ‘-‘ characters are the normal “minus” character, the line has a single space between -ofmm.out and -efmm.err, a single space character between fmm.mipseb and –p, a single space between –p and 1, and no spaces, tabs, or anything else after that):

~/sesc/sesc.opt -f Default -c ~/sesc/confs/cmp4-noc.conf
-iInput/input.256 -ofmm.out-efmm.err fmm.mipseb –p 1

In this command line the –c, -o, and –e simulator options should already be familiar. The –f option tells the simulator to save the report file as sesc_fmm.mipseb.Default instead of using a random string at the end. This way you can produce report files with the names you need for your Project 2 submission, without having to rename them.

As with every simulation, you should verify that the simulated execution didn’t crash or terminate too soon due to misspelling of the command line. After a correct simulation, the fmm.err file should be empty, and fmm.out should begin with “Creating a two cluster, non uniform distribution for 256 particles” and have a “Total time for steps 3 to 5 : 0” line at the end. Having this does not guarantee that everything is OK, but it at least means that FMM did not exit prematurely.

Part 1 [35 points]: Performance impact of caches

In this project, we will be modifying the data caches of the simulated processor, so let’s take a closer look at the [DMemory] section of the configuration file. It says that the structure the processor gets data from is of type “smpcache” (it’s a cache that can work in a multiprocessor, as we will see Project 3), which can store 32KBytes of data (size parameter), is 4-way set associative (assoc parameter), has a 64-byte block/line size (bsize parameter uses cacheLineSize, which is set to 64 earlier in the configuration file), is a write-back cache (writePolicy), uses LRU replacement policy, and has two ports with port occupancy of 1cycle (so it can handle two accesses every cycle), has a 1-cycle hit time, and takes 1 cycles to detect a miss. If there is a miss, the processor keeps track of it using the DMSHR (data miss handling registers) structure, which is described in the [DMSHR] section as a 64-entry structure where each entry can keep track of a miss to an entire 64-byte block. On a miss, the L1 cache requests data from the core’s local slice of the L2 cache, or from the on-chip router that connects it to the L2 slices of other cores. Note that in this project we will still be using only one core (Core 0) so it gets to use the entire L2 cache (all four slices). Looking at the [L2Slice] section, we see that each slice can store 1 megabyte of data (so the total L2 cache size is 4MB), that it is a 16-way set associative cache with a 64-byte block size, write-back policy, LRU replacement, 2 ports, 12-cycle hit time, that it needs 12 cycles to detect a miss, and that it uses a 64-entry MSHR to keep track of misses. When there is a miss, it is handed off to a local on-chip router, which uses the on-chip network (NOC) to deliver the message to a memory controller. It in turn uses the off-chip processor-memory bus to access the main memory, which is modeled in this configuration as an infinite cache with a 200-cycle hit delay. Recall that a real off-chip main memory has a similar delay but has much more complicated behavior, so this simplification is there mostly to avoid specifying the myriad main-memory parameters.

Now let’s change some cache parameters and see how they affect performance. Before we make any changes to the cmp4-noc.conf file, we should save the original so we can restore the default configuration later. In general, you should be very careful about ensuring that you have the correct configuration. The values for one thing (e.g. L1 cache) can affect what happens in other things (e.g. L2 cache), so you should be able to restore the default parameters when needed.

Run the fmm benchmark with the default configuration and submit the report from this simulation as mipseb.Default
Change the L1 cache size to 1kB (leave all other conf parameters unchanged, and make sure to save the original cmp4-noc.conf), run that simulation (using –f SmallL1 this time in the simulation command line), and submit the report as mipseb.SmallL1
With a1kB L1 cache, the miss rate in the L1 cache is 37 percent, and with a 32kB L1 cache the miss rate is1.03percent. The overall speedup achieved by replacing a 1KB L1 cache with a 32KB cache is 1.6398 .
In Part 1B we have seen that the simulated execution is faster when we change the L1 cache size from 1kB to 32kB. But if we look at the time it takes to do the simulation itself (“Exe Time” from report.pl), we find that simulationtakes less time, too, although in both simulations the simulated processor executes almost exactly the same number of instructions. You may think that it takes less time to simulate fewer cycles, even when the same overall work gets done over fewer cycles, but actually the simulator takes less time because itdoesdo less work when we have a larger L1 cache. What is thesimulator’s work that is eliminated?

The miss rate is much smaller for the simulation with larger L1 cache. With less cache miss, there is less need of bringing data from L2 cache and main memory to L1 caches. Since the instruction number is the same for both simulations and the Average Memory Access Time (AMAD) = hitTime + missRate * missPenalty, the AMAD for smaller L1 cache increases. Smaller L1 cache thus requires more works done for reading data from L2 cache and main memory. Also, according to the write-backpolicy, more cache miss also requires dirty data to be write more frequently to main memory, leading more cycles required.

Now let’s compare the default 32kB 4-way set-associative L1 cache with one that has the same size but is direct-mapped. You already ran a simulation with the default (set-associative) configuration, so you only need to run a simulation for the direct-mapped L1 cache. Submit the simulation report for this run as mipseb.DMapL1
The miss rate with the direct-mapped L1 cache is 61percent, the miss rate with the 4-way set-associative L1 cache is 1.03percent, and the overall speedup achieved by changing from direct-mapped (1-way set-associative) to 4-way set-associative L1 cache is 1.0244.
Now let’s restore the default configuration (32kB, 4-way set associative L1 cache) and change the L1 cache latency to 4 cycles (change both hitDelay and missDelay to 4) and then to 7 cycles. Submit the reports for these simulations as mipseb.4CycL1 and sesc_fmm.mipseb.7CycL1.
The speedup of improving L1 latency from 7 to 4 cycles is 0627, the speedup of improving the L1 latency from 4 to 1 cycles is 1.0579, and the speedup of improving the L1 latency from 7 to 1 cycles is 1.1242.
The change in L1 latency from 7 cycles to 1 cycle represents is a 7X improvement, the loads and stores represent about 30% of all instructions. So a naïve application of Amdahl’s law tells us to expect a speedup of 1.35. Explain why the actual speedup is very different from that:

Amdahl’s law merely considers the speedup directly coming from L1 cache. However, there is L2 cache access and main memory access, whose latency is much greater than L1, leading the overall speedup to decrease. Thus, the actual speedup is very different from the calculated one.

Part 2 [20 points]: Changing the simulated cache

The cache implementation in the simulator can only model LRU replacement policy – note that a RANDOM policy can be specified in the configuration file but the code that models the replacement policy will still implement LRU even when RANDOM is specified. Now we will explore what happens when we actually change the cache’s replacement policy. We will implement the NXLRU (Next to Least Recently Used). While LRU replaces the block that is the first in LRU order (i.e. the least recently used block) in the cache set, NXLRU should replace the block that is the second in LRU order in the set, i.e. the second-least-recently-used block in the set.

To implement NXLRU, we need to modify the code of the simulator. The source file which implements the ‘smpcache’ (used for our L1 cache) is in SMPCache.h and SMCache.cpp in the sesc/src/libcmp/ directory. For much of the “basic” cache behavior, the SMPCache uses code in sesc/src/libsuc/CacheCore.h (and CacheCore.cpp). There are separate classes for CacheDM (for direct mapped caches) and CacheAssoc (for set-associative caches). Since direct-mapped caches do not have a replacement policy (they must replace the one line where the new block must go), we will be looking at the CacheAssoc class. First we must add “NXLRU” as an option that can be specified in the conf file and selected when a CacheAssoc object is constructed. Probably a good approach is to look for “LRU” in the code to see how this is done for LRU (and RANDOM), and then add NXLRU. Then we must actually implement this policy. The function that actually implements the cache’s replacement policy is the findLine2Replace method of the CacheAssoc class in CacheCore.cpp. The parameter supplied to this method is the new address that needs a line in the cache. Note that this method does not only implement the replacement policy because an actual replacement (replace one valid line with another) may not be needed. For example, when addr is already in the cache (a cache hit), this method returns the line that contains addr. When the set where addr belongs contains non-valid lines, one of those non-valid lines is used – a valid block may have a cache hit in the future, while a non-valid line cannot, so we should only replace a valid line if the set has no non-valid lines. Finally, when the set contains “locked” lines they are skipped, so the actual “LRU” policy implemented by findLine2Replace is “From the set where addr belongs, return the line that contains addr if there is such a line, otherwise return the invalid line that was accessed most recently if there are any invalid lines, otherwise return the least recently used line among the lines that are not locked”. Even that is not the complete specification because findLine2Replace must consider what should happen when all lines are valid and locked – in that case it returns 0 unless ignoreLocked is true, in which case it returns the least recently used line chosen among all the (valid and locked) lines in the set.

Our NXLRU policy should treat hits and invalid lines just like the existing LRU policy, but when there is no hit and no invalid lines to return, the NXLRU policy should find the second-least-recently-used line among the non-locked lines. However, if only one non-locked line exists in the set, that line must be returned, and if all lines are valid and locked the second-least-recently-used one in the set should be returned.

Note that you have to add the NXLRU policy as an option in the configuration file, i.e. it is not OK to just change the existing LRU (or RANDOM) code to actually follow the NXLRU policy. Changing the behavior of existing policies will change the behavior of all cache-like structures in the processor, including TLBs. We will want to change the replacement policy only in L1 caches and leave behavior of TLBs, L2 caches, etc. unchanged!

Make the changes needed to implement the NXLRU replacement policy and then:

Run a simulation with a1kB L1 cache, using NXLRU policy, and with all other settings at their default values. Submit the simulation report for this as mipseb.L1NXLRU. You already ran a simulation with the same L1 cache that uses LRU (in Part 1A).With a 1kB L1 cache, the LRU policy gave us a hit rate of 81.63 percent, while NXLRU gives us 80.74 percent. The number of blocksthat are fetched (read) by the L1 cache from the L2 cache changes from637895 with LRU to680127with NXLRU, and the speedup of using LRU instead of NXLRU is 1.0262.

Note: Because report.pl does not provide summary statistics on the L2 cache, you will have to directly examine the report file generated by SESC. This file begins with a copy of the configuration that was used, then reports how many events of each kind were observed in each part of the processor. Events in the DL1 cache of processor zero (the one running the application) are reported in lines that start with “P(0)_DL1:”. Events in the L2 cache are reported separately for each of the four slices. In the report file, the number of blocks requested by the L1 cache from the L2 cache is reported as lineFill (these become entire-block reads from the L2 cache), and the number of write-backs the L1 wants to do to the L2 is reported as writeBack (these become entire-block writes to the L2 cache).

Part 3 [45 points]: Classifying misses in the L1 cache

Now we will change the simulator to identify what kind of L1 cache miss we are having each time there is a miss – compulsory, conflict, or capacity. Recall that a miss is a compulsory miss if it would occur in an infinite-sized cache, i.e. if the block has never been in the cache before. A capacity miss is a non-compulsory miss that would occur even in a fully associative LRU cache that has the same block size and overall capacity, and a conflict miss is a miss that is neither a compulsory nor a capacity miss. The L1 cache in the simulator counts read and write misses in separate counters (which appear in the simulation report as readMiss and writeMiss number for each cache, e.g. there is line in the report for “P(0)_DL1:readMiss=something” in the report file. Now you need to have additional counters, which should appear in the simulation report file as compMiss,capMiss, and confMiss counters (these three values should add up to the readMiss+writeMiss value). Each of the new counters should count both read and write misses of that kind. It is OK to also have counters that count compulsory, capacity, and conflict misses separately for reads and writes, or to do this classification of misses for other caches. But report files that do not have the overall (reads+writes) items for P(0)_DL1:compMiss, P(0)_DL1:capMiss, and P(0)_DL1:comfMiss will not be graded.

Attach the h, CacheCore.cpp, SMPCache.cpp, and SMPCache.h files that contain your modifications for both Part 2 (NXLRU) and Part 3 (classification of misses). You should not modify (or add/delete) any other files.
With your new miss-classification code in the simulator, you should run a simulation with the default configuration (32kB 4-way set-associative LRU L1, cache), with an 1kB 4-way set-associative LRU L1 cache, with a direct-mapped 32kB L1 cache, and with a 32kB 4-way set-associative NXLRU L1 cache. Submit the four simulation reports as mipseb.DefLRU, sesc_fmm.mipseb.SmallLRU, sesc_fmm.mispeb.DefDM, and sesc_fmm.mispeb.DefNXLRU.
Fill the following table. The first row of fill-in fields is the total number of L1 misses for each configuration, the second is the percentage of all L1 misses for that configuration that are compulsory misses, the third is the percentage of L1 misses that are conflict misses, and the fourth is the percentage of L1 misses that are capacity misses. For example, if a simulation ended up with a total of 1000 L1 misses, and if they include 300 compulsory, 500 conflict, and 200 capacity misses, then the column for that configuration would have the following numbers (from top to bottom): 1000, 30,50,20.

	32kB SA LRU	1kB SA LRU	32kB DM	32kB SA NXLRU
Total # of misses	29722 C	637897C	46840 C	40128C
% Compulsory	11.94[3549]C	0.56[3549]C	7.58[3549]C	8.84 [3549]C
% Conflict	15.31[4551]C	4.00[25538]C	52.62[24645]C	31.03 [12452]C
% Capacity	72.75 [21622]C	95.44[608810]C	39.81 [18646]C	60.13[24127]C

Now we will consider the 32kB 4-way set-associative LRU cache as the baseline and compute, for each of the other three configurations, the percentage increase of various kinds of misses relative to that baseline. For example, if the 32kB 4-way SA LRU configuration has resulted in 200 compulsory misses and the 1kB 4-way SA LRU configuration has resulted in 190 compulsory misses, then we should put -5 in the compulsory-misses entry in the 1kB SA LRU column: the change is -10 compulsory misses (190-200), and -10 is 5% of the baseline’s 200 misses (-10/200 = -0.05).

% Change in	1kB SA LRU	32kB DM	32kB SA NXLRU
Total # of Misses	2046.21C	57.59C	35.01C
# of Compulsory Misses	0.00C	0.00 C	0.00 C
# of Conflict Misses	460.15C	441.53 C	173.61 C
# of Capacity Misses	2715.70 C	-13.76 C	11.59 C

Some helpful notes for Part 3:

In the simulator, there are two kinds of misses – normal misses and “half-misses”. A half-miss occurs when the processors loads/stores a data item, the corresponding block is not present in the cache, but a cache miss is already in progress for that block. Note that a half-miss would not occur if the cache was handling one access at a time – the block would be brought to the cache as a result of a “normal” miss, and the access that was a half-miss would be a hit. To avoid confusion, in Part 3 of this project, you should only be concerned with the “normal” misses, i.e. you should treat half-misses as cache hits.

Another potentially confusing consideration is that it is possible to have an access that is a hit in a set-associative cache but it a miss in the fully associative cache that we are modeling to determine if a cache miss is a conflict miss. This can occur, for example, when a cache block B is not accessed for a long time, but the other blocks that map to the same cache set are also not accessed for a long time. In a fully-associative cache, block B would eventually be replaced because we need room in the cache for other blocks, but in the set-associative cache we need room in other sets so block B stays in the cache. When B is eventually accessed, we get a hit in the set-associative cache and a miss in the fully-associative cache. For Part 3 of this project, you can simply ignore this problem – your task is to only look at misses that do occur in the actual cache and determine whether these misses would be hits in infinite-size and/or fully-associative caches, so accesses that are hits in the set-associative cache can be ignored.

CS 6290: High-Performance Computer Architecture Project 3

This project is intended to help you understand cache coherence and performance of multi-core processors. As with previous projects, for this project you will need VirtualBox and our project virtual machine. Just like in previous projects, you will put your answers in the reddish boxes in this Word document, and then submit it in Canvas (but this time the submitted file name should be PRJ3.docx).

Additional files to upload are specified in each part of this document. Do not archive (zip, rar, or anything else) the files when you submit them, except when we explicitly ask you to submit a zip file (in Part 3H). Each file we are asking for should be uploaded separately, and its name should be as specified in this assignment. You will lose up to 20 points for not following the file submission and naming guidelines. Furthermore, if it is not VERY clear which submitted file matches which requested file, we will treat the submission as missing that file. The same is true if you submit multiple files that appear to match the same requested file (e.g. several files with the same name). In short, if there is any ambiguity about which submitted file(s) should be used for grading, the grading will be done as if those ambiguous files were not submitted at all.

This project can be done either individually or in groups of two students. If doing this project as a two-student group, you can do the simulations and programming work together, but each student is responsible for his/her own project report, and each student will be graded based solely on what that student submits. Finally, no collaboration with other students or anyone else is allowed. If you do have a partner you have to provide his/her name here None (enter None here if no partner) and his/her Canvas username here . Note that this means that you you cannot change partners once you begin working on the project, i.e. if you do any work with a partner you cannot “drop” your partner and submit the project as your own (or start working with someone else) because the collaboration you already had with your (original) partner then becomes unauthorized collaboration.

Part 1 [40 points]: Running a parallel application

In this part of Project 3 we will be using the LU benchmark. We will also be using a processor with more (sixteen) cores (cmp16-noc.conf). So, for example, to simulate 2-threaded execution you would use a command like this (note the absence of spaces between -n and 256, and between -p and 2):

~/sesc/sesc.opt –fAp2-c ~/sesc/confs/cmp16-noc.conf -olu.out -elu.err lu.mipseb -n256–p2

To complete this part of the project, run the lu application with 1, 2,and 4 threads. Then fill in the blanks, taking into account all the runs you were asked to do:

Submit the three simulation reports:mipseb.Ap1, sesc_lu.mipseb.Ap2, and sesc_lu.mipseb.Ap4. You will not earn points for submitting these simulation reports, but you will lose 10 points for each missing simulation report.

Fill out the execution time, parallel speedup, and parallel efficiency with 2 and 4 threads. Enter Sim Time with precision of at least three decimals, and speedup and efficiency with precision of at least two decimals.

	SimTime (in ms)	Parallel Speedup	Parallel Efficiency
-p1	56.509ms	1.0000C	1.0000C
-p2	35.283ms	1.6016C	0.8008C
-p4	24.125ms	2.3423C	0.5856C

Note: Parallel speedup is the speedup of parallel execution over the single-thread execution with the same input size. Parallel efficiency is the parallel speedup divided by the number of threads used – ideally, the speedup would be equal to the number of threads used, so the efficiency would be 1. When computing the speedup and efficiency we cannot use IPC or Cycles that are reported for each processor, because these do not account for the cycles where that core was idle (e.g. because the thread was waiting for something to happen). So we need to use the “Sim Time” we get from report.pl because it accounts for all cycles that elapse between the start and completion of the entire benchmark.

Our results indicate that parallel efficiency changes as we use more cores. Why do you think this is happening?

With more cores, multiplethreads can be allocated to different cores to execute, the overall time for completion is reduced with more cores. Ideally, the parallel efficiency should be 1 for all cases. However, with more cores, once different cores reading and writing same data, different cores mustmaintain the coherence with coherence miss, resulting cycles cost in data sharing and data block transferring. With these overheads, the parallel efficiency can no longer be 1. Since LU factorization is based on matrix calculation, there is high spatial locality within computation. With multiple cores performing reading and writing for same data blocks,there are more data sharing overheads and memory access (considering MSI). Although there is speedup in total execution time, the unit efficiency is dragged down.

When we use two threads (-p2) instead of one (-p1), the IPC achieved by Core 0 (the first processor listed) got slightly lower because

IPC measures single core efficiency. With multi-core, the behavior of single core needs to be adjusted to guarantee the coherence requirement.Threads can be allocated to other cores to run. With cycles waiting sharing data from other cores, the efficiency of single core is down, leading its IPC is lower than that with uniprocessor.

Now look at the simulation reports for these simulations.Core 0 executes more than its fair share of all instructions because

The initial create of different processes is completed on Core 0 and Core 0 is the main core. It is required to assign tasks, perform sys call to create new process, load configuration from system, leading more instructions executed on Core 0.

Part 2 [10 points]: Cache miss behavior

In this part of Project 3, we will be focusing on the number of read misses in the DL1 (Data L1) cache of Core 0, using the same simulations that we already did for Part 1. In the report file generated by the simulator (sesc_lu.mipseb.something, not what you get from report.pl), the number of cache read misses that occur in each DL1 cache (one per processor core) is reported in lines that begin with “P(0)_DL1:readMiss=”.

The total number of read misses that occur in the DL1 cache of Core 0 is

Simulation	-p1	-p2	-p4
Core 0’s DL1 read misses	95393C	46459C	28610C

Your answers here should be integer numbers.

The number of these misses changes this way as we go from one to two to four, etc. threads because

Since the task is allocated to different threads and different threads now can run on different cores, which is no longer on Core 0. Thus, the decrease of readMiss is proportional to the decrease of work on one Core. With less data required in Core 0, the readMiss will decrease with increase thread number. However, with the decreasingratio of readMiss is also decreasing with more threads running on different cores. The ration of p1/p2 is 2.05, while p2/p4 is 1.62. This is due to the coherence miss od data sharing between cores. With more cores involves, the number of coherence cache miss is expected to increase.

Part 3 [50 points]: Identifying accesses to shared data

You task in this part of the project is to determine how many read misses in each core’s DL1 cache are compulsory (readCompMiss), replacement (capacity or conflict, the counter should be called readReplMiss), and coherence misses (readCoheMiss), and separately also classify write misses (writeCompMiss, writeReplMiss, and writeCoheMiss). Note that this classification is similar to the one you did for Project 2, except that you now we are counting different categories of misses separately for reads (load instructions) and writes (store instructions), that we are placing conflict and capacity misses in the same (replacement) category, and that we are adding a category for coherence misses that we didn’t have in Project 2. To simplify classification, we will not follow the exact definition of coherence misses (“those misses that would have been hits were it not for coherence actions from other cores”). Instead, we will use a definition that allows much simpler implementation: a coherence miss is a miss that finds in the cache a line whose tag matches the block it wants, but that block has a coherence state that prevents such access. In the case of read misses, this means that the line was found in an “Invalid” coherence state.Note that this identification of coherence misses may not be trivial in the SESC simulator because of the way it handles tags during invalidation. If a miss is not a coherence miss, then you can classify it as either compulsory or replacement miss by checking if the block was ever in that cache. When checking whether the miss is a compulsory miss, be careful to track the “was previously in this cache” set of blocks for each cache separately.

Create a Changed.zip file with any simulator source code files that you have modified in Part 3 of the project, and submit this zip file together with your project. You will not earn points for submitting this file, but you will lose 50 points if it is missing or if it does not contain all the source code modifications. Then, with your changed simulator (that now counts compulsory, replacement, and coherence read misses), re-run the simulations from Part 1 and submit the resulting simulation report files as sesc_lu.mipseb.Hp1, sesc_lu.mipseb.Hp2, and sesc_lu.mipseb.Hp4. As in Part 1, you will not earn points for these submitted simulation reports, but you will lose 10 points for each simulation that is missing.
The number of all read misses, compulsory read misses, replacement read misses, and coherence read misses for the DL1 cache of Core 0 is:

	Core 0’s DL1 readMiss	Core 0’s DL1 compMiss	Core 0’s DL1 replMiss	Core 0’s DL1 coheMiss
-p1	95393C	122C	95271C	0C
-p2	46459C	125C	46237C	97C
-p4	28610C	126C	28359C	125C

Note: readMiss numbers here should be the same as those you had in Part 2.

The number of all write misses, compulsory write misses, replacement write misses, and coherence write misses for the DL1 cache of Core 0 is:

	Core 0’s DL1 readMiss	Core 0’s DL1 compMiss	Core 0’s DL1 replMiss	Core 0’s DL1 coheMiss
-p1	8357C	8325C	32C	0C
-p2	8517C	8338C	62C	117C
-p4	8533C	8362C	47C	124C

CS 6290: High-Performance Computer Architecture Projects 0 to 3 solutions

Download Details:

Description

CS 6290: High-Performance Computer Architecture Project 0

Part 1 [60 points]: Simulator set-up and simulation

Part 2 [40 points]: Compiling and simulating a small application

CS 6290: High-Performance Computer Architecture Project 1

Part 1 [20 points]: Configuration of the Branch Predictor

Part 2 [30 points]: Changing the Branch Predictor

Part 3 [50 points]: Which branches tend to be mispredicted?

CS 6290: High-Performance Computer Architecture Project 2

Part 1 [35 points]: Performance impact of caches

Part 2 [20 points]: Changing the simulated cache

Part 3 [45 points]: Classifying misses in the L1 cache

CS 6290: High-Performance Computer Architecture Project 3

Part 1 [40 points]: Running a parallel application

Part 2 [10 points]: Cache miss behavior

Part 3 [50 points]: Identifying accesses to shared data

CS 6290: High-Performance Computer Architecture Projects 0 to 3 solutions

Download Details:

Description

CS 6290: High-Performance Computer Architecture Project 0

Part 1 [60 points]: Simulator set-up and simulation

Part 2 [40 points]: Compiling and simulating a small application

CS 6290: High-Performance Computer Architecture Project 1

Part 1 [20 points]: Configuration of the Branch Predictor

Part 2 [30 points]: Changing the Branch Predictor

Part 3 [50 points]: Which branches tend to be mispredicted?

CS 6290: High-Performance Computer Architecture Project 2

Part 1 [35 points]: Performance impact of caches

Part 2 [20 points]: Changing the simulated cache

Part 3 [45 points]: Classifying misses in the L1 cache

CS 6290: High-Performance Computer Architecture Project 3

Part 1 [40 points]: Running a parallel application

Part 2 [10 points]: Cache miss behavior

Part 3 [50 points]: Identifying accesses to shared data

Related products

CS 6290: High-Performance Computer Architecture Project 3 solution

CS 6290 Project 3 solution

CS 6290: High-Performance Computer Architecture Project 2 solution