CS3350B Computer Organization Assignment 1 to 4 solutions

$90.00

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (1 vote)

CS3350B Computer Organization Assignment 1

Exercise 1. [15 Marks] Consider the following two functions: factRec and factIter,
which compute the factorial of a number.

Discuss the instruction locality of each the function. Is there good locality or bad locality?
Is there spatial locality or temporal locality? Explain why. For ease of discussion, you may
consider a particular execution of the function, say, n = 5.
int factRec(int n) {
if (n < 1) {
return 1;
}
int n1 = factRec(n-1);
return n1 * n;
}
int factIter(int n) {
int fact = 1;
for (int i = 1; i <= n; ++i) {
fact = fact * i;
}
return fact;
}
2

Exercise 2. [15 Marks] Consider a 16-bit computer with a simplified memory hierarchy.
This hierarchy contains a single cache and an unbounded backing memory. The cache is 2-
way set associative, 4-byte cache lines, and a capacity of 32 bytes. Consider also the following
sequence of memory word addresses.
8, 3, 9, 7, 15, 20, 22, 2, 6, 0

(a) Determine, in binary notation, the set index and block offset for each address in the above
sequence. Include the byte offset as part of the block offset. Assume the cache is initially
empty.

During the sequence of address accesses above, determine if each reference results in a cache
hit or a cache miss. If the reference results in a cache miss, which type of cache miss occurs
(cold, conflict, or capacity). Use the below table to help answer this question.

Address Index Block Offset Hit or Miss Type of Miss
8
3
9
7
15
20
22
2
6
0
(b) Create a table which resembles this cache’s configuration. Fill that table such that
it corresponds to the cache’s contents after all addresses in the above sequence have been
referenced. (See “3350-L4-CacheExample.pdf”).
3

Exercise 3. [30 Marks] In this exercise, we will examine cache capacity and its effect on
performance. For simplicity, let assume consider only data cache. That is, instructions are
not stored in the caches. Recall cache access time is related to its capacity. Consider that
accessing main memory requires 100ns and that, in a particular program, 42% of instructions
incur a data access.

For two different processors executing this program, we have two different L1 caches, attached
to processors P1 and P2, respectively.
L1 size L1 Miss Rate L1 Hit Time
P1 16 KB 3.6% 1.26 ns
P2 32 KB 3.1% 2.17ns

(a) What is the AMAT for P1 and P2 assuming no other levels of cache?
(b) Assuming an ideal CPI of 2.0 for both processors, and where the L1 hit time determines
the cycle time, what is the CPI𝑠𝑡𝑎𝑙𝑙 for P1 and P2? Which processor is faster at executing
this particular program?

Now consider the addition of an L2 cache to P1 with the following characteristics. The data
from the previous table still holds.
L2 size L2 Miss Rate L2 Hit Time
8 MB 48% 26.24 ns

(c) What is the AMAT for P1 with the addition of an L2 cache? Is the AMAT better or
worse with the L2 cache?
(d) Assuming an ideal CPI of 2.0 and where the L1 hit time determines the cycle time, what
is the CPI𝑠𝑡𝑎𝑙𝑙 for P1 with the addition of an L2 cache?

(e) Which processor is faster, now that P1 has an L2 cache? If P1 is faster, what miss rate
would P1 need in its L1 cache to match P2’s performance? If P2 is faster, what miss rate
would P2 need in its L1 cache to match P1’s performance?
4

Exercise 4. [20 Marks] Consider a 64-bit computer with a simplified memory hierarchy.
This hierarchy contains a single cache and an unbounded backing memory. The cache has
the following characteristics:
• Direct-Mapped, Write-through, Write allocate.
• Cache blocks are 4 words each.
• The cache has 256 sets.

(a) Consider the following code fragment in the C programming language to be run on the
described computer. Assume that: program instructions are not stored in cache, arrays are
cache-aligned (the beginning of the array aligns with the beginning of a cache line), ints are
32 bits, and all other variables are stored only in registers.
int N = 32768;
int A[N];
for (int i = 0; i < N; i += 2) {
A[i] = A[i+1];
}

Determine the following:
(i) The number of cache misses.
(ii) The cache miss rate.
(iii) The type of cache misses which occur.

(b) Consider the following code fragment in the C programming language to be run on the
described computer. Assume that: program instructions are not stored in cache, arrays are
cache-aligned (the beginning of the array aligns with the beginning of a cache line), ints are
32 bits, and all other variables are stored only in registers.

int N = 32768;
int A[N];
int B[N];
for (int i = 0; i < N; ++i) {
B[i] = A[i];
}
Determine the following:
(i) The number of cache misses.
(ii) The cache miss rate.
(iii) The type of cache misses which occur.
5

Exercise 5. [20 Marks] The Intel Core i7-8750H processor (more details here) has the
following characteristics, taken from /proc/cpuinfo:
vendor_id : GenuineIntel
cpu family : 6
model : 158
model name : Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
stepping : 10
microcode : 0xea
cpu MHz : 2200.000
L1 cache size : 32 KB
L2 cache size : 256 KB
L3 cache size : 9216 KB
physical id : 0
siblings : 12
core id : 0
cpu cores : 6
{…}
clflush size : 64
cache line size : 64

Consider the following two functions Normalize1 and Normalize2 which take in a positive
𝑁x𝑁 matrix of doubles and normalizes its entries to the range (︀0, 1⌋︀. A program implementing these functions is available on OWL as Normalize.c.

After those two code segmenets, data is presented which was collected using the perf utility.
This data shows runtime performance metrics of these functions executing on the Intel Core
i7-8750H processor for various data sizes. In this data:
• “Normalize.bin 1 …” executes the function Normalize1;
• “Normalize.bin 2 …” executes the function Normalize2;

• the second command-line argument is the size 𝑁 of the matrix;
• LLC-loads means “Last Level Cache loads”, the number of accesses to L3;
• LLC-load-misses means “Last Level Cache misses”, the number of L3 cache misses;
• cpu-cycles is the number of CPU cycles elapsed during programing execution.

Using the knowledge learned so far in this course, the specification of the i7-8750H processor,
the code fragments, and the perf data, answer the following questions.
6

(a) Why is the runtime execution of Normalize1 faster than Normalize2? The values of
which performance metrics from the perf data support your claims?

(b) Consider the miss rates of Normalize1. The miss rate drastically increases for values of
𝑁 larger than 512. Explain why the increase occurs at this particular value of 𝑁. Give a reason why this increase is not a sharp “jump” but rather has an intermediate effect at 𝑁 = 1024.

(c) Consider the miss rates of Normalize2. The miss rate starts quite low but quickly increases for increasing values of 𝑁. Disucss why the miss rates for 𝑁 = 256 and 𝑁 = 512 are
misleading for describing the actual data locality of Normalize2. Which additional performance metrics would you record to get a more precise understanding of the data locality of
Normalize2? Assume perf is capable of reporting any possible hardware event.

void Normalize1(double* A, int N) {
double t, min = (double) RAND_MAX, max = 0.0;
for (int i = 0; i < N; ++i) {
for (int j = 0; j < N; ++j) {
t = A[i*N + j];
if (t < min) min = t;
if (t > max) max = t;
}
}
double frac = 1 / (max – min);
for (int i = 0; i < N; ++i) {
for (int j = 0; j < N; ++j) {
A[i*N + j] = (A[i*N + j] – min) * frac;
}
}
}
7
void Normalize2(double* A, int N) {
double t, min = (double) RAND_MAX, max = 0.0;
for (int j = 0; j < N; ++j) {
for (int i = 0; i < N; ++i) {
t = A[i*N + j];
if (t < min) min = t;
if (t > max) max = t;
}
}
double frac = max – min;
for (int j = 0; j < N; ++j) {
for (int i = 0; i < N; ++i) {
A[i*N + j] = (A[i*N + j] – min) / frac;
}
}
}
8

Runtime performance of Normalize1.
Performance counter stats for ’./Normalize.bin 1 256’:
4,802,912 cpu-cycles
4,538 LLC-loads
425 LLC-load-misses # 9.37% of all LL-cache accesses
0.002047926 seconds time elapsed
Performance counter stats for ’./Normalize.bin 1 512’:
16,577,772 cpu-cycles
5,409 LLC-loads
668 LLC-load-misses # 12.35% of all LL-cache accesses
0.004451773 seconds time elapsed
Performance counter stats for ’./Normalize.bin 1 1024’:
64,124,090 cpu-cycles
14,016 LLC-loads
6,100 LLC-load-misses # 43.52% of all LL-cache accesses
0.016784636 seconds time elapsed
Performance counter stats for ’./Normalize.bin 1 1536’:
143,240,802 cpu-cycles
24,864 LLC-loads
14,732 LLC-load-misses # 59.25% of all LL-cache accesses
0.038767478 seconds time elapsed
Performance counter stats for ’./Normalize.bin 1 2048’:
254,735,648 cpu-cycles
40,122 LLC-loads
24,405 LLC-load-misses # 60.83% of all LL-cache accesses
0.065211502 seconds time elapsed
Performance counter stats for ’./Normalize.bin 1 2560’:
399,380,980 cpu-cycles
73,006 LLC-loads
50,260 LLC-load-misses # 68.84% of all LL-cache accesses
0.102457893 seconds time elapsed
9

Runtime performance of Normalize2.
Performance counter stats for ’./Normalize.bin 2 256’:
5,870,023 cpu-cycles
117,911 LLC-loads
287 LLC-load-misses # 0.24% of all LL-cache accesses
0.001692968 seconds time elapsed
Performance counter stats for ’./Normalize.bin 2 512’:
22,547,539 cpu-cycles
532,221 LLC-loads
7,504 LLC-load-misses # 1.41% of all LL-cache accesses
0.007333935 seconds time elapsed
Performance counter stats for ’./Normalize.bin 2 1024’:
121,908,975 cpu-cycles
2,047,279 LLC-loads

372,388 LLC-load-misses # 18.19% of all LL-cache accesses
0.031579656 seconds time elapsed
Performance counter stats for ’./Normalize.bin 2 1536’:
292,392,531 cpu-cycles
4,528,106 LLC-loads

1,384,196 LLC-load-misses # 30.57% of all LL-cache accesses
0.074422344 seconds time elapsed
Performance counter stats for ’./Normalize.bin 2 2048’:
694,112,174 cpu-cycles
8,338,987 LLC-loads
6,798,261 LLC-load-misses # 81.52% of all LL-cache accesses
0.173902216 seconds time elapsed
Performance counter stats for ’./Normalize.bin 2 2560’:
1,025,207,852 cpu-cycles
12,349,636 LLC-loads
10,425,562 LLC-load-misses # 84.42% of all LL-cache accesses
0.258983051 seconds time elapsed
10

CS3350B Computer Organization Assignment 2

Useful Facts:

Logism is a good tool for drawing circuits.
draw.io is a good tool for drawing any kind of diagram, including circuits.
You may also consider using OneNote, Photoshop, etc. to draw circuits
1 𝐺𝐻𝑧 = 1 × 109 𝐻𝑧
1 𝑏𝑦𝑡𝑒 = 8 𝑏𝑖𝑡𝑠
1 𝐾𝑏𝑦𝑡𝑒 (𝐾𝐵) = 1024 𝑏𝑦𝑡𝑒𝑠
Recall that 𝑋𝑌 ≡ 𝑋 𝑌 ≡ 𝑋 ⋅ 𝑌

Exercise 1. [6 marks] Re-write the following Boolean expression using only NAND operations. Show your step-by-step process. You do not need to simplify.
(𝑝 + 𝑞) ⋅ 𝑟

Exercise 2. [10 Marks] Prove that NOR is functionally complete. You should give formulas as well as truth tables to support those formulas. You can use the symbol ↓ for NOR.

Exercise 3. [10 marks] Draw a 4-bit Serial In, Serial Out register using SR flip-flops.
For example, the below diagram represents a Parallel In, Parallel Out 𝑛-bit register using D
flip-flops.
clock
2

Exercise 4. [20 marks] In this exercise you are to construct an ALU with the following
specification:
• The ALU takes 3 different 32-bit inputs: A, B, C; and has one 32-bit output: D.
• The ALU supports 8 different operations:
– A + B (addition)
– A – B (subtraction)
– MAX(A,B,C)
– MIN(A,B,C)
– A & B (bitwise AND)
– B & C (bitwise AND)
– A | B (bitwise AOR)
– B | C (bitwise OR)

To construct this ALU you will follow a modular design. Assume you have an unlimited
number of the following gates and combinational circuits at your disposal, but no others:
• 2-arity AND gate,
• 2-arity OR gate,
• NOT gate,
• 2-way MUX,

• bit-wise AND circuit: it takes two 32-bit inputs and has a 32-bit output which is the
bit-wise AND of its inputs.
• bit-wise OR circuit: it takes two 32-bit inputs and has a 32-bit output which is the
bit-wise OR of its inputs.

• 32-bit ADDER: it takes two 32-bit inputs, a 1-bit control signal, and has a 32-bit
output. If the control signal is 0, the circuit outputs the sum of its inputs. If the
control signal is 1, the circuit outputs the difference of its inputs. The inputs and
outputs are assumed to be 32-bit two’s complement numbers.

• COMAPRATOR: it takes two 32-bit inputs, a 1-bit control signal, and has a 32-bit
output. If the control signal is 0, it outputs the minimum of its two inputs. If the
control signal is 1, it outputs the maximum of its two inputs. The inputs and outputs
are assumed to be 32-bit two’s complement numbers.

(a) Give the specification of the ALU by determining the number of control signals needed
and then:
(𝑖) present a table specifying the control signal values and the operation the ALU performs
given those control signal values;
(𝑖𝑖) draw a schematic diagram (Lec. 7, pg. 5) for your ALU; include bit-widths and labels.

(b) Using a modular design and the previously stated circuits and gates, draw a circuit which
fulfills the specification of your ALU given in part (a). (Hint: you can have multiple input
wires with the same label; you can have control signal “stubs”; see Lecture 7, page 25.)
3

Exercise 5. [10 + 8 + 8 + 8 = 34 marks] In this exercise we will explore designing
an integrated circuit for use within a microwave. Not the whole thing (that’s too complex),
but just a small piece of it.

Consider that the microwave already has a circuit which controls the timer and the magnetron
(the thing that produces the microwaves). Your goal is to design a simple circuit which
controls the turntable. The turntable is powered by a gear box which takes two input bits:
• 𝑅. If 𝑅 ≡ 1, the gear box rotates. If 𝑅 ≡ 0 the gear box does not rotate.
• 𝐶. If 𝐶 ≡ 1 and 𝑅 ≡ 1, the gearbox turns clockwise. If 𝐶 ≡ 0 and 𝑅 ≡ 1, the gearbox
turns counterclockwise.

The manufacturer of this microwave requests a synchronous circuit which will control the
gearbox to alternate between clockwise and counterclockwise rotation each time
the microwave is turned on. The circuit to be designed will receive a single input bit 𝐼
from the circuit controlling the magnetron.

• When 𝐼 ≡ 1, the turntable should rotate.
• When 𝐼 ≡ 0, the turntable should not rotate.
To guide your design process, complete parts (a) to (d) below.

(a) Draw a Finite State Machine which describes this state circuit. Since this circuit has
4 possible combinations of outputs, it should have 4 states. Be sure to include the inputs
and outputs in the FSM. Order the outputs as 𝑅𝐶. Hence, a transition in the FSM will be
labelled as 𝐼⇑𝑅𝐶. The change in direction should be triggered by the microwave turning on.

(b) Create a truth table to describe the inputs, outputs, and transitions of your FSM. That
is, create a truth table describing the combinational logic of your eventual state circuit. Since
there are 4 states, you must have 2 bits of “input” to encode the current state, and 2 bits of
“output” to encode the next state. (Hint: your table should have 8 rows and 7 columns.)

(c) Give the DNF for each output of the combinational part of your state circuit (there are
4 of them; recall that the “next state” is part of the combinational logic output). Then,
simplify each DNF into a reduced sum of products. For this particular question, you do not
need to specify the Boolean law being applied at each step of the simplification.

(d) Draw a circuit which implements the combinational logic of your turntable controller
circuit. That is, draw a circuit implementing your 4 simplified formulas from part (c). Use
gates with arity at most 2. For clarity (and to avoid crossing too many lines), you may draw
4 independent circuits, one for each output bit.
4

CS3350B Computer Organization Assignment 3

Useful Things:

A MIPS simulator, spim: http://pages.cs.wisc.edu/~larus/spim.html
List of MIPS instructions: https://inst.eecs.berkeley.edu/ cs61c/resources/MIPS_help.html
Example MIPS code which runs on spim is provided in OWL under resources.
Labels can be used in assembly in replace of calculating exact values for branch and jump
instructions. The following is an example.
i n t i sNeg ( i n t a0 ) {
i f ( a0 < 0 ) {
r e t u r n 1 ;
} e l s e {
r e t u r n 0 ;
}
}
i sNeg :
s l t $ t0 $a0 $0
beq $ t0 $0 i sP o s
addi $v0 $0 1
j r $ ra
i sP o s :
add $v0 $0 $0
j r $ ra

When translating code to and from assembly, one should usually consider local variables as
being stored in registers and memory accesses as being done through pointers or arrays.
i n t loadEx ( i n t ∗ a0 ) {
i n t t0 = a0 [ 0 ] ;
r e t u r n t0 ;
}
loadEx :
lw $ t0 0 ( $a0 )
add $v0 $ t0 $0
j r $ ra

Exercise 1. [16 marks] For each of the following MIPS assembly instructions, give the
instruction format, the decimal value of each bit field for that instruction format, and the
32-bit instruction word encoding that instruction written as a single hexadecimal number.
• add $v1, $a2, $t8
• lb $s2, 37($t7)
• bne $v0, $t4, 732
• sra $s4, $s6, 11

Exercise 2. [10 marks]
(a) Consider the following MIPS assembly code fragment. What is the correct value of L2
to be used in the beq instruction? Assume the lw instruction labelled by L1 is located at
address 0x89aa1b. Show your workings.
L1 :
lw $t0 , 0 ( $a0 )
lw $t1 , 4 ( $a0 )
lw $t2 , 0 ( $a1 )
lw $t3 , 4 ( $a1 )
s l t $t4 , $t0 , $ t2
s l t $t5 , $t1 , $ t3
and $t4 , $t4 , $ t5
beq $t4 , $0 , L2
sub $t3 , $t3 , $ t2
j L3
L2 :
sub $t3 , $t2 , $ t3
L3 :
sw $t3 , 0 ( $a0 )
j r $ ra

(b) Assume that, for a particular clock cycle in a single-cycle MIPS datapath, the program
counter is 0x1258AB91 and the fetched instruction word is 0x089F01A7. What is the value of
the program counter on the next clock cycle? Write you answer in hexadecimal. Show your
workings.

2
Exercise 3. [20 marks] Consider the following MIPS assembly code fragment. The sort
function implements bubble sort using a helper function swap.

Assume there are no syntax errors. But, there are 10 bugs in this code with respect to the
semantics of MIPS and runtime bugs. That is, there are exactly 10 lines of code that either
need to be modified or added to get this code working correctly.

• If an instruction needs to be modified, state the instruction as is and then state what
it should be modified to be.
• If an instruction needs to be added, state before which instruction it should be added
and the instruction to add.

3
1 # swap k and k +1 of an array where a0 is array addreess and a1 is k
2 swap :
3 add $t0 , $v1 , $zero # t0 = k
4 add $t0 , $v0 , $t0 # t0 = v [ k ]
5 lw $t1 , 0( $t0 ) # load v [ k ]
6 lw $s1 , 4( $t0 ) # load v [ k +1]
7 sw $s1 , 0( $t0 ) # v [ k ] = v [ k +1]
8 sw $t1 , 4( $t0 ) # v [ k +1] = v [ k ]
9 jr $ra
10

11 # Bubble sort an array of ints with address a0 and length a1
12 sort :
13 addi $sp , $sp , 16 # make room for ints on stack
14 sw $s3 , 12( $sp )
15 sw $s2 , 8( $sp )
16 sw $s1 , 4( $sp )
17 sw $s0 , 0( $sp )
18 add $s2 , $a0 , $zero # s2 = a0 ( address of v array )
19 add $s3 , $a1 , $zero # s3 = a1 ( length of v array )
20 add $s0 , $zero , $zero # s0 = i = 0

21 for1tst :
22 slt $t0 , $s0 , $s3 # t0 = ( s0 < s3 )
23 beq $t0 , $zero , exit1
24 addi $s1 , $s0 , -1 # s1 = j = i-1
25 for2tst :
26 slti $t0 , $s1 , 0 # t0 = ( s1 < 0)
27 bne $t0 , $zero , exit2 # go to exit2 if ( s1 < 0)
28 sll $t1 , $s1 , 2 # t1 = j * 4
29 add $t2 , $s2 , $t1 # t2 = v + ( j *4)
30 lw $t3 , 0( $t2 ) # t3 = v [ j ]
31 lw $t4 , 4( $t2 ) # t4 = v [ j +1]
32 slt $t0 , $t4 , $t3 # t3 = v [ j +1] < v [ j ]
33 beq $t0 , $zero , exit2 # go to exit2 if v [ j +1] >= v [ j ]

34 add $v0 , $s2 , $zero # set args for swap proc
35 add $v1 , $s1 , $zero
36 j swap # call swap
37 addi $s1 , $s1 , -1 # decrement j
38 j for2tst # iterate inner loop
39 exit2 :
40 addi $s0 , $s0 , 1 # increment i
41 j for1tst # iterate outer loop
42 exit1 :
43 lw $s0 , 0( $sp )
44 lw $s1 , 4( $sp )
45 lw $s2 , 8( $sp )
46 lw $s3 , 12( $sp )
47 addi $sp , $sp , -16
48 jr $ra
4

Exercise 4. [14 marks]
Consider the above circuit, and assume the following timings for its constituent pieces.
• Transfer along wires is instantaneous.
• NOT operations are instantaneous (the small circle before the top-right MUX).
• Propagation delay of AND gate: 20ns
• Propagation delay of XOR gate: 45ns
• Propagation delay of MUX: 120ns

(a) Assume the values of A1, A2, A3, B1, B2, B3 all change to new values at time 𝑡 = 0.
For each of C1, C2, C3, C4, C5, C6, determine the time at which its new value becomes
available. That is, determine each output’s critical path.

(b) What is the propagation delay of this circuit as a whole?
5

Exercise 5. [20 Marks] Consider the single-cycle MIPS datapath with control signals
as presented in Figure 1. We want to add a new instruction to the MIPS instruction set
architecture: foo. Its specification is as follows:
MIPS assembly RTL
foo $rt $rs IMM Mem[Reg[$rs]] ← Reg[$rt]
Reg[$rt] ← Mem[Reg[$rs] + Reg[$rt]]
Reg[$rs] ← Reg[$rs] + IMM
PC ← PC + 4

*Note that this is the normal “PC + 4” as for any non-branch, non-jump instruction.

Your task is to modify the MIPS datapath so that it can fulfill this new instruction foo. To
do that, you should:
(i) add new wires, ports, circuitry, MUX, control signals, etc. to the datapath so that it
can execute the new instruction foo (see, e.g., Slides 16-23 in L11-CPUControl);

(ii) ensure that any newly added circuitry and control signals do not hinder the execution
of any existing operations in the MIPS ISA (i.e. your modified datapath should still
be able to successfully execute all the preexisting instructions in the MIPS ISA).

Assume that IMM is a signed integer. Assume that memory is fast enough to read and write
within one clock cycle, and that the read from memory occurs before the write to memory.

To structure your solution to this exercise, perform the following:
(a) Modify (using Photoshop, Gimp, OneNote, etc.) the MIPS datapath (supplied as
MIPSDatapath.png) with new circuits and control signals so that it can perform the foo
instruction. That is, simply draw your new circuits and wires on top of the original image.
Add your modifications in a colour other than black so they are clearly distinguishable from
the original data path. Ensure you include bit-widths and labels.

(b) Give a brief English description of the modifications you have made. Explain how data
flows through the modified datapath when the foo instruction is being executed. Explain how
data flows through the modified datapath when any instruction besides foo is being executed.

(c) Give the values of the control signals required to execute your instruction. That is, when
executing the foo instruction, give the values of the 8 preexisting control signals as well as
the values of any new control signals you have added.

(d) If you have added any new control signals, give the values of these control signals when
any instruction besides foo is being executed.
6
Figure 1: MIPS datapath with control signals
7

Exercise 6. Consider the following 8-stage datapath and the timings for each stage.
Stage IM1 IM2 DEC REG SHFT ALU DM1 DM2
Time 115ps 225ps 175ps 200ps 150ps 425ps 200ps 225ps
(a) [2 marks] Using single-cycle clocking for this datapath what is the minimum clock cycle
possible for this datapath?

(b) [2 marks] If this datapath were to be pipelined, what is the minimum clock cycle which
would be possible?
(c) [6 marks] Assume this datapath was pipelined. What is the ideal speedup? Assuming
no data dependencies or pipeline hazards, what is the actual speedup obtained when executing 300 instructions? Round your answers to 3 decimal places. As always, show your
workings.

(d) [10 marks] Let us assume this datapath is not pipelined but still follows a multi-cycle
clocking method. Let us further assume that the datapath has an additional optimization in
which instructions skip stages that are unused for that instruction. Consider the following
set of instructions to be executed on this datapath along with the stages which are used by
that instruction.

add: IM1, DEC, REG, ALU, DM2
sll: IM1, REG, SHFT, DM2
sw: IM2, DEC, REG, ALU, DM1
lw: IM1, IM2, DEC, REG, ALU, DM1, DM2
beq: IM1, IM2, REG
𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛 𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛 𝐶𝑜𝑢𝑛𝑡
add 1500
sll 250
sw 500
lw 400
beq 350
Table 1: Program Instructions

Consider also a program composed of instructions as detailed in Table 1. If this program
was to be executed on this datapath, determine: (𝑖) the ideal CPI of this program, and (𝑖𝑖)
the total execution time of this program. Assume that there are no dependencies or hazards
between instructions. Assume there are no memory stalls.
8

CS3350B Computer Organization Assignment 4

Useful Facts:

Figure 1: MIPS Datapath with Interstage Registers

Exercise 1. [20 Marks] Consider the multi-cycle MIPS datapath presented in Figure 1,
it shows 4 inter-stage registers: IF/ID, ID/EX, EX/MEM, and MEM/WB. Consider also
the control signals presented in the diagram in blue. Assume the ALUOp control signal is
3 bits. Ignore control signals not shown (i.e. the ones controlling forwarding). Determine
the minimum size, in bits, of each of the four inter-stage registers. For part marks, ensure to
include workings and not just the total values.

Exercise 2. [15 Marks] Consider a pipelined processor with 𝑁 stages. Each stage except
the last runs in 𝑡 units of time (say pico-seconds) meanwhile the last stage runs in 𝑚×𝑡 units
of time where 𝑚 > 1. Thus, the last stage is slower than the other ones. Assume there are 𝑐
independent and hazard-free instructions being processed by this pipeline.

(a) Compute an expression for the actual pipeline speedup based on the variables 𝑁, 𝑡, 𝑚,
and 𝑐. Show your workings.
(b) Compute an expression for the ratio of time for which the pipeline runs at full occupancy
based on the variables 𝑁, 𝑡, 𝑚, and 𝑐. Show your workings. Recall full occupancy means
every pipeline stage is actively computing work.
2

Exercise 3.
Consider the following C code:
for (i = 0; i < n; ++i)
a[i] = b[i] + i;
where a and b are 32-bit integer arrays of size n.

We have a corresponding MIPS instruction sequence:
add $t0, $0, $0 # $t0 = 0, which corresponds to i in C code
loop: lw $s1, 0($s4) # assume $s4 stores the base address of array b
add $s0, $s1, $t0 # $s0 gets b[i] + i
sw $s0, 0($s2) # assume $s2 stores the base address of array a
addi $t0, $t0, 1 # ++i
addi $s2, $s2, 4 # get address of a[i+1]
addi $s4, $s4, 4 # get address of b[i+1]
slt $t2, $t0, $s5 # assume that $s5 holds n
bne $t2, $0, loop # if $t2 == 1, go to loop

Assume that the above MIPS instructions will be executed on a 5-stage pipelined processor.
Ignore structural hazards and assume the branch condition is computed in the EX stage.
(a) [10 Marks] Draw the pipeline execution diagram for one iteration of the loop, that is,
starting at the label loop. Assume there is no data forwarding. Compute the CPI (clock
cycles per instruction) for that one iteration. Do not re-order the instructions.

(b) [10 Marks] Draw the pipeline execution diagram for one iteration of the loop, that
is, starting at the label loop. This time, assume all possible data forwarding is used. For
each instance of data forwarding, annotate your pipeline diagram (say, with arrows, colors,
or footnotes) to indicate the source and destination of the forwarding; also say what kind of
forwarding it is. Compute the CPI (clock cycles per instruction) for that one iteration. Do
not re-order the instructions.

(c) [5 Marks] Apply loop unrolling (as well as instruction re-ordering, if you like) on the
above MIPS code for two iterations. Write out the corresponding MIPS instruction code. Make
sure to use offsets appropriately to avoid unnecessary instructions.

3
(d) [15 Marks] Consider a 2-issue extension of MIPS. That is, a VLIW extension of MIPS
where two instructions occur in each instruction word. The issue packet has the following
format: the first instruction must be arithmetic or branch, and the second instruction must
be a data transfer instruction (lw or sw). Still assume all types of forwarding.

Using the code of your unrolled loop of part (c), statically schedule one iteration of the (now
unrolled) loop to run optimally on this 2-issue machine. You may wish to modify your code
from part (c) to have a different order and use additional registers to accomplish this task.

Consider using the following table as a starting point and to help guide you through the
process
ALU or branch Data transfer CC
loop: 1
2
3
4
5
6
7
8
9

4

Exercise 4. [25 Marks] Consider a multi-core processor with 2 cores, named P1 and P2.
Each core has a dedicated cache with the following characteristics:
• 2-way set associative and a 16-byte capacity;
• is initially empty;
• follows the MESI snooping protocol;
• follows write-back and write-allocate protocols; and
• follows a pseudo-LRU replacement policy where

(𝑖) empty cache lines in a set are filled first, then,
(𝑖) if there are any invalid cache lines in a set replace them, then,
(𝑖𝑖) if no invalid cache lines are present, follows a typical LRU replacement policy.

Given the following list of serialized memory byte address accesses by the cores, determine:
(a) whether each access results in a cache hit, cold miss, conflict miss, capacity miss, true
share miss, or false share miss;

(b) the data stored in each cache after all addresses in the list have been accessed; and
(c) the MESI state of each cache block after all addresses in the list have been accessed.

Time Memory Access Hit/Miss Type
1 P1 Reads 5
2 P2 Writes 8
3 P1 Reads 9
4 P1 Writes 14
5 P1 Reads 3
6 P1 Writes 12
7 P2 Reads 6
8 P2 Reads 17
9 P1 Reads 20
10 P2 Reads 4
Time Memory Access Hit/Miss Type
11 P2 Writes 9
12 P2 Writes 10
13 P2 Reads 2
14 P1 Writes 7
15 P1 Reads 8
16 P1 Reads 4
17 P2 Reads 12
18 P2 Reads 7
19 P1 Writes 2
20 P1 Reads 11

To answer part (a) use the above table. To answer parts (b) and (c), use the below tables.
P1 Cache
Set Cache Block Data State
0
1
2
3
P2 Cache
Set Cache Block Data State
0
1
2
3
5