CSE 13S Assignments 1 to 7 solutions

$160.00

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (1 vote)

CSE 13S Assignment 1 Left, Right and Center

1 Introduction
The gambling known as business looks with austere disfavor
upon the business known as gambling.
—Ambrose Bierce
We are going to implement a simple game called Left, Right, and Center. A subset of philosophers will represent
players of the game. It requires no skill, no real decision-making, and a philosopher that is out of the game can
suddenly come back in and often win.
Some of you may have religious or moral prohibitions against gambling. This program is not gambling, since
(i) you neither win nor lose, and (ii) only fictitious people lose or win any money. But, if you do have any qualms,
let us know and we will give you an alternative assignment involving vampires.
2 Playing the Game
No sympathy for the devil; keep that in mind. Buy the ticket, take the ride. . . and if it
occasionally gets a little heavier than what you had in mind, well. . .maybe chalk it off to forced
conscious expansion: Tune in, freak out, get beaten.
—Hunter S. Thompson, Fear & Loathing in Las Vegas
Some number of k players, 1 < k ≤ 14, sit around a table. Each player has in her hand $3. There are three dice,
and each die has 6 faces and is labeled: 3ו, 1×L, 1×R or 1×C. As a result, we know that there is a 50% chance of
rolling •, and 16.66% chance of rolling each of L, R, or C.
1. Beginning with player 1, roll the dice:
(a) If the player has $3 or more then she rolls three dice; if she has $2 then she rolls two dice; if she has only
$1 then she rolls one die; if she has no money then she must pass.
(b) For each die:
i. If the player rolls L then she gives $1 to the player on her left.
ii. If the player rolls R then she gives $1 to the player on her right.
iii. If the player rolls C then she puts $1 in the pot in the center.
iv. If the player rolls • then she ignores it.
© 2021 Darrell Long 1
2. Move to the next player in sequence: to the right. The players are numbered. There is you. Then there is the
left player which is (you − 1) mod the number of players and there is the right player which is (you + 1) mod
the number of players. Be careful: What does -2 % 10 mean? Consequently, you may find this code useful:
1 //
2 // Returns the position of the player to the left .
3 //
4 // pos : The position of the current player .
5 // players : The number of players in the game .
6 //
7 static inline uint8_t left ( uint8_t pos , uint8_t players ) {
8 return (( pos + players – 1) % players ) ;
9 }
10
11 //
12 // Returns the position of the player to the right .
13 //
14 // pos : The position of the current player .
15 // players : The number of players in the game .
16 //
17 static inline uint8_t right ( uint8_t pos , uint8_t players ) {
18 return (( pos + 1) % players ) ;
19 }
3. Repeat until only one player has any money remaining (who then wins the pot).
3 Your Task
Any one who considers arithmetical methods of producing random digits is,
of course, in a state of sin.
—John von Neumann, 1951
• You must have one source file: lrc.c. Do not name your source file anything else. You will lose points.
• For grading purposes the numbering of the faces matters, and so they must be defined as follows (do not
change it):
1 typedef enum faciem { PASS , LEFT , RIGHT , CENTER } faces ;
2 faces die [] = { LEFT , RIGHT , CENTER , PASS , PASS , PASS };
This means you may not use int as a substitute.
• You must give your players names, and for grading purposes the names must correspond to these (do not
change them):
© 2021 Darrell Long 2
1 char * philosophers [] = { ” Immanuel Kant “,
2 ” Martin Heidegger “,
3 ” David Hume “,
4 ” Georg Wilhelm Friedrich Hegel “,
5 ” Arthur Schopenhauer “,
6 ” Ludwig Wittgenstein “,
7 ” Karl Wilhelm Friedrich Schlegel “,
8 ” Friedrich Nietzsche “,
9 ” Socrates “,
10 ” John Stuart Mill “,
11 ” Plato “,
12 ” Aristotle “,
13 ” Thomas Hobbes “,
14 ” Rene Descartes ” };
This array of names is defined in philos.h, which can be found in the course resources repository.
• Typing make must build your program and ./lrc must run your program. Since you have not learned about
Makefiles yet, here is one that you can use for now.
Makefile
1 CC = clang
2 CFLAGS = -Wall -Wextra -Werror -Wpedantic
3
4 all : lrc
5
6 lrc : lrc .o
7 $(CC) -o lrc lrc .o
8
9 lrc .o: lrc .c
10 $(CC) $( CFLAGS ) -c lrc .c
11
12 clean :
13 rm -f lrc lrc .o
14
15 scan – build : clean
16 scan – build make
• You will need to use a random number generator to simulate rolling a dice. You can do this by calling the
function random() (read the man page) to get a random value. You can then use mod to limit this value to
the range of 0–5 inclusive (in Computer Science, we start from 0). This value is used to determine what was
rolled.
• In order that your program be reproducible, you must start from a known place. This is accomplished by
setting the random seed using the function srandom() (again read the man page). Your program will ask for
two numbers: the random seed and the number of players. You should assign these inputs to variables to
© 2021 Darrell Long 3
use in your program. The random seed completely determines the outcome of your program. If you give
it the same random seed and the number of players you must get the same answer. Here is an example of
using srandom() and random():
1 srandom (1) ; // Sets the random seed as 1.
2 int a = random () ; // Set a as a pseudorandom number .
• Comment your code. Include block comments to tell the grader what a certain block of code does. Use a line
comment to explain something that is not as obvious. Refer to the coding standards.
• All source code should be formatted using clang-format.
• Your program must compile and run on Ubuntu 20.04 with no errors. If it does not, your program will receive
a 0.
• Your program must pass scan-build with no errors. Document any errors that are false positives in your
README.md.
In lecture we talked briefly about random and pseudo-random numbers. As we know, computers produce
pseudo-random numbers, and in this case it is to your benefit since reproducibility is essential. That means that
in reality you program though it appears to be random is actually deterministic. This is why starting with the same
seed produces the same sequence.
4 Resources
Common sense in an uncommon degree is what the world calls wisdom.
—Samuel Taylor Coleridge, Literary Remains
A working example binary can be found in the course resources repository on git.ucsc.edu. The header file,
philos.h, containing the names of the philosophers can be found here as well.
For this assignment, running “./lrc” will prompt the user to enter a random seed and the number of players.
It will then produce an output of the game which your program should also be able to produce given the same seed
and number of players. Your program must be able to do exactly the same.
5 Hints
It is no use trying to sum people up. One must follow hints, not exactly what
is said, nor yet entirely what is done.
—Virginia Woolf
1. Start Now! You can always finish early.
2. You may find it helpful to draw out and run through a game to get a feel for the rules. Doing so may make it
easier to visualize the game and design your program.
3. The game itself should be an infinite loop, where the condition to break out of this loop is when there is only
one active player remaining.
© 2021 Darrell Long 4
4. You should think carefully about what quantities that you must track in order for your program to function.
At a minimum you must keep track of the bank balance of each player, the amount of money in the pot,
and the number of players that are in. Be careful: players that were out may be brought back in if money is
passed to the left or right.
6 Deliverables
The design process is about designing and prototyping and making. When
you separate those, I think the final result suffers.
—Jonathan Ive
You will need to turn in:
1. lrc.c: The source file which is your program.
2. philos.h: The header file containing the names of the philosophers.
3. Makefile: This is a file that will allow the grader to type make to compile your program. Typing make with
no arguments must build your program.
• CC=clang must be specified.
• CFLAGS=-Wall -Wextra -Werror -Wpedantic must be included.
• make clean must remove all files that are compiler generated.
• make should build your program, as should make all.
• Your program executable must be named lrc.
4. README.md: This must be in markdown. This must describe how to use your program and Makefile.
5. DESIGN.pdf: This must be in PDF format. The design document should describe your design for your program with enough detail that a sufficiently knowledgeable programmer would be able to replicate your implementation. This does not mean copying your entire program in verbatim. You should instead describe
how your program works with supporting pseudo-code. For this program, pay extra attention to how you
describe your logic flow/algorithm in C. The commit ID containing the first draft of your design document
must be submitted no later than the Thursday before the assignment is due, by 11:59 pm PST. Failing to
submit this first draft or submitting a insubstantial draft means receiving a 0 for your design document.
All of these files must be in the directory asgn1.
7 Submission
Better three hours too soon than a minute too late.
—William Shakespeare
To submit your assignment, refer back to asgn0 for the steps on how to submit your assignment through git.
Remember: add, commit, and push!
Your assignment is turned in only after you have pushed. If you forget to push, you have not turned in your
assignment and you will get a zero. “I forgot to push” is not a valid excuse. It is highly recommended to commit
and push your changes often.
© 2021 Darrell Long 5
8 Supplemental Readings
I was not proud of what I had learned but I never doubted that
it was worth knowing.
—Hunter S. Thompson, The Rum Diary
• The C Programming Language by Kernighan & Ritchie
– Chapter 1 §1.6, 1.7
– Chapter 2 §2.3
– Chapter 3 §3.1-3.7
La programmation en C c’est comme donner une tronçonneuse à un singe.

CSE 13S Assignment 2 A Small Numerical Library

1 Introduction
Let us change our traditional attitude to the construction of programs.
Instead of imagining that our main task is to instruct a computer what
to do, let us concentrate rather on explaining to human beings what
we want a computer to do.
—Donald Knuth
As we know, computers are simple machines that carry out a sequence of very simple steps, albeit very
quickly. Unless you have a special-purpose processor, a computer can only compute addition, subtraction,
multiplication, and division. If you think about it, you will see that the functions that might interest you
when dealing with real or complex numbers can be built up from those four operations. We use many of
these functions in nearly every program that we write, so we ought to understand how they are created.
If you recall from your Calculus class, with some conditions a function f(x) can be represented by its
Taylor series expansion near some point f(a):
f(x) = f(a)+∑∞
k=1
f
(k)
(a)
k!
(x−a)
k
.
Note: when you see Σ, you should generally think of a for loop.
If you have forgotten (or never taken) Calculus, do not despair. Attend a laboratory section for review:
the concepts required for this assignment are just derivatives.
Since we cannot compute an infinite series, we must be content to calculate a finite number of terms. In
general, the more terms that we compute, the more accurate our approximation. For example, if we expand
to 10 terms we get:
f(x) = f(a)+ f
(1)
(a)
1!
(x−a)
1 +
f
(2)
(a)
2!
(x−a)
2 +
f
(3)
(a)
3!
(x−a)
3 +
f
(4)
(a)
4!
(x−a)
4
+
f
(5)
(a)
5!
(x−a)
5 +
f
(6)
(a)
6!
(x−a)
6 +
f
(7)
(a)
7!
(x−a)
7 +
f
(8)
(a)
8!
(x−a)
8
+
f
(9)
(a)
9!
(x−a)
9 +O((x−a)
10).
© 2021 Darrell Long 1
Note: k! = k(k−1)(k−2)×…×1, and by definition, 0! = 1.
Taylor series, named after Brook Taylor, requires that we pick a point a where we will center the approximation. In the case a = 0, then it is called a Maclaurin series). Often we choose 0, but the closer to the
value of x the better we will approximate the function. For example, let’s consider e
x
centered around 0:
e
x =
∑∞
k=0
x
k
k!
= 1+
x
1
1!
+
x
2
2!
+
x
3
3!
+
x
4
4!
+
x
5
5!
+
x
6
6!
+
x
7
7!
+
x
8
8!
+
x
9
9!
+…
This is one of the simplest series when centered at 0, since e
0 = 1. Consider the general case:
e
x = e
a +
e
a
1!
(x−a)
1 +
e
a
2!
(x−a)
2 +
e
a
3!
(x−a)
3 +
e
a
4!
(x−a)
4 +
e
a
5!
(x−a)
5
+
e
a
6!
(x−a)
6 +
e
a
7!
(x−a)
7 +
e
a
8!
(x−a)
8 +
e
a
9!
(x−a)
9 +
e
a
10!
(x−a)
10 +O((x−a)
11).
Since d
dx e
x = e
x
the exponential function does not drop out as it does for a = 0, leaving us with the
original problem. If we knew e
a
for a ≈ x then we could use a small number of terms. However, we do not
know it and so we must use a = 0.
What is the O((x−a)
11) term? That is the error term that is “on the order of” the value in parentheses.
This is different from the big-O that we will discuss with regard to algorithm analysis.
2 Your Task
Programming is one of the most difficult branches of applied
mathematics; the poorer mathematicians had better remain pure
mathematicians.
—Edsger Dijkstra
For this assignment, you will be creating a small numerical library and a corresponding test harness. Our
goal is for you to have some idea of what must be done to implement functions that you use all the time.
You will be writing and implementing sin−1
(arcsin), cos−1
(arccos),tan−1
(arctan), and log functions.
For arcsin, arccos, and arctan you can choose to use a Taylor series approximation or the inverse method.
You will use Newton’s method to approximate log. You will then compare your implemented functions to
the corresponding implementations in the standard library <math.h> with a test harness and output the
results into a table similar to what is shown in Figures 1 and 2.
$ ./mathlib-test -s | head -n 5
x arcSin Library Difference
– —— ——- ———-
-1.0000 -1.57079633 -1.57079633 -0.0000000000
-0.9000 -1.11976951 -1.11976951 0.0000000000
-0.8000 -0.92729522 -0.92729522 -0.0000000000
Figure 1: First five lines of program output for arcsin.
© 2021 Darrell Long 2
$ ./mathlib-test -c | head -n 5
x arcCos Library Difference
– —— ——- ———-
-1.0000 3.14159265 3.14159265 -0.0000000000
-0.9000 2.69056584 2.69056584 0.0000000000
-0.8000 2.49809154 2.49809154 -0.0000000000
Figure 2: First five lines of program output for arccos.
From left to right, the columns represent the input parameter, your program’s arcsin/arccos value for
the input parameter, the library’s value for the input parameter and lastly, the difference between your value
and the library’s value.
You will test arcsin and arccos in the range [−1,1) with steps of 0.1, while arctan and log will be tested
in the range [1,10) with steps of 0.1.
Each implementation will be a separate function. You must name the functions arcSin(), arcCos(),
arcTan(), and Log(). Since the <math.h> library uses asin, acos, atan, and log, you will not be able
to use the same names. You may use the sin() and cos() functions from <math.h>. You may not use
any other functions from <math.h> in any of your functions. You may only use them in your printf()
functions. However, you should use constants such as M_PI from <math.h>. Do not define π yourself.
2.1 Sin−1 and Cos−1
The domain of sin−1
and cos−1
is [−1,1], and so centering them around 0 makes sense. The Taylor series
for arcsin(x) centered about 0 is:
arcsin(x) =∑∞
k=0
(2k)!
2
2k(k!)2
x
2k+1
2k+1
, |x| ≤ 1.
If we expand a few terms, then we get:
arcsin(x) = x+(
1
2
)
x
3
3
+(
1×3
2×4
)
x
5
5
+(
1×3×5
2×4×6
)
x
7
7
+O(x
9
).
The series for arccos(x) centered about 0 is:
arccos(x) = π
2

∑∞
k=0
(2k)!
2
2k(k!)2
x
2k+1
2k+1
.
So you can implement it as,
arccos(x) = π
2
−arcsin(x).
You can implement the Taylor series expansion or the inverse method when writing the functions for arcsin
and arccos.
© 2021 Darrell Long 3
2.2 Tan−1
The Taylor series expansion for arctan is also complex and is as follows:
arctan(x) =∑∞
k=0
2
2k(k!)2
(2k+1)!
x
2k+1
(1+x
2)
k+1
.
To make things simpler, you can also calculate arctan using the arcsin or arccos functions
arctan(x) = arcsin(
x

x
2 +1
) = arccos(
1

x
2 +1
), x > 0.
You can implement the Taylor series expansion or the inverse method when writing arctan.
2.3 e
x
Fortunately, we have a nice series for e
x
and it happens to converge very quickly. In Figure 3, we use our
expansion to 10 terms and plot for e
0
,…,e10. We see that the approximation starts to diverge significantly
around x = 7. What this tells us is that 10 terms are insufficient for an accurate approximation, and more
terms are needed.
2 4 6 8 10
x
2000
4000
6000
8000
ex
Taylor Approximation Figure 3: Comparing e
x with its Taylor approximation centered at zero.
If we are naïve about computing the terms of the series we can quickly get into trouble — the values of
k! get large very quickly. We can do better if we observe that:
x
k
k!
=
x
k−1
(k−1)! ×
x
k
.
At first, that looks like a recursive definition (and in fact, you could write it that way, but it would be
wasteful). As we progress through the computation, assume that we know the previous result. We then just
have to compute the next term and multiply it by the previous term. At each step we just need to compute
© 2021 Darrell Long 4
2 4 6 8 10
5
10
15
20
25
Figure 4: Comparing
x
k
k!
for x = 2,3,4,5.
x
k
, starting with k = 0! (remember 0! = 1) and multiply it by the previous value and add it into the total. It
turns into a simple for or while loop.
Conceptually, what you need to think about is:
1 new = previous * current;
2 previous = current;
We can use an ϵ (epsilon) to halt the computation since |x
k
| < k! for a sufficiently large k. Consider
Figure 4: briefly, x
k dominates but is quickly overwhelmed by k! and so the ratio rapidly approaches zero.
You should set ϵ = 10−10 for this assignment.
For this assignment, the Exp(x) function is provided to you and you must use it in order to write your
Log function.
2.4 Log
To compute log, you will use Newton’s method, also called the Newton-Raphson method, by computing the
inverse of e
x
. It is an iterative algorithm to approximate roots of real-valued functions, i.e., solving f(x) = 0.
Each iteration of Newton’s method produces successively better approximations.
xk+1 = xk −
f(xk)
f
′(xk)
Your function Log() should behave the same as log() from <math.h>: compute ln(x). Here we are
© 2021 Darrell Long 5
finding the root of f(x) = y−e
x
. Since e
x
is the inverse of ln, i.e. ln(e
y
) = x. This gives us:
xk+1 = xk +
y−e
xk
e
xk
.
Each guess xk+1 gives a successive improvement over the previous guess xk. The function begins with
an initial guess x0 = 1.0 that it uses to compute better approximations. Log() is sufficiently calculated
once the value converges, i.e. the difference between consecutive approximations is sufficiently small. This
occurs when when e
xi −y is small.
In order to implement this function, you will have to use the given Exp() function. Using the exp()
and pow() functions from <math.h> is strictly prohibited.
3 Command-line Options
GUIs tend to impose a large overhead on every single piece of software, even the smallest,
and this overhead completely changes the programming environment. Small utility
programs are no longer worth writing. Their functions, instead, tend to get swallowed up
into omnibus software packages.
—Neal Stephenson, In the Beginning…Was the Command Line
Your test harness will determine which implemented functions to run through the use of commandline options. In most C programs, the main() function has two parameters: int argc and char **argv.
A command, such as ./hello arg1 arg2, is split into an array of strings referred to as arguments. The
parameter argv is this array of strings. The parameter argc serves as the argument counter, the value in
which is the number of arguments supplied. Try this code, and make sure that you understand it:
Trying out command-line arguments.
1 #include <stdio.h>
2
3 int main(int argc , char **argv) {
4 for (int i = 0; i < argc; i += 1) {
5 printf(“argv[%d] = %s\n”, i, argv[i]);
6 }
7 return 0;
8 }
A command-line option is an argument, usually prefixed with a hyphen, that modifies the behavior
of a command or program. They are typically parsed using the getopt() function. Do not attempt to
parse the command-line arguments yourself. Instead, use the getopt() function. Command-line options
must be defined in order for getopt() to parse them. These options are defined in a string, where each
character in the string corresponds to an option character that can be specified on the on the commandline. Upon running the executable, getopt() scans through the command-line arguments, checking for
option characters.
© 2021 Darrell Long 6
Parsing options with getopt().
1 #include <stdio.h>
2 #include <unistd.h> // For getopt().
3
4 #define OPTIONS “pi:”
5
6 int main(int argc , char **argv) {
7 int opt = 0;
8 while ((opt = getopt(argc , argv , OPTIONS)) != -1) {
9 switch (opt) {
10 case ‘p’:
11 printf(“-p option.\n”);
12 break;
13 case ‘i’:
14 printf(“-i option: %s is parameter.\n”, optarg);
15 break;
16 }
17 }
18 return 0;
19 }
This example program supports two command-line options, ‘p’ and ‘i’. Note that the option character
‘i’ in the defined option string OPTIONS has a colon following it. The colon signifies that, when the ‘i’
option is enabled on the command-line using ‘-i’, getopt() is looking for an parameter to be supplied
following it. An error is thrown by getopt() if an argument for a flag requiring one is not supplied.
4 Deliverables
Thinking doesn’t guarantee that we won’t make mistakes. But not
thinking guarantees that we will.
—Leslie Lamport
You will need to turn in:
1. mathlib.h: You will be supplied this file and you are not allowed to modify it. The function prototypes for the math functions you are required to implement are as follows:
• double arcSin(double x);
• double arcCos(double x);
• double arcTan(double x);
• double Log(double x);
© 2021 Darrell Long 7
2. mathlib.c: This file will contain your math function implementations, as prototyped in mathlib.h.
Your functions must not print or have any side effects.
3. mathlib-test.c: This file will contain the main() program and acts as a test harness for your math
library. The code to parse command-line options and print out the results of testing your math functions belong here. The getopt() options that you must support are:
• -a: to run all tests.
• -s: to run arcsin tests.
• -c: to run arccos tests.
• -t: to run arctan tests.
• -l: to run log tests.
Note that these options are not mutually exclusive. You should be able to support any combination
of these options. Each function is tested at most once. Specifying all tests to be run and a specific
test does not result in that test running twice (e.g. ./mathlib-test -a -s). The output for each
function must match the format shown in Figures 1 and 2. Your compiled program must be called
mathlib-test. To aid you with the print formatting, the print statement is given as follows:
1 printf(” %7.4lf % 16.8lf % 16.8lf % 16.10lf\n”, …);
The spaces in the print format statement are intentional and account for negative (but not positive)
signs.
4. Makefile: This is a file that will allow the grader to type make to compile your program. Running
make or make all must build your mathlib-test program. Running make clean must remove
any compiler-generated files.
5. README.md: This must be in Markdown. This must describe how to build and run your program and
list the command-line options it accepts and what they do.
6. DESIGN.pdf: This must be in PDF. The design document should contain answers to the pre-lab
questions. It should also describe the purpose of your program and communicate the overall design
of the program with enough detail such that a sufficiently knowledgeable programmer would be able
to replicate your implementation. This does not mean copying your entire program in verbatim.
You should instead describe how your program works with supporting pseudocode. C code is not
considered pseudocode. You must push DESIGN.pdf before you push any code.
7. WRITEUP.pdf: This must be in PDF. WRITEUP.pdf should contain a discussion of the results for your
tests. This means analyzing the differences in the output of your implementations versus those in the
<math.h> library. Include possible reasons for the differences between your implementation and the
standard library’s.
Graphs can be especially useful in showing the differences and backing up your arguments. In this
document, graphs were effectively used to show the divergence of the Taylor series approximation
© 2021 Darrell Long 8
for e
x
as seen in Figure 3. A visual representation of data makes it easier to get your point across. For
this assignment, a plot can be used to show how close your approximation is to the value returned by
the library function. An example script for using gnuplot to create your graphs has been provided
to you in the Resources repository. gnuplot is a command-line graphing utility that allows you to
visualize data easily and with a few commands you can make your writeup very pretty!
5 Submission
A man who procrastinates in his choosing will inevitably have his
choice made for him by circumstance.
—Hunter S. Thompson, The Proud Highway: Saga of a Desperate
Southern Gentleman, 1955–1967
To submit your assignment, refer back to asgn0 for the steps on how to submit your assignment through
git. Remember: add, commit, and push!
Your assignment is turned in only after you have pushed. If you forget to push, you have not turned in
your assignment and you will get a zero. “I forgot to push” is not a valid excuse. It is highly recommended
to commit and push your changes often.
6 Supplemental Readings
• The C Programming Language by Kernighan & Ritchie
– Chapter 3 §3.4-3.7
– Chapter 4 §4.1 & 4.2 & 4.5
– Chapter 7 §7.2
– Appendix B §B4
• The Kollected Kode Vicious by George V. Neville-Neil
– Chapter 2 §2.6
© 2021 Darrell Long 9
C ņ IJोÀाƒमग करना एक बƫदर को Öनसॉ ċī जƢसा ž।

CSE 13S Assignment 3 Sorting: Putting your affairs in order

Any inaccuracies in this index may be explained by the fact that it has been
sorted with the help of a computer.
—Donald Knuth, Vol. III, Sorting and Searching
1 Introduction
Putting items into a sorted order is one of the most common tasks in Computer Science. As a result, there
are a myriad of library routines that will do this task for you, but that does not absolve you of the obligation
of understanding how it is done. In fact it behooves you to understand the various algorithms in order to
make wise choices.
The best execution time that can be accomplished, also referred to as the lower bound, for sorting using
comparisons is Ω(n logn), where n is the number is elements to be sorted. If the universe of elements to
be sorted is small, then we can do better using a Count Sort or a Radix Sort both of which have a time
complexity of O(n). The idea of Count Sort is to count the number of occurrences of each element in an
array. For Radix Sort, a digit by digit sort is done by starting from the least significant digit to the most
significant digit. It may also use Count Sort as a subroutine.
What is this O and Ω stuff? It’s how we talk about the execution time (or space used) by a program. We
will discuss it in lecture and in sections, and you will see it again in your Data Structures and Algorithms
class, now named CSE 101.
The sorting algorithms that you are expected to implement are Bubble Sort, Shell Sort, and two Quicksorts. The purpose of this assignment is to get you fully familiarized with each sorting algorithm. They are
well-known sorts (save for the latter Quicksort). You can use the Python pseudocode provided to you as
guidelines. Do not get the code for the sorts from the Internet or you will be referred to for cheating. We
will be running plagiarism checkers.
© 2021 Darrell Long 1
2 Bubble Sort
C is peculiar in a lot of ways, but it, like many other successful things, has a
certain unity of approach that stems from development in a small group.
—Dennis Ritchie
Bubble Sort works by examining adjacent pairs of items, call them i and j. If item j is smaller than item
i, swap them. As a result, the largest element bubbles up to the top of the array in a single pass. Since it is
in fact the largest, we do not need to consider it again. So in the next pass, we only need to consider n −1
pairs of items. The first pass requires n pairs to be examined; the second pass, n − 1 pairs; the third pass
n −2 pairs, and so forth. If you can pass over the entire array and no pairs are out of order, then the array is
sorted.
Pre-lab Part 1
1. How many rounds of swapping will need to sort the numbers 8, 22, 7, 9, 31, 5, 13 in ascending
order using Bubble Sort?
2. How many comparisons can we expect to see in the worse case scenario for Bubble Sort? Hint:
make a list of numbers and attempt to sort them using Bubble Sort.
In 1784, when Carl Friedrich Gauß was only 7 years old, he was reported to have amazed his elementary school teacher by how quickly he summed up the integers from 1 to 100. The precocious little Gauß
produced the correct answer immediately after he quickly observed that the sum was actually 50 pairs of
numbers, with each pair summing to 101 totaling to 5,050. We can then see that:
n +(n −1)+(n −2)+…+1 =
n(n +1)
2
so the worst case time complexity is O(n
2
). However, it could be much better if the list is already sorted. If
you haven’t seen the inductive proof for this yet, you will in the discrete mathematics class, CSE 16.
Bubble Sort in Python
1 def bubble_sort(arr):
2 n = len(arr)
3 swapped = True
4 while swapped:
5 swapped = False
6 for i in range(1, n):
7 if arr[i] < arr[i – 1]:
8 arr[i], arr[i – 1] = arr[i – 1], arr[i]
9 swapped = True
10 n -= 1
© 2021 Darrell Long 2
3 Shell Sort
There are two ways of constructing a software design. One way is to make it
so simple that there are obviously no deficiencies, and the other way is to
make it so complicated that there are no obvious deficiencies. The first
method is far more difficult.
—C.A.R. Hoare
Donald L. Shell (March 1, 1924–November 2, 2015) was an American computer scientist who designed the
Shell sort sorting algorithm. He earned his Ph.D. in Mathematics from the University of Cincinnati in 1959,
and published the Shell Sort algorithm in the Communications of the ACM in July that same year.
Shell Sort is a variation of insertion sort, which sorts pairs of elements which are far apart from each
other. The gap between the compared items being sorted is continuously reduced. Shell Sort starts with
distant elements and moves out-of-place elements into position faster than a simple nearest neighbor exchange. What is the expected time complexity of Shell Sort? It depends entirely upon the gap sequence.
The following is the pseudocode for Shell Sort. The gap sequence is represented by the array gaps. You
will be given a gap sequence, the Pratt sequence (2
p3
q
also called 3-smooth), in the header file gaps.h. For
each gap in the gap sequence, the function compares all the pairs in arr that are gap indices away from
each other. Pairs are swapped if they are in the wrong order.
Shell Sort in Python
1 def shell_sort(arr):
2 for gap in gaps:
3 for i in range(gap, len(arr)):
4 j = i
5 temp = arr[i]
6 while j >= gap and temp < arr[j – gap]:
7 arr[j] = arr[j – gap]
8 j -= gap
9 arr[j] = temp
Pre-lab Part 2
1. The worst time complexity for Shell Sort depends on the sequence of gaps. Investigate why
this is the case. How can you improve the time complexity of this sort by changing the gap
size? Cite any sources you used.
© 2021 Darrell Long 3
4 Quicksort
If debugging is the process of removing software bugs, then programming
must be the process of putting them in.
—Edsger Dijkstra
Quicksort (sometimes called partition-exchange sort) was developed by British computer scientist C.A.R.
“Tony” Hoare in 1959 and published in 1961. It is perhaps the most commonly used algorithm for sorting
(by competent programmers). When implemented well, it is the fastest known algorithm that sorts using
comparisons. It is usually two or three times faster than its main competitors, Merge Sort and Heapsort. It
does, though, have a worst case performance of O(n
2
) while its competitors are strictly O(n logn) in their
worst case.
Quicksort is a divide-and-conquer algorithm. It partitions arrays into two sub-arrays by selecting an
element from the array and designating it as a pivot. Elements in the array that are less than the pivot go
to the left sub-array, and elements in the array that are greater than or equal to the pivot go to the right
sub-array.
Note that Quicksort is an in-place algorithm, meaning it doesn’t allocate additional memory for subarrays to hold partitioned elements. Instead, Quicksort utilizes a subroutine called partition() that
places elements less than the pivot into the left side of the array and elements greater than or equal to
the pivot into the right side and returns the index that indicates the division between the partitioned parts
of the array. In a recursive implementation, Quicksort is then applied recursively on the partitioned parts
of the array, thereby sorting each array partition containing at least one element.
Instead of a recursive Quicksort, you will be writing an two iterative Quicksorts: one that utilizes a stack
and one that utilizes a queue. In the implementation using a stack, two indices are pushed into the stack at
a time. The first and second indices respectively represent the leftmost and rightmost indices of the array
partition to sort. Similarly, in the implementation using a queue, two indices are enqueued at a time, where
the first and second indices respectively represent the leftmost and rightmost indices of the array partition
to sort. Hint: the return type of partition() should be int64_t.
Partition in Python
1 def partition(arr, lo, hi):
2 pivot = arr[lo + ((hi – lo) // 2)]; # Prevent overflow.
3 i = lo – 1
4 j = hi + 1
5 while i < j:
6 i += 1 # You may want to use a do-while loop.
7 while arr[i] < pivot:
8 i += 1
9 j -= 1
10 while arr[j] > pivot:
11 j -= 1
12 if i < j:
13 arr[i], arr[j] = arr[j], arr[i]
14 return j
© 2021 Darrell Long 4
Quicksort in Python with a stack
1 def quick_sort_stack(arr):
2 lo = 0
3 hi = len(arr) – 1
4 stack = []
5 stack.append(lo) # Pushes to the top.
6 stack.append(hi)
7 while len(stack) != 0:
8 hi = stack.pop() # Pops off the top.
9 lo = stack.pop()
10 p = partition(arr, lo, hi)
11 if lo < p:
12 stack.append(lo)
13 stack.append(p)
14 if hi > p + 1:
15 stack.append(p + 1)
16 stack.append(hi)
Quicksort in Python with a queue
1 def quick_sort_queue(arr):
2 lo = 0
3 hi = len(arr) – 1
4 queue = []
5 queue.append(lo) # Enqueues at the head.
6 queue.append(hi)
7 while len(queue) != 0:
8 lo = queue.pop(0) # Dequeues from the tail.
9 hi = queue.pop(0)
10 p = partition(arr, lo, hi)
11 if lo < p:
12 queue.append(lo)
13 queue.append(p)
14 if hi > p + 1:
15 queue.append(p + 1)
16 queue.append(hi)
Pre-lab Part 3
1. Quicksort, with a worse case time complexity of O(n
2
), doesn’t seem to live up to its name.
Investigate and explain why Quicksort isn’t doomed by its worst case scenario. Make sure to
cite any sources you use.
© 2021 Darrell Long 5
4.1 Stacks
You will need to implement the stack ADT for the first of your two iterative Quicksorts. An ADT is an abstract
data type. With any ADT comes an interface comprised of constructor, destructor, accessor, and manipulator functions. The following subsections define the interface for a stack. A stack is an ADT that implements
a last-in, first-out, or LIFO, policy. Consider a stack of pancakes. A pancake can only be placed (pushed) on
top of the stack and can only be removed (popped) from the top of the stack. The header file containing the
interface will be given to you as stack.h. You may not modify this file. If you borrow code from any place
— including Prof. Long — you must cite it. It is far better if you write it yourself.
4.1.1 Stack
The stack datatype will be abstracted as a struct called Stack. We will use a typedef to construct a new
type, which you should treat as opaque — which means that you cannot manipulate it directly. We will
declare the stack type in stack.h and you will define its concrete implementation in stack.c. This means
the following struct definition must be in stack.c.
1 struct Stack {
2 uint32_t top; // Index of the next empty slot.
3 uint32_t capacity; // Number of items that can be pushed.
4 int64_t *items; // Array of items , each with type int64_t.
5 };
4.1.2 Stack *stack_create(uint32_t capacity)
The constructor function for a Stack. The top of a stack should be initialized to 0. The capacity of a stack is
set to the specified capacity. The specified capacity also indicates the number of items to allocate memory
for, the items in which are held in the dynamically allocated array items. A working constructor for a stack
is provided below. Note that it uses malloc() and calloc() for dynamic memory allocation. Both these
functions are included from <stdlib.h>.
1 Stack *stack_create(uint32_t capacity) {
2 Stack *s = (Stack *) malloc(sizeof(Stack));
3 if (s) {
4 s->top = 0;
5 s->capacity = capacity;
6 s->items = (int64_t *) calloc(capacity , sizeof(int64_t));
7 if (!s->items) {
8 free(s);
9 s = NULL;
10 }
11 }
12 return s;
13 }
© 2021 Darrell Long 6
4.1.3 void stack_delete(Stack **s)
The destructor function for a stack. A working destructor for a stack is provided below. Any memory that is
allocated using one of malloc(), realloc(), or calloc() must be freed using the free() function. The
job of the destructor is to free all the memory allocated by the constructor. Your programs are expected to
be free of memory leaks. A pointer to a pointer is used as the parameter because we want to avoid use-afterfree errors. A use-after-free error occurs when a program uses a pointer that points to freed memory. To
avoid this, we pass the address of a pointer to the destructor function. By dereferencing this double pointer,
we can make sure that the pointer that pointed to allocated memory is updated to be NULL.
1 void stack_delete(Stack **s) {
2 if (*s && (*s)->items) {
3 free((*s)->items);
4 free(*s);
5 *s = NULL;
6 }
7 return;
8 }
9
10 int main(void) {
11 Stack *s = stack_create();
12 stack_delete(&s);
13 assert(s == NULL);
14 }
4.1.4 bool stack_empty(Stack *s)
Since we will be using typedef to create opaque data types, we need functions to access fields of a data
type. These functions are called accessor functions. An opaque data type means that users do not need to
know its implementation outside of the implementation itself. This also means that it is incorrect to write
s->top outside of stack.c since it violates opacity. This accessor function returns true if the stack is
empty and false otherwise.
4.1.5 bool stack_full(Stack *s)
Returns true if the stack is full and false otherwise.
4.1.6 uint32_t stack_size(Stack *s)
Returns the number of items in the stack.
4.1.7 bool stack_push(Stack *s, int64_t x)
The need for manipulator functions follows the rationale behind the need for accessor functions: there
needs to be some way to alter fields of a data type. stack_push() is a manipulator function that pushes
an item x to the top of a stack.
© 2021 Darrell Long 7
This function returns a bool in order to signify either success or failure when pushing onto a stack.
When can pushing onto a stack result in failure? When the stack is full. If the stack is full prior to pushing
the item x, return false to indicate failure. Otherwise, push the item and return true to indicate success.
4.1.8 bool stack_pop(Stack *s, int64_t *x)
This function pops an item off the specified stack, passing the value of the popped item back through the
pointer x. Like with stack_push(), this function returns a bool to indicate either success or failure. When
can popping a stack result in failure? When the stack is empty. If the stack is empty prior to popping it,
return false to indicate failure. Otherwise, pop the item, set the value in the memory x is pointing to as
the popped item, and return true to indicate success.
1 // Dereferencing x to change the value it points to.
2 *x = s->items[s->top];
4.1.9 void stack_print(Stack *s)
This is a debug function that you should write early on. It will help greatly in determining whether or not
your stack implementation is working correctly.
4.2 Queues
You will need to implement the queue ADT for the second of your two iterative Quicksorts. The following
subsections define the interface for a queue. The header file containing the interface will be given to you as
queue.h. You may not modify this file. A queue is an ADT that implements a first-in, first-out, or FIFO, policy. Consider a checkout line. Customers are dequeued from the front (the head) of the line and enqueued
at the end (the tail) of the line.
4.2.1 Queue
The queue struct must be defined as follows in queue.c:
1 struct Queue {
2 uint32_t head; // Index of the head of the queue.
3 uint32_t tail; // Index of the tail of the queue.
4 uint32_t size; // The number of elements in the queue.
5 uint32_t capacity; // Capacity of the queue.
6 int64_t *items; // Holds the items.
7 }
4.2.2 Queue *queue_create(uint32_t capacity)
The constructor for a Queue. The structure of this function is very similar to the stack constructor function
given in §4.1.2. The head and tail of a queue should be initialized to 0. The capacity of a queue is set to
the specified capacity. The capacity also indicates the number of items to allocate memory for, the items
© 2021 Darrell Long 8
in which are contained in the dynamically allocated array items. The size of a queue tracks the number
of enqueued items. Return a pointer to the dynamically allocated Queue. If malloc() or calloc() fail,
return a NULL pointer to indicate failure.
4.2.3 void queue_delete(Queue **q)
The destructor for a Queue.
4.2.4 bool queue_empty(Queue *q)
Returns true if the queue is empty and false otherwise.
4.2.5 bool queue_full(Queue *q)
Returns true if the queue is full and false otherwise.
4.2.6 uint32_t queue_size(Queue *q)
Returns the number of items in the queue.
4.2.7 bool enqueue(Queue *q, int64_t x)
Enqueues an item x at the tail of a queue. If the queue is full prior to enqueuing x, return false to indicate
failure. Otherwise, enqueue x and return true to indicate success.
4.2.8 bool dequeue(Queue *q, int64_t *x)
Dequeues an item from the head of a queue, passing the value of the dequeued item back through the
pointer x. If the queue is empty prior to dequeuing the item, return false to indicate failure. Otherwise,
dequeue the item, set the value in the memory x is pointing to as the dequeued item, and return true to
indicate success.
4.2.9 void queue_print(Queue *q)
This is a debug function that you should write early on. It will help greatly in determining whether or not
your queue implementation is working correctly.
5 Your Task
Die Narrheit hat ein großes Zelt; Es lagert bei ihr alle Welt, Zumal wer Macht
hat und viel Geld.
—Sebastian Brant, Das Narrenschiff
For this assignment you have 3 tasks:
© 2021 Darrell Long 9
1. Implement Bubble Sort, Shell Sort, and both iterative Quicksorts based on the provided Python pseudocode in C. The interface for these sorts will be given as the header files bubble.h, shell.h, and
quick.h. You are not allowed to modify these files for any reason other than gathering statistics.
2. Implement a test harness for your implemented sorting algorithms. In your test harness, you will
creating an array of random elements and testing each of the sorts. Your test harness must be in the
file sorting.c.
3. Gather statistics about each sort and its performance. The statistics you will gather are the size of the
array, the number of moves required, and the number of comparisons required. Note: a comparison
is counted only when two array elements are compared. For the iterative Quicksorts, you will additionally need to gather statistics on the maximum stack and queue sizes. You will want to create some
graphs.
6 Specifics
Vielleicht sind alle Drachen unseres Lebens Prinzessinnen, die nur darauf
warten uns einmal schön und mutig zu sehen. Vielleicht ist alles
Schreckliche im Grunde das Hilflose, das von uns Hilfe will.
—Rainer Maria Rilke
The following subsections cover the requirements of your test harness. It is crucial that you follow the instructions.
6.1 Command-line Options
Your test harness must support any combination of the following command-line options:
• -a : Enables all sorting algorithms.
• -b : Enables Bubble Sort.
• -s : Enables Shell Sort.
• -q : Enables the Quicksort that utilizes a stack.
• -Q : Enables the Quicksort that utilizes a queue.
• -r seed : Set the random seed to seed. The default seed should be 13371453.
• -n size : Set the array size to size. The default size should be 100.
• -p elements : Print out elements number of elements from the array. The default number of elements to print out should be 100. If the size of the array is less than the specified number of elements
to print, print out the entire array and nothing more.
It is important to read this carefully. None of these options are exclusive of any other (you may specify
any number of them, including zero). The most natural data structure for this problem is a set.
© 2021 Darrell Long 10
6.2 Output
The output your test harness produces must be formatted like in the following examples:
$ ./sorting -bq -n 1000 -p 0
Bubble Sort
1000 elements, 758313 moves, 498465 compares
Quick Sort (Stack)
1000 elements, 7257 moves, 13643 compares
Max stack size: 22
$ ./sorting -bqQ -n 15 -r 2021
Bubble Sort
15 elements, 228 moves, 105 compares
45003895 46620555 199644728 203770850 218081181
230022357 273593510 314322227 377988577 458735007
553822818 570456718 653251166 802708864 890975627
Quick Sort (Stack)
15 elements, 60 moves, 98 compares
Max stack size: 6
45003895 46620555 199644728 203770850 218081181
230022357 273593510 314322227 377988577 458735007
553822818 570456718 653251166 802708864 890975627
Quick Sort (Queue)
15 elements, 60 moves, 98 compares
Max queue size: 8
45003895 46620555 199644728 203770850 218081181
230022357 273593510 314322227 377988577 458735007
553822818 570456718 653251166 802708864 890975627
For each sort that was specified, print its name, the statistics for the run, then the specified number of
array elements to print. The array elements should be printed out in a table with 5 columns. Each array
element should be printed with a width of 13. You should make use of the following printf() statement:
1 printf(“%13” PRIu32); // Include <inttypes.h> for PRIu32.
Pre-lab Part 4
1. Explain how you plan on keeping track of the number of moves and comparisons since each
sort will reside within its own file.
7 Deliverables
Dr. Long, put down the Rilke and step away from the computer.
—Michael Olson
You will need to turn in:
© 2021 Darrell Long 11
1. Your program must have the following source and header files:
• Each sorting method will have its own pair of header file and source file.
– bubble.h specifies the interface to bubble.c.
– bubble.c implements Bubble Sort.
– gaps.h contains the Pratt gap sequence for Shell Sort.
– shell.h specifies the interface to shell.c.
– shell.c implements Shell Sort.
– quick.h specifies the interface to quick.c.
– quick.c implements both iterative Quicksorts.
– stack.h specifies the interface to the stack ADT.
– stack.c implements the stack ADT.
– queue.h specifies the interface to the queue ADT.
– queue.c implements the queue ADT.
• sorting.c contains main() and may contain any other functions necessary to complete the
assignment.
You can have other source and header files, but do not try to be overly clever. The header files for each
of the sorts are provided to you. Each sort function takes the array of uint32_ts to sort as the first
parameter and the length of the array as the second parameter.
2. Makefile: This is a file that will allow the grader to type make to compile your program. Typing make
must build your program and ./sorting alone as well as flags must run your program.
• CFLAGS=-Wall -Wextra -Werror -Wpedantic must be included.
• CC=clang must be specified.
• make clean must remove all files that are compiler generated.
• make should build your program, as should make all.
• make format should format all your source code, including the header files.
• Your program executable must be named sorting.
3. Your code must pass scan-build cleanly.
4. README.md: This must be in Markdown. This must describe how to use your program and Makefile.
5. DESIGN.pdf: This must be a PDF. The design document should contain answers to the pre-lab questions at the beginning and describe your design for your program with enough detail that a sufficiently
knowledgeable programmer would be able to replicate your implementation. This does not mean
copying your entire program in verbatim. You should instead describe how your program works with
supporting pseudo-code.
6. WRITEUP.pdf: This document must be a PDF. The writeup must include the following:
• Identify the respective time complexity for each sort and include what you have to say about the
constant.
© 2021 Darrell Long 12
• What you learned from the different sorting algorithms.
• How you experimented with the sorts.
• How big do the stack and queue get? Does the size relate to the input? How?
• Graphs explaining the performance of the sorts on a variety of inputs, such as arrays in reverse
order, arrays with a small number of elements, and arrays with a large number of elements.
• Analysis of the graphs you produce.
8 Submission
Daß Gott ohn Arbeit Lohn verspricht, Verlaß dich darauf und backe nicht
Und wart, bis dir ’ne Taube gebraten Vom Himmel könnt in den Mund
geraten!
—Sebastian Brant, Das Narrenschiff
To submit your assignment, refer back to asgn0 for the steps on how to submit your assignment through
git. Remember: add, commit, and push!
Your assignment is turned in only after you have pushed and submitted the commit ID on Canvas. If
you forget to push, you have not turned in your assignment and you will get a zero. “I forgot to push” is not
a valid excuse. It is highly recommended to commit and push your changes often.
9 Supplemental Readings
The more you read, the more things you will know. The more that you learn,
the more places you’ll go.
—Dr. Seuss
• The C Programming Language by Kernighan & Ritchie
– Chapter 1 §1.10
– Chapter 3 §3.5
– Chapter 4 §4.10–4.11
– Chapter 5 §5.1–5.3 & 5.10
• C in a Nutshell by T. Crawford & P. Prinz.
– Chapter 6 – Example 6.5
– Chapter 7 – Recursive Functions
– Chapter 8 – Arrays as Arguments of Functions
– Chapter 9 – Pointers to Arrays
© 2021 Darrell Long 13
用 C 編程就像給猴子電鋸一樣。

CSE 13S Assignment 4 The Circumnavigations of Denver Long

1 Introduction
I wonder why it is that when I plan a route too carefully, it goes to pieces,
whereas if I blunder along in blissful ignorance aimed in a fancied
direction I get through with no trouble.
—John Steinbeck, Travels with Charley: In Search of Amerca
Denver Long decides to augment his income during his retirement years by selling the fine products
produced by the Shinola Corporation. He enjoys driving his Lincoln, and so it’s the life of a traveling
salesman for him. Having accidentally taken a wrong turn near Chula Vista and winding up in Mexico,
he asks his son to have his class create a computer program that will provide an optimal route to all the
cities along his route and then return him to his home in scenic Clearlake.
2 Directed Graphs
A graph is a data structure G = 〈V,E〉 where V = {v0,…, vn} is the set of vertices (or nodes) and E = {
〈vi
, v j〉,…}
is the set of edges that connect the vertices. For example, you might have a set of cities V =
{El Cajon,La Mesa,San Diego,…,La Jolla} as the vertices and write “El Cajon” → “Lakeside” to indicate
that there is a path (Los Coches Road) from El Cajon to Lakeside. If there is a path from El Cajon to
Lakeside, as well as a path from Lakeside to El Cajon, then the edge connecting El Cajon and Lakeside is
undirected.
© 2021 Darrell Long 1
Such a simple graph representation simply tells you how vertices are connected and provides the idea
of one-way roads. But it really does not help Denver, since it does not provide any notion of distance.
We solve this problem by associating a weight with each edge and so we might write “Santee → El Cajon,
2” to indicate that there is a path of two miles long from Santee to El Cajon. Given a set of edges and
weights, then we can then find the shortest path from any vertex to any other vertex (the answer may be
that there is no such path). There are elegant (and quick) algorithms for computing the shortest path,
but that is not exactly what we want to do. We want to find a path through all of the vertices, visiting
each exactly once, such that there is a direct (single step) connection from the last vertex to the first. This
is called a Hamiltonian path. This will address Denver’s need to get home, but won’t necessarily be the
shortest such path. So, we need to go through every possible Hamiltonian path given the list of cities
Denver is traveling through and pick the shortest path. Note: if you find yourself on a path that is longer
than the current best found Hamiltonian path, then you can preemptively reject your current path.
3 Representing Graphs
Nothing can be more limiting to the imagination than only writing about
what you know.
—John W. Gardner
Perhaps the simplest way to represent a graph is with an adjacency matrix. Consider an n × n adjacency matrix M, where n is the number of vertices in the graph. If Mi,j = k, where 1 ≤ i ≤ j ≤ n, then we
say that there exists a directed edge from vertex i to vertex j with weight k. Traveling through wormholes
is considered hazardous, so any valid edge weight k must be non-zero and positive.
0 1 2 3 4 5 ··· 25


























0 0 10 0 0 0 0 0 0
1 0 0 2 5 0 0 0 0
2 0 0 0 0 0 3 0 5
3 0 0 0 0 21 0 0 0
4 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0
.
.
. 0 0 0 0 0 0 0 0
25 0 0 0 0 0 0 0 0
Each edge will be represented as a triplet 〈i, j,k〉. The set of edges in the adjacency matrix above is
E = {〈0, 1, 10〉,〈1, 2, 2〉,〈1, 3, 5〉,〈2, 5, 3〉,〈2, 25, 5〉,〈3, 4, 21〉}.
If the above adjacency matrix were made to be undirected, it would be reflected along the diagonal.
© 2021 Darrell Long 2
0 1 2 3 4 5 ··· 25


























0 0 10 0 0 0 0 0 0
1 10 0 2 5 0 0 0 0
2 0 2 0 0 0 3 0 5
3 0 5 0 0 21 0 0 0
4 0 0 0 21 0 0 0 0
5 0 0 3 0 0 0 0 0
.
.
. 0 0 0 0 0 0 0 0
25 0 0 5 0 0 0 0 0
We will create a graph ADT based on this struct definition:
1 struct Graph {
2 uint32_t vertices ; // Number of vertices .
3 bool undirected ; // Undirected graph ?
4 bool visited [ VERTICES ]; // Where have we gone ?
5 uint32_t matrix [ VERTICES ][ VERTICES ]; // Adjacency matrix .
6 };
We elect to use an adjacency matrix with set maximum dimensions. This is both to simplify the
abstraction and also due to the computational complexity of solving the Traveling Salesman Problem
(TSP) with depth-first search (DFS), which is discussed in §5. The VERTICES macro will be defined and
supplied to you in vertices.h. In this header file, there is another macro START_VERTEX which defines
the origin vertex of the shortest Hamiltonian path we will be searching for. You may not modify this file.
The struct definition of a graph must go in graph.c. The following subsections define the interface for
the graph ADT.
vertices.h
1 # ifndef __VERTICES_H__
2 # define __VERTICES_H__
3
4 # define START_VERTEX 0 // Starting ( origin ) vertex .
5 # define VERTICES 26 // Maximum vertices in graph .
6
7 # endif
3.1 Graph *graph_create(uint32_t vertices, bool undirected)
The constructor for a graph. It is through this constructor in which a graph can be specified to be undirected. Make sure each cell of the adjacency matrix, matrix, is set to zero. Also make sure that each index
of the visited array is initialized as false to reflect that no vertex has been visited yet. The vertices
field reflects the number of vertices in the graph.
© 2021 Darrell Long 3
3.2 void graph_delete(Graph **G)
The destructor for a graph. Remember to set the pointer G to NULL.
3.3 uint32_t graph_vertices(Graph *G)
Return the number of vertices in the graph.
3.4 bool graph_add_edge(Graph *G, uint32_t i, uint32_t j, uint32_t k)
Adds an edge of weight k from vertex i to vertex j. If the graph is undirected, add an edge, also with
weight k from j to i. Return true if both vertices are within bounds and the edge(s) are successfully
added and false otherwise.
3.5 bool graph_has_edge(Graph *G, uint32_t i, uint32_t j)
Return true if vertices i and j are within bounds and there exists an edge from i to j. Remember: an
edge exists if it has a non-zero, positive weight. Return false otherwise.
3.6 uint32_t graph_edge_weight(Graph *G, uint32_t i, uint32_t j)
Return the weight of the edge from vertex i to vertex j. If either i or j aren’t within bounds, or if an edge
doesn’t exist, return 0.
3.7 bool graph_visited(Graph *G, uint32_t v)
Return true if vertex v has been visited and false otherwise.
3.8 void graph_mark_visited(Graph *G, uint32_t v)
If vertex v is within bounds, mark v as visited.
3.9 void graph_mark_unvisited(Graph *G, uint32_t v)
If vertex v is within bounds, mark v as unvisited.
3.10 void graph_print(Graph *G)
A debug function you will want to write to make sure your graph ADT works as expected.
4 Depth-first Search
Again it might have been the American tendency in travel. One goes, not so
much to see but to tell afterward.
—John Steinbeck, Travels with Charley: In Search of Amerca
We need a methodical procedure for searching through the graph. Once we have examined a vertex,
© 2021 Darrell Long 4
we do not want to do so again—we don’t want Denver going through cities where he has already been
(he has been known to wear out his welcome: charming women and fighting men).
Depth-first search (DFS) first marks the vertex v as having been visited, then it iterates through all of
the edges 〈v,w〉, recursively calling itself starting at w if w has not already been visited.
1 procedure DFS (G,v) :
2 label v as visited
3 for all edges from v to w in G. adjacentEdges (v) do
4 if vertex w is not labeled as visited then
5 recursively call DFS (G,w)
6 label v as unvisited
Finding a Hamiltonian path then reduces to:
1. Using DFS to find paths that pass through all vertices, and
2. There is an edge from the last vertex to the first. The solutions to the Traveling Salesman Problem
are then the shortest found Hamiltonian paths.
5 Computational Complexity
Many a trip continues long after movement in time and space have ceased.
I remember a man in Salinas who in his middle years traveled to Honolulu
and back, and that journey continued for the rest of his life. We could
watch him in his rocking chair on his front porch, his eyes squinted,
half-closed, endlessly traveling to Honolulu.
—John Steinbeck, Travels with Charley: In Search of Amerca
How long will this take? The answer is, it will take a very long time if there are a large number of
vertices. The running time of the simplest algorithm is O(n!) and there is no known algorithm that runs
in less than O(2n
) time. In fact, the TSP has been shown to be NP-hard, which means that it is as difficult
as any problem in the class NP (you will learn more about this in CSE 104: Computability and Computational Complexity). Basically, it means that it can be solved in polynomial time if you have a magical
computer that at each if-statement is takes both branches every time (creating a copy of the computer
for each such branch).
6 Representing Paths
We find after years of struggle that we do not take a trip; a trip takes us.
—John Steinbeck, Travels with Charley: In Search of Amerca
Given that vertices are added to and removed from the traveled path in a stack-like manner, we decide to abstract a path as follows:
© 2021 Darrell Long 5
1 struct Path {
2 Stack * vertices ; // The vertices comprising the path .
3 uint32_t length ; // The total length of the path .
4 };
The following subsections define for the interface for the path ADT.
6.1 Path *path_create(void)
The constructor for a path. Set vertices as a freshly created stack that can hold up to VERTICES number
of vertices. Initialize length to be 0. The length field will track the length of the path. In other words, it
holds the sum of the edge weights between consecutive vertices in the vertices stack.
6.2 void path_delete(Path **p)
The destructor for a path. Remember to set the pointer p to NULL.
6.3 bool path_push_vertex(Path *p, uint32_t v, Graph *G)
Pushes vertex v onto path p. The length of the path is increased by the edge weight connecting the vertex
at the top of the stack and v. Return true if the vertex was successfully pushed and false otherwise.
6.4 bool path_pop_vertex(Path *p, uint32_t *v, Graph *G)
Pops the vertices stack, passing the popped vertex back through the pointer v. The length of the
path is decreased by the edge weight connecting the vertex at the top of the stack and the popped vertex.
Return true if the vertex was successfully popped and false otherwise.
6.5 uint32_t path_vertices(Path *p)
Returns the number of vertices in the path.
6.6 uint32_t path_length(Path *p)
Returns the length of the path.
6.7 void path_copy(Path *dst, Path *src)
Assuming that the destination path dst is properly initialized, makes dst a copy of the source path src.
This will require making a copy of the vertices stack as well as the length of the source path.
6.8 void path_print(Path *p, FILE *outfile, char *cities[])
Prints out a path to outfile using fprintf(). Requires a call to stack_print(), as defined in §7.10,
in order to print out the contents of the vertices stack.
© 2021 Darrell Long 6
7 Stacks, Revisited
If you have some respect for people as they are, you can be more
effective in helping them to become better than they are.
—John W. Gardner
You will use the stack that you implemented for assignment 3 with slight modifications. If there were
any problems with your stack for that assignment, make sure to fix them for this assignment. Here is the
modified stack interface for this assignment.
1 struct Stack {
2 uint32_t top ;
3 uint32_t capacity ;
4 uint32_t * items ;
5 };
7.1 Stack *stack_create(uint32_t capacity)
The constructor function for a Stack. The top of a stack should be initialized to 0. The capacity of a
stack is set to the specified capacity. The specified capacity also indicates the number of items to allocate
memory for, the items in which are held in the dynamically allocated array items.
7.2 void stack_delete(Stack **s)
The destructor function for a stack. Remember to set the pointer s to NULL.
7.3 bool stack_empty(Stack *s)
Return true if the stack is empty and false otherwise.
7.4 bool stack_full(Stack *s)
Return true if the stack is full and false otherwise.
7.5 uint32_t stack_size(Stack *s)
Return the number of items in the stack.
7.6 bool stack_push(Stack *s, uint32_t x)
If the stack is full prior to pushing the item x, return false to indicate failure. Otherwise, push the item
and return true to indicate success.
© 2021 Darrell Long 7
7.7 bool stack_peek(Stack *s, uint32_t *x)
Peeking into a stack is synonymous with querying a stack about the element at the top of the stack. If the
stack is empty prior to peeking into it, return false to indicate failure.
7.8 bool stack_pop(Stack *s, uint32_t *x)
If the stack is empty prior to popping it, return false to indicate failure. Otherwise, pop the item, set the
value in the memory x is pointing to as the popped item, and return true to indicate success.
7.9 void stack_copy(Stack *dst, Stack *src)
Assuming that the destination stack dst is properly initialized, make dst a copy of the source stack src.
This means making the contents of dst->items the same as src->items. The top of dst should also
match the top of src.
7.10 void stack_print(Stack *s, FILE *outfile, char *cities[])
Prints out the contents of the stack to outfile using fprintf(). Working through each vertex in the
stack starting from the bottom, print out the name of the city each vertex corresponds to. This function
will be given to aid you.
1 void stack_print ( Stack *s, FILE * outfile , char * cities []) {
2 for ( uint32_t i = 0; i < s-> top ; i += 1) {
3 fprintf ( outfile , “%s”, cities [s-> items [i]]) ;
4 if (i + 1 != s- > top ) {
5 fprintf ( outfile , ” -> “) ;
6 }
7 }
8 fprintf ( outfile , “\n”);
9 }
8 Command-line Options
Attitude is a choice. Happiness is a choice. Optimism is a choice. Kindness
is a choice. Giving is a choice. Respect is a choice. Whatever choice you
make makes you. Choose wisely.
—Roy T. Bennett, The Light in the Heart
Your program must support any combination of the following command-line options.
• -h : Prints out a help message describing the purpose of the graph and the command-line options
it accepts, exiting the program afterwards. Refer to the reference program in the resources repo for
an idea of what to print.
© 2021 Darrell Long 8
• -v : Enables verbose printing. If enabled, the program prints out all Hamiltonian paths found as
well as the total number of recursive calls to dfs().
• -u : Specifies the graph to be undirected.
• -i infile : Specify the input file path containing the cities and edges of a graph. If not specified,
the default input should be set as stdin.
• -o outfile : Specify the output file path to print to. If not specified, the default output should be
set as stdout.
9 Reading An Input Graph
It behooves a man who wants to see wonders sometimes
to go out of his way.
—John Mandeville, The Travels of Sir John Mandeville
We will be storing graphs in specially formatted files. Here is an example:
$ cat mythical.graph
4
Asgard
Elysium
Olympus
Shangri-La
0 3 5
3 2 4
2 1 10
1 0 2
The first line of a graph file is the number of vertices, or cities, in the graph. Assuming n is the number
of vertices, the next n lines of the file are the names of the cities. Each line after that is an edge. It is to be
scanned in as a triplet 〈i, j,k〉 and interpreted as an edge from vertex i to vertex j with weight k.
10 Specifics
The gladdest moment in human life, methinks, is a
departure into unknown lands.
—Sir Richard Burton
Here are the specifics for your program implementation.
1. Parse command-line options with looped calls to getopt(). This should be familiar from assignments 2 and 3.
© 2021 Darrell Long 9
2. Scan in the first line from the input. This will be the number of vertices, or cities, that will be in
the graph. Print an error if the number specified is greater than VERTICES, the macro defined in
vertices.h.
3. Assuming the number of specified vertices is n, read the next n lines from the input using fgets().
Each line is the name of a city. Save the name of each city to an array. You will want to either
make use of strdup() from <string.h> or implement your own strdup() function. If the line
is malformed, print an error and exit the program. Note: using fgets() will leave in the newline
character at the end, so you will manually have to change it to the null character to remove it.
4. Create a new graph G, making it undirected if specified.
5. Scan the input line by line using fscanf() until the end-of-file is reached. Add each edge to G. If
the line is malformed, print an error and exit the program.
6. Create two paths. One will be for tracking the current traveled path and the other for tracking the
shortest found path.
7. Starting from the origin vertex, defined by the macro START_VERTEX in vertices.h, perform a
depth-first search on G to find the shortest Hamiltonian path. Here is an example function prototype that you may use as the recursive depth-first function:
1 void dfs ( Graph *G, uint32_t v, Path *curr , Path * shortest
, char * cities [] , FILE * outfile ) ;
The parameter v is the vertex that you are currently on. The currently traversed path is maintained
with curr. The shortest found path is tracked with shortest. The array of city names is cities.
Finally, outfile is the output to print to.
8. After the search, print out the length of the shortest path found, the path itself (remember to return
back to the origin), and the number of calls to dfs().
$ ./tsp < mythical.graph
Path length: 21
Path: Asgard -> Shangri-La -> Olympus -> Elysium -> Asgard
Total recursive calls: 4
If the verbose command-line option was enabled, print out all the Hamiltonian paths that were
found as well. It is recommended that you print out the paths as you find them.
$ ./tsp -v < ucsc.graph
Path length: 7
Path: Cowell -> Stevenson -> Merrill -> Cowell
Path length: 6
Path: Cowell -> Merrill -> Stevenson -> Cowell
Path length: 6
Path: Cowell -> Merrill -> Stevenson -> Cowell
Total recursive calls: 5
© 2021 Darrell Long 10
11 Deliverables
Travel isn’t always pretty. It isn’t always comfortable. Sometimes it hurts, it
even breaks your heart. But that’s okay. The journey changes you; it should
change you. It leaves marks on your memory, on your consciousness, on
your heart, and on your body. You take something with you. Hopefully, you
leave something good behind.
—Anthony Bourdain
1. Your program, called tsp, must have the following source and header files:
• vertices.h defines macros regarding vertices.
• graph.h specifies the interface to the graph ADT.
• graph.c implements the graph ADT.
• stack.h specifies the interface to the stack ADT.
• stack.c implements the stack ADT.
• path.h specifies the interface to the path ADT.
• path.c implements the path ADT.
• tsp.c contains main() and may contain any other functions necessary to complete the assignment.
You can have other source and header files, but do not try to be overly clever. You may not modify
any of the supplied header files.
2. Makefile: This is a file that will allow the grader to type make or make all to compile your program.
• CC=clang must be specified.
• CFLAGS=-Wall -Wextra -Werror -Wpedantic must be included.
• make should build your program, as should make all.
• make clean must remove all files that are compiler generated.
• make format should format all your source code, including the header files.
3. Your code must pass scan-build cleanly.
4. README.md: This must be in Markdown. This must describe your program briefly, its usage, and
how to build it using your Makefile.
5. DESIGN.pdf: This must be a PDF. The design document should contain answers to the pre-lab
questions at the beginning and describe your design for your program with enough detail that
a sufficiently knowledgeable programmer would be able to replicate your implementation. This
does not mean copying your entire program in verbatim. You should instead describe how your
program works with supporting pseudo-code.
© 2021 Darrell Long 11
12 Supplemental Readings
One of the reasons people stop learning is that they become less and less
willing to risk failure. you learn, the more places you’ll go.
—John W. Gardner
• The C Programming Language by Kernighan & Ritchie
– Chapter 4 §4.10
– Chapter 7 §7.4–7.8
• Introduction to Algorithms by T. Cormen, C. Leiserson, R. Rivest, & C. Stein
– Chapter 10 §10.1
– Chapter 35 §35.2
13 Submission
A man who procrastinates in his choosing will inevitably have his choice
made for him by circumstance.
—Hunter S. Thompson, The Proud Highway
Refer back to asgn0 for the steps on how to submit your assignment through git. Remember: add,
commit, and push! Your assignment is turned in only after you have pushed and submitted the commit
ID on Canvas. If you forget to push, you have not turned in your assignment and you will get a zero. “I
forgot to push” is not a valid excuse. It is highly recommended to commit and push your changes often.
Fhewhqccydw yd S yi byau wylydw q cedauo q sxqydiqm.

CSE 3S Assignment 5 Hamming Codes

1 Introduction
As we know, the world is far from perfect. In the communications domain, this imperfection is called
noise. Noise (unwanted random disturbances) makes it difficult to have a reliable signal. Thus, transferring data through a noisy communication channel is prone to errors. Noisy channels are omnipresent.
They are present in mobile phone networks and even the wires in a circuit. To counteract noisy interference, we add extra information to our data. This extra information allows us to perform error checking,
and request that the sender retransmit any data that was incorrect. We can also add extra information to
not only detect errors but also correct them. This technique is called forward error correction (FEC). CDs,
DVDs, and even hard drives use FEC to account for scratches or bad sectors. In fact, most of our digital
world such as Netflix would not be possible without FEC.
2 Hamming Codes
Hamming: If you come to the Navy School, I will teach you how to be a great scientist.
Long: And if I go to Santa Cruz, will you still teach me?
Hamming: No.
Lunch with Richard Hamming
Richard Hamming
Richard Wesley Hamming was an American applied mathematician whose work
had profound impact on computer engineering and telecommunications. His contributions include the Hamming code (which makes use of a Hamming matrix), the
Hamming window, Hamming numbers, sphere-packing (or Hamming bound), and the
Hamming distance. Hamming served as president of the Association for Computing
Machinery from 1958 to 1960.
He used to joke that he was the anti-Huffman: David Huffman, the inventor of
Huffman codes and a professor here at Santa Cruz, was a friend. Hamming added
redundancy for reliability, while Huffman focused on removing it for efficiency.
In later life, Hamming became interested in teaching. From 1960 and 1976, before
he left Bell Laboratories, he held visiting positions at Stanford University, Stevens Institute of Technology, the City College of New York, the University of California at Irvine and Princeton
© 2021 Darrell Long 1
University. After retiring from the Bell Laboratories in 1976, Hamming took a position at the Naval Postgraduate School in Monterey, California, where he worked as an adjunct professor and senior lecturer in
computer science, and devoted himself to teaching and writing books.
He spent significant effort in trying to recruit a young Dr. Long, telling him that “If you come to the
Navy School, I will teach you how to be a great scientist.” Dr. Long replied, “If I go to Santa Cruz, will you
still teach me?” To which Hamming replied, simply, “No.” Sadly, it was an opportunity lost.
2.1 Bits, Nibbles, and Bytes
Around 1948, John Wilder Turkey, an American mathematician, coined the word bit to replace the mouthful that was binary digit. A bit is the basic unit of information for digital systems. It represents a logical
state with only two possible values, either a 1 or 0, on or off, true or false. But one bit is rather limiting.
As a result, multiple bits were packed together to make up a nibble (4 bits with 24
states) or a byte (8 bits
with 28
states).
2.2 Overview
Hamming codes are a linear error-correcting code invented by Richard W. Hamming 1
to correct errors
caused by punched card readers. But, we will be using them to correct errors caused by random bitflips (noise). He introduced the Hamming(7,4) code that encodes 4 bits of data D in 7 bits by adding 3
redundant or parity P bits. An explanation of parity bits will be given through an example. This code
can detect and correct one error. With 7 possible errors, 3 parity bits can identify which bit is incorrect
(23 −1 = 7).
The Hamming(7,4) code is shown in table 1 where the Hamming code’s least significant bit (LSB)
is at index 0012 and the most significant bit (MSB) is in position 1112. While normally in practice we
begin counting at 0, the importance of starting the count at 1 and in binary will be evident in the next
paragraphs.
Index 1112 1102 1012 1002 0112 0102 0012
Hamming code D3 D2 D1 P2 D0 P1 P0
Table 1: Hamming(7,4) code with data bits D and parity bits P.
You might notice that each parity bit’s index has only one bit set (a power of 2). P0’s index (0012) has
the 0th bit set, and you might notice that the index for D0, D1, and D3 also have the 0th bit set. Thus, P0
is calculated over D0, D1, and D3, i.e., P0 is set if the data bits have an odd number of 1s. This is an odd
parity scheme. With an odd parity scheme, the parity bit is set if by setting it there are an odd number of
1s. We could use an odd parity scheme, but for this assignment, we will be using an even parity scheme.
We can calculate P1 and P2 in the same way where Pi
is calculated over the data bits whose index has
the i
th bit set. The formulas for the three parity bits Pi are shown below.
P0 = D0 ⊕D1 ⊕D3
P1 = D0 ⊕D2 ⊕D3
P2 = D1 ⊕D2 ⊕D3
1R. W. Hamming, “Error detecting and error correcting codes,” The Bell System Technical Journal, April 1950.
© 2021 Darrell Long 2
As a result, each data bit has at least two parity bits that will help recover its value should an error occur.
The overlap of parity bits also keeps them in check (parity bits can be erroneously flipped too). For
example, the Hamming code for 00012 is shown in table 2. From here on, the index will follow convention
and start at 0 and in decimal.
Index 6 5 4 3 2 1 0
Label D3 D2 D1 P2 D0 P1 P0
Hamming code 0 0 0 0 1 1 1
Table 2: Hamming(7,4) code for 00012.
The values of parity bits P0, P1, and P2 are calculated as follows:
P0 = D0 ⊕D1 ⊕D3 = 1⊕0⊕0 = 1
P1 = D0 ⊕D2 ⊕D3 = 1⊕0⊕0 = 1
P2 = D1 ⊕D2 ⊕D3 = 0⊕0⊕0 = 0
We currently have a non-systematic code, a code where the parity bits are interspersed throughout
the code. While this is fine for a hardware implementation, it is tedious to do in software. Instead, we
will be using a systematic code, a code where the parity bits are placed after the data bits. We will also
be extending the Hamming(7,4) code by adding one more parity bit to make this a Hamming(8,4) code.
The extra parity bit, P3, is calculated over P0–P2 and D0–D3. If P3 for a received code is incorrect then
we know an error has occurred. Either P3 is incorrect or one of the the bits in the code is incorrect. This
extension has two additional benefits. First, each code is a byte rather than seven bits. Second, we can
now detect two errors but only correct one. The new Hamming code is shown below.
Index 7 6 5 4 3 2 1 0
Hamming code P3 P2 P1 P0 D3 D2 D1 D0
Table 3: Hamming(8,4) systematic code, an extension of the Hamming(7,4) code.
where
P0 = D0 ⊕D1 ⊕D3
P1 = D0 ⊕D2 ⊕D3
P2 = D1 ⊕D2 ⊕D3
P3 = D0 ⊕D1 ⊕D2 ⊕D3 ⊕P0 ⊕P1 ⊕P2
2.3 Encoding
One approach to generating a Hamming code for a message is to calculate the parity bits one-by-one and
appending them to the end of the message. Instead, we can use a generator matrix, G. Given a message,
m~ , of four bits (a nibble) we can generate its hamming code, ~c, by vector-matrix multiplication ~c = m~ G
where the resulting code is eight bits in size (a byte). G is defined as:
© 2021 Darrell Long 3
G =





1 0 0 0 0 1 1 1
0 1 0 0 1 0 1 1
0 0 1 0 1 1 0 1
0 0 0 1 1 1 1 0





.
Vector-matrix multiplication for a vector ~y of size n and matrix A of size n×n where ~y =~xA is defined as:
yi =
Xn
k=1
xk · Ai,k .
Those of you with exposure to linear algebra (helpful but not required) will notice the left half of G is
the identity matrix I (1s along the diagonal). This ensures the first four bits of the Hamming code are the
data bits while the right half is used to calculate the parity of message (notice the 0s along the diagonal).
Generating the Hamming Code for m~ =
¡
0 0 1 1¢
(11002 in binary) is shown below. Note: binary
is read from right to left with the LSB in the rightmost position and the MSB in the leftmost position.
Vectors (arrays) are read from left to right.
~c =
³
0 1 2 3
0 0 1 1´





1 0 0 0 0 1 1 1
0 1 0 0 1 0 1 1
0 0 1 0 1 1 0 1
0 0 0 1 1 1 1 0





(mod 2)
=
¡
0 0 1 1 2 2 1 1¢
(mod 2)
=
¡
0 0 1 1 0 0 1 1¢
In binary,~c is equivalent to 1100 11002.
Note: Addition and multiplication is mod 2 in binary. You might remember from CSE 12 that L
(exclusive-or) is the summing function and can be used in lieu of addition, and ∧ (logical and) can substitute multiplication. These operations have the added benefit of operating in mod 2.
a ⊕b = a +b (mod 2)
a ∧b = a ×b (mod 2)
2.4 Decoding
The process for decoding a Hamming code is similar to encoding a message as it uses a parity-checker
matrix, H, to identify any errors and recover the original message if an error occurred. To decode a
message, the code is multiplied by the transpose of the parity-checker matrix, ~e = ~cHÖ where ~e is the
error syndrome,~c is the Hamming code, and H is defined as
H =





0 1 1 1 1 0 0 0
1 0 1 1 0 1 0 0
1 1 0 1 0 0 1 0
1 1 1 0 0 0 0 1





.
However, ~e 6= m~ . Instead, ~e, the error syndrome, is a pattern of bits that identifies the error if there is
one. For example, if ~e = [0, 1, 1, 1], matching HÖ
’s first row, then we know the error lies in the 0th bit and
© 2021 Darrell Long 4
flipping the 0th bit in ~c will correct the error (remember the first four bits in the Hamming code are the
data bits). If ~e = 0 then our message does not contain an error. But, if the error syndrome is non-zero
and the vector is not one of HÖ
’s rows then we cannot correct the error since more than one bit has been
flipped.
For example, to decode~c calculated earlier, we can do the following:
~e =~cH
Ö =
³
0 1 2 3 4 5 6 7
0 0 1 1 0 0 1 1´














0 1 1 1
1 0 1 1
1 1 0 1
1 1 1 0
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1














(mod 2)
=
¡
0 0 0 0¢
.
Since ~e =
¡
0 0 0 0¢
, we know our message arrived with no errors. Since this is a systematic code,
m~ =
¡
0 0 1 1¢
(the first four elements in~c).
If the second bit had been flipped due to noise and we received ~c =
¡
0 1 1 1 0 0 1 1¢
instead, we can calculate~e to identify the flipped bit.
~e =~cH
Ö =
¡
0 1 1 1 0 0 1 1¢














0 1 1 1
1 0 1 1
1 1 0 1
1 1 1 0
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1














(mod 2)
=
¡
1 0 1 1¢
.
Since ~e =
¡
1 0 1 1¢
, which is HÖ
’s second row, we know the second element in ~c was erroneously
flipped. Flipping the value of the second element gives us ~c =
¡
0 0 1 1 0 0 1 1¢
. As a result,
m~ =
¡
0 0 1 1¢
or 11002, the original message. One possible method to determine which bit to flip is
to compare the error syndrome with each row in HÖ
, but section 5.3 will go over an optimal approach
involving lookup tables.
3 Bit Vector
A bit vector is an ADT that represents a one dimensional array of bits, the bits in which are used to denote
if something is true or false (1 or 0). This is an efficient ADT since, in order to represent the truth or falsity
of a bit vector of n items, we can use bn/8c +1 uint8_ts instead of n, and being able to access 8 indices
with a single integer access is extremely cost efficient. Since we cannot directly access a bit, we must use
bitwise operations to get, set, and clear a bit within a byte.
© 2021 Darrell Long 5
1 struct BitVector {
2 uint32_t length ; // Length in bits .
3 uint8_t * vector ; // Array of bytes .
4 };
3.1 BitVector *bv_create(uint32_t length)
The constructor for a bit vector. In the event that sufficient memory cannot be allocated, the function
must return NULL. Else, it must return a (BitVector *), or a pointer to an allocated BitVector. Each
bit of the bit vector should be initialized to 0.
3.2 void bv_delete(BitVector **v)
The destructor for a bit vector. Remember to set the pointer to NULL after the memory associated with
the bit vector is freed.
3.3 uint32_t bv_length(BitVector *v)
Returns the length of a bit vector.
3.4 void bv_set_bit(BitVector *v, uint32_t i)
Sets the i
th bit in a bit vector. To set a bit in a bit vector, it is necessary to first locate the byte in which
the i
th resides. The location of the byte is calculated as i / 8. The position of the bit in that byte is
calculated as i % 8. Setting a bit in a bit vector should not modify any of the other bits.
3.5 void bv_clr_bit(BitVector *v, uint32_t i)
Clears the i
th bit in the bit vector.
3.6 uint8_t bv_get_bit(BitVector *v, uint32_t i)
Returns the i
th bit in the bit vector.
3.7 void bv_xor_bit(BitVector *v, uint32_t i, uint8_t bit)
XORs the i
th bit in the bit vector with the value of the specified bit.
3.8 void bv_print(BitVector *v)
A debug function to print a bit vector. You will want to do this first in order to verify the correctness of
your bit vector implementation.
© 2021 Darrell Long 6
4 Bit Matrix
In this assignment, we will define a bit matrix ADT, which will be an abstraction over a bit vector. That is,
given an m×n bit matrix M, 0 ≤ r < m, and 0 ≤ c < n, then Mr,c corresponds with index r * n + c in the
underlying bit vector. Using this bit matrix abstraction will help when performing matrix multiplication
mod 2 when encoding and decoding Hamming codes.
1 struct BitMatrix {
2 uint32_t rows ;
3 uint32_t cols ;
4 BitVector * vector ;
5 };
4.1 BitMatrix *bm_create(uint32_t rows, uint32_t cols)
The constructor for a bit matrix. In the event that sufficient memory cannot be allocated, the function
must return NULL. Else, it must return a (BitMatrix *), or a pointer to an allocated BitMatrix. The
number of bits in the bit matrix is calculated as row × cols. Each bit should be initialized to 0.
4.2 void bm_delete(BitMatrix **m)
The destructor for a bit matrix. Remember to set the pointer to NULL after the memory associated with
the bit matrix is freed.
4.3 uint32_t bm_rows(BitMatrix *m)
Returns the number of rows in the bit matrix.
4.4 uint32_t bm_cols(BitMatrix *m)
Returns the number of columns in the bit matrix.
4.5 void bm_set_bit(BitMatrix *m, uint32_t r, uint32_t c)
Sets the bit at row r and column c in the bit matrix. Remember, given an m ×n bit matrix M, 0 ≤ r < m,
and 0 ≤ c < n, then Mr,c corresponds with index r * n + c in the underlying bit vector. Setting a bit
should not modify any of the other bits.
4.6 void bm_clr_bit(BitMatrix *m, uint32_t r, uint32_t c)
Clears the bit at row r and column c. Clearing a bit should not modify any of the other bits.
4.7 uint8_t bm_get_bit(BitMatrix *m, uint32_t r, uint32_t c)
Gets the bit at row r and column c. Getting a bit should not modify any of the other bits.
© 2021 Darrell Long 7
4.8 BitMatrix *bm_from_data(uint8_t byte, uint32_t length)
We will occasionally need to transform the first length number of bits in a byte into a bit matrix. The
returned bit matrix should have the dimensions 1×length. The value of length should never be greater
than 8, since we will only need to transform codes (8 bits) or nibbles of data (4 bits) into bit matrices.
4.9 uint8_t bm_to_data(BitMatrix *m)
Extracts the first 8 bits of a bit matrix, returning those bits as a uint8_t. Hint: since there are 8 bits in a
byte, what would be the most efficient way of getting these bits?
4.10 BitMatrix *bm_multiply(BitMatrix *A, BitMatrix *B)
Performs a true matrix multiply, multiplying bit matrix A and bit matrix B mod 2. Returns a new bit matrix
that holds the result of the multiplication. Remember, given an m × n matrix A and an n × p matrix B,
the m × p matrix product C is defined as follows for 1 ≤ i ≤ m and 1 ≤ j ≤ p:
Ci,j =
Xn
k=1
Ai,k ×Bk,j
.
4.11 void bm_print(BitMatrix *m)
A debug function to print a bit matrix. You will want to do this first in order to verify the correctness of
your bit matrix implementation.
5 Hamming Code Module
This module will implement the Hamming(8,4) code described in §2.2. The API is defined in hamming.h.
As with all provided header files, you may not modify hamming.h. The implementation of the module
should be defined in hamming.c.
5.1 HAM_STATUS
The module will contain enumerated status codes that reflect the different possible cases that can occur
when decoding. HAM_OK will indicate that no corrections were needed when decoding. HAM_CORRECT
will indicate that an error was detected and corrected when decoding. Lastly, HAM_ERR will indicate that
more than error was detected and a code could not be properly decoded. These status codes are provided
in hamming.h.
1 typedef enum HAM_STATUS {
2 HAM_OK = -3 , // No error detected .
3 HAM_ERR = -2, // Uncorrectable .
4 HAM_CORRECT = -1 // Detected error and corrected .
5 } HAM_STATUS ;
© 2021 Darrell Long 8
5.2 uint8_t ham_encode(BitMatrix *G, uint8_t msg)
Generates a Hamming code given a nibble of data stored in the lower of nibble of msg and the generator
matrix G. Returns the generated Hamming code, which is stored as a byte, or a uint8_t.
5.3 HAM_STATUS ham_decode(BitMatrix *Ht, uint8_t code, uint8_t *msg)
Decodes the Hamming code using the transpose of the parity-checker matrix, HÖ
, and passes back the
decoded data through the pointer msg. The decoded data bits should constitute the lower nibble. If
no correction is needed when decoding code, return HAM_OK. If one error is detected and corrected,
return HAM_CORRECT. In the event that more than one error is detected, return HAM_ERR to indicate an
uncorrectable error without modifying msg.
To avoiding comparing the error syndrome with each row in HÖ until a match is found (or no match
is found and the error cannot be corrected), we can refer to a lookup table. A lookup table is an array that
contains precomputed information that is referred to often. By constructing a lookup table, we can avoid
performing the same computation many times at the expense of storing the table in memory (in this
case, storing 24 or 16 bytes is negligible). The index to the table will be the error syndrome and the value is
the bit(start counting at 0) that needs to be flipped if the error can be corrected. Thus, if~e =
¡
1 0 1 1¢
or 11012 (1310), then table[13] = 1. If the error cannot be corrected then table[~e] = HAM_ERR.
Pre-lab Questions
1. Complete the rest of the look-up table shown below.
0 0
1 4
··· ···
15 HAM_ERR
2. Decode the following codes. If it contains an error, show and explain how to correct it.
Remember, it is possible for a code to be uncorrectable.
(a) 1110 00112
(b) 1101 10002
6 Your Task
You will be creating two programs for this assignment: an encoder and a decoder. The encoder will generate Hamming codes given input data and the decoder will decode the generated Hamming codes. Both
will operate with the Hamming(8, 4) systematic code. The generator matrix G and the transpose of the
parity-checker matrix, HÖ
, will be represented with bit matrices.
They must read from stdin by default or a file and write to stdout by default or a file. Thus, they
should support command-line arguments, -i and -o, to specify input and/or output files. The decoder
will also print statistics such as total bytes processed, uncorrected errors, corrected errors, and the error
rate to stderr.
© 2021 Darrell Long 9
6.1 Encoder Command-line Options
Your encoder program, named encode, must support any combination of the following command-line
options.
• -h : Prints out a help message describing the purpose of the program and the command-line options it accepts, exiting the program afterwards. Refer to the reference program in the resources
repo for an idea of what to print.
• -i infile : Specify the input file path containing data to encode into Hamming codes. The default input should be set as stdin.
• -o outfile : Specify the output file path to write the encoded data (the Hamming codes) to. If not
specified, the default output should be set as stdout.
6.2 Decoder Command-line Options
Your decoder program, named decode, will need to support the same command-line options that your
encoder does, plus an option to trigger the verbose printing of statistics.
• -h : Prints out a help message describing the purpose of the program and the command-line options it accepts, exiting the program afterwards. Refer to the reference program in the resources
repo for an idea of what to print.
• -i infile : Specify the input file path containing Hamming codes to decode. The default input
should be set as stdin.
• -o outfile : Specify the output file path to write the decoded Hamming codes to. If not specified,
the default output should be set as stdout.
• -v : Prints statistics of the decoding process to stderr. The statistics to print are the total bytes
processed, uncorrected errors, corrected errors, and the error rate. The error rate is defined as
(uncorrected errors/total bytes processed), the ratio of uncorrected errors to total bytes processed.
6.3 Example Program Output
$ ./encode -i frankenstein.txt | ./decode | diff – frankenstein.txt
Total bytes processed: 902092
Uncorrected errors: 0
Corrected errors: 0
Error rate: 0.000000
Figure 1: Example usage and decoder statistics. Here diff uses the dash to represent stdin.
We will be providing source code for a program (error.c) that will inject errors (noise) into your
Hamming codes. Note that not all errors will be correctable. The rate at which the program injects errors
is a command-line argument and is specified by -e rate (default is 0.01 or 1%) and must be between
[0.0, 1.0]. The seed is specified with -s seed and it must be a positive integer.
© 2021 Darrell Long 10
$ ./encode < frankenstein.txt | ./error -e 0.002 -s 2021 | ./decode > /
dev/null
Total bytes processed: 902092
Uncorrected errors: 89
Corrected errors: 14267
Error rate: 0.000099
Figure 2: Example usage with added noise.
6.4 FILE *
There is a lot of complicated logic at the hardware level to read and write actual bits of information.
This is out of scope for a person wanting to write a software program. To make things simpler, the C
standard library provides functions that abstracts much of this complicated logic. All of these functions
are defined in stdio.h, and utilize the FILE object. The FILE object contains all the information about
the file such as its name, file position indicator, file access mode (read/write), and other details necessary
for I/O operations. Some functions of note for this assignment are fopen(), fclose(), fgetc(), and
fputc(). Read the man pages for these functions to see what they do.
6.5 File I/O and Permissions
In the last assignment, graphs were read from either stdin or a file, and the final paths was printed to
either stdout or a file. Thus, you should be familiar with file I/O. However, we must pay special attention
to the input and output file’s permissions when generating or decoding Hamming codes. For example,
imagine you have a sensitive file on your UNIX machine that only you can read or write. Everyone else
will receive an error if they attempt to read or write to it. The file containing the generated Hamming
codes should also only be read or written by you, i.e., it should inherit the file permissions of the file for
which it is generating Hamming codes. This will prevent another user with prying eyes to decode the
file with Hamming codes and learn the contents of your sensitive file (if it is that sensitive, it should be
encrypted in the first place).
6.5.1 UNIX File Permissions
Every file in UNIX has a set of access modes for the owner, group, and everyone else. The ls -l command will lists all the files in a directory and also displays its permissions. In figure 3, the owner of the
file, bat-notes, is batman, and the group is dc. Users in UNIX can be members of a group such as the
sudo group. The groups command lists the groups the current user is in. Groups allow users to share
files within their groups while controlling the extent of their access.
$ ls -l
-rw-rw-r– 1 batman dc 102 Jan 21 12:33 bat-notes
Figure 3: Output of ls -l
© 2021 Darrell Long 11
The file’s permissions are presented in the same order from left to right after the first dash: owner,
group, and others. Read permission is represented by r, write permission, the ability to modify, create, or
delete a file, is represented by w, and execute permission for programs and shell scripts are represented
by x. Thus, in figure 3, batman has the following permissions on his file, bat-notes: rw-rw-r–. The
owner, batman, has read and write permissions but cannot execute the file. The group, dc, has read and
write permissions but cannot execute the file. Everyone else can only read the file.
File permissions are not set in stone and can be modified by chmod (both a command and a system
call). For example, if batman wants to remove read access to everyone else other than himself and members of the dc group, he can execute the command chmod o-r bat-notes to remove read permissions
from others. If he changes his mind later, he can use the command chmod o+r bat-notes to add read
permissions to others.
$ chmod o-r,g-w bat-notes
$ ls -l
-rw-r—- 1 batman dc 102 Jan 21 12:33 bat-notes
Figure 4: Removing other’s read permission and the group’s write permission
As mentioned earlier, chmod is also a system call and can be called from a C program. For this assignment, all output files should inherit the file permissions of the input file. fstat() should be used to
retrieve the input file’s permissions and fchmod() to set the output file’s permissions to match the input’s.
Refer to man chmod, man fchmod, man fstat for more information. Note: both of these functions expect a file descriptor that is returned by low-level IO (open()), so you will need to use fileno() to get
the file descriptor of an open stream.
Using fstat() and fchmod()
1 // Opening files to read from and write to.
2 FILE * infile = fopen (“bat – notes “, “rb”);
3 FILE * outfile = fopen (“bat -notes – encoded “, “wb”);
4
5 // Getting and setting file permissions
6 struct stat statbuf ;
7 fstat ( fileno ( infile ) , & statbuf );
8 fchmod ( fileno ( outfile ) , statbuf . st_mode ) ;
Another way to represent file permissions is with three octal digits, one for owner, group, and others.
The octable table is shown in table 4. In fact, those with a keen eye will notice that file permissions are
in fact a set. Thus, if a file has its permissions set to 7558 then the owner has read, write, and execute
permissions and the group and others only have read, execute permissions.
7 Packing and Unpacking Nibbles
Here are some functions to aid you in getting either the low or high nibble in a byte, as well as packing
them together back into a byte.
© 2021 Darrell Long 12
Ref. Octal Binary
–x 1 001
-w- 2 010
-wx 3 011
r– 4 100
r-x 5 101
rw- 6 110
rwx 7 111
Table 4: Octal and binary representation of file permissions.
Helper functions
1 // Returns the lower nibble of val
2 uint8_t lower_nibble ( uint8_t val ) {
3 return val & 0xF;
4 }
5
6 // Returns the upper nibble of val
7 uint8_t upper_nibble ( uint8_t val ) {
8 return val >> 4;
9 }
10
11 // Packs two nibbles into a byte
12 uint8_t pack_byte ( uint8_t upper , uint8_t lower ) {
13 return ( upper << 4) | ( lower & 0xF);
14 }
8 Encoder Program Specifics
In main() do the following:
1. Parse the command-line options with getopt() and open any input and/or output files using
fopen() and correct file permissions.
2. Initialize the generator matrix G using bm_create().
3. Read a byte from the specified file stream or stdin with fgetc().
4. Generate the Hamming(8,4) codes for both the upper and lower nibble with ham_encode() and
write to the specified file or stdout with fputc(). The Hamming code for the lower nibble should
be written first followed by the code for the upper nibble. Note: You’ll notice this operation becomes repetitive with larger files. If only there was a lookup table of Hamming codes to refer to.
5. Repeat steps 3 – 4 until all data has been read from the file or stdout.
© 2021 Darrell Long 13
6. Close both the input and output files with fclose() and make sure that any memory allocated is
freed.
9 Decoder Program Specifics
1. Parse the command-line options with getopt() and open any input and/or output files using
fopen() and correct file permissions.
2. Initialize the transpose of the parity-checker matrix, HÖ
, using bm_create().
3. Read two bytes from the specified file stream or stdin with fgetc(). Note: The first byte read is
the Hamming code for the lower nibble, and the second is the upper nibble.
4. For each byte pair read, decode the Hamming(8,4) codes for both with ham_decode() to recover
the original upper and lower nibbles of the message. Then, reconstruct the original byte. Note:
The return code from ham_decode() should be used for statistics. Your program should count the
number of bytes proccessed, Hamming codes that required correction, and Hamming codes that
could not be corrected.
5. Write the reconstructed byte with fputc(). Note: You should still write the byte even for Hamming
codes that could not be corrected, leaving the data bits in the byte as is.
6. Repeat steps 3 – 5 until all data has been read from the file or stdout.
7. Print the following statistics to stderr with fprintf():
• Total bytes processed: The number of bytes read by the decoder.
• Uncorrected errors: The number of Hamming codes that could not be corrected.
• Corrected errors: The number of Hamming codes that experienced an error that was recoverable.
• Error rate: The rate of uncorrected errors for a given input. Refer to §6.2 for the calculation.
8. Close both the input and output files with fclose() and make sure that any memory allocated is
freed.
Remember: In both the encoder and decoder, if an input and output file are specified, the output file
should have the same file permissions as the input file. You can use fstat() to retrieve an open file’s
permissions, fchmod() to change the permissions of the output file to match the input’s. Since both of
these functions expect a file descriptor that is returned by low-level IO (open()), you will need to use
fileno() to get the file descriptor of an open stream.
10 Deliverables
You will need to turn in:
1. encode.c: This file will contain your implementation of the Hamming Code encoder.
© 2021 Darrell Long 14
2. decode.c: This file will contain your implementation of the Hamming Code decoder.
3. error.c: This file will be provided in the resources repo, but should be included in your repo as
well. You must not modify this file.
4. entropy.c: This file will be provided in the resources repo, but should be included in your repo as
well. You must not modify this file.
5. bv.h: This fill will contain the bit vector ADT interface. This file will be provided. You may not
modify this file.
6. bv.c: This file will contain your implementation of the bit vector ADT. You must define the bit
vector struct in this file.
7. bm.h: This file will contain the bit matrix ADT interface. This file will be provided. You may not
modify this file.
8. bm.c: This file will contain your implementation of the bit matrix ADT. You must define the bit
matrix struct in this file.
9. hamming.h: This file will contain the interface of the Hamming Code module. This file will be
provided. You may not modify this file.
10. hamming.c: This file will contain your implementation of the Hamming Code module.
11. Makefile: This is a file that will allow the grader to type make to compile your programs.
• CC=clang must be specified.
• CFLAGS=-Wall -Wextra -Werror -Wpedantic must be included.
• make should build the encoder, the decoder, the supplied error-injection program, and the
supplied entropy-measure program, as should make all.
• make encode should build just the encoder.
• make decode should build just the decoder.
• make error should build just the supplied error-injection program.
• make entropy should build just the supplied entropy-measure program.
• make clean must remove all files that are compiler generated.
• make format should format all your source code, including the header files.
12. Your code must pass scan-build cleanly. If there are any bugs or errors that are false positives,
document them and explain why they are false positives in your README.md.
13. README.md: This must be in Markdown. This must describe how to build and run your program.
14. DESIGN.pdf: This must be a PDF. The design document should answer the pre-lab questions,
describe the purpose of your program, and communicate its overall design with enough detail such
that a sufficiently knowledgeable programmer would be able to replicate your implementation.
This does not mean copying your entire program in verbatim. You should instead describe how
your program works with supporting pseudocode. C code is not considered pseudocode.
© 2021 Darrell Long 15
15. WRITEUP.pdf: This document must be a PDF. You will need to use the entropy.c program that
will be supplied to you. The program reads data from stdin and prints out the amount of entropy
in the read data. Entropy, as defined by Claude Shannon, quantifies the amount of information
that is produced by some process. This will be discussed further in lecture. Your writeup must
include the following:
• Graphs showing the amount of entropy of data before and after encoding it.
• Analysis of the graphs you produce.
11 Submission
To submit your assignment through git, refer to the steps shown in asgn0 Remember: add, commit, and
push! Your assignment is turned in only after you have pushed and submitted the commit ID to Canvas.
Your design document is turned in only after you have pushed and submitted the commit ID to Canvas.
If you forget to push, you have not turned in your assignment and you will get a zero. “I forgot to push”
is not a valid excuse. It is highly recommended to commit and push your changes often.
12 Supplemental Readings
The more that you read, the more things you will know. The more that you learn, the
more places you’ll go.
—Dr. Seuss
• The C Programming Language by Kernighan & Ritchie
– Chapter 2 §2.9
– Chapter 5 §5.7
– Chapter 7
Wrasznstizs qa I ps xusr mpvuzo n svnwqg n ioauzanc.

CSE 13S Assignment 6 Huffman Coding

1 Introduction
You’re trying to take something that can be described in many, many
sentences and pages of prose, but you can convert it into a couple lines
of poetry and you still get the essence, so that’s compression. The best
code is poetry.
—Satya Nadella
David A. Huffman
When David Huffman was a graduate student in a class at MIT, the professor
gave the class an unsolved problem: How to construct an optimal static encoding
of information. The young Huffman came back a few days later with his solution, and that solution changed the world. Data compression is now used in all
aspects of communication. David Huffman joined the faculty of MIT in 1953, and
in 1967 he joined the faculty of University of California, Santa Cruz as one of its
earliest members and helped to found its Computer Science Department, where
he served as chairman from 1970 to 1973. He retired in 1994, and passed away in
1999.
The key idea is called entropy, originally defined by Claude Shannon in 1948.
Entropy is a measure of the amount of information in a, say, set of symbols. If
we define I(x) = log2 Pr[x] to be the information content of a symbol, then the
entropy of the set X = {x1,…, xn} is
H(X) =
∑n
i=1
Pr[xi]I(xi) = − ∑n
i=1
Pr[xi]log2 Pr[xi].
It should be easy to see that the optimal static encoding will assign the least number of bits to the most
common symbol, and the greatest number of bits to the least common symbol.
2 The Encoder
Your first task for this assignment is to implement a Huffman encoder. This encoder will read in an input
file, find the Huffman encoding of its contents, and use the encoding to compress the file. Your encoder
program, named encode, must support any combination of the following command-line options:
© 2021 Darrell Long 1
• -h : Prints out a help message describing the purpose of the program and the command-line options it accepts, exiting the program afterwards. Refer to the reference program in the resources
repo for an idea of what to print.
• -i infile : Specifies the input file to encode using Huffman coding. The default input should be
set as stdin.
• -o outfile : Specifies the output file to write the compressed input to. The default output should
be set as stdout.
• -v : Prints compression statistics to stderr. These statistics include the uncompressed file size,
the compressed file size, and space saving. The formula for calculating space saving is:
100×(1−(compressed size/uncompressed size)).
Refer to the reference program in the resources repository for the exact output.
The algorithm to encode a file, or to compress it, is as follows:
1. Compute a histogram of the file. In other words, count the number of occurrences of each unique
symbol in the file.
2. Construct the Huffman tree using the computed histogram. This will require a priority queue.
3. Construct a code table. Each index of the table represents a symbol and the value at that index the
symbol’s code. You will need to use a stack of bits and perform a traversal of the Huffman tree.
4. Emit an encoding of the Huffman tree to a file. This will be done through a post-order traversal of
the Huffman tree. The encoding of the Huffman tree will be referred to as a tree dump.
5. Step through each symbol of the input file again. For each symbol, emit its code to the output file.
3 The Decoder
The second task for this assignment is to implement a Huffman decoder. This decoder will read in a compressed input file and decompress it, expanding it back to its original, uncompressed size. Your decoder
program, named decode, must support any combination of the following command-line options.
• -h : Prints out a help message describing the purpose of the program and the command-line options it accepts, exiting the program afterwards. Refer to the reference program in the resources
repo for an idea of what to print.
• -i infile : Specifies the input file to decode using Huffman coding. The default input should be
set as stdin.
• -o outfile : Specifies the output file to write the decompressed input to. The default output
should be set as stdout.
© 2021 Darrell Long 2
• -v : Prints decompression statistics to stderr. These statistics include the compressed file size,
the decompressed file size, and space saving. The formula for calculating space saving is:
100×(1−(compressed size/decompressed size)).
Refer to the reference program in the resources repository for the exact output.
The algorithm to decode a file, or to decompress it, is as follows:
1. Read the emitted (dumped) tree from the input file. A stack of nodes is needed in order to reconstruct the Huffman tree.
2. Read in the rest of the input file bit-by-bit, traversing down the Huffman tree one link at a time.
Reading a 0 means walking down the left link, and reading a 1 means walking down the right link.
Whenever a leaf node is reached, its symbol is emitted and you start traversing again from the root.
4 Nodes
The first ADT that we will cover is a node. Huffman trees are composed of nodes, with each node containing a pointer to its left child, a pointer to its right child, a symbol, and the frequency of that symbol.
The node’s frequency is only needed for the encoder.
1 typedef struct Node Node ;
2
3 struct Node {
4 Node * left ; // Pointer to left child .
5 Node * right ; // Pointer to right child .
6 uint8_t symbol ; // Node ‘s symbol .
7 uint64_t frequency ; // Frequency of symbol .
8 };
Immediately, we notice that a symbol is a uint8_t, and not a char. This is because we want to
interpret the input file as raw bytes, not as a string. The following subsections define the interface for
a Node and will be supplied in node.h. The definition of a Node will be made transparent in order to
simplify things.
4.1 Node *node_create(uint8_t symbol, uint64_t frequency)
The constructor for a node. Sets the node’s symbol as symbol and its frequency as frequency.
4.2 void node_delete(Node **n)
The destructor for a node. Make sure to set the pointer to NULL after freeing the memory for a node.
© 2021 Darrell Long 3
4.3 Node *node_join(Node *left, Node *right)
Joins a left child node and right child node, returning a pointer to a created parent node. The parent
node’s left child will be left and its right child will be right. The parent node’s symbol will be ‘$’ and
its frequency the sum of its left child’s frequency and its right child’s frequency.
4.4 void node_print(Node *n)
A debug function to verify that your nodes are created and joined correctly.
5 Priority Queues
As stated in the encoding algorithm, the encoder will make use of a priority queue of nodes. A priority
queue functions like a regular queue, but assigns each of its elements a priority, such that elements with
a high priority are dequeued before elements with a low priority. Assuming that elements are enqueued
at the tail and dequeued from the head, this implies that the enqueue() operation does not simply add
the element at the tail. Of course, the dequeue() operation could search for the highest priority element
each time, but that is a bad idea.
How you implement your priority queue is up to you. There are a couple choices: 1) mimicking an
insertion sort when enqueuing a node, finding the correct position for the node and shifting everything
back, or 2) using a min heap to serve as the priority queue. Why a min heap? Because we want nodes
with lower frequencies to be dequeued first. The lower the frequency of a node, the higher its priority.
Your priority queue, no matter the implementation, must fulfill the interface that will be supplied to you
in pq.h.
5.1 PriorityQueue *pq_create(uint32_t capacity)
The constructor for a priority queue. The priority queue’s maximum capacity is specified by capacity.
5.2 void pq_delete(PriorityQueue **q)
The destructor for a priority queue. Make sure to set the pointer to NULL after freeing the memory for a
priority queue.
5.3 bool pq_empty(PriorityQueue *q)
Returns true if the priority queue is empty and false otherwise.
5.4 bool pq_full(PriorityQueue *q)
Returns true if the priority queue is full and false otherwise.
5.5 uint32_t pq_size(PriorityQueue *q)
Returns the number of items currently in the priority queue.
© 2021 Darrell Long 4
5.6 bool enqueue(PriorityQueue *q, Node *n)
Enqueues a node into the priority queue. Returns false if the priority queue is full prior to enqueuing
the node and true otherwise to indicate the successful enqueuing of the node.
5.7 bool dequeue(PriorityQueue *q, Node **n)
Dequeues a node from the priority queue, passing it back through the double pointer n. The node dequeued should have the highest priority over all the nodes in the priority queue. Returns false if the
priority queue is empty prior to dequeuing a node and true otherwise to indicate the successful dequeuing of a node.
5.8 void pq_print(PriorityQueue *q)
A debug function to print a priority queue. This function will be significantly easier to implement if your
enqueue() function always ensures a total ordering over all nodes in the priority queue. Enqueuing
nodes in a insertion-sort-like fashion will provide such an ordering. Implementing your priority queue
as a heap, however, will only provide a partial ordering, and thus will require more work in printing to
assure you that your priority queue functions as expected (you will be displaying a tree).
6 Codes
After constructing a Huffman tree, you will need to maintain a stack of bits while traversing the tree in
order to create a code for each symbol. We will create a new ADT, a Code, that represents a stack of bits.
1 typedef struct Code {
2 uint32_t top ;
3 uint8_t bits [ MAX_CODE_SIZE ];
4 } Code ;
The struct definition of a Code will be made transparent. This is done for the sole purpose of being
able to pass a struct by value. The macro MAX_CODE_SIZE reflects the maximum number of bytes
needed to store any valid code. The definition of this macro — and other macros — will be given in
defines.h. You will need to combine your knowledge of bit vectors and stacks in order to implement
this ADT. The interface, given in code.h, is defined in the the following subsections.
© 2021 Darrell Long 5
Macros defined in defines.h
1 // 4KB blocks .
2 # define BLOCK 4096
3
4 // ASCII + Extended ASCII .
5 # define ALPHABET 256
6
7 // 32 – bit magic number .
8 # define MAGIC 0 xDEADBEEF
9
10 // Bytes for a maximum , 256 – bit code .
11 # define MAX_CODE_SIZE ( ALPHABET / 8)
12
13 // Maximum Huffman tree dump size .
14 # define MAX_TREE_SIZE (3 * ALPHABET – 1)
6.1 Code code_init(void)
You will immediately notice that this “constructor” function is unlike any of the other constructor functions you have implemented in the past. You may also have noticed, if you glanced slightly ahead, that
there is no corresponding destructor function. This is an engineering decision that was made when considering the constraints of the Huffman coding algorithm.
This function will not require any dynamic memory allocation. You will simply create a new Code on
the stack, setting top to 0, and zeroing out the array of bits, bits. The initialized Code is then returned.
6.2 uint32_t code_size(Code *c)
Returns the size of the Code, which is exactly the number of bits pushed onto the Code.
6.3 bool code_empty(Code *c)
Returns true if the Code is empty and false otherwise.
6.4 bool code_full(Code *c)
Returns true if the Code is full and false otherwise. The maximum length of a code in bits is 256, which
we have defined using the macro ALPHABET. Why 256? Because there are exactly 256 ASCII characters
(including the extended ASCII).
6.5 bool code_push_bit(Code *c, uint8_t bit)
Pushes a bit onto the Code. The value of the bit to push is given by bit. Returns false if the Code is full
prior to pushing a bit and true otherwise to indicate the successful pushing of a bit.
© 2021 Darrell Long 6
6.6 bool code_pop_bit(Code *c, uint8_t *bit)
Pops a bit off the Code. The value of the popped bit is passed back with the pointer bit. Returns false if
the Code is empty prior to popping a bit and true otherwise to indicate the successful popping of a bit.
6.7 void code_print(Code *c)
A debug function to help you verify whether or not bits are pushed onto and popped off a Code correctly.
7 I/O
Now that we have covered all the essential ADTs necessary for the encoder, we will discuss I/O. Instead
of the buffered I/O functions from <stdio.h> that you have become acquainted with in previous assignments, we will use low-level system calls (syscalls) such as read(), write(), open() and close().
The former two functions can be included with <unistd.h> and the latter two can be included with
<fcntl.h>. Functions defined by the following I/O module will be used by both the encoder and decoder. The interface for the I/O module will be supplied in io.h.
7.1 int read_bytes(int infile, uint8_t *buf, int nbytes)
This will be a useful wrapper function to perform reads. As you may know, the read() syscall does not
always guarantee that it will read all the bytes specified (as is the case with pipes). For example, a call
could be issued to read a a block of bytes, but it might only read part of a block. So, we write a wrapper
function to loop calls to read() until we have either read all the bytes that were specified (nbytes) into
the byte buffer buf, or there are no more bytes to read. The number of bytes that were read from the
input file descriptor, infile, is returned. You should use this function whenever you need to perform a
read.
7.2 int write_bytes(int outfile, uint8_t *buf, int nbytes)
This functions is very much the same as read_bytes(), except that it is for looping calls to write().
As you may imagine, write() is not guaranteed to write out all the specified bytes (nbytes), and so we
must loop until we have either written out all the bytes specified from the byte buffer buf, or no bytes
were written. The number of bytes written out to the output file descriptor, outfile, is returned. You
should use this function whenever you need to perform a write.
7.3 bool read_bit(int infile, uint8_t *bit)
You should all know by now that it is not possible to read a single bit from a file. What you can do,
however, is read in a block of bytes into a buffer and dole out bits one at a time. Whenever all the bits in
the buffer have been doled out, you can simply fill the buffer back up again with bytes from infile. This
is exactly what you will do in this function. You will maintain a static buffer of bytes and an index into
the buffer that tracks which bit to return through the pointer bit. The buffer will store BLOCK number of
bytes, where BLOCK is yet another macro defined in defines.h. This function returns false if there are
no more bits that can be read and true if there are still bits to read. It may help to treat the buffer as a bit
vector.
© 2021 Darrell Long 7
7.4 void write_code(int outfile, Code *c)
The same bit-buffering logic used in read_bit() will be used in here as well. This function will also
make use of a static buffer (we recommend this buffer to be static to the file, not just this function) and
an index. Each bit in the code c will be buffered into the buffer. The bits will be buffered starting from the
0
th bit in c. When the buffer of BLOCK bytes is filled with bits, write the contents of the buffer to outfile.
7.5 void flush_codes(int outfile)
It is not always guaranteed that the buffered codes will align nicely with a block, which means that it is
possible to have bits leftover in the buffer used by write_code() after the input file has been completely
encoded. The sole purpose of this function is to write out any leftover, buffered bits. Make sure that any
extra bits in the last byte are zeroed before flushing the codes.
8 Stacks
You will need to use a stack in your decoder to reconstruct a Huffman tree. The interface of the stack
should be familiar from assignments 3 and 4. The difference is that the stack this time around will store
nodes. The interface for the stack is defined in stack.h.
1 struct Stack {
2 uint32_t top ;
3 uint32_t capacity ;
4 Node ** items ;
5 };
8.1 Stack *stack_create(uint32_t capacity)
The constructor for a stack. The maximum number of nodes the stack can hold is specified by capacity.
8.2 void stack_delete(Stack **s)
The destructor for a stack. Remember to set the pointer to NULL after you free the memory allocated by
the stack.
8.3 bool stack_empty(Stack *s)
Returns true if the stack is empty and false otherwise.
8.4 bool stack_full(Stack *s)
Returns true if the stack is full and false otherwise.
8.5 uint32_t stack_size(Stack *s)
Returns the number of nodes in the stack.
© 2021 Darrell Long 8
8.6 bool stack_push(Stack *s, Node *n)
Pushes a node onto the stack. Returns false if the stack is full prior to pushing the node and true
otherwise to indicate the successful pushing of a node.
8.7 bool stack_pop(Stack *s, Node **n)
Pops a node off the stack, passing it back through the double pointer n. Returns false if the stack is
empty prior to popping a node and true otherwise to indicate the succesuccessful popping of a node.
8.8 void stack_print(Stack *s)
A debug function to print the contents of a stack.
9 A Huffman Coding Module
A interface for a Huffman coding module that you will need to implement will be given in huffman.h.
Do not worry if you do not initially understand the exact purpose of each function, as they will be clarified in §10 and §11. The interface is just given now as a reference for which functions are used in the
aforementioned sections.
9.1 Node *build_tree(uint64_t hist[static ALPHABET])
Constructs a Huffman tree given a computed histogram. The histogram will have ALPHABET indices,
one index for each possible symbol. Returns the root node of the constructed tree. The use of static
array indices in parameter declaration is a C99 addition. In this case, it informs the compiler that the
histogram hist should have at least ALPHABET number of indices.
9.2 void build_codes(Node *root, Code table[static ALPHABET])
Populates a code table, building the code for each symbols in the Huffman tree. The constructed codes
are copied to the code table, table, which has ALPHABET indices, one index for each possible symbol.
9.3 Node *rebuild_tree(uint16_t nbytes, uint8_t tree_dump[static nbytes])
Reconstructs a Huffman tree given its post-order tree dump stored in the array tree_dump. The length
in bytes of tree_dump is given by nbytes. Returns the root node of the reconstructed tree.
9.4 void delete_tree(Node **root)
The destructor for a Huffman tree. This will require a post-order traversal of the tree to free all the nodes.
Remember to set the pointer to NULL after you are finished freeing all the allocated memory.
© 2021 Darrell Long 9
10 Specifics for the Encoder
For this section, the input file to compress will be referred to as infile and the compressed output
file as outfile. Your encoder, after parsing command-line options with getopt(), must perform the
following steps exactly to produce the correct Huffman encoding:
1. Read through infile to construct a histogram. Your histogram should be a simple array of 256
(ALPHABET) uint64_ts.
2. Increment the count of element 0 and element 255 by one in the histogram. This is so that at the
very minimum, the histogram will have two elements present. Do this regardless of what you read
in. While doing this may result in a slightly sub-optimal Huffman tree later on, it is a quick and
clean solution to handling the case when a file has no bytes or contains only one unique symbol.
3. Construct the Huffman tree using a priority queue. This will be done using build_tree().
(a) Create a priority queue. For each symbol histogram where its frequency is greater than 0
(there should be at minimum two elements because of step 2), create a corresponding Node
and insert it into the priority queue.
(b) While there are two or more nodes in the priority queue, dequeue two nodes. The first dequeued node will be the left child node. The second dequeued node will be the right child
node. Join these nodes together using node_join() and enqueue the joined parent node.
The frequency of the parent node is the sum of its left child’s frequency and its right child’s
frequency.
(c) Eventually, there will only be one node left in the priority queue. This node is the root of the
constructed Huffman tree.
4. Construct a code table by traversing the Huffman tree. This will be done using build_codes().
The code table is a simple array of 256 (ALPHABET) Codes.
(a) Create a new Code c using code_init(). Starting at the root of the Huffman tree, perform a
post-order traversal.
(b) If the current node is a leaf, the current code c represents the path to the node, and thus is
the code for the node’s symbol. Save this code into code table.
(c) Else, the current node must be an interior node. Push a 0 to c and recurse down the left link.
(d) After you return from the left link, pop a bit from c, push a 1 to c and recurse down the right
link. Remember to pop a bit from c when you return from the right link.
5. Construct a header. A header is defined by the following struct definition, which will be supplied
to you in header.h:
1 typedef struct Header {
2 uint32_t magic ; // 32 – bit magic number .
3 uint16_t permissions ; // Input file permissions .
4 uint16_t tree_size ; // Emitted tree size in bytes .
5 uint64_t file_size ; // Input file size .
6 } Header ;
© 2021 Darrell Long 10
The header’s magic number field, magic, should be set to the macro MAGIC, as defined in defines.h.
This magic number identifies a file as one which has been compressed using your encoder. It is
crucial that you use this magic number and nothing else.
The header’s permissions field stores the original permission bits of infile. You can get these
permissions by using fstat(). You should also set the permissions of outfile to match the permissions of infile using fchmod().
The header’s tree_size field represents the number of bytes that make up the Huffman tree
dump. This size is calculated as (3×unique symbols)−1.
Finally, the header’s file_size field is the size in bytes of the file to compress, infile. You get
this size using fstat() as well.
6. Write the constructed header to outfile.
7. Perform a post-order traversal of the Huffman tree to write the tree to the outfile. This should
write an ‘L’ followed by the byte of the symbol for each leaf, and an ‘I’ for interior nodes. You
should not write a symbol for an interior node.
8. Starting at the beginning of infile, write the corresponding code for each symbol to outfile with
write_code(). When finished with all the symbols, make sure to flush any remaining buffered
codes with flush_codes().
9. Close infile and outfile.
11 Specifics for the Decoder
For this section, the input file to decompress will be referred to as infile and the compressed output
file as outfile. Your decoder, after parsing command-line options with getopt(), must perform the
following steps exactly to decode the file:
1. Read in the header from infile and verify the magic number. If the magic number does not match
0xDEADBEEF (defined as MAGIC in defines.h), then an invalid file was passed to your program.
Display a helpful error message and quit.
2. The permissions field in the header indicates the permissions that outfile should be set to. Set
the permissions using fchmod().
3. The size of the dumped tree is given by the tree_size field in the header. Read the dumped tree
from infile into an array that is tree_size bytes long. Then, reconstruct the Huffman tree using
rebuild_tree().
(a) The array containing the dumped tree will be referred to as tree_dump. The length of this
array will be nbytes. A stack of nodes will be needed to reconstruct the tree.
(b) Iterate over the contents tree_dump from 0 to nbytes.
(c) If the element of the array is an ‘L’, then the next element will be the symbol for the leaf
node. Use that symbol to create a new node with node_create(). Push the created node
onto the stack.
© 2021 Darrell Long 11
(d) If the element of the array is an ‘I’, then you have encountered an interior node. Pop the
stack once to get the right child of the interior node, then pop again to get the left child
of the interior node. Note: the pop order is important. Join the left and right nodes with
node_join() and push the joined parent node on the stack.
(e) There will be one node left in the stack after you finish iterating over the contents tree_dump.
This node is the root of the Huffman tree.
4. Read infile one bit at a time using read_bit(). You will be traversing down the tree one link at
a time for each bit that is read.
(a) Begin at the root of the Huffman tree. If a bit of value 0 is read, then walk down to the left
child of the current node. Else, if a bit of value 1 is read, then walk down to the right child of
the current node.
(b) If you find yourself at a leaf node, then write the leaf node’s symbol to outfile. Note: you
may alternatively buffer these symbols and write out the buffer whenever it is filled (this will
be more efficient). After writing the symbol, reset the current node back to the root of the
tree.
(c) Repeat until the number of decoded symbols matches the original file size, which is given by
the file_size field in the header that was read from infile.
5. Close infile and outfile.
12 Deliverables
You will need to turn in:
1. encode.c: This file will contain your implementation of the Huffman encoder.
2. decode.c: This file will contain your implementation of the Huffman decoder.
3. entropy.c: This file will be provided in the resources repository, but should be included in your
repo as well. You must not modify this file.
4. defines.h: This file will contain the macro definitions used throughout the assignment. You may
not modify this file.
5. header.h: This will will contain the struct definition for a file header. You may not modify this
file.
6. node.h: This file will contain the node ADT interface. This file will be provided. You may not
modify this file.
7. node.c: This file will contain your implementation of the node ADT.
8. pq.h: This file will contain the priority queue ADT interface. This file will be provided. You may
not modify this file.
© 2021 Darrell Long 12
9. pq.c: This file will contain your implementation of the priority queue ADT. You must define your
priority queue struct in this file.
10. code.h: This file will contain the code ADT interface. This file will be provided. You may not
modify this file.
11. code.c: This file will contain your implementation of the code ADT.
12. io.h: This file will contain the I/O module interface. This file will be provided. You may not modify
this file.
13. io.c: This file will contain your implementation of the I/O module.
14. stack.h: This file will contain the stack ADT interface. This file will be provided. You may not
modify this file.
15. stack.c: This file will contain your implementation of the stack ADT. You must define your stack
struct in this file.
16. huffman.h: This file will contain the Huffman coding module interface. This file will be provided.
You may not modify this file.
17. huffman.c: This file will contain your implementation of the Huffman coding module interface.
18. Makefile: This is a file that will allow the grader to type make to compile your programs.
• CC=clang must be specified.
• CFLAGS=-Wall -Wextra -Werror -Wpedantic must be included.
• make should build the encoder, the decoder, and the supplied entropy-measure program, as
should make all.
• make encode should build just the encoder.
• make decode should build just the decoder.
• make entropy should build just the supplied entropy-measure program.
• make clean must remove all files that are compiler generated.
• make format should format all your source code, including the header files.
19. Your code must pass scan-build cleanly. If there are any bugs or errors that are false positives,
document them and explain why they are false positives in your README.md.
20. README.md: This must be in Markdown. This must describe how to build and run your program.
21. DESIGN.pdf: This must be a PDF. The design document should answer the pre-lab questions,
describe the purpose of your program, and communicate its overall design with enough detail such
that a sufficiently knowledgeable programmer would be able to replicate your implementation.
This does not mean copying your entire program in verbatim. You should instead describe how
your program works with supporting pseudocode. C code is not considered pseudocode.
© 2021 Darrell Long 13
13 Submission
To submit your assignment through git, refer to the steps shown in asgn0 Remember: add, commit, and
push! Your assignment is turned in only after you have pushed and submitted the commit ID to Canvas.
Your design document is turned in only after you have pushed and submitted the commit ID to Canvas.
If you forget to push, you have not turned in your assignment and you will get a zero. “I forgot to push”
is not a valid excuse. It is highly recommended to commit and push your changes often.
14 Supplemental Readings
The more that you read, the more things you will know. The more that
you learn, the more places you’ll go.
—Dr. Seuss
• The C Programming Language by Kernighan & Ritchie
– Chapter 8
• Introduction to Algorithms by T. Cormen, C. Leiserson, R. Rivest, & C. Stein
– Chapter 2 §2.1
– Chapter 6 §6.5
– Chapter 10 §10.1
– Chapter 12 §12.1
– Chapter 16 §16.3
15 Strategy
I will, in fact, claim that the difference between a bad programmer and
a good one is whether he considers his code or his data structures more
important. Bad programmers worry about the code. Good
programmers worry about data structures and their relationships.
—Linus Torvalds
Let’s talk strategy.
First, develop the data structures that you will need. Draw pictures, work them out, implement them,
and test them. Test them again. A program called printtree will be supplied in the resources repository. It will help you determine whether or not you are constructing and dumping your Huffman trees
correctly.
Do things in small, incremental steps. Make a histogram. Test it with a small input file. Create a node
using a symbol in the histogram. Does that work? Good, now try putting a tree together. Do the same for
the rest of the data structures. You’ll thank yourself later down the road if you do this.
© 2021 Darrell Long 14
As with all assignments, a working encoder and decoder will be supplied in the resources repository.
Test if you can decode what the reference encoder encodes. Test if the reference decoder and decode
what your encoder encodes.
Build your toolkit. Build components. Test them.
Mafbalrrpyb py Z ph kpdn bpcpyb l rfydnx l zilpyhlj.

CSE 13S Assignment 7 The Great Firewall of Santa Cruz: Bloom Filters, Linked Lists and Hash Tables

1 Introduction
War is peace. Freedom is slavery. Ignorance is strength.
—George Orwell, 1984
You have been selected through thoroughly democratic processes (and the machinations of your friend and hero
Ernst Blofeld) to be the Dear and Beloved Leader of the Glorious People’s Republic of Santa Cruz following the failure of the short-lived anarcho-syndicalist commune, where each person in turn acted as a form of executive officer
for the week but all the decisions of that officer have to be ratified at a special bi-weekly meeting by a simple majority in the case of purely internal affairs, but by a two-thirds majority in the case of more major decisions. Where
it was understood that strange women lying in ponds distributing swords is no basis for a system of government.
Supreme executive power derives from a mandate from the masses, not from some farcical aquatic ceremony. It
was a very silly place, and so with Herr Blofeld’s assistance you are now the leader of your people.
In order to promote virtue and prevent vice, and to preserve social cohesion and discourage unrest, you have
decided that the Internet content must be filtered so that your beloved children are not corrupted through the use
of unfortunate, hurtful, offensive, and far too descriptive language.
2 Bloom Filters
The Ministry of Peace concerns itself with war, the Ministry of Truth with lies, the Ministry of
Love with torture and the Ministry of Plenty with starvation. These contradictions are not
accidental, nor do they result from from ordinary hypocrisy: they are deliberate exercises in
doublethink.
—George Orwell, 1984
The Internet is very large, very fast, and full of badthink. The masses spend their days sending each other cat videos
and communicating through their beloved social media platforms: Twitter and Discord. To your dismay, you find
that a portion of the masses frequently use words you deem improper—oldspeak. You decide, as the newly elected
Dear and Beloved Leader of the Glorious People’s Republic of Santa Cruz (GPRSC), that a more neutral newspeak
is required to keep your citizens of the GPRSC content, pure, and from thinking too much. But how do you process
and store so many words as they flow in and out of the GPRSC at 10Gbits/second? The solution comes to your
brilliant and pure mind—a Bloom filter.
© 2021 Darrell Long 1
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton H. Bloom in 1970, and is
used to test whether an element is a member of a set. False-positive matches are possible, but false negatives are
not—in other words, a query for set membership returns either “possibly in the set” or “definitely not in the set.”
Elements can be added to the set but not removed from it; the more elements added, the higher the probability of
false positives.
A Bloom filter can be represented as an array of m bits, or a bit vector. A Bloom filter should
utilize k different hash functions. Using these hash functions, a set element added to the Bloom
filter is mapped to at most k of the m bit indices, generating a uniform pseudo-random distribution. Typically, k is a small constant which depends on the desired false error rate ², while m is
proportional to k and the number of elements to be added.
Assume you are adding a word w to your Bloom filter and are using k = 3 hash functions,
f (x), g (x), and h(x). To add w to the Bloom filter, you simply set the bits at indices f (w), g (w),
and h(w). To check if some word w
0 has been added to the same Bloom filter, you check if the
bits at indices f (w
0
), g (w
0
), and h(w
0
) are set. If they are all set, then w
0 has most likely been
added to the Bloom filter. If any one of those bits was cleared, then w
0 has definitely not been added to the Bloom
filter. The fact that the Bloom filter can only tell if some word has most likely been added to the Bloom filter means
that false positives can occur. The larger the Bloom filter, the lower the chances of getting false positives.
So what do Bloom filters mean for you as the Dear and Beloved Leader? It means you can take a list of proscribed words, oldspeak and add each word into your Bloom filter. If any of the words that your citizens use seem
to be added to the Bloom filter, then this is very ungood and further action must be taken to discern whether or not
the citizen did transgress. You decide to implement a Bloom filter with three salts forthree different hash functions.
Why? To reduce the chance of a false positive.
You can think of a “salt” as an initialization vector or a key. Using different salts with the same hash function
results in a different, unique hash. Since you are equipping your Bloom filter with three different salts, you are
effectively getting three different hash functions: f (x), g (x), and h(x). Hashing a word w, with extremely high
probability, should result in f (w) 6= g (w) 6= h(w). These salts are to be used for the SPECK cipher, which requires
a 128-bit key, so we have used MD5 1
“message-digest” to reduce three books down to 128 bits each. You will use
the SPECK cipher as a hash function, which will be discussed in §4.
2.1 BloomFilter
The following struct defines the BloomFilter ADT. The three salts will be stored in the primary, secondary,
and tertiary fields. Each salt is 128 bits in size. To hold these 128 bits, we use an array of two uint64_ts. This
struct definition must go in bf.c.
1 struct BloomFilter {
2 uint64_t primary [2]; // Primary hash function salt .
3 uint64_t secondary [2]; // Secondary hash function salt .
4 uint64_t tertiary [2]; // Tertiary hash function salt .
5 BitVector * filter ;
6 };
2.2 BloomFilter *bf_create(uint32_t size)
The constructor for a Bloom filter. Working code for the constructor, complete with primary, secondary, and tertiary salts that you will use, is shown below. Note that you will also have to implement the bit vector ADT for your
Bloom filter, as it will serve as the array of bits necessary for a proper Bloom filter.
1Rivest, R.. “The MD5 Message-Digest Algorithm.” RFC 1321 (1992): 1-21.
© 2021 Darrell Long 2
1 BloomFilter * bf_create ( uint32_t size ) {
2 BloomFilter *bf = ( BloomFilter *) malloc ( sizeof ( BloomFilter ) ) ;
3 if (bf) {
4 // Grimm ‘s Fairy Tales
5 bf – > primary [0] = 0 x5adf08ae86d36f21 ;
6 bf – > primary [1] = 0 xa267bbd3116f3957 ;
7 // The Adventures of Sherlock Holmes
8 bf – > secondary [0] = 0 x419d292ea2ffd49e ;
9 bf – > secondary [1] = 0 x09601433057d5786 ;
10 // The Strange Case of Dr. Jekyll and Mr. Hyde
11 bf – > tertiary [0] = 0 x50d8bb08de3818df ;
12 bf – > tertiary [1] = 0 x4deaae187c16ae1d ;
13 bf – > filter = bv_create ( size ) ;
14 if (!bf – > filter ) {
15 free (bf) ;
16 bf = NULL ;
17 }
18 }
19 return bf;
20 }
2.3 void bf_delete(BloomFilter **bf)
The destructor for a Bloom filter. As with all other destructors, it should free any memory allocated by the constructor and null out the pointer that was passed in.
2.4 uint32_t bf_size(BloomFilter *bf)
Returns the size of the Bloom filter. In other words, the number of bits that the Bloom filter can access. Hint: this
is the length of the underlying bit vector.
2.5 void bf_insert(BloomFilter *bf, char *oldspeak)
Takes oldspeak and inserts it into the Bloom filter. This entails hashing oldspeak with each of the three salts for
three indices, and setting the bits at those indices in the underlying bit vector.
2.6 bool bf_probe(BloomFilter *bf, char *oldspeak)
Probes the Bloom filter for oldspeak. Like with bf_insert(), oldspeak is hashed with each of the three salts for
three indices. If all the bits at those indices are set, return true to signify that oldspeak was most likely added to
the Bloom filter. Else, return false.
2.7 uint32_t bf_count(BloomFilter *bf)
Returns the number of set bits in the Bloom filter.
2.8 void bf_print(BloomFilter *bf)
A debug function to print out a Bloom filter.
© 2021 Darrell Long 3
3 Hashing with the SPECK Cipher
You will need a good hash function to use in your Bloom filter and hash table (discussed in §5). We will have
discussed hash functions in lecture, and rather than risk having a poor one implemented, we will simply provide
you one. The SPECK2 block cipher is provided for use as a hash function.
SPECK is a family of lightweight block ciphers publicly released by the National Security Agency (NSA) in June
2013. SPECK has been optimized for performance in software implementations, while its sister algorithm, SIMON,
has been optimized for hardware implementations. SPECK is an add-rotate-xor (ARX) cipher. The reason a cipher
is used for this is because encryption generates random output given some input; exactly what we want for a hash.
Encryption is the process of taking some file you wish to protect, usually called plaintext, and transforming its
data such that only authorized parties can access it. This transformed data is referred to as ciphertext. Decryption
is the inverse operation of encryption, taking the ciphertext and transforming the encrypted data back to its original state as found in the original plaintext. Encryption algorithms that utilize the same key for both encryption and
decryption, like SPECK, are symmetric-key algorithms, and algorithms that don’t, such as RSA, are asymmetric-key
algorithms.
You will be given two files, speck.h and speck.c. The former will provide the interface to using the SPECK
hash function which has been named hash(), and the latter contains the implementation. The hash function
hash() takes two parameters: a 128-bit salt passed in the form of an array of two uint64_ts, and a key to hash.
The function will return a uint32_t which is exactly the index the key is mapped to.
1 uint32_t hash ( uint64_t salt [] , char * key ) ;
4 Bit Vectors
Symmetrical equations are good in their place, but ‘vector’ is a useless survival, or offshoot from
quaternions, and has never been of the slightest use to any creature.
—Lord Kelvin
You will reuse much of the bit vector implementation from assignment 5 for this assignment. If there were any
problems with your implementation, make sure to fix them here.
1 struct BitVector {
2 uint32_t length ;
3 uint8_t * vector ;
4 };
4.1 BitVector *bv_create(uint32_t length)
The constructor for a bit vector. In the even that sufficient memory cannot be allocated, the function must return
NULL. Else, it must return a BitVector *), or a pointer to an allocated BitVector. Each bit of the bit vector
should be initialized to 0.
4.2 void bv_delete(BitVector **bv)
The destructor for a bit vector. Remember to set the pointer to NULL after the memory associated with the bit
vector is freed.
2Ray Beaulieu, Stefan Treatman-Clark, Douglas Shors, Bryan Weeks, Jason Smith, and Louis Wingers, “The SIMON and SPECK lightweight
block ciphers.” In Proceedings of the 52nd ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 1–6. IEEE, 2015.
© 2021 Darrell Long 4
4.3 uint32_t bv_length(BitVector *bv)
Returns the length of a bit vector.
4.4 void bv_set_bit(BitVector *bv, uint32_t i)
Sets the i
th bit in a bit vector. To set a bit in a bit vector, it is necessary to first locate the byte in which the i
th resides.
The location of the byte is calculated as i / 8. The position of the bit in that byte is calculated as i % 8. Setting a
bit in a bit vector should not modify any of the other bits.
4.5 void bv_clr_bit(BitVector *bv, uint32_t i)
Clears the i
th bit in the bit vector.
4.6 uint8_t bv_get_bit(BitVector *bv, uint32_t i)
Returns the i
th bit in the bit vector.
4.7 void bv_print(BitVector *bv)
A debug function to print a bit vector. Write this immediately after the constructor.
5 Hash Tables
To send men to the firing squad, judicial proof is unnecessary . . . These procedures are an archaic
bourgeois detail.
Ernesto Ché Guevara
Armed with a Bloom filter, you now exercise the power to catch and punish those who practice wrongthink and
continue to use oldspeak. It comes to mind however, that a Bloom filter is probabilistic and that it is better to
exercise mercy and counsel the oldspeakers so that they may atone and use newspeak. To remedy this, another
solution pops into your brilliant and pure mind—a hash table.
A hash table is a data structure that maps keys to values and provides fast, O(1), look-up times.
It does so typically by taking a key k, hashing it with some hash function h(x), and placing the
key’s corresponding value in an underlying array at index h(k). This is the perfect way not only to
store translations from oldspeak to newspeak, but also as a way to store all prohibited oldspeak
words without newspeak translations. We will refer to oldspeak without newspeak translations
as badspeak. So what happens when two oldspeak words have the same hash value? This is called
a hash collision, and must be resolved. Rather than doing open addressing (as will be discussed
in lecture), we will be using linked lists to resolve oldspeak hash collisions.
5.1 HashTable
Below is the struct definition for a hash table. Similar to a Bloom filter, a hash table contains a salt which is passed
to hash() whenever a new oldspeak entry is being inserted. As mentioned in §5, we will be using linked lists to
resolve oldspeak hash collisions, which is why a hash table contains an array of linked lists. The mtf field indicates
whether or not the linked lists should use the move-to-front technique, which will be discussed in §5.6.
© 2021 Darrell Long 5
1 struct HashTable {
2 uint64_t salt [2];
3 uint32_t size ;
4 bool mtf ;
5 LinkedList ** lists ;
6 };
5.2 HashTable *ht_create(uint32_t size, bool mtf)
The constructor for a hash table. The size parameter denotes the number of indices, or linked lists, that the hash
table can index up to. The salt for the hash table has been supplied in the constructor as well.
1 HashTable * ht_create ( uint32_t size , bool mtf ) {
2 HashTable *ht = ( HashTable *) malloc ( sizeof ( HashTable ) ) ;
3 if (ht) {
4 // Leviathan
5 ht – > salt [0] = 0 x9846e4f157fe8840 ;
6 ht – > salt [1] = 0 xc5f318d7e055afb8 ;
7 ht – > size = size ;
8 ht – >mtf = mtf ;
9 ht – > lists = ( LinkedList **) calloc (size , sizeof ( LinkedList *) ) ;
10 if (!ht – > lists ) {
11 free (ht) ;
12 ht = NULL ;
13 }
14 }
15 return ht;
16 }
5.3 void ht_delete(HashTable **ht)
The destructor for a hash table. Each of the linked lists in lists, the underlying array of linked lists, is freed. The
pointer that was passed in should be set to NULL.
5.4 uint32_t ht_size(HashTable *ht)
Returns the hash table’s size.
5.5 Node *ht_lookup(HashTable *ht, char *oldspeak)
Searches for an entry, a node, in the hash table that contains oldspeak. A node stores oldspeak and its newspeak
translation. The index of the linked list to perform a look-up on is calculated by hashing the oldspeak. If the node
is found, the pointer to the node is returned. Else, a NULL pointer is returned.
5.6 void ht_insert(HashTable *ht, char *oldspeak, char *newspeak)
Inserts the specified oldspeak and its corresponding newspeak translation into the hash table. The index of the
linked list to insert into is calculated by hashing the oldspeak. If the linked list that should be inserted into hasn’t
been initialized yet, create it first before inserting the oldspeak and newspeak.
© 2021 Darrell Long 6
5.7 uint32_t ht_count(HashTable *ht)
Returns the number of non-NULL linked lists in the hash table.
5.8 void ht_print(HashTable *ht)
A debug function to print out the contents of a hash table. Write this immediately after the constructor.
6 Linked Lists
Education is a weapon, whose effect depends on who holds it in his hands and at whom it is
aimed.
—Joseph Stalin
A linked list will be used to resolve hash collisions. Each node of the linked list contains oldspeak and its newspeak
translation if it exists. The key to search with in the linked list is oldspeak. Each node will also contain a pointer
to the previous node and the next node in the linked list. This is because you will be implementing doubly linked
lists.
6.1 Node
A node is defined with the following struct:
1 struct Node {
2 char * oldspeak ;
3 char * newspeak ;
4 Node * next ;
5 Node * prev ;
6 };
If the newspeak field is NULL, then the oldspeak contained in this node is badspeak, since there is no newspeak
translation. This struct definition and interface for the node ADT will be provided for you in node.h. The node
ADT is not opaque in order to simplify the rest of the linked list implementation.
6.2 Node *node_create(char *oldspeak, char *newspeak)
The constructor for a node. You will want to make a copy of the oldspeak and its newspeak translation that
are passed in. What this means is allocating memory and copying over the characters for both oldspeak and
newspeak. There was a function strdup() that did precisely that, but it has been deprecated. You will find it
helpful to implement your own function to mimic strdup().
6.3 void node_delete(Node **n)
The destructor for a node. Only the node n is freed. The previous and next nodes that n points to are not deleted.
Since you have allocated memory for oldspeak and newspeak, remember to free the memory allocated to both of
those as well. The pointer to the node should be set to NULL.
© 2021 Darrell Long 7
6.4 void node_print(Node *n)
While helpful as debug function, you will use this function to produce correct program output. Thus, it is imperative that you print out the contents of a node in the following manner:
• If the node n contains oldspeak and newspeak, print out the node with this print statement:
1 printf (“%s -> %s\n”, n- > oldspeak , n-> newspeak ) ;
• If the node n contains only oldspeak, meaning that newspeak is null, then print out the node with this print
statement:
1 printf (“%s\n”, n- > oldspeak ) ;
6.5 LinkedList
The struct definition of a linked list is given below. A linked list, when constructed, will initially have two sentinel
nodes, which will be further explained in §6.6 where the constructor function is discussed along with the rationale and usage of the sentinel nodes. The field mtf signifies the move-to-front technique, which will be further
explained in §6.6 and §6.10. The linked list interface, defined in ll.h, will contain the declaration of two extern
variables: seeks and links. These two variables should be defined in ll.c. You will use these variables to calculate statistics, where seeks tracks the number of linked list lookups performed, and links counts the total number
of links traversed.
1 struct LinkedList {
2 uint32_t length ;
3 Node * head ; // Head sentinel node .
4 Node * tail ; // Tail sentinel node .
5 bool mtf ;
6 };
6.6 LinkedList *ll_create(bool mtf)
The constructor for a linked list. The only parameter that this function takes is a boolean, mtf. If mtf is true, that
means any node that is found in the linked list through a look-up is moved to the front of the linked list. To simplify
the code to insert nodes into the linked list, your linked lists will be initialized with exactly two sentinel nodes. They
will serve as the head and the tail of the linked list.
A linked list that is not initialized with sentinel nodes exhibits two different cases when inserting a node. The
first case is when the linked list is empty and contains no nodes. This means the inserted node becomes the head
of the linked list. The second case is when the linked list contains at least one node, which means the inserted
node becomes the new head of the linked list and must point to the old head of the linked list. The old head of the
linked list must also point back to the new head to preserve the properties of a doubly linked list.
Using two sentinel nodes reduces the insertion cases down to one: every inserted node will go between two
nodes. Why? Because there are already two nodes to start with. The following diagrams showcase, in order, a
linked list when initialized, the linked list when a node containing x is inserted, and then the linked list when a
node containing y is inserted. Note that inserting a node into a linked list means inserting it at the head.
NULL head tail NULL
© 2021 Darrell Long 8
NULL head x tail NULL
NULL head y x tail NULL
6.7 void ll_delete(LinkedList **ll)
The destructor for a linked list. Each node in the linked list should be freed using node_delete(). The pointer to
the linked list should be set to NULL.
6.8 uint32_t ll_length(LinkedList *ll)
Returns the length of the linked list, which is equivalent to the number of nodes in the linked list, not including the
head and tail sentinel nodes.
6.9 Node *ll_lookup(LinkedList *ll, char *oldspeak)
Searches for a node containing oldspeak. If a node is found, the pointer to the node is returned. Else, a NULL
pointer is returned. If a node was found and the move-to-front option was specified when constructing the linked
list, then the found node is moved to the front of the linked list. The move-to-front technique decreases look-up
times for nodes that are frequently searched for. You will learn more about optimality in your future classes.
6.10 void ll_insert(LinkedList *ll, char *oldspeak, char *newspeak)
Inserts a new node containing the specified oldspeak and newspeak into the linked list. Before inserting the
node, a look-up is performed to make sure the linked list does not already contain a node containing a matching
oldspeak. If a duplicate exists, a new node is not inserted into the linked list. Else, the new node is inserted at the
head of the linked list. This means that the new node comes directly after the head sentinel node.
6.11 void ll_print(LinkedList *ll)
Prints out each node in the linked list except for the head and tail sentinel nodes. This will require the use of
node_print().
7 Lexical Analysis with Regular Expressions
Ideas are more powerful than guns. We would not let our
enemies have guns, why should we let them have ideas.
—Joseph Stalin
Back to regulating your citizens of the GPRSC. You will need a function to parse out the words that they speak,
which will be passed to you in the form of an input stream. The words that they will use are valid words, which
can include contractions and hyphenations. A valid word is any sequence of one or more characters that are part of
your regular expression word character set. Your word character set should contain characters from a–z, A–Z, 0–9,
and the underscore character. Since you also accept contractions like “don’t” and “y’all’ve” and hyphenations like
“pseudo-code” and “move-to-front”, your word character set should include apostrophes and hyphens as well.
You will need to write your own regular expression for a word, utilizing the regex.h library to lexically analyze
the input stream for words. You will be given a parsing module that lexically analyzes the input stream using
your regular expression. You are not required to use the module itself, but it is mandatory that you parse through
© 2021 Darrell Long 9
an input stream for words using at least one regular expression. The interface for the parsing module will be in
parser.h and its implementation will be in parser.c.
The function next_word() requires two inputs, the input stream infile, and a pointer to a compiled regular expression, word_regex. Notice the word compiled: you must first compile your regular expression using
regcomp() before passing it to the function. Make sure you remember to call the function clear_words() to free
any memory used by the module when you’re done reading in words. Here is a small program that prints out words
input to stdin using the parsing module. In the program, the regular expression for a word matches one or more
lowercase and uppercase letters. The regular expression you will have to write for your assignment will be more
complex than the one displayed here, as it is just an example.
Example program using the parsing module.
1 # include ” parser .h”
2 # include < regex .h >
3 # include < stdio .h >
4
5 # define WORD “[a-zA -Z]+”
6
7 int main ( void ) {
8 regex_t re;
9 if ( regcomp (&re , WORD , REG_EXTENDED ) ) {
10 fprintf ( stderr , ” Failed to compile regex .\n”) ;
11 return 1;
12 }
13
14 char * word = NULL ;
15 while (( word = next_word (stdin , &re) ) != NULL ) {
16 printf (” Word : %s\n”, word ) ;
17 }
18
19 clear_words () ;
20 regfree (& re) ;
21 return 0;
22 }
8 Your Task
The people will believe what the media tells them they believe.
—George Orwell
• Initialize your Bloom filter and hash table.
• Read in a list of badspeak words with fscanf(). Again, badspeak is simply oldspeak without a newspeak
translation. Badspeak is strictly forbidden. Each badspeak word should be added to the Bloom filter and
the hash table. The list of proscribed words will be in badspeak.txt, which can be found in the resources
repository.
• Read in a list of oldspeak and newspeak pairs with fscanf(). Only the oldspeak should be added to the
Bloom filter. The oldspeak and newspeak are added to the hash table. The list of oldspeak and newspeak
pairs will be in newspeak.txt, which can also be found in the resources repository.
© 2021 Darrell Long 10
• Now that the lexicon of badspeak and oldspeak/newspeak translations has been populated, you can start to
filter out words. Read words in from stdin using the supplied parsing module.
• For each word that is read in, check to see if it has been added to the Bloom filter. If it has not been added to
the Bloom filter, then no action is needed since the word isn’t a proscribed word.
• If the word has most likely been added to the Bloom filter, meaning bf_probe() returned true, then further
action needs to be taken.
1. If the hash table contains the word and the word does not have a newspeak translation, then the citizen
who used this word is guilty of thoughtcrime. Insert this badspeak word into a list of badspeak words
that the citizen used in order to notify them of their errors later.
2. If the hash table contains the word, and the word does have a newspeak translation, then the citizen
requires counseling on proper Rightspeak. Insert this oldspeak word into a list of oldspeak words with
newspeak translations in order to notify the citizen of the revisions needed to be made in order to
practice Rightspeak.
3. If the hash table does not contain the word, then all is good since the Bloom filter issued a false positive.
No disciplinary action needs to be taken.
• If the citizen is accused of thoughtcrime and requires counseling on proper Rightspeak, then they are given
a reprimanding mixspeak message notifying them of their transgressions and promptly sent off to joycamp.
The message should contain the list of badspeak words that were used followed by the list of oldspeak words
that were used with their proper newspeak translations.
Dear beloved citizen of the GPRSC,
We have some good news, and we have some bad news.
The good news is that there is bad news. The bad news is that you will
be sent to joycamp and subjected to a week-long destitute existence.
This is the penalty for using degenerate words, as well as using
oldspeak in place of newspeak. We hope you can correct your behavior.
Your transgressions, followed by the words you must think on:
kalamazoo
antidisestablishmentarianism
write -> papertalk
sad -> happy
read -> papertalk
music -> noise
liberty -> badfree
• If the citizen is accused solely of thoughtcrime, then they are issued a thoughtcrime message and also sent
off to joycamp. The badspeak message should contain the list of badspeak words that were used.
Dear beloved citizen of the GPRSC,
You have been caught using degenerate words that may cause
distress among the moral and upstanding citizens of the GPSRC.
As such, you will be sent to joycamp. It is there where you will
sit and reflect on the consequences of your choice in language.
Your transgressions:
© 2021 Darrell Long 11
kalamazoo
antidisestablishmentarianism
• If the citizen only requires counseling, then they are issued an encouraging goodspeak message. They will
read it, correct their wrongthink, and enjoy the rest of their stay in the GPRSC. The message should contain
the list of oldspeak words that were used with their proper newspeak translations.
Dear beloved citizen of the GPRSC,
We recognize your efforts in conforming to the language standards
of the GPSRC. Alas, you have been caught uttering questionable words
and thinking unpleasant thoughts. You must correct your wrongspeak
and badthink at once. Failure to do so will result in your deliverance
to joycamp.
Words that you must think on:
write -> papertalk
sad -> happy
read -> papertalk
music -> noise
liberty -> badfree
• Each of the messages are defined for you in messages.h. You may not modify this file.
• The list of the command-line options your program must support is listed below. Any combination of the
command-line options must be supported.
– -h prints out the program usage. Refer to the reference program in the resources repository for what to
print.
– -t size specifies that the hash table will have size entries (the default will be 10000).
– -f size specifies that the Bloom filter will have size entries (the default will be 220).
– -m will enable the move-to-front rule. By default, the move-to-front rule is disabled.
– -s will enable the printing of statistics to stdout. The statistics to calculate are the total number of
seeks, the average seek length, the hash table load, and the Bloom filter load. The calculations for the
latter three statistics are as follows:
Average seek length =
links
seeks
Hash table load = 100×
ht_count()
ht_size()
Bloom filter load = 100×
bf_count()
bf_size()
The hash table load and Bloom filter load should be printed with up to 6 digits of precision. Enabling
the printing of statistics should suppress all messages the program may otherwise print.
© 2021 Darrell Long 12
9 Deliverables
I would rather have questions that can’t be answered than
answers that can’t be questioned.
—Richard P. Feynman
You will need to turn in:
1. banhammer.c: This contains main() and may contain the other functions necessary to complete the assignment.
2. messages.h: Defines the mixspeak, badspeak, and goodspeak messages that are used in banhammer.c. Do
not modify this.
3. speck.h: Defines the interface for the hash function using the SPECK cipher. Do not modify this.
4. speck.c: Contains the implementation of the hash function using the SPECK cipher. Do not modify this.
5. ht.h: Defines the interface for the hash table ADT. Do not modify this.
6. ht.c: Contains the implementation of the hash table ADT.
7. ll.h: Defines the interface for the linked list ADT. Do not modify this.
8. ll.c: Contains the implementation of the linked list ADT.
9. node.h: Defines the interface for the node ADT. Do not modify this.
10. node.c: Contains the implementation of the node ADT.
11. bf.h: Defines the interface for the Bloom filter ADT. Do not modify this.
12. bf.c: Contains the implementation of the Bloom filter ADT.
13. bv.h: Defines the interface for the bit vector ADT. Do not modify this.
14. bv.c: Contains the implementation of the bit vector ADT.
15. parser.h: Defines the interface for the regex parsing module. Do not modify this.
16. parser.c: Contains the implementation of the regex parsing module.
17. You may have other source and header files, but do not make things overly complicated.
18. Makefile: This is a file that will allow the grader to type make to compile your program.
• CC=clang must be specified.
• CFLAGS=-Wall -Wextra -Werror -Wpedantic must be included.
• make should build the banhammer executable, as should make all.
• make clean must remove all files that are compiler generated.
• make format should format all your source code, including the header files.
19. Your code must pass scan-build cleanly. If there are any bugs or errors that are false positives, document
them and explain why they are false positives in your README.md.
20. README.md: This must be in Markdown. This must describe how to use your program and Makefile. This
includes listing and explaining the command-line options that your program accepts. Any false positives
reported by scan-build should go here as well.
© 2021 Darrell Long 13
21. DESIGN.pdf: This must be a PDF. The design document should describe your design for your program with
enough detail that a sufficiently knowledgeable programmer would be able to replicate your implementation. This does not mean copying your entire program in verbatim. You should instead describe how your
program works with supporting pseudocode. For this program, pay extra attention to how you build each
necessary component.
22. WRITEUP.pdf: This document must be a PDF. The writeup must include the following:
• Graphs comparing the total number of seeks and average seek length as you vary the hash table and
Bloom filter size.
– Do linked lists get longer?
– How does the number of links followed without using the move-to-front rule compare to the number of links followed using the rule?
– How does changing the Bloom filter size affect the number of lookups performed in the hash table?
• Analysis of the graphs you produce.
10 Submission
We can and must write in a language which sows among the
masses hate, revulsion, and scorn toward those who disagree
with us.
—Vladimir Lenin
To submit your assignment through git, refer to the steps shown in asgn0 Remember: add, commit, and push!
Your assignment is turned in only after you have pushed and submitted the commit ID to Canvas. Your design
document is turned in only after you have pushed and submitted the commit ID to Canvas. If you forget to push,
you have not turned in your assignment and you will get a zero. “I forgot to push” is not a valid excuse. It is highly
recommended to commit and push your changes often.
11 Supplemental Readings
The more you read, the more things you will know. The more
that you learn, the more places you’ll go.
—Dr. Seuss
• The C Programming Language by Kernighan & Ritchie
– Chapter 5 §5.7
– Chapter 7
• Introduction to Algorithms by T. Cormen, C. Leiserson, R. Rivest, & C. Stein
– Chapter 10 §10.2 (Linked lists)
– Chapter 11 (Hash tables and hash functions)
• Introduction to the Theory of Computation by M. Sipser
– Chapter 1 §1.3 (Regular expressions)