CPE 202 Project 1 to 4 solutions

$100.00

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (1 vote)

CPE 202 Assignment 1

1 Permutations in Lexicographic Order

1.1 Specification
We are going to write a Python program to generate all the permutations of the characters in a string. This will
give you a chance to review some Python constructs (e.g. strings and lists) and also solidify your understanding of
recursion.

Your program must meet the following specifications. In a file named perm_lex.py, You are to write a Python
function perm_gen_lex that:
• Takes a string as a single input argument. You may assume that the input string will be 0 or more unique
lower-case letters in alphabetical order.

• Returns a Python list of strings where each string represents a permutation of the input string. The list of
permutations must be in lexicographic (i.e. dictionary) order.
Note: if you follow the pseudocode below, your list will be constructed such that this condition will be met.
Do not sort the list.

• Is well structured, commented, and easy to read. Contains a docstring explaining its purpose. Adheres to the
style in PEP 8.
• Is recursive and follows the pseudocode below.
perm_gen_lex(s)
Input: A string s

Output: A list of all permutations of the characters in s in lexicographic order
for each character c in s do
Form a simpler string, t, by removing c from s
Generate all permutations of t recursively # i.e. call perm_gen_lex(t)
for each permutation p of t do
Add c to the front of p and add the result to your list of permutations
return the list of permutations
Note, my simple pseudocode never mentions a base case. Your code will need one to function correctly. Think
about what it should be and what you should return in that case.
My pseudocode also uses bad variable names (semi-intentionally). I expect better from your code.

1.2 Examples
As an example of what this function does:
>>> perm_gen_lex(”)
[”]
>>> perm_gen_lex(‘a’)
[‘a’]
>>> perm_gen_lex(‘ab’)
[‘ab’, ‘ba’]
>>> perm_gen_lex(‘abc’)
[‘abc’, ‘acb’, ‘bac’, ‘bca’, ‘cab’, ‘cba’]
>>> perm_gen_lex(‘abcd ‘)
[‘abcd’, ‘abdc’, ‘acbd’, ‘acdb’, ‘adbc’, ‘adcb’, ‘bacd’, ‘badc’, ‘bcad’,
‘bcda’, ‘bdac’, ‘bdca’, ‘cabd’, ‘cadb’, ‘cbad’, ‘cbda’, ‘cdab’, ‘cdba’,
‘dabc’, ‘dacb’, ‘dbac’, ‘dbca’, ‘dcab’, ‘dcba ‘]

Note: For a string with n characters, your program will return a list containing n! strings. This will grow very
quickly. If your string contained every lowercase letter a to z (26 letters), the resulting list of permutations would
have 403,291,461,126,605,635,584,000,000 ≈ 4.03 × 1026 elements. If you want your code to finish in your lifetime,
you should probably stick to shorter strings for testing.

2 Base Conversion

2.1 Specification
One algorithm for converting a base 10 number to base b involves repeated division by the base b. Initially, one
divides the number by b. The remainder from this division is the units digit (the rightmost digit) in the base b
representation of the number (it is the part of the number that contains no powers of b). The quotient is then
divided by b on the next iteration. The algorithm stops when the quotient is 0.

Note that at each iteration, the remainder from the division is the next base b digit from the right—that is, this
algorithm fins the digits for the base b number in reverse order.
Here is an example for converting the base 10 number 30 into base 4:
quotient
remainder
1/4 7/4 30/4
0 1 7
1 3 2
Start here
So, 30 (base 10) ≡ 132 (base 4).
Think about how this is recursive in nature.
• For what values do I know the answer without doing any work?
• How can I turn the problem into a smaller problem?
In a file named base_convert.py, write a recursive function named convert that will take a non-negative integer
in base 10 and a target base (an integer between 2 and 16 inclusive) and returns a string representing the number in
the given base.
def convert(num: int, base: int) -> str:
“””Returns a string representing num in the given base.
Implemented recursively.
“””

2.2 Examples
As some examples:
>>> convert(30, 4)
‘132’
>>> convert(45, 2)
‘101101’
>>> convert(316, 16)
’13C’
Note that for bases > 10, the symbols used for 10, 11, 12, 13, 14, and 15 are A, B, C, D, E, and F respectively.

3 A Teddy Bear Picnic

3.1 Specification
This question involves a game with teddy bears. The game starts when I give you some bears. You can then
repeatedly give back some bears, but you must follow these rules (where n is the number of bears that you currently
have):

1. If n is even, then you may give back n/2 bears.
2. If n is divisible by 3 or 4, then you may multiply the last two digits of n together and give back this many
bears.
3. If n is divisible by 5, then you may give back 42 bears.
The goal of the game is to end up with EXACTLY 42 bears.

For example, suppose that you start with 250 bears. Then you could make these moves:
• Start with 250 bears.
• Since 250 is divisible by 5, you may return 42 of the bears, leaving you with 208 bears.
• Since 208 is even, you may return half of the bears, leaving you with 104 bears.
• Since 104 is even, you may return half of the bears, leaving you with 52 bears.
• Since 52 is divisible by 4, you may multiply the last two digits (resulting in 5 ∗ 2 = 10) and return these 10
bears. This leaves you with 42 bears.
• 42 bears is the goal! So, we succeeded!

Note that at several of the steps, we had several options for how to proceed; not all of them would have resulted in
success.
In a file named bears.py, write a recursive function named bears to the following specification:
def bears(n: int) -> bool:
“””Returns whether it is possible to win the bear game starting with
n bears.
Implemented recursively.
“””
Although this problem may seem silly at first, it’s an example of a reachability problem, which is a problem that
arises in many areas of computer science.

3.2 Examples
Some examples:
>>> bears(250) # We saw this above
True
>>> bears(42) # This is true because we’re already at the goal.
True
>>> bears(40)
False
>>> bears(53)
False

4 Testing
For each part, you are required to achieve 100% test coverage. This means that every line of code in your functions
must be executed at some point in at least one of your tests.
As discussed in the syllabus, when I grade your code, you will ordinarily receive feedback regarding any tests that
fail, unless you do not have 100% test coverage. In the event you do not have 100% test coverage, the only feedback
you will receive is that you need to do more testing. I don’t want you using my grading script to do your testing in
the last day.

5 GitHub Submission
Push your finished code back to GitHub. Refer to Lab 0, as needed, to remember how to push your local code.

CPE 202 Project 2

1 Introduction

For this project, you will implement a program to evaluate a postfix expression. To make the program more versatile,
you’ll also provide code to convert infix expressions (the normal kind to which you’re accustomed) to postfix.

In this
way, your program will be able to evaluate infix and postfix expressions. Many programming languages do something
similar to convert expressions into code that is easy to execute on a computer.

Notes:
• To break up a string into “tokens”, consider using the str.split method.
• To convert a token (string) into a number (float), consider using try/except blocks along with the float
function.
• Your functions will support the mathematical operations of +, -, *, /, //, and **.
• That last one, **, is Python’s exponentiation operator. It has higher precedence than the other operators (e.g.,
2 * 4 ** 3 == 2 * 64 == 128) and is right-associative (e.g., 2 ** 3 ** 2 == 2 ** (3 ** 2) == 2 **
9 == 512).

• All the other operators are left-associative (e.g., 2 * 3 // 4 == (2 * 3) // 4 == 6 // 4 == 1).
• At no point should you ever be using the Python builtin function eval.

2 Array Stack

The first thing we’re going to have to do is implement a stack. In class, we talked about how we could use either an
array-based structure, or we could use a link-based structure. While they both do work, for consistency in grading,
I’m going to have you all implement an array-based stack.

In a file called array_stack.py, you will implement these functions (stubs are included in the starting file):
• empty_stack This function takes no arguments and returns an empty stack.
• push This function takes a stack and a value as arguments and places the value on the “top” of the stack.
For this project, because we’re only using the array-based approach, we should be mutating the given stack
and thus our function won’t return anything.

• pop This function takes a stack as an argument and removes (and returns) the element at the “top” of the
stack. If the stack is empty, then this operation should raise an IndexError.
For this project, because we’re mutating the stack directly, we only need return the value being popped.
• peek This function takes a stack as an argument and returns (without removing) the value on the “top” of the
stack. If the stack is empty, then this operation should raise an IndexError.
• is_empty This function takes a stack as an argument and returns whether or not the stack is empty.
• size This function takes a stack as an argument and returns the number of items in the stack.

You should find that you can reuse a lot of your code from your ArrayList implementation, but it should be a fair
bit simpler. Similar to ArrayList, you have the exact same restrictions about which list operations are not allowed.
These should all be O(1) (i.e., constant time) operations.

3 Evaluating a Postfix (RPN) Expression

3.1 Algorithm
In a file called exp_eval.py, you will implement this algorithm as a function called postfix_eval.
While RPN will look strange until you are familiar with it, here you can begin to see some of its advantages for
programmers. One such advantage of RPN is that it removes the need for parentheses. Infix notation supports
operator precedence (* and / have higher precedence than + and -) and thus needs parentheses to override this
precedence.

This makes parsing such expressions much more difficult. RPN has no notion of precedence, the operators
are processed in the order they are encountered. This makes evaluating RPN expressions fairly straightforward and
is a perfect application for a stack data structure, we just follow these steps:
• Process the expression from left to right
• When a value is encountered:
– Push the value onto the stack
• When an operator is encountered:
– Pop the required number of values from the stack
– Perform the operation
– Push the result back onto the stack
• Return the last value remaining on the stack

For example, given the expression
5 1 2 + 4 ** + 3 –
Input Type Stack Notes
5 Value 5 Push 5 onto the stack
1 Value 5 1 Push 1 onto the stack
2 Value 5 1 2 Push 2 onto the stack
+ Operator 5 3 Pop two operands (1 and 2), perform 1 + 2, and push the result (3)
4 Value 5 3 4 Push 4 onto the stack
** Operator 5 81 Pop two operands (3 and 4), perform 3
4
, and push the result (81)
+ Operator 86 Pop two operands (5 and 81), perform 5 + 81, and push the result (86)

3 Value 86 3 Push 3 onto the stack
– Operator 83 Pop two operands (86 and 3), perform 86 − 3, and push the result (83)
Result 83 The value left on the stack is the answer
You may (and should) use the Python string method str.split to separate the string into tokens.

3.2 Exceptions
You may (and should) assume that a string is always passed to postfix_eval. However, that does not mean that
the RPN expression will always be valid. Specifically, your function should raise a ValueError with the following
messages in the following conditions:
• “empty input” if the input string is an empty string
• “invalid token” if one of the tokens is neither a valid operand not a valid operator (e.g., 2 a +)
• “insufficient operands” if the expression does not contains sufficient operands (e.g., 2 +)
• “too many operands” if the expression contains too many operands (e.g., 2 3 4 +)
To raise an exception with a message, you will raise ValueError(“your message here”).
Additionally, if you would divide by zero, your code should raise a ZeroDivisionError.

4 Converting Infix Expressions to Postfix

In a file called exp_eval.py, you will implement this algorithm as a function called infix_to_postfix.
We can (and you will!) also use a stack to convert an infix expression to an RPN expression via the Shunting-yard
algorithm. The steps are:
• Process the expression from left to right.
• When you encounter a value:
– Append the value to the RPN expression
• When you encounter an opening parenthesis:
– Push it onto the stack
• When you encounter a closing parenthesis:
– Until the top of the stack is an opening parenthesis, pop operators off the stack and append them to the
RPN expression

– Pop the opening parenthesis from the stack (but don’t put it into the RPN expression)
• When you encounter an operator, o1:
– While there is an operator, o2 on the top of the stack and either:
∗ o2 has greater precedence than o1, or
∗ they have the same precedence and o1 is left-associative
Pop o2 from the stack into the RPN expression.
– Then push o1 onto the stack
• When you get to the end of the infix expression, pop (and append to the RPN expression) all remaining
operators.

The following table summarizes the operator precedence, from the highest precedence at the top to the lowest
precedence at the bottom. Operators in the same box have the same precedence.

Operators Notes
** Right associative
*, /, // Left associative
+, – Left associative
For example, given the expression
3 + 4 * 2 / ( 1 – 5 ) ** 2 ** 3
Input Action RPN Stack
3 Append 3 to RPN 3
+ Push + to stack 3 +
4 Append 4 to RPN 3 4 +
* Push * to stack 3 4 + *
2 Append 2 to RPN 3 4 2 + *
/ Pop *, push / 3 4 2 * + /
( Push ( to stack 3 4 2 * + / (
1 Append 1 to RPN 3 4 2 * 1 + / (
– Push – to stack 3 4 2 * 1 + / ( –
5 Append 5 to RPN 3 4 2 * 1 5 + / ( –
) Pop until ( 3 4 2 * 1 5 – + /
** Push ** to stack 3 4 2 * 1 5 – + / **
2 Append 2 to RPN 3 4 2 * 1 5 – 2 + / **
** Push ** to stack 3 4 2 * 1 5 – 2 + / ** **
3 Append 3 to RPN 3 4 2 * 1 5 – 2 3
Pop entire stack into RPN 3 4 2 * 1 5 – 2 3 ** ** / +

You may (and should) assume that a well formatted, correct infix expression containing only numbers, the specified
operators, and parentheses will be passed to your function. You may also assume that all tokens are separated by
exactly one space.

You may (and should) use the Python string methods str.split (to split the string into token) and str.join (to
join the RPN expression when you’re done).

5 Testing
For each part, you are required to achieve 100% test coverage. This means that every line of code in your functions
must be executed at some point in at least one of your tests.
As discussed in the syllabus, when I grade your code, you will ordinarily receive feedback regarding any tests that
fail, unless you do not have 100% test coverage. In the event you do not have 100% test coverage, the only feedback
you will receive is that you need to do more testing. I don’t want you using my grading script to do your testing in
the last day.

6 GitHub Submission
Push your finished code back to GitHub. Refer to Lab 0, as needed, to remember how to push your local code.
The files you need to have submitted are:
• stack_array.py
– Your array-based implementation of a stack.
• exp_eval.py
– Your functions for evaluating RPN and for converting from infix to RPN.
• exp_eval_tests.py
– Your tests for the functions mentioned above. You must have 100% coverage.

CPE 202 Project 3a — Huffman Encoding and Project 3b — Huffman Decoding

1 Introduction

For this project, you will implement a program to encode text using Huffman coding. Huffman coding is a method of
lossless (the original can be reconstructed perfectly) data compression. Each character is assigned a code consisting
of ‘0’s and ‘1’s (i.e., binary digits or bits). The length of the code is based on how frequently the character occurs;
more frequent characters are assigned shorter codes.

The basic idea is to use a binary tree, where each leaf node represents a character and frequency count. The Huffman
code of a character is then determined by the unique path from the root of the binary tree to that leaf node, where
each ‘left’ accounts for a ‘0’ and a ‘right’ accounts for a ‘1’. Since the number of bits needed to encode a character
is the path length from the root to the character, a character occurring frequently should have a shorter path from
the root (i.e., should be higher in the tree) compared to an “infrequent” character (which should be lower in the tree).

The key problem is therefore to construct such a binary tree based on the frequency of occurrence of characters. For
a detailed example of how to construct such a tree, see the Huffman Example document on Canvas.
BE SURE that you understand the example in the document before attempting to implement any of the following.
If you do not understand what the algorithm should be doing, you will not be able to write code for it.

2 Specification

In the provided huffman.py, you will implement the following:

2.1 Count Frequencies

You will implement a function called count_frequencies that takes as an argument the name of a text file and
counts the number of occurrences of all the characters within that file.
Use the builtin Python list data structure of size 256 for counting the occurrences of characters. The indices into
the list will be the ASCII values of the characters (use the Python builtin ord to get this value for a given character)
and the value at a given index will be the frequency with which that character occurred.

For example, suppose that the file to be encoded contained “aaabbbbc”. Then this function will return a list
containing mostly zeros, except indices 97–99 will be [3, 4, 1].
For an empty file, this function should return a list containing 256 zeros.

2.2 Data Definition for Huffman Tree

Most of a HuffmanNode class is provided. These represent nodes in your Huffman Tree. You will need to implement
the __eq__ and __lt__ methods so that these nodes may be places in your OrderedList (from Lab 4). See the
example for details of when one tree should be considered “less than” another tree.

2.3 Building the Huffman Tree

You will implement a function build_huffman_tree that takes as an argument a list of frequencies and builds and
returns the resulting Huffman Tree. You will do this by:
• Creating an OrderedList (from Lab 4) of individual Huffman Trees each consisting of a single HuffmanNode
containing the character and its frequency. We should only include characters with non-zero frequency here.
• While the list contains more than one tree, you will remove the two least frequent trees, join them together
into one tree, and put the resulting tree back into the ordered list.

– In our implementation, when joining two trees, the “lesser” of the trees must go on the left. See the
example document for details.
• Once the list only has a single tree, this is our Huffman Tree!
Note that when connecting two trees, this new larger tree does not represent a single character, but rather jointly
represents all the characters in the tree. As such, the new node should have the sum of the frequencies of its children.

For our implementation, we will also store the smaller of the characters in this new node to continue to enable tie
breaking. See the example document for details.
Some examples to consider:
• If the frequency list was all zeros, your function should return an empty tree.
• If there was only one character with non-zero frequency, the resulting tree will only have a single node.

2.4 Creating the Codes

You will implement a function create_codes that takes as an argument a Huffman Tree and determines the code
for each character in the tree.
Use the builtin Python list data structure of size 256 for storing the codes. The indices into the list will be the
ASCII values of the characters and the value at a given index will be the code for that character. Any character not
in the tree should have an empty code (i.e., ”).

You should accomplish this by traversing the tree, building the codes on the way down. Each time you traverse to
the left, you will add a ‘0’ and each time you traverse to the right, you will add a ‘1’. You may find it useful to
use the builtin ‘+’ operator to concatenate strings.

2.5 Creating the Header

When we decode a file (coming in Project 3b), we need to have the Huffman Tree available (see the example document
for details). This means that we have to store enough information into the encoded file to reconstruct the tree. For
our implementation we will accomplish this by storing the frequencies at the top of the file.
You will implement a function create_header that takes as an argument a list of frequencies and returns a string
containing the frequency data to be stored at the top of the encoded file.

This will be a string of the ASCII values and their associated frequencies, separated by spaces. You should only
include characters with non-zero frequency, and they should be ordered by ASCII value. For example, if the original
file had contained “aaacbbbb”, then this function would return the string “97 3 98 4 99 1”

2.6 Huffman Encoding

You will implement a function huffman_encode that takes as arguments two file names (for the input and output),
encodes the input file, and stores the resulting encoded data in the output file.

This encoded data has two pieces:
• A header on the first line of the file (see Section 2.5 for details), ending with a newline character.
• The encoded data, where each character is replaced with its Huffman Code (see the example document for
details).
You may notice after doing this that the resulting file is larger than the original! What lousy compression! This is
because although we encoded our input text characters in sequences of 0s and 1s representing bits, we wrote them
to the file as the text characters 0 (ASCII value 48) and 1 (ASCII value 49). So, the result is roughly 8 times larger
than it should have been. Writing the “true” binary data to the file is a bit of a hassle, so we will forgo it.

2.7 More Notes
When writing your own text files, or copying and pasting text around, take into account that most text editors will
add a new line character (‘\n’, ASCII value 10) to the end of the file. This is now a character in your file that will
be encoded along with everything else. So, it will end up in your Huffman Tree, and it’s encoded bits will end up in
your resulting file. This isn’t wrong, but it makes testing a little harder if the file you’re encoding for testing isn’t
what you thought it was.

Additionally, the exact behavior of newlines is different for Windows vs. every other operating system in the world.
On a Windows machine, a newline is actually two characters “\r\n” (ASCII values 13 and 10).
If you’re seeing your Huffman Trees getting nodes with values 10 and 13, this isn’t necessarily wrong, it may just
mean that the files you’re compressing have newlines.

3 Testing
Writing tests for functions that operate on files is difficult. So I’ve provided you with an example (along with some
basic tests of the rest of your functions).
When testing, always consider edge cases like the following cases:
• If the input file consists only of some number of a single character, say “aaaaa”, what do you think the result
should look like? What is the “code” for the letter ‘a’?
• If the input file is empty (as in 0 bytes large), what should the result look like?

In your huffman_tests.py, you should only test the functions that are required (for which stubs were provided).
I will be running your tests against my code for this projects (similar to project 1) and if you’re testing a helper
function, it would crash my tests (because I most likely won’t have written the exact same helper functions).
That said, you are still required to achieve 100% test coverage. This means that every line of code in your functions
must be executed at some point in at least one of your tests. If you would like to write tests for your helper functions
directly, you may create a file huffman_helper_tests.py for your helper function tests.

As discussed in the syllabus, when I grade your code, you will ordinarily receive feedback regarding any tests that
fail, unless you do not have 100% test coverage. In the event you do not have 100% test coverage, the only feedback
you will receive is that you need to do more testing. I don’t want you using my grading script to do your testing in
the last day.

4 GitHub Submission
Push your finished code back to GitHub. Refer to Lab 0, as needed, to remember how to push your local code.
The files you need to have submitted are:
• ordered_list.py
– Your correct implementation of the OrderedList from Lab 4.
• huffman.py
– Contains the functions specified above, and any helper functions that you find necessary.
• huffman_tests.py
– Contains tests for all your functions specified by the assignment. These should run on any correct implementation of this project. As such, you should not include any tests for helper functions.
• huffman_helper_tests.py
– Optional file containing tests for all your helper functions. These tests will be used in conjunction with
your huffman_tests.py to check for 100% coverage.
• Any text files you use for testing.
– If you use a text file for testing, you’ll need to submit it so that your tests run on my end.

 

Project 3b — Huffman Decoding

Please read the entire document carefully! I would suggest reading the entire document before writing any code.

1 Introduction

For this project, you will implement a program to decode text that has been encoded using Huffman coding. This
project assume that the encoded file is in the format that was output from the previous part of the assignment.
For decoding, you will need to recreate the Huffman Tree that was used to determine the Huffman codes used for
encoding. The header in the input file contains the character frequency information that will be needed. After
reading in the frequency information, you will be able to use the same build_huffman_tree function that you
wrote for the encoding portion of the assignment.

Once the Huffman tree is recreated, you rill be able to traverse the tree using the encoded 0s and 1s from the input
file. The decoding process starts by beginning at the root node of the Huffman tree. A 0 will direct the navigation
to the left, while a 1 will direct the navigation to the right. When a leaf node is reached, the character stored in that
node is the decoded character that is added to the decoded output. Navigation of the Huffman tree is then restarted
at the root node, until all of the characters from the original file have been written.

2 Specification

In the same huffman.py file from the first part of the assignment, you will add the following:

2.1 Parse Header

You will implement a function call parse_header that takes as an argument a string (the first line of the file) and
returns a list of frequencies. The list of frequencies should be in the same format that count_frequencies returned
in the first part of the assignment, i.e., a list with 256 entries, indexed by the ASCII value of the characters.

2.2 Huffman Decoding

You will implement a function huffman_decode that takes as arguments two file names (for the input and output),
decodes the input file, and stores the resulting decoded data in the output file.
This will be done by first building the Huffman tree (using functions you’ve already written), and now recreating
the original file one character at a time using the tree traversing method described above.

2.3 More Notes

You may find it useful to read a single line of text from a file. In Python, you can read just one line with
file.readline() which will return one line in the file.
For our purposes, recall that the first line will be the header and the second line will be all the 0s and 1s.

3 Testing
When testing, always consider edge cases like the following cases:
• If the input file consists only of some number of a single character, say “aaaaa”, what do you think the result
should look like? What is the “code” for the letter ‘a’?
• If the input file is empty (as in 0 bytes large), what should the result look like?

Now when we decode those files, how do we handle these cases? These will possibly require special cases in your
decoding code.
In your huffman_tests.py, you should leave all the tests you had for the previous part (they’re needed for 100%
coverage) and additionally add tests for the new functions that you’re writing.
There are sample tests provided on Canvas (huffman_decode_tests.py) that you can/should copy over into your
huffman_tests.py file.

Like the previous assignments, you are still required to achieve 100% test coverage. This means that every line of
code in your functions must be executed at some point in at least one of your tests.
As discussed in the syllabus, when I grade your code, you will ordinarily receive feedback regarding any tests that
fail, unless you do not have 100% test coverage. In the event you do not have 100% test coverage, the only feedback
you will receive is that you need to do more testing. I don’t want you using my grading script to do your testing in
the last day.

4 GitHub Submission
Push your finished code back to GitHub. Refer to Lab 0, as needed, to remember how to push your local code.
The files you need to have submitted are:
• ordered_list.py
– Your correct implementation of the OrderedList from Lab 4.
• huffman.py
– Contains the functions specified above, and any helper functions that you find necessary.
• huffman_tests.py
– Contains tests for all your functions specified by the assignment. These should run on any correct implementation of this project. As such, you should not include any tests for helper functions.
• huffman_helper_tests.py
– Optional file containing tests for all your helper functions. These tests will be used in conjunction with
your huffman_tests.py to check for 100% coverage.
• Any text files you use for testing.
– If you use a text file for testing, you’ll need to submit it so that your tests run on my end.

CPE 202 Project 4

1 Introduction

For this project, you will write a program to generate a concordance. A concordance is an alphabetized list of the
words (usually the important ones) from a text, along with their locations in the text. Your implementation will use
a hash table to accomplish this.

2 Implementation

2.1 Hash Function
For all of the following, you will be using strings as keys to your hash tables. This means that we need a function that
takes as input a string, and returns an integer. You’ll must use the following hash function (a slight modification of
DJBX33A) as the hash function for your hash tables:
DJBX33A(str) = 5381 ∗ 33n +
nX−1
i=0
ord(str[i]) ∗ 33n−1−i where n = min(len(str), 8)
Note that this may be implemented more efficiently than a direct translation of the above mathematical expression;
this however is not required.

2.2 Concordance
A typical method of constructing a concordance begins with splitting the input by sections (or lines, verses, chapters,
etc), and then into tokens (words, essentially). Then, each token is considered, and non-important ones are discarded.
This includes numbers and any “common” words that are not interesting enough to keep track of. For this project, a
list of words to ignore is provided known as the stop-list. Important words (words that were not filtered) are entered
into a hash table, along with what section (or line, verse, etc) on which they were found.
Once the input has been processed, the resulting set of words is output in alphabetical order, along with the (distinct)
locations at which they appeared (also in order).

A hash table is perfect for both aspects of this task, storing the stop words and building the concordance itself. This
will allow fast lookup of words that occur multiple times in the file.
For this project, you will be reporting the concordance of a single text file, tracking the lines that on which each
word appeared. So, for example:

Sample stop-words file
a
about
be
by
can
do
i
in
is
it
of
on
the
this
to
was
Sample text file
This is a sample data (text) file to be processed
by your word-concordance program!
A real data file is much bigger.

Then the output should be:
bigger: 4
concordance: 2
data: 1 4
file: 1 4
much: 4
processed: 2
program: 2
real: 4
sample: 1
text: 1
word: 2
your: 2
Notes:
• Words are defined as sequences of characters delimited by any non-letter (white space, punctuation). Apostrophes will be ignored completely.

• There is no distinction made between upper and lowercase letters (e.g. CaT is the same word as cat). To
accomplish this, we will make everything lowercase (by calling str.lower()).
• Blank lines are counted in line numbering.

So, the general algorithm for the program will be:
1. Read the stop words file into a hash table (we want O(1) ability to check if a word is a stop word). Note that
this will use your implementation of a hash table. Also note that the values aren’t important, all we really care
about are the keys.

2. The word-concordance will be in a separate hash table. Process the input file one line at a time to build the
word-concordance. This hash table should only contain the non-stop words as the keys (you will use the stop
words hash table to filter out stop words). Associated with each key is its value, where the value consists of
the line numbers where the key appears. Do not include duplicate line numbers. This should be efficient (i.e.
O(1)).

3. Generate a text file containing the concordance words printed out in alphabetical order along with their line
numbers. You may use Python’s builtin sorting here. There should be one word per line followed by a colon
and spaces separating items on each line.

See the provided Python files for specifics of each function you are required to write.
For processing the lines of the file:
• I recommend processing the file one line at a time (e.g. I’d suggest using a for loop over the file).

• For each line in the input file:
– Convert the line to lowercase.
– Remove all occurrences of the apostrophe character (‘)
– Convert all characters in string.punctuation to spaces. (That is the variable punctuation in the
builtin string module.)
– Split the string into tokens using str.split() (with no parameters).
– Each token that is only alphabetic (i.e. str.isalpha() returns True) is considered a “word”. All other
tokens should be ignored.

3 Testing
You are provided with several text files to be used for testing. You are also encouraged to write your own!
The text files are:
• a stop words file stop_words.txt
• six sample data files that can be used for preliminary testing of your programs:
– file1.txt/file1_sol.txt: contains no punctuation
– file2.txt/file2_sol.txt: contains punctuation to be removed
– declaration.txt/declaration_sol.txt: a larger file for testing
• six large data files that can be used for performance testing. DO NOT include any of these in your submitted
tests.

They may take too long to run and will be killed by the grader. That said, none of these should take
longer than a few seconds.
– dictionary_a-c.txt/dictionary_a-c_sol.txt
– file_the.txt/file_the_sol.txt: for this you should use the empty stop words file
– war_and_peace.txt/war_and_peace_sol.txt
You are required to achieve 100% test coverage. This means that every line of code in your functions must be
executed at some point in at least one of your tests.

As discussed in the syllabus, when I grade your code, you will ordinarily receive feedback regarding any tests that
fail, unless you do not have 100% test coverage. In the event you do not have 100% test coverage, the only feedback
you will receive is that you need to do more testing. I don’t want you using my grading script to do your testing in
the last day.

4 GitHub Submission
Push your finished code back to GitHub. Refer to Lab 0, as needed, to remember how to push your local code.
The files you need to have submitted are:
• hash_table.py
– Your correct implementation of the HashTable from Lab 8.

• concordance.py
– Contains the functions for creating the concordance.
• concordance_tests.py
– Contains tests for all your required concordance functions. These should run on any correct implementation of this project. As such, you should not include any tests for helper functions.
• concordance_helper_tests.py
– Optional file containing tests for all your helper functions. These tests will be used in conjunction with
your concordance_tests.py to check for 100% coverage.
• Any text files you use for testing.
– If you use a text file for testing, you’ll need to submit it so that your tests run on my end.