CSCI 141 Assignment 1 to 5 solutions

$110.00

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (1 vote)

CSCI 141 Assignment 1: Variables, print, input, operators

Introduction

This first homework has 2 parts. For the first part please answer the questions on Canvas
assigned for this homework, and for the second portion you will complete a single programming
task.

Getting Started
Refer to lab 1, as well as the lecture slides, to review. In this and future assignments, you may
not have seen all the topics in lecture before the assignment is released, but they will be covered
in before the deadline. As usual, seek help early if you get stuck: come talk to me or the TAs
during office hours, or visit the CS mentors for help. Please keep track of approximately how
much time you spend on both portions of this assignment. You will be asked to report your
estimate on Canvas after you submit.

Collaboration and Academic Honesty
The answers to the questions and programming solution MUST be your own. You can discuss
the problems with your peers, but these discussions must happen away from computers and you
should take a break before returning to work on them to help ensure that you truly understand
the answers. You may not copy another person’s code, or have another person tell you what code
to type. If you have any questions, or are unsure about whether a specific sort of collaboration
violates academic honesty, please come talk to me.

1 Questions: 16 points
Please answer the questions in the A1 Written quiz on Canvas. The questions on Canvas have
been configured so that there is no time limit, but you have only 2 attempts to submit your
answers. The score that is recorded in Canvas is the score that is the latest (most recent
submission) of your attempts.

2 Programming Task: 20 points

Congratulations! You’ve just been hired as a Python programmer at an education start-up
company. Your first task is to develop a prototype of a program that kindergarten students will
use to check their homework assignments which involve addition, multiplication, and division
problems.

Program Specification

The program begins with a series of prompts, then prints a few lines to the screen in response.
In total there are 6 lines that are printed each time the program is run:

1. Prompt the user for their name
2. Greet the user and ask them to supply the first integer
3. Prompt the user for a second integer
4. Output the sum of the two numbers
5. Output the product of the two numbers

6. Rephrase the division question, and output the whole number and remainder. All numerical outputs on the 6th line of output must be integers (whole numbers, without
decimals).

A sample invocation of the program is shown in Figure 1:
Figure 1: Sample Output
Although this is a simple set of steps, there are many, many different Python programs that
can achieve it. The text of your prompts does not need to match the example exactly. However,
your solution must follow the the instructions above exactly as specified. For example:
• Both the greeting and the prompt for the first number must be printed on the second line
of output.

• The last (6th) line of output must rephrase the division question and output the whole
number and remainder portions of the calculation on a single line.

Valid Input and Error Checking

You should assume that the user provides all requested inputs (via the keyboard) as instructed,
and assume that all integers are positive numbers. Your program is not required to check the
input or behave in any specific way if the above conditions are not met.

Testing Your Program
Testing is a major component in the process of writing software. Often, testing (detecting
errors) and debugging (locating and fixing errors) takes way more effort than writing the code
did in the first place. We’ll talk more about testing as the quarter progresses; in the meantime,
the following table provides some helpful test cases that you can use to see if your program is
working correctly. Try your code out with the given pairs of integers and see if your output
matches the sum, product, and division result.

First Integer Second Integer Sum Product Division
7 5 12 35 1 remainder 2
5 7 12 35 0 remainder 5
3 3 6 9 1 remainder 0
1 678 679 678 0 remainder 1
8364724 9738 8374462 81455682312 858 remainder 9520

Submission
Double check that your program works according to the specification. Take a look through
the rubric below and make sure you won’t lose points for reasons that could easily be foreseen
and fixed. When you’re finished, submit your program to Canvas as a single .py file named
arithmetic.py. Finally, fill out the A1 Hours quiz with an estimate of the number of hours
you spent on A1 (include both the written and programming portions in your estimate).

Rubric
Canvas questions 16 points
Author, date, and program description given in comments at the top of the file 1 point
Program prompts for user’s name on the first line 4 points
Greeting on second line includes user’s name 4 points
First integer prompt also appears on second line 2 points
Correct sum output on fourth line 2 points
Correct product output on fifth line 2 points

Division question is rephrased, quotient and remainder are printed on sixth line 3 points
Code is commented adequately and variables are appropriately named 2 points
Total 36 points

3 Optional Challenge Problem

Some assignments will come with an optional challenge problem. In general, these problems
will be worth very small amounts of extra credit: this one is worth one point. Though the
grade payoff is small, you may find them interesting to work on and test your skills in Python
and algorithm development.

The skills and knowledge needed to solve these problems are not
intended to go beyond those needed for the base assignment, but less guidance is provided and
more decisions are left up to you. The A1 challenge problem is as follows:
Many online real estate websites have mortgage calculator features1
.

These calculators ask
for some information, such as the price of a home, the down payment (amount of the home
price you’d pay up front), and the interest rate, then calculate the amount you’d have to pay
monthly on a loan for the home.

According to NerdWallet2
, the formula used to calculate the monthly payment based on
these inputs is as follows:
M = (P − D)
r(1 + r)
N
(1 + r)N − 1
Where:
M = The monthly payment
P = The price of the home
D = The down payment amount
N = The number of months over which the loan will be paid off
r = R/12, the monthly interest rate, which is the yearly rate divided by 12

Write a program that asks the user to enter P, D, N, and R, then outputs the monthly
payment amount M. Notice that you will prompt the user for R, the annual interest rate, but
the formula uses r, the monthly interest rate.

3.1 Submission
Upload your submission to Canvas in a file called challenge.py.
1See https://www.zillow.com/mortgage-calculator/ for an example
2Go to https://www.nerdwallet.com/mortgages/mortgage-calculator/calculate-mortgage-payment and click
“How to calculate your mortgage payment” for the source of the formula

CSCI 141 A2: Variables, Boolean logic, Conditionals

Getting Started

Review the labs and lecture slides to review. Topics needed to complete this assignment will be
covered before the deadline. As usual, seek help early if you get stuck: come talk to me or the
TAs during office hours, or visit the CS mentors for help. Please keep track of approximately
how much time you spend on both portions of this assignment. You will be asked to report
your estimate on Canvas after you submit.

Reminder: You can discuss this homework with your peers. However, the answers to the
questions and programming solution MUST be your own. You cannot copy another person’s
code, you cannot have another person tell you what code to type, etc. If any part of this is
unclear, please come see me.

1 Questions: 16 points
Please answer the questions available on Canvas. The questions on Canvas have been configured
so that there is no time limit, but you have only 2 attempts to submit your answers. The score
that is recorded in Canvas is the score that is the latest (most recent submission) of your
attempts.

2 Coding Task: 25 points

Suppose that you are a programmer for a game development company called Fungi. The text
adventure game being prepared for launch involves a character meandering through the forest,
during which they find and pick up mushrooms.

Your task is to write code for a portion of the game in which the role-playing character encounters a chef who wants to exchange some of the gathered mushrooms for rubies. The chef
exchanges mushrooms for rubies according to her secret formula (explained below).

The chief game designer has given you the below pseudocode that explains the mechanics that
your python program should implement. The chief software engineer has also instructed you
to use no more than 10 if keywords.

• Prompt the player to specify how many shiitake mushrooms were found and picked up
• Prompt the player to specify how many portobello mushrooms were found and picked up
• Include a narrative of how the player is meandering through the forest

• The chef asks the player how many of the shiitake mushrooms they’d like to trade
• The chef asks the player how many of the portobello mushrooms they’d like to trade
– If the player specifies that they want to trade more mushrooms (of either kind) than
have been collected, the chef ends the conversation (the program ends; it should not
throw an error).

– If the player specifies to trade a total of zero mushrooms (i.e., the sum of both
mushroom types), the chef ends the conversation (the program ends; it should not
throw an error).

– If the player wants to trade their mushrooms, then the chef will offer rubies according
to the following exchange rules (the chef’s secret formula):
Number Shiitake Player
is Willing to Trade
Number Portobello
Player is Willing to
Trade

Rubies Offered by Chef
Fewer than 10 Fewer than 5 Twice the number of Shiitake
offered for trade
Fewer than 10 5 or more Three times the number of
Portobello offered for trade
Multiple of 12 but NOT
a multiple of 24
20 or more Four times the number of Portobello offered for trade
Multiple of 12 but NOT
a multiple of 24

Fewer than 20 The number of Portobello offered for trade
A number of Shiitake
mushrooms different
than any of the 4 above
choices

Any Five times the number of Shiitake offered for trade
• The chef should ask the player if they want to make the exchange. If the player enters y,
yes, or Yes, the program should output the number of rubies that the player walks away
with, as well as the number of portobello and shiitake mushrooms that the player retains.
Otherwise, the program should output the number of portobello and shiitake mushrooms
the player walks away with.

Two sample invocations of the program are shown in Figure 1:

Figure 1: Sample Outputs
If you proceed with computer science you’ll learn about more formal testing techniques. For
the time being, use the below table for sample inputs and outputs of the program to make sure
that your code is working correctly. Note that these sample inputs are not an exhaustive test
suite.

Your code will be graded on a different set of tests from the ones given below, so you
can’t count on these tests finding all possible mistakes. You should test your program on your
own combinations of inputs, making sure that you have tried out all possible paths that your
code might take.

Shiitakes
Found / Willing to Trade
Portobellos
Found / Willing to Trade
Chef Offers Accept? Player’s Final Shiitake/Portobello/Rubies
10/5 30/22 66 rubies Yes 5/8/66
100/0 40/5 15 rubies Yes 100/35/15
10/10 5/6 Chef runs away NA NA
10/10 6/5 50 rubies No 10/6/0
20/0 0/0 Unwilling to trade NA NA
13/12 9/8 8 rubies Yes 1/1/8

Submission
Submit a file called fungiExchange.py to Canvas containing your implementation of the program, and complete the questions on Canvas. Fill out the A2 Hours quiz on Canvas with an
estimate of the number of hours you spent working on both parts of this assignment.

Rubric
Canvas questions 16 points
Top of python file has comments, including name, date, and description 1
The program correctly prompts for the mushroom input 2
The ruby and remaining mushroom counts are correct after making a trade 7
The remaining mushroom counts are correct when the trade is not made 2
The program responds correctly if the player specifies they want to trade 0 3
The program responds correctly if the player wants to trade more mushrooms
than have been picked up

The program uses no more than 10 if keywords 5
The code is commented adequately and variable names are appropriately named 2
Total 41 points

3 Challenge Problem
The A2 challenge problem is worth 1 point of extra credit: Write a program that prompts
the user for three integers, and prints the median and mean of the three integers. Do not use
any built-in or external libraries (i.e., your program should not have any import statements).
Upload your solution as threestats.py

CSCI 141 Assignment 3: Conditionals and Loops

Guessing Game: 32 points

Assume you are a computer programmer working for a company, Nostalgia-R-Us, that makes
legacy (old-style, text-only) games for people who were using computers in the early 1980s.

The game you have been tasked to write is a simple guessing game. A player specifies how
many tries are allowed, and then proceeds to guess a secret two-character sequence. Because
the game is intended for distribution to alumni of Western Washington University, the letters
are selected from the letters in the word bellingham. See the sample screen shots in Figure 1
for sample gameplay.

Figure 1: Two sample runs of the program

Your manager has provided you with the following requirements and/or pseudocode:
• Your program must be named letterGuessGame.py.
• The program should provide a brief blurb that explains the game and prompts the player
to specify the number of tries.

• Two letters from the word bellingham should be chosen randomly as the secret answer.
Because the letters are chosen independently, both the first and second secret letters may
be the same.

• While the player has tries remaining, the game should prompt the user to guess a letter.
For each of the two secret letters, the program should specify whether the guess was
correct or not. If the guess is not one (or neither) of the secret letters, then the program
should output a statement stating that fact.

• The program should remember which letter(s) (if any) have already been guessed correctly.
Once a letter has been guessed correctly, subsequent output should not mention that letter.
• If the player guesses both letters, the game should output ”You win” and terminate right
away, even if the player has tries remaining.

• If the player does not correctly guess the secret letters in the number of tries indicated,
the game should end, specify that there are no more tries remaining, and the correct
answer should be revealed.

This game can be implemented many different ways. Declare and use as many variables as you
need to keep track of guesses. The logic for a sample ”you lose” game play is shown below.
Num Guesses : 4
Secret Answer : bh
User Guess 1 : b

Game Response : You have guessed the first leter. The second letter is not b.
User Guess 2 : g
Game Response : The second letter is not g.
User Guess 3 : l
Game Response : The second letter is not l.
User Guess 4 : e

Game Response : You are out of tries. Game over. The secret letters were b and h.
2.1 Testing
Test your program thoroughly. Make sure it works correctly for a ”win” and a ”lose” scenario,
noting that a player can win after a minimum of 1 attempt (if both letters are the same and the
player guesses them on the first try) but can also win in more than that, up to (and including)
the maximum number of attempts specified.

Submission
Upload letterGuessGame.py to Canvas.

Rubric
Canvas (written) questions 18 points
Top of python file contains comments, including your name 1
The program correctly prompts for the number of tries 2
The program selects two random characters from the letters in bellingham 4
The program correctly keeps track of how many tries are remaining 4
The program specifies which (if any) of the secret letters have been guessed
correctly after each guess

If one of the two letters has been guessed correctly, on subsequent guesses, the
program does not mention the already-guessed letter

The program terminates right away and says ”Win” if the player guesses correctly.
If the player loses, the answer is revealed.
5
The program runs as intended, and does not generate errors 5
The code is commented adequately and variable names are appropriately named 2
Total 50 points

3 Challenge Problem
This challenge problem is worth two points of extra credit: Write a program that prompts the
user for a non-negative decimal number, then prints the binary representation of the number
with no leading zeros.

Submit your Challenge Problem solution in a file named binary.py to the A3 Challenge (NOT
A3 Code) assignment on Canvas.

CSCI 141 Assignment 4: Functions

1 Overview

For this assignment you will write a Python program that draws a Sierpinski Triangle using a method
called a chaos game.

The chaos game is played as follows. The user chooses the window size (say 300 by 300 pixels). Denote
the three corners of an isosceles triangle 1, 2, 3, where corner 1 is at the top center of the screen,
corner 2 is in the lower left of the screen, and corner 3 is in the lower right of the screen.

Here’s some
pseudocode to help you understand how the chaos game works:
Let p be a random point in the window
loop 10000 times:
c = a random corner of the triangle
m = the midpoint between p and c
choose a color for m
color the pixel at m
p = m

This process will generate a Sierpinski Triangle like the one pictured below.

2 Details

1. One thing you might notice if you try this on paper (or if you carefully look at the image you
generate once it’s complete) is that the first few iterations of this game may produce points that
are not actually part of the Sierpinski triangle. Thus, you should start the process of generating
points, but don’t color the first few points you calculate. To be safe you should start adding
points to the image after 10 iterations.

2. You are provided with a skeleton code file called sierpinski.py. This file contains some code
to help get you started, including a function that sets up the Turtle graphics window for our
somewhat nontraditional turtle use case. In particular, take a look at the specification for the
turtle_setup function: this takes care of creating a turtle, and resizing the window to the
desired dimensions. Call this before beginning the chaos game iterations, and use the turtle it
returns to do all your pixel coloring.

3. The setup function changes the window so its coordinate system now has (0,0) at the bottom left corner, with positive x going right and positive y going up, so the top left corner is
at (0, canv_height),the bottom right corner is at (canv_width, 0), and the top right is at
(canv_width, canv_height). This helps to simplify the math when locating corners of the triangle. The setup function also calls tracer(0, 0), which you may recall disables automatic
re-drawing of the canvas.

This means that to get your picture to show up, you need to call
turtle.update() yourself. For the sake of speed, I recommend re-drawing the picture only every
100 or every 1000 iterations so the drawing doesn’t take too long.

4. In this program, we’re not really using turtles for what they were meant for. Instead of drawing
lines as the turtle moves, we’ll use the turtle to color individual pixels on the canvas. Turtles
draw as they move, but they can also draw shapes, such as circles and dots; we’ll make use of the
aptly named dot method. To fill in a pixel, all you need to do is move the turtle to that pixel,
then draw a dot of size 1.

5. When the turtle draws via movement with the pen down, or via other methods such as dot, the
color it draws is determined by the turtle’s current color. You can change the turtle’s current
color using the (again, aptly named) color method. One way to specify colors is using various
standard color names (“red”, “green”, “purple”, etc.).

A more flexible way is to specify
how much red, green, and blue you want: some combination of these three primary colors can
represent all colors that your screen can display. When storing images on computers, we typically
store each R,G, and B value using a single byte (8 bits). That means a color is represented by
three numbers from 0 to 255, which is the maximum number representable using 8 bits. For
instance (255, 0, 0) is red, (0, 255, 0) is green and (0, 0, 255) is blue. Furthermore, (255, 255, 255)
is white and (0, 0, 0) is black.

6. In the figure on the first page, you can see that the colors of the pixels are related to their
coordinates. If you simply followed the pseudocode at the top of this document, but chose black
for the color every time, then you’d have a black and white version of the Sierpinski triangle.

That’s a good first step to make sure your chaos procedure is working correctly. Once you have
that working, you should figure out how to make the triangle prettier. The color scheme used
in the example above chooses each color value based on the distance from one of the corners. In
particular, the red color scales from 255 to 0 based on distance from corner 1, the green scales
with distance from corner 2, and blue scales with distance from corner 3.

You may choose a
different scheme, but your colors should appear in a smooth gradient across the triangle, and
the corners should each be colored one of the “base” colors (red, green, and blue). This is not
too difficult in a square window but you need to be a little more careful in a non-square window
(when the width of the window is not equal to the height of the window).

3 Suggested Approach

This may seem like a big problem to solve all at once; in fact it is, so it is highly recommended that
you write functions to solve small pieces of the problem, then put them together into a solution to the
full program.

1. In class we wrote distance and midpoint functions. These will come in handy here, so copy
those into your code, and feel free to modify them as needed.

2. I included the pseudocode for the chaos game in the skeleton file. This is a handy way to keep
track of your overall program structure: start with pseudocode and piece-by-piece fill in code
that accomplishes each of the steps. Because each step has some complexity to it, you should
define functions that take care of the details of each step. That way, the code in your main
program will end up corresponding fairly closely to the lines of the pseudocode, and it will be
easy to understand.

3. Based on the pseudocode, decide what functions you’d like to have in order to make the algorithm
easy to implement. In my solution, I have almost one-to-one correspondence between functions
and lines in the pseudocode. To give one example, to choose a color for the point m, I have a
choose_color function. It takes a point and the three corners and calculates the RGB color
values based on distance of the point from each of the colors. This function in turn makes use of
the distance function.

4. Instead of immediately starting to code each function you’ve decided to write, try this instead:
write out the specification (docstring) for the function. This means deciding what the function
takes as arguments and what it returns. Once you have this, try sketching out the code for the
chaos game, using the functions (even though you haven’t written them yet!). In doing this, you
may discover changes that you want to make to your function specifications—make them now so
you don’t have to rewrite the code.

5. Now, go implement each of your functions. Start with the ones that will be needed to draw the
triangle in black. After finishing one function, test it. Use the interactive shell and/or put code
in your main program that checks whether the code does what you expect it to. For example, to
test my choose_color function, I first tried passing in each corner: I made sure the top corner
gave me (255, 0, 0) back, and so on for all three corners.

Then I tried the bottom middle point on
the canvas, because it’s easy for me to calculate that its blue and green values should be about
128 (it’s equidistant from the green corner and the blue corner). Then test the center point – its
RGB values should all be equal because it’s equidistant from all three corners.

6. Finally, turn your sketch of the overall chaos game algorithm into real code that uses your
functions to draw the Sierpinski triangle. Make sure it works with different square window sizes
first (e.g., 200 by 200, 300 by 300). Then try testing it with unequal width and height (e.g., 200
by 300).

4 Hints
1. I defined three variables in my main program that hold the coordinates of the three corners, since
the corner coordinates are needed in several places. The functions that do calculations involving
corners need to take the relevant corners as parameters.

2. Drawing a black and white triangle is a great first step. You may want to start out simply
choosing black as the color for all pixels to ensure the geometry is all working correctly.

3. The distance and midpoint function used tuples to pack the coordinates of points together into
a single argument / return value. This is a design decision, and you may choose to use this
approach in your functions or not. For example, my function that colors a certain pixel a given
color has the following header:
color_pixel(turt, point, color)
where point is expected to be a 2-tuple point = (x, y), and color = (r, g, b) is a 3-tuple
of the RGB color values. I could also have written it
color_pixel(turt, px, py, r, g, b)
but I think it’s cleaner to pass points and colors to functions as tuples.

4. How solidly filled in your triangle is depends on how many iterations of the chaos game you run
and how large your canvas is. A smaller canvas has fewer pixels to fill in, so fewer iterations
will make a more solid picture, but it will have lower resolution. A large picture requires more
iterations but has higher definition. Feel free to experiment with running more iterations to get
larger, higher-definition triangles, but please turn in code that runs 10000 iterations and
runs in less than 5 seconds.

To keep things fast, remember that you can choose how often
to call turtle.update(); for maximum speed, call it once after all your iterations are complete.
The images below show what you can expect your drawing to look like with 10000 iterations for
a few different canvas sizes.
Here’s a 200×200 output:
Here’s a 100×300 output:
Here’s a 200×100 output:

5 Guidelines
Please make sure your program follows these guidelines:
• Your code should run 10000 iterations of the chaos game and run in under 5 seconds.
• Your functions should not directly make use of (refer to) any global variables. Any information
a function needs to do its job should be passed into the function as a parameter.
• Your code should do all the drawing (i.e., color all pixels) with the Turtle object returned by
the setup function. Don’t create any additional turtles.

• Each of your functions, and the main program, should not be too long. Not counting comments,
docstrings, and blank lines, my main program (the part in the if __name__ == “__main__”:
block) is just under 20 lines and each of my functions is less than 10 lines. If you find yourself
writing a continuous block of code that’s longer than about 30 lines (not counting comments and
blank lines), think about how you could break it up into logical subtasks and write functions to
accomplish each one.

• Your functions and variable names should be descriptive but not overly long. For example, your
corner 1 variable should probably not be called c1, nor should it be called
the_top_middle_corner_of_the_triangle. Somewhere in between is best.

Submission
Take a screenshot of the drawing produced on a canvas with width = 300, height = 120, and name
it triangle.png. Zip both files into a zip file called a4.zip and submit it to the A4 Code assignment
on Canvas. As usual, please fill out the A4 Hours survey with an estimate of the hours you spent on
this assignment.

Rubric
Submission Mechanics (10 points)
You submitted a zip file called a4.zip containing sierpinsky.py. 1
Your zip file also includes triangle.png, a screenshot of your program’s result on a
300×120 canvas.

Your sierpinski.py program runs in under 5 seconds. 4
sierpinski.py contains comments including your name, date, and description at the
top

Code Style and Clarity (36 points)
Your program defines at least two additional functions beyond the provided setup, distance, and midpoint functions.

Each function you introduce has a docstring containing a clear function specification. 8
Your functions do not make use of any global variables. 6
The main program and each individual function is not excessively long. 6
The names of functions and variable names are descriptive but not too verbose. 6
Correctness (34 points)

The triangle is drawn correctly for a square window 10
The triangle is drawn correctly for a non-square window 10
The first ten points generated are not added to the image. 4
Each corner is colored one of red, green and blue as described above 5
The colors gradually blend according to their distance from each corner 5
Total 80 points

6 Challenge Problem
Take a look at the following web page: https://mathworld.wolfram.com/ChaosGame.html. There
you can see how what we’re doing here is just one specific case of a general idea. The general idea is
you can have triangles, squares, pentagons, hexagons, etc. And you when you choose a random corner
and find the midpoint you could instead find the point that is 1/3 of the way to the corner, or 3/8 of
the way to the corner, etc.Make a copy of your main assignment program in a file named chaos.py.

In this file, implement the following function:
def chaos_game(canv_width, canv_height, poly_sides, ratio):
“”” Run a chaos game on a canvas with size (canv_width, canv_height)
with n = poly_sides (i.e., a poly_sides-sided polygon)
and r = ratio (i.e., fraction of distance from the corner)
“””

This challenge may require usage of material we haven’t covered in detail (for example, lists will likely
come in handy to store the corners of the polygon). If you are trying to tackle this and encounter any
problems, come talk to me and I’d be happy to help. Successful completion of the challenge problem
is worth 5 points of extra credit. Submit chaos.py to the A4 Challenge assignment on Canvas.

CSCI 141 Assignment 5: Cancer Classification using Machine Learning

1 Overview

The goal of this project is to gain more practice with using functions, lists and dictionaries and
gain some intuition for Machine Learning, the field of computer science concerned with writing
algorithms that allow computes to “learn” from data.

The problem we’ll be solving is as follows: Given a data file containing hundreds of patient
records with values describing measurements of cancer tumors and whether or not each tumor
is malignant or benign, develop a simple rule-based classifier that can be used to predict whether
an as-yet-unseen tumor is malignant or benign.

The general idea is that malignant tumors are different than benign tumors. Malignant tumors
tend to have larger radii, to be more smooth, to be more symmetric, etc. Measurements have
been taken on many tumors whose class (malignant or benign) is known. The code you are
going to write will get the average score across all the malignant tumors for an attribute (e.g.
‘area’) as well as the average score for that attribute for benign tumors. Let’s say that the
average area for malignant tumors is 100, and for benign tumors is 50. We can then use that
information to try to predict whether a given tumor is malignant or benign.

Imagine you are presented with a new tumor and told the area was 99. All else being equal,
we would have reason to think this tumor is more likely to be malignant than had its area
been 51. Based on this intuition, we are going to create a simple classification scheme. We
will calculate the midpoint between the malignant average and the benign average (75 in our
hypothetical example), and simply say that for each new tumor, if its value for that attribute
is greater than or equal to the midpoint value for that attribute, that is one vote for the tumor
being malignant.

Each attribute that we are using produces a vote, and at the end of counting
votes for each attribute, if the malignant votes are greater than or equal to the benign votes,
we predict that the tumor is malignant.

2 Machine Learning Framework

“Machine learning” is a popular buzzword that might evoke computer brain simulations, or
robots walking among humans. In reality (for now, anyway), machine learning refers to something less fanciful: algorithms that use previously observed data to make predictions about
new data. It may sound less glamorous than fully sentient robots, but that’s exactly what was
described above! You can get more sophisticated about the specifics of how you go about this,
but that’s the core of what machine learning really means.

If using data to make predictions on new data is our goal, you might think it makes sense to use
all the data we have to learn from. But in fact, if we truly don’t know the labels (e.g., malignant
or benign) of the data we’re testing our algorithm on, we won’t have any idea whether it’s doing
a good job! For this reason, it makes sense to split the data we have labels for into a training
set, which we’ll use to “learn” from, and a test set, which we’ll use to evaluate how well the
algorithm does on new data (i.e., data it wasn’t trained on). We will take about 80% of the
data as our training set, and use the remaining 20% as our test set.

2.1 Training Phase
Here’s how our classifier will work: In the training phase, we will “learn” (read: compute) the
average value each attribute (e.g. area, smoothness, etc.) among the malignant tumors. We
will also “learn” (again: compute) the average value of each attribute among benign tumors.
Then we’ll compute the midpoint for each attribute. This collection of midpoints, one for each
attribute, is our classifier.

2.2 Testing Phase
Having trained our classifier, we can now use it to make an educated guess about the label of
a new tumor if we have the measurements of all of its attributes. Our educated guess will be
pretty simple:
• If the tumor’s value for an attribute is greater than or equal to the midpoint value for
that attribute, cast one vote for the tumor being malignant.

• If the tumor’s attribute value is less than the midpoint, cast one vote for the tumor being
benign.
• Tally up the votes cast according to these rules for each of the ten attributes. If the
malignant votes are greater than or equal to the benign votes, we predict that the tumor
is malignant.

If we want to use this classifier to diagnose people, we have an important question to answer:
how good are our guesses? To answer this question, we’ll run test our algorithm on the 20% of
our data that we held out as the test set, which we didn’t use to train the classifier, but we do
know the correct labels. Our rate of accuracy on these data should be indicative of how well
our classifier will do on new, unlabeled tumors.

3 Dataset Description

You have been provided with cancertTrainingData.txt, a text file containing the 80% of the
data that we’ll use as our training set.

The file has many numbers per patient record, some of which refer to attributes of the tumor.
The skeleton code includes the function make_training_set(), which reads in the important
information from this file and produces a list of dictionaries. Each dictionary contains attributes
for a single tumor as follows:
0. ID
1. radius
2. texture
3. perimeter
4. area
5. smoothness
6. compactness
7. concavity
8. concave
9. symmetry
10. fractal
11. class

The middle 10 attributes (numbered 1 through 10) are the numbers that describe the tumor.
The first attribute is just the patient ID number, and the last attribute is the actual real life
state of the tumor, namely, malignant (represented by “M”) or benign (represented by “B”).

We don’t need to know what these attributes mean: all we need to know is that they are
measurements of the tumors, and that benign and malignant tumors tend to have different
attribute values. For these 10 tumor attributes when comparing to the midpoint values, higher
numbers indicate malignancy.

Pictorially, the list of dictionaries looks like this (two are shown,
but the list contains many more than that):
dict
ID 897880
radius 10.05
texture 17.53
perimeter 64.41
area 310.8
smoothness 0.1007
compactness 0.07326
concavity 0.02511
concave 0.01775
symmetry 0.189
fractal 0.06331
class B
list
. . .
training_set
dict
ID 89812
radius 23.51
texture 24.27
perimeter 155.1
area 1747
smoothness 0.1069
compactness 0.1283
concavity 0.2308
concave 0.141
symmetry 0.1797
fractal 0.05506
class M

Figure 1: Illustration of the data layout of the training set returned by make training set
The dictionary stored in the 0th spot in the list gives the attributes for the 0th tumor:
training_set[0][“class”] gives the true class label (in this case, ”B” for benign) of the
0th tumor.

4 Getting Started
Download the skeleton code (cancer_classifier.py), training set (cancerTrainingData.txt),
and the test set (cancerTestingData.txt). Make sure all three files are in the same directory,
or the main program will not be able to load the data from the files.

5 Tasks

5.0 Overview
Training and evaluating our classifier involves several steps. The first task, which has been
done for you, is to write code to load the training and test data sets from text files into lists
of dictionaries representing patient records, as described in the previous section. The functions
make_training_set and make_test_set are included in the skeleton code to complete these
steps.

You will complete the following four tasks:
• TODO 1: Train the classifier
• TODO 2: Apply the classifier to the test set
• TODO 3: Calculate and report accuracy on the test set
• TODO 4: Provide classifier details on user-specified patients
The main program has been provided to you: you will be implementing functions that are called
from the main program at the bottom of the skeleton code file. Take a moment to read through
and understand the main program (notice that the parts of the program that use TODOs 1–4
are commented out).

Each of the above steps is described in detail in the remainder of this section. After you finish
each TODO (2 and 3 are completed together), uncomment the corresponding block in the main
program and run your code to make sure your output matches the sample output provided
below.

5.1 TODO 1: Train the classifier
A classifier is simply some model of a problem that allows us to make predictions about new
records. We use the training set to build up a simple model, as described in Section 2:
• For all malignant records, calculate the average value of each attribute.
• For all benign records, calculate the average value of each attribute.

• Calculate the midpoint between these averages for each attribute.
Our classifier is a single dictionary that stores this midpoint value for each attribute.
Implement this functionality in train_classifier. My solution for this part totals roughly 30
lines of code. As always, you may find it useful to write helper methods that perform smaller
tasks: for example, you could create a helper function to initialize a dictionary with each of the
attributes as keys and 0 as values.

When done, uncomment the block of code in the main program that calls train_classifier
and debug your code until your attribute midpoints match the sample output.

5.2 TODO 2: Apply the classifier
After computing the classifier (namely, the dictionary of attribute midpoints), we can use these
values to make predictions given the attribute values of a new patient. A record is classified as
follows:
For each attribute, determine whether the record’s value is less than or equal to the classifier’s
midpoint value.

If so, cast one vote for Benign; otherwise, cast one vote for Malignant. If the
votes for Malignant are greater than or equal to the votes for Benign, the record is classified as
Malignant; otherwise, it is classified as Benign.

Implement this classification scheme in the classify function, applying it to each record in
the test set. Notice that the prediction for a record is to be stored in the “prediction” field
of the dictionary for that record.

5.3 TODO 3: Report accuracy
For each record in the test set, compare the predicted class to the actual class. Print out the
percentage of records that were labeled correctly (i.e., the predicted class is the same as the
true class).

5.4 TODO 4: Provide patient details
The final task is to provide a user the opportunity to examine the details of the predictions made
for individual patients. Implement check_patients, which contains commented pseudocode
describing its the exact behavior. You are strongly encouraged to write helper functions that
are called from within this function: if a pseudocode step requires more than a few lines of
code, consider making a helper function to accomplish that step.

If the user-specified patient ID is found in the test set, print a table with four columns:
• Attribute: the name of the attribute
• Patient: the patient’s value for that attribute
• Classifier: the classifier’s threshold (midpoint) for that attribute
• Vote: the vote cast by the classifier on for that attribute

See the sample output for specifics of what the table should look like.
Printing a table of results with nice even columns and uniformly formatted decimal numbers
requires delving into the details of string formatting in Python. There are multiple ways to do
this, but the following tips should be sufficient1
:

• String objects have rjust and ljust methods, which return a copy of the string padded
to the given width, justified either right or left. For example, “abc”.rjust(5) returns
” abc”.

• Floating point numbers can be formatted nicely using the format method, which is called
on strings containing special formatting specifiers. For example, “{:.2f}”.format(8.632)
formats the argument (8.632) as a float with 2 decimal places, resulting in the string
“8.63”.

Your table does not need to match the sample output character for character, but your columns
should be lined up, right justified, and floating-point values should be printed with the decimals
aligned and a consistent number of digits following the decimal point.

6 Sample Output
A sample run of my solution program is shown below. User input is bolded.
Reading in training data…
Done reading training data.
Reading in test data…
Done reading test data.
Training classifier…
Classifier cutoffs:
radius: 14.545393772893773
texture: 19.279093406593404
perimeter: 94.91928571428579
area: 693.337728937729
smoothness: 0.09783294871794869
compactness: 0.1104729532967033
concavity: 0.09963735815018318
concave: 0.054678068681318664
symmetry: 0.18456510989010982
fractal: 0.06286657967032966
Done training classifier.
Making predictions and reporting accuracy
Classifier accuracy: 92.20779220779221

1For much more detail on string formatting, see the Python Tutorial entry: https://docs.python.org/3/
tutorial/inputoutput.html
Done classifying.

Enter a patient ID to see classification details: 897880
Attribute Patient Classifier Vote
radius 10.0500 14.5454 Benign
texture 17.5300 19.2791 Benign
perimeter 64.4100 94.9193 Benign
area 310.8000 693.3377 Benign
smoothness 0.1007 0.0978 Malignant
compactness 0.0733 0.1105 Benign
concavity 0.0251 0.0996 Benign
concave 0.0177 0.0547 Benign
symmetry 0.1890 0.1846 Malignant
fractal 0.0633 0.0629 Malignant
Classifier’s diagnosis: Benign
Enter a patient ID to see classification details: 89812
Attribute Patient Classifier Vote
radius 23.5100 14.5454 Malignant
texture 24.2700 19.2791 Malignant
perimeter 155.1000 94.9193 Malignant
area 1747.0000 693.3377 Malignant
smoothness 0.1069 0.0978 Malignant
compactness 0.1283 0.1105 Malignant
concavity 0.2308 0.0996 Malignant
concave 0.1410 0.0547 Malignant
symmetry 0.1797 0.1846 Benign
fractal 0.0551 0.0629 Benign
Classifier’s diagnosis: Malignant
Enter a patient ID to see classification details: quit

7 Hints and Guidelines
• Start by reading through the skeleton code, and making sure you know what the main
program does and how the functions you are tasked with implementing fit into the overall
program.

• If your understanding of lists and dictionaries is shaky, you will have great difficulty
making progress. Visit my office hours, TA office hours, or mentor hours early so you
don’t spend too much time struggling.

• The top of the skeleton file has a global variable called ATTRS, which is a list of the
attribute names each patient record has. Using global variables with all-caps names is
a common convention when you have variables that need to be referenced all over your
program and (crucially) never change value. You may refer to ATTRS from anywhere in
your program, including inside function definitions, without passing it in as a parameter.

• As in A4, all variables (other than ATTRS) referenced from within functions must be local
variables – if you need access to information from outside the function, it must be passed
into the function as a parameter.

• When iterating over patient record dictionaries, use loops over the keys stored in ATTRS
rather than looping directly over the dictionary’s keys. An example of this appears in the
main program where the classifier cutoffs are printed.

• The functions provided in the skeleton code include headers and specifications. Make sure
you follow the given specifications (and don’t modify them!).

• Keep the length of each function short: if you’re writing a function that takes more than
about 30 lines of code (not including comments and whitespace), consider how you might
break the task into smaller pieces and implement each piece using a helper function.

• All helper functions you write must have docstrings with precise, clearly written specifications.
• Test each function after you’ve written it by running the main program with the corresponding code block uncommented. Don’t move on until the corresponding portion of the
output matches the sample.

Submission
Upload cancer_classifier.py to Canvas and fill in the A5 Hours quiz with an estimate of
how many hours you spent working on this assignment.

Rubric
Submission Mechanics (2 points)
File called cancer_classifier.py is submitted to Canvas 2
Code Style and Clarity (28 points)
Comment at the top with author/date/description 3
Comments throughout code clarify any nontrivial code sections 5
Variable and function names are descriptive 5
Helper functions are used to keep functions no longer than about 30 lines of
code (not counting comments and blank lines)

ATTRS is used to iterate over dictionary attributes 5
No global variables except ATTRS are referenced from within functions 5
Correctness (70 points)
The trained classifier has the correct midpoint values for each attribute 30
Prediction is performed as described using the midpoints computed in training 5
Accuracy is computed and reported correctly as shown in the demo output 10
User is repeatedly prompted for Patient ID 5
Message is printed if given ID is not in the test set. 5
If ID is in the test set, table is printed with all four columns and rows for all 10
attributes
10
Table columns are right-justified and aligned 3
Floating-point values in the table are lined up on the decimal point and have a
fixed number of digits after the decimal.
2
Total 100 points

Acknowledgements
This assignment was adapted from a version used by Perry Fizzano, who adapted it from an
original assignment developed at Michigan State University for their CSE 231 course.

8 Challenge Problem
The following challenge problem is worth up to 10 points of extra credit. As usual, this can
be done using only material we’ve learned in class, but it’s much more open-ended. If you are
trying to tackle this, feel free to come talk to me about it in office hours.

In this assignment, you trained a simple classifier that used means over the entire training set
to classify unseen examples. This simple classifier does quite well (92% accuracy) on the test
set. There are many more sophisticated methods for learning classifiers from training data,
some of which depend on some pretty heavy-duty mathematical derivations.

One type of classifier that doesn’t require a lot of math but nonetheless performs pretty well
on a lot of real-world problems is called the Nearest Neighbor Classifier or its more general
cousin, the K Nearest Neighbors Classifier. The idea behind KNN is that records with
similar attributes should have similar labels. So a reasonable way to guess the label of a
previously-unseen record is to find the record from the training set that is most similar to it
and guess that record’s label.

To implement a nearest neighbor classifier, we need some definition of what it means to be
”near”. One of the the simplest choices for numerical data like ours is the Sum of Squared
Differences metric. Given two records, compute the difference between the two records’ values,
square the difference, and add up all the squared differences over all 10 attributes. The smaller
the SSD metric, the more similar the two records are.

(up to 5 points) Implement a nearest neighbor classifier using the SSD metric in a file called
KNN.py. Feel free to copy and re-use the data loading functions, and any other functions that
remain relevant, from cancer_classifier.py. Evaluate your classifier’s accuracy like we did
in the base assignment. Write a comment in KNN.py reporting your classifier’s performance.

You might notice an issue with the SSD metric applied to our dataset: some attributes have
huge values (e.g., in the hundreds) and others have tiny values. When computing SSD, the
large-valued attributes will dominate the SSD score, even if they aren’t the most important
attributes. Come up with a way to modify the distance metric so that it weights attributes
evenly. Describe your approach and compare the performance of your new metric with SSD in
a comment in KNN.py.

(up to 5 points) The nearest neighbor classifier can be a bit fiddly, because the guess depends
on a single datapoint in your training set, so a single unusual (or mislabeled!) training record
could cause a wrong prediction. A more robust classifier looks not just at the single nearest
neighbor, but each of K nearest neighbors, for some choice of K. Generalize your nearest
neighbor classifier to a KNN classifier. Try out different values of K and include a comment
discussing the classification accuracy for values of K. Do any of them beat the base assignment
classifier’s performance?