Description
CPSC6430 Project 1: Basic plotting in Python with matplotlib.pyplot
Data File
The first line of IrisData.txt contains two integers. The first represents how many lines of data
are in the file. The second how many features are associated with each line. Each line after that
contains four floating point values representing the sepal length, sepal width, petal length, and
petal width of a type of iris flower, followed by the name of the iris type: either setosa,
versicolor, or virginica. Values are tab separated.
Python Program
Write a Python program that will print out a two-dimensional plot of any two features for all
three varieties of iris flowers in the plot area of Spyder using the features of
matplotlib.pyplot.Your program should begin by asking for the name of the input file. For
example (blue is user response):
Enter the name of your data file: Irisdata.txt
Then the program should prompt the user for two features to plot. For example, it could say:
You can do a plot of any two features of the Iris Data set
The feature codes are:
0 = sepal length
1 = sepal width
2 = petal length
3 = petal width
Enter feature code for the horizontal axis: 0
Enter feature code for the vertical axis: 1
Once the program does a plot it should ask the user if they want to another plot. For example:
Would you like to do another plot? (y/n) y
If the answer is y, then the user should be prompted for new features to plot on the horizontal
and vertical axes as before and a new plot generated. N should end the program.
Plots should be labelled with a title, axes should be labelled, there should be a legend and all
three varieties should be color coded and have different symbols. An example is shown below.
You are required to submit a project report, including:
• At least 3 plots with different x and y labels combination.
• Briefly discuss which feature combination is the best for classify the iris data
• Full screenshot of your python console.
• A copy of your code
Your report should be named yourlastname_yourfirstname_P1.docx or .doc or .pdf. Your
Python program should be named yourlastname_yourfirstname_P1.py, then zipped together
with your project report and uploaded to Canvas
CPSC6430 Project 2: Choosing a Model for Predicting on Unseen Data
For Project 2 you will create a regression program and choose a model to predict the women’s
Olympic 100-meter race record time for year 2022. We will code the year of each race as we did
in lecture 2.3. A text file with the data is available on Canvas for the years 1928 through 2008
when the Olympics were held. The first line of the text file indicating there’re m lines of data
and a n number of features (in this case, one).
Your project assignment is to compare three different models, linear, quadratic, and cubic.
hw(x) = w0 + w1x
hw(x) = w0 + w1x + w2x2
hw(x) = w0 + w1x + w2x2 + w3x3
using 5-fold cross validation.
Then you should present a chart, similar to the one in the lecture (see below), of all your test
results and a plot of your training andtest J’s with respect to the polynomial degree.
Linear Quadratic Cubic
1234
5
1235
4
1245
3
1345
2
2345
1
Mean for Training
Mean for Testing
Based on your data and plot, you should then:
• Argue which model (linear, quadratic, or cubic) you expect will best predict the times for
the women’s Olympic 100-meter race in the future.
• Compute weights using the complete data set with your best model.
• Using those weights, write a Python program that takes a year as input, then outputs
the winning women’s Olympic 100-meter race time for that year.
Important Note:
You cannot use python machine learning package that can have the k-fold validation algorithm
as embedded function, for instance, sklearn package.
You are required to submit a project report, including:
• The J value chart as shown in the table above.
• A plot of your training and test J’s with respect to the polynomial degree
• Argue which model (linear, quadratic, or cubic) you will choose
• The final hypothesis function hw(x)
• Predict the women’s Olympic 100-meter race record time for this winter Olympic (2022)
• Full screenshot of your python console.
• A copy of your code
Your report should be named yourlastname_yourfirstname_P2.docx or .doc or .pdf. Your
Python program should be named yourlastname_yourfirstname_P2.py, then zipped together
with your project report and uploaded to Canvas
CPSC6430 Project 3: Classification with Logistic Regression and SVM
For this project we will apply both Logistic Regression and SVM to predict whether capacitors from a fabrication
plant pass quality control based (QC) on two different tests. To train your system and determine its reliability you
have a set of 118 examples. The plot of these examples is show below where a red x is a capacitor that failed QC
and the green circlesrepresent capacitors that passed QC.
I have already randomized the data into two data sets: a training
set of 85 examples and a test set of 33 examples. Both are
formatted as
•First line: m and n, tab separated
•Each line after that has two real numbers representing the
results of the two tests, followed by a 1.0 if the capacitor
passed QC anda 0.0 if it failed QC—tab separated.
Assignment: Your assignment is to use what you have learned
from the class slides and homework to create (from scratch in
Python, not by using Logistic Regression library function!) a
Logistic Regression and SVM binary classifier to predict whether
each capacitor in the test set will pass QC.
Logistic Regression: You are free to use any model variation and any testing or training approach we have discussed
for logistic regression. In particular, since this data is not linear, I assume you will want to add new features based on
power of the original two features to create a good decision boundary. w0 + w1x1 + w2x2 is not going to work!
One choice might be
w0 + w1x1 + w2x2 + w3x3 + w4x4 + w5x5 +w6x6 + w7x7 + w8x8 where the new features are created as follows:
Note that it is easy to create a small Python program that reads in your original
features, uses a nested loop to create the new features and then writes them to a file.
thePower = 2
for j in range(thePower+1):
for i in range(thePower+1):
temp = (x1**i)*(x2**j)
if (temp != 1):
fout1.write(str(temp)+”\t”)
fout1.write(str(y)+”\n”)
With a few additions to the code, you can make a program to create combinations of any powers of x1 and x2!
SVM: You need to use the original training and testing data file with kernel functions for SVM. You can use the svm
functions in the Scikit-learn library and don’t need to implement the algorithm from scratch.
Please refer to https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html for details
New
Features
From
Original
Features
x1 x1
x2 x1
2
x3 x2
x4 x1x2
x5 x1x2
2
x6 x2
2
x7 x1
2
x2
x8 x1
2
x2
2
What to Upload to Canvas:
Logistic Regression:
1. A single py file (lastname_firstname _P3_LR.py) that prompts for a training file name, computes weights
using gradient descent, prints out a plot of iterations vs. J, plot the decision boundary on the whole dataset
and then prompts for a test filename, and using the computed weights prints out final J, FP, FN, TP, TN,
accuracy, precision, recall and F1 for the test set. All values should be clearly labelled.
2. Your training set file (lastname_firstname_P3Train.txt). First line should contain integers m and n,tab
separated. Each line after that should have n real numbers representing the new feature data, followed by a 1
if the capacitor passed QC and a 0 if it failed QC—tab separated.
3. Your test set file (lastname_firstname_P3Train.txt). First line should contain integers m and n, tabseparated.
Each line after that should have n real numbers representing the new feature data, followed by a 1 if the
capacitor passed QC and a 0 if it failed QC—tab separated.
4. A pdf file (lastname_firstname_P3_LR.pdf) that includes
• A description of your model and testing procedure, including
o Description of your model
o Initial values that you chose for your weights, learning rate, and the initial value for J.
o Final values for learning rate, your weights, how many iterations your learning algorithm went
through and your final value of J on your training set.
o Include a plot of J (vertical axis) vs. number of iterations (horizontal axis).
o Include a plot of hyperplane on the whole dataset
o Value of J on your test set.
o Your code
• A confusion matrix showing your results on your test set.
• A description of your final results that includes accuracy, precision, recall and F1 values.
SVM:
1. A single py file (lastname_firstname _P3_SVM.py) that prompts for a training file name, plot the margin and
hyperplane, and then prompts for a test filename, and using the computed weights prints out final FP, FN, TP,
TN, accuracy, precision, recall and F1 for the test set. All values should be clearly labelled.
2. A pdf file (lastname_firstname_P3_SVM.pdf) that includes
• A description of your model and testing procedure, including
o Description of your model
o Description of your kernel function
o Include a plot of margin and hyperplane
o Your code
• A confusion matrix showing your results on your test set.
• A description of your final results that includes accuracy, precision, recall and F1 values.
Note:
For undergrads (CPSC 4430) the final accuracy of both algorithms on your test set should be higher than 70%
For graduate-level (CPSC 6430) the final accuracy of both algorithms on your test set should be higher than 85%
Do not assume that any files are available to you besides files you turn in!
Zip your files into one zip file named lastname_firstname_P3_midterm.zip and upload it to Canvas.
CPSC6430 Project 4 Building a Spam Filter using a Naïve Bayes Classifier
Project 4 is to build a Naïve Bayes Spam filter. You will be able to download a labeled training
set file and a labeled test set file from Canvas. Both files will have the same format. Each line
will start with either a 1 (Spam) or a 0 (Ham), then a space, followed by an email subject line. A
third file will contain a list of Stop Words—common words that you should remove from your
vocabulary list. Format of the Stop Word list will be one word per line.
Assignment
Your program should prompt the user for the name of a training set file in the format described
above and for the name of the file of Stop Words. Your program should create a vocabulary of
words found in the subject lines of the training set associated with an estimated probability of
each word appearing in a Spam and the estimated probability of each word appearing in a Ham
email. Your program should then prompt the user for a labeled test set and predict the class (1
= Spam, 0 = Ham) of each subject line using a Naïve Bayes approach as discussed in the class
videos. Note: We may or may not test your program on the same files that you used to create
it!
Output to the screen of your program should include:
• How many Spam and Ham emails were in the Test set file that was read in.
• Number of False Positives, True Positives, False Negatives and True Negatives that your
spam filter predicted.
• Accuracy, precision, recall and F1 values for your Spam filter on the Test Set file.
What to turn In Via Canvas
You are required to submit a project report, including:
• A brief introduction of your model.
• Number of False Positives, True Positives, False Negatives and True Negatives that your
spam filter predicted.
• Accuracy, precision, recall and F1 values for your Spam filter on the Test Set file.
• Screenshot of your python console.
• A copy of your code
Your report should be named yourlastname_yourfirstname_P4.docx or .doc or .pdf. Your Python
program should be named yourlastname_yourfirstname_P4.py, then zipped together with your
project report and uploaded to Canvas
Notes and Suggestions
• If your program has problems with reading in the files, try opening the files like
this: file = open(filename, “r”, encoding = ‘unicode-escape’)
• Use the training file to figure out the percentage of emails expected to be Spam.
• You will probably have to use the natural log format of Bayes equation to avoid
computer precision problems.
So instead of multiplying a lot of probabilities together, we can
add their logs, then raise e to the power of the final sum.
• Total probability = 0.8*0.0001*0.002*0.9
• Total probability = e ln(0.8) + ln(0.0001)+ln(0.002)+ln(0.9)
2. Then use
1
1 + 𝑒𝑒ln(𝑃𝑃(𝐹𝐹|¬𝐸𝐸)𝑃𝑃(𝐸𝐸))−ln (𝑃𝑃(𝐹𝐹|𝐸𝐸)𝑃𝑃(𝐸𝐸))


