Description
CS-677 Preliminary Assignment 1
In this assignment, you will write simple trading strategies
using Python. This will review Python and prepare for next
assignments for next assignments.
1. Preliminary Task 1: choose a stock ticker that starts
with the same letter as your last name (we want to have different stock symbols for each student). For example, if your
last name is Jones, you can try JBL (Jabil Circuit), JPM
(JP Morgan), JNJ (Johnson and Johnson) or any other
ticker starting with J. You can refer to the link below for
suggestions:
https://www.nasdaq.com/screening/company-list.aspx
2. Preliminary Task 2: use the script
read and save stock data.py
to download daily stock data (5-years from Jan 1, 2014 to
Dec 31, 2019) for your ticker as a CSV file (using Pandas
Yahoo stock market reader). In this script, you would need
to change the location of the directory for the CSV file
and change the ticker to the name that you have chosen in
Task 1. This script downloads historical data and computes
additional fields for time (week, day, month) and prices
Page 1
BU MET CS-677: Data Science With Python, v.1.1 CS-677 Preliminary Assignment
(daily returns, 14- and 50-day moving price averages).
3. Preliminary Task 3: use the script
read stock data from file.py
to read your saved CSV file into a list of lines.
Page 2
CS-677: Predicting Daily Trading Labels Assignment 2
In many data science applications, you want to identify patterns, labels or classes based on available data. In this assignment we will focus on discovering patterns in your past stock
behavior.
To each trading day i you will assign a βtradingβ label β + β or
β β β. depending whether the corresponding daily return for
that day ri β₯ 0 or ri < 0. We will call these βtrueβ labels and
we compute these for all days in all 5 years.
We will use years 1,2 ans 3 as training years and we will use
years 4 and 5 as testing years. For each day in years 4 and 5 we
will predict a label based on some patterns that we observe in
training years. We will call these βpredictedβ labels. We know
the βtrueβ labels for years 4 and 5 and we compute βpredictedβ
labels for years 4 and 5. Therefore, we can analyze how good
are our predictions for all labels, β+β labels only and β-β labels
only in years 4 and 5.
Question 1: You have a csv table of daily returns for your
stosk and for S&P-500 (βspyβ ticker).
1. For each file, read them into a pandas frame and add a
Page 1
BU MET CS-677: Data Science With Python, v.2.0 CS-677: Predicting Daily Trading Labels
column βTrue Labelβ. In that column, for each day (row)
i with daily return ri β₯ 0 you assign a β + β label (βup
dayβ). For each day i with daily return ri < 0 you assign
β β β (βdown daysβ). You do this for every day for all 5
years both both tickers.
For example, if your initial dataframe were
Date Β· Β· Β· Return
1/2/2015 Β· Β· Β· 0.015
1/3/2015 Β· Β· Β· -0.01
1/6/2015 Β· Β· Β· 0.02
Β· Β· Β· Β· Β· Β· Β· Β· Β·
Β· Β· Β· Β· Β· Β· Β· Β· Β·
12/30/2019 Β· Β· Β· 0
12/31/2019 Β· Β· Β· -0.03
Table 1: Initial data
you will add an additional column βTrue Labelβ and have
data as shown in Table 2.
Your daily βtrue labelsβ sequence is +, β, +, Β· Β· Β· +, β.
2. take years 1,2 and 3. Let L be the number of trading days.
Assuming 250 trading days per year, L will contain about
750 days. Let L
β be all trading days with β labels and
let L
+ be all trading days with + labels. Assuming that
Page 2
BU MET CS-677: Data Science With Python, v.2.0 CS-677: Predicting Daily Trading Labels
Date Β· Β· Β· Return True Label
1/2/2015 Β· Β· Β· 0.015 +
1/3/2015 Β· Β· Β· -0.01 β
1/6/2015 Β· Β· Β· 0.02 +
Β· Β· Β· Β· Β· Β· Β· Β· Β· Β· Β· Β·
Β· Β· Β· Β· Β· Β· Β· Β· Β· Β· Β· Β·
12/30/2019 Β· Β· Β· 0 +
12/31/2019 Β· Β· Β· -0.03 β
Table 2: Adding True Labels
all days are independent of each other and that the ratio
of βupβ and βdownβ days remains the same in the future,
compute the default probability p
β
that the next day is a
βupβ day.
3. take years 1, 2 and 3 What is the probability that after
seeing k consecutive βdown daysβ, the next day is an βup
dayβ? For example, if k = 3, what is the probability of seeing ββ, β, β, +β as opposed to seeing ββ, β, β, ββ. Compute this for k = 1, 2, 3.
4. take years 1, 2 and 3. What is the probability that after
seeing k consecutive βup daysβ, the next day is still an
βup dayβ? For example, if k = 3, what is the probability
of seeing β+, +, +, +β as opposed to seeing β+, +, +, ββ?
Compute this for k = 1, 2, 3.
Page 3
BU MET CS-677: Data Science With Python, v.2.0 CS-677: Predicting Daily Trading Labels
Predicting labels: We will now describe a procedure to
predict labels for each day in years 4 and 5 from βtrueβ labels
in training years 1,2 and 3.
For each day d in year 4 and 5, we look at the pattern of
last W true labels (including this day d). By looking at the
frequency of this pattern and true label for the next day in the
training set, we will predict label for day d + 1. Here W is the
hyperparameter that we will choose based on our prediction
accuracy.
Suppose W = 3. You look at a partuclar day d and suppose
that the sequence of last W labels is s = ββ, +, ββ. We want
to predict the label for next day d + 1. To do this, we count
the number of sequences of length W + 1 in the training set
where the first W labels coincide with s. In other words, we
count the number Nβ(s) of sequences βs, ββ and the number
of sequences N+(s) of sequences βs, +β. If N+(s) β₯ Nβ(s)
then the next day is assigned β+β. If N+(s) < Nβ(s) then the
next day is assigned βββ. In the unlikely event that N+(s) =
Nβ(s) = 0 we will assign a label based on default probability
p
β
that we computed in the previous question.
Question 2:
1. for W = 2, 3, 4, compute predicted labels for each day in
year 4 and 5 based on true labels in years 1,2 and 3 only.
Perform this for your ticker and for βspyβ.
Page 4
BU MET CS-677: Data Science With Python, v.2.0 CS-677: Predicting Daily Trading Labels
2. for each W = 2, 3, 4, compute the accuracy – what percentage of true labels (both positive and negative) have you
predicted correctly for the last two years.
3. which Wβ value gave you the highest accuracy for your
stock and and which Wβ valuegave you the highest accuracy
for S&P-500?
Question 3. One of the most powerful methods to (potentially) improve predictions is to combine predictions by some
βaveragingβ. This is called ensemble learning. Let us consider
the following procedure: for every day d, you have 3 predicted
labels: for W = 2, W = 3 and W = 4. Let us compute an
βensembleβ label for day d by taking the majority of your labels for that day. For example, if your predicted labels were
βββ,βββ and β+β, then we would take βββ as ensemble label
for day d (the majority of three labels is βββ). If, on the other
hand, your predicted labels were βββ, β+β and β+β then we
would take β+β as ensemble label for day d (the majority of
predicted labels is β+β). Compute such ensemble labels and
answer the following:
1. compute ensemble labels for year 4 and 5 for both your
stock and S&P-500.
2. for both S&P-500 and your ticker, what percentage of labels
in year 4 and 5 do you compute correctly by using ensemble?
Page 5
BU MET CS-677: Data Science With Python, v.2.0 CS-677: Predicting Daily Trading Labels
3. did you improve your accuracy on predicting βββ labels by
using ensemble compared to W = 2, 3, 4?
4. did you improve your accuracy on predicting β+β labels by
using ensemble compared to W = 2, 3, 4?
Question 4: For W = 2, 3, 4 and ensemble, compute the
following (both for your ticker and βspyβ) statistics based on
years 4 and 5:
1. TP – true positives (your predicted label is + and true label
is +
2. FP – false positives (your predicted label is + but true label
is β
3. TN – true negativess (your predicted label is β and true
label is β
4. FN – false negatives (your predicted label is β but true label
is +
5. TPR = TP/(TP + FN) – true positive rate. This is the fraction of positive labels that your predicted correctly. This is
also called sensitivity, recall or hit rate.
6. TNR = TN/(TN + FP) – true negative rate. This is the
fraction of negative labels that your predicted correctly.
This is also called specificity or selectivity.
Page 6
BU MET CS-677: Data Science With Python, v.2.0 CS-677: Predicting Daily Trading Labels
7. summarize your findings in the table as shown below:
W ticker TP FP TN FN accuracy TPR TNR
2 S&P-500
3 S&P-500
4 S&P-500
ensemble S&P-500
2 your stock
3 your stock
4 your stock
ensemble your stock
Table 3: Prediction Results for W = 1, 2, 3 and ensemble
8. discuss your findings
Question 5: At the beginning of year 4 you start with $100
dollars and trade for 2 years based on predicted labels.
1. take your stock. Plot the growth of your amount for 2 years
if you trade based on best Wβ
and on ensemble. On the
same graph, plot the growth of your portfolio for βbuy-andholdβ strategy
2. examine your chart. Any patterns? (e.g any differences in
year 4 and year 5)
Page 7
CS-677 Assignment: kNN & Log. Regression(banknotes) Assignment 3
In this assignment, we will implement k-nn and logistic regression classifiers to detect βfakeβ banknotes and analyze the
comparative importance of features in predicting accuracy.
For the dataset, we use βbanknote authentication datasetβ from
the machine Learning depository at UCI: https://archive.
ics.uci.edu/ml/datasets/banknote+authentication
Dataset Description: From the website: βThis dataset
contains 1,372 examples of both fake and real banknotes. Data
were extracted from images that were taken from genuine and
forged banknote-like specimens. For digitization, an industrial
camera usually used for print inspection was used. The final images have 400x 400 pixels. Due to the object lens and distance
to the investigated object gray-scale pictures with a resolution
of about 660 dpi were gained. Wavelet Transform tool were
used to extract features from images.β
There are 4 continuous attributes (features) and a class:
1. f1 – variance of wavelet transformed image
2. f2 – skewness of wavelet transformed image
3. f3 – curtosis of wavelet transformed image
4. f4 – entropy of image
Page 1
BU MET CS-677: Data Science With Python, v.2.0CS-677 Assignment: kNN & Log. Regression(banknotes)
5. class (integer)
In other words, assume that you have a machine that examines
a banknote and computes 4 attributes (step 1). Then each banknote is examined by a much more expensive machines and/or
by human expert(s) and classified as fake or real (step 2). The
second step is very time-consuming and expensive. You want
to build a classifier that would give you results after step 1 only.
We assume that class 0 are good banknotes. We will use color
βgreenβ or β+β for legitimate banknotes. Class 1 are assumed
to be fake banknotes and we will use color βredβ or βββ for
counterfeit banknotes. These are βtrueβ labels.
Question 1:
1. load the data into dataframe and add column βcolorβ. For
each class 0, this should contain βgreenβ and for each class
1 it should contain βredβ
2. for each class and for each feature f1, f2, f3, f4, compute its
mean Β΅() and standard deviation Ο(). Round the results to
2 decimal places and summarize them in a table as shown
below:
3. examine your table. Are there any obvious patterns in the
distribution of banknotes in each class
Page 2
BU MET CS-677: Data Science With Python, v.2.0CS-677 Assignment: kNN & Log. Regression(banknotes)
class Β΅(f1) Ο(f1) Β΅(f2) Ο(f2) Β΅(f3) Ο(f3) Β΅(f4) Ο(f4)
0
1
all
Question 2:
1. split your dataset X into training Xtrain and Xtesting parts
(50/50 split). Using βpairplotβ from seaborn package, plot
pairwise relationships in Xtrain separately for class 0 and
class 1. Save your results into 2 pdf files βgood bills.pdfβ
and βfake bills.pdfβ
2. visually examine your results. Come up with three simple
comparisons that you think may be sufficient to detect a
fake bill. For example, your classifier may look like this:
# assume you are examining a bill
# with features f_1 ,f_2 ,f_3 and f_4
# your rule may look like this :
if ( f_1 > 4) and ( f_2 > 8) and ( f_4 < 25):
x = ” good ”
else :
x = ” fake ”
Page 3
BU MET CS-677: Data Science With Python, v.2.0CS-677 Assignment: kNN & Log. Regression(banknotes)
3. apply your simple classifier to Xtest and compute predicted
class labels
4. comparing your predicted class labels with true labels, compute the following:
(a) TP – true positives (your predicted label is + and true
label is +)
(b) FP – false positives (your predicted label is + but true
label is β
(c) TN – true negativess (your predicted label is β and true
label is β
(d) FN – false negatives (your predicted label is β but true
label is +
(e) TPR = TP/(TP + FN) – true positive rate. This is the
fraction of positive labels that your predicted correctly.
This is also called sensitivity, recall or hit rate.
(f) TNR = TN/(TN + FP) – true negative rate. This is the
fraction of negative labels that your predicted correctly.
This is also called specificity or selectivity.
5. summarize your findings in the table as shown below:
6. does you simple classifier gives you higher accuracy on identifying βfakeβ bills or βrealβ billsβ Is your accuracy better
than 50% (βcoinβ flipping)?
Page 4
BU MET CS-677: Data Science With Python, v.2.0CS-677 Assignment: kNN & Log. Regression(banknotes)
TP FP TN FN accuracy TPR TNR
Question 3 (use k-NN classifier using sklearn library)
1. take k = 3, 5, 7, 9, 11. Use the same Xtrain and Xtest as
before. For each k, train your k-NN classifier on Xtrain and
compute its accuracy for Xtest
2. plot a graph showing the accuracy. On x axis you plot k
and on y-axis you plot accuracy. What is the optimal value
k
β
of k?
3. use the optimal value k
β
to compute performance measures
and summarize them in the table
TP FP TN FN accuracy TPR TNR
4. is your k-NN classifier better than your simple classifier for
any of the measures from the previous table?
5. consider a bill x that contains the last 4 digits of your BUID
as feature values. What is the class label predicted for this
Page 5
BU MET CS-677: Data Science With Python, v.2.0CS-677 Assignment: kNN & Log. Regression(banknotes)
bill by your simple classifier? What is the label for this bill
predicted by k-NN using the best k
β
?
Question 4: One of the fundamental questions in machine
learning is βfeature selectionβ. We try to come up with a least
number of features and still retain good accuracy. The natural
question is whether some of the features are important or can
be dropped.
1. take your best value k
β
. For each of the four features
f1, . . . , f4, drop that feature from both Xtrain and Xtest.
Train your classifier on the βtruncatedβ Xtrain and predict labels on Xtest using just 3 remaining features. You
will repeat this for 4 cases: (1) just f1 is missing, (2) just
f2 missing, (3) just f3 missing and (4) just f4 is missing.
Compute the accuracy for each of these scenarious.
2. did accuracy increase in any of the 4 cases compared with
accuracy when all 4 features are used?
3. which feature, when removed, contributed the most to loss
of accuracy?
4. which feature, when removed, contributed the least to loss
of accuracy?
Page 6
BU MET CS-677: Data Science With Python, v.2.0CS-677 Assignment: kNN & Log. Regression(banknotes)
Question 5 (use logistic (regression classifier using sklearn
library)
1. Use the same Xtrain and Xtest as before. Train your logistic
regression classifier on Xtrain and compute its accuracy for
Xtest
2. summarize your performance measures in the table
TP FP TN FN accuracy TPR TNR
3. is your logistic regression better than your simple classifier
for any of the measures from the previous table?
4. is your logistic regression better than your k-NN classifier
(using the best k
β
) for any of the measures from the previous
table?
5. consider a bill x that contains the last 4 digits of your BUID
as feature values. What is the class label predicted for this
bill x by logistic regression? Is it the same label as predicted
by k-NN?
Page 7
BU MET CS-677: Data Science With Python, v.2.0CS-677 Assignment: kNN & Log. Regression(banknotes)
Question 6: We will investigate change in accuracy when
removing one feature. This is similar to question 4 but now we
use logistic regression.
1. For each of the four features f1, . . . , f4, drop that feature
from both Xtrain and Xtest. Train your logistic regression
classifier on the βtruncatedβ Xtrain and predict labels on
Xtest using just 3 remaining features. You will repeat this
for 4 cases: (1) just f1 is missing, (2) just f2 missing, (3)
just f3 missing and (4) just f4 is missing. Compute the
accuracy for each of these scenarious.
2. did accuracy increase in any of the 4 cases compared with
accuracy when all 4 features are used?
3. which feature, when removed, contributed the most to loss
of accuracy?
4. which feature, when removed, contributed the least to loss
of accuracy?
5. is relative significance of features the same as you obtained
using k-NN?
Page 8
CS-677 Assignment: Linear Models Assignment 4
In this assignment, we will implement a number of linear models (including linear regression) to model relationships between
different clinical features for heart failure in patients.
For the dataset, we use βheart failure clinical records data set
at UCI:
https://archive.ics.uci.edu/ml/datasets/Heart+failure+
clinical+records
Dataset Description: From the website: βThis dataset
contains the medical records of 299 patients who had heart failure, collected during their follow-up period, where each patient
profile has 13 clinical features.β
These 13 features are:
1. age: age of the patient (years)
2. anaemia: decrease of red blood cells or hemoglobin (boolean)
3. high blood pressure: if the patient has hypertension (boolean)
4. creatinine phosphokinase (CPK): level of the CPK enzyme
in the blood (mcg/L)
5. diabetes: if the patient has diabetes (boolean)
Page 1
BU MET CS-677: Data Science With Python, v.2.0 CS-677 Assignment: Linear Models
6. ejection fraction: percentage of blood leaving the heart at
each contraction (percentage)
7. platelets: platelets in the blood (kiloplatelets/mL)
8. sex: woman or man (binary)
9. serum creatinine: level of serum creatinine in the blood
(mg/dL)
10. serum sodium: level of serum sodium in the blood (mEq/L)
11. smoking: if the patient smokes or not (boolean)
12. time: follow-up period (days)
target death event: if the patient deceased (DEATH EVENT =
1) during the follow-up period (boolean)
We will focus on the following subset of four features:
1. creatinine phosphokinase
2. serum creatinine
3. serum sodium
4. platelets
and try establish a relationship between some of them using
various linear models and their variants.
Page 2
BU MET CS-677: Data Science With Python, v.2.0 CS-677 Assignment: Linear Models
Question 1:
1. load the data into Pandas dataframe. Extract two dataframes
with the above 4 features: df 0 for surviving patients (DEATH EVENT
= 0) and df 1 for deceased patients (DEATH EVENT = 1)
2. for each dataset, construct the visual representations of
correponding correlation matrices M0 (from df 0) and M1
(from df 1) and save the plots into two separate files
3. examine your correlation matrix plots visually and answer
the following:
(a) which features have the highest correlation for surviving
patients?
(b) which features have the lowest correlation for surviving
patients?
(c) which features have the highest correlation for deceased
patients?
(d) which features have the lowest correlation for deceased
patients?
(e) are results the same for both cases?
Question 2: In this question you will compare a number of
different models using linear systems (including linear regression). You choose one feature X as independent variable X
Page 3
BU MET CS-677: Data Science With Python, v.2.0 CS-677 Assignment: Linear Models
and another feature Y as dependent. Your choice of X and Y
will depend on your facilitator group as follows:
1. Group 1: X: creatinine phosphokinase (CPK), Y : platelets
2. Group 2: X: platelets, Y : serum sodium
3. Group 3: X: serum sodium, Y : serum creatinine
4. Group 4: X: platelets, Y : serum creatinine
We will now look for the best model (from the list below) that
best explains the relationship for surviving and deceased patients. Consider surviving patients (DEATH EVENT = 0).
Extract the corresponding columns for X and Y . For each
of the models below, we will take 50/50 split, fit model with
Xtrain and predict Ytest using Xtest. From the predicted values Pred(yi) we compute the residuals ri = yi β Pred(yi). We
can then estimate the loss function (SSE sum of the squared of
residuals)
L =
X
xiβXtest
e
2
i
You do the same analysis for deceased patients. You will consider the following models for both deceased and surviving patients:
1. y = ax + b (simple linear regression)
Page 4
BU MET CS-677: Data Science With Python, v.2.0 CS-677 Assignment: Linear Models
2. y = ax2 + bx + c (quadratic)
3. y = ax3 + bx2 + cx + d (cubic spline)
4. y = a log x + b (GLM – generalized linear model)
5. log y = a log x + b (GLM – generalized linear model)
For each of the model below, you will do the following (for both
deceased and surviving patients)
(a) fit t he model on Xtrain
(b) print the weights (a, b, . . .)
(c) compute predicted values using Xtest
(d) plot (if possible) predicted and actual values in Xtest
(e) compute (and print) the corresponding loss function
Question 3: Summarize your results from question 2 in a
table like shown below:
Model SSE (death event=0) (death event=1)
y = ax + b
y = ax2 + bx + c
y = ax3 + bx2 + cx + d
y = a log x + b
log y = a log x + b
Page 5
BU MET CS-677: Data Science With Python, v.2.0 CS-677 Assignment: Linear Models
1. which model was the best (smallest SSE) for surviving patients? for deceased patients?
2. which model was the worst (largest SSE) for surving patients? for deceased patients?
Page 6
CS-677 Assignment: NB, Trees & RF Assignment 5
In this assignment, we will compare Naive Bayesian and Decision Tree Classification for identifying normal vs. non-normal
fetus status based on fetal cardiograms.
For the dataset, we use βfetal cardiotocography data setβ at
UCI:
https://archive.ics.uci.edu/ml/datasets/Cardiotocography
Dataset Description: From the website: β2126 fetal cardiotocograms (CTGs) were automatically processed and the
respective diagnostic features measured. The CTGs were also
classified by three expert obstetricians and a consensus classification label assigned to each of them. Classification was both
with respect to a morphologic pattern (A, B, C. …) and to a
fetal state (N, S, P). Therefore the dataset can be used either
for 10-class or 3-class experiments.β
We will focus on the βfetal stateβ. We will combine labels
βSβ (suspect) and βPβ (pathological) into one class βAβ (abnormal). We will focus on predicting βNβ (normal) vs. βAβ
(βAbnormalβ). For a detailed description of features, please
visit the above website.
The data is an Excel (not csv) file. For ways to process excel
files in Python, see https://www.python-excel.org/
Page 1
BU MET CS-677: Data Science With Python, v.2.0 CS-677 Assignment: NB, Trees & RF
You will use the following subset of 12 numeric features:
1. LB – FHR baseline (beats per minute)
2. ASTV – percentage of time with abnormal short term variability
3. MSTV – mean value of short term variability
4. ALTV – percentage of time with abnormal long term variability
5. MLTV – mean value of long term variability
6. Width – width of FHR histogram
7. Min – minimum of FHR histogram
8. Max – Maximum of FHR histogram
9. Mode – histogram mode
10. Mean – histogram mean
11. Median – histogram median
12. Variance – histogram variance
You will consider the following set of 4 features depending on
your facilitator group.
Page 2
BU MET CS-677: Data Science With Python, v.2.0 CS-677 Assignment: NB, Trees & RF
β’ Group 1: LB, ALTV, Min, Mean
β’ Group 2: ASTV, MLTV, Max, Median
β’ Group 3: MSTV, Width, Mode, Variance
β’ Group 4: LB, MLTV, Width, Variance
For each of the questions below, these would be your features.
Question 1:
1. load the Excel (βraw dataβ worksheet) data into Pandas
dataframe
2. combine NSP labels into two groups: N (normal – these
labels are assigned) and Abnormal (everything else) We
will use existing class 1 for normal and define class 0 for
abnormal.
Question 2: Use Naive Bayesian NB classifier to answer
these questions:
1. split your set 50/50, train NB on Xtrain and predict class
labels in Xtest
2. what is the accuracy?
3. compute the confusion matrix
Page 3
BU MET CS-677: Data Science With Python, v.2.0 CS-677 Assignment: NB, Trees & RF
Question 3: Use Decision Tree to answer these questions:
1. split your set 50/50, train NB on Xtrain and predict class
labels in Xtest
2. what is the accuracy?
3. compute the confusion matrix
Question 4: Recall that there are two hyper-parameters in
the random forest classifier: N – number of (sub)trees to use
and d – max depth of each subtree
Use Random Forest classifier to answer these questions:
1. take N = 1, . . . , 10 and d = 1, 2, . . . , 5. For each value of
N and d, split your data into Xtrain and Xtest, construct
a random tree classifier (use βentropyβ as splitting criteria
– this is the default) Train you r classifier on Xtrain and
compute the error rate for Xtest
2. plot your error rates and find the best combination of N
and d.
3. what is the accuracy for the best combination of N and k?
4. compute the confusion matrix using the best combination
of N and d
Page 4
BU MET CS-677: Data Science With Python, v.2.0 CS-677 Assignment: NB, Trees & RF
Question 5: Summarize your results for Naive Bayesian,
decision tree and random forest in a table below and discuss
your findings.
Model TP FP TN FN accuracy TPR TNR
naive bayesian
decision tree
random forest
Page 5
CS-677 Assignment: SVM & Clustering Assignment 6
In this assignment, you will implement k-means clustering and
use it to construct a multi-label classifier to determine the variery of wheat. For the dataset, we use βseedsβ dataset from the
machine Learning depository at UCI:
https://archive.ics.uci.edu/ml/datasets/seeds
Dataset Description: From the website: β… The examined group comprised kernels belonging to three different
varieties of wheat: Kama, Rosa and Canadian, 70 elements
each, randomly selected for the experiment…β
There are 7 (continuous) features) F = {f1, . . . , f7} and a
class label L (Kama: 1, Rosa: 2, Canadian: 3).
1. f1: area A
2. f2: perimeter P
3. f3: compactness C = 4ΟA/P2
4. f4: length of kernel,
5. f5: width of kernel,
6. f6: asymmetry coefficient
7. f7: length of kernel groove.
Page 1
BU MET CS-677: Data Science With Python, v.2.0 CS-677 Assignment: SVM & Clustering
8. L: class (Kama: 1, Rosa: 2, Canadian: 3)
For the first question, you will choose 2 class labels as follows.
Take the last digit in your buid and divide it by 3. Choose the
following 2 classes depending on the remainder R:
1. R = 0: class L = 1 (negative) and L = 2 (positive)
2. R = 1: class L = 2 (negative) and L = 3 (positive)
3. R = 2: class L = 1 (negative) and L = 3 (positive)
Question 1: Take the subset of the dataset containing your
two class labels. You will use random 50/50 splits for training
and testing data.
1. implement a linear kernel SVM. What is your accuracy and
confusion matrix?
2. implement a Gaussian kernel SVM. What is your accuracy
and confusion matrix?
3. implement a polynomial kernel SVM of degree 3. What is
your accuracy and confusion matrix?
Question 2: Pick up any classifier for supervised learning
(e.g. kNN, logistic regression, Naive Bayesian, etc).
Page 2
BU MET CS-677: Data Science With Python, v.2.0 CS-677 Assignment: SVM & Clustering
1. use this classifier to your dataset. What is your accuracy
and confusion matrix?
2. summarize your findings in a table below and discuss your
results
Model TP FP TN FN accuracy TPR TNR
linear SVM
Gaussian SVM
polynomial SVM
your classifier
Question 3: Take the original dataset with all 3 class labels.
1. for k = 1, 2, . . . , 8 use k-means clustering with random
initialization and defaults. Compute and plot distortion vs
k. Use the βkneeβ method to find the best k.
2. re-run your clustering with best k clusters. Pick two features fi and fj at random (using python, of course) and
plot your datapoints (different color for each class and centroids) using fi and fj as axis. Examine your plot. Are
there any interesting patterns?
3. for each cluster, assign a cluster label based on the majority
class of items. For example, if cluster Ci contains 45% of
Page 3
BU MET CS-677: Data Science With Python, v.2.0 CS-677 Assignment: SVM & Clustering
class 1 (βKamaβ wheat), 35% of class 2 (βRosaβ wheat) and
20% of class 3 (βCanadianβ wheat), then this cluster Ci
is
assigned label 1. For each cluster, print out its centroid and
assigned label.
4. consider the following multi-label classifier. Take the largest
3 clusters with label 1, 2 and 3 respectively. Let us call these
clusters A, B and C. For each of these clusters, you know
their means (centroids): Β΅(A), Β΅(B) and Β΅(C). We now
consider the following procedure (conceptually analogous
to nearest neighbor with k = 1): for every point x in your
dataset, assign a label based on the label on the nearest
(using Euclidean distance) centroid of A, B or C. In other
words, if x is closest to center of cluster A, you assign it
label 1. If x is closest to center of cluster B, you assign it
class 2. Finally, if x is closest to center of cluster C, you
assign it class 3. What is the overall accuracy of this new
classifier when applied to the complete data set?
5. take this new classifier and consider the same two labels that
you used for SVM. What is your accuracy and confusion
matrix? How does your new classifier (from task 4) compare
with any classifiers listed in the table for question 2 above?
Page 4

