Description
CS 4342 Assignment #1
1. For each of parts (a) through (d), indicate whether we would generally expect the performance of a
flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.
(a) The sample size n is extremely large, and the number of predictors p is small.
(b) The number of predictors p is extremely large, and the number of observations n is small.
(c) The relationship between the predictors and response is highly non-linear.
(d) The variance of the error terms, i.e. 𝜎
2 = Var(𝜖), is extremely high.
Points: 5
2. Explain whether each scenario is a classification or regression problem, and indicate whether we are
most interested in inference or prediction. Finally, provide n and p.
(a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of
employees, industry and the CEO salary. We are interested in understanding which factors affect CEO
salary.
(b) We are considering launching a new product and wish to know whether it will be a success or a failure.
We collect data on 20 similar products that were previously launched. For each product we have recorded
whether it was a success or failure, price charged for the product, marketing budget, competition price, and
ten other variables.
(c) We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly
changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record
the % change in the USD/Euro, the % change in the US market, the % change in the British market, and
the % change in the German market.
Points: 5
3. We now revisit the bias-variance decomposition.
(a) Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible)
error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible
approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should
represent the values for each curve. There should be five curves. Make sure to label each one.
(b) Explain why each of the five curves has the shape displayed in part (a).
Points: 5
4. You will now think of some real-life applications for statistical learning.
(a) Describe three real-life applications in which classification might be useful. Describe the response, as
well as the predictors. Is the goal of each application inference or prediction? Explain your answer.
(b) Describe three real-life applications in which regression might be useful. Describe the response, as well
as the predictors. Is the goal of each application inference or prediction? Explain your answer.
(c) Describe three real-life applications in which cluster analysis might be useful.
Points: 5
5. What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for
regression or classification? Under what circumstances might a more flexible approach be preferred to a
less flexible approach? When might a less flexible approach be preferred?
Points: 5
6. Describe the differences between a parametric and a non-parametric statistical learning approach. What
are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric
approach)? What are its disadvantages?
Points: 5
7. The table below provides a training data set containing six observations, three predictors, and one
qualitative response variable.
Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using K-nearest
neighbors.
(a) Compute the Euclidean distance between each observation and the test point, X1 = X2 = X3 = 0.
(b) What is our prediction with K = 1? Why?
(c) What is our prediction with K = 3? Why?
(d) If the Bayes decision boundary in this problem is highly nonlinear, then would we expect the best value
for K to be large or small? Why?
Points: 5
Applied Questions
1. This exercise involves the Auto data set studied in the lab. Make sure that the missing values have been
removed from the data.
(a) Which of the predictors are quantitative, and which are qualitative?
(b) What is the range of each quantitative predictor?
(c) What is the mean and standard deviation of each quantitative predictor?
(d) Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of
each predictor in the subset of the data that remains?
(e) Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your
choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.
(f) Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots
suggest that any of the other variables might be useful in predicting mpg? Justify your answer.
Hints:
– Range: The range of a set of data is the difference between the highest and lowest values in the
set. To find the range, first order the data from least to greatest. Then subtract the smallest value
from the largest value in the set.
– You can use NumPy package. It provides functions for the mean, min, and max of arrays.
– Load Data: You can use Pandas package. You might use functions in Pandas like read_csv.
– Scatter plots: check this https://jakevdp.github.io/PythonDataScienceHandbook/04.02-simplescatter-plots.html
Points: 15
2. This exercise involves the Boston housing data set.
(a) How many rows are in this data set? How many columns? What do the rows and columns represent?
(b) Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your findings.
(c) Are any of the predictors associated with per capita crime rate? If so, explain the relationship.
(d) Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher
ratios? Comment on the range of each predictor.
(e) How many of the suburbs in this data set bound the Charles river?
(f) What is the median pupil-teacher ratio among the towns in this data set?
(g) Which suburb of Boston has lowest median value of owneroccupied homes? What are the values of the
other predictors for that suburb, and how do those values compare to the overall ranges for those
predictors? Comment on your findings.
(h) In this data set, how many of the suburbs average more than seven rooms per dwelling? More than
eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling.
Hints:
– You can find the description of Boston data set in the following link:
https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html
– Median is defined as the value that is in the physically/positionally middle of a sorted array.
– You might use NumPy package
Points: 15
Mathematics and Probability Questions
1. Minimum and Maximum of a function
For the following function
𝑓(𝑥, 𝑦) = 4 𝑥
2 + 2 𝑦
2 − 4 𝑥 + 6 𝑦 + 2 𝑥 𝑦 + 5
a. Show the contour plot of the function
b. Find the partial derivative with respect to 𝑥 and 𝑦
c. Find the minimum point of the function
Hints:
– Check https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.contour.html for contour plot
Points: 10
2. Maximum Likelihood Estimation (MLE) (10)
For the normal distribution with mean 𝑚 and variance 𝜎
2
; its pdf is defined by
𝑓(𝑥) =
1
√2𝜋 𝜎
𝑒
−
(𝑥−𝑚)
2
2 𝜎2
a. Plot the curve for 𝑚 = 3 and 𝜎 = 1
b. Plot the curve for 𝑚 = 3 and 𝜎 = √10
c. Let’s assume, we have 𝑁 samples – {𝑥1, 𝑥2, … . , 𝑥𝑁}. The likelihood function is defined by
𝐿(𝑥1, 𝑥2, … , 𝑥𝑁; 𝑚, 𝜎) = ∏𝑓(𝑥𝑖)
𝑁
𝑖=1
= 𝑓(𝑥1) × 𝑓(𝑥2
) × ⋯ × 𝑓(𝑥𝑁)
find the MLE for 𝑚 and 𝜎.
d. Draw 10 samples from a normal distribution with 𝑚 = 3 and 𝜎 = √10. Show its histogram, and
calculate it’s mean and variance
e. Draw 100 samples from a normal distribution with 𝑚 = 3 and 𝜎 = √10. Show its histogram, and
calculate it’s mean and variance
f. Draw 1000 samples from a normal distribution with 𝑚 = 3 and 𝜎 = √10. Show its histogram, and
calculate it’s mean and variance
Hints:
– Check https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html for
random normal number generator
3. Bayes Rule and Conditional Distribution (15)
For a company, we have collected the following information for their hiring process over the last 10
years.
Education Ph.D. Master Bachelor
Accepted 10 25 45
Rejected 90 125 55
a. What is the probability of an applicant to have PhD?
b. What is probability of being accepted if you have at least a Master Degree?
c. What is probability of being accepted?
d. What is probability of having Ph.D. if the candidate being accepted?
Points: 15
CS 4342 Assignment #2
1. Describe the null hypotheses to which the p-values given in the below Table correspond. Explain what
conclusions you can draw based on these p-values. Your explanation should be phrased in terms of sales, TV,
radio, and newspaper, rather than in terms of the coefficients of the linear model.
Points: 5
2. Carefully explain the differences between the KNN classifier and KNN regression methods.
Points: 5
3. Suppose we have a data set with five predictors, X1 = GPA, X2 = IQ, X3 = Gender (1 for Female and 0 forMale),
X4 = Interaction between GPA and IQ, and X5 = Interaction between GPA and Gender. The response is starting
salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model, and get 𝛽̂0 = 50,
𝛽̂1 = 20, 𝛽̂2 = 0.07, 𝛽̂ 3 = 35, 𝛽̂4 = 0.01, 𝛽̂ 5 = −10.
(a) Which answer is correct, and why?
i. For a fixed value of IQ and GPA, males earn more on average than females.
ii. For a fixed value of IQ and GPA, females earn more on average than males.
iii. For a fixed value of IQ and GPA, males earn more on average than females provided that the GPA is
high enough.
iv. For a fixed value of IQ and GPA, females earn more on average than males provided that the GPA is
high enough.
(b) Predict the salary of a female with IQ of 110 and a GPA of 4.0.
(c) True or false: Since the coefficient for the GPA/IQ interaction term is very small, there is very little evidence of
an interaction effect. Justify your answer.
Points: 5
4. I collect a set of data (n = 100 observations) containing a single predictor and a quantitative response. I then fit a
linear regression model to the data, as well as a separate cubic regression, i.e. 𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝛽2𝑋
2 + 𝛽3 𝑋
3 +
𝜖.
(a) Suppose that the true relationship between X and Y is linear, i.e. 𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜖 . Consider the training
residual sum of squares (RSS) for the linear regression, and also the training RSS for the cubic regression. Would
we expect one to be lower than the other, would we expect them to be the same, or is there not enough information
to tell? Justify your answer.
(b) Answer (a) using test rather than training RSS.
(c) Suppose that the true relationship between X and Y is not linear, but we don’t know how far it is from linear.
Consider the training RSS for the linear regression, and also the training RSS for the cubic regression. Would we
expect one to be lower than the other, would we expect them to be the same, or is there not enough information to
tell? Justify your answer.
(d) Answer (c) using test rather than training RSS.
Points: 5
5. Consider the fitted values that result from performing linear regression without an intercept. In this setting, the ith
fitted value takes the form
𝑦̂𝑖 = 𝑥𝑖 𝛽̂
,
where
𝛽̂ = (∑𝑥𝑖𝑦𝑖
𝑛
𝑖=1
)/(∑ 𝑥𝑖
′
2
𝑛
𝑖
′=1
)
Show that we can write
𝑦̂
𝑖 = ∑ 𝑎
𝑖
′𝑦
𝑖
′
𝑛
𝑖
′=1
What is 𝑎𝑖
′?
Note: We interpret this result by saying that the fitted values from linear regression are linear combinations of the
response values.
Points: 5
6. Using equation (3.4) – shown below, argue that in the case of simple linear regression, the least squares line
always passes through the point (𝑥̅, 𝑦̅).
𝛽̂
1 = (∑ (𝑥𝑖 − 𝑥̅)(𝑦𝑖 − 𝑦̅)
𝑛
𝑖=1
)/∑ (𝑥𝑖 − 𝑥̅)
𝑛 2
𝑖=1
(3.4.a)
𝛽̂
0 = 𝑦̅ − 𝛽̂
1𝑥̅ (3.4.b)
Points: 5
Applied Questions
1. This question involves the use of simple linear regression on the Auto data set.
(a) Perform a simple linear regression with mpg as the response and horsepower as the predictor and answer the
following questions:
i. Is there a relationship between the predictor and the response?
ii. How strong is the relationship between the predictor and the response?
iii. Is the relationship between the predictor and the response positive or negative?
iv. What is the predicted mpg associated with a horsepower of 95?
(b) Plot the response and the predictor along with the predicted line.
Hints:
– Check https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html
– Check https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
Points: 20
2. This question involves the use of multiple linear regression on the Auto data set.
(a) Produce a scatterplot matrix which includes all of the variables in the data set.
(b) Compute the matrix of correlations between the variables. You will need to exclude the name variable which is
qualitative.
(c) Perform a multiple linear regression with mpg as the response and all other variables except name as the
predictors. Examine the results, and comment on the output. For instance:
i. Is there a relationship between the predictors and the response?
ii. Which predictors appear to have a statistically significant relationship to the response?
iii. What does the coefficient for the year variable suggest?
(d) Produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the
residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually
high leverage?
(e) Fit linear regression models with predictors and interaction terms. Do any interactions appear to be statistically
significant?
(e) Fit linear regression models with only interaction terms. Do any interactions appear to be statistically significant?
(f) Try a few different transformations of the variables, such as log(X), √𝑋, 𝑋
2
. Comment on your findings.
Hints:
– Check https://pandas.pydata.org/docs/reference/api/pandas.plotting.scatter_matrix.html
– Check NumPy, SciPy, or Pandas for correlation
– Check
https://www.statsmodels.org/stable/generated/statsmodels.graphics.regressionplots.influence_plot.html for
the leverage plot
– Check https://joelcarlson.github.io/2016/05/10/Exploring-Interactions/ for the interaction term. You can
build them manually too.
Points: 20
3. This question should be answered using the Carseats data set.
(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.
(b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are
qualitative!
(c) Write out the model in equation form, being careful to handle the qualitative variables properly.
(d) For which of the predictors can you reject the null hypothesis H0 : βj = 0?
(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which
there is evidence of association with the outcome.
(f) How well do the models in (a) and (e) fit the data?
(g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s).
(h) Is there evidence of outliers or high leverage observations in the model from (e)?
Points: 20
4. This problem involves the Boston data set, which we saw in the previous HW. We will now try to predict
per capita crime rate using the other variables in this data set. In other words, per capita crime rate is the response,
and the other variables are the predictors.
(a) For each predictor, fit a simple linear regression model to predict the response. Describe your results. In which of
the models is there a statistically significant association between the predictor and the response? Create some plots
to back up your assertions.
(b) Fit a multiple regression model to predict the response using all of the predictors. Describe your results. For
which predictors can we reject the null hypothesis H0 : βj = 0?
(c) How do your results from (a) compare to your results from (b)? Create a plot displaying the univariate regression
coefficients from (a) on the x-axis, and the multiple regression coefficients from (b) on the y-axis. That is, each
predictor is displayed as a single point in the plot. Its coefficient in a simple linear regression model is shown on the
x-axis, and its coefficient estimate in the multiple linear regression model is shown on the y-axis.
(d) Is there evidence of non-linear association between any of the predictors and the response? To answer this
question, for each predictor X, fit a model of the form
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝛽2 𝑋
2 + 𝛽3 𝑋
3 + 𝜖
Points: 10
CS4342 Assignment #3
1. Using a little bit of algebra, prove that (4.2) is equivalent to (4.3) – check below equations. In other
words, the logistic function representation and logit representation for the logistic regression model are
equivalent.
𝑃(𝑋) =
𝑒
𝛽0+𝛽1𝑋
1+𝑒
𝛽0+𝛽1𝑋
(4.2)
𝑃(𝑋)
1−𝑃(𝑋)
= 𝑒
𝛽0+𝛽1𝑋 (4.3)
Points: 5
2. It was stated in the text that classifying an observation to the class for which (4.12) is largest is equivalent
to classifying an observation to the class for which (4.13) is largest. Prove that this is the case. In other
words, under the assumption that the observations in the kth class are drawn from a N(μk, σ2) distribution,
the Bayes’ classifier assigns an observation to the class for which the discriminant function is maximized.
𝑝𝑘
(𝑥) =
𝜋𝑘
1
√2𝜋𝜎
exp (−
1
2𝜎2
(𝑥−𝜇𝑘
)
2)
∑ 𝜋𝑙
1
√2𝜋𝜎
exp (−
1
2𝜎2
(𝑥−𝜇𝑙
)2)
𝐾
𝑙=1
(4.12)
𝛿𝑘
(𝑥) = 𝑥
𝜇𝑘
𝜎2 −
𝜇𝑘
2
2 𝜎2 + log (𝜋𝑘) (4.13)
Points: 5
3. This problem relates to the QDA model, in which the observations within each class are drawn from a
normal distribution with a classspecific mean vector and a class specific covariance matrix. We consider
the simple case where p = 1; i.e. there is only one feature. Suppose that we have K classes, and that if an
observation belongs to the kth class then X comes from a one-dimensional normal distribution, 𝑋~𝑁(𝜇𝑘,𝜎𝑘
2
).
Prove that in this case, the Bayes’ classifier is not linear. Argue that it is in fact quadratic.
Hint: For this problem, you should follow the arguments laid out in Section 4.4.2, but without making the
assumption that 𝜎1
2 = 𝜎2
2 = ⋯ = 𝜎𝐾
2
.
Points: 5
4. We now examine the differences between LDA and QDA.
(a) If the Bayes decision boundary is linear, do we expect LDA or QDA to perform better on the training
set? On the test set?
(b) If the Bayes decision boundary is non-linear, do we expect LDA or QDA to perform better on the training
set? On the test set?
(c) In general, as the sample size n increases, do we expect the test prediction accuracy of QDA relative
to LDA to improve, decline, or be unchanged? Why?
(d) True or False: Even if the Bayes decision boundary for a given problem is linear, we will probably achieve
a superior test error rate using QDA rather than LDA because QDA is flexible enough to model a linear
decision boundary. Justify your answer.
Points: 5
5. Suppose we collect data for a group of students in a statistics class with variables X1 =hours studied, X2
=undergrad GPA, and Y = receive an A. We fit a logistic regression and produce estimated coefficient, 𝛽̂0
= −6, 𝛽̂1 = 0.05, 𝛽̂2 = 1.
(a) Estimate the probability that a student who studies for 40 h and has an undergrad GPA of 3.5 gets an
A in the class.
(b) How many hours would the student in part (a) need to study to have a 50% chance of getting an A in
the class?
Points: 5
6. Suppose that we wish to predict whether a given stock will issue a dividend this year (“Yes” or “No”)
based on X, last year’s percent profit.We examine a large number of companies and discover that the mean
value of X for companies that issued a dividend was 𝑋̅ = 10, while the mean for those that didn’t was 𝑋̅ =
0. In addition, the variance of X for these two sets of companies was 𝜎̂
2 = 36. Finally, 80% of companies
issued dividends. Assuming that X follows a normal distribution, predict the probability that a company will
issue a dividend this year given that its percentage profit was X = 4 last year.
Hint: You will need to use Bayes’ theorem.
Points: 5
7. Suppose that we take a data set, divide it into equally-sized training and test sets, and then try out two
different classification procedures. First we use logistic regression and get an error rate of 20% on the
training data and 30% on the test data. Next we use 1-nearest neighbors (i.e. K = 1) and get an average
error rate (averaged over both test and training data sets) of 18%. Based on these results, which method
should we prefer to use for classification of new observations? Why?
Points: 5
Applied Questions
1. This question should be answered using the Weekly data set.
(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any
patterns?
(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag
variables plus Volume as predictors. Do any of the predictors appear to be statistically significant? If so,
which ones?
(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion
matrix is telling you about the types of mistakes made by logistic regression.
(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the
only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out
data (that is, the data from 2009 and 2010).
(e) Repeat (d) using LDA.
(f) Repeat (d) using QDA.
(g) Repeat (d) using KNN with K = 1.
(h) Which of these methods appears to provide the best results on this data?
(i) Experiment with different combinations of predictors, including possible transformations and
interactions, for each of the methods. Report the variables, method, and associated confusion matrix that
appears to provide the best results on the held out data. Note that you should also experiment with values
for K in the KNN classifier.
Hints:
– The dataset explanation cane be found here: https://rdrr.io/cran/ISLR/man/Smarket.html
– Check https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html for logistic
regression
– Check https://scikitlearn.org/stable/modules/generated/sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis.
html for QDA
– Check https://scikitlearn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.ht
ml for LDA
– Check https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html for
confusion matrix
– Check https://scikitlearn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html for KNN
Points: 25
2. In this problem, you will develop a model to predict whether a given car gets high or low gas mileage
based on the Auto data set.
(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if
mpg contains a value below its median. Create a single data set containing both mpg01 and the other
Auto variables.
(b) Explore the data graphically in order to investigate the association between mpg01 and the other
features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and
boxplots may be useful tools to answer this question. Describe your findings.
(c) Split the data into a training set and a test set.
(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most
associated with mpg01 in (b). What is the test error of the model obtained?
(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most
associated with mpg01 in (b). What is the test error of the model obtained?
(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that
seemed most associated with mpg01 in (b). What is the test error of the model obtained?
(g) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the
variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of
K seems to perform the best on this data set?
Hint:
– Check this link for boxplots https://seaborn.pydata.org/generated/seaborn.boxplot.html
Points: 25
3. Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime
rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets
of the predictors. Describe your findings.
Points: 15
CS4342 Assignment #4
1. Using basic statistical properties of the variance, as well as singlevariable calculus, derive (1). In other
words, prove that 𝛼 given by (1) does indeed minimize 𝑉𝑎𝑟(𝛼𝑋 + (1 − 𝛼)𝑌 ).
𝛼 =
𝜎𝑌
2−𝜎𝑋𝑌
𝜎𝑌
2+𝜎𝑋
2−2𝜎𝑋𝑌
(1)
Points: 5
2. We now review k-fold cross-validation.
(a) Explain how k-fold cross-validation is implemented.
(b) What are the advantages and disadvantages of k-fold crossvalidation relative to:
i. The validation set approach?
ii. LOOCV?
Points: 5
3. Suppose that we use some statistical learning method to make a prediction for the response Y for a
particular value of the predictor X. Carefully describe how we might estimate the standard deviation of our
prediction.
Points: 5
4. We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For
each approach, we obtain p + 1 models, containing 0, 1, 2, . . . , p predictors. Explain your answers:
(a) Which of the three models with k predictors has the smallest training RSS?
(b) Which of the three models with k predictors has the smallest test RSS?
(c) True or False:
i. The predictors in the k-variable model identified by forward stepwise are a subset of the predictors
in the (k+1)-variable model identified by forward stepwise selection.
ii. The predictors in the k-variable model identified by backward stepwise are a subset of the
predictors in the (k + 1)- variable model identified by backward stepwise selection.
iii. The predictors in the k-variable model identified by backward stepwise are a subset of the
predictors in the (k + 1)- variable model identified by forward stepwise selection.
iv. The predictors in the k-variable model identified by forward stepwise are a subset of the
predictors in the (k+1)-variable model identified by backward stepwise selection.
v. The predictors in the k-variable model identified by best subset are a subset of the predictors in
the (k + 1)-variable model identified by best subset selection.
Points: 5
4. Suppose we estimate the regression coefficients in a linear regression model by minimizing
∑(𝑦𝑖 − 𝛽0 −∑𝛽𝑗𝑥𝑖𝑗
𝑝
𝑗=1
)
2
𝑛
𝑖=1
subject to ∑|𝛽𝑗
|
𝑝
𝑗=1
≤ 𝑠
for a particular value of s – s is postitive. For parts (a) through (e), indicate which of i. through v. is correct.
Justify your answer.
(a) As we increase s from 0, the training RSS will:
i. Increase initially, and then eventually start decreasing in an inverted U shape.
ii. Decrease initially, and then eventually start increasing in a U shape.
iii. Steadily increase.
iv. Steadily decrease.
v. Remain constant.
(b) Repeat (a) for test RSS.
(c) Repeat (a) for variance.
(d) Repeat (a) for (squared) bias.
(e) Repeat (a) for the irreducible error.
Points: 5
5. Suppose we estimate the regression coefficients in a linear regression model by minimizing
∑(𝑦𝑖 − 𝛽0 − ∑𝛽𝑗𝑥𝑖𝑗
𝑝
𝑗=1
)
2
𝑛
𝑖=1
+ 𝜆∑𝛽𝑗
2
𝑝
𝑗=1
for a particular value of λ. For parts (a) through (e), indicate which of i. through v. is correct. Justify your
answer.
(a) As we increase λ from 0, the training RSS will:
i. Increase initially, and then eventually start decreasing in an inverted U shape.
ii. Decrease initially, and then eventually start increasing in a U shape.
iii. Steadily increase.
iv. Steadily decrease.
v. Remain constant.
(b) Repeat (a) for test RSS.
(c) Repeat (a) for variance.
(d) Repeat (a) for (squared) bias.
(e) Repeat (a) for the irreducible error.
Points: 5
Applied Questions
1. We can use the logistic regression to predict the probability of default using income and balance on the
Default data set. We will now estimate the test error of this logistic regression model using the validation
set approach.
(a) Fit a logistic regression model that uses income and balance to predict default.
(b) Using the validation set approach, estimate the test error of this model. In order to do this, you must
perform the following steps:
i. Split the sample set into a training set and a validation set.
ii. Fit a multiple logistic regression model using only the training observations.
iii. Obtain a prediction of default status for each individual in the validation set by computing the
posterior probability of default for that individual, and classifying the individual to the default
category if the posterior probability is greater than 0.5.
iv. Compute the validation set error, which is the fraction of the observations in the validation set
that are misclassified.
(c) Repeat the process in (b) three times, using three different splits of the observations into a training set
and a validation set. Comment on the results obtained.
(d) Now consider a logistic regression model that predicts the probability of default using income, balance,
and a dummy variable for student. Estimate the test error for this model using the validation set approach.
Comment on whether or not including a dummy variable for student leads to a reduction in the test error
rate.
Hint:
– Check https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
Points: 15
2. We will now perform cross-validation on a simulated data set.
(a) Generate a simulated data set as follows:
> x: create 100 random samples from normal distribution with mean 0 and variance 1
> 𝑦 = 𝑥 − 2 𝑥
2 + 𝑛𝑜𝑖𝑠𝑒 noise are samples from normal distribution with mean 0 and variance 1
In this data set, what is n and what is p? Write out the model used to generate the data in equation form.
(b) Create a scatterplot of X against Y . Comment on what you find.
(c) Compute the LOOCV errors that result from fitting the following four models using least squares:
i. 𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜖
ii. 𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝛽2 𝑋
2 + 𝜖
iii. 𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝛽2 𝑋
2 + 𝛽3 𝑋
3 + 𝜖
iv. 𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝛽2 𝑋
2 + 𝛽3 𝑋
3 + 𝛽4 𝑋
4 + 𝜖.
(d) Which of the models in (c) had the smallest LOOCV error? Is this what you expected? Explain your
answer.
(e) Comment on the statistical significance of the coefficient estimates that results from fitting each of the
models in (c) using least squares. Do these results agree with the conclusions drawn based on the crossvalidation results?
Hints:
– Check https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneOut.html
– Check extra for cross-validation https://scikit-learn.org/stable/modules/cross_validation.html
Points: 10
3. We will now consider the Boston housing data set.
(a) Based on this data set, provide an estimate for the population mean of medv. Call this estimate 𝝁̂.
(b) Provide an estimate of the standard error of 𝜇̂. Interpret this result.
(c) Now estimate the standard error of 𝜇̂using the bootstrap. How does this compare to your answer from
(b)?
(d) Based on your bootstrap estimate from (c), provide a 95% confidence interval for the mean of medv.
Hints:
– Check https://scikit-learn.org/stable/modules/generated/sklearn.utils.resample.html
– We can compute the standard error of the sample mean by dividing the sample standard deviation
by the square root of the number of observations.
– You can approximate a 95% confidence interval using the formula [𝜇̂– 2 SE(𝜇̂), 𝜇̂+ 2 SE(𝜇̂)].
Points: 10
4. Here, we will generate simulated data, and will then use this data to perform best subset selection.
(a) Generate a predictor X of length n = 100 from a normal distribution with mean 0 and variance 1, as well
as a noise vector 𝜖 of length n = 100.
(b) Generate a response vector Y of length n = 100 according to the model
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝛽2 𝑋
2 + 𝛽3 𝑋
3 + 𝜖,
where 𝛽0, 𝛽1, 𝛽2, and 𝛽3 are constants of your choice. For 𝑋 and 𝜖, use the data being generated in (a).
(c) Perform best subset selection in order to choose the best model containing the predictors 𝑋, 𝑋
2
, . . .,
𝑋
10. What is the best model obtained according to Cp, BIC, and adjusted 𝑅
2? Show some plots to provide
evidence for your answer, and report the coefficients of the best model obtained.
(d) Repeat (c), using forward stepwise selection and also using backwards stepwise selection. How does
your answer compare to the results in (c)?
(e) Now fit a lasso model to the simulated data, again using 𝑋, 𝑋
2
,. . . , 𝑋
10 as predictors. Use crossvalidation to select the optimal value of λ. Create plots of the cross-validation error as a function of λ. Report
the resulting coefficient estimates, and discuss the results obtained.
(f) Now generate a response vector Y according to the model
𝑌 = 𝛽0 + 𝛽7 𝑋
7 + 𝜖,
and perform best subset selection and the lasso. Discuss the results obtained.
Hints:
– Check this link for the best subset selection
https://nbviewer.jupyter.org/github/pedvide/ISLR_Python/blob/master/Chapter6_Linear_Model_Selection_an
d_Regularization.ipynb#6.5.1-Best-Subset-Selection You can change the code for different metrics.
– Check this for forward and backward subset selection
https://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/
– Check this link for Ridge https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html
– Check this link for Lasso https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html
Points: 20
5. Here, we will predict the number of applications received using the other variables in the College data
set.
(a) Split the data set into a training set and a test set.
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error
obtained.
(d) Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained,
along with the number of non-zero coefficient estimates.
Points: 15
CS4342 Assignment #5
1. Suppose we fit a curve with basis functions 𝑏1(𝑋) = 𝑋, 𝑏2(𝑋) = (𝑋 − 1)
2
𝐼(𝑋 ≥ 1). ( Note that I(X ≥ 1)
equals 1 for X ≥ 1 and 0 otherwise.) We fit the linear regression model
𝑌 = 𝛽0 + 𝛽1 𝑏1(𝑋) + 𝛽2 𝑏2(𝑋) + 𝜖,
and obtain coefficient estimates 𝛽̂
0 = 1, 𝛽̂
1 = 1, 𝛽̂
2 = −2. Sketch the estimated curve between X = −2
and X = 2. Note the intercepts, slopes, and other relevant information.
Points: 5
2. Suppose we fit a curve with basis functions 𝑏1(𝑋) = 𝐼(0 ≤ 𝑋 ≤ 2) − (𝑋 − 1)𝐼(1 ≤ 𝑋 ≤ 2), 𝑏2(𝑋) =
(𝑋 − 3)𝐼(3 ≤ 𝑋 ≤ 4) + 𝐼(4 < 𝑋 ≤ 5). We fit the linear regression model
𝑌 = 𝛽0 + 𝛽1 𝑏1(𝑋) + 𝛽2 𝑏2(𝑋) + 𝜖,
and obtain coefficient estimates 𝛽̂
0 = 1, 𝛽̂
1 = 1, 𝛽̂
2 = 3. Sketch the estimated curve between X = −2 and
X = 2. Note the intercepts, slopes, and other relevant information.
Points: 5
3. Draw an example (of your own invention) of a partition of two dimensional feature space that could result
from recursive binary splitting. Your example should contain at least six regions. Draw a decision tree
corresponding to this partition. Be sure to label all aspects of your figures, including the regions R1, R2, . .
., the cutpoints t1, t2, . . ., and so forth.
Points: 5
4. It is mentioned in Section 8.2.3 that boosting using depth-one trees (or stumps) leads to an additive
model: that is, a model of the form
𝑓(𝑋) = ∑𝑓𝑗(𝑋𝑗)
𝑝
𝑗=1
,
Explain why this is the case. You can begin with (8.12) in Algorithm 8.2.
Points: 5
5. This question relates to the plots in the below figures.
(a) Sketch the tree corresponding to the partition of the predictor space illustrated in the left-hand panel of
the below figure. The numbers inside the boxes indicate the mean of Y within each region.
(b) Create a diagram similar to the left-hand panel of the figure, using the tree illustrated in the right-hand
panel of the same figure. You should divide up the predictor space into the correct regions, and indicate
the mean for each region.
Points: 5
6. Suppose we produce ten bootstrapped samples from a data set containing red and green classes. We
then apply a classification tree to each bootstrapped sample and, for a specific value of X, produce 10
estimates of P(Class is Red|X):
0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, and 0.75.
There are two common ways to combine these results together into a single class prediction. One is the
majority vote approach discussed in this chapter.
The second approach is to classify based on the average probability. In this example, what is the final
classification under each of these two approaches?
Points: 5
Applied Questions
1. In this exercise, we will further analyze the Wage data set.
(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal
degree d for the polynomial. Make a plot of the resulting polynomial fit to the data.
(b) Fit a step function to predict wage using age, and perform cross-validation to choose the optimal number
of cuts. Make a plot of the fit obtained.
Hints:
– Check https://scikitlearn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html
Points: 15
2. The Wage data set contains a number of other features not explored in Chapter 7, such as marital status
(maritl), job class (jobclass), and others. Explore the relationships between some of these other predictors
and wage, and use non-linear fitting techniques in order to fit flexible models to the data. Create plots of
the results obtained, and write a summary of your findings.
Points: 10
3. This question uses the variables dis (the weighted mean of distances to five Boston employment centers)
and nox (nitrogen oxides concentration in parts per 10 million) from the Boston data. We will treat dis as
the predictor and nox as the response.
(a) Fit a cubic polynomial regression to predict nox using dis. Report the regression output, and plot the
resulting data and polynomial fits.
(b) Plot the polynomial fits for a range of different polynomial degrees (say, from 1 to 10), and report the
associated residual sum of squares.
(c) Perform cross-validation or another approach to select the optimal degree for the polynomial, and
explain your results.
(d) Fit a regression spline to predict nox using dis. Report the output for the fit using four degrees of freedom.
How did you choose the knots? Plot the resulting fit.
(e) Now fit a regression spline for a range of degrees of freedom, and plot the resulting fits and report the
resulting RSS. Describe the results obtained.
(f) Perform cross-validation or another approach in order to select the best degrees of freedom for a
regression spline on this data. Describe your results.
Hints:
– Check this https://scikitlearn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html
– Check https://www.analyticsvidhya.com/blog/2018/03/introduction-regression-splines-pythoncodes/
Points: 10
4. Apply random forests to predict mdev of the Boston data after converting it into a qualitative response
variable – values above the median of mdev is set 1 and others are set to zero. Use all other predictors in
preditction of the qualitative data using 25 and 500 trees. Create a plot displaying the test error resulting
from random forests on this data set for a more comprehensive range of values of number of predictors
and trees. Describe the results obtained.
Hints:
– Check https://scikitlearn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
– Check https://scikit-learn.org/stable/modules/ensemble.html
Points: 10
5. We want to predict Sales in the Carseats data set using regression trees and related approaches.
(a) Split the data set into a training set and a test set.
(b) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you
obtain?
(c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree
improve the test MSE?
(d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Determine which
variables are most important (variable impooratnce measure).
(e) Use random forests to analyze this data. What test MSE do you obtain? Determine which variables
aremost important (variable importance measure). Describe the effect of m, the number of variables
considered at each split, on the error rate obtained.
Hints:
– Check https://scikitlearn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
– Chec https://scikitlearn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.Decisi
onTreeRegressor
– Check https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html
– Check https://scikit-learn.org/stable/modules/ensemble.html
– Check https://machinelearningmastery.com/calculate-feature-importance-with-python/
Points: 15
6. We now use boosting to predict Salary in the Hitters data set.
(a) Remove the observations for whom the salary information is unknown, and then log-transform the
salaries.
(b) Create a training set consisting of the first 200 observations, and a test set consisting of the remaining
observations.
(c) Perform boosting on the training set with 1,000 trees for a range of values of the shrinkage parameter
λ. Produce a plot with different shrinkage values on the x-axis and the corresponding training set MSE on
the y-axis.
(d) Produce a plot with different shrinkage values on the x-axis and the corresponding test set MSE on the
y-axis.
(e) Compare the test MSE of boosting to the test MSE that results from applying two of the regression
approaches seen in Chapters 3 and 6.
(f) Which variables appear to be the most important predictors in the boosted model?
(g) Now apply bagging to the training set. What is the test set MSE for this approach?
Hints:
– Check https://scikitlearn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html
Points: 10
CS4342 Assignment #6
1. This problem involves hyperplanes in two dimensions.
(a) Sketch the hyperplane 1 + 3 𝑋1 − 𝑋2 = 0. Indicate the set of points for which 1 + 3 𝑋1 − 𝑋2 > 0, as
well as the set of points for which 1 + 3 𝑋1 − 𝑋2 < 0. (b) On the same plot, sketch the hyperplane −2 + 𝑋1 + 2 𝑋2 = 0. Indicate the set of points for which −2 + 𝑋1 + 2 𝑋2 > 0, as well as the set of points for which −2 + 𝑋1 + 2 𝑋2 < 0. Points: 5 2. We have seen that in p = 2 dimensions, a linear decision boundary takes the form 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 = 0.We now investigate a non-linear decision boundary. (a) Sketch the curve (1 + 𝑋1 ) 2 + (2 − 𝑋2 ) 2 = 4. (b) On your sketch, indicate the set of points for which (1 + 𝑋1 ) 2 + (2 − 𝑋2 ) 2 > 4,
as well as the set of points for which (1 + 𝑋1
)
2 + (2 − 𝑋2
)
2 ≤ 4.
(c) Suppose that a classifier assigns an observation to the blue class if
(1 + 𝑋1
)
2 + (2 − 𝑋2
)
2 > 4,
and to the red class otherwise. To what class is the observation (0, 0) classified? (−1, 1)? (2, 2)? (3, 8)?
(d) Argue that while the decision boundary in (c) is not linear in terms of 𝑋1 and 𝑋2, it is linear in terms of
𝑋1, 𝑋2, 𝑋1
2
, and 𝑋2
2
.
Points:5
3. Here we explore the maximal margin classifier on a toy data set.
(a) We are given n = 7 observations in p = 2 dimensions. For each observation, there is an associated class
label.
(b) Sketch the optimal separating hyperplane, and provide the equation for this hyperplane (of the form
(9.1)).
(c) Describe the classification rule for the maximal margin classifier.
It should be something along the lines of “Classify to Red if 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 > 0, and classify to Blue
otherwise.” Provide the values for 𝛽0, 𝛽1, and 𝛽2.
(d) On your sketch, indicate the margin for the maximal margin hyperplane.
(e) Indicate the support vectors for the maximal margin classifier.
(f) Argue that a slight movement of the seventh observation would not affect the maximal margin
hyperplane.
(g) Sketch a hyperplane that is not the optimal separating hyperplane, and provide the equation for this
hyperplane.
(h) Draw an additional observation on the plot so that the two classes are no longer separable by a
hyperplane.
Points: 5
4. Suppose that we have four observations, for which we compute a dissimilarity matrix, given by
For instance, the dissimilarity between the first and second observations is 0.3, and the dissimilarity
between the second and fourth observations is 0.8.
(a) On the basis of this dissimilarity matrix, sketch the dendrogram that results from hierarchically clustering
these four observations using complete linkage. Be sure to indicate on the plot the height at which each
fusion occurs, as well as the observations corresponding to each leaf in the dendrogram.
(b) Repeat (a), this time using single linkage clustering.
(c) Suppose that we cut the dendogram obtained in (a) such that two clusters result. Which observations
are in each cluster?
(d) Suppose that we cut the dendogram obtained in (b) such that two clusters result. Which observations
are in each cluster?
(e) It is mentioned in the chapter that at each fusion in the dendrogram, the position of the two clusters
being fused can be swapped without changing the meaning of the dendrogram. Draw a dendrogram that is
equivalent to the dendrogram in (a), for which two or more of the leaves are repositioned, but for which the
meaning of the dendrogram is the same.
Points: 5
5. In this problem, you will perform K-means clustering manually, with K = 2, on a small example with n = 6
observations and p = 2 features. The observations are as follows.
(a) Plot the observations.
(b) Randomly assign a cluster label to each observation. Report the cluster labels for each observation.
(c) Compute the centroid for each cluster.
(d) Assign each observation to the centroid to which it is closest, in terms of Euclidean distance. Report the
cluster labels for each observation.
(e) Repeat (c) and (d) until the answers obtained stop changing.
(f) In your plot from (a), color the observations according to the cluster labels obtained.
Points: 5
6. Suppose that for a particular data set, we perform hierarchical clustering using single linkage and using
complete linkage. We obtain two dendrograms.
(a) At a certain point on the single linkage dendrogram, the clusters {1, 2, 3} and {4, 5} fuse. On the complete
linkage dendrogram, the clusters {1, 2, 3} and {4, 5} also fuse at a certain point. Which fusion will occur
higher on the tree, or will they fuse at the same height, or is there not enough information to tell?
(b) At a certain point on the single linkage dendrogram, the clusters {5} and {6} fuse. On the complete
linkage dendrogram, the clusters {5} and {6} also fuse at a certain point. Which fusion will occur higher on
the tree, or will they fuse at the same height, or is there not enough information to tell?
Points: 5
7. In words, describe the results that you would expect if you performed K-means clustering of the eight
shoppers in Figure 10.14 – shown below, on the basis of their sock and computer purchases, with K = 2.
Give three answers, one for each of the variable scalings displayed. Explain.
Points: 5
Applied Questions
1. We have seen that we can fit an SVM with a non-linear kernel in order to perform classification using a
non-linear decision boundary.We will now see that we can also obtain a non-linear decision boundary by
performing logistic regression using non-linear transformations of the features.
(a) Generate a data set with n = 500 and p = 2, such that the observations belong to two classes with a
quadratic decision boundary between them. For instance, you can do this as follows:
> 𝑋1=random.uniform (500) -0.5
https://numpy.org/doc/stable/reference/random/generated/numpy.random.uniform.html
> 𝑋2= random.uniform (500) -0.5
> 𝑦 = 1 ∗ ( 𝑋1
2 − 𝑋2
2 > 0)
(b) Plot the observations, colored according to their class labels. Your plot should display 𝑋1 on the x-axis,
and 𝑋2 on the y-axis.
(c) Fit a logistic regression model to the data, using 𝑋1 and 𝑋2 as predictors.
(d) Apply this model to the training data in order to obtain a predicted class label for each training
observation. Plot the observations, colored according to the predicted class labels. The decision boundary
should be linear.
(e) Now fit a logistic regression model to the data using non-linear functions of 𝑋1 and 𝑋2 as predictors (e.g.
𝑋1
2
, 𝑋1 × 𝑋2, log(𝑋2), and so forth).
(f) Apply this model to the training data in order to obtain a predicted class label for each training
observation. Plot the observations, colored according to the predicted class labels. The decision boundary
should be obviously non-linear. If it is not, then repeat (a)-(e) until you come up with an example in which
the predicted class labels are obviously non-linear.
(g) Fit a support vector classifier to the data with 𝑋1 and 𝑋2 as predictors. Obtain a class prediction for each
training observation. Plot the observations, colored according to the predicted class labels.
(h) Fit a SVM using a non-linear kernel to the data. Obtain a class prediction for each training observation.
Plot the observations, colored according to the predicted class labels.
(i) Comment on your results.
Hint:
– Check https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
Points: 15
2. In this problem, you will use support vector approaches in order to predict whether a given car gets high
or low gas mileage based on the Auto data set.
(a) Create a binary variable that takes on a 1 for cars with gas mileage above the median, and a 0 for cars
with gas mileage below the median.
(b) Fit a support vector classifier to the data with the linear kernel, in order to predict whether a car gets
high or low gas mileage. Report the cross-validation error. Comment on your results.
(c) Now repeat (b), this time using SVMs with radial and polynomial basis kernels, with different values of
gamma and degree. Comment on your results.
(d) Make some plots to back up your assertions in (b) and (c).
Hints:
– Check https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
– Check https://scikit-learn.org/0.18/auto_examples/svm/plot_iris.html
Points: 15
3. Consider the USArrests data. We will now perform hierarchical clustering on the states.
(a) Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.
(b) Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which
clusters?
(c) Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the
variables to have standard deviation one.
(d) What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion,
should the variables be scaled before the inter-observation dissimilarities are computed? Provide a
justification for your answer.
Hints:
– Check
https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html
Points: 15
4. In this problem, you will generate simulated data, and then perform PCA and K-means clustering on the
data.
(a) Generate a simulated data set with 20 observations in each of three classes (i.e. 60 observations total),
and 50 variables. Use uniform or normal distributed samples.
(b) Perform PCA on the 60 observations and plot the first two principal component score vectors. Use a
different color to indicate the observations in each of the three classes. If the three classes appear
separated in this plot, then continue on to part (c). If not, then return to part (a) and modify the simulation
so that there is greater separation between the three classes. Do not continue to part (c) until the three
classes show at least some separation in the first two principal component score vectors. Hint: you can
assign different means to different classes to create separate clusters.
(c) Perform K-means clustering of the observations with K = 3. How well do the clusters that you obtained
in K-means clustering compare to the true class labels?
(d) Perform K-means clustering with K = 2. Describe your results.
(e) Now perform K-means clustering with K = 4, and describe your results.
(f) Now perform K-means clustering with K = 3 on the first two principal component score vectors, rather
than on the raw data. That is, perform K-means clustering on the 60 x 2 matrix of which the first column is
the first principal component score vector, and the second column is the second principal component score
vector. Comment on the results.
(g) Using the z-score function to scale your variables, perform K-means clustering with K = 3 on the data
after scaling each variable to have standard deviation one. How do these results compare to those obtained
in (b)? Explain.
Hints:
– Check https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
– Check https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
– Check https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.zscore.html
Points: 25


