# ISyE 7406 HW 3 solution

\$30.00

Original Work ?

5/5 - (1 vote)

## Classification.

In this problem, you are asked to write a report to summarize your analysis of the
popular “Auto MPG” data set in the literature. Much research has been done to analyze this data
set, and here the objective of our analysis is to predict whether a given car gets high or low gas
mileage based 7 car attributes such as cylinders, displacement, horsepower, weight, acceleration,
model year and origin.

(a) The “Auto MPG” data set is available at UCI Machine Learning (ML) Repository:
https://archive.ics.uci.edu/ml/datasets/Auto+MPG
Download the data file “auto-mpg.data” from UCI ML Repository or from Canvas, and use Excel
or Notepad to see the data (this is a .txt file).

There are 398 rows (i.e., 398 different kinds of cars), and 9 columns (the car attributes and
name). Before we do any analysis, we need to clean the raw data. In particular, some values
are missing for this dataset. Many statistical methods have been proposed to deal with missing
values, and please conduct literature research by yourself.

For the purpose of simplicity in this
homework, here we adopt a simple though inefficient method to remove those rows with missing
values. Also we remove the last column of car names, which is text/string and may cause trouble
in our numerical analysis. These two deletions lead to a new cleaned data set of 392 observations
and 8 columns. To save your time, you can also directly download the cleaned data from the file
“Auto.csv” from Canvas.

(b) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median,
and a 0 if mpg contains a value below its median. This binary variable will be the response variable
in this homework. Note that you need to first compute the median value of the mpg variable in the
data set.

(c) Explore the data graphically in order to investigate the association between mpg01 and the
other features. Which of the other features seem most likely to be useful in predicting mpg01?
Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

(d) Split the data into a training set and a test set. Any reasonable splitting is acceptable, as
long as you clearly explain how you split and why you think it is reasonable. For your convenience,
you can either randomly split, or save every fifth (or tenth) observations as testing data.

(e) For the purpose of this homework, perform the following classification methods on the
training data in order to predict mpg01 using the variables that seemed most associated with mpg01
in (c). What is the test error of the model obtained?

(1) LDA (2) QDA (3) Naive Bayes (4) Logistic Regression
(5) KNN with several values of K. Use only the variables that seemed most associated with
mpg01 in (c). Which value of K seems to perform the best on this data set?

(6) (Optional) PCA-KNN. The Principal Component Analysis (PCA) or other dimension
reduction methods can easily be combined with other data mining methods. Recall that the essence
of the PC-reduction is to replace the p-dimensional explanatory variable xi = (xi1, . . . , xip), for
i = 1, . . . , n, with a new p-dimensional explanatory variable ui = (ui1, . . . , uip), where ui = Ap×pxi
.
Then we can apply standard data mining methods such as KNN to the first r(≤ p) entries of the
ui
’s, (ui1, . . . , uir), to predict Yi
’s. Find the testing errors when the KNN with different values of
K (neighbors) is applied to the PCA-dimension-reduced data for different r = p − 1, p − 2, . . . , 1.

(7) (Optional) Any other classification methods you want to propose or use.
Write a report to summarize your findings, e.g., what is the best method and how to use your
results in the context of guiding to manufacture or buy high gas mileage cars.

The report should
include (i) Introduction, (ii) Exploratory (or preliminary) Data Analysis, (iii) Methods,
(iv) Results and (v) Findings. Please attach your computing code for R, Python, or other
statistical software (without, or with limited, output) in the appendix of your report, and please
do not just dump the computer output in the body of the report. It is important to summarize
and interpret your computer output results.

Remarks: (a) From now on, we might no longer ask you to conduct cross-validation (CV)
explicitly as in the previous HWs, but you might need to do it on your own if you want your results
more convincing.

(b) In this HW, you are asked to use those explanatory variables that are seemed most associated
with mpg01. This approach often occurs in manufacturing or biomedical applications where one
wants to combine certain domain knowledge to improve the performance. Note that this might not
be applicable in other applications of machine learning such as deep neural networks or random
forest, where it is okay to use all explanatory variables.

(c) Below some sample R code if you wnat (please feel free to use python or matlab or other
software):
## Suppose that you save the datafile in the local folder of your computer, say, ‘‘C:/Temp”: