DSCI552 Assignment 2: Exploratory Data Analysis and K Nearest Neighbors Classification solution

$25.00

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (1 vote)

For this assignment you will perform exploratory data analysis to visualize Fisher’s Iris dataset using Scikit Learn. And, you will explore the bias/variance trade-off by applying k-nearest neighbors classification to the Iris dataset and varying the hyperparameter k.

Documentation for Scikit Learn:

For more information about the Iris dataset, see this page https://en.wikipedia.org/wiki/Iris_flower_data_set.

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import datasets
from sklearn import neighbors
from sklearn.model_selection import train_test_split
from pandas import DataFrame

Load Iris dataset

iris = datasets.load_iris()
X = iris.data  
y = iris.target
print("Number of instances in the iris dataset:", X.shape[0])
print("Number of features in the iris dataset:", X.shape[1])
print("The dimension of the data matrix X is", X.shape)
X

The y vector length is 150. It has three unique values: 0, 1 and 2. Each value represents a species of iris flower.

y
y.shape
dir(iris)
iris.target_names
iris.feature_names

Extension: Show the summary table of iris data including min, max, median, quantiles

# Insert your answer here
iris_df.describe()

Part 1Exploratory Data Analysis

Part 1a

Generate scatter plots using each pair of the attributes as axis. You should generate 6=(42)6=(42) scatter plots.

## Insert your answer here...
import matplotlib.pyplot as plt
import seaborn as sns

Extension: Draw a boxplot of sepal length (cm), sepal width (cm), petal length (cm), petal width (cm). Use color to show the different target class.

Some links to help you:

https://seaborn.pydata.org/generated/seaborn.boxplot.html

https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.boxplot.html

# Insert your code ...

Part 1b

If you were to draw linear decision boundaries to separate the classes, which scatter plot do you think will have the least error and which the most?

Insert your 1b answer here

Part 1c

Scatter plots using two attributes of the data are equivalent to project the four dimensional data down to two dimensions using axis-parallel projection. Principal component analysis (PCA) is a technique to linearly project the data to lower dimensions that are not necessarily axis-parallel. Use PCA to project the data down to two dimensions.

Documentation for PCA:

### Insert your code here

In the case of the Iris dataset, does PCA do a better job of separating the classes?

Insert your answer

Part 2 K Nearest Neighbor

Split the dataset into train set and test set. Use 67 percent of the dataset for training, and use 33 percent for testing.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)
print("Number of instances in the train set:", X_train.shape[0])
print("Number of instances in the test set:", X_test.shape[0])

Part 2a Create a KNeibhorsClassifier with n_neighbors = 5. And, train the classifier using the train set.

### Insert you answer here
print("Using", '____', "neighbors:")
print("The train accuracy score is:", '______')
print("The test accuracy score is :", '______')

Part 2b Tuning hyperparameter k

As we have seen in class, hyperparameter k of the K Nearest Neighbors classification affects the inductive bias. For this part train multiple near neighbor classifier models, store the results in a DataFrame. The plot plot training error and testing error versus N/k, where N = 100.

Extension: Use different metric for knn classification.

- 1). Euclidean distance 
- 2). Manhattan distance 
- 3). Chebyshev distance.

Distance Metrics Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.distance_metrics.html#sklearn.metrics.pairwise.distance_metrics

k_list = [1, 3, 5, 7, 9, 11, 13, 15, 50]
train = []
test = []
### Insert your code
# Use the `result` to store the DataFrame
# euclidean
result.plot(x='N/k', y=['train error', 'test error'], ylabel='accuracy')
### Insert your code
# Use the `result` to store the DataFrame
# manhattan
result.plot(x='N/k', y=['train error', 'test error'], ylabel='accuracy')
### Insert your code
# Use the `result` to store the DataFrame
# chebyshev
result.plot(x='N/k', y=['train error', 'test error'], ylabel='accuracy')

Part 2c Plot decision boundaries of K Nearest Neighbors

Use Scikit Learn’s DecisionBoundaryDisplay class to visualize the nearest neighbor boundaries as k is varied.

https://scikit-learn.org/stable/modules/generated/sklearn.inspection.DecisionBoundaryDisplay.html#sklearn.inspection.DecisionBoundaryDisplay

k_list = [1, 3, 5, 7, 9, 11, 13, 15, 50]

Simplify the problem by using only the first 2 attributes of the dataset

X2 = iris.data[:, :2]
### Insert your code here