Description

5/5 - (2 votes)

For this assignment you will perform exploratory data analysis to visualize Fisher’s Iris dataset using Scikit Learn. And, you will explore the bias/variance trade-off by applying k-nearest neighbors classification to the Iris dataset and varying the hyperparameter k.

Documentation for Scikit Learn:

The top level documenation page is here: https://scikit-learn.org/stable/index.html
The API for the KNearestNeighborsClassifier is here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier
The User Guide for KNearestNeighborsClassifier is here: https://scikit-learn.org/stable/modules/neighbors.html#classification
Scikit Learn provides many Jupyter notebook examples on how use the toolkit. These Jupyter notebook examples can be run on MyBinder: https://scikit-learn.org/stable/auto_examples/index.html

For more information about the Iris dataset, see this page https://en.wikipedia.org/wiki/Iris_flower_data_set.

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import datasets
from sklearn import neighbors
from sklearn.model_selection import train_test_split
from pandas import DataFrame

Load Iris dataset

iris = datasets.load_iris()
X = iris.data  
y = iris.target

print("Number of instances in the iris dataset:", X.shape[0])
print("Number of features in the iris dataset:", X.shape[1])
print("The dimension of the data matrix X is", X.shape)

The y vector length is 150. It has three unique values: 0, 1 and 2. Each value represents a species of iris flower.

y.shape

dir(iris)

iris.target_names

iris.feature_names

Extension: Show the summary table of iris data including min, max, median, quantiles

# Insert your answer here

iris_df.describe()

Part 1Exploratory Data Analysis

Part 1a

Generate scatter plots using each pair of the attributes as axis. You should generate $6 = (\binom{4}{2})$ scatter plots.

## Insert your answer here...

import matplotlib.pyplot as plt
import seaborn as sns

Extension: Draw a boxplot of sepal length (cm), sepal width (cm), petal length (cm), petal width (cm). Use color to show the different target class.

Some links to help you:

https://seaborn.pydata.org/generated/seaborn.boxplot.html

https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.boxplot.html

# Insert your code ...

Part 1b

If you were to draw linear decision boundaries to separate the classes, which scatter plot do you think will have the least error and which the most?

Insert your 1b answer here

…

Part 1c

Scatter plots using two attributes of the data are equivalent to project the four dimensional data down to two dimensions using axis-parallel projection. Principal component analysis (PCA) is a technique to linearly project the data to lower dimensions that are not necessarily axis-parallel. Use PCA to project the data down to two dimensions.

Documentation for PCA:

### Insert your code here

In the case of the Iris dataset, does PCA do a better job of separating the classes?

Insert your answer

…

Part 2 K Nearest Neighbor

Split the dataset into train set and test set. Use 67 percent of the dataset for training, and use 33 percent for testing.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

print("Number of instances in the train set:", X_train.shape[0])
print("Number of instances in the test set:", X_test.shape[0])

Part 2a Create a KNeibhorsClassifier with `n_neighbors = 5`. And, train the classifier using the train set.

### Insert you answer here

print("Using", '____', "neighbors:")
print("The train accuracy score is:", '______')
print("The test accuracy score is :", '______')

Part 2b Tuning hyperparameter k

As we have seen in class, hyperparameter k of the K Nearest Neighbors classification affects the inductive bias. For this part train multiple near neighbor classifier models, store the results in a DataFrame. The plot plot training error and testing error versus N/k, where N = 100.

Extension: Use different metric for knn classification.

- 1). Euclidean distance 
- 2). Manhattan distance 
- 3). Chebyshev distance.

Distance Metrics Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.distance_metrics.html#sklearn.metrics.pairwise.distance_metrics

k_list = [1, 3, 5, 7, 9, 11, 13, 15, 50]
train = []
test = []

### Insert your code
# Use the `result` to store the DataFrame
# euclidean

result.plot(x='N/k', y=['train error', 'test error'], ylabel='accuracy')

### Insert your code
# Use the `result` to store the DataFrame
# manhattan

result.plot(x='N/k', y=['train error', 'test error'], ylabel='accuracy')

### Insert your code
# Use the `result` to store the DataFrame
# chebyshev

result.plot(x='N/k', y=['train error', 'test error'], ylabel='accuracy')

Part 2c Plot decision boundaries of K Nearest Neighbors

Use Scikit Learn’s DecisionBoundaryDisplay class to visualize the nearest neighbor boundaries as k is varied.

https://scikit-learn.org/stable/modules/generated/sklearn.inspection.DecisionBoundaryDisplay.html#sklearn.inspection.DecisionBoundaryDisplay

k_list = [1, 3, 5, 7, 9, 11, 13, 15, 50]

Simplify the problem by using only the first 2 attributes of the dataset

X2 = iris.data[:, :2]

### Insert your code here

DSCI552 Assignment 2: Exploratory Data Analysis and K Nearest Neighbors Classification solution

Download Details:

Description

For this assignment you will perform exploratory data analysis to visualize Fisher’s Iris dataset using Scikit Learn. And, you will explore the bias/variance trade-off by applying k-nearest neighbors classification to the Iris dataset and varying the hyperparameter k.

Extension: Show the summary table of iris data including min, max, median, quantiles

Part 1Exploratory Data Analysis

Part 1a

Extension: Draw a boxplot of sepal length (cm), sepal width (cm), petal length (cm), petal width (cm). Use color to show the different target class.

Part 1b

Insert your 1b answer here

Part 1c

In the case of the Iris dataset, does PCA do a better job of separating the classes?

Insert your answer

Part 2 K Nearest Neighbor

Part 2a Create a KNeibhorsClassifier with `n_neighbors = 5`. And, train the classifier using the train set.

Part 2b Tuning hyperparameter k

Extension: Use different metric for knn classification.

Part 2c Plot decision boundaries of K Nearest Neighbors

DSCI552 Assignment 2: Exploratory Data Analysis and K Nearest Neighbors Classification solution

Download Details:

Description

For this assignment you will perform exploratory data analysis to visualize Fisher’s Iris dataset using Scikit Learn. And, you will explore the bias/variance trade-off by applying k-nearest neighbors classification to the Iris dataset and varying the hyperparameter k.

Extension: Show the summary table of iris data including min, max, median, quantiles

Part 1Exploratory Data Analysis

Part 1a

Extension: Draw a boxplot of sepal length (cm), sepal width (cm), petal length (cm), petal width (cm). Use color to show the different target class.

Part 1b

Insert your 1b answer here

Part 1c

In the case of the Iris dataset, does PCA do a better job of separating the classes?

Insert your answer

Part 2 K Nearest Neighbor

Part 2a Create a KNeibhorsClassifier with n_neighbors = 5. And, train the classifier using the train set.

Part 2b Tuning hyperparameter k

Extension: Use different metric for knn classification.

Part 2c Plot decision boundaries of K Nearest Neighbors

Related products

DSCI 552: Programming Assignment 2 solution

Final Project DSCI 552 Transfer Learning solution

DSCI 552: Programming Assignment 3 solution

Part 2a Create a KNeibhorsClassifier with `n_neighbors = 5`. And, train the classifier using the train set.