Description
For this assignment you will perform exploratory data analysis to visualize Fisher’s Iris dataset using Scikit Learn. And, you will explore the bias/variance trade-off by applying k-nearest neighbors classification to the Iris dataset and varying the hyperparameter k.
Documentation for Scikit Learn:
- The top level documenation page is here: https://scikit-learn.org/stable/index.html
- The API for the KNearestNeighborsClassifier is here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier
- The User Guide for KNearestNeighborsClassifier is here: https://scikit-learn.org/stable/modules/neighbors.html#classification
- Scikit Learn provides many Jupyter notebook examples on how use the toolkit. These Jupyter notebook examples can be run on MyBinder: https://scikit-learn.org/stable/auto_examples/index.html
For more information about the Iris dataset, see this page https://en.wikipedia.org/wiki/Iris_flower_data_set.
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import datasets
from sklearn import neighbors
from sklearn.model_selection import train_test_split
from pandas import DataFrame
Load Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target
print("Number of instances in the iris dataset:", X.shape[0])
print("Number of features in the iris dataset:", X.shape[1])
print("The dimension of the data matrix X is", X.shape)
X
The y
vector length is 150. It has three unique values: 0, 1 and 2. Each value represents a species of iris flower.
y
y.shape
dir(iris)
iris.target_names
iris.feature_names
Extension: Show the summary table of iris data including min, max, median, quantiles
# Insert your answer here
iris_df.describe()
Part 1Exploratory Data Analysis
Part 1a
Generate scatter plots using each pair of the attributes as axis. You should generate 6=(42)6=(42) scatter plots.
## Insert your answer here...
import matplotlib.pyplot as plt
import seaborn as sns
Extension: Draw a boxplot of sepal length (cm), sepal width (cm), petal length (cm), petal width (cm). Use color to show the different target class.
Some links to help you:
https://seaborn.pydata.org/generated/seaborn.boxplot.html
https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.boxplot.html
# Insert your code ...
Part 1b
If you were to draw linear decision boundaries to separate the classes, which scatter plot do you think will have the least error and which the most?
Insert your 1b answer here
…
Part 1c
Scatter plots using two attributes of the data are equivalent to project the four dimensional data down to two dimensions using axis-parallel projection. Principal component analysis (PCA) is a technique to linearly project the data to lower dimensions that are not necessarily axis-parallel. Use PCA to project the data down to two dimensions.
Documentation for PCA:
### Insert your code here
In the case of the Iris dataset, does PCA do a better job of separating the classes?
Insert your answer
…
Part 2 K Nearest Neighbor
Split the dataset into train set and test set. Use 67 percent of the dataset for training, and use 33 percent for testing.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
print("Number of instances in the train set:", X_train.shape[0])
print("Number of instances in the test set:", X_test.shape[0])
Part 2a Create a KNeibhorsClassifier with n_neighbors = 5
. And, train the classifier using the train set.
### Insert you answer here
print("Using", '____', "neighbors:")
print("The train accuracy score is:", '______')
print("The test accuracy score is :", '______')
Part 2b Tuning hyperparameter k
As we have seen in class, hyperparameter k of the K Nearest Neighbors classification affects the inductive bias. For this part train multiple near neighbor classifier models, store the results in a DataFrame. The plot plot training error and testing error versus N/k, where N = 100.
Extension: Use different metric for knn classification.
- 1). Euclidean distance
- 2). Manhattan distance
- 3). Chebyshev distance.
Distance Metrics Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.distance_metrics.html#sklearn.metrics.pairwise.distance_metrics
k_list = [1, 3, 5, 7, 9, 11, 13, 15, 50]
train = []
test = []
### Insert your code
# Use the `result` to store the DataFrame
# euclidean
result.plot(x='N/k', y=['train error', 'test error'], ylabel='accuracy')
### Insert your code
# Use the `result` to store the DataFrame
# manhattan
result.plot(x='N/k', y=['train error', 'test error'], ylabel='accuracy')
### Insert your code
# Use the `result` to store the DataFrame
# chebyshev
result.plot(x='N/k', y=['train error', 'test error'], ylabel='accuracy')
Part 2c Plot decision boundaries of K Nearest Neighbors
Use Scikit Learn’s DecisionBoundaryDisplay class to visualize the nearest neighbor boundaries as k is varied.
k_list = [1, 3, 5, 7, 9, 11, 13, 15, 50]
Simplify the problem by using only the first 2 attributes of the dataset
X2 = iris.data[:, :2]
### Insert your code here