## Description

# For this assignment you will perform exploratory data analysis to visualize Fisher’s Iris dataset using Scikit Learn. And, you will explore the bias/variance trade-off by applying k-nearest neighbors classification to the Iris dataset and varying the hyperparameter k.

Documentation for Scikit Learn:

- The top level documenation page is here: https://scikit-learn.org/stable/index.html
- The API for the KNearestNeighborsClassifier is here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier
- The User Guide for KNearestNeighborsClassifier is here: https://scikit-learn.org/stable/modules/neighbors.html#classification
- Scikit Learn provides many Jupyter notebook examples on how use the toolkit. These Jupyter notebook examples can be run on MyBinder: https://scikit-learn.org/stable/auto_examples/index.html

For more information about the Iris dataset, see this page https://en.wikipedia.org/wiki/Iris_flower_data_set.

```
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import datasets
from sklearn import neighbors
from sklearn.model_selection import train_test_split
from pandas import DataFrame
```

Load Iris dataset

```
iris = datasets.load_iris()
X = iris.data
y = iris.target
```

```
print("Number of instances in the iris dataset:", X.shape[0])
print("Number of features in the iris dataset:", X.shape[1])
print("The dimension of the data matrix X is", X.shape)
```

`X`

The `y`

vector length is 150. It has three unique values: 0, 1 and 2. Each value represents a species of iris flower.

`y`

`y.shape`

`dir(iris)`

`iris.target_names`

`iris.feature_names`

### Extension: Show the summary table of iris data including min, max, median, quantiles

`# Insert your answer here`

`iris_df.describe()`

## Part 1Exploratory Data Analysis

### Part 1a

Generate scatter plots using each pair of the attributes as axis. You should generate 6=(42)6=(42) scatter plots.

`## Insert your answer here...`

```
import matplotlib.pyplot as plt
import seaborn as sns
```

### Extension: Draw a boxplot of sepal length (cm), sepal width (cm), petal length (cm), petal width (cm). Use color to show the different target class.

Some links to help you:

https://seaborn.pydata.org/generated/seaborn.boxplot.html

https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.boxplot.html

`# Insert your code ...`

### Part 1b

If you were to draw linear decision boundaries to separate the classes, which scatter plot do you think will have the least error and which the most?

### Insert your 1b answer here

…

### Part 1c

Scatter plots using two attributes of the data are equivalent to project the four dimensional data down to two dimensions using axis-parallel projection. Principal component analysis (PCA) is a technique to linearly project the data to lower dimensions that are not necessarily axis-parallel. Use PCA to project the data down to two dimensions.

Documentation for PCA:

`### Insert your code here`

### In the case of the Iris dataset, does PCA do a better job of separating the classes?

### Insert your answer

…

## Part 2 K Nearest Neighbor

Split the dataset into train set and test set. Use 67 percent of the dataset for training, and use 33 percent for testing.

```
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
```

```
print("Number of instances in the train set:", X_train.shape[0])
print("Number of instances in the test set:", X_test.shape[0])
```

### Part 2a Create a KNeibhorsClassifier with `n_neighbors = 5`

. And, train the classifier using the train set.

`### Insert you answer here`

```
print("Using", '____', "neighbors:")
print("The train accuracy score is:", '______')
print("The test accuracy score is :", '______')
```

### Part 2b Tuning hyperparameter k

As we have seen in class, hyperparameter k of the K Nearest Neighbors classification affects the inductive bias. For this part train multiple near neighbor classifier models, store the results in a DataFrame. The plot plot training error and testing error versus N/k, where N = 100.

### Extension: Use different metric for knn classification.

```
- 1). Euclidean distance
- 2). Manhattan distance
- 3). Chebyshev distance.
```

Distance Metrics Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.distance_metrics.html#sklearn.metrics.pairwise.distance_metrics

```
k_list = [1, 3, 5, 7, 9, 11, 13, 15, 50]
train = []
test = []
```

```
### Insert your code
# Use the `result` to store the DataFrame
# euclidean
```

`result.plot(x='N/k', y=['train error', 'test error'], ylabel='accuracy')`

```
### Insert your code
# Use the `result` to store the DataFrame
# manhattan
```

`result.plot(x='N/k', y=['train error', 'test error'], ylabel='accuracy')`

```
### Insert your code
# Use the `result` to store the DataFrame
# chebyshev
```

`result.plot(x='N/k', y=['train error', 'test error'], ylabel='accuracy')`

### Part 2c Plot decision boundaries of K Nearest Neighbors

Use Scikit Learn’s DecisionBoundaryDisplay class to visualize the nearest neighbor boundaries as k is varied.

`k_list = [1, 3, 5, 7, 9, 11, 13, 15, 50]`

Simplify the problem by using only the first 2 attributes of the dataset

`X2 = iris.data[:, :2]`

`### Insert your code here`