## Description

1. For the next exercise, you are going to use the “airline_costs.csv” dataset.

The dataset has the following attributes:

i. Airline name

ii. Length of flight in miles

iii. Speed of plane in miles per hour

iv. Daily flight time per plane in hours

v. Customers served in 1000s

vi. Total operating cost in cents per revenue ton-mile

vii. Revenue in tons per aircraft mile

viii. Ton-mile load factor

ix. Available capacity

x. Total assets in $100,000s

xi. Investments and special funds in $100,000s

xii. Adjusted assets in $100,000s

(Implement this exercise in Python language; import ‘pandas’, ‘sklearn.linear_model

import LinearRegression’ libraries)

1.1) Use a linear regression model to predict the number of customers each airline serves from

its length of flight and daily flight time per plane (this will be referred to as model 1

throughout this question).

(Note: save each of these objects as different variable names as it will make question 1.7

easier for you) (10 points)

1.2) What is the Root Mean Squared Error (RMSE) of this model? (Hint: import ‘from

sklearn.metrics import mean_squared_error’ and numpy’s sqrt function to help solve for

this) (10 points)

1.3) Now repeat exercises 1.1 and 1.2, but first split the data into train (80%) and test (20%)

datasets and find the RMSE of the test set (this will be referred to as model 2 throughout

this question) (10 points)

(Hint: import ‘from sklearn.model_selection import train_test_split’ to help solve this)

1.4) Now find the RMSE of the train set. (5 points)

1.5) What do you notice about the difference between the RMSE on the entire dataset (model

1), the RMSE on the 20% test/holdout set (model 2), and the RMSE on the 80% train set

(model 2)? Why do you think this is? (10 points)

1.6) Build another regression model to predict the total assets of an airline from the customers

served by the airline using a 75%/25% train-test dataset split. Evaluate the RMSE of this

model as well (this will be referred to as model 3 throughout this question).

(Note: your predictor variables must a DataFrame, not a Series to use sklearn’s linear

model) (10 points)

1.7) What are the coefficients of the 3 models? (look up in the sklearn documentation on how

to find this) (10 points)

1.8) What do you notice about these coefficients? Research what linear regression coefficients

mean if you are not sure. (5 points)

2. For this clustering exercise, you are going to use the data on women professional golfers’

performance on the LPGA, 2008 tour (“lpga2008.csv” dataset). The dataset has the

following attributes:

Golfer: name of the player

• Average Drive distance

• Fairway Percentage

• Greens in regulation: in percentage

• Average putts per round

• Sand attempts per round

• Sand saves: in percentage

• Total Winnings per round

• Log: Calculated as (Total Win/Round)

• Total Rounds

• Id: Unique ID representing each player (10 points)

2.1) Use agglomerative clustering on this dataset to find out which players have similar

performance in the same season. To do this, perform the following:

• First, remove the columns ‘Id’ and ‘Golfer’ from the dataset

• Normalize the data using ‘from sklearn.preprocessing import StandardScalar’ and the

method ‘fit_transform()’

• Save this result into a dataframe

• Next, use ‘import scipy.cluster.hierarchy as shc’ and ‘import matplotlib.pyplot as plt’ to

visualize the a dendrogram of this data

• Use the ‘sch.linkage()’ method with the linkage as ward and the metric as Euclidean to

create the clusters

• Then use the ‘sch.denogram()’ method and ‘plt.show()’ to visualize the denogram

• Once we’ve plotted this denogram, we see that a good number of clusters is 4.

• Use ‘from sklearn.cluster import AgglomerativeClustering’ and implement a model that

has 4 clusters, linkage as ward, and the metric as Euclidean

• Print the cluster labels for this model on our normalized dataset (9 points)

2.2) What is the difference between agglomerative clustering and divisive clustering? (1 point)