Description
1. Feature engineering (one-hot encoding and data imputation)
1a. Read the data from http://www.stat.wisc.edu/~jgillett/451/data/kaggle_titanic_train.csv.
- Retain only these columns: Survived, Pclass, Sex, Age, SibSp, Parch.
- Display the first 7 rows.
These data are described at https://www.kaggle.com/competitions/titanic/data (click on the small down-arrow to see the “Data Dictionary”), which is where they are from.
- Read that “Data Dictionary” paragraph (with your eyes, not python) so you understand what each column represents.
(We used these data before in HW02:
- There we used
df.dropna()
to drop any observations with missing values; here we use data imputation instead. - There we manually did one-hot encoding of the categorical
Sex
column by making aFemale
column; here we do the same one-hot encoding with the help of pandas’sdf.join(pd.get_dummies())
. - There we used a decision tree; here we use 𝑘-NN.
We evaluate how these strategies can improve model performance by allowing us to use columns with categorical or missing data.)
# ... your code here ...
1b. Try to train a 𝑘NN model to predict 𝑦= ‘Survived’ from 𝑋= these features: ‘Pclass’, ‘Sex’, ‘Age’, ‘SibSp’, ‘Parch’.
- Use 𝑘=3 and the (default) euclidean metric.
- Notice at the bottom of the error message that it fails with the error “ValueError: could not convert string to float: ‘male'”.
- Comment out your .fit() line so the cell can run without error.
# ... your code here ...
1c. Try to train again, this time without the ‘Sex’ feature.
- Notice that it fails because “Input contains NaN”.
- Comment out your .fit() line so the cell can run without error.
- Run
X.isna().any()
(where X is the name of your DataFrame of features) to see that the ‘Age’ feature has missing values. (You can see the first missing value in the sixth row that you displayed above.)
# ... your code here ...
1d. Train without the ‘Sex’ and ‘Age’ features.
- Report accuracy on the training data with a line of the form
Accuracy on training data is 0.500
(0.500 may not be correct).
# ... your code here ...
1e. Use one-hot encoding
to include a binary ‘male’ feature made from the ‘Sex’ feature. (Or include a binary ‘female’ feature, according to your preference. Using both is unnecessary since either is the logical negation of the other.) That is, train on these features: ‘Pclass’, ‘SibSp’, ‘Parch’, ‘male’.
- Use pandas’s df.join(pd.get_dummies())`.
- Report training accuracy as before.
# ... your code here ...
1f. Use data imputation
to include an ‘age’ feature made from ‘Age’ but replacing each missing value with the median of the non-missing ages. That is, train on these features: ‘Pclass’, ‘SibSp’, ‘Parch’, ‘male’, ‘age’.
- Report training accuracy as before.
# ... your code here ...
2. Explore model fit, overfit, and regularization in the context of multiple linear regression
2a. Prepare the data:
- Read http://www.stat.wisc.edu/~jgillett/451/data/mtcars.csv into a DataFrame.
- Set a variable
X
to the subset consisting of all columns exceptmpg
. - Set a variable
y
to thempg
column. - Use
train_test_split()
to splitX
andy
intoX_train
,X_test
,y_train
, andy_test
.- Reserve half the data for training and half for testing.
- Use
random_state=0
to get reproducible results.
# ... your code here ...
2b. Train three models on the training data and evaluate each on the test data:
LinearRegression()
Lasso()
Ridge()
The evaluation consists in displaying MSEtrain, MSEtest, and the coefficients 𝑤 for each model.
# ... your code here ...
2c. Answer a few questions about the models:
- Which one best fits the training data?
- Which one best fits the test data?
- Which one does feature selection by setting most coefficients to zero?-
# ... your answers here in a markdown cell ...