## Description

Problem 1 In this problem, you will predict tumor type from gene expression data. Since there are many

more gene features than observations of patients, we will use ridge and LASSO regularization for logistic

regression to reduce overfitting and help select the most relevant features out of a large group of features.

This dataset has a multi-class outcome variable. The possible tumor types are BRCA, COAD, KIRC, LUAD,

or PRAD. You will analyze this dataset by building a multinomial regression model with `1 and `2 regularization. The recommended approach is the glmnet package in R, which is covered in the code in class. You

can check the “Multinomial Regression” section found at this link for specific information about multinomial

regression in glmnet.

Note: This is quite a large dataset so models will take a minute or two to fit.

(a) Load the labels and data with read.csv. Remove any columns with missing entries. Remove any

columns with variance less than 0.001. Standardize each gene predictor column to have mean 0 and

standard deviation 1 (this is important when doing regularized regression). Split the dataset randomly

into a training and validation set.

(b) Use ridge logistic regression with 10-fold cross validation to model the response given the gene expression predictors. What is your optimal value of the regularization parameter λ? Apply your model to

give predictions using the optimal value of λ. Make a confusion matrix showing the accuracy of your

model on the training and test set.

(c) Use LASSO logistic regression with 10-fold cross validation to model the response given the gene

expression predictors. What is your optimal value of the regularization parameter λ? Apply your

model to give predictions using the optimal value of λ. Make a confusion matrix showing the accuracy

of your model on the training and test set.

(d) Give a list of the top 20 most relevant genes that are selected by your LASSO model at the optimal

value of λ. The coefficients for a multinomial regression model will be a p × C matrix where C is the

number of classes and p is the number of feature columns. What relation do your selected genes have

to tumor expression? You can determine this by looking at which of the C coefficients associated with

a certain gene are non-zero. Positive values in a certain index correspond to a high probability of the

tumor associated with that index, while negative values correspond to a lower probability.

Note: If columns are highly correlated, LASSO will often arbitrarily select a single column, so a full report

of relevent genes would involve predictors selected by LASSO and genes that are highly correlated. Other

techniques like group LASSO can select subsets of related genes, these will not be covered in this class. You

could also try to combine the `1 and `2 penalties to get representation of meaningful predictors that are also

correlated (see the reference material above).

1