Description

5/5 - (1 vote)

Purpose
The purpose of this project is to strengthen your understanding of linear and logistic regression by
implementing models. In the first part, you will be implementing the objective function for linear
regression, and polynomial regression and use ridge regression to implement the best possible fit. In
the second part, you will implement a logistic regression model to predict if a subject is diabetic or not
by classifying the subjects using a binary classification model.
Objectives
Learners will be able to
● Understand the core principles of linear regression, including estimating objective function and
exploring various techniques to solve linear regression
● Understand logistic neurons, outline the logistic regression objective function, and learn about
binary classification and gradient descent to solve logistic regression
Technology Requirements
● GPU environment (optional)
● Jupyter Notebook
● Python3 (Python 3.8 and above)
● Numpy
● Matplotlib
Directions
Accessing ZyLabs
You will complete and submit your work through zyBooks’s zyLabs. Follow the directions to correctly
access the provided workspace:
1
1. Go to the course submission page “Submission: Linear Regression & Binary
Classification”
2. Click the “Load Submission…in new window” button.
3. Once in ZyLabs, click the green button in the Jupyter Notebook to get started.
4. Review the directions and resources provided in the description.
5. When ready, review the provided code and develop your work where instructed.
Project Directions
Part 1 – Linear Regression
You need to define the objective function that will be optimized by the linear regression model.
𝐿(Φ(𝑋), 𝑌, θ) = (𝑌 − Φ(𝑋)θ) 𝑇
(𝑌 − Φ(𝑋)θ)
Here, Φ(𝑋) is the design matrix of dimensions (𝑚(𝑑 + 1)) and Y is the m dimension vector of labels.
θ is the (d+1) dimension vector of weight parameters.
Define a closed-form solution to the objective function. Closed-form solution is given by,
θ = (Φ(𝑋)𝑇
Φ(𝑋))−1 Φ(𝑋)𝑇
𝑌
Here, Φ(𝑋) is the (m(d+1)) dimension design matrix obtained using the poly_transformation function
defined earlier, and Y are the ground truth labels of dimensions (m x 1)
Define a function to evaluate the goodness of the linear regression model using root mean square
error. This compares the difference between the estimate Y-labels and the ground truth Y-labels. The
smaller the RMSE value, the better the fit.
RMSE(Root Mean Squared Error) = 1
𝑚 𝑖−1
𝑚
∑ (𝑦_𝑝𝑟𝑒𝑑(𝑖) − 𝑦
(𝑖))
2
Here, y_pred is the estimated labels of dimensions (m,1) and y is the ground truth labels of
dimensions (m,1)
2
Ridge Regression
Similar to Linear Regression, implement the closed-form solution to ridge regression.
The degree of the polynomial regression is d = 10. Even though the curve appears to be smooth, it
may be fitting to the noise. Hence, you need to use the Ridge Regression to get a smoother fit and
avoid overfitting.
Ridge regression objective form:
𝐿(Φ(𝑋), 𝑌, θ, λ) = (𝑌 − Φ(𝑋)θ)𝑇
(𝑌 − Φ(𝑋)θ) + λ
2
θ
𝑇
θ
where, λ ≥ 0 is the regularization parameter. The larger the value λ, the smoother the curve. The
closed-form solution to the objective is given by:
θ = (Φ(𝑋)𝑇
Φ(𝑋) + λ
2
𝐼
𝑑
)
−1Φ(𝑋)𝑇
𝑌
Here, 𝐼 is the identity matrix of dimensions ((d+1) x (d + 1)), is the (m(d+1)) dimension design 𝑑 Φ(𝑋)
matrix obtained using poly_transform function (from linear regression) and Y are the ground truth
labels of dimensions (m x 1).
Cross Validation to Estimate
To avoid overfitting when using a high-degree polynomial, we have used ridge regression. We now
need to estimate the optimal value of λ using cross-validation.
We will obtain a generic value of λ using the entire training dataset to validate. We will employ the
method of k-fold cross-validation, where we split the training data into k non-overlapping random
subsets. In every cycle, for a given value of λ , k-1 subsets are used for training the ridge regression
model and the remaining subset is used for evaluating the goodness of the fit. We estimate the
average goodness of the fit across all the subsets and select the λ that results in the best fit.
It is easier to shuffle the index and slice the training into the required number of segments than to
process the complete dataset.
Refer to the following documentation for splitting and shuffling:
● https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.shuffle.html
● https://docs.scipy.org/doc/numpy/reference/generated/numpy.split.html
You need to define a function to implement the k-fold cross-validation that takes the following inputs:
● k_fold: number of validation subsets
● train_X: training data of dimensions (m,1)
● train_Y: ground truth training labels
● lambda: ridge regularization lambda parameter
● d: polynomial degrees
3
Part 2 Logistic Regression
Machine learning is used in medicine to assist doctors with crucial decision-making based on
diagnostic data. In this project, we will be designing a logistic regression model (single-layer neural
network) to predict if a subject is diabetic or not. The model will classify the subjects into two groups
diabetic (Class 1) or non-diabetic (Class 0) – a binary classification model.
We will be using the ‘Pima Indians Diabetes dataset’ to train our model which contains different
clinical parameters (features) for multiple subjects along with the label (diabetic or not-diabetic). Each
subject is represented by 8 features (Pregnancies, Glucose, Blood-Pressure, SkinThickness, Insulin,
BMI, Diabetes-Pedigree-Function, Age) and the ‘Outcome’ which is the class label. The dataset
contains the results from 768 subjects.
We will be splitting the dataset into train and test data. We will load and train our model on the train
data and predict the categories on the test data.
1. Steps involved in Logistic Regression using Gradient Descent.
a. Training data X is of dimensions (d x m) where d is the number of features and m is the
number of samples. Training labels Y is of dimensions (1 x m).
b. Initialize logistic regression model parameters ω and 𝑏 where ω is of dimension (d,1)
and 𝑏 is a scalar. ω is initialized to small random values and 𝑏 is set to zero.
c. Calculate 𝑍 using 𝑋 and initial parameter values (𝑤, 𝑏)
𝑍 = ω
𝑇
𝑋 + 𝑏
d. Apply the sigmoid activation to estimate 𝐴 on 𝑍,
𝐴 = 1
1+ 𝑒𝑥𝑝(−𝑍)
e. Calculate the loss L() between predicted probabilities A and ground truth labels Y,
logistic_loss
𝑙𝑜𝑠𝑠 = 𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐_𝑙𝑜𝑠𝑠(𝐴, 𝑌)
f. Calculate gradient 𝑑𝑍(𝑜𝑟 , 𝑑𝐿
𝑑𝑍 )
𝑑𝑍 = (𝐴 − 𝑌)
g. Calculate gradients represented by , represented by 𝑑𝐿
𝑑ω 𝑑ω 𝑑𝐿
𝑑ω 𝑑𝑏
𝑑ω 𝑑𝑏 , = 𝑔𝑟𝑎𝑑_𝑓𝑛(𝑋, 𝑑𝑍)
h. Adjust the model parameters using the gradients. Here α is the learning rate
ω : = ω − α. 𝑑ω
𝑏 : = 𝑏 − α. 𝑑𝑏
4
i. Loop until the loss converges or for a fixed number of epochs.
2. Initialize the model parameters. The weights will be initialized with small random values and
bias as 0. While the bias will be scalar, the dimension of the weight vector will be (d X 1),
where d is the number of features.
3. Define the Sigmoid activation function.
σ(𝑧) = where, z is in the input variable. 1
1+𝑒𝑥𝑝(−𝑧)
4. Logistic Loss Function: Define the objective function that will be used later for determining
the loss between the model prediction and ground truth labels. You need to use vectors 𝐴
(activation output of the logistic neuron) and 𝑌(ground truth labels) for defining the loss.
𝐿(𝐴, 𝑌) = − 1
𝑚 𝑖=1
𝑚
∑ 𝑦
(𝑖)𝑙𝑜𝑔𝑎(𝑖) + (1 − 𝑦
(𝑖))𝑙𝑜𝑔(1 − 𝑎
(𝑖))
where m is the number of input data points and is used for averaging the total loss.
5. Gradient Function: Define the gradient function for calculating the gradients ( . Use it 𝑑𝐿
𝑑ω , 𝑑𝐿
𝑑𝑏 )
during gradient descent. The gradients can be calculated as,
𝑑ω = 1
𝑚 𝑋(𝐴 − 𝑌)𝑇
𝑑𝑏 = 1
𝑚 𝑖=1
𝑚
∑ (𝑎(𝑖) − 𝑦
(𝑖))
Instead of (A – Y), use 𝑑𝑍(𝑜𝑟 :
𝑑𝐿
𝑑𝑍 ) 𝑑𝑍 = (𝐴 − 𝑌)
Make sure the gradients are of the correct dimensions. Refer to the lecture for more
information.
6. Implement the steps for gradient descent mentioned in step 1. Write a function to fit a logistic
model with the parameters ω, 𝑏 to the training data with labels X and Y.
7. Once you have optimal values of model parameters (ω, 𝑏) we can determine the accuracy of
the model on the test data.
𝑍 = ω
𝑇
𝑋 + 𝑏
𝐴 = σ(𝑍)
5
Submission Directions for Project Deliverables
Learners are expected to work on the project individually. Ideas and concepts may be discussed with
peers or other sources can be referenced for assistance, but the submitted work must be entirely your
own.
You must complete and submit your work through zyBooks’s zyLabs to receive credit for the project:
1. To get started, use the provided Jupyter Notebook in your workspace.
2. All necessary datasets are already loaded into the workspace.
3. Execute your code by clicking the “Run” button in top menu bar.
4. When you are ready to submit your completed work, click on “Submit for grading” located on
the bottom left from the notebook.
5. You will know you have completed the project when feedback appears below the notebook.
6. If needed: to resubmit the project in zyLabs
a. Edit your work in the provided workspace.
b. Run your code again.
c. Click “Submit for grading” again at the bottom of the screen.
Your submission score will automatically be populated from zyBooks into your course grade.
However, the course team will review submissions after the due date has passed to ensure grades
are accurate.
Evaluation
This assignment is auto-graded. There are a total of thirteen (13) test cases and each has points
assigned to it. Please review the notebook to see the points assigned for each test case. A
percentage score will be passed to Canvas based on your score.

Solved CSE 598 Linear Regression & Binary Classification

Download Details:

Description

Solved CSE 598 Linear Regression & Binary Classification

Download Details:

Description

Related products

CSE 598 BMI Calculator – MVC Architecture Mobile App solution

Project 1: Linear Regression solved

Solved CptS 475/575: Data Science Assignment 5 – Part 1: Linear Regression & Logistic Regression