Solved CMPT 726: Assignment 2 1 Probability Let’s consider a scenario where we have a binary classification problem

$30.00

Original Work ?

Download Details:

  • Name: A2-nzan7z.zip
  • Type: zip
  • Size: 3.32 MB

Category: Tags: , , You will Instantly receive a download link upon Payment||Click Original Work Button for Custom work

Description

5/5 - (1 vote)

1 Probability

Let’s consider a scenario where we have a binary classification problem in machine learning, and we want to calculate
the KL divergence between two probability distributions: the true distribution P(Y ) and the predicted distribution
Q(Y ), where Y represents the class labels (0 or 1).
Suppose we have a dataset with 100 samples, where the true distribution P of labels is as follows:
• Class 0: 60 examples
• Class 1: 40 examples

Let’s say our machine learning model makes predictions on this dataset, and the predicted probabilities Q for each
class are as follows:
• Predicted probabilities for Class 0: p
• Predicted probabilities for Class 1: 1 − p
with 0 < p < 1.

a) Calculate the value of true distribution Entropy H(P) and write down the Entropy of the predicted distribution
H(Q).
b) Calculate the minimum cross-entropy Hmin(P, Q) and find the corresponding probability p.
Hmin(P, Q) := min
p
[−
X
1
i=0
P(Y = i) log Q(Y = i)].
c) Prove that KL divergence can be found with the following relationship between the cross entropy H(P, Q) and
the entropy H(P)
DKL(P||Q) = H(P, Q) − H(P).
d) What is the minimum KL divergence of the prediction DKLmin (P||Q)?

2 Bayesian Inference

Consider a simple linear model y = wx + b + ϵ, where x, y, w, b ∈ R and ϵ ∼ N (0, σ2
). Assume prior information
such that w ∼ N (0, σ2
w) and b ∼ N (0, σ2
b
). The regularization parameters λw and λb are defined as λw =
1
σ2
w
and
λb =
1
σ
2
b
, respectively. The training data consists of 20 data points, with code provided in this Jupyter Notebook.

(a) Identify the prior means and covariances.
State the prior means and covariance values for both w and b based on the given prior information.
(b) Explain and implement the formula for [wMAP, bMAP] (Maximum A Posteriori estimation).
State the expression for [wMAP, bMAP] in terms of the design matrix X (with an added column of ones for the
intercept), the output vector y, and regularization terms. Provide a high-level explanation of the derivation steps.
A detailed derivation is not required as it is available in the provided slides. Focus on summarizing the intuition
behind the MAP estimate and implement the formula in code.

(c) Calculate the posterior means and covariances for w and b.
Provide expressions for the posterior mean and covariance matrix for the parameter vector [w, b]. For a detailed derivation and further explanation on Bayesian linear regression, see Probabilistic Machine Learning: An
Introduction by Kevin P. Murphy, Chapter 7, Section 7.6.1, noting VN as posterior covariance matrix.

(d) Interpret the posterior mean and covariance.
Explain what the posterior mean and covariance represent for w and b after observing the data, and describe the
practical significance of w.

(e) Sample models from the posterior distribution and plot the results.
Use the posterior mean and covariance to sample multiple sets of parameters [w, b] from the multivariate normal
distribution. Plot the training data along with multiple linear models represented by these samples. Display
these sampled models as semi-transparent lines to visualize the uncertainty in the predictions.

3 Nonlinear Optimization

In this question, we implement iterative algorithms to solve a nonlinear optimization problem. As practical application we will optimize a nonlinear model of chess win probability given ELO ratings of the players, using data from
lichess.org.

Please provide the code for your answers in this Jupyter Notebook (there are total 4 ”#<<TODO#x>>” in this question).
a) Define the objective function.
Implement your chosen objective function. You may choose between the following:
– Option (i): A simpler quadratic function:
f(x, y) = x
2 +
(y − 2)2
2

– Option (ii): The cross-entropy loss for the sigmoid model, which better represents the chess ELO win probability data.
Implement this function and ensure the code correctly calculates the chosen loss value and its gradient wrt the
two model parameters for the given data.

Note: You have the option to proceed with either one of the two objective functions. The first one is an optional,
simpler fallback that might help you to get started. There is a -5% penalty on this question, if you choose not
implement the cross-entropy loss option (ii).

b) Implement a basic Gradient Descent algorithm for the objective function (i) or (ii).
• Use step size of 0.2, total iterations of 100, and initial point of (−10, 10) for the objective function.
• Change the step size to 0.01. Report and plot your observation.

c) Implement the Adam algorithm for the chosen objective function.
• Use step size of 0.2, total iterations of 150, and initial point of (−10, 10) for the objective function. Use
default values for other parameters. Does the solution converge to the minimal objective value? Report
and plot your observation.

d) Finally, use the best parameters that you found via Adam (or gradient descent) to draw the best fitting sigmoid
function model on top of the data points. Briefly discuss this result in comparison with the linear fit from the
earlier question.