Solved CS434 Homework 3 Naïve Bayes and Neural Networks

$30.00

Original Work ?

Download Details:

  • Name: naive_bayes_and_neural_networks-afe8vu.zip
  • Type: zip
  • Size: 9.70 MB

Category: Tags: , , , You will Instantly receive a download link upon Payment||Click Original Work Button for Custom work

Description

5/5 - (1 vote)

In this homework, we are going to do some exercises to understand Naïve Bayes a bit
better and then get some hand-on experience with simple Neural Networks in a multi-class classification setting.
How to Do This Assignment.
• Each question that you need to respond to is in a blue “Task Box” with its corresponding point-value listed.
• We prefer typeset solutions (LATEX / Word) but will accept scanned written work if it is legible. If a TA can’t
read your work, they can’t give you credit.
• Programming should be done in Python and numpy. If you don’t have Python installed, you can install it from
here. This is also the link showing how to install numpy. You can also search through the internet for numpy
tutorials if you haven’t used it before. Google and APIs are your friends!
You are NOT allowed to…
• Use machine learning package such as sklearn.
• Use data analysis package such as panda or seaborn.
• Discuss low-level details or share code / solutions with other students.
Advice. Start early. There are two sections to this assignment – one involving working with math (20% of grade) and
another focused more on programming (80% of the grade). Read the whole document before deciding where to start.
How to submit. Submit a zip file to Canvas. Inside, you will need to have all your working code and hw3-report.pdf.
You will also submit test set predictions to a class-wide Kaggle competition.
1 Written Exercises: Analyzing Naïve Bayes [5pts]
1.1 Bernoulli Naïve Bayes As A Linear Classifier
Consider a Naïve Bayes model for binary classification where the features X1, …, Xd are also binary variables. As part
of training this Naïve Bayes model, we would estimate the conditional distributions P(Xi
|y=c) for c = {0, 1} and
i = 1, …, d as well as the prior distribution P(y=c) for c = {0, 1}. For notational simplicity, let’s denote P(Xi=1|y=c)
as θic and P(Xi=0|y=c) as 1 − θic. Likewise, let θ1 be P(y=1) and θ0 = 1 − θ1 be P(y=0).
If we write out the posterior of this classifier (leaving out the normalizing constant), we would have:
P(y = 1|x1, …, xd) = P(y = 1) Qd
i=1 P(xi
|y = 1)
P(x1, …, xd)
∝ θ1
Y
d
i=1
θ
xi
i1
(1 − θi1)
1−xi
(1)
P(y = 0|x1, …, xd) = P(y = 0) Qd
i=1 P(xi
|y = 0)
P(x1, …, xd)
∝ θ0
Y
d
i=1
θ
xi
i0
(1 − θi0)
1−xi
(2)
The classification rule in Naïve Bayes is to choose the class with the highest posterior probability, that is to say by
predicting y = argmaxc P(y = c|x1, …, xd) for a new example x = [x1, …, xd]. In order for the prediction to be class
1, P(y = 1|x1, …, xd) must be greater than P(y = 0|x1, …, xd), or equivalently we could check if:
P(y = 1|x1, …, xd)
P(y = 0|x1, …, xd)
> 1 (3)
In this setting with binary inputs, we will show that this is a linear decision boundary. This also true for continuous
inputs if they are modelled with a member of the exponential family (including Gaussian) – see proof here in §3.1.
1
I Q1 Prove Bernoulli Naïve Bayes has a linear decision boundary [4pts]. Prove the Bernoulli
Naïve Bayes model described above has a linear decision boundary. Specifically, you’ll need to show that
class 1 will be predicted only if b + wT x > 1 for some parameters b and w. To do so, show that:
P(y = 1|x1, …, xd)
P(y = 0|x1, …, xd)
> 1 =⇒ b +
X
d
i=1
wixi > 0 (4)
As part of your report, explicitly write expressions for the bias b and each weight wi
.
Hints: Expand Eq. (3) by substituting the posterior expressions from Eq. (1) & (2) into Eq. (3). Take the
log of that expression and combine like-terms.
1.2 Duplicated Features in Naïve Bayes
Naïve Bayes classifiers assume features are conditionally independent given the class label. When this assumption is
violated, Naïve Bayes classifiers may be over-confident or under-confident in the outputs. To examine this more closely,
we’ll consider the situation where an input feature is duplicated. These duplicated features are maximally dependent
– making this is a strong violation of conditional independence. Here, we’ll examine a case where the confidence
increases when the duplicated features are included despite no new information actually being added as input.
I Q2 Duplicate Features in Naïve Bayes [1pts]. Consider a Naïve Bayes model with a single feature
X1 for a binary classification problem. Assume a uniform prior such that P(y = 0) = P(y = 1). Suppose the
model predicts class 1 for an example x then we know:
P(y = 1|X1 = x1) > P(y = 0|X1 = x1) (5)
Now suppose we make a mistake and duplicate the X1 feature in our data – now we have two identical inputs X1 and X2. Show that the predicted probability for class 1 is higher than it was before; that is, prove:
P(y = 1|X1 = x1, X2 = x2) > P(y = 1|X1 = x1) (6)
Hints: Use the assumption that P(y = 0) = P(y = 1) to simplify the posterior expressions in Eq. (6) and
Eq. (5). As X2 is an exact copy of X1, P(X2 = x2|y) is the same as P(X1 = x1|y) for any example.
2 Implementing a Neural Network For Digit Identification [20pts]
Small MNIST. In this section, we will implement a feed-forward neural network model for predicting the value of a
drawn digit. We are using a subset of the MNIST dataset commonly used in machine learning research papers. A few
example of these handwritten-then-digitized digits from the dataset are shown below:
Each digit is a 28 × 28 greyscale image with values ranging from 0 to 256. We represent an image as a row vector
x ∈ R
1×784 where the image has been serialized into one long vector. Each digit has an associated class label from
0,1,2,…,9 corresponding to its value. We provide three dataset splits for this homework – a training set containing
5000 examples, a validation set containing 1000, and our test set containing 4220 (no labels). These datasets can be
downloaded from the class Kaggle competition for this homework.
2
2.1 Cross-Entropy Loss for Multiclass Classification
Unlike the previous classification tasks we’ve examined, we have 10 different possible class labels here. How do
we measure error of our model? Let’s formalize this a little and say we have a dataset D = {xi
, yi}
N
i=1 with
yi ∈ {0, 1, 2, …, 9}. Assume we have a model f(x; θ) parameterized by a set of parameters θ that predicts P(Y |X = x)
(a distribution over our labels given an input). Let’s refer to P(Y = c|X = x) predicted from this model as pc|x for
compactness. We can write this output as a categorical distribution:
P(Y = y|X = x) =



p0|x ify = 0,
p1|x ify = 1
.
.
.
p9|x ify = 9
=
Y
9
c=0
p
I[y==c]
c
(7)
where I[condition] is the indicator function that is 1 if the condition is true and 0 otherwise. Using this, we can write
our negative log-likelihood of a single example as as:
−logP(D|θ) = −
X
9
c=0
I[yi == c] log pc|xi = − log pyi|xi
(8)
This loss function is also often referred to as a Cross-Entropy loss. In this homework, we will minimize this negative
log-likelihood by stochastic gradient descent. In the following, we will refer to this negative log-likelihood as
L(θ) = − log pyi|xi
(9)
Note that we write L as a function of θ because each pyi|xi
is produced by our model f(xi
; θ).
2.2 Implementing Backpropagation for Feed-forward Neural Network
In this homework, we’ll consider feed-forward neural networks composed of a sequence of linear layers xW1 + b1 and
non-linear activation functions g1(·). As such, a network with 3 of these layers stacked together can be written
b3 + g2(b2 + g1(b1 + x ∗ W1) ∗ W2) ∗ W3 (10)
Note how that this is a series of nested functions, reflecting the sequential feed-forward nature of the computation.
To make our notation easier in the future, I want to give a name to the intermediate outputs at each stage so will
expand this to write:
z1 = x ∗ W1 + b1 (11)
a1 = g1(z1) (12)
z2 = a1 ∗ W2 + b2 (13)
a2 = g2(z2) (14)
z3 = a2 ∗ W3 + b3 (15)
(16)
where z’s are intermediate outputs from the linear layers and a’s are post-activation function outputs. In the case of
our MNIST experiments, z3 will have 10 dimensions – one for each of the possible labels. Finally, the output vector
z3 is not yet a probability distribution so we apply the softmax function:
pj|x =
e
z3j
P9
c=0 e
z3c
(17)
and let p·|x be the vector of these predicted probability values.
Gradient Descent for Neural Networks. Considering this simple 3-layer neural network, there are quite a few
parameters spread out through the function – weight matrices W3, W2, W1 and biases vectors b3, b2, b1. Suppose we
would like to find parameters that minimize our loss L that measures our error in the network’s prediction.
3
How can we update the weights to reduce this error? Let’s use gradient descent and start by writing out the chain
rule for the gradient of each of these. I’ll work backwards from W3 to W1 to expose some structure here.
δL
δW3
=
δL
δp·|x
δp·|x
δz3
δz3
δW3
(18)
δL
δW2
=
δL
δp·|x
δp·|x
δz3
δz3
δa2
δa2
δz2
δz2
δW2
(19)
δL
δW1
=
δL
δp·|x
δp·|x
δz3
δz3
δa2
δa2
δz2
δz2
δa1
δa1
δz1
δz1
δW1
(20)
As I’ve highlighted in color above, we end up reusing the same intermediate terms over and over as we compute
derivatives for weights further and further from the output in our network.1 As discussed in class, this suggests
the straight-forward backpropagation algorithm for computing these efficiently. Specifically, we will compute these
intermediate colored terms starting from the output and working backwards.
Forward-Backward Pass in Backpropagation. One convenient way to implement backpropagation is to consider
each layer (or operation) f as having a forward pass that computes the function output normally as
output = fforward(input) (21)
and a backward pass that takes in the gradient up to this point in our backward pass and then outputs the gradient
of the loss with respect to its input:
δL
δinput = fbackward 
δL
δoutput
=
δL
δoutput
δoutput
δinput (22)
The backward operator will also compute any gradients with respect to parameters of f and store them to be used in
a gradient descent update step after the backwards pass. The starter code implements this sort of framework.
See the snippet on the following page that defines a neural network like the one we’ve described here, except it
allows for a configurable number of linear layers. Please read the comments and code below before continuing reading
this document. To give concrete examples of the forward-backward steps for an operator, consider the Sigmoid (aka
the logistic) activation fucntion below:
Sigmoid(x) = 1
1 + e−x
(23)
The implementation for forward and backward for the Sigmoid is below – in forward it computes Eq.(23), in backward
it computes and returns
δL
δinput =
δL
δoutput
δoutput
δinput =
δL
δoutputSigmoid(input)(1 − Sigmoid(input)) (24)
It has no parameters so does nothing during the “step” function.
1 class Sigmoid :
2
3 # Given the input , apply the sigmoid function
4 # store the output value for use in the backwards pass
5 def forward ( self , input ) :
6 self . act = 1/(1+ np . exp (- input ))
7 return self . act
8
9 # Compute the gradient of the output with respect to the input
10 # self .act *(1 – self .act ) and then multiply by the loss gradient with
11 # respect to the output to produce the loss gradient with respect to the input
12 def backward ( self , grad ) :
13 return grad * self . act * (1 – self . act )
14
15 # The Sigmoid has no parameters so nothing to do during a gradient descent step
16 def step ( self , step_size ):
17 return
1
I don’t repeat this for the bias vectors, as it would simply involve changing the final terms from δzi
δWi
to δzi
δbi
.
4
1 class FeedForwardNeuralNetwork :
2
3 # Builds a network of linear layers separated by non – linear activations
4 # Either ReLU or Sigmoid . Each internal layer has hidden_dim dimensions .
5 def __init__ ( self , input_dim , output_dim , hidden_dim , num_layers , activation =” ReLU “) :
6
7 if num_layers == 1: # Just a linear mapping from input to output
8
9 self . layers = [ LinearLayer ( input_dim , output_dim ) ]
10
11 else : # At least two layers
12
13 # Layer to map input to hidden dimension size
14 self . layers = [ LinearLayer ( input_dim , hidden_dim ) ]
15 self . layers . append ( Sigmoid () if activation ==” Sigmoid ” else ReLU () )
16
17 # Hidden layers
18 for i in range ( num_layers -2) :
19 self . layers . append ( LinearLayer ( hidden_dim , hidden_dim ) )
20 self . layers . append ( Sigmoid () if activation ==” Sigmoid ” else ReLU () )
21
22 # Layer to map hidden dimension to output size
23 self . layers . append ( LinearLayer ( hidden_dim , output_dim ) )
24
25 # Given an input , call the forward function of each of our layers
26 # Pass the output of each layer to the next one
27 def forward ( self , X) :
28 for layer in self . layers :
29 X = layer . forward ( X)
30 return X
31
32 # Given an gradient with respect to the network output , call
33 # the backward function of each of our layers . Pass the output of each layer to the
one before it
34 def backward ( self , grad ) :
35 for layer in reversed ( self . layers ):
36 grad = layer . backward ( grad )
37
38 # Tell each layer to update its weights based on the gradient computed in the
backward pass
39 def step ( self , step_size =0.001) :
40 for layer in self . layers :
41 layer . step ( step_size )
Operating on Batches. The network described in the equations earlier in this section is operating on a single input
at a time. In practice, we will want to operate on sets of n examples at once such that the layer actually computes
Z = XW + b for X ∈ R
n×input_dim and Z ∈ R
n×output_dim – call this a batched operation. It is straightforward to
change the forward pass to operate on these all at once. For example, a linear layer can be rewritten as Z = XW + b
where the +b is a broadcasted addition – this is already done in the code above. On the backward pass, we simply
need to aggregate the gradient of the loss of each data point with respect to our parameters. For example,
δL
δW1
=
Xn
i=1
δLi
δW1
(25)
where Li
is the loss of the i’th datapoint and L is the overall loss.
Deriving the Backward Pass for a Linear Layer. In this homework, we’ll implement the backward pass of a linear
layer. To do so, we’ll need to be able to compute dZ/db, dZ/dW, and dZ/dX. For each, we’ll start by considering
the problem for a single training example x (i.e. a single row of X) and then generalize to the batch setting. In this
single-example setting, z = xW + b such that z, b ∈ R
1×c
, x ∈ R
1×d
, and W ∈ R
d×c
. Once we solve this case,
extending to the batch setting just requires summing over the gradient terms for each example.
dZ/db. Considering just the i’th element of z, we can write zi = xw·,i +bi where w·,i is the i’th column of W. From
this equation, it is straightforward to observe that element bi only effects the corresponding output zi such that
dzi
dbj
=
(
1 if i = j
0 otherwise
(26)
This suggests that the Jacobian dz/db is an identity matrix I of dimension c × c. Applying chain rule and summing
5
over all our datapoints, we see dL/db can be computed as a sum of the rows of dL/dZ:
dL
db =
Xn
k=1
dLk
dZk
dZk
db =
Xn
k=1
dLk
dZk
I =
Xn
k=1
dLk
dZk
(27)
dZ/dW. Following the same process of reasoning from the single-example case, we can again write the i’th element
of z as zi = xw·,i + bi where w·,i is the i’th column of W. When considering the derivative of zi with respect to the
columns of W, we see that it is just x for w·,i and 0 for other columns as they don’t contribute to zi – that is to say:
δzi
δw·,j
=
(
x if i = j
0 otherwise
(28)
Considering the loss gradient δL/δw·,i for a single example, we can write:
δL
δw·,i
=
δL
δzi
δzi
δw·,i
=
δL
δzi
x (29)
That is to say, each column i of δL
δW is the input x scaled by the loss gradient of zi
. As such, we can compute the
gradient for the entire W as the product:
δL
δW = x
T
δL
δz (30)
Notice that x
T
is d × 1 and δL
δz is 1 × c – resulting in a d × c gradient that matches the dimension of W.
Now let’s consider if we have multiple datapoints x1, …xn as the matrix X and likewise multiple activation vectors
z1, …, zn as the matrix Z. As our loss simply sums each datapoint’s loss, the gradient also decomposes into a sum of
δL
δzi,·
terms.
δL
δW =
Xn
k=1
δL
δZk
δZk
δW =
Xn
k=1
XT
k
δLk
δZk
(31)
We can write this even more compactly as:
δL
δW = XT
δL
δZ (32)
dZ/dX. This follows a very similar path as dZ/dW. We again consider the i’th element of z as zi = xw·,i + bi where
w·,i is the i’th column of W. Taking the derivative with respect to x it is clear that for zi the result will be w·,i.
δzi
δx
= w·,i (33)
This suggests that the rows of dZ/dx are simply the columns of W such that dZ/dx = WT
and we can write
δL
δx
=
δL
δz
δz
δx
=
δL
δz WT
(34)
Moving to the multiple example setting, the above expression gives each row of dL/dX and the entire matrix can be
computed efficiently as
dL
dX =
dL
dZ WT
(35)
6
I Q3 Implementing the Backward Pass for a Linear Layer [6pt]. Implement the backward pass
function of the linear layer in the skeleton code. The function takes in the matrix dL/dZ as the variable
grad and you must compute dL/dW, dL/db, and dL/dX. The first two are stored as self.grad_weights
and self.grad_bias and the third is returned. The expressions for these can be found above in Eq.27
(dL/db), Eq.32 (dL/dW), and Eq.35 (dL/dX).
1 class LinearLayer :
2
3 # Initialize our layer with ( input_dim , output_dim ) weight matrix and a (1 ,
output_dim ) bias vector
4 def __init__ ( self , input_dim , output_dim ):
5 self . weights = np . random . randn ( input_dim , output_dim ) . astype ( np . float64 )* np
. sqrt (2. / input_dim )
6 self . bias = np . ones ( (1 , output_dim ) ) . astype ( np . float64 ) *0.5
7
8 # During the forward pass , we simply compute Xw+b
9 def forward ( self , input ) :
10 self . input = input
11 return self . input@self . weights + self . bias
12
13 # ################################################
14 # Q3 Implementing Backward Pass for Linear
15 # ################################################
16 # Inputs :
17 #
18 # grad dL/dZ — For a batch size of n, grad is a (n x output_dim ) matrix where
19 # the i’th row is the gradient of the loss of example i with respect
20 # to z_i ( the output of this layer for example i)
21 #
22 # Computes and stores :
23 #
24 # self . grad_weights dL/dW — A ( input_dim x output_dim ) matrix storing the
25 # gradient of the loss with respect to the weights of
26 # this layer .
27 #
28 # self . grad_bias dL/db — A (1 x output_dim ) matrix storing the gradient
29 # of the loss with respect to the bias of this layer .
30 #
31 # Return Value :
32 #
33 # grad_input dL/dX — For a batch size of n, grad_input is a (n x input_dim )
34 # matrix where the i’th row is the gradient of the loss of
35 # example i with respect to x_i (the input of this
36 # layer for example i)
37 # ################################################
38
39 def backward ( self , grad ) : # grad is dL/dZ
40 self . grad_weights = ? # Compute dL/dW as in Eq. 32
41 self . grad_bias = ? # Compute dL/db as in Eq. 27
42 return ? # Compute dL/dX as in Eq. 35
43
44 # During the gradient descent step , update the weights and biases based on the
stored gradients from the backward pass
45 def step ( self , step_size ):
46 self . weights -= step_size * self . grad_weights
47 self . bias -= step_size * self . grad_bias
Once you’ve completed the above task, running the skeleton code should load the digit data and train a 2-layer
neural network with hidden dimension of 16 and Sigmoid activations. This model is trained on the training set and
evaluated once per epoch on the validation data. After training, it will produce a plot of your results that should look
like the one below. This curve plots training and validation loss (cross-entropy in this case) over training iterations (in
red and measured on the left vertical axis). It also plots training and validation accuracy (in blue and measures on the
right vertical axis). As you can see, this model achieves between 80% and 90% accuracy on the validation set.
7
2.3 Analyzing Hyperparmeter Choices
Neural networks have many hyperparameters. These range from architectural choices (How many layers? How wide
should each layer be? What activation function should be used? ) to optimization parameters (What batch size for
stochastic gradient descent? What step size (aka learning rate)? How many epochs should I train? ). This section has
you modify many of these to examine their effect. The default parameters are below for easy reference.
1 # GLOBAL PARAMETERS FOR STOCHASTIC GRADIENT DESCENT
2 np . random . seed (102)
3 step_size = 0.01
4 batch_size = 200
5 max_epochs = 200
6
7 # GLOBAL PARAMETERS FOR NETWORK ARCHITECTURE
8 number_of_layers = 2
9 width_of_layers = 16 # only matters if number of layers > 1
10 activation = ” ReLU ” if False else ” Sigmoid ”
Optimization Parameters. Optimization parameters in Stochastic Gradient Descent are very inter-related. Large
batch sizes mean less noisy estimates of the gradient, so larger step sizes could be used. But larger batch sizes
also mean fewer gradient updates per epoch, so we might need to increase the max epochs. Getting a good set of
parameters that work well can be tricky and requires checking the validation set performance. Further, these “good
parameters” will vary model-to-model.
I Q4 Learning Rate [2pts]. The learning rate (or step size) in stochastic gradient descent controls how
large of a step in the direction of the loss gradient we take our parameters at each iteration. The batch size
determines how many data points we use to estimate the gradient. Modify the hyperparameters to run the
following experiments:
1. Step size of 0.0001 (leave default values for other hyperparameters)
2. Step size of 5 (leave default values for other hyperparameters)
3. Step size of 10 (leave default values for other hyperparameters)
Include these plot in your report and answer the following questions:
a) Compare and contrast the learning curves with your curve using the default parameters. What do you
observe in terms of smoothness, shape, and what performance they reach?
b) For (a), what would you expect to happen if the max epochs were increased?
8
Activation Function and Depth. As networks get deeper (or have more layers) they tend to become able to fit more
complex functions (though this also may lead to overfitting). However, this also means the backpropagated gradient
has many product terms before reaching lower levels – resulting in the magnitude of the gradients being relatively
small. This has the effect of making learning slower. Certain activation functions make this better or worse depending
on the shape of their derivative. One popular choice is to use a Rectified Linear Unit or ReLU activation that computes:
ReLU(x) = max(0, x) (36)
This is especially common in very deep networks. In the next question, we’ll see why experimentally.
I Q5 ReLU’s and Vanishing Gradients [3pts]. Modify the hyperparameters to run the following experiments:
1. 5-layer with Sigmoid Activation (leave default values for other hyperparameters)
2. 5-layer with Sigmoid Activation with 0.1 step size (leave default values for other hyperparameters)
3. 5-layer with ReLU Activation (leave default values for other hyperparameters)
Include these plot in your report and answer the following questions:
a) Compare and contrast the learning curves you observe and the curve for the default parameters in terms
of smoothness, shape, and what performance they reach. Do you notice any differences in the
relationship between the train and validation curves in each plot?
b) If you observed increasing the learning rate in (2) improves over (1), why might that be?
c) If (3) outperformed (1), why might that be? Consider the derivative of the sigmoid and ReLU functions.
2.4 Randomness in Training
There is also a good deal of randomness in training a neural network with stochastic gradient descent – network weights
are randomly initialized and the batches are randomly ordered. This can make a non-trivial difference to outcomes.
I Q6 Measuring Randomness [1pt]. Using the default hyperparameters, set the random seed to 5 different values and report the validation accuracies you observe after training. What impact does this randomness have on the certainty of your conclusions in the previous questions?
2.5 Make Your Kaggle Submission
Great work getting here. In this section, you’ll submit the predictions of your best model to the class-wide Kaggle
competition. You are free to make any modification to your neural network or the optimization procedure to improve
performance; however, it must remain a feed-forward neural network! For example, you can change any of the
optimization hyperparameters or add momentum / weight decay, vary the number of layers or width, add dropout or
residual connections, etc.
9
I Q7 Kaggle Submission [8pt]. Submit a set of predictions to Kaggle that outperforms the baseline on
the public leaderboard. To make a valid submission, use the train set to train your neural network classifier
and then apply it to the test instances in mnist_small_test.csv available from Kaggle’s Data tab. Format
your output as a two-column CSV as below:
id,digit
0,3
1,9
2,4
3,1
.
.
where the id is just the row index in mnist_small_test.csv. You may submit up to 10 times a day. In
your report, tell us what modifications you made for your final submission.
Extra Credit and Bragging Rights [1.25pt Extra Credit]. The TA has made a submission to the leaderboard. Any
submission outperforming the TA on the private leaderboard at the end of the homework period will receive 1.25 extra
credit points on this assignment. Further, the top 5 ranked submissions will “win HW3” and receive bragging rights.
3 Debriefing (required in your report)
1. Approximately how many hours did you spend on this assignment?
2. Would you rate it as easy, moderate, or difficult?
3. Did you work on it mostly alone or did you discuss the problems with others?
4. How deeply do you feel you understand the material it covers (0%–100%)?
5. Any other comments?