CSCE 633 Homework 1 solution

$24.99

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (4 votes)

Question 1
1-dimensional linear regression: Assume a 1-dimensional linear regression model y = w0 +
w1x. The residual sum of squares (RSS) of the training data Dtrain = {(x1, y1), . . . ,(xN , yN )}
can be written as:
RSS(w0, w1) = X
N
n=1
(yn − w0 − w1xn)
2
We estimate the weights w0, w1 by minimizing the above error.
(a) Show that minimizing RSS results in the following closed-form expression:
w

1 =
P
N
n=1
xnyn − N

1
N
P
N
n=1
xn
  1
N
P
N
n=1
yn

P
N
n=1
x
2
n − N

1
N
P
N
n=1
xn
2
w

0 =

1
N
X
N
n=1
yn
!
− w1

1
N
X
N
n=1
xn
!
Tip: Set the partial derivatives ϑRSS(w0,w1)
ϑw0
and ϑRSS(w0,w1)
ϑw1
equal to 0. Then solve a 2 × 2
system of linear equations with respect to w0 and w1.
(b) Show that the above expressions for w

0
and w

1
are equivalent to the following:
w

1 =
P
N
n=1
(xi − x¯)(yi − y¯)
P
N
i=1
(xi − x¯)
2
w

0 = ¯y − w1x¯
where ¯x =
1
N
PN
n=1 xn and ¯y =
1
N
PN
n=1 yn are the sample means of input features and outcome
values, respectively.
(c) How would you interpret the above expression in terms of the descriptive statistics (e.g.
sample mean, variance, co-variance) of populations {xn}
N
n=1 and {yn}
N
n=1?
1
Question 2
Principled method for learning the step size in gradient descent: In class we discussed
that when we use gradient descent to minimize target function J(w) with respect to w, the step
size α(k) at iteration k is a crucial hyperparameter. We further said that we can experimentally
determine α(k) through cross-validation. There is actually a principled way for computing the
optimal α(k) in each iteration and we are going to derive the expression for that.
(a) According to Taylor series expansion, a differentiable function f(x) can be written around
x0 as follows:
f(x) = f(x0) + (∇f|x=x0
)
T
· (x − x0) + 1
2
(x − x0)
T
· Hf |x=x0
· (x − x0) + . . .
where ∇f are the gradient vector and Hf Hessian matrix of f evaluated at x0.
Let w(k) be the value of w at the k
th iteration of gradient descent. Show that the second order
Taylor expansion of the target function J(w) around w(k) is the following:
J(w) ≈ J(w(k)) + (∇J|w=w(k)
)
T
· (w − w(k)) + 1
2
(w − w(k))T
· HJ |w=w(k)
· (w − w(k))
where ∇J are the gradient vector and HJ Hessian matrix of J evaluated at w(k).
(b) Show that the above expression of J(w) evaluated at w(k + 1) (i.e. at the (k + 1)th gradient
descent iteration) can be written as:
J(w(k + 1)) ≈J(w(k)) −

∇J|w=w(k)

2
2
· α(k)+
+
1
2

∇J|w=w(k)
T HJ |w=w(k)

∇J|w=w(k)

· α
2
(k)
Tip: Take into account the gradient descent update rule w(k + 1) = w(k) − α(k) · ∇J|w=w(k)
(c) Show that minimizing the above expression with respect to the step size α(k) results in:
α(k) =

∇J|w=w(k)

2
2

∇J|w=w(k)
T HJ |w=w(k)

∇J|w=w(k)

The above expression gives a closed-form solution of the step size at iteration k (i.e. a(k)) that
minimizes the target function at the next iteration.
(d) What is the cost of computing a(k) at each iteration k using the above expression?
Question 3
Predicting forest fires: Forest fires are a major environmental issue endangering human
lives. This renders their fast detection a key element for controlling them and potentially
preventing them. Since it is hard for humans to monitor all forests, we can use automatic tools
based on local sensors to do that. Through these sensors we can get information regarding the
meteorological conditions, such as temperature, wind, relative humidity (RH), and amount of
rain. We can also compute several fire hazard indexes, such as the forest fire weather index
(FWI), fine fuel moisture code (FFMC), duff moisture code (DMC), drought code (DC), and
initial spread index (ISI). Using these measures, we can predict whether fire is going to occur in
the forest, as well as to estimate the amount of burned area. Such data are part of the “Forest
Fires Data Set” of the UCI Machine Learning Repository and their description can be found
here: http://archive.ics.uci.edu/ml/datasets/Forest+Fires
Inside “Homework 1” folder on Piazza you can find two files including the train and test
data (named “train.csv” and “test.csv”) for our experiments. The rows of these files refer to the
data samples, while the columns denote the features (columns 1-12) and the outcome variable
(column 13), as describe bellow:
1. X: x-axis spatial coordinate of the forest: 1 to 9
2. Y: y-axis spatial coordinate of the forest: 2 to 9
3. month: month of the year: 1 to 12 to denote ”jan” to ”dec”
4. day: day of the week: 1 to 7 to denote ”mon” to ”sun”
5. FFMC: FFMC index from the FWI system
6. DMC: DMC index from the FWI system
7. DC: DC index from the FWI system
8. ISI: ISI index from the FWI system
9. temp: temperature in Celsius degrees
10. RH: relative humidity
11. wind: wind speed in km/h
12. rain: outside rain in mm/m2
13. area: the burned area of the forest (this is the outcome variable)
(a) Data exploration: Inspect the input features (e.g. you can plot histograms, scatter plots,
etc.). Which of the features are continuous and which categorical?
(b) Classification: From data exploration, we can notice that the the outcome value (i.e. the
burned area) is zero for many samples, meaning that the corresponding forests are not affected
by fire. Therefore we can dichotomize the outcome variable, based on whether its corresponding
value is zero or greater than zero. This creates the following two classes:
Class 0: Forests not affected by the fire, i.e. area = 0
Class 1: Forests affected by the fire, i.e. area > 0
After dichotomizing the outcome variable, we can run a classification task to predict whether or
not fire will occur in a certain forest based on the input features.
(b.i) Implement a K-Nearest Neighbor classifier (K-NN) using the euclidean distance as a distance measure to perform the above binary classification task. Reminder: Don’t forget to
normalize the features.
(b.ii) Explore different values of K through cross-validation on the training set. Plot the classification accuracy, i.e. (#samples correctly classified) / (total #samples), against the different
values of K.
(b.iii) Report the classification accuracy on the test set using the best K from cross-validation.
3
(b.iv) Bonus: Instead of using the euclidean distance for all features, experiment with different
types of distances or distance combinations, i.e. Hamming distance for categorical features.
Report your findings.
(c) Linear Regression: Among the forests that were affected by the fire, we can use linear
regression to predict the actual amount of area that was burned. For this task, we will only
use the samples of the train and test set with burned area (column 13) greater than
zero, i.e. area > 0.
(c.i) Plot the histogram of the outcome variable. What do you observe? Plot the histogram of
the logarithm of the outcome value, i.e. log(area). What do you observe now?
(c.ii) Implement a linear regression model to fit the outcome data using the ordinary least
squares (OLS) solution.
(c.iii) Test your model on the test data and compute the residual sum of squares error (RSS)
and the correlation between the actual and predicted outcome variable.
(c.iii) Bonus: Experiment with different non-linear functions of the input features. Report
your findings on the train and test sets.
4