MAST30034: Applied Data Science Assignment 1 solved

$30.00

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (3 votes)

Assignment Overview
Spatiotemporal datasets consists of both space and time dimensions and represent many
real world applications such as medical imaging data, video data, weather patterns and so
on. The aim of this assignment is to familiarize ourselves with dataset generation by time
samples and pixel values, its preprocessing by statistical measures, its analysis by parameter
estimation, its results visualization by graphs, and its evaluation by performance metrics. In
this regard, you are free to choose any tools (R/Python/Matlab) to perform these tasks.
Assignment Details
In this assignment, you are going to attempt six main tasks given as follows:
1. Generate a spatiotemporal synthetic dataset (that follows a linear regression model),
2. Perform data preprocessing steps,
3. Apply least square regression and its variants on the generated (synthetic) dataset,
4. Visualize the generated dataset to address critical questions,
5. Select parameter by using mean square error (MSE) curve,
6. Use performance metrics to judge a better estimator.
Least Square Regression (LSR)
The least square estimator A = (D>D)
−1D>X solves a linear regression problem of the type
X = DA + E and in terms of cost function it can be written as
min
a
||xv − Dav||2
, v = 1, …, V (1)
where X ∈ R
N×V is the observed dataset, D ∈ R
N×L
contains the set of regressors, A ∈ R
L×V
(known as coefficient matrix) measures the relationship between the observed responses and
the regressors, E accounts for the model error, xv and av are v-th column from X and A,
respectively, and x
k
is the k-th row. Also, N is the number of time points, L is the number
of regressors, and V is the number of observed variables. Throughout this document capital
bold letters represent matrices and small bold letters represent vectors.
Synthetic Dataset Generation
A spatiotemporal synthetic dataset X can be constructed using temporal sources called time
courses TC and spatial sources called spatial maps SM. The model presented in eq. (1) can
be used to generate a synthetic data where D becomes TC, and A becomes SM. On this
1
synthetic dataset X one can then perform LSR using regressors from D (D = TC source
TCs) to estimate response signal strength A (retrieved SMs), and then use A (retrieved SMs)
to estimate D (retrieved TCs).
The source TCs and SMs can either be statistically dependent on each other or independent.
For this assignment, a simple case is considered that avoids spatial dependence, however
temporal correlation would be introduced to address the topic of multicollinearity (MC).
In order to mitigate the problem of MC, variants of LSR that introduce regularization into
their model making them more adaptable at handling MC can be used. Regularization allows
to improve efficiency of the estimate (reduced variance or minimum mean square error) by
introducing a bias into the model that shrinks the estimate of coefficients towards certain
values (small values near zero).
Ridge Regression (RR)
An l-2 norm of av in the objective function (1) is penalized by introducing a regularization
term λ˜ as
min
a
||xv − Dav||2 + λ˜X
L
l=1
||a
l
v
||2
, v = 1, …, V , l = 1, …, L (2)
This results in a ridge regression estimate as A = (D>D + λ˜I)
−1D>X. Here I is an identity
matrix of size L × L, whereas penalty term λ˜ is a scalar value, which has no effect when
its zero and RR produces LSR estimate, however as it reaches infinity its impact increases
and the coefficients in A start getting close to zero. Generally, for our synthetic data case
selecting λ between 0 and 1 would suffice only if we multiply whatever λ we choose between
0 and 1 by number of variables in the dataset, which are V in our case that is λ˜ = λV . We
will use check and guess method for parameter selection (λ) in this case, but we will use sum
of MSE versus regularization parameter plot in case of LASSO regression. Furthermore, RR
will not produce a sparse solution as all coefficients are shrunk towards zero with the same
factor so it fails to eliminate any of them.
LASSO Regression (LR)
LASSO regression on the other hand yields a sparse solution as it can eliminate less important
coefficients by shrinking them towards zeros. Lets use this regression to further enhance the
prediction accuracy for our case. Other than the inability of LSR to handle MC, the main
reason behind the bad performance of LSR and RR (both of them producing many false
positives while recovering coefficients) is that they incorporate the undesired (noise carrying)
pixels into the estimate of A and this also effects the estimate of D in terms of overfitting.
In order to mitigate the problem of both overfitting and multicollinearity, l-1 norm of av in
the objective function (1) is penalized by introducing a regularization term ρ as
min
a
||xv − Dav||2 + ρ
X
L
l=1
|a
l
v
|, v = 1, …, V , l = 1, …, L (3)
2
Luckily its closed form solution exists in soft thresholding as
s =
1
1.1||D>D||2
, thr = N ρs
E = a(i)v + s(D>(xv − Dav))
av =
1
1 + thr
h
sgn
E



|E| − thr1L

+
i
, (4)
where i=1,…,I represents the i-th iteration of the algorithm, 0 < ρ < 1 controls the thresholding of av, sgn(.), |.|,(x)+, and ◦ define the component-wise sign, the component-wise absolute
value, the component-wise max (0, x), and the Hadamard product, respectively. The R code
to solve (4) is given below to answer Question 2.3 and 2.4.
R Code for LR
s t e p <= 1 / ( norm (TC%*%t (TC) ) * 1 . 1 )
t h r <= rho *N* s t e p
Ao <= mat rix ( 0 , n s r c s , 1 )
A <= mat rix ( 0 , n s r c s , 1 )
Alr <= mat rix ( 0 , n s r c s , x1*x2 )
f o r ( k i n 1 : ( x1*x2 ) )
{
A <= Ao+s t e p *( t (TC)%*%(X[ , k] =(TC%*%Ao ) ) )
A <= (1/(1+ t h r ) ) * ( si g n (A)*pmax ( r e p l i c a t e ( n s r c s , 0 ) , abs (A) =t h r ) )
f o r ( i i n 1 : 1 0 )
{
Ao <= A
A <= Ao+s t e p *( t (TC)%*%(X[ , k] =(TC%*%Ao ) ) )
A <= (1/(1+ t h r ) ) * ( si g n (A)*pmax ( r e p l i c a t e ( n s r c s , 0 ) , abs (A) =t h r ) )
}
Alr [ , k]<=A
}
Principal Component Regression (PCR)
The PCR, which also addresses the multicollinearity problem where instead of regressing the
observed variable from X on the design matrix (D) directly, the principal components (PC)
of the design matrix are used as regressors. Only a subset of all the principal components
are used for regression. We can apply PCA on the matrix of regressors D to obtain a new
regressor matrix Z as
min
a
||xv − Dav||2 = min
b
||xv − Zbv||2
, v = 1, …, V (5)
In order to estimate PCs, you can use svd command in matlab, in python it is found in
numpy library, and in R it is there in MASS library. For instance in Matlab
[U,V,W]= s vd s (X, 5 )
% h e r e d i g i t 5 p oi n t s t o the f a c t t h a t we have onl y decomposed the data t o f i r s t f i v e PCs
% U i s the ei g e n v e c t o r o f XX’ and i t i s where you o b t ai n ”Z” f o r your c a s e
% W i s the ei g e n v e c t o r o f X’X, V a r e the ei g e n v al u e s
You must note down that we did not take PCA of the source TCs in order to generate the
dataset X. We keep TCs intact for dataset generation and actually PCs of the TCs is going
to be used as regressors in Z while estimating B.
3
Figure 1: First TC source (left), random correlation matrix (middle), first SM source (right)
Assignment Questions
In both questions your marks carrying tasks are highlighted by italic text. For both questions
and code given in LR section, value of N is 240, value of V is 441, x1=21 and x2=21 the
size of each slice, value of ρ can be selected between 0 and 1, nsrcs is the number of sources
=6, and X is the standardized generated dataset. For question 1.4, σ stands for standard
deviation. For question 2, always apply absolute operation on retrieved coefficient values
in A as this will make it easier to visualize its plots, and whatever the variant of LSR you
encounter, you can always estimate D using D = XA>
Question 1: Synthetic dataset generation, data preporcessing, & data visualization
1. Construct a matrix TC of size 240 × 6 consisting of six temporal sources using three
vectors, 1) onsets arrival vector (AV) = [0,20,0,0,0,0], 2) increment vector (IV) =
[30,45,60,40,40,40], and 3) duration of ones = [15,20,25,15,20,25]. For instance, you
can generate first TC using these three vectors as AV:IV:N-20 =0:30:220 and ones stay
active for 15 samples, and N here is 240. This TC is also shown in Figure 1 (left).
Mean center each TC by subtracting its mean and standardize each TC by dividing it
by its standard deviation. This will make TCs bias free (centered around the origin)
and equally important (have unit variance). Plot all TCs as six subplots. Why not
normalize (divide by l-2 norm) the TCs instead of standardizing it?
2. A randomly generated correlation matrix (CM) (illustrating uncorrelatedness among
all variables) is shown as a sample in Figure 1 (middle). For your case, construct a CM
that represents correlation values between 6 variables. Show its plot, and can you tell
visually which two TCs are highly correlated? If not, can you tell this from CM?
3. Construct an array tmpSM of size 6×(21×21) consisting of ones and zeros, by placing
ones at these pixels along ”vertical,horizontal” direction of the slice i) 02:06,02:06, ii)
02:06,15:19, iii) 08:13,02:06, iv) 08:13,15:19, v) 15:19,02:06, vi) 15:19,15:19. The first
SM source is also shown in Figure 1 (right). Plot these SMs in six subplots. Reshape
the array tmpSM into a two dimensional matrix and call it SM of size 6 × 441.
Using CM show if these 6 vectored SMs are independent? For our particular case, why
standardization of SMs like TCs is not important? Hint: Imagine one of the slice has
pixel values of 5, while others remain at 1.
4
4. Generate zero mean white Gaussian noise for temporal and spatial sources denoted as
Γt ∈ R
240×6 and Γs ∈ R
6×441. Besides their dimensions, another difference between
spatial and temporal noise is the noise variance, which is 0.25 for Γt
, and 0.015 for
Γs. Using a 6 × 6 CM for each noise type (spatial and temporal) can you show if they
are correlated across sources? Also plot the histogram of both noise sources to see if
they have a normal distribution? Does this normal distribution fulfils the mean and
variance= 1.96σ criteria relating to 0.25, 0.015, and zero mean? Is there product ΓtΓs
correlated across V number of variables?
5. Generate a synthetic dataset X of size 240 × 441 as X = (TC + Γt) × (SM+ Γs). This
builds a dataset that follows the model shown in eq (1). Can these products TC × Γs
and Γt × SM exist, If yes what happened to them because if we keep them then we
cannot fit our model onto (1)? Plot atleast 100 randomly selected time-series from X
as few of them are shown in Figure 2 (left). Also plot variance of all 441 variables on
a separate plot. What information does this plot give you? At the end standardize the
dataset X, because source TCs (regressors) are also standardized and so dataset should
be too.
Question 2: Data analysis, results visualization, & performance metrics
1. The synthetic standardized dataset X that you have generated in Question 1 follows
the linear regression model X = DA + E where the unknown A can be estimated
using least squares. Since the set of regressors D are known as you have used them to
generate X, and these were the TCs so D = TC. Estimate A (retrieval of SMs) using
least square solution ALSR = (D>D)
−1D>X. Similarly, retrieve TCs in DLSR for a
known ALSR using DLSR = XA>
LSR. Plot six retrieved sources using ALSR and DLSR
side by side as shown in Figure 2 (right) for one of the retrieved sources. Do a scatter
plot between 3rd column of DLSR and 30th column of standardized X, you will find a
linear relationship between them, why this does not exist between 4th column of DLSR
and same column of X. Hint: Look at the slices arrangement which source TCs do you
think contributed in building 30th data element.
2. Estimate RR parameters ARR = (D>D + λ˜I)
−1D>X and DRR and then compare
RR to LSR by estimating two correlation vectors retaining only maximum absolute
correlations i) between each TC and DLSR and store it in cT LSR, and ii) between each
TC and DRR and store it in cT RR. Calculate the sum of these two correlation vectors.
P
If you have carefully selected the value of λ between 0 and 1 you must end up with
cT RR >
PcT LSR, and remember λ˜ = λV . Also, for λ = 1000, plot first vector from
ARR and the corresponding vector from ALSR, Do you find all values in a
1
RR shrinking
towards zero?
3. For 21 values of ρ selected between 0 and 1 with an interval of 0.05, estimate LR parameters ALR, DLR, and sum of MSE using LR parameters as P
V
v=1
||X − DLRA2
LR||2
2
/NV ,
and repeat this process 10 times (10 realizations) each time with a new standardized
X (new in terms of Γt and Γs). Then plot average of MSE over these 10 realizations
5
Figure 2: Few variables from X (left), first retrieved spatiotemporal source from X via SR
(right)
against each value of ρ. At what value of ρ do you find the minimum MSE? Is it okay
to select this value? At what value of ρ did MSE started to increase again (LR diverged)
. Here ||.||2 is the Frobenius norm. Hint: Matlab code for MSE calculation over one
realization and one lambda value, here rr=1, first realization, and kk=1, first lambda
MSE( kk , r r ) = sum ( sum ( (Y=Dlr *Alr ) . ˆ 2 ) ) / (N*V)
4. Estimate LR parameters for the value of ρ that you have selected in Question 2.3.
Now, estimate four correlation vectors retaining only maximum absolute correlations
i) between each TC and DRR and store it in cT RR, ii) between each SM and ARR
and store it in cSRR, iii) between each TC and DLR and store it in cT LR, and iv)
between each SM and ALR and store it in cSLR. Calculate the sum of these four
correlation vectors. If you have carefully selected the value of
P
ρ you must end up with
cT LR >
PcT RR and PcSLR >
PcSRR. Plot side by side in form of 4 columns
estimates of D and A for both RR and LR to know the difference visually. You will
see a major difference in estimates of A in terms of false positives. Can you mention
the reason behind this difference?
5. Estimate PCs of the TCs and plot their eigen values. For which PC the eigen value
is the smallest? Plot the regressors in Z and source TCs side by side. Did you notice
deteriorated shape of PCs? Why the shape of TCs has been lost? Now keeping all
components in Z apply lasso regression on X using ρ = 0.001 and then Plot the results
of DP CR and AP CR side by side (note that AP CR = B and your regressors are in Z
(PCs of the TCs)). Did you notice the inferior performance of PCR compared to the
other three regression models? Why is that so?
6