ECE 445  – Exercise #2 solution

$30.00

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (1 vote)

1. Feature Engineering for Environmental Sensor Telemetry Data
In this part of the exercise, we will focus on ‘feature engineering’ (aka, hand-crafted features) for the Environmental
Sensor Telemetry Data. This dataset corresponds to time-series data collected using three identical, custom-built,
breadboard-based sensor arrays mounted on three Raspberry Pi devices. This dataset was created with the hope
that temporal fluctuations in the sensor data of each device might enable machine learning algorithms to determine
when a person is near one of the devices. You can read further about this dataset at Kaggle using the link provided
below. The dataset is stored as a csv file, which is also being provided to you as part of this exercise.
Dataset link: https://kaggle.com/rjconstable/environmental-sensor-telemetry-dataset
1
Dataset csv filename: iot_telemetry_dataset.csv
1. We now turn our attention to preprocessing of the dataset, which includes one-hot encoding of categorical
variables and standardization of non-categorical variables that don’t represent time. Note that no further
preprocessing is needed for this data since the dataset does not have any missing entries.1
(a) One-hot encode the categorical variables of device, light, and motion. In this exercise (and subsequent exercises), you are allowed to use pandas.get_dummies() method for this purpose. (3 points)
(b) Standardize the dataset by making the data associated with co, humidity, lpg, smoke, and temp
variables zero mean and unit variance. Such standardization, however, must be done separately for
data associated with each device. This is an important lesson for practical purposes, as data samples
associated with different devices cannot be thought of as having the same mean and variance. (6 points)
(c) Print the first 20 samples of the preprocessed data (e.g., using the pandas.DataFrame.head() method).
(1 point)
(d) Why do you think the ts variable in the dataset has not been touched during preprocessing? Comment
as much as you can in a markdown cell. (1 point)
(e) Provide two Grouped Bar Charts, with grouping using the three devices, for the original means and
variances (i.e., before standardization) associated with co, humidity, lpg, smoke, and temp variables.
Comment on any observations that you can make from these charts. (3 points)
2. Map the co, humidity, lpg, smoke, and temp variables for each data sample into the following four independent features (5 points):
(a) mean of the five independent variables (e.g., use mean() function in either pandas or numpy)
(b) variance of the five independent variables (e.g., use var() function in either pandas or numpy)
(c) kurtosis of the five independent variables (e.g., use kurtosis() function in scipy.stats)
(d) skewness of the five independent variables (e.g., use skew() function in scipy.stats)
Print the first 40 samples of the transformed dataset (e.g., using the pandas.DataFrame.head() method),
which has four features calculated from five independent variables.
Remark 1: One of the things you will notice is that there are some terminologies being used in descriptions
of the functions in scipy.stats that might not be familiar to you. It is my hope that you can try to get a
handle on these terminologies by digging into resources such as Wikipedia, Stack Overflow, Google,
etc. Helping you become comfortable with the idea of lifelong self-learning is one of the goals of this course.
2. Feature Learning for Synthetically Generated Data
In this part of the exercise, we will focus on ‘feature learning’ using Principal Component Analysis (PCA).
In order to grasp the basic concepts underlying PCA, we limit ourselves in this exercise to synthetically
generated three-dimensional data samples (i.e., p = 3) that actually lie on a two-dimensional subspace (i.e.,
k = 2).
a) In order to create synthetic data in R
3
that lies on a two-dimensional subspace, we need basis vectors
(i.e., a basis) for the two-dimensional subspace. You will generate such a basis (matrix) randomly, as
follows.
1Be aware that real-world data in most problems is never this nice!
2
(i) Create a matrix A ∈ R
3×2 whose individual entries are drawn from a Gaussian distribution with
mean 0 and variance 1 in an independent and identically distributed (iid) fashion. While this can
be accomplished in a number of ways in Python, you might want to use numpy.random.randn()
method for this purpose. Once generated, this matrix should not be changed for the rest of this
exercise. (2 points)
(ii) Matrices with iid Gaussian entries are always full rank, which makes the matrix A a basis matrix
whose column space is a two-dimensional subspace in 3
. Verify this by printing the rank of A; it
should be 2. (1 point) Note: numpy.linalg package (https://docs.scipy.org/doc/numpy/
reference/routines.linalg.html) is one of the best packages for most linear algebra operations in Python.
(iii) Note that the basis vectors in A are neither unit-norm, nor orthogonal to each other. Verify this by
printing the norm of each vector in A as well as the inner product between the two vectors in A. (1
point)
(iv) Let S denote the subspace corresponding to the column space of the matrix A. Generate and print
three unique vectors that lie in the subspace S. (1 point)
b) We now turn our attention to generation of synthetic data. We will resort to a ‘random’ generation
mechanism for this purpose. Specifically, each of our (unlabeled) data sample x ∈ R
3
is going to be
generated as follows: x = Ab, where b ∈ R
2
is a random vector whose entries are iid Gaussian with
mean 0 and variance 1. Note that we will have a different b for each new data sample (i.e., unlike A, it
is not fixed for each data sample).
(i) Generate 250 data samples {xi}
250
i=1
using the aforementioned mathematical model. (2 points)
(ii) Does each data sample xi
lie in the subspace S? Justify your answer. (1 point)
(iii) Store the data samples into a data matrix X ∈ R
n×p
such that each data sample is a row in this data
matrix. What is n and p in this case? Print the dimensionality of X and confirm it matches your
answer. (3 points)
(iv) Since we can write X
T = AB, where B ∈ R
2×250 is a matrix whose columns are the vectors
bi’s corresponding to data samples xi’s, the rank of X is 2 (Can you see why? Perhaps refer to
Wikipedia?). Verify this by printing the rank of X. (1 point)
c) Before turning our attention to calculation of PCA features for our data samples, we first investigate the
relationship between eigenvectors of the scaled covariance matrix X
TX and the right singular vectors
of X.
(i) Compute the singular value decomposition (SVD) of X and the eigenvalue decomposition (EVD)
of X
TX and verify (by printing) that:
(a) The right singular vectors of X correspond to the eigenvectors of X
TX. Hint: Recall that eigenvalue decomposition does not necessarily list the eigenvalues in decreasing order. You would
need to be aware of this fact to appropriately match the eigenvectors and singular vectors. (2
points)
(b) The eigenvalues of X
TX are square of the singular values of X. (2 points)
(c) The energy in X, defined by kXk
2
F
, is equal to sum of squares of the singular values of X. (2
points)
(ii) Since the rank of X is 2, it means that the entire dataset spans only a two-dimensional subspace in
R
3
. We now dig a bit deeper into this.
(a) Since rank of X is 2, we should ideally only have two nonzero singular values of X. However,
unless you are really lucky, you will see that none of your singular values are exactly zero.
Comment on why that might be happening (and if you are the lucky one then run your code
again and you will hopefully become unlucky :). (2 points)
3
(b) What do you think is the relationship between the right singular vectors of X corresponding
to the two largest singular values and the subspace S? Try to be as precise and mathematically
rigorous as you can. (3 points)
d) We finally turn our attention to PCA of the synthetic dataset, which is stored in matrix X. Our focus in
this problem is on computing of PCA features for k = 2, computation of projected data (also termed
reconstructed data), and the sum of squared errors (also termed representation error or PCA error).
(i) Since each data sample xi
lies in a three-dimensional space, we can have up to three principal
components of this data. However, based on your knowledge of how the data was created (and
subsequent discussion above), how many principal components should be enough to capture all
variation in the data? Justify your answer as much as you can, especially in light of the discussion
in class. (2 points)
(ii) While mean centering is an important preprocessing step for PCA, we do not necessarily need to
carry out mean centering in this problem since the mean vector for this dataset will have very small
entries. Indeed, if we let x1, x2, and x3 denote the first, second, and third component of the random
vector x then it follows that E[xk] = 0, k = 1, 2, 3.
i Formally show that E[xk] = 0, k = 1, 2, 3, for our particular data generation method. (3 points)
ii Compute the (empirical) mean vector bµ from the data matrix X and verify by printing that its
entries are indeed small. (2 points)
(iii) Compute the top two principal component directions (loading vectors) U =
h
u1 u2
i
of this dataset
and print them. (3 points)
(iv) Compute feature vectors exi from data samples xi by ‘projecting’ data onto the top two principal
component directions of X. (2 points)
(v) Reconstruct (approximate) the original data samples xi from the PCA feature vectors exi by computing bxi = Uexi
. (2 points)
(vi) Ideally, since the data comes from a two-dimensional subspace, the representation error (aka, the
PCA error)
Xn
i=1
kbxi − xik
2
2 = kXb − Xk
2
F
should be zero. Verify (unless, again, you are super lucky) that this is, in fact, not the case. This
error, however, is so small that it can be treated as zero for all practical purposes. (2 points)
(vii) Now compute feature vectorsexi from data samples xi by projecting data onto only the top principal
component direction of X. (2 points)
(viii) Reconstruct (approximate) the original data samples xi from the PCA feature vectors exi by computing bxi = u1exi
. (2 points)
(ix) Compute the representation error kXb − Xk
2
F
and show that this error is equal to the square of the
second-largest singular value of X. (2 points)
(x) Using mpl_toolkits.mplot3d, display two 3D scatterplots corresponding to the original data
samples xi and the reconstructed data samples bxi corresponding to the top principal component.
Comment on the shape of the scatterplot for the reconstructed samples and the mathematical reason
for this shape. (4 points)
4