STAT580 Homework 1 to 4 solutions

$95.00

Original Work ?
Category: Tags: , , You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (1 vote)

STAT580 Homework 1

1. Let X1, X2, . . . , Xn be a sample from the standard uniform U(0, 1) distribution. Find
the distribution of (Pn
i=1 Xi)mod1. [10 points]

2. Let F be a cumulative distribution function (c.d.f.). Define F
[−1](u) = min{x | F(x) ≥
u}. Show that if U ∼ U(0, 1), then F
[−1](U) ∼ F. [5 points]

3. There are many methods of Sampling from the Standard Normal distribution.
(a) Prove that the Box-Muller algorithm provides two independent standard normal
random deviates. [5 points]
(b) Prove that the Polar algorithm for simulating from the standard normal distribution provides two independent standard normal random deviates. [5 points]

4. Sampling from the tail of a standard normal. Perform a simulation experiment in R (or
any other software) to map the rejection rate with increasing d. Compare the result
with the theoretical naive rejection rate. (For this experiment, no code is required to
be turned in, but you are required to describe the experiment that you set up, and also
detail the results, using appropriate figures (and tables) if needed.) [10 points]

5. Compressed Column Storage. Consider the n × p sparse matrix W = ((wij ) with m
non-zero entries and stored in the following Compressed Column Storage (CCS) format:
• Consider the triple given by A = (a, r, c). where a = (a1, a2, . . . , am) is a mdimensional vector containing the non-zero values in W (stored column-wise),
r = (r1, r2, . . . , rm) is the m-dimensional vector with ri storing the row index
of the corresponding ai
, and c = (c1, c2, . . . , cp) is a (p + 1)-dimensional vector
with ci
indicating the element of a that starts the column of A. By convention,
cp+1 = m + 1. Thus, ai ≡ wri,k, where ck ≤ i < ck+1. Matrices may alternatively
be stored in the Compressed Row Storage format (CRS) defined correspondingly.

(a) Consider the product of A with a p-variate vector x. Let y = W x. Provide an
algorithm for calculating the j element of y. That is, write an algorithm which
will calculate yj given W in CCS format A = (a, r, c) and without expanding it
back to W. No programming is required, but only the algorithm in pseudo-code
should be provided. [10 points]

(b) Suppose that we now have a sparse symmetric p × p matrix V . Answer the
following questions:
i. Develop a similar CCS-type strategy for storing the matrix V in a packed
CCS format. [5 points]
ii. Provide, in pseudo-code, an algorithm for finding the jth element yj of y =
V x. [5 points]

6. Let U1, U2, . . . , UnU be a sample from Population 1 and V 1,V 2, . . . ,V nV be a sample from Population 2. Consider the following samples hypothesized (but not known
for sure) to be from different populations a follows: W1,W2, . . . ,WnW from Population 1, X1, X2, . . . , XnX
to be from Population 2, Y1, Y2, . . . , YnY
from Population 3,
Z1, Z2, . . . , ZnZ
from Population 4. The total within sum of squares (WSS) in this
scenario is given by
XnU
i=1
kUi − µˆ 1k
2 +
XnV
i=1
kV i − µˆ 2k
2 +
XnW
i=1
kWi − µˆ 1k
2 +
XnX
i=1
kXi − µˆ 2k
2
+
XnY
i=1
kY i − µˆ 3k
2 +
XnZ
i=1
kZi − µˆ 4k
2
,
(1)
where µˆ 1 = (nUU¯ + nWW)/(nU + nW ), µˆ 2 = (nV V¯ + nXX¯ )/(nV + nX), µˆ 3 = Y¯
and µˆ 4 = Z¯ . The above total WSS can be rewritten to be a sum of WSS from the
samples U1, U2, . . . , UnU
and V 1,V 2, . . . ,V nV plus some terms. (This is important
in the context of the following questions.) We are interested in finding how the total
WSS changes when we reassign an (unsure) observation from one population to the
other. To do so, we will investigate the three possibilities (there is actually a fourth,
but that is the converse of Case (b) below).

(a) Suppose we reassign WnW from Population 1 to Population 2. How does the total
WSS in (1) change? (Note that such a reassignment affects both µˆ 1 and µˆ 2
). [10
points]

(b) Suppose that from the original setup leading to (1). we reassign WnW from
Population 1 to Population 3. How does the total WSS in (1) change? (Note that
such a reassignment affects both µˆ 1 and µˆ 3
). [10 points]

(c) Suppose that from the original setup behind (1). we reassign Y nY
from Population
3 to Population 4. How does the total WSS in (1) change? (Note that such a
reassignment affects both µˆ 3 and µˆ 4
.

Note that an extension of this is the basis for
the Hartigan-Wong (1979) implementation of the k-means algorithm.) [10 points]
The reductions in all three parts above provides a stripped-down context of the basis
for an efficient implementation of the k-means semi-supervised clustering algorithm.

7. For each program, please write out how you compiled, executed the program and result.
You may include this code in as a comment.
(a) Write an annotated C program which converts temperature, from Fahrenheit to
the Celsius scale and vice-versa. [7 points]

(b) Write an annotated C program which takes in two integers i = 320 and j = 256
and reports their product. Store the product as a short and as an int and print
the result. What differences do you see? Please put in your observations and
reasoning as a comment in the C program. [8 points]

STAT580 Homework 2

1. Let X ∼ Np(0, I). Let Y be a random variable having the χ
2
p
-density that is independent of X. Let Z =

Y
X
kXk
. Show that that the density of Z is also standard
multivariate normal. [10 points]

2. Write an example in C to illustrate that a function passes its arguments by value. To
do this, write a function which takes in two arguments: an integer and a pointer to
an integer and then increments each by 1. Follow the location of the arguments inside
and outside the function to illustrate the point. [10 points]

3. Write a function in C which illustrates the use of the matrix multiplication algorithm
using CCS representation that you wrote in the previous assignment. [20 points]

4. Let X1, X2, . . . , Xn be a random multivariate sample such that Xi has only the first
1 ≤ pi ≤ p coordinates that are observed. Assume that each Xi
is a realization from
the pi-dimensional marginal distribution of Np(µ, Σ).

Further, there are at least two
is for which pi = p. Answer the following questions:
(a) Find the maximum likelihood estimator of µ and Σ using direct maximization of
the loglikelihood. [15 points]

(b) Use the expectation-maximization algorithm to formulate the maximum likelihood estimator of µ and Σ. [20 points]

(c) Compare the number of computations (floating point operations) needed in one
EM step to the number of computations in the direct calculations. You may make
simplifying assumptions as needed for calculating the number of operations. [20
points]

(d) How do the results in (c) change if the pi observed coordinates are not the first
ones? [5 points]

STAT580 Homework 3

1. Write a C function to provide the largest eigenvalue and eigenvector of a nonnegative
definite matrix using the power method. The function should take in a function pointer
which defines the multiplication of a matrix in appropriate storage format and a vector.
[15 points]

(a) Use calls to the above function in another C function which provides the first m
eigenvalues of a positive definite matrix. [5 points]

(b) Demonstrate the above in an example program. [5 points]

2. Let X1, X2, . . . , Xn be a sample. Let sij = s(Xi
, Xj ) be a similarity measure between
Xi and Xj
. For example, sij = Corr(Xi
, Xj ). Let n
k
i be the set of Xj s which are
the k-nearest neighbors to Xi
.

That is, for each i, let sij1 ≥ sij2 ≥ . . . ≥ sijn−1 be
the ordered similarities (in decreasing order) among {sij : j ∈ (1, 2, . . . , i − 1, i + 1, i +
2, . . . , n)}. Then Xj1
, Xj2
, . . . , Xjk
are the k-nearest neighbors of Xi
.

(a) Write a function n k(i, j,…) in C (of appropriate arguments) which takes in
a dataset and for any two observation pairs (i, j) and a user-supplied similarity
measure, returns 0 if i and j are not among the k-nearest neighbors of each other
and 1 if they both are among the k-nearest neighbors of the other. Let Nk be the
set of (i, j) for which the above function returns 1. [15 points]

(b) Use the above to write a function in C which takes in a dataset and a user-supplied
similarity measure and filters out those observations that are not a K-nearest
neighbor to another observation. In order to make this function easy to use in
big data problems, make sure that you do not unnecessarily store the distance
matrix. Test the function. [5 points]

3. For any X1, X2, . . . , Xn (assumed to be filtered as with a call to the previous function),
calculate the similarity matrix W with non-zero elements Wij = sij when (i, j) ∈ Nk
and sij = Corr(Xi
, Xj ) when Corr(Xi
, Xj ) > ρ.

Clearly, W is a sparse symmetric
matrix that can be stored in the sparse packed format. (Actually, it can be stored even
more efficiently, since the diagonals are all unity but we will ignore this for now.)

Let G be the diagonal matrix with W1 in the diagonals. Here 1 = (1, 1, . . . , 1)0
.
It is common to think of the points X1, X2, . . . , Xn as nodes on a graph, with edges
between nodes weighted by similarities Wij and gi
’s as the so-called node degrees, that
is, the sum of the weights of the edges connected to node ii.

Consider the standardized (with respect to the node degrees) graph Laplacian matrix
as L = I − G−1W. Then L is sparse and further the smallest eigenvalues of L
are the same (in reverse order) as the largest eigenvalues of G−1W, and they share
the same respective eigenvectors.

Obtain the first m eigenvectors. How does one
determine m? One may do so on the basis of the eigenvalues of L that are close to zero
(equivalently, on the basis of those eigenvalues of G−1W that are close to unity). The
above framework provides the background for spectral clustering (which additionally
involves clustering these eigenvectors).

Write a function in C which does the above in a general framework. Test the function
(if it is too cumbersome, you may test the function using a subset of the observations
below for testing purposes.) [25 points]

4. Microarray gene expression data. The file, diurnaldata.csv contains gene expression
data on 22,810 genes from Arabidopsis plants exposed to equal periods of light and
darkness in the diurnal cycle. Leaves were harvested at eleven time-points, at the
start of the experiment (end of the light period) and subsequently after 1, 2, 4, 8 and
12 hours of darkness and light each.

Note that there are 23 columns, with the first
column representing the gene probeset. Columns 2–12 represent measurements on gene
abundance taken at 1, 2, 4, 8 and 12 hours of darkness and light each, while columns
13-23 represent the same for a second replication. Note that the file has a header and
also that the first column, in character, is not particularly of value.

Use the functions written in the problems above to obtain the eigenvectors, using only
the first replication, and provide plots of the eigenvectors for different values of k, ρ
and m. Comment. [15 points]

5. In this problem, you will generate and print out all possible strings c1c2 · · · c8 where
ci ∈ {A, C, G, T}. Write a program that uses eight nested loops to output the strings.

Can you rewrite the program to take advantage of the bitwise operators? Can you adapt
the second program to output, with one minor change, all possible strings c1c2 · · · c16
of length 16? [15 points]

STAT580 Homework 4

1. Use LAPACK to write a function in C which calculates the eigenvalues of a positive
definite matrix A. Test the function with a simple example. Provide instructions on
how to use and test the function and test your result with the output of R. [15 points]

2. Write a function in C which uses the R mathematical library to provide the loglikelihood of the gamma density. Illustrate with an example. [15 points]

3. Modify the multivariate skewness function discussed in class to use the .Call() function while calling C from R. [20 points]