IE 531 Programming Assignment 2: Experimental Verification of the Johnson-Lindenstrauss Lemma solution

$24.99

Original Work ?

Download Details:

  • Name: mp2-hkl3p2.zip
  • Type: zip
  • Size: 7.22 MB

Category: You will Instantly receive a download link upon Payment||Click Original Work Button for Custom work

Description

4.4/5 - (7 votes)

1 Introduction
We have n-many, d-dimensional vectors, {x1, x2, . . . , xn}, where ∀i, xi ∈
Rd
, and d is very large. We wish to find n-many, k-dimensional vectors,
{y1, y2, . . . , yn} where ∀i, yi ∈ Rk
, and k  d is comparatively smaller than d.
Theorem 2.11 of the text as the following form –
Theorem 1.1 (Johnson-Lindenstrauss Lemma) For any  ∈ (0, 1), and any
integer n, let k ≥
3
c2 ln n, with c as defined in Theorem 2.9. For any set of nmany vectors {x1, x2, . . . , xn} where xi ∈ Rd
, the random projection f : Rd →
Rk defined in section 2.7 of the text has the property that for all pairs xi and
xj , with probability at least 1 −
3
2n
,
(1 − )

kkxi − xjk ≤ kf(xi) − f(xj )k ≤ (1 + )

kkxi − xjk
(Note, I have replaced the text’s | • | with my k • k to denote length of the
vector-argument).
2 Discussion
For this programming assignment we will take a slightly different form/version
of the JL-Lemma. As before, we would like a mapping f : Rd → Rk
(where
k  d) such that the distance between any two d-dimensional points x1, x2 ∈ Rd
is more-or-less preserved when we look at f(x1), f(x2) ∈ Rk
. We are going to
rewrite the requirement of theorem 1.1 as equation 1 shown below, where for
some (small)  ∈ R, and ∀x1, x2 ∈ Rd
, we want
(1 − )kx1 − x2k ≤ kf(x1) − f(x2)k ≤ (1 + )kx1 − x2k (1)
As suggested by the text, you make k-many calls to a routine that generates
d-dimensional Gaussian RVs with zero-mean and an identity-covariance matrix.
Let us suppose this process gets us the vectors {u1, u2, . . . , uk}, where each
ui ∈ <d
. You re-scale each ui ∈ ui ←
r
d
k
×
ui
ku1k
1
This is needed to preserve the expected-norm (Why?1
). You stack the k-many,
members of {u
T
i
}
k
i=1 on top of each other to get the (k × d) matrix A. For each
xi ∈ <n, you define f(xi) = Axi (∈ Rk
).
There is a version of Johnson-Lindenstrauss Lemma that says that if
k = O

1

2
ln(n
δ
)

,
then ∀x1, x2 ∈ Rn equation 1 will be satisfied with probability at least (1 −
δ). I leave it as an exercise to you to show that theorem 1.1 and the above
version are theoretically similar. As noted in class, the neat thing about the
JL-Lemma is that this bound on k does not depend on d (the dimension we
want to reduce/scale-down). There are ways of gaining small efficiencies when
it comes to producing the random map f(•) alluded to above. But that is for
you to discover after you have done some experimentation.
If you spend enough time experimenting with this, you will see that while
the (ln n)-term is nice, the 1

2 -term can be a killer. For example, if  = 0.01 (i.e.
1% error), we will need k ≈ ln n × 104
. If this is to have any practical-value/use
we should have a d  ln n × 104
. If  = 0.1 (i.e. 10% error), we will need
k ≈ ln n × 100. On the flip-side, keep in mind that the error-bounds in the
formal theorem is an upper bound (i.e. it is the worst-case) – you may find that
the JL-Lemma in practice might not be as bad as the worst-case. It is for you
to experiment and find out – as a part of this programming exercise.
3 Data for Verification
The Compass site should contain a data file JL Data1 that contains data for a
(10000×1000) matrix that represents the set {xi}
1000
i=1 , where each xi ∈ R10,000
.
We want to represent each xi ∈ R10,000 with a yi ∈ Rk
, where k < 10, 000. As
you can gather, n = 1, 000 for this data set. The plan is to have you try a bunch
of different values for k (say, k ∈ {200, 300, 400, 500}) and see if the bounds in
the statement of the JL-Lemma holds for this data-set.
Thrombin Data
This is not necessary for the programming assignment. I am just providing this
to you to work with a bigger dataset. I know practically nothing about drugdiscovery. All I know is that there is a small role for the JL-Lemma in reducing
the dimension for a standard search that is part of the process. My superficial
understanding is that every time a new drug (i.e. chemical compound) is to
be considered, you have to check if there are others in your list that have a
similar “performance” – this involves checking the nearest neighbor of the new
compound in the list. I am not even sure about all this. That said, we have a
“real-world” data set that we can for testing the JL-Lemma.
1A very careful reading of my notes, the lectures, and the text, should tell you why!
2
You can try the training-data for Prediction of Molecular Bioactivity for
Drug Design at this link. Each row of this CSV file represents a chemicalcompound (i.e. a component of a drug). The first-item in each row is either an
“A” or an “I” (representing Active/Inactive). This is followed by 139,531 many
comma-seperated 0/1 values representing location on the Thrombin molecule
(i.e. d = 139, 531). There are 1909-many chemical compounds (i.e. n = 1909).
The Compass website contains some Python-Code I wrote to process this data
set to output a file that can be read the code you would have written for this
programming assignment. You can explore if each 139,531-dimensional binary
vector can be represented by a shorter k-dimensional vectors that meet all the
requirements of the JL-Lemma. Do not be surprised if you see some weird/interesting results ,,.
4 Requirements
What I need from you:
1. *.cpp and *.h that verifies the JL-Lemma by taking a bunch of values at
the command-line.
(a) First Command Line Variable – d,
(b) Second Command Line Variable – k,
(c) Third Command Line Variable – n,
(d) Fourth Command Line Variable – ,
(e) Fifth Command Line Variable – δ,
(f) Sixth Command Line Variable – input-filename (that contains the
relevant data), and
(g) Seventh Command Line Variable – number of trials, which is the number of statistical-trials where the JL-Lemma is experimentally/statistically verified.
I have provided a hint on Compass for anyone that needs it. To verify the JLLemma, you will implement what is shown below in pseudo-code.
1: no of hits = 0;
2: for int i = 1; i ≤ no of trials; i++ do
3: pick a pair of vectors x1, x2 ∈ Rd
;
4: compute y1 = Ax1 and y2 = Ax2;
5: if (1 − )kx1 − x2k ≤ ky1 − y2k ≤ (1 + )kx1 − x2k then
6: no of hits++;
7: end if
8: end for
9: Check if no of hits
no of trials ≥ (1 − δ)
I am looking for an output that is along the lines of what is shown in figures 1
and 2. I want you to try a bunch of values for k, and see if the specifications
3
of the JL-Lemma is verified for all of them. A single page statement of your
experiments will suffice.
Figure 1: A sample output.
4
Figure 2: A sample output on Thrombin-Data (run on my iMac).
5