Homework 3 COMS 4771 solution

$29.99

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (3 votes)

Problem 1 (Features; 10 points). It is common to pre-preprocess the feature vectors in Rd before
passing them to a learning algorithm. Two simple and generic ways to pre-process are as follows.
• Centering: Subtract the mean ^ := 1
jSj
P
(x;y)2S x (of the training data) from every feature
vector:
x 7! x 􀀀 ^:
• Standardization: Perform centering, and then divide every feature by the per-feature standard
deviation ^i =
q
1
jSj
P
(x;y)2S(xi 􀀀 ^i)2:
(x1; x2; : : : ; xd) 7!

x1 􀀀 ^1
^1
;
x2 􀀀 ^2
^2
; : : : ;
xd 􀀀 ^d
^d

:
For each of the following learning algorithms, and each of the above pre-processing transformations,
does the transformation a ect the learning algorithm?
(a) The classi er based on the generative model where class conditional distributions are multivariate
Gaussian distributions with a xed covariance equal to the identity matrix I. Assume
MLE is used for parameter estimation.
(b) The 1-NN classi er using Euclidean distance.
(c) The greedy decision tree learning algorithm with axis-aligned splits. (For concreteness, assume
Gini index is used as the uncertainty measure, and the algorithm stops after 20 leaf nodes.)
(d) Empirical Risk Minimization: the (intractable) algorithm that nds the linear classi er (both
the weight vector and threshold) that has the smallest training error rate.
To make this more precise, consider any training data set S from X  Y, and let ^ fS : X ! Y be
the classi er obtained from the learning algorithm with training set S. Let : X ! X0 be the
pre-processing transformation; and let ~ fS0 : X0 ! Y be the classi er obtained from the learning
algorithm with training set S0, where S0 is the data set from X0  Y containing ((x); y) for each
(x; y) in S. We say  a ects the learning algorithm if, there is a set of training data S such that
the classi er ^ fS is not the same as x 7! ~ fS0((x)).
You should assume the following: (i) the per-feature standard deviations are never zero; (ii)
there are never any \ties” whenever you compute an arg max or an arg min; (iii) there are no issues
with numerical precision or computational eciency.
What to submit: \yes” or \no” for each pre-processing and learning algorithm pair, along with
a brief statement of justi cation.
1
Problem 2 (More features; 30 points). Download the review data set reviews tr.csv (training
data) and reviews te.csv (test data) from Courseworks. This data set is comprised of reviews
of restaurants in Pittsburgh; the label indicates whether or not the reviewer-assigned rating is at
least four (on a ve-point scale). The data are in CSV format (where the rst line is the header);
the rst column is the label (\label”), and the second column is the review text (\text”). The text
has been processed to remove non-alphanumeric symbols and to make all letters lowercase.
Write a script that, using only the training data, tries di erent methods of data representation
and di erent methods for learning a linear classi er, and ultimately selects|via ve-fold cross
validation|and uses one combination of these methods to train a nal classi er. (You can think
of the choice of these methods as \hyperparameters”.)
The data representations to try are the following.
1. Unigram representation.
In this representation, there is a feature for every word w, and the feature value associated
with a word w in a document d is
tf(w; d) := number of times word w appears in document d :
(tf is short for term frequency.)
2. Term frequency-inverse document frequency (tf-idf ) weighting.
This is like the unigram representation, except the feature associated with a word w in a
document d from a collection of documents D (e.g., training data) is
tf(w; d)  log10(idf(w;D)) ;
where tf(w; d) is as de ned above, and
idf(w;D) :=
jDj
number of documents in D that contain word w
:
This representation puts more emphasis on rare words and less emphasis on common words.
(There are many variants of tf-idf that are unfortunately all referred to by the same name.)
Important: When you apply this representation to a new document (e.g., a document in the
test set), you should still use the idf de ned with respect to D. This, however, becomes
problematic if a word w appears in a new document but did not appear in any document in
D: in this case, idf(w;D) = jDj=0 = 1. It is ambiguous what should be done in these cases,
but a safe recourse, which you should use in this assignment, is to simply ignore words w that
do not appear in any document in D.
3. Bigram representation.
In addition to the unigram features, there is a feature for every pair of words (w1;w2) (called
a bigram), and the feature value associated with a bigram (w1;w2) in a given document d is
tf((w1;w2); d) := number of times bigram (w1;w2) appears consecutively in document d :
In the sequence of words \a rose is a rose”, the bigrams that appear are: (a; rose), which
appears twice; (rose; is); and (is; a).
4. Another data representation of your own choosing or design.
Some examples:
2
• Extend the bigram representation to n-gram representations (for any positive integer n)
to consider n-tuples of words appearing consecutively in documents.
• Extend to k-skip-n-grams: instead of just counting the number of times the n-tuples
(w1;w2; : : : ;wn) appears consecutively in the document, also count occurrences where
wi and wi+1 are at most k words apart.
• Use variants of tf-idf weighting, such as one that replace tf(w; d) with 1+log10(tf(w; d)).
This is better for documents where words often appear in bursts.
• Select speci c words or n-grams you think are informative for the classi cation task.
The learning methods to try are the following.
1. Averaged-Perceptron (with some modi cations described below).
Averaged-Perceptron is like Voted-Perceptron (which uses Online Perceptron), except instead
of forming the nal classi er as a weighed-majority vote over the various linear classi ers,
you simply form a single linear classi er by taking a weighted average the weight vectors and
thresholds of all the linear classi ers used by Online Perceptron.
Use the following two modi cations:
• Run Online Perceptron to make two passes through the data. Before each pass, randomly
shue the order of the training examples. Note that a total of 2n + 1 linear classi ers
are considered during the run of the algorithm (where n is the number of training data).
• Instead of averaging the weights and thresholds for all 2n + 1 linear classi ers, just
average these things for the nal n + 1 linear classi ers.
You should use Averaged-Perceptron with all four data representations.
2. Nave Bayes classi er with parameter estimation as in the previous homework assignment.
Note that with this learning method, a word that appears more than once in a document is
treated the same as appearing exactly once.
You only need to use Nave Bayes with the unigram representation.
3. (Optional.) Any other learning method you like.
You must write your own code to implement Averaged-Perceptron. The code should be easy
to understand (e.g., by using sensible variable names and comments). For other things
(e.g., Nave Bayes, cross validation, feature processing), you can use your own or existing library
implementations in MATLAB (+Statistics/ML library) and Python (+numpy, scipy, sklearn), but
you are responsible for the correctness of these implementations as per the speci cations from the
course lectures. Provide references for any such third-party implementations you use (e.g., speci c
scikit-learn or MATLAB functions that perform cross validation).
What to submit:
1. A concise and unambiguous description of the fourth data representation you try (and also
of any additional learning methods you try).
2. Cross-validation error rates for all data representations / learning methods combinations.
3. Name of the method ultimately selected using the cross validation procedure.
4. Training and test error rates of the classi er learned by the selected method.
5. Source code and scripts that you write yourself (in separate les).
3
Problem 3 (MLE; 10 points).
(a) Consider the statistical model P = fP;2 :  2 Rd; 2 > 0g, where P;2 is the multivariate
Gaussian distribution with mean  and covariance matrix 2I. Give a formula for the MLE
of 2 given the data fxign
i=1 from Rd, which is regarded as an iid sample. Also give a clear
derivation of the formula in which you brie y justify each step.
(b) Consider the statistical model P = fP :  2 Ng, where P is the distribution of the random
pair (X; Y ) in N  f0; 1g such that:
• X is uniformly distributed on f1; 2; : : : ; g;
• for any x 2 f1; 2; : : : ; g, the conditional distribution of Y given X = x is speci ed by
P(Y = 1 j X = x) = x=.
Give a formula for the MLE of  given a single observation (x; y) 2 N  f0; 1g. Also give a
clear derivation of the formula in which you brie y justify each step.
4