Description
1. Q1 (55 points): Create a MaxEnt POS tagger, maxent tagger.sh.
• The command line is: maxent tagger.sh train file test file rare thres feat thres
output dir
• The train file and test file have the format (e.g., test.word pos): w1/t1 w2/t2 …
wn/tn
• rare thres is an integer: any words (in the train file and test file) that appear
LESS THAN raw thres times in the train file are treated as rare words, and features
such as pref=xx and suf=xx should be used for rare words (see Table 1 in (Ratnaparkhi,
1996)).
• feat thres is an integer: All the wi features (i.e., CurrentWord=xx features), regardless
of their frequency, should be kept. For all OTHER types of features, if a feature appears
LESS THAN feat thres in the train file, that feature should be removed from the
feature vectors.
• output dir is a directory that stores the output files from the tagger. Your script should
create the following files and store them under output dir:
– train voc (e.g., ex train voc): the vocabulary that includes all the words appearing in train file. The file has the format “word freq” where freq is the frequency
of the word in the training data. The lines should be sorted by freq in descending
order. For words with the same frequency, sort the lines alphabetically.
– init feats (e.g., ex init feats): features that occur in the train file. It has the
format featName freq and the lines are sorted by the frequency of the feature in
the train file in descending order. For features with the same frequency, sort the
lines alphabetically.
– kept feats (e.g., ex kept feats): This is a subset of init feats, and it includes
the features that are kept after applying feat thres.
– final train.vectors.txt (e.g., ex final train.vectors.txt): the feat vectors
for the train file in the Mallet text format. Only features in kept feats should
be kept in this file.
– final test.vectors.txt: the feat vectors for the test file in the Mallet text
format. The format is the same as final train.vectors.txt.
– final train.vectors: the binary format of the vectors in final train.vectors.txt.
– me model: the MaxEnt model (in binary format) which is produced by the MaxEnt
trainer.
– me model.stdout and me model.stderr: the stdout (standard out) and stderr (standard error) produced by the MaxEnt trainer are redirected and saved to those
files by running command such as mallet train-classifier –trainer MaxEnt
–input final train.vectors –output-classifier me model > me model.stdout
2> me model.stderr. The training accuracy is displayed at the end of me model.stdout.
– sys out: the system output file when running the MaxEnt classifier with command
such as mallet classify-file –input final test.vectors.txt –classifier
me model –output sys out.
Your script maxent tagger.sh should do the following:
(a) Create feature vectors for the training data and the test data. The vector files should
be called final train.vectors.txt and final test.vectors.txt.
(b) Run mallet import-file to convert the training vectors into binary format, and the
binary file is called final train.vectors.
(c) Run mallet train-classifier to create a MaxEnt model me model using final train.vectors
(d) Run mallet classify-file to get the result on the test data final test.vectors.txt.
(e) Calculate the test accuracy
For step 2-4, you should use Mallet commands. For Step 5, if you don’t want to write code
for it, you can use the vectors2classify command, which covers step 3-5. In that case, you
need to convert final test.vectors.txt to the binary format first.
For the first step, you need to write some code. Features are defined in Table 1 in (Ratnaparkhi, 1996) (see MaxEnt slides). The following is one way for implementing this step:
(a) create train voc from the train file, and use the word frequency in train voc and
rare thres to determine whether a word should be treated as a rare word. The feature
vectors for rare words and non-rare words are different.
(b) Form feature vectors for the words in train file, and store the features and frequencies
in the training data in init feats.
(c) Create kept feats by using feat thres to filter out low frequency features in init feats.
Note that wi features are NOT subject to filtering with feat thres and every wi feature
in init feats should be kept in kept feats.
(d) Go through the feature vector file for train file and remove all the features that are
not in kept feats.
(e) Create feature vectors for test file, and use only the features in kept feats. If a word
in the test file appears LESS THAN rare thres times (or does not appear at all) in
the training file, the word should be treated as a rare word even if it appears many
times in the test file.
(f) For the feature vector files, replace all the occurrences of “,” with “comma” as Mallet
treats “,” as a separator.
2. Q2 (20 points): Run maxent tagger.sh with wsj sec0.word pos as train file, test.word pos
as test file, and the thresholds as specified in Table 1:
• training accuracy is the accuracy of the tagger on the train file
• test accuracy is the accuracy of the tagger on the test file
• # of feats is the number of features in the train file before applying feat thres
• # of kept feats is the number of features in the train file after applying feat thres
• running time is the CPU time (in minutes) of running maxent tagger.sh.
Table 1: Tagging accuracy with different thresholds
Expt rare feat training test # of # of running
id thres thres accuracy accuracy feats kept feats time
1 1 1 1
1 3 1 3
2 3 2 3
3 5 3 5
5 10 5 10
Please do the following:
• Fill out Table 1.
• What conclusion can you draw from Table 1?
• Save the output files of maxent tagger.sh to res id/, where id is the experiment id in
the first column (e.g., the files for the first experiment will be stored under res 1 1).
Submit only the subdirs for the first row and the last row (i.e., res 1 3 and res 3 5).
The submission should include:
• The readme.[txt|pdf] file that includes Table 1 and your answer to Q2.
• hw.tar.gz that includes maxent tagger.sh and res 1 3 and res 3 5 created in Q2 (see the
complete file list in submit-file-list).