LING 570: Hw1 solution


For this homework, you are going to write an English tokenizer and a tool that creates a
vocabulary from the input text. All the sample files are under

Q1 (50 points): Implementing an English tokenizer,
 Format:
o The command line is: cat input_file | ./ abbrev_list >

o abbrev_list is an input file. It contains a list of abbreviations, one
abbreviation per line.

o The input and output files should have the same number of lines, and the
th line in the input corresponds to the k
th line in the out file.

o The tokens in the output lines should be separated by the whitespace.

o A sample input file is “ex1”, and a sample output file is “ex1.tok”. The
sample output file is meant to show you the format, NOT the gold

 Note:
o Your tokenizer should not separate numbers, urls, paths, etc. See the slides
for 9/27’s lecture.

o You can assume that a token will not cross the line boundary; therefore,
your code should process each line independently of other lines.

o Do not merge the tokens in the input text (e.g., the collocation expression
such as “pick up”, “because of”, “Hong Kong” should NOT be merged
into one token).

Q2 (15 points): Writing a tool,, that creates a vocabulary from the input
 The command line should be: cat input_file | ./ > output_file

 The tool reads in each line in the input, breaks it into tokens by whitespace only,
and output the frequencies of the tokens.

 Each line in the output file is a (token, frequency) pair. The lines are sorted by the
frequency of the tokens in descending order.
 A sample input is “ex1”, and a sample output is “ex1.voc”.

Q3 (10 points): Run the code in Q1 and Q2
 Run the following commands:
o cat ex2 | ./ abbrev-list > ex2.tok
o cat ex2.tok | ./ > ex2.tok.voc
o cat ex2 | ./ > ex2.voc
 In your note file, write down
o the numbers of tokens in ex2 and ex2.tok
o the numbers of lines in ex2.voc and ex2.tok.voc

Submission instruction:
 Submit two files, readme.[txt|pdf] and hw.tar.gz, as specified in the course policy.
 The note file, readme.[txt|pdf], should include the answers to Q3 and any note that
you want us to read.

 hw.tar.gz should include all the files specified in ~/dropbox/18-19/570/hw1/submitfile-list, plus any source code (and corresponding binary code) called by the shell