Description
For this homework, you are going to write an English tokenizer and its evaluation script from
the input text. All the sample files are under ~/dropbox/19-20/570/hw1/examples.
Rubric:
2pts hw.tar.gz submitted, it should contain following files:
• eng tokenizer.sh
• abbrev list
• file.tok.system
• eng tokenizer eval.sh
• file.tok.score
2pts readme.txt or readme.pdf submitted
6pts All files and folders are present in expected locations
10pts Programs run to completion
5pts The output of programs on patas match submitted output
1. (25pts) Implement an English tokenizer using regular expressions and the exception list,
eng tokenizer.sh
• The command line is: cat file.txt | ./eng tokenizer.sh abbrev list > file.tok.system
• Minimum in-line comments should be provided.
2. (10pts) Calculate LD between execution and intention by completing the following table of
the minimum edit distance algorithm:
# I N T E N T I O N
#
E
X
E
C
U
T
I
O
N
1
3. (40pts) Implement an English tokenizer evaluator using the minimum edit distance algorithm,
eng tokenizer eval.sh
• The command line is: cat file.tok.system | ./eng tokenizer eval.sh file.tok.gold
> file.tok.score
• Minimum in-line comments should be provided.