MIE 451/1513 Assignment 1: Information Retrieval (IR) solution




Rate this product

1 Before the Introductory lab
In the lab, you’ll familiarize yourself with three pieces of software: Whoosh, a pure-Python search
engineering library, NLTK, a natural language processing toolkit and pytrec eval, an Information
Retrieval evaluation tool for Python, based on the popular trec eval, the standard software for
evaluating search engines with test collections.
The Whoosh documentation can be found on its website at https://whoosh.readthedocs.io/en
The NLTK documentation can be found on its website at http://www.nltk.org/
You should have an idea before coming to the lab that we are using Whoosh and NLTK libraries
for IR tasks and pytrec eval to evaluate the performance of IR.
2 In the Introductory lab
Please use the link to create the lab repository, and clone it and open the Jupyter Notebook in the
Google Colab.
Start executing statements and follow the instructions in the lab. You will work with a dataset
that contains example data for this lab. The example data we will be using is in the “lab-data”
folder. This is a very small data set that contains three things:
• A set of documents (email messages), in the documents directory.
• A set of queries – also called “topics” – the file “air.topics”.
• A set of judgments, saying which documents are relevant for each topic – the file “air.qrels”.
The overall goal here is to search the documents for each query and produce some output. The
output can then be compared with the judgments to say how good (or bad) Whoosh is for these
tasks. In order to do so:
1. Whoosh needs to make an index of the set of documents, then
2. Whoosh needs to read queries (topics) from the set given, and
3. Whoosh needs to produce output for each query (topic). Then,
4. pytrec eval can compare Whoosh’s output with the judgments and say how good (or bad)
the output is.
Once you understand how Whoosh works from running and understanding the provided code:
1. You’ll need to index the correct document set.
2. Ideally, you’ll need to read topics from a file (air.topics) instead of having them in the program
3. Produce output in the format pytrec eval expects. This is:
01 Q0 email09 0 1.23 myname
01 Q0 email06 1 1.08 myname
where “01” is the topic number (01 to 06); “Q0” is the literal string Q0, that is exactly those
two characters1
; “emailn” is the name of the file you’re returning; “0” (or ”1” or some other
number) is the rank of this result;“1.23” (or “1.08” or some other number) is the score of
this result; and “myname” is some name for your software (it doesn’t matter what, just be
Once you’ve done this, index and rank the documents for any or all of the six test queries (e.g.,
using a default Whoosh index and ranking) and run pytrec eval with the qrels file provided in
If pytrec eval runs correctly and produces numbers which you think are sensible, you’re done
with this part. You might want to look at the output, though, and get some understanding of what
it means; later you will be asked to interpret this and to choose evaluation measures you prefer.
This is an example of what pytrec eval may return
P_5 01 0.2000
gm_map all 0.0141
where the first column, e.g. ”P 5” is the name of one of the measures. The second column e.g.
”01” is the id for the query/topic. If the value of this column is ”all”, this row is the average of
the measure overall queries/topics. The last column is the score for this measure
3 Main Assignment (Assessed in Evaluation Lab)
In the Introductory lab section, you were asked to get Whoosh to index the documents from a
very small test collection, run a few queries (“topics”) and produce output in the format expected
by pytrec eval. You were also asked to run pytrec eval, to compare your output to the human
relevance judgments (“qrels”), and to check you understand the output.
In this part of the assignment, you will run tests with a bigger collection of documents, and
more queries. You will also need to improve on the baseline Whoosh configuration.
The data you need should be available in the government directory of the assignment repository.
The test collection is about 4,000 documents from US Government web sites and the topics are 15
needs for government information. Both were part of the TREC conference in 2003.
Make sure you provide all the answers in the provided notebook:
• Text answers should be written in the dedicated markdown cells for each question.
• Code answers should be written in the dedicated code cells. Make sure you fill in the values
of the result variables as instructed in each question.
• Make sure you perform a basic validation of your code as explained below.
1Once upon a time, this field meant something to pytrec eval. It is not used for anything now but it’s still required.
Code Preparation
(a) Put all your imports, and path constants in the dedicated cell in the notebook
(b) Make sure all your paths are relative to DATA_DIR and do NOT hard-code absolute paths in
your code.
Q1. Appropriate TREC measures
pytrec eval can report dozens of measures: for example “p5” (precision, in the first five documents
returned), “num rel ret” (the number of relevant documents retrieved overall queries), and “recip rank” (the reciprocal rank of top relevant document: e.g., 0.25 if the first relevant document is
the fourth in the ranking). You can get the key description of some key measures of pytrec eval in
this link:
(a) Which of pytrec eval’s measures might be appropriate for measuring search system performance for government web sites? [List the measure]
(b) Why do you think this measure is appropriate? [1 sentence]
Q2. Indexing and querying
Index the government documents, run the queries (“topics”) through (vanilla) Whoosh as a baseline
system, and run pytrec eval to compare Whoosh’s results with human judgments.
(a) Save your index (after indexing all the documents) in the provided variable INDEX_Q2, your
query parser in the provided variable QP_Q2, and your searcher in the provided variable
(b) How well did the baseline Whoosh system do on your chosen measure? [Provide the number.]
(c) Are there any particular topics where it did very well, or very badly? [If so, list a few topic
IDs for each]
(Note: pytrec eval will report measures for each query/topic separately as well as the averages.
This will help you pinpoint good or bad cases.)
Q3. Improving performance
Look at where the baseline Whoosh system did well, or badly.
(a) What do you think would improve Whoosh’s performance on this test collection, and why?
• For the system you aim to improve, you need to (1) understand what documents were
highly ranked, (2) what documents should have been highly ranked, and (3) explain false
positives (irrelevant documents ranked highly) and false negatives (relevant documents
not ranked highly) in order to directly inform your suggested improvements. Hence,
please find one query and explain one false positive and one false negative case and
explain each error and how this motivates your suggested modification. (Please note: it
is highly unlikely for two students to choose the same query and same two false positive
and false negative examples… similarity in responses will be reported to the department
for investigation as a possible plagiarism case.)
• Use printRelN ame() to show the ground true relevant files and the files your system
return. Based on the result, you can open the false positive files and false positive files
to do your analysis mentioned above
(b) Based on your analysis, make any changes you think can improve your baseline. Run your
modified version of Whoosh, and look again at the evaluation measure you chose. Save your
new index in the provided variable INDEX_Q3, your query parser in the provided variable
QP_Q3, and your searcher in the provided variable SEARCHER_Q3.
(c) What modifications did you make and what were the improvements? Explain whether there
were overall improvements (over some/all queries) in performance and also whether either
the false negative or false positive cases from part (a) improved. [1-3 sentences, any single
improvement over the baseline is sufficient for full credit, but nonetheless, you are encouraged
to explore]
(d) Did your changes improve things overall? [yes/no]
(e) Did some queries get better while others got worse? [yes/no]
(f) What do you think this means for your idea: was it good? Why or why not? [1-3 sentences]
Q4. Search engine optimization
Try alternative techniques (tokenizer, filters, stemmers, and scoring functions, etc.) to improve performance. You should show multiple cells corresponding to multiple iterations of your improvement
attempts. At the end, we want a clear markdown cell stating the following:
(a) A clear list of all final modifications made.
(b) Why each modification was made – how did it help?
(c) The final MAP performance that these modifications attained. We will verify this score
by running the best configuration stored in the Q4 variables described below. Your Q4
autograde will vary linearly from no credit for a MAP score of 0.32 (or below)
to full credit for a MAP score equal to 0.41 (or above).
This time, save your final best-performing index in the provided variable INDEX_Q4, your query
parser in the provided variable QP_Q4, and your searcher in the provided variable SEARCHER_Q4 so
we can verify the claimed performance during autograding.
Code Validation
(a) Run the validation cells at the end of the notebook to make sure the variables are properly
defined and their type is as required by the auto-grader.
(b) Make sure you address any AssertionError. The submitted notebook should not have any
(c) Make sure your notebook runs without generating exceptions, by restarting it and running
all code cells. This can be done by Choosing Runtime −> Restart and Run all. You should
see no exceptions.