DSCI 558 Homework 5: Information Extraction II solution




5/5 - (5 votes)

In this homework, you will extract data from unstructured text using Snorkel
(https://snorkel.org/), a data programming paradigm for weak supervision training models.
You will use Snorkel (v.0.7.0-beta) to extract schools, colleges and universities, where cast
(actresses and actors) studied, from their biography. We provide a python notebook
(snorkel.ipynb), which contains code and instructions to accomplish this task.
• Once you finish accomplishing all the tasks, change the notebook’s file name to
Firstname_Lastname_hw05_snorkel.ipynb (you will submit it).

Task 1 (2pts)
Label 99 documents in a development set as instructed in the notebook.
Save your results to two files (Firstname_Lastname_hw05_gold_labels.dev.json and
Firstname_Lastname_hw05_extracted_relation.dev.json) using the code in the notebook.
Task 2 (4pts)
2.1. Define (in the notebook) your own labeling functions (LFs), which Snorkel uses to label
the training set.
2.2. Report (attach in your report, with comments) the performance of your LFs using the two
cells shown in Figures 1 and 2 below. We will grade you based on the performance of your
LFs. Normally, the F1 score should be greater than 0.35.
Figure 1: Score of the generative model
Figure 2: Detailed statistics about LFs learned by the generative model
2.3. Report (attach in your report) the distribution of the training marginals (as in Figure 3).
The distribution is important because it tells you if your labeling functions are good. In the
example below, it is not good because it doesn’t label anything.
2.4. Give a comment (in your report) about your marginal distribution (max 3 sentences). Is
it good or bad? Explain briefly.
Figure 3: Distribution of training marginals
Task 3 (2pts)
Distant supervision generates training data automatically using an external, imperfectly
aligned training resource, such as a Knowledge Base.
Define an additional distant-supervision-based labeling function which uses DBpedia. With
the additional labeling function you added, please repeat steps 2.2, 2.3 & 2.4 and attach you
answers to the report (refer to these answers as 3.2, 3.3 & 3.4 respectively).
• Hint: You can use SPARQLWrapper (https://rdflib.github.io/sparqlwrapper/) to
access DBpedia’s SPARQL endpoint to query instances such as of types
schema:CollegeOrUniversity and dbo:EducationalInstitution.
Task 4 (2pts)
Train an end extraction model (as demonstrated in the notebook).
Tune the hyper-parameters to get your best F1 score and include it your report, comment
about your tuning process, explain your line of thought.
Extract and save the relations file in the testing set to
Submission Instructions
You must submit (via Blackboard) the following files/folders in a single .zip archive named
• Firstname_Lastname_hw05_report.pdf: pdf file with your relevant answers
• Firstname_Lastname_hw05_snorkel.ipynb: The notebook contains code you wrote to
accomplish the tasks
• Firstname_Lastname_hw05_snorkel.pdf: A printed version of the notebook. You can
save your notebook to pdf using Print Preview or Download it as PDF in File menu
• Firstname_Lastname_hw05_extracted_relation.dev.json: contains extracted relations
from task 1
• Firstname_Lastname_hw05_gold_labels.dev.json: The labeled data from task 1
• Firstname_Lastname_hw05_extracted_relation.test.json: contains extracted relations
from task 4
• source: This folder includes any additional code you may have wrote to accomplish the
tasks (other than the notebook).