CS 6200 Homework3: Crawling, Vertical Search solution

$30.00

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (3 votes)

In this assignment, you will work with a team to create a vertical search engine using elasticsearch. Please read these instructions carefully: although you are working with teammates, you will be graded individually for most of the assignment. You will write a web crawler, and crawl Internet documents to construct a document collection focused on a particular topic. Your crawler must conform strictly to a particular politeness policy. Once the documents are crawled, you will pool them together. Form a team of three students with your classmates. Your team will be assigned a single query with few associated seed URLs. You will each crawl web pages starting from a different seed URL. When you have each collected your individual documents, you will pool them together, index them and implement search. Although you are working in a team, you are each responsible for developing your own crawlers individually, and for crawling from your own seeds for your team’s assigned topic. Obtaining a topic Form a team of three students with your classmates. If you have trouble finding teammates, please contact the TAs right away to be placed in a team. Once your team has been formed, have each team member create a file in Dropbox named teamXYabcd.txt (using your first initial and the last name). This file should contain the names team members. The TAs will update this file with a topic and three seed URLs. Each individual on your team will crawl using three seed URLs: one of the URLs provided to the team, and at least two additional seed URLs you devise on your own. In total, the members of your team will crawl from at least nine seed URLs. Crawling Documents Each individual is responsible for writing their own crawler, and crawling from their own seed URLs. Set up Elastic Search with your teammates to have the same cluster name and the same index name. Your crawler will manage a frontier of URLs to be crawled. The frontier will initially contain just your seed URLs. URLs will be added to the frontier as you crawl, by finding the links on the web pages you crawl. 1. You should crawl at least 20,000 documents individually, including your seed URLs. This will take several hours, so think carefully about how to adequately test your program without running it to completion in each debugging cycle. 2. You should choose the next URL to crawl from your frontier using a best-first strategy. See Frontier Management, below. 3. Your crawler must strictly conform to the politeness policy detailed in the section below. You will be consuming resources owned by the web sites you crawl, and many of them are actively looking for misbehaving crawlers to permanently block. Please be considerate of the resources you consume. 4. You should only crawl HTML documents. It is up to you to devise a way to ensure this. However, do not reject documents simply because their URLs don’t end in .html or .htm. 5. You should find all outgoing links on the pages you crawl, canonicalize them, and add them to your frontier if they are new. See the Document Processing and URL Canonicalization sections below for a discussion. 6. For each page you crawl, you should store the following filed with ElasticSearch : an id, the URL, the HTTP headers, the page contents cleaned (with term positions), the raw html, and a list of all in-links (known) and out-links for the page. Once your crawl is done, you should get together with your teammates and figure out how to merge the indexes. With proper ids, the ElasticSearch will do the merging itself, you still have to manage the link graph. Politeness Policy Your crawler must strictly observe this politeness policy at all times, including during development and testing. Violating these policies can harm to the web sites you crawl, and cause the web site administrators to block the IP address from which you are crawling. 1. Make no more than one HTTP request per second from any given domain. You may crawl multiple pages from different domains at the same time, but be prepared to convince the TAs that your crawler obeys this rule. The simplest approach is to make one request at a time and have 4/8/2017 Homework 3 https://www.ccs.neu.edu/home/vip/teach/IRcourse/3_crawling_snippets/HW3/hw3.html 2/4 your program sleep between requests. The one exception to this rule is that if you make a HEAD request for a URL, you may then make a GET request for the same URL without waiting. 2. Before you crawl the first page from a given domain, fetch its robots.txt file and make sure your crawler strictly obeys the file. You should use a third party library to parse the file and tell you which URLs are OK to crawl. Frontier Management The frontier is the data structure you use to store pages you need to crawl. For each page, the frontier should store the canonicalized page URL and the in-link count to the page from other pages you have already crawled. When selecting the next page to crawl, you should choose the next page in the following order: 1. Seed URLs should always be crawled first. 2. Must use BFS as the baseline graph traversal (variations and optimizations allowed) 3. Prefer pages with higher in-link counts. 4. If multiple pages have maximal in-link counts, choose the option which has been in the queue the longest. If the next page in the frontier is at a domain you have recently crawled a page from and you do not wish to wait, then you should crawl the next page from a different domain instead. URL Canonicalization Many URLs can refer to the same web resource. In order to ensure that you crawl 20,000 distinct web sites, you should apply the following canonicalization rules to all URLs you encounter. 1. Convert the scheme and host to lower case: HTTP://www.Example.com/SomeFile.html → https://www.example.com/SomeFile.html 2. Remove port 80 from http URLs, and port 443 from HTTPS URLs: https://www.example.com:80 → https://www.example.com 3. Make relative URLs absolute: if you crawl https://www.example.com/a/b.html and find the URL ../c.html , it should canonicalize to https://www.example.com/c.html . 4. Remove the fragment, which begins with # : https://www.example.com/a.html#anything → https://www.example.com/a.html 5. Remove duplicate slashes: https://www.example.com//a.html → https://www.example.com/a.html You may add additional canonicalization rules to improve performance, if you wish to do so. Document Processing Once you have downloaded a web page, you will need to parse it to update the frontier and save its contents. You should parse it using a third party library. We suggest jsoup for Java, and Beautiful Soup for Python. You will need to do the following: 1. Extract all links in tags. Canonicalize the URL, add it to the frontier if it has not been crawled (or increment the in-link count if the URL is already in the frontier), and record it as an out-link in the link graph file. 2. Extract the document text, stripped of all HTML formatting, JavaScript, CSS, and so on. Write the document text to a file in the same format as the AP89 corpus, as described below. Use the canonical URL as the DOCNO. If the page has a