DSCI 558 Homework 1: Crawling solution

$30.00

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (4 votes)

In this homework, you will implement a web crawler to collect webpages and extract data from the Internet Movie Database (IMDb) website (https://www.imdb.com/). A web crawler is a program/bot that systematically browses the World Wide Web (WWW), typically for the purpose of web indexing (web spidering). It starts with a list of seed URLs to visit, and as it visits each webpage, it finds the links in that web page, and then visits those links and repeats the entire process. You are required to use Scrapy (https://scrapy.org), a free and open-source web-crawling python library. Task 1 (4 points) Crawl at least 5000 webpages of Comedy movies/shows in IMDb using Scrapy. Extract and generate the following attributes for each webpage: id unique id for the webpage url url of the webpage timestamp_crawl timestamp of the crawling event title see Figure 1 if attribute doesn’t exist, set as empty value (string or list. According to type) genres* languages* release_date budget gross runtime * attribute holds a list of strings, not a single string value Store your crawled data into a JSON-Lines (.jl) file. In this file format, each line is a valid JSON object (dictionary) that holds the attributes listed above for a single crawled webpage. You can check the attached file sample.jl to understand the format (note that genres and languages are a list of values). While crawling, please make sure you obey the website’s politeness rules (i.e. sleep time between requests) in order to avoid getting banned. Figure 1: An example movie/show webpage with the required attributes 2 Task 2 (4 points) Similar to the previous task, crawl at least 5000 webpages of cast (actors and actresses) in IMDb using Scrapy. Extract and generate the following attributes for each cast webpage: id unique id for the webpage url url of the webpage timestamp_crawl timestamp of the crawling event name see Figure 2 if attribute doesn’t exist, set as empty string date_of_birth place_of_birth date_of_death place_of_death mini_bio Similarly, store your crawled data into a JSON-Lines (.jl) file. Figure 2: An example cast webpage with the required attributes Task 3 (2 points) Answer the following questions (no more than 2 sentences for each question) for each of the previous tasks (Tasks 1 and 2): 3.1. What is the seed URL(s) you used? 3.2. How did you manage to only collect movie/show or cast pages? 3.3. Did you need to discard irrelevant pages? If so, how? 3.4. Did you collect the required number of pages? If you were not able to do so, please describe and explain your issues. 3 Submission Instructions You must submit (via Blackboard) the following files/folders in a single .zip archive named Firstname_Lastname_hw01.zip: • Firstname_Lastname_hw01_report.pdf: pdf file with your answers to Task 3 • JSON-Lines files containing the data you crawled using Scrapy for Tasks 1 and 2: o Firstname_Lastname_hw01_scrapy_title.jl: Generated data from Task 1 o Firstname_Lastname_hw01_scrapy_cast.jl: Generated data from Task 2 • source: This folder includes all the code you wrote to accomplish Tasks 1 and 2 (i.e. your Scrapy crawler, seed files, your script/program to eliminate unwanted pages and store webpages into JSON-Lines format, etc…) Additional Notes 1. We also provide you with a script called post_processing.py, you can use this script to validate the structure of your JSON-Lines files in Task 1: a. Prerequisites: python 3 and the ujson package (pip install ) b. Usage: The script takes one argument which is the path of your jl file, processes the file and prints a message to let the user know if it is valid. python post_processing.py /path/to/file/sample.jl will print: Process: (0/2)… Process: (1/2)… Finished processing, looks good, found 2 entries. 2. It is your responsibility to validate the structure of the outputs in task 2 prior to submitting.