Sale!

CS 6200 Homework4: Web graph computation solution

$30.00 $18.00

Original Work ?

Download Details:

  • Name: HW4-1.zip
  • Type: zip
  • Size: 177.08 KB

Category: You will Instantly receive a download link upon Payment||Click Original Work Button for Custom work

Description

5/5 - (2 votes)

Compute link graph measures for each page crawled using the adjacency matrix. While you have to use the
merged team index, this assignment is individual (can compare with teamates the results)
Page Rank – crawl
Compute the PageRank of every page in your crawl (merged team index). You can use any of the methods
described in class: random walks (slow), transition matrix, algebraic solution etc. List the top 500 pages by the
PageRank score. You can take a look at this PageRank pseudocode
(https://www.ccs.neu.edu/course/cs6200f13/proj1.html) (for basic iteration method) to get an idea
Page Rank – other graph
Get the graph linked by the in-links in file resources/wt2g_inlinks.txt.zip
Compute the PageRank of every page. List the top 500 pages by the PageRank score.
HITS- crawl
Compute Hubs and Authority score for the pages in the crawl (merged team index)
1. Create a root set
1. Obtain a basic set of documents by ranking all pages using an IR function (e.g. BM25, ES Search)
and add the basic set to the root set
You will need to use your topic as your query
2. For each page in 1000 web pages, add all pages that the page points to
3. For each page in 1000 web pages, obtain a set of pages that pointing to the page
if the size of the set is less than or equal to d, add all pages in the set to the root set
if the size of the set is greater than d, add an arbitrary set of d pages from the set to the root
set
Note: The constant d can be 50. The idea of it is trying to include more possibly strong hubs
into the root set while constraining the size of the root size.
2. Create a base set : expand root set by incoming and outgoing links. Some capping by size might be
necessary.
3. For each web page, initialize its authority and hub scores to 1. Update hub and authority scores for each
page in the base set until convergence
https://www.ccs.neu.edu/home/vip/teach/IRcourse/4_webgraph/HW4/hw4.html 2/2
Authority Score Update: Set each web page’s authority score in the root set to the sum of the hub
score of each web page that points to it
Hub Score Update: Set each web pages’s hub score in the base set to the sum of the authority score
of each web page that it is pointing to
After every iteration, it is necessary to normalize the hub and authority scores. Please see the lecture
note for detail.
4. Create one file for top 500 hub webpages, and one file for top 500 authority webpages
The format for both files should be: [webpage url][tab][hub/authority score]β§΅n
EC1
Implement a Topical PageRank by designing categories appropriate for your crawl (merged team index)
EC2
Implement SALSA scoring on your crawl (merged team index) and compare with HITS
Rubric
15 points
Page Rank on wt2g_inlinks data
30 points
PageRank on crawled data
30 points
HITS on side data