Description

5/5 - (1 vote)

CSCI572 HW 1 Web Search Engine Comparison

This exercise is about comparing the search results from Google versus different search engines.
Many search engine comparison studies have been done.

All of them use samples of data, some
small and some large, so no definitive conclusions can be drawn. But it is always instructive to
see how two search engines match up, even on a small data set.

The process you will follow is to issue a set of queries and to evaluate how closely the results of
the two search engines compare. You will compare the results from the search engine that you
are assigned to with the results from Google (provided by us on the class website).

To begin, the class is divided into four groups. Students are pre-assigned according to their USC
ID number, as given in the table below.
Note: Please stick with the assigned dataset and search engine according to your ID number.
PLEASE don’t work on another dataset and later ask for an exception.

USC ID ends
with Query Data Set Google Reference
Dataset
Assigned Search
Engine
00~24 100QueriesSet1 Google_Result1.json Bing
25~49 100QueriesSet2 Google_Result2.json Yahoo!
50~74 100QueriesSet3 Google_Result3.json Ask
75~99 100QueriesSet4 Google_Result4.json DuckDuckGo

THE QUERIES

The queries will be given to you in a text file, one query per line. Each file contains 100 queries.
These are actual queries extracted from query log files of several search engines. Here is a
sample of some of the queries:
The European Union includes how many countries
What are Mia Hamms accomplishments
Which form of government is still in place in Greece
When was the canal de panama built
What color is the black box on commercial airplanes

Note: Some of the queries will include misspellings; you should not alter the queries in any way
as this accurately reflects the type of query data that search engines have to deal with

REFERENCE GOOGLE DATASET

A Google Reference JSON1
file is given which contains the Google results for each of the
queries in your dataset. The JSON file is structured in the form of a query as the key and a list of
10 results as the value for that key (each a particular URL representing a result). The Google
1
JSON, JavaScript object Notation is a file format used to transmit data objects consisting of key-value pairs. It is
programming language independent. https://www.softwaretestinghelp.com/json-tutorial/
2
results for a specific query are ordered as they were returned by Google. Namely the 1st element
in the list represents the top result that was scraped from Google, the 2nd element represents the
second result, and so on.

Example:
{
“A two dollar bill from 1953 is worth what”: [
“https://www.antiquemoney.com/old-two-dollar-bill-value-price-guide/two-dollarbank-notes-pictures-prices-history/prices-for-two-dollar-1953-legal-tenders/”,
“https://oldcurrencyvalues.com/1953_red_seal_two_dollar/”,
“https://www.silverrecyclers.com/blog/1953-2-dollar-bill.aspx”,
“https://www.ebay.com/b/1953-A-2-Dollar-Bill/40033/bn_7023293545”,
“https://www.ebay.com/b/1953-2-US-Federal-Reserve-SmallNotes/40029/bn_71222817”,
“https://coinsite.com/why-the-1953-2-dollar-bill-has-a-red-seal/”,
“https://hobbylark.com/collecting/Value-of-Two-Dollar-Bills”,
“https://www.quora.com/What-is-the-value-of-a-2-dollar-bill-from-1953”,
“https://www.reference.com/hobbies-games/1953-2-bill-worth-c778780b24b9eb8a”,
“https://treasurepursuits.com/1953-2-dollar-bill-value-whats-it-worth/”
]
}

DETERMINING OVERLAP AND CORRELATION

Overlap: Since the Google results are taken as our baseline, it will be interesting to see how
many identical results are returned by your assigned search engine, regardless of their position.
Assuming Google’s results are the standard of relevance, the percentage of identical results will
act as a measure of the quality of your assigned search engine.

Each of the queries in your dataset should be run on your assigned search engine. You should
capture the top ten results. Only the resulting URL is required. For each of the top ten results for
each query you should compute an overlap score between our reference Google answer dataset
and your scraped results. The output format is described ahead.

Note: If you get less than 10 URLs for a particular query, you can just use those results to
compare against Google results. For example: if a query gets 6 results from a search engine, just
use those 6 results to compare against 10 results of Google reference dataset and produce
statistics for that particular query.

Note: For a given query, if the Google result has 10 URLs, but the other search engine has fewer
results (e.g. 8), and there are 5 overlapping URLs, the percent overlap would be 5/10
Correlation: In statistics, Spearman’s rank correlation coefficient or Spearman’s rho, is a
measure of the statistical dependence between the rankings of two variables. It assesses how well
the relationship between two variables can be described. Intuitively, the Spearman correlation
between two variables will be high when observations have a similar rank, and low when
observations have a dissimilar rank.

The rank coefficient rs can be computed using the formula
3
where,
● di is the difference in the two rankings, and
● n is the number of observations

Note: The formula above when applied to search results yields a somewhat modified set of values
that can be greater than one or less than minus one. However the sign of the Spearman correlation
indicates the direction of association between the two rank variables. If the rank results of one
search engine is near the rank of the other, then the Spearman correlation value is positive. If the
rank of one is dissimilar to the rank of the other, then the Spearman correlation value will be
negative.

Note: In the event that your search engine account enables personalized search, please turn this
off before performing your tests.
Example1.1: “Who discovered x-rays in 1885”

GOOGLE RESULTS

1. https://explorable.com/wilhelm-conrad-roentgen
2. https://www.the-scientist.com/foundations/the-first-x-ray-1895-42279
4
3. https://www.bl.uk/learning/cult/bodies/xray/roentgen.html
4. https://en.wikipedia.org/wiki/Wilhelm_R%C3%B6ntgen
5. https://www.wired.com/2010/11/1108roentgen-stumbles-x-ray/
6. https://www.history.com/this-day-in-history/german-scientist-discovers-x-rays
7. https://www.aps.org/publications/apsnews/200111/history.cfm
8. https://www.nde-ed.org/EducationResources/CommunityCollege/Radiography/Introduction/history.htm

9. https://www.dw.com/en/x-ray-vision-an-accidental-discovery-that-revolutionized-medicine/a-18833060
10. https://www.slac.stanford.edu/pubs/beamline/25/2/25-2-assmus.pdf

RESULTS FROM ANOTHER SEARCH ENGINE

1. https://explorable.com/wilhelm-conrad-roentgen
2. https://www.history.com/this-day-in-history/german-scientist-discovers-x-rays
3. https://www.coursehero.com/file/p5jkhl/Discovery-of-X-rays-In-1885-Wilhem-Rontgen-while-studying-the-characteristics/
4. https://www.nde-ed.org/EducationResources/HighSchool/Radiography/discoveryxrays.htm
5. https://www.answers.com/Q/Who_discovered_x-rays

6. https://www.aps.org/publications/apsnews/200111/history.cfm
7. https://www.answers.com/Q/Who_discovered_x-rays
8. https://www.coursehero.com/file/p5jkhl/Discovery-of-X-rays-In-1885-Wilhem-Rontgen-while-studying-the-characteristics/
9. https://www.wired.com/2010/11/1108roentgen-stumbles-x-ray/
10. https://time.com/3649842/x-ray/

RANK MATCHES FROM GOOGLE AND ANOTHER SEARCH ENGINE

1 AND 1
5 AND 9
6 AND 2
7 AND 6
We are now ready to compute Spearman’s rank correlation coefficient.
Rank
Google
Rank
Other Srch
Engine
di di
2
1 1 0 0
5 9 -4 16
6 2 4 16
7 6 1 1
The sum of di
2 = 33. The value of n = 4. Substituting into the equation
1 – ( (6 * 33) / (4 * 15) ) = 1 – ( 3.3) = -2.30
5

Even though we have four overlapping results (40% overlap), their positions in the search result
list produce a negative Spearman coefficient indicating that the overlapping results are
uncorrelated. Clearly the two search engines are using different algorithms for weighting and
ranking the documents they determine are most relevant to the query. Moreover their algorithms
are emphasizing different ranking features.

Note: the value of n in the equation above refers to the number of URL matches (in this case,
four) and does not refer to the original number of results (in this case, ten).
Note: If n=1 (which means only one paired match), we deal with it in a different way:
1. if Rank in your result = Rank in Google result → rho=1
2. if Rank in your result ≠ Rank in Google result → rho=0

TASKS

Task1: Scraping results from your assigned search engine

In this task you need to develop a script (computer program) that could scrape the top 10 results
from your assigned search engine. You may use any language of your choice. Always
incorporate random delay between 10 to 100 seconds while scraping multiple queries, else
you may be blocked off by the search engine and they may not allow you to scrape results
for several hours.

For reference:
● https://pypi.org/project/beautifulsoup4, a python library for parsing HTML documents
● URLs for the search engines:
○ Bing: https://www.bing.com/search?q=
○ Yahoo!: https://www.search.yahoo.com/search?p=
○ Ask: https://www.ask.com/web?q=
○ DuckDuckGo: https://www.duckduckgo.com/html/?q=

For each URL, you can add your query string after q=
● Selectors for various search engines, you grab links by looking for href in these selectors:
○ Bing: [“li”, attrs = {“class” : “b_algo”}]
○ Yahoo!: [“a”, attrs = {“class” : “ac-algo fz-l ac-21th lh-24”}]
○ Ask: [“div”, attrs = {“class” : “PartialSearchResults-item-title”}]
○ DuckDuckGo: [“a”, attrs = {“class” : “result__a”}]

By executing this task you need to generate a JSON file which will store your results in the
JSON format described above and repeated here.
{
Query1: [Result1, Result2, Result3, Result4, Result5, Result6, Result7,
Result8, Result9, Result10],
Query2: [Result1, Result2, Result3, Result4, Result5, Result6, Result7,
Result8, Result9, Result10],
….
Query100: [Result1, Result2, Result3, Result4, Result5, Result6, Result7,
Result8, Result9, Result10]
}
6
Here Result1 is the top result for that particular query.
NOTE: In the JSON shown above, query string should be used as keys.

Task2: Determining the Percent Overlap and the Spearman Coefficient

For this task, you need to use the JSON file that you generated in Task 1 and the Google
reference dataset which is provided by us and compare the results as shown in the Determining
Correlation section above. The output should be a CSV file with the following information:
1. Use the JSON file that you generated in Task 1 and do the following steps on each query:
2. Determine the URLs that match with the given reference Google dataset, and their
position in the search engine result list

3. Compute the percent of overlap. In Example1.1, above the percent overlap is 4/10 or
40%.
4. Compute the Spearman correlation coefficient. In above Example1.1, the coefficient
is -2.30.
5. Once you run all of the queries, collect all of the top ten URLs and compute the statistics,
as shown in the following example:

Note: The above example is a table with four columns, rows containing results for each of the
queries, and averages for each of the columns. Of course the actual values above are only for
demonstration purposes. The first column should contain “Query 1”, “Query 2” … “Query 100”
and should not be replaced by actual queries.

Points to note:
● Always incorporate a delay while scraping. We recommend that you use a random delay
with a range of 10 to 100 seconds.
● You will likely be blocked off from the search engine if you do not implement some
delay in your code.
● You should ignore the People Also Ask boxes and any carousels that may be included in
the results.
● You should ignore Ads and scrape only organic results

SUBMISSION INSTRUCTIONS

7
Please place your homework in your Google Drive CSCI572 folder that is shared with your
grader, in the subfolder named hw1. You need to submit:
● JSON file generated in Task 1 while scraping your assigned search engine, call it
hw1.json
● CSV file of final results after determining relevance between your assigned search engine
and Google reference dataset provided by us, call it hw1.csv. Note: you need not
format the numbers.

● TXT file stating why the assigned search engine performed either better/worse/same as
Google, call it hw1.txt. For the txt file, we are just looking for a paragraph which
states how much is your assigned search engine similar to Google based on the Spearman
coefficients and percent overlap. Make sure you clearly state the “average percent
overlap” and the “average Spearman coefficient” over all queries, clearly in the file.

SAMPLE SCRAPING PROGRAM IN PYTHON

Here is a program you can use to help you get started
from bs4 import BeautifulSoup
from time import sleep
import requests
from random import randint
from html.parser import HTMLParser
USER_AGENT = {‘User-Agent’:’Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100
Safari/537.36′}
class SearchEngine:
@staticmethod
def search(query, sleep=True):
if sleep: # Prevents loading too many pages too soon
time.sleep(randint(10, 100))
temp_url = ‘+’.join(query.split()) #for adding + between words for
the query
url = ‘SEARCHING_URL’ + temp_url
soup = BeautifulSoup(requests.get(url, headers=USER_AGENT).text,
“html.parser”)
new_results = SearchEngine.scrape_search_result(soup)
return new_results
@staticmethod
def scrape_search_result(soup):
raw_results = soup.find_all(“SEARCH SELECTOR”)
results = []
#implement a check to get only 10 results and also check that URLs
must not be duplicated
for result in raw_results:
link = result.get(‘href’)
results.append(link)
return results
8
#############Driver code############
SearchEngine.search(“QUERY”)
####################################

FAQs
1. What do I need to run Python on my Windows/Mac machine?
You can refer to the documentation for setup:
https://docs.python.org/3.6/using/index.html
We encourage you to use Python 3.6. You can find many tutorials on Google.
2. Given that Python is installed what lines of the sample program do I have to modify to
get it to work on a specific search engine?
In the reference code, you need to:
● Supply in query variable
● Change SEARCHING_URL and SEARCH SELECTOR as per the search engine
that is assigned to you
● Implement the code that extracts only top 10 URLs and make sure that none of
them is repeated
● Implement the main function

3. What to do if the query does not produce ten results.
You can modify the URLs to get 30 results on single page:
– For bing use count=30 – https://www.bing.com/search?q=test&count=30
– For yahoo use n=30 – https://search.yahoo.com/search?p=test&n=30
– For Ask there does not appear to be a parameter which could produce n results on
single page, so instead you can update the URL in such a manner which increments
page number

– For Ask use page=2 – https://www.ask.com/web?q=testn&page=2
– If, after trying the above hints you are unable to get 10 results for a particular query,
you can just use those results to compare against Google results. For example: if a
query gets 6 results from a search engine, just use those 6 results to compare against
10 results of Google reference dataset and produce statistics for that particular query.
4. Two URLs that differ only in the scheme (http versus https) can be treated as the same.

5. Metrics for similar URLs:
a. As browsers default to www when no host name is provided, so xyz.com is
identical to www.xyz.com
b. URLs that only differ in the scheme (http or https) are identical
c. www.xyz.com and www.xyz.com/ – You need to remove slash(/) at the end of
URL
d. URLs should NOT be converted to lower case.
6. Value of rho:
a. If no overlap, rho = 0
b. If only one result matches:
i. if Rank in your result = Rank in Google result → rho=1
9
ii. if Rank in your result ≠ Rank in Google result → rho=0

7. Rho value may be negative
a. The maximum value of rho is 1, but it may have negative values that are smaller
than -1.
b. How to calculate average rho? We calculate rho for each query and sum them up.
Then we get the average

8. Save order as JSON
a. You can save the dictionary in python as JSON directly by importing json library
calling json.dump(args)

9. What to do if a search engine blocks your IP:
a. Try to change USER_AGENT and try again.
b. Sometimes if you are hitting a URL in quick succession, it may block the IP. Put a
sleep or wait after each query.
c. Run the queries in batches to prevent IP ban.
d. Use a different WiFi or mobile hotspot.

CSCI572 Homework 2: Web Crawling

1. Objective

In this assignment, you will work with a simple web crawler to measure aspects of a crawl, study the characteristics of the crawl, download web pages from the crawl and gather webpage metadata, all from pre-selected news websites.

2. Preliminaries

To begin we will make use of an existing open source Java web crawler called crawler4j. This crawler is built upon the open source crawler4j library which is located on github. For complete details on downloading and compiling see Also see the document “Instructions for Installing Eclipse and Crawler4j” located on the Assignments web page for help. Note: You can use any IDE of your choice. But we have provided installation instructions for Eclipse IDE only

3. Crawling

Your task is to configure and compile the crawler and then have it crawl a news website. In the interest of distributing the load evenly and not overloading the news servers, we have pre-assigned the news sites to be crawled according to your USC ID number, given in the table below.

The maximum pages to fetch can be set in crawler4j and it should be set to 20,000 to ensure a reasonable execution time for this exercise. Also, maximum depth should be set to 16 to ensure that we limit the crawling. You should crawl only the news websites assigned to you, and your crawler should be configured so that it does not visit pages outside of the given news website! USC ID ends with News Sites to Crawl Ne wsS ite Na me Root URL 01~20 NY Times nytimes https://www.nytimes.com 21~40 Wall Street Journal wsj https://www.wsj.com 41~60 Fox News foxnews https://www.foxnews.com 61~80 USA

Today usatoday https://www.usatoday.com 81~00 Los Angeles Times latimes https://www.latimes.com Limit your crawler so it only visits HTML, doc, pdf and different image format URLs and record the meta data for those file types 2

4. Collecting Statistics

Your primary task is to enhance the crawler so it collects information about: 1. the URLs it attempts to fetch, a two column spreadsheet, column 1 containing the URL and column 2 containing the HTTP/HTTPS status code received; name the file fetch_NewsSite.csv (where the name “NewsSite” is replaced by the news website name in the table above that you are crawling).

The number of rows should be no more than 20,000 as that is our pre-set limit. Column names for this file can be URL and Status 2. the files it successfully downloads, a four column spreadsheet, column 1 containing the URLs successfully downloaded, column 2 containing the size of the downloaded file (in Bytes, or you can choose your own preferred unit (bytes,kb,mb)), column 3 containing the # of outlinks found, and column 4 containing the resulting content-type; name the file visit_NewsSite.csv; clearly the number of rows will be less than the number of rows in fetch_NewsSite.csv 3. all of the URLs (including repeats) that were discovered and processed in some way; a two column spreadsheet where column 1 contains the encountered URL and column two an indicator of whether the URL a. resides in the website (OK), or b. points outside of the website (N_OK).

(A file points out of the website if its URL does not start with the initial host/domain name, e.g. when crawling USA Today news website all inside URLs must start with .) Name the file urls_NewsSite.csv. This file will be much larger than fetch_*.csv and visit_*.csv. For example for New York Times- the URL and the URL are both considered as residing in the same website whereas the following URL is not considered to be in the same website, https://store.nytimes.com/

Note1: you should modify the crawler so it outputs the above data into three separate csv files; you will use them for processing later; Note2: all uses of NewsSite above should be replaced by the name given in the column labeled NewsSite Name in the table on page 1. Note 3: You should denote the units in size column of visit.csv. The best way would be to write the units that you are using in column header name and let the rest of the size data be in numbers for easier statistical analysis.

The hard requirement is only to show the units clearly and correctly. Based on the information recorded by the crawler in the output files above, you are to collate the following statistics for a crawl of your designated news website: ● Fetch statistics: o # fetches attempted: The total number of URLs that the crawler attempted to fetch.

This is usually equal to the MAXPAGES setting if the crawler reached that limit; less if the website is smaller than that. o # fetches succeeded: The number of URLs that were successfully downloaded in their entirety, i.e. returning a HTTP status code of 2XX. o # fetches failed or aborted: The number of fetches that failed for whatever reason, including, but not limited to: HTTP 3 redirections (3XX), client errors (4XX), server errors (5XX) and other network-related errors.1

● Outgoing URLs: statistics about URLs extracted from visited HTML pages o Total URLs extracted: The grand total number of URLs extracted (including repeats) from all visited pages o # unique URLs extracted: The number of unique URLs encountered by the crawler o # unique URLs within your news website: The number of unique URLs encountered that are associated with the news website, i.e. the URL begins with the given root URL of the news website, but the remainder of the URL is distinct o # unique URLs outside the news website: The number of unique URLs encountered that were not from the news website.

● Status codes: number of times various HTTP status codes were encountered during crawling, including (but not limited to): 200, 301, 401, 402, 404, etc. ● File sizes: statistics about file sizes of visited URLs – the number of files in each size range (See Appendix A). o 1KB = 1024B; 1MB = 1024KB

● Content Type: a list of the different content-types encountered These statistics should be collated and submitted as a plain text file whose name is CrawlReport_NewsSite.txt, following the format given in Appendix A at the end of this document. Make sure you understand the crawler code and required output before you commence collating these statistics. For efficient crawling it is a good idea to have multiple crawling threads. You are required to use multiple threads in this exercise. crawler4j supports multi-threading and our examples show setting the number of crawlers to seven (see the line in the code int numberOfCrawlers = 7;).

However, if you do a naive implementation the threads will trample on each other when outputting to your statistics collection files. Therefore you need to be a bit smarter about how to collect the statistics, and crawler4j documentation has a good example of how to do this. See both of the following links for details: and https://github.com/yasserg/crawler4j/blob/master/crawler4j-examples/crawler4j-examplesbase/src/test/java/edu/uci/ics/crawler4j/examples/localdata/LocalDataCollectorCrawler.java All the information that you are required to collect can be derived by processing the crawler output. 5. FAQ

Q: For the purposes of counting unique URLs, how to handle URLs that differ only in the query string? For example: https://www.nytimes.com/page?q=0 and https://www.nytimes.com/page?q=1 1 Based purely on the success/failure of the fetching process. Do not include errors caused by difficulty in parsing content after it has already been successfully downloaded. 4 A: These can be treated as different URLs.

Q: URL case sensitivity: are these the same, or different URLs? https://www.nytimes.com/foo and https://www.nytimes.com/FOO A: The path component of a URL is considered to be case-sensitive, so the crawler behavior is correct according to RFC3986. Therefore, these are different URLs. The page served may be the same because:

● that particular web server implementation treats path as case-insensitive (some server implementations do this, especially windows-based implementations) ● the web server implementation treats path as case-sensitive, but aliasing or redirect is being used. This is one of the reasons why deduplication is necessary in practice.

Q: Attempting to compile the crawler results in syntax errors. A: Make sure that you have included crawler4j as well as all its dependencies. Also check your Java version; the code includes more recent Java constructs such as the typed collection List which requires at least Java 1.5.0. Q: I get the following warnings when trying to run the crawler: log4j: WARN No appenders could be found for logger log4j: WARN Please initialize the log4j system properly.

A: You failed to include the log4j.properties file that comes with crawler4j. Q: On Windows, I am encountering the error: Exception_Access_Violation A: This is a Java issue. See: Q: I am encountering multiple instances of this info message: INFO [Crawler 1] I/O exception (org.apache.http.NoHttpResponseException) caught when processing request: The target server failed to respond INFO [Crawler 1] Retrying request A: If you’re working off an unsteady wireless link, you may be battling network issues such as packet losses – try to use a better connection.

If not, the web server may be struggling to keep up with the frequency of your requests. As indicated by the info message, the crawler will retry the fetch, so a few isolated occurrences of this message are not an issue. However, if the problem repeats persistently, the situation is not likely to improve if you continue hammering the server at the same frequency.

Try giving the server more room to breathe: 5 /* * Be polite: Make sure that we don’t send more than * 1 request per second (1000 milliseconds between requests). */ config.setPolitenessDelay(2500); /* * READ ROBOTS.TXT of the website – Crawl-Delay: 10 * Multiply that value by 1000 for millisecond value */ Q: The crawler seems to choke on some of the downloaded files, for example: java.lang.StringIndexOutOfBoundsException: String index out of range: -2 java.lang.NullPointerException: charsetName

A: Safely ignore those. We are using a fairly simple, rudimentary crawler and it is not necessarily robust enough to handle all the possible quirks of heavy-duty crawling and parsing. These problems are few in number (compared to the entire crawl size), and for this exercise we’re okay with it as long as it skips the few problem cases and keeps crawling everything else, and terminates properly – as opposed to exiting with fatal errors. Q: While running the crawler, you may get the following error: SLF4J: Failed to load class “org.slf4j.impl.StaticLoggerBinder”. SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See for further details. A. Download slf4j-simple-1.7.25.jar from add this as an external JAR to the project in the same way as the crawler-4j JAR will make the crawler display logs now. Q: What should we do with URL if it contains comma ?

A: Replace the comma with “-” or “_”, so that it doesn’t throw an error. Q: Should the number of 200 codes in the fetch.csv file have to exactly match with the number of records in the visit.csv? A: No, but it should be close, like within 2,000 of 20,000. If not then you may be filtering too much. Q: “CrawlConfig cannot be resolved to a type” ? A: import edu.uci.ics.crawler4j.crawler.CrawlConfig; make sure the external jars are added to ClassPath.

ModulePath only contains the JRE. If it doesn’t work, check if standard JRE imports are working. Or using an alternative way: Using maven. Initialize a new project using maven and 6 then just add a crawler4j dependency in the pom.xml file (auto-generated by maven). The dependency is given in Crawler4j github page.

Q: What’s the difference between aborted fetches and failed fetches? A: failed: Can be due to HTTP errors and other network related errors aborted: Client decided to stop the fetching. (ex: Taking too much time to fetch) You may sum up both the values and provide the combined result in the write up. Q: For some reason my crawler attempts 19,999 fetches, even though max pages is set to 20,000, does this matter? A: No, it doesn’t matter. It can occur because 20,000 is the limit that you will try to fetch (it may contain successful status code like 200 and other like 301).

But the visit.csv will contain only the URL’s for which you are able to successfully download the files. Q: How to differentiate fetched pages and downloaded pages? A: In this assignment we do not ask you to save any of the downloaded files to the disk. Visiting a page means crawler4j processing a page (it will parse the page and extract relevant information like outgoing URLs ). That means all visited pages are downloaded.

You must make sure that your crawler crawls both http and https pages of the given domain Q: How much time should it approximately take to crawl a website using n crawlers? A: (i) Depends on your parameters set for the crawler (ii) Depends on the politeness you set in the crawler program Your crawl time in hours = maxPagesToFetch / 3600 * politeness delay in seconds Example: a 20,000 page fetch with a politeness delay of 2 seconds will take 11.11 hours. That is assuming you are running enough threads to ensure a page fetch every 2 seconds.

Therefore, it can vary for everyone. Q: For the third CSV file, urls_NewSite.csv, should the discovered URLs include redirect URLs? A: YES, if the redirect URL is the one that gets status code 300, then the URL that redirects the URL to point to will be added to the scheduler of the crawler and waits to be visited. Q: When the URL ends with “/”, what needs to be done? A: You should filter using content type. Please have a peek into Crawler 4j code located at You will get a hint on how to know the content type of the page, even if the extension is not explicitly mentioned in the URL Q: Eclipse keeps crashing after a few minutes of running my code. But when I reduce the no of pages to fetch, it works fine. A: Increase heap size for eclipse using this. Q: What if a URL has an unknown extension?

A: Please check the content type of the page if it has an unknown extension Q: Why do some links return True in shouldVisit() but cannot be visited by Visit()? A: shouldVisit() function is used to calculate whether the page should be visited or not. It may or may not be a visitable page. 7 For example – If you are crawling the site https://viterbi.usc.edu/, the page https://viterbi.usc.edu/mySamplePage.html should be visited. but this page may return a 404 Not Found Error or it may be redirected to some other site like https://mysamplesite.com.

In this case, shouldVisit() function would return true because the page should be visited but visit() will not be called because the page cannot be visited. Comment: has details on regular expressions that you need to take care. Comment: Since many newspaper websites dump images and other types of media on CDN, your crawl may only encounter html files.

That is fine. Comment: File types css,js,json and others should not be visited. E.g. you can add .json to your pattern filter. If the extension does not appear, use !page.getContentType().contains(“application.json”) Comment: Some sites may have less than the 20,000 pages, but as long as the formula matches. i,e # fetches attempted = # fetches succeeded + # fetches aborted + # fetches failed your homework is ok. However, the variation should not be more than 10% away from the limit as it is an indication that something is wrong.

Scenario: My visit.csv file has about 15 URLs lesser than the number of URLs with status code 200. It is fine if the difference is less than 10%. Comment: the homework description states that you only need to consider HTML, doc, pdf and different image format URLs . But you should also consider URL’s with no extension as they may return a file of one of the above types. Comment: The distinction between failed and aborted web pages. failed: Can be due to content not found, HTTP errors or other network related errors aborted: the client (the crawler) decided to stop the fetching. (ex: Taking too much time to fetch).

You may sum up both the values and provide the combined result in the write up. Q: In the visit_NewsSite.csv, do we also need to chop “charset=utf-8” from content-type? Or just chop “charset=urf-8” in the report? A: You can chop Encoding part(charset=urf-8) in all places. Q: REGARDING STATISTICS A: #unique URLs extracted = #unique URLs within + #unique URLs outside #total urls extracted is the sum of #outgoing links. #total urls extracted is the sum of all values in column 3 of visit.csv For text/html files, find the number of out links.

For non-text/html files, the number should be 0. Q: How to handle pages with NO Extension 8 A: Use getContentType() in visit() and don’t rely just on extension. If the content type returned is not one of the required content types for the assignment, you should ignore it for any calculation of the statistics. This will probably result in more rows in visit.csv, but it’s acceptable according to the grading guidelines.

Q: Clarification on “the URLs it attempts to fetch” A: “The URLs it attempts to fetch” means all the URLs crawled from start seed which reside in the news website and has the required media types. Note #1: Extracted urls do not have to be added to visit queue. Some of them which satisfy a requirement (e.g : content type, domain, not duplicate) will be added to visit queue. But others will be dumped by the crawler. However, as long as the grading guideline is satisfied, we will not deduct points. Note#2: : 303 could be considered aborted.

404 could be considered failed. To summarize: we consider a request to be aborted if the crawler decides to terminate that request. Client-side timeout is an example. Requests can fail due to reasons like content not found, server errors, etc. Note#3: Fetch statistics: # fetches attempted: The total number of URLs that the crawler attempted to fetch.

This is usually equal to the MAXPAGES setting if the crawler reached that limit; less if the website is smaller than that. # fetches succeeded: The number of URLs that were successfully downloaded in their entirety, i.e. returning a HTTP status code of 2XX. # fetches failed or aborted: The number of fetches that failed for whatever reason, including, but not limited to: HTTP redirections (3XX), client errors (4XX), server errors (5XX) and other network-related errors.

Note#4: Consider fetches failed and aborted as same similar to as mentioned in Note#3 Note#5: Hint on crawling pages other than html Look for how to turn ON the Binary Content in Crawling in crawler4j. Make sure you are not just crawling the html parsed data and not the binary data which includes file types other than html. Search on the internet on how to crawl binary data and I am sure you will get something on how to parse pages other than html types.

There will be pages other than html in almost every news site so please make sure you crawl them properly. Q: Regarding the content type in visit_NewsSite.csv, should we display “text/html;charset=UTF8” or chop out the encoding and write “text/html” in the Excel sheet ? A: ONLY TEXT/HTML, ignore rest.

Q: Should we limit the URLs that the crawler attempted to fetch within the news domain? e.g. if we encounter we should skip fetching by adding constraints in “shouldVisit()”? But do we need to include it in urls_NewsSite.csv? A: Yes, you need to include every encountered url in urls_NewsSite.csv. 9 Q: All 3xx,4xx, 5xx should be considered as aborted? A: YES Q: Are “cookie” domains considered as an original newsite domain ? A: NO, they should not be included as part of the newsite you are crawling. For details see https://web.archive.org/web/20200418163316/https://www.mxsasha.eu/blog/2014/03/04/definiti ve-guide-to-cookie-domains/ Q. More about statistics

A: visit.csv will contain the urls which are succeeded i.e. 200 status code with known/ allowed content types. Fetch.csv will include all the urls which are been attempted to fetch i.e. with all the status codes. fetch.csv entries will be = visit.csv entries (with 2xx status codes) + entries with status codes other than 2XX visit.csv = entries with 2XX status codes. Also, you should not and it is not necessary to use customized status code.

Just use the status code what the webpage returns to you. (Note:-> fetch.csv should have urls from news site domain only) Q: do we need to check content-type for all the extracted URLs, i.e. url.csv or just for visited URLs, e.g. those in visit.csv? A: only those in visit_NewsSite.csv Q: How to get the size of the downloaded file? A: It will be the size of the page. Ex – for an image or pdf, it will be the size of the image or the pdf, for the html files, it will be the size of the file. The size should be in bytes (or kb, mb etc.). (page.getContentData().length)

Q: Change logging level in crawler4j? A: If you are using the latest version of Crawler4j, logging can be controlled through logback.xml. You can view the github issue thread for knowing more about the logback configurations – . Q: Crawling urls only yield text/html. I have only filtered out css|js|mp3|zip|gz, But all the visited urls have return type text/html is it fine? Or is there a problem? A: It is fine. Some websites host their asset files (images/pdfs) on another CDN, and the URL for the same would be different from www.newssite.com, so you might only get html files for that news site. Q: Eclipse Error: Provider class org.apache.tika.parser.external.

CompositeExternalParser not in module I’m trying to follow the guide and run the boiler plate code, but eclipse gives this error when I’m trying to run the copy pasted code from the installation guide A: Please import crawler4j jars in ClassPath and not ModulePath, while configuring the build in Eclipse. Q: Illegal State Exception Error 10 A: 1) if you are using a newest java version, I would downgrade to 8 to use, there are some sort of a similar issue with the newest java version. 2) carefully follow the instructions on Crawler4jinstallation.pdf 3) make sure to add the jar file to the CLASS PATH 4) if any module is missing, download from google and add to the prj class path.

Q: /data/crawl error Exception in thread “main” java.lang.Exception: couldn’t create the storage folder: /data/crawl does it already exist ? at edu.uci.ics.crawler4j.crawler.CrawlController.(CrawlController.java:84) at Controller.main(Controller.java:20) A: Replace the path /data/crawl in the Controller class code with a location on your machine Q: Do we need to remove duplicate urls in fetch.csv (if exists)? A: Crawler4j already handles duplication checks so you don’t have to handle it. It doesn’t crawl pages that have already been visited.

Q: Error in Controller.java- “Unhandled exception type Exception” A: Make sure Exception Handling is taken care of in the code. Since CrawlController class throws exception, so it needs to be handled inside a try-catch block. Q: Crawler cannot stop – when I set maxFetchPage to 20000, my script cannot stop and keeps running forever. I have to kill it by myself.

However, it looks like that my crawler has crawled all the 20000 pages but just cannot end. A:Set a reasonable maxDepthofCrawling, Politeness Delay, setSocketTimeout(), and Number of crawlers in the Controller class, and retry. Also ensure there are no System.out.print() statements running inside the Crawler code. Q: If you are in countries that have connection problems. A: We would suggest you to visit https://itservices.usc.edu/vpn/ for more information. Enable the VPN, clear the cache, restart the computer should help solve the problem.

6. Submission Instructions

● Save your statistics report as a plain text file and name it based on the news website domain names assigned below: USC ID ends with Site 01~20 CrawlReport_nytimes.txt 21~40 CrawlReport_wsj.txt 41~60 CrawlReport_foxnews.txt 61~80 CrawlReport_usatoday.txt 11 81~00 CrawlReport_latimes.txt ● Also include the output files generated from your crawler run, using the extensions as shown above: o fetch_NewsSite.csv o visit_NewsSite.csv ● Do NOT include the output files o urls_NewsSite.csv where _NewSite should be replaced by the name from the table above. ● Do not submit Java code or compiled programs; it is not required. ● Compress all of the above into a single zip archive and name it: crawl.zip Use only standard zip format.

Do NOT use other formats such as zipx, rar, ace, etc. For example the zip file might contain the following three files: 1. CrawlReport_nytimes.txt, (the statistics file) 2. fetch_nytimes.csv 3. visit_nytimes.csv ● Please upload your homework to your Google Drive CSCI572 folder, in the subfolder named hw2 Appendix A Use the following format to tabulate the statistics that you collated based on the crawler outputs. Note: The status codes and content types shown are only a sample.

The status codes and content types that you encounter may vary, and should all be listed and reflected in your report. Do NOT lump everything else that is not in this sample under an “Other” heading. You may, however, exclude status codes and types for which you have a count of zero. Also, note the use of multiple threads. You are required to use multiple threads in this exercise. CrawlReport_NewsSite.txt 12 Name: Tommy Trojan USC ID: 1234567890 News site crawled: nytimes.com Number of threads: 7 Fetch Statistics ================ # fetches attempted: # fetches succeeded: # fetches failed or aborted: Outgoing URLs: ============== Total URLs extracted: # unique URLs extracted: # unique URLs within News Site: # unique URLs outside News Site: Status Codes: ============= 200 OK: 301 Moved Permanently: 401 Unauthorized: 403 Forbidden: 404 Not Found: File Sizes: =========== < 1KB: 1KB ~ <10KB: 10KB ~ <100KB: 100KB ~ <1MB: >= 1MB: Content Types: ============== text/html: image/gif: image/jpeg: image/png: application/pdf:

CSCI572 HW3: Inverted-index creation

Summary

In this homework, you’ll write code that indexes words from multiple text files, and outputs an inverted
index that looks like this:

Description

You will need a collection of text files to index, they are. After you unzip it, you’ll get two directories
with text files in them: ‘devdata’ (with 5 files), ‘fulldata’ (with 74 files).

The input data is already cleaned, that is all the \n\r characters are removed – but one or more \t chars
might still be present (which needs to be handled). There is punctuation, and you are required to handle
this in your code: replace all the occurrences of special (punctuation) characters and numerals with the
space character, and convert all the words to lowercase. A single ‘\t’ separates the key (docID) from the
value (actual document contents).

In other words, the input files are in a key-value format where docID is
the key, and the contents are the value; the key and value are separated by a tab:
here

The above format is to help you build the inverted index easily – the filename (docID) is in the file’s text so
you can simply extract it.
As you know, Google invented MapReduce, to help them do this on a massive scale. For this HW, we could
have based it on GCP/Azure/AWS/… where you would upload the files to the cloud and launch MapReduce
jobs there. To get practice doing this, please do try it own your own, after this course.

But for this HW, there is a much simpler way! It’s this [contains code I got from elsewhere, and the custom
environment I set up to run it (ie. the minimal collection of .jar files needed, including the one for
MapReduce)]: start with – it is a self-contained,

complete, minimal MapReduce example that counts words in two input documents. Study it thoroughly –
see what files are used (code, data, config), how the code is organized (a single class called WordCount),
how it is run (we specify input and output folders). Be sure to look at the three columns: files on the left, file
contents in the center, execution on the right.

Fork it,to do your homework – in other words, do it on your
own repl.it area (you need to sign up for a free account). FYI, repl.it is built on top of GCP:
[how cool!].
For the HW, you need to create a unigram index, and a bigram one. Details are in the two sections that
follow.

Unigram index

You will need to create a file called unigram_index.txt, containing words from files in fulldata.
Modify the mapper in the repl.it link above , to output(word, docID), instead of what it currently outputs
for word counting, which is (word, count); also, in your reducer.

Bigram index

You’ll create a file called selected_bigram_index.txt, containing the inverted index for just these five
bigrams, using files in devdata:
computer science
information retrieval
power politics
los angeles
bruce willis
Modify your mapper, to output (word1 word2, docID) pairs, rather than the (word, docID) pairs you had in
the unigram task. There is no need to change your reducer.

Extra details

If your execution takes way too long, or crashes, you can simply make the data files smaller by deleting text,
starting from the end. Each file contains 500000+ words, you can make it as small as 50000; but do keep
the file count the same, ie. use all the files in devdata as well as fulldata.

You could ‘test’ your code using the devdata collection for unigrams as well, then do it ‘for real’ (in
‘production mode’) on the fulldata set of files. Or, for both unigrams and bigrams, you can use your own set
of files, eg. a.txt, b.txt… e.txt (5 files), each with a paragraph from

https://cloud.google.com/customers/repl-it
use a HashMap data structure
https://www.gutenberg.org/files/74/74-

[:)] – that way your code will run quite fast and output results, so you can make rapid alterations in a
shortened develop-run-debug cycle. To develop+test your code, you can even simply use the two sample
files I have in the repl.
If you like, you can read/do the ‘official’ MapReduce tutorial

UPDATE, 3/23: is a collection of shorter data files [each file has fewer words] – you have the choice of
using these instead of the ‘big’ files. The filenames, and the file counts are the same as for the
(‘big’) set of files, the only diff is that there are fewer words in each. For the bigrams, you’ll
get different counts compared to the originals, but that shouldn’t/doesn’t matter – if you use these smaller
files, just make sure you’re creating a bigram index that shows files and counts for ‘los angeles’ etc.

Many
files end with a chopped off word, eg. ‘sabb’, but that’s ok. Also, if you’re curious, here’s how I the
fulldata/* files 🙂

Rubrics
Your (max 10) points will come from fulfilling these:
4 points for the unigram index entries, contained in unigram_index.txt
4 points for the index entries to the words mentioned in the bigrams section above, contained in selected_bigram_index.txt
1 point for screenshots of the output folder (the output folder is what you specify as the second argument while running the job)
for the job for unigrams, and bigrams (two screenshots)
1 point for your code, for unigrams and bigrams (two source files)

Getting help
There is a hw3 ‘forum’ on Piazza, for you to post questions/answers. You can also meet w/ the TAs, CPs, or
me.
Hope you have fun doing the HW 🙂
0.txt
here.
Here
original/previous
shortened

CSCI572 HW4: Inverted-index creation, using Lunr and Solr

Summary

In this 4th HW, you’re going to use Lunr (a JS library) in three ways, and Solr (a Java library) in one way, to inverted-index
documents/data (each piece of data, eg a student’s info, is called a ‘document’ and is expressed as JSON).

Description

We are doing to describe each of the 4 pieces separately below, since they are all independent of each other…

1. Lunr, using xem

Start with . As you can see, we fetch
[a portion of the famous Yelp businesses dataset, with only 50 rows], index the ‘state’ field, then display the ‘name’ field as
the output (in Lunr terminology, this (name) is the ‘ref’ field).

Your turn: also fetch
, and after that,
, do the same state searches. For
“PA”, how many results do we get, for the three different files?
Next, pick a different column (field) to index
, then search for a value in that
field, grab a screenshotto submit.

2. Lunr, using repl.it

Bring up , fork and run it. You’ll see a file called fun.jsonl, containing two
simple docs that get indexed and searched. REPLACE the data with your own, make it have 10 rows/docs – that also has two fields,
eg. (movieName, rating), (course, grade), (foodItemUnitPortion, calories)… Alternately you can also add more schools and rankings if
you like (eg from ), rather than use your own alternate fields.

Programmatically search for a range of the second field (eg rank, grade, calories, ratings…), and display the result as a simple JS
array – get a screenshot. In other words, suppose you have food names and calories – you’d search for a range of calories, eg. 500 to
1000, via a for() or forEach() loop for each of 500,501,502…1000 [search 501 times]; if the search comes back non-null, you’d add
the result an empty array you start with. When you’re done, that array (eg. myResults) will contain the range results; print it, get a
screenshot.

3. Lunr, via TypeScript in StackBlitz

Fork , and take a look at index.ts, to find a ‘documents’ array
(these are what get indexed and searched). Put in 10 paragraphs from HW3’s data2.zip (the ‘smaller’ files), to create 10 indexable
documents. Search the documents for two differentterms, which you know, occur in more than one document. Grab a screenshot
for each search.

https://bytes.usc.edu/~saty/tools/xem/run.html?x=lunr-cs572-hw4
https://bytes.usc.edu/cs572/s23-sear-chhh/hw/HW4/data/yelp_academic_dataset_business.json/yelp_academic_dataset_business50.json
https://bytes.usc.edu/cs572/s23-searchhh/hw/HW4/data/yelp_academic_dataset_business.json/yelp_academic_dataset_business-1000.json
https://bytes.usc.edu/cs572/s23-searchhh/hw/HW4/data/yelp_academic_dataset_business.json/yelp_academic_dataset_business.json
https://bytes.usc.edu/cs572/s23-searchhh/hw/HW4/data/yelp_academic_dataset_business.json/yelp_academic_dataset_business-50.json

https://www.usnews.com/best-colleges/rankings/national-universities

4. Solr, using Docker

Solr is a powerful search engine with a rich search syntax, easy setup, fast indexing – we specify a search ‘schema’ (fieldname,
fieldtype), for the fields we want to index and search, then we add documents (data) which kicks off the indexing, then we search (via
a URL query, or via one of MANY APIs in multiple languages).

Start by installing Docker: Docker makes it so easy to run Solr inside a lightweight virtual
machine ‘container’ (runtime) – to do this, we’d first download an ‘image’ (template), then run it to launch a container.
Bring up a conda shell on a PC ( ) or a terminal on
Mac/Linux, and type the following, to download the Solr image.
docker pull solr
Now we can run Solr via [see for more]
docker run -d -p 8983:8983 –name my_solr solr solr-precreate my_core

In the above, ‘my_core’ is our name for a Solr ‘core’, which is an area where Solr keeps the docs to index, the index itself, and other
files (eg config).

FYI, you can run ‘docker ps’ to verify that the container is indeed, up and running:
We are almost ready to start using Solr, via the browser (by visiting ) -but- we need one more thing first – a
webserver. Here is one way: in the conda shell, run
python serveit.py
Here is serveit.py (you’d create it by copying and pasting my code below):
import http.server
class MyHTTPRequestHandler(http.server.SimpleHTTPRequestHandler):
def end_headers(self):
self.send_my_headers()
http.server.SimpleHTTPRequestHandler.end_headers(self)
def send_my_headers(self):
self.send_header(“Cache-Control”, “no-cache, no-store, must-revalidate”)
self.send_header(“Pragma”, “no-cache”)
self.send_header(“Expires”, “0”)
if __name__ == ‘__main__’:
http.server.test(HandlerClass=MyHTTPRequestHandler)
# from https://stackoverflow.com/questions/12193803/invoke-python-simplehttpserver-from-command-line-with-no-cache-option

# usage: python serveit.py 🙂 GREAT, because it does NOT cache files!! Serves out of port 8000 by default.
I’ve aliased the Python command above, to ‘pws’, since I run it a lot:
https://docs.docker.com/engine/install/
https://docs.conda.io/projects/conda/en/latest/user-guide/install/windows.html
https://hub.docker.com/_/solr
https://localhost:8983/

Ready to roll! To see the ‘Solr panel’ [lol!], go to – ta da!!
Poke around in the panel, by clicking on the various buttons, to get a feel for the interface – you can see that there is a LOT you can
specify, customize, run – please do explore these after the course.

But for this exercise, we only need to do three things:
1. Add field names and types, for the docs (data) we will index – in other words, we need to specify a schema (click on ‘Add Field’, to add fields one by
one):
After creating our schema, we can verify that it looks right by searching for the fieldnames we put in:
https://localhost:8983

2. Add docs you’d like Solr to index, that correspond to the schema (note the non-standard syntax – we are NOT specifying valid JSON, instead we’re
adding a list of comma-separated JSONs (one for each doc), a lot like for Lunr (which called for a .jsonl, ie. a list of JSONs without commas)):
Doing the above (pressing ‘Submit Document’) leads to our data getting indexed!

FYI if you ever want to clean out all the docs and start over [more here:
] – click on ‘Submit Document’ after typing the ‘delete’ command:
3. Now we can search – via URL query syntax, among many other ways [more here:
]
Cool! And that’s all there is to it, for Solr basics. If you exit your shell, that will stop Docker and also the webserver.

Do a simple search (like ‘name:USC’). Next, do a range search (eg. schools with ranks between 5 and 15). For the range search you
won’t do it via a loop, like you did in Lunr under replit; instead, you’d use Solr’s own syntax for that (look that up).

Rubrics
Your (max 10) points will come from your submitting these:
2 points for: using the code in to run the ‘state’ search on docs of 3 sizes and report result
counts, index another field rather than the ‘state’ one, and do a sample search for it, submit a screenshot of the search result

3 points for: going off of and modifying the .jsonl file to put in a different kind of data, doing a ‘range’
search and printing out the aggregated result as an array

2 points for: using StackBlitz [ ] to add paragraph text data from HW3, and doing two
different searches of terms known to occur in multiple docs, submitting a screenshot for each
3 points for: running Solr under Docker, doing a simple query (field:value), and a ‘range’ one as well
https://stackoverflow.com/questions/23228727/deleting-solr-documentsfrom-solr-admin
https://solr.apache.org/guide/solr/latest/query-guide/standardquery-parser.html
https://bytes.usc.edu/~saty/tools/xem/run.html?x=lunr-cs572-hw4

Getting help
There is a hw4 ‘forum’ on Piazza, for you to post questions/answers. You can also meet w/ the TAs, CPs, or me.
Have fun! This is a quick+easy+useful+fun one!

CSCI572 HW5: Vector-based similarity search!

Summary

In this final HW, you will use Weaviate [ ], which is a vector DB (stores data as vectors, and computes a search
query by vectorizing it and doing similarity search with existing vectors).

Description

The (three) steps we need are really simple:
install Weaviate plus vectorizer via Docker as images, run them as containers
specify a schema for data, upload data (in .json format) to have it be vectorized
run a query (which gets vectorized and sim-searched), get back results (as JSON)

The following sections describe the above steps. The entire HW will only take you 2 to 3 hours to complete, pinky swear 🙂

1. Installing Weaviate and a vectorizer module

After installing Docker, bring it up (eg. on Windows, run Docker Desktop). Then, in your (ana)conda shell, run this docker-compose
command that uses this ‘docker-compose.yml’ config file to pull in two images: the ‘weaviate’ one, and a text2vec transformer
called ‘t2v-transformers’:
docker-compose up -d
https://weaviate.io/
yaml

These screenshots show the progress, completion, and subsequently, two containers automatically being started (one for weaviate,
one for t2v-transformers):

Yeay! Now we have the vectorizer transformer (to convert sentences to vectors), and weaviate (our vector DB search engine)
running! On to data handling 🙂

2. Loading data to search for

This is the data that we’d like searched, part of which will get returned to us as results. The data is conveniently represented as an
array of JSON documents, similar to Solr/Lunr. is our data file, conveniently named data.json (you can rename it if you like) –

place it in the ‘root’ directory of your webserver (see below). As you can see, each datum/’row’/JSON contains three k:v pairs, with
‘Category’, ‘Question’, ‘Answer’ as keys – as you might guess, it seems to be in Jeopardy(TM) answer-question (reversed) format 🙂
The file is actually called , I simply made a local copy called data.json.

The overall idea is this: we’d get the 10 documents vectorized, then specify a query word, eg. ‘biology’, and automagically have that
pull up related docs, eg. the ‘DNA’ one! This is a really useful semantic search feature where we don’t need to specify exact
keywords to search for.

Start by installing the weaviate Python client:
pip install weaviate-client
So, how to submit our JSON data, to get it vectorized? Simply use Python script, do:
python weave-loadData.py
Here
jeopardy-tiny.json
this

You will see this:
If you look in the script, you’ll see that we are creating a schema – we create a class called ‘SimSearch’ (you can call it something else
if you like). The data we load into the DB, will be associated with this class (the last line in the script does this via add_data_object()).

NOTE – you NEED to run a local webserver [in a separate ana/conda (or other) shell], eg. via ‘python serveit.py’ like you did for
HW4 – it’s what will ‘serve’ data.json to weaviate 🙂

Great! Now we have specified our searchable data, which has been first vectorized (by ‘t2v-transformers’), then stored as vectors
(in weaviate).
Only one thing left: querying!

3. Querying our vectorized data

To query, use this simple shell script called , and run this:
sh weave-doQuery.sh
As you can see in the script, we search for ‘physics’-related docs, and sure enough, that’s what we get:
weave-doQuery.sh

Why is this exciting? Because the word ‘physics’ isn’t in any of our results!
Now it’s your turn:
• first, MODIFY the contents of data.json, to replace the 10 docs in it, with your own data, where you’d replace (“Category”,”Question”,”Answer”)
with ANYTHING you like, eg. (“Author”,”Book”,”Summary”), (“MusicGenre”,”SongTitle”,”Artist”), (“School”,”CourseName”,”CourseDesc”), etc, etc –

HAVE fun coming up with this! You can certainly add more docs, eg. have 20 of them instead of 10
• next, MODIFY the query keyword(s) in the query .sh file – eg. you can query for ‘computer science’ courses, ‘female’ singer, ‘American’ books,
[‘Indian’,’Chinese’] food dishes (the query list can contain multiple items), etc. Like in the above screenshot, ‘cat’ the query, then run it, and get a
screenshot to submit. BE SURE to also modify the data loader .py script, to put in your keys (instead of (“Category”,”Question”,”Answer”))

That’s it, you’re done w/ the HW 🙂 In RL you will have a .json or file (or data in other formats) with BILLIONS of items! Later, do
feel free to play with bigger JSON files, eg. this Jeopardy JSON file 🙂
FYI/’extras’

Here are two more things you can do, via ‘curl’:
[you can also do ‘ ‘ in your browser]
[you can also do ‘ ‘ in your browser]
.csv
200K
https://localhost:8080/v1/meta
https://localhost:8080/v1/schema

Weaviate has a cloud version too, called – you can try that as an alternative to using the Dockerized version:
Run 🙂
Also, for fun, see if you can print the raw vectors for the data (the 10 docs)…
More info:
•
•
•
•

Whatto submit
• your data.json that contains the data (10 docs) you put in
• a screenshot of the ‘cat’ of your query and the results
WCS
this
https://weaviate.io/developers/weaviate/quickstart/end-to-end
https://weaviate.io/developers/weaviate/installation/docker-compose
https://medium.com/semi-technologies/what-weaviate-users-should-know-about-docker-containers-1601c6afa079
https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-transformers

Alternative (!) submission [omg]
You can just submit a README.txt file that notes why you didn’t/couldn’t do the HW.
“Wait, WHAT?” Turns out THIS HW IS OPTIONAL!!! You will get the full 10 points for this HW, regardless of what you submit – but
you DO need to submit something – either the .json+screenshot combo, or a README.

Such a deal! For your own benefit, it’s
worth doing the HW of course – you’ll get first-hand experience using a vector DB (Weaviate is a worthy alternative to Pinecone
btw); but if you aren’t able, we understand (hope you’ll do it after the course!) 🙂

Getting help
There is a hw5 ‘forum’ on Piazza, for you to post questions/answers. You can also meet w/ the TAs, CPs, or me.
Have fun! This is a really useful piece of tech to know. Vector DBs are sure be used more and more in the near future, as a way to
provide ‘infinite external runtime memory’ for pretrained LLMs.

CSCI572 HW 1 to 5 solutions

Download Details:

Description

CSCI572 HW 1 Web Search Engine Comparison

THE QUERIES

REFERENCE GOOGLE DATASET

DETERMINING OVERLAP AND CORRELATION

GOOGLE RESULTS

RESULTS FROM ANOTHER SEARCH ENGINE

RANK MATCHES FROM GOOGLE AND ANOTHER SEARCH ENGINE

TASKS

Task1: Scraping results from your assigned search engine

Task2: Determining the Percent Overlap and the Spearman Coefficient

SUBMISSION INSTRUCTIONS

SAMPLE SCRAPING PROGRAM IN PYTHON

CSCI572 Homework 2: Web Crawling

1. Objective

2. Preliminaries

3. Crawling

4. Collecting Statistics

6. Submission Instructions

CSCI572 HW3: Inverted-index creation

Summary

Description

Unigram index

Bigram index

Extra details

CSCI572 HW4: Inverted-index creation, using Lunr and Solr

Summary

Description

1. Lunr, using xem

2. Lunr, using repl.it

3. Lunr, via TypeScript in StackBlitz

4. Solr, using Docker

CSCI572 HW5: Vector-based similarity search!

Summary

Description

1. Installing Weaviate and a vectorizer module

2. Loading data to search for

3. Querying our vectorized data

CSCI572 HW 1 to 5 solutions

Download Details:

Description

CSCI572 HW 1 Web Search Engine Comparison

THE QUERIES

REFERENCE GOOGLE DATASET

DETERMINING OVERLAP AND CORRELATION

GOOGLE RESULTS

RESULTS FROM ANOTHER SEARCH ENGINE

RANK MATCHES FROM GOOGLE AND ANOTHER SEARCH ENGINE

TASKS

Task1: Scraping results from your assigned search engine

Task2: Determining the Percent Overlap and the Spearman Coefficient

SUBMISSION INSTRUCTIONS

SAMPLE SCRAPING PROGRAM IN PYTHON

CSCI572 Homework 2: Web Crawling

1. Objective

2. Preliminaries

3. Crawling

4. Collecting Statistics

6. Submission Instructions

CSCI572 HW3: Inverted-index creation

Summary

Description

Unigram index

Bigram index

Extra details

CSCI572 HW4: Inverted-index creation, using Lunr and Solr

Summary

Description

1. Lunr, using xem

2. Lunr, using repl.it

3. Lunr, via TypeScript in StackBlitz

4. Solr, using Docker

CSCI572 HW5: Vector-based similarity search!

Summary

Description

1. Installing Weaviate and a vectorizer module

2. Loading data to search for

3. Querying our vectorized data

Related products

Solved CSCI572 Homework: Web Crawling 1. Objective In this assignment, you will work with a simple web crawler to measure aspects

Solved CSCI572 Web Search Engine Comparison This exercise is about comparing the search results from Google

CSCI572 HW 1 Web Search Engine Comparison solution