Description

5/5 - (1 vote)

In this assignment, we’re interested in the main topics discussed on the /r/mcgill subreddit vs. the /r/concordia subreddit. We’ll do this using human annotation … and you’re the annotator

Task 1: Data collection

First, let’s collect some reddit posts (using the /new.json endpoint – details here). We’ll collect two data files. One from the McGill subreddit and one from the Concordia subreddit. For the purpose of this assignment, collect them manually. Meaning, in a web browser, get the json dump and download it to a file. You should have a a mcgill.json file and a concordia.json file.

Task 2: Prep for coding

Write a script extract_to_tsv.py that accepts one of the files you collected from Reddit and outputs a random selection of posts from that file to a tsv (tab separated value) file. It should function like this: python3 extract_to_tsv.py -o If is greater than the file length, then the script should just output all lines. If there are more than (which is likely the case), then it should randomly select num_posts_to_output (the parameter you passed to the script) of them and just output those. The output format (written to out_file) is: Name title coding

Solved COMP 370 Homework 8 – Data Annotation

Download Details:

Description

Task 1: Data collection

Task 2: Prep for coding

Related products

Solved COMP 370 Homework 8 – Using TF-IDF

Solved COMP 370 Homework 2 – Unix server and command-line exercises

Solved SI 630 Homework 3: Data Annotation and Large Language Models