Description

5/5 - (1 vote)

CSE 6242 / CX 4242: Data and Visual Analytics HW 1

CSE 6242 / CX 4242: Data and Visual Analytics HW 1: End-to-end analysis of TMDb data, SQLite, D3 Warmup, OpenRefine, Flask

Homework Overview

Vast amounts of digital data are generated each day, but raw data is often not immediately “usable”. Instead, we are interested in the information content of the data such as what patterns are captured? This assignment covers useful tools for acquiring, cleaning, storing, and visualizing datasets. In questions 1 & 2, we’ll perform a simple end-to-end analysis using data from The Movie Database (TMDb).

We will collect movie data via API, store the data in csv files, and analyze data using SQL queries. For Q3, we will complete a D3 warmup to prepare our students for visualization questions in HW2. Q4 & 5 will provide an opportunity to explore other industry tools used to acquire, store, and clean datasets. The maximum possible score for this homework is 100 points. Contents Download the HW1 Skeleton before you begin. ………………………………………………………………………… 1

Homework Overview…………………………………………………………………………………………………………………. 1 Important Notes ………………………………………………………………………………………………………………………. 2 Submission Notes…………………………………………………………………………………………………………………….. 2 Do I need to use the specific version of the software listed?……………………………………………………………. 2 Q1 [40 points] Collect data from TMDb to build a co-actor network…………………………………………………… 3 Q2 [35 points] SQLite ……………………………………………………………………………………………………………….. 4 Q3 [15 points] D3 Warmup – Visualizing Wildlife Trafficking by Species…………………………………………….. 7 Q4 [5 points] OpenRefine …………………………………………………………………………………………………………10 Q5 [5 points] Introduction to Python Flask …………………………………………………………………………………..12 2 Version 1

Important Notes A. Submit your work by the due date on the course schedule. a. Every assignment has a generous 48-hour grace period, allowing students to address unexpected minor issues without facing penalties. You may use it without asking. b. Before the grace period expires, you may resubmit as many times as you need. c. TA assistance is not guaranteed during the grace period. d. Submissions during the grace period will display as “late” but will not incur a penalty. e. We will not accept any submissions executed after the grace period ends. B. Always use the most up-to-date assignment (version number at the bottom right of this document). The latest version will be listed in Ed Discussion. C. You may discuss ideas with other students at the “whiteboard” level (e.g., how cross-validation works, use HashMap instead of an array) and review any relevant materials online. However, each student must write up and submit the student’s own answers. D. All incidents of suspected dishonesty, plagiarism, or violations of the Georgia Tech Honor Code will be subject to the institute’s Academic Integrity procedures, directly handled by the Office of Student Integrity (OSI). Consequences can be severe, e.g., academic probation or dismissal, a 0 grade for assignments concerned, and prohibition from withdrawing from the class. Submission Notes A. All questions are graded on the Gradescope platform, accessible through Canvas. B. We will not accept submissions anywhere else outside of Gradescope. C. Submit all required files as specified in each question. Make sure they are named correctly. D. You may upload your code periodically to Gradescope to obtain feedback on your code. There are no hidden test cases. The score you see on Gradescope is what you will receive. E. You must not use Gradescope as the primary way to test your code. It provides only a few test cases and error messages may not be as informative as local debuggers. Iteratively develop and test your code locally, write more test cases, and follow good coding practices. Use Gradescope mainly as a “final” check. F. Gradescope cannot run code that contains syntax errors. If you get the “The autograder failed to execute correctly” error, verify: a. The code is free of syntax errors (by running locally) b. All methods have been implemented c. The correct file was submitted with the correct name d. No extra packages or files were imported G. When many students use Gradescope simultaneously, it may slow down or fail. It can become even slower as the deadline approaches. You are responsible for submitting your work on time. H. Each submission and its score will be recorded and saved by Gradescope. By default, your last submission is used for grading. To use a different submission, you MUST “activate” it (click the “Submission History” button at the bottom toolbar, then “Activate”). Do I need to use the specific version of the software listed? Under each question, you will see a set of technologies with specific versions – this is what is installed on the autograder and what it will run your code with. Thus, installing those specific versions on your computer to complete the question is highly recommended. You may be able to complete the question with different versions installed locally, but you are responsible for determining the compatibility of your code. We will not award points for code that works locally but not on the autograder. 3 Version 1 Q1 [40 points] Collect data from TMDb to build a co-actor network Leveraging the power of APIs for data acquisition, you will build a co-actor network of highly rated movies using information from The Movie Database (TMDb). Through data collection and analysis, you will create a graph showing the relationships between actors based on their highly rated movies. This will not only highlight the practical application of APIs in collecting rich datasets, but also introduce the importance of graphs in understanding and visualizing the real-world dataset. Technology • Python 3.10.x • TMDb API version 3 Allowed Libraries The Python Standard Library and Requests only. Max runtime 10 minutes. Submissions exceeding this will receive zero credit. Deliverables • Q1.py: The completed Python file • nodes.csv: The csv file containing nodes • edges.csv: The csv file containing edges Follow the instructions found in Q1.py to complete the Graph class, the TMDbAPIUtils class, and the one global function. The Graph class will serve as a re-usable way to represent and write out your collected graph data. The TMDbAPIUtils class will be used to work with the TMDb API for data retrieval. Tasks and point breakdown 1. [10 pts] Implementation of the Graph class according to the instructions in Q1.py. a. The graph is undirected, thus {a, b} and {b, a} refer to the same undirected edge in the graph; keep only either {a, b} or {b, a} in the Graph object. A node’s degree is the number of (undirected) edges incident on it. In/ out-degrees are not defined for undirected graphs. 2. [10 pts] Implementation of the TMDbAPIUtils class according to instructions in Q1.py. Use version 3 of the TMDb API to download data about actors and their co-actors. To use the API: a. Create a TMDb account and follow the instructions on this document to obtain an API key. b. Be sure to use the key, not the token. This is the shorter of the two. c. Refer to the TMDB API Documentation as you work on this question. 3. [20 pts] Producing correct nodes.csv and edges.csv. a. If an actor’s name has comma characters (“,”), remove those characters before writing that name into the CSV files. 4 Version 1 Q2 [35 points] SQLite SQLite is a lightweight, serverless, embedded database that can easily handle multiple gigabytes of data. It is one of the world’s most popular embedded database systems. It is convenient to share data stored in an SQLite database — just one cross-platform file that does not need to be parsed explicitly (unlike CSV files, which must be parsed). You can find instructions to install SQLite here. In this question, you will construct a TMDb database in SQLite, partition it, and combine information within tables to answer questions. You will modify the given Q2.py file by adding SQL statements to it. We suggest testing your SQL locally on your computer using interactive tools to speed up testing and debugging, such as DB Browser for SQLite. Technology • SQLite release 3.37.2 • Python 3.10.x Allowed Libraries Do not modify import statements. Everything you need to complete this question has been imported for you. Do not use other libraries for this question. Max runtime 10 minutes. Submissions exceeding this will receive zero credit. Deliverables • Q2.py: Modified file containing all the SQL statements you have used to answer parts a – h in the proper sequence. IMPORTANT NOTES: • If the final output asks for a decimal column, format it to two places using printf(). Do NOT use the ROUND() function, as in rare cases, it works differently on different platforms. If you need to sort that column, be sure you sort it using the actual decimal value and not the string returned by printf. • A sample class has been provided to show example SQL statements; you can turn off this output by changing the global variable SHOW from True to False. • In this question, you must only use INNER JOIN when performing a join between two tables, except for part g. Other types of joins may result in incorrect results. Tasks and point breakdown 1. [9 points] Create tables and import data. a. [2 points] Create two tables (via two separate methods, part_ai_1 and part_ai_2, in Q2.py) named movies and movie_cast with columns having the indicated data types: i. movies 1. id (integer) 2. title (text) 3. score (real) ii. movie_cast 1. movie_id (integer) 2. cast_id (integer) 3. cast_name (text) 4. birthday (text) 5. popularity (real) b. [2 points] Import the provided movies.csv file into the movies table and movie_cast.csv into the movie_cast table i. Write Python code that imports the .csv files into the individual tables. This will include looping though the file and using the ‘INSERT INTO’ SQL command. Make sure you use paths relative to the Q2 directory. c. [5 points] Vertical Database Partitioning. Database partitioning is an important technique that divides large tables into smaller tables, which may help speed up queries. Create a new table cast_bio from the movie_cast table. Be sure that the values are unique when inserting into the new cast_bio table. Read this page for an example of vertical database partitioning. 5 Version 1 i. cast_bio 1. cast_id (integer) 2. cast_name (text) 3. birthday (text) 4. popularity (real) 2. [1 point] Create indexes. Create the following indexes. Indexes increase data retrieval speed; though the speed improvement may be negligible for this small database, it is significant for larger databases. a. movie_index for the id column in movies table b. cast_index for the cast_id column in movie_cast table c. cast_bio_index for the cast_id column in cast_bio table 3. [3 points] Calculate a proportion. Find the proportion of movies with a score between 7 and 20 (both limits inclusive). The proportion should be calculated as a percentage. a. Output format and example value: 7.70 4. [4 points] Find the most prolific actors. List 5 cast members with the highest number of movie appearances that have a popularity > 10. Sort the results by the number of appearances in descending order, then by cast_name in alphabetical order. a. Output format and example row values (cast_name,appearance_count): Harrison Ford,2 5. [4 points] List the 5 highest-scoring movies. In the case of a tie, prioritize movies with fewer cast members. Sort the result by score in descending order, then by number of cast members in ascending order, then by movie name in alphabetical order. a. Output format and example values (movie_title,score,cast_count): Star Wars: Holiday Special,75.01,12 Games,58.49,33 6. [4 points] Get high scoring actors. Find the top ten cast members who have the highest average movie scores. Sort the output by average_score in descending order, then by cast_name alphabetically. a. Exclude movies with score < 25 before calculating average_score. b. Include only cast members who have appeared in three or more movies with score >= 25. i. Output format and example value (cast_id,cast_name,average_score): 8822,Julia Roberts,53.00 7. [2 points] Creating views. Create a view (virtual table) called good_collaboration that lists pairs of actors who have had a good collaboration as defined here. Each row in the view describes one pair of actors who appeared in at least 2 movies together AND the average score of these movies is >= 40. The view should have the format: good_collaboration( cast_member_id1, cast_member_id2, movie_count, average_movie_score) For symmetrical or mirror pairs, only keep the row in which cast_member_id1 has a lower numeric value. For example, for ID pairs (1, 2) and (2, 1), keep the row with IDs (1, 2). There should not be any “self-pair” where cast_member_id1 is the same as cast_member_id2. Remember that creating a view will not produce any output, so you should test your view with a few simple select statements during development. One such test has already been added to the code as part of the auto-grading. NOTE: Do not submit any code that creates a ‘TEMP’ or ‘TEMPORARY’ view that 6 Version 1 you may have used for testing. Optional Reading: Why create views? 8. [4 points] Find the best collaborators. Get the 5 cast members with the highest average scores from the good_collaboration view, and call this score the collaboration_score. This score is the average of the average_movie_score corresponding to each cast member, including actors in cast_member_id1 as well as cast_member_id2. a. Order your output by collaboration_score in descending order, then by cast_name alphabetically. b. Output format and example values(cast_id,cast_name,collaboration_score): 2,Mark Hamil,99.32 1920,Winoa Ryder,88.32 9. [4 points] SQLite supports simple but powerful Full Text Search (FTS) for fast text-based querying (FTS documentation). a. [1 point] Import movie overview data from the movie_overview.csv into a new FTS table called movie_overview with the schema: movie_overview id (integer) overview (text) NOTE: Create the table using fts3 or fts4 only. Also note that keywords like NEAR, AND, OR, and NOT are case-sensitive in FTS queries. NOTE: If you have issues that fts is not enabled, try the following steps • Go to sqlite3 downloads page: https://www.sqlite.org/download.html • Download the dll file for your system • Navigate to your Python packages folder, e.g., C:\Users\… …\Anaconda3\pkgs\sqlite-3.29.0- he774522_0\Library\bin • Drop the downloaded .dll file in the bin. • In your IDE, import sqlite3 again, fts should be enabled. b. [1 point] Count the number of movies whose overview field contains the word ‘fight’. Matches are not case sensitive. Match full words, not word parts/sub-strings. i. Example: Allowed: ‘FIGHT’, ‘Fight’, ‘fight’, ‘fight.’ Disallowed: ‘gunfight’, ‘fighting’, etc. ii. Output format and example value: 12 c. [2 points] Count the number of movies that contain the terms ‘space’ and ‘program’ in the overview field with no more than 5 intervening terms in between. Matches are not case sensitive. As you did in h(i)(1), match full words, not word parts/sub-strings. i. Example: Allowed: ‘In Space there was a program’, ‘In this space program’ Disallowed: ‘In space you are not subjected to the laws of gravity. A program.’ ii. Output format and example value: 6 7 Version 1 Q3 [15 points] D3 Warmup – Visualizing Wildlife Trafficking by Species In this question, you will utilize a dataset provided by TRAFFIC, an NGO working to ensure the global trade of wildlife is both legal and sustainable. TRAFFIC provides data through their interactive Wildlife Trade Portal, some of which we have already downloaded and pre-processed for you to utilize in Q3. Using species-related data, you will build a bar chart to visualize the most frequently illegally trafficked species between 2015 and 2023. Using D3, you will get firsthand experience with how interactive plots can make data more visually appealing, engaging, and easier to parse. Read chapters 4-8 of Scott Murray’s Interactive Data Visualization for the Web, 2nd edition (sign in using your GT account, e.g., jdoe3@gatech.edu). This reading provides an important foundation you will need for Homework 2. The question and autograder have been developed and tested for D3 version 5 (v5), while the book covers v4. What you learn from the book is transferable to v5, as v5 introduced few breaking changes. We also suggest briefly reviewing chapters 1-3 for background information on web development. TRAFFIC International (2024) Wildlife Trade Portal. Available at www.wildlifetradeportal.org. Technology • D3 Version 5 (included in the lib folder) • Chrome 97.0 (or newer): the browser for grading your code • Python HTTP server (for local testing) Allowed Libraries D3 library is provided to you in the lib folder. You must NOT use any D3 libraries (d3*.js) other than the ones provided. Deliverables • Q3.html: Modified file containing all html, javascript, and any css code required to produce the bar plot. Do not include the D3 libraries or q3.csv dataset. IMPORTANT NOTES: • Setup an HTTP server to run your D3 visualizations as discussed in the D3 lecture (OMS students: watch lecture video. Campus students: see lecture PDF.). The easiest way is to use http.server for Python 3.x. Run your local HTTP server in the hw1-skeleton/Q3 folder. • We have provided sections of skeleton code and comments to help you complete the implementation. While you do not need to remove them, you need to write additional code to make things work. • All d3*.js files are provided in the lib folder and referenced using relative paths in your html file. For example, since the file “Q3/Q3.html” uses d3, its header contains:. It is incorrect to use an absolute path such as:. The 3 files that are referenced are: a. lib/d3/d3.min.js b. lib/d3-dsv/d3-dsv.min.js c. lib/d3-fetch/d3-fetch.min.js • In your html / js code, use a relative path to read the dataset file. For example, since Q3 requires reading data from the q3.csv file, the path must be “q3.csv” and NOT an absolute path such as “C:/Users/polo/HW1-skeleton/Q3/q3.csv”. Absolute paths are specific locations that exist only on your computer, which means your code will NOT run on our machines when we grade, and you will lose points. As file paths are case-sensitive, ensure you correctly provide the relative path. • Load the data from q3.csv using D3 fetch methods. We recommend d3.dsv(). Handle any data conversions that might be needed, e.g., strings that need to be converted to integer. See https://github.com/d3/d3-fetch#dsv. • VERY IMPORTANT: Use the Margin Convention guide to specify chart dimensions and layout. Tasks and point breakdown Q3.html: When run in a browser, should display a horizontal bar plot with the following specifications: 8 Version 1 1. [3.5 points] The bar plot must display one bar for each of the five most trafficked species by count. Each bar’s length corresponds to the number of wildlife trafficking incidents involving that species between 2015 and 2023, represented by the ‘count’ column in our dataset. 2. [1 point] The bars must have the same fixed thickness, and there must be some space between the bars, so they do not overlap. 3. [3 points] The plot must have visible X and Y axes that scale according to the generated bars. That is, the axes are driven by the data that they are representing. They must not be hard-coded. The x-axis must be a element having the id: “x_axis” and the y-axis must be a element having the id: “y_axis”. 4. [2 points] Set x-axis label to ‘Count’ and y-axis label to ‘Species’. The x-axis label must be a element having the id: “x_axis_label” and the y-axis label must be a element having the id: “y_axis_label”. 5. [2 points] Use a linear scale for the X-axis to represent the count (recommended function: d3.scaleLinear()). Only display ticks and labels at every 500 interval. The X-axis must be displayed below the plot. 6. [2 points] Use a categorical scale for the Y-axis to represent the species names (recommended function: d3.scaleBand()). Order the species names from greatest to least on ‘Count’ and limit the output to the top 5 species. The Y-axis must be displayed to the left of the plot. 7. [1 point] Set the HTML title tag and display a title for the plot. Those two titles are independent of each other and need to be set separately. Set the HTML title tag (i.e.,). Position the title “Wildlife Trafficking Incidents per Species (2015 to 2023)” above the bar plot. The title must be a element having the id: “title”. 8. [0.25 points] Add your GT username (usually includes a mix of letters and numbers) to the area beneath the bottom-right of the plot. The GT username must be a element having the id: “credit” 9. [0.25 points] Fill each bar with a unique color. We recommend using a colorblind-safe pallete. NOTE: Gradescope will render your plot using Chrome and present you with a Dropbox link to view the screenshot of your plot as the autograder sees it. This visual feedback helps you adjust and identify errors, e.g., a blank plot indicates a serious error. Your design does not need to replicate the solution plot. However, the autograder requires the following DOM structure (including using correct IDs for elements) and sizing attributes to know how your chart is built. 9 Version 1 plot | width: 900 | height: 370 | +– containing Q3.a plot elements | +– containing bars | +– x-axis | | | +– (x-axis elements) | +– x-axis label | +– y-axis | | | +– (y-axis elements) | +– y-axis label | +– GTUsername | +– chart title 10 Version 1 Q4 [5 points] OpenRefine OpenRefine is a powerful tool for working with messy data, allowing users to clean and transform data efficiently. Use OpenRefine in this question to clean data from Mercari. Construct GREL queries to filter the entries in this dataset. OpenRefine is a Java application that requires Java JRE to run. However, OpenRefine v.3.6.2 comes with a compatible Java version embedded with the installer. So, there is no need to install Java separately when working with this version. Go through the main features on OpenRefine’s homepage. Then, download and install OpenRefine 3.6.2. The link to release 3.6.2 is https://github.com/OpenRefine/OpenRefine/releases/tag/3.6.2 Technology • OpenRefine 3.6.2 Deliverables • properties_clean.csv: Export the final table as a csv file. • changes.json: Submit a list of changes made to file in json format. Go to ‘Undo/Redo’ Tab → ‘Extract’ → ‘Export’. This downloads ‘history.json’ . Rename it to ‘changes.json’. • Q4Observations.txt: A text file with answers to parts b.i, b.ii, b.iii, b.iv, b.v, b.vi. Provide each answer in a new line in the output format specified. Your file’s final formatting should result in a .txt file that has each answer on a new line followed by one blank line. Tasks and point breakdown 1. Import Dataset a. Run OpenRefine and point your browser at https://127.0.0.1:3333. b. We use a products dataset from Mercari, derived from a Kaggle competition (Mercari Price Suggestion Challenge). If you are interested in the details, visit the data description page. We have sampled a subset of the dataset provided as “properties.csv”. c. Choose “Create Project” → This Computer → properties.csv. Click “Next”. d. You will now see a preview of the data. Click “Create Project” at the upper right corner. 2. [5 points] Clean/Refine the Data a. [0.5 point] Select the category_name column and choose ‘Facet by Blank’ (Facet → Customized Facets → Facet by blank) to filter out the records that have blank values in this column. Provide the number of rows that return True in Q4Observations.txt. Exclude these rows. Output format and sample values: i.rows: 500 NOTE: OpenRefine maintains a log of all changes. You can undo changes by the “Undo/Redo” button at the upper left corner. You must follow all the steps in order and submit the final cleaned data file properties_clean.csv. The changes made by this step need to be present in the final submission. If they are not done at the beginning, the final number of rows can be incorrect and raise errors by the autograder. b. [1 point] Split the column category_name into multiple columns without removing the original column. For example, a row with “Kids/Toys/Dolls & Accessories” in the category_name column would be split across the newly created columns as “Kids”, “Toys” and “Dolls & Accessories”. Use the existing functionality in OpenRefine that creates multiple columns from an existing column based on a separator (i.e., in this case ‘/’) and does not remove the original category_name column. Provide the number of new columns that are created by this operation, excluding the original category_name column. Output format and sample values: ii.columns: 10 11 Version 1 NOTE: While multiple methods can split data, ensure new columns aren’t empty. Validate by sorting and checking for null values after using our suggested method in step b. c. [0.5 points] Select the column name and apply the Text Facet (Facet → Text Facet). Cluster by using (Edit Cells → Cluster and Edit …) this opens a window where you can choose different “methods” and “keying functions” to use while clustering. Choose the keying function that produces the smallest number of clusters under the “Key Collision” method. Click ‘Select All’ and ‘Merge Selected & Close’. Provide the name of the keying function and the number of clusters produced. Output format and sample values: iii.function: fingerprint, 200 NOTE: Use the default Ngram size when testing Ngram-fingerprint. d. [1 point] Replace the null values in the brand_name column with the text “Unknown” (Edit Cells → Transform). Provide the expression used. Output format and sample values: iv.GREL_categoryname: endsWith(“food”, “ood”) NOTE: “Unknown” is case and space sensitive (“Unknown” is different from “unknown” and “Unknown “.) e. [0.5 point] Create a new column high_priced with the values 0 or 1 based on the “price” column with the following conditions: if the price is greater than 90, high_priced should be set as 1, else 0. Provide the GREL expression used to perform this. Output format and sample values: v.GREL_highpriced: endsWith(“food”, “ood”) f. [1.5 points] Create a new column has_offer with the values 0 or 1 based on the item_description column with the following conditions: If it contains the text “discount” or “offer” or “sale”, then set the value in has_offer as 1, else 0. Provide the GREL expression used to perform this. Convert the text to lowercase in the GREL expression before you search for the terms. Output format and sample values: vi.GREL_hasoffer: endsWith(“food”, “ood”) 12 Version 1 Q5 [5 points] Introduction to Python Flask Flask is a lightweight web application framework written in Python that provides you with tools, libraries, and technologies to build a web application quickly and scale it up as needed. In this question, you will build a web application that displays a table of TMDb data on a single-page website using Flask. You will modify the given file: wrangling_scripts/Q5.py Technology Python 3.10.x Flask Allowed Libraries Python standard libraries Libraries already imported in Q5.py Deliverables Q5.py: Completed Python file with your changes Tasks and point breakdown 1. username() – Update the username() method inside Q5.py by including your GT username. 2. Install Flask on your machine by running $ pip install Flask a. You can optionally create a virtual environment by following the steps here. Creating a virtual environment is purely optional and can be skipped. 3. To run the code, navigate to the Q5 folder in your terminal/command prompt and execute the following command: python run.py. After running the command, go to http://127.0.0.1:3001/ on your browser. This will open up index.html, showing a table in which the rows returned by data_wrangling() are displayed. 4. You must solve the following two sub-questions: a. [2 points] Read and store the first 100 rows in a table using the data_wrangling() method. NOTE: The skeleton code, by default, reads all the rows from movies.csv. You must add the required code to ensure that you are reading only the first 100 data rows. The skeleton code already handles reading the table header for you. b. [3 points]: Sort this table in descending order of the values, i.e., with larger values at the top and smaller values at the bottom of the table in the last (3rd) column. Note that this column needs to be returned as a string for the autograder, but sorting may require float casting.

CSE 6242 / CX 4242: Data and Visual Analytics HW 2

CSE 6242 / CX 4242: Data and Visual Analytics HW 2: Tableau, D3 Graphs, and Visualization

“Visualization gives you answers to questions you didn’t know you have” – Ben Schneiderman Download the HW2 Skeleton before you begin

Homework Overview

Data visualization is an integral part of exploratory analysis and communicating key insights. This homework focuses on exploring and creating data visualizations using two of the most popular tools in the field; Tableau and D3.js. All 5 questions use data on the same topic to highlight the uses and strengths of different types of visualizations. The data comes from BoardGameGeek and includes games’ ratings, popularity, and metadata. Below are some terms you will often see in the questions: • Rating – a value from 0 to 10 given to each game. BoardGameGeek calculates a game’s overall rating in different ways including Average and Bayes, so make sure you are using the correct rating called for in a question. A higher rating is better than a lower rating. • Rank – the overall rank of a boardgame from 1 to n, with ranks closer to 1 being better and n being the total number of games. The rank may be for all games or for a subgroup of games such as abstract games or family games. The maximum possible score for this homework is 100 points. Students have the option to complete any 90 points’ worth of work to receive 100% (equivalent to 15 course total grade points) for this assignment. They can earn more than 100% if they submit additional work. For example, a student scoring 100 points will receive 111% for the assignment (equivalent to 16.67 course total grade points, as shown on Canvas). Download the HW2 Skeleton before you begin ……………………………………………………………………………….. 1

Homework Overview …………………………………………………………………………………………………………………… 1 Important Notes………………………………………………………………………………………………………………………….. 2 Submission Notes ………………………………………………………………………………………………………………………. 2 Do I need to use the specific version of the software listed?………………………………………………………………. 2 Q1 [25 points] Designing a good table. Visualizing data with Tableau. ………………………………………………… 3 Important Points about Developing with D3 in Questions 2–5 ……………………………………………………………. 7 Q2 [15 points] Force-directed graph layout……………………………………………………………………………………… 8 Q3 [15 points] Line Charts……………………………………………………………………………………………………………10 Q4 [20 points] Interactive Visualization…………………………………………………………………………………………..14 Q5 [25 points] Choropleth Map of Board Game Ratings……………………………………………………………………18 2 Version 0

Important Notes A. Submit your work by the due date on the course schedule. a. Every assignment has a generous 48-hour grace period, allowing students to address unexpected minor issues without facing penalties. You may use it without asking. b. Before the grace period expires, you may resubmit as many times as needed. c. TA assistance is not guaranteed during the grace period. d. Submissions during the grace period will display as “late” but will not incur a penalty. e. We will not accept any submissions executed after the grace period ends. B. Always use the most up-to-date assignment (version number at the bottom right of this document). The latest version will be listed in Ed Discussion. C. You may discuss ideas with other students at the “whiteboard” level (e.g., how cross-validation works, use HashMap instead of array) and review any relevant materials online. However, each student must write up and submit the student’s own answers. D. All incidents of suspected dishonesty, plagiarism, or violations of the Georgia Tech Honor Code will be subject to the institute’s Academic Integrity procedures, directly handled by the Office of Student Integrity (OSI). Consequences can be severe, e.g., academic probation or dismissal, a 0 grade for assignments concerned, and prohibition from withdrawing from the class. Submission Notes A. All questions are graded on the Gradescope platform, accessible through Canvas. a. Question 1 will be manually graded after the final HW due date and Grace Period. b. Questions 2-5 are auto graded at the time of submission. B. We will not accept submissions anywhere else outside of Gradescope. C. Submit all required files as specified in each question. Make sure they are named correctly. D. You may upload your code periodically to Gradescope to obtain feedback on your code. There are no hidden test cases. The score you see on Gradescope is what you will receive. E. You must not use Gradescope as the primary way to test your code. It provides only a few test cases and error messages may not be as informative as local debuggers. Iteratively develop and test your code locally, write more test cases, and follow good coding practices. Use Gradescope mainly as a “final” check. F. Gradescope cannot run code that contains syntax errors. If you get the “The autograder failed to execute correctly” error, verify: a. The code is free of syntax errors (by running locally) b. All methods have been implemented c. The correct file was submitted with the correct name d. No extra packages or files were imported G. When many students use Gradescope simultaneously, it may slow down or fail. It can become even slower as the deadline approaches. You are responsible for submitting your work on time. H. Each submission and its score will be recorded and saved by Gradescope. By default, your last submission is used for grading. To use a different submission, you MUST “activate” it (click the “Submission History” button at the bottom toolbar, then “Activate”). Do I need to use the specific version of the software listed? Under each question, you will see a set of technologies with specific versions – this is what is installed on the autograder and what it will run your code with. Thus, installing those specific versions on your computer to complete the question is highly recommended. You may be able to complete the question with different versions installed locally, but you are responsible for determining the compatibility of your code. We will not award points for code that works locally but not on the autograder. 3 Version 0 Q1 [25 points] Designing a good table. Visualizing data with Tableau. Goal Design a table, a grouped bar chart, and a stacked bar chart with filters in Tableau. Technology Tableau Desktop Deliverables Gradescope: After selecting HW2 – Q1, click Submit Images. You will be taken to a list of questions for your assignment. Click Select Images and submit the following four PNG images under the corresponding questions: ● table.png: Image/screenshot of the table in Q1.1 ● grouped_barchart.png: Image of the chart in Q1.2 ● stacked_barchart_1.png: Image of the chart in Q1.3 after filtering data for Max.Players = 2 ● stacked_barchart_2.png: Image of the chart in Q1.3 after filtering data for Max.Players = 4 a Q1 will be manually graded after the grace period. Setting Up Tableau Install and activate Tableau Desktop by following “HW2 Instructions” on Canvas. The product activation key is for your use in this course only. Do not share the key with anyone. If you already have Tableau Desktop installed on your machine, you may use this key to reactivate it. a If you do not have access to a Mac or Windows machine, use the 14-day trial version of Tableau Online: 1. Visit https://www.tableau.com/trial/tableau-online 2. Enter your information (name, email, GT details, etc.) 3. You will then receive an email to access your Tableau Online site 4. Go to your site and create a workbook a If neither of the above methods work, use Tableau for Students. Follow the link and select “Get Tableau For Free”. You should be able to receive an activation key which offers you a one-year use of Tableau Desktop at no cost by providing a valid Georgia Tech email. Connecting to Data 1. It is optional to use Tableau for Q1.1. Otherwise, complete all parts using a single Tableau workbook. 2. Q1 will require connecting Tableau to two different data sources. You can connect to multiple data sources within one workbook by following the directions here. 3. For Q1.1 and Q1.2: a. Open Tableau and connect to a data source. Choose To a File – Text file. Select the popular_board_game.csv file from the skeleton. b. Click on the graph area at the bottom section next to “Data Source” to create worksheets. 4. For Q1.3: a. You will need a data.world account to access the data for Q1.3. Add a new data source by clicking on Data – New Data Source. b. When connecting to a data source, choose To a Server – Web Data Connector. c. Enter this URL to connect to the data.world data set on board games. You may be prompted to log in to data-world and authorize Tableau. If you haven’t used data.world before, you will be required to create an account by clicking “Join Now”. Do not edit the provided SQL query. a NOTE: If you cannot connect to data-world, you can use the provided csv files for Q1 in the skeleton. The provided csv files are identical to those hosted online and can be loaded directly into Tableau. a d. Click the graph area at the bottom section to create another worksheet, and Tableau will automatically create a data extract. 4 Version 0 Table and Chart Design 1. [5 points] Good table design. Visualize the data contained in popular_board_game.csv as a data table (known as a text table in Tableau). In this part (Q1.1), you can use any tool (e.g., Excel, HTML, Pandas, Tableau) to create the table. We are interested in grouping popular games into “support solo” (min player = 1) and “not support solo” (min player > 1). Your table should clearly communicate information about these two groups simultaneously. For each group (Solo Supported, Solo Not Supported), show: a a. Total number of games in each category (fighting, economic, …) b. In each category, the game with the highest number of ratings. If more than one game has the same (highest) number of ratings, pick the game you prefer. NOTE: Level of Detail expressions may be useful if you use Tableau. c. Average rating of games in each category (use simple average), rounded to 2 decimal places. d. Average playtime of games in each category, rounded to 2 decimal places. e. In the bottom left corner below your table, include your GT username (In Tableau, this can be done by including a caption when exporting an image of a worksheet or by adding a text box to a dashboard. If you use Tableau, refer to the tutorial here). f. Save the table as table.png. (If you use Tableau, go to Worksheet/Dashboard  Export  Image). NOTE: Do not take screenshots in Tableau since your image must have high resolution. You can take a screenshot If you use HTML, Pandas, etc. a Your learning goal here is to practice good table design, which is not strongly dependent on the tool that you use. Thus, we do not require that you use Tableau in this part. You may decide the most meaningful column names, the number of columns, and the column order. You are not limited to only the techniques described in the lecture. For OMS students, the lecture video on this topic is Week 4 – Fixing Common Visualization Issues – Fixing Bar Charts, Line Charts. For campus students, review lecture slides 42 and 43. 2. [10 points] Grouped bar chart. Visualize popular_board_game.csv as a grouped bar chart in Tableau. Your chart should display game category (e.g., fighting, economic,…) along the horizontal axis and game count along the vertical axis. Show game playtime (e.g., <=30, (30, 60]) for each category. NOTE: Do not differentiate between “support solo” and “non-support solo” for this question. a. a a. Design a vertically grouped bar chart. For each category, show the game count for each playtime. b. Include clearly labeled axes, a clear chart title, and a legend. c. In the bottom left corner of your image, include your GT username.NOTE: In Tableau, this can be done by including a caption when exporting an image of a worksheet or by adding a text box to a dashboard. Refer to the tutorial here. d. Save the chart as grouped_barchart.png (go to Worksheet/Dashboard  Export  Image. a. NOTE: Do not take screenshots in Tableau since your image must have high resolution. The main goal here is for you to get familiarized with Tableau. Thus, we kept this open-ended, so you can practice making design decisions. We will accept most designs. We show one possible design in Figure 1.2, based on the tutorial from Tableau. 3. [10 points] Stacked bar chart. Visualize the data.world dataset (or games_detailed_info_filtered.csv if using the local files in the skeleton) as a stacked bar chart. Showcase the count of games in different categories and the relationship between game categories, their mechanics, and max player size. a a. Create a Worksheet with a stacked bar chart that shows game counts for each playing mechanic (sub-bars) for each game category. NOTE: This data contains duplicate rows, as each row represents a distinct game. Do not remove duplicate rows from the data. b. Display game counts along the vertical axis and category along the horizontal axis. c. Include clear axes labels, a clear chart title, and a legend. d. Create a Dashboard using the worksheet you created. 5 Version 0 e. Add a filter for the number of ‘Max.Players’ allowed in each game. Update the chart using this filter to generate the following chart images (Refer to the tutorial here on how to add a filter in a dashboard. Make sure to add ‘Max.Players’ in the filter shelf in the Worksheet first, like this): i. Select “2 Players” only in the filter. Save the resulting chart as ‘stacked_barchart_1.png’ ii. Select “4 Players” only in the filter. Save the resulting chart as ‘stacked_barchart_2.png’ iii. Both images must include your GT username in the bottom left. This can be added using a text box. Refer to the tutorial here.https://youtu.be/fRwQenvBJ6I iv. In each image, the filter must be visible. If you are using Tableau Online, you may need to add your worksheet containing the chart to a dashboard and then download an image of the dashboard that contains both the filter and the chart. Note: To save a dashboard image, go to Dashboard – Export Image. Do not submit screenshots. An example of a possible design is shown in Figure 1.3. Optional Reading: The effectiveness of stacked bar charts is often debated—sometimes, they can be confusing, difficult to understand, and may make data series comparisons challenging. Figure 1.2: Example of a grouped bar chart. Your chart may appear different and can earn full credit if it meets all the stated requirements. Your submitted image should include your GT username in the bottom left. 6 Version 0 Figure 1.3: Example of a stacked bar chart after selecting “4 Players” in Max.Players filter. Your chart may appear different and can earn full credit if it meets all the stated requirements. Your submitted image should include your GT username in the bottom left. 7 Version 0 Important Points about Developing with D3 in Questions 2–5 1. We highly recommend that you use the latest Chrome browser to complete this question. We will grade your work using Chrome v92 (or higher). 2. You will work with version 5 of D3 in this homework. You must NOT use any D3 libraries (d3*.js) other than the ones provided in the lib folder. 3. For Q3–5, your D3 visualization MUST produce a DOM structure as specified at the end of each question. Not only does the structure help guide your D3 code design, but it also enables your code to be auto-graded (the auto-grader identifies and evaluates relevant elements in the rendered HTML). We highly recommend you review the specified DOM structure before starting to code. 4. You need to setup a local HTTP server in the root (hw2-skeleton) folder to run your D3 visualizations, as discussed in the D3 lecture (OMS students: the video “Week 5 – Data Visualization for the Web (D3) – Prerequisites: JavaScript and SVG”. Campus students: see lecture PDF.). The easiest way is to use http.server for Python 3.x. (for more details, see link). 5. All d3*.js files in the lib folder must be referenced using relative paths, e.g., “../lib/” in your html files. For example, if the file “Q2/submission.html” uses d3, its header should contain:It is incorrect to use an absolute path such as:6. For questions that require reading from a dataset, use a relative path to read in the dataset file. For example, suppose a question reads data from earthquake.csv, the path should simply be “earthquake.csv” and NOT an absolute path such as “C:/Users/polo/hw2-skeleton/Q/earthquake.csv”. 7. You can and are encouraged to decouple the style, functionality and markup in the code for each question. That is, you can use separate files for CSS, JavaScript and html. 8 Version 0 Q2 [15 points] Force-directed graph layout Goal Create a network graph shows relationships between games in D3. Use interactive features like pinning nodes to give the viewer some control over the visualization. Technology D3 Version 5 (included in the lib folder) Chrome v92.0 (or higher): the browser for grading your code Python http server (for local testing) Allowed Libraries D3 library is provided to you in the lib folder. You must NOT use any D3 libraries (d3*.js) other than the ones provided. On Gradescope, these libraries are provided for you in the auto-grading environment. Deliverables [Gradescope] Q2.(html/js/css): The HTML, JavaScript, CSS to render the graph. Do not include the D3 libraries or board_games.csv dataset. You will experiment with many aspects of D3 for graph visualization. To help you get started, we have provided the Q2.html file (in the Q2 folder) and an undirected graph dataset of boardgames, board_games.csv file (in the Q2 folder). The dataset for this question was inspired by a Reddit post about visualizing boardgames as a network, where the author calculates the similarity between board games based on categories and game mechanics where the edge value between each board game (node) is the total weighted similarity index. This dataset has been modified and simplified for this question and does not fully represent actual data found from the post. The provided Q2.html file will display a graph (network) in a web browser. The goal of this question is for you to experiment with the visual styling of this graph to make a more meaningful representation of the data. Here is a helpful resource (about graph layout) for this question. Note: You can submit a single Q2.html that contains all the css and js components; or you can split Q2.html into Q2.html, Q2.css, and Q2.js. 1. [2 points] Adding node labels: Modify Q2.html to show the node label (the node name, e.g., the source) at the top right of each node in bold. If a node is dragged, its label must move with it. 2. [3 points] Styling edges: Style the edges based on the “value” field in the links array: 1. If the value of the edge is equal to 0 (similar), the edge should be gray, thick, and solid (The dashed line with zero gap is not considered as solid). 2. If the value of the edge is equal to 1 (not similar), the edge should be green, thin, and dashed. 3. [3 points] Scaling nodes: a. [1.5 points] Scale the radius of each node in the graph based on the degree of the node (you may try linear or squared scale, but you are not limited to these choices). Note: Regardless of which scale you decide to use, you should avoid extreme node sizes, which will likely lead to low-quality visualization (e.g., nodes that are mere points, barely visible, or of huge sizes with overlaps). Note: D3 v5 does not support d.weight (which was the typical approach to obtain node degree in D3 v3). You may need to calculate node degrees yourself. Example relevant approach is here. b. [1.5 points] The degree of each node should be represented by varying colors. Pick a meaningful color scheme (hint: color gradients). There should be at least 3 color gradations and it must be visually evident that the nodes with a higher degree use darker/deeper colors and the nodes with lower degrees use lighter colors. You can find example color gradients at Color Brewer. 9 Version 0 4. [6 points] Pinning nodes: a. [2 points] Modify the code so that dragging a node will fix (i.e., “pin”) the node’s position such that it will not be modified by the graph layout algorithm (Note: pinned nodes can be further dragged around by the user. Additionally, pinning a node should not affect the free movement of the other nodes). Node pinning is an effective interaction technique to help users spatially organize nodes during graph exploration. The D3 API for pinning nodes has evolved over time. We recommend reading this post when you work on this sub-question. b. [1 points] Mark pinned nodes to visually distinguish them from unpinned nodes, i.e., show pinned nodes in a different color. c. [3 points] Double clicking a pinned node should unpin (unfreeze) its position and unmark it. When a node is no longer pinned, it should move freely again. IMPORTANT: 1. To pass autograder consistently for part 1 (which tests if a dragged node becomes pinned and retains its position), you may need to increase the radius of highly weighted nodes and reduce their label sizes, so that the nodes can be more easily detected by the autograder’s webdriver mouse cursor. 2. To avoid timeout errors on Gradescope, complete the double click function in part 3 before submitting. 3. If you receive timeout messages for all parts and your code works locally on your computer, verify that you are indeed using the appropriate ids provided in the “add the nodes” section in the skeleton code. 4. D3 v5 does not support the d.fixed method (it was deprecated after D3 v3). For our purposes, it is used as a Boolean value to indicate whether a node has been pinned or not. 5. [1 points] Add GT username: Add your Georgia Tech username (usually includes a mix of letters and numbers, e.g., gburdell3) to the top right corner of the force-directed graph (see example image). The GT username must be a element having the id: “credit” Figure 2: Example of Visualization with pinned node (yellow). Your chart may appear different and can earn full credit if it meets all the stated requirements. 10 Version 0 Q3 [15 points] Line Charts Goal Explore temporal patterns in the BoardGameGeek data using line charts in D3 to compare how the number of ratings grew. Integrate additional data about board game rankings onto these line charts and explore the effect of axis scale choice. Technology D3 Version 5 (included in the lib folder) Chrome v92.0 (or higher): The browser used for grading your code Python HTTP server (for local testing) Allowed Libraries D3 library is provided to you in the lib folder. You must NOT use any D3 libraries (d3*.js) other than the ones provided. On Gradescope, these libraries are provided for you in the autograder environment. Deliverables [Gradescope] Q3.(html / js / css): The HTML, JavaScript, CSS to render the line charts. Do not include the D3 libraries or boardgame_ratings.csv dataset. Use the dataset in the file boardgame_ratings.csv (in the Q3 folder) to create line charts. Refer to the tutorial for line chart here: Note: You will create four charts in this question, which should be placed one after the other on a single HTML page, like the example image below (Figure 3). Note that your design need NOT be identical to the example; however, the submission must follow the DOM structure specified at the end of this question. IMPORTANT: use the Margin Convention guide for specifying chart dimensions and layout. The autograder will assume this convention has been followed for grading purposes. The SVG viewBox attribute is not recommended to define the position and dimension of your SVG. 1. [5 points] Creating line chart. Create a line chart (Figure 3.1) that visualizes the number of board game ratings from November 2016 to August 2020 (inclusively), for the eight board games: [‘Catan’, ‘Dominion’, ‘Codenames’, ‘Terraforming Mars’, ‘Gloomhaven’, ‘Magic: The Gathering’, ‘Dixit’, ‘Monopoly’]. Use d3.schemeCategory10() to differentiate these board games. Add each board game’s name next to its corresponding line. For the x-axis, show a tick label for every three months. Use D3 axis.tickFormat() and d3.timeFormat() to format the ticks to display abbreviated months and years. For example, Jan 17, Apr 17, Jul 17. (See Figure 3.1 and its x-axis ticks). ● Chart title: Number of Ratings 2016-2020 ● Horizontal axis label: Month. Use D3.scaleTime(). ● Vertical axis label: Num of Ratings. Use a linear scale (for this part). VERY IMPORTANT — Beware of “Silent Date Conversion”: Opening the csv file in an application like Excel may silently modify date strings without warning you, e.g., converting hyphen-separated date strings (e.g., 2016-11-01) into slash-separated date strings (e.g., 11/01/16). Impacted students would see a “correct” line chart visualization on their local computers, but when they upload their code to Gradescope, test cases will fail (e.g., tick labels are not found, lines are not drawn) because the x-scale cannot be computed (as the dates are parsed as NaN). To view the content of a csv file, we recommend you only use text editors (e.g., sublime text, notepad) that do not silently modify csv files. 2. [5 points] Adding board game rankings. Create a line chart (Figure 3.2) for this part (append to the same HTML page) whose design is a variant of what you have created in part 1. Start with your chart from part 1. Modify the code to visualize how the rankings of [‘Catan’, ‘Codenames’, ‘Terraforming Mars’, ‘Gloomhaven’] change over time by adding a symbol with the ranking text on their corresponding lines. Show the symbol for every three months and exactly align with the x-axis ticks in part 1. (See Figure 3.2). Add a legend to explain what this symbol represents next to your chart (See the Figure 3.2 bottom right). 11 Version 0 ● Chart title: Number of Ratings 2016-2020 with Rankings 3. [5 points] Axis scales in D3. Create two line charts (Figure 3.3-1 and 3.3-2) for this part (append to the same HTML page) to try out two axis scales in D3. Start with your chart from part 2. Then modify the vertical axis scale for each chart: the first chart uses the square root scale for its vertical axis (only), and the second chart uses the log scale for its vertical axis (only). Keep the symbols and the symbol legend you implemented in part 2. At the bottom right of the last chart, add your GT username (e.g., gburdell3, see Figure 3.3-2 for example). Note: the horizontal axes should be kept in linear scale, and only the vertical axes are affected. Hint: You may need to carefully set the scale domain to handle the 0s in data. ● First chart (Figure 3.3-1) ○ Chart title: Number of Ratings 2016-2020 (Square root Scale) ○ This chart uses the square root scale for its vertical axis (only) ○ Other features should be the same as part 2. ● Second chart (Figure 3.3-2) ○ Chart title: Number of Ratings 2016-2020 (Log Scale) ○ This chart uses the log scale for its vertical axis (only). Set the y-scale domain minimum to 1. ○ Other features should be the same as part 2. Figure 3.1: Example line chart. Your chart may appear different and can earn full credit if it meets all stated requirements. Figure 3.2: Example of a line chart with rankings. Your chart may appear different and can earn full credit if it meets all stated requirements. 12 Version 0 Figure 3.3-1: Example of a line chart using square root scale. Your chart may appear different and can earn full credit if it meets all stated requirements. Figure 3.3-2: Example of a line chart using log scale. Your chart may appear different and can earn full credit if it meets all stated requirements. 13 Version 0 Note: Your D3 visualization MUST produce the following DOM structure. plot (Q3.1) | +– chart title | +– containing Q3.1 plot elements | +– containing plot lines, line labels | +– x-axis | | | +– (x-axis elements) | | | +– x-axis label | +– y-axis | +– (y-axis elements) | +– y-axis label plot (Q3.2) | +– chart title | +– containing Q3.2 plot elements | | | +– containing plot lines, line labels | | | +– for x-axis | | | | | +– (x-axis elements) | | | | | +– x-axis label | | | +– for y-axis | | | | | +– (y-axis elements) | | | | | +– for y-axis label | | | +– containing plotted symbols, symbol labels | +– containing legend symbol and legend text element(s) plot (Q3.3-1): same as format for Q3.2, with c-1 in ids (e.g., id=”svg-c-1″, etc.) plot (Q3.3-2): same as format for Q3.2, with c-2 in ids (e.g., id=”svg-c-2″, etc.)

containing GT username 14 Version 0 Q4 [20 points] Interactive Visualization Goal Create line charts in D3 that use interactive elements to display additional data. Then implement a bar chart that appears when you mouse over a point on the line chart. Technology D3 Version 5 (included in the lib folder) Chrome v92.0 (or higher): the browser for grading your code Python http server (for local testing) Allowed Libraries D3 library is provided to you in the lib folder. You must NOT use any D3 libraries (d3*.js) other than the ones provided. On Gradescope, these libraries are provided for you in the auto-grading environment. Deliverables [Gradescope] Q4.(html/js/css): The HTML, JavaScript, CSS to render the visualization in Q4. Do not include the D3 libraries or average-rating.csv dataset. Use the dataset average-rating.csv provided in the Q4 folder to create an interactive frequency polygon line chart. This dataset contains a list of games, their ratings and supporting information like the numbers of users who rated a game and the year a game was published. In the data sample below, each row under the header represents a game name, year of publication, average rating, and the number of users who rated the game. Helpful resource to work with nested data in D3: https://gist.github.com/phoebebright/3176159 name,year,average_rating,users_rated Codenames,2015,7.71148,51209 King of Tokyo,2011,7.23048,48611 1. [3 points] Create a line chart. Summarize the data by displaying the count of board games by rating for each year. Round each rating down to the nearest integer, using Math.floor(). For example, a rating of 7.71148 becomes 7. For each year, sum the count of board games by rating. Display one plot line for each of the 5 years (2015-2019) in the dataset. Note: the dataset comprises year data from 2011 to 2020; this question asks to plot lines for the years 2015-2019. If some of the datapoints in the chart do not have ratings, generate dummy values (0s) to be displayed on the chart for the required years. All axes must start at 0, and their upper limits must be automatically adjusted based on the data. Do not hard-code the upper limits. Note: if you are losing points on Gradescope for axis or scale, ensure that you are using the proper margin convention without any additional paddings or translations. • The vertical axis represents the count of board games for a given rating. Use a linear scale. • The horizontal axis represents the ratings. Use a linear scale. 2. [3 points] Line styling, legend, title and username. • For each line, use a different color of your choosing. Display a filled circle for each rating-count data point. • Display a legend on the right-hand portion of the chart to show how line colors map to years. • Display the title “Board games by Rating 2015-2019” at the top of the chart. • Add your GT username (usually includes a mix of lowercase letters and numbers, e.g., gburdell3) beneath the title (see example Figure 4.1). Figure 4.1 shows an example line chart design. Yours may look different but can earn full credit if it meets all stated requirements. Note: The data provided in average-rating.csv requires some processing for aggregation. All aggregation must only be performed in JavaScript; you must NOT modify average-rating.csv. That is, your code should first read the data from .csv file as is, then you may process the loaded data using JavaScript. If you are getting a MoveTargetOutOfBoundsException, (a) check that your margin convention 15 Version 0 is correct, and (b) make sure to check the Dropbox linked screenshot of your graph to get a good idea of how the plot could be Improved compared to the sample graph provided. Figure 4.1: Line chart representing count of board games by rating for each year. Your chart may appear different but can earn full credit if it meets all stated requirements. Figure 4.2: Bar chart representing the number of users who rated the top 5 board games with the rating 6 in year 2019. Your chart may appear different but can earn full credit if it meets all stated requirements. Interactivity and sub-chart. In the next few sub-questions, you will create event handlers to detect mouseover and mouseout events over each circle that you added in Q4.2. 3. [8 points] Create a horizontal bar chart, so that when hovering over a circle, that bar chart will be shown below the line chart. The bar chart displays the top 5 board games that received the highest numbers of user ratings (users_rated), for the hovered year and rating. For example, hovering over the rating-6 circle for 2019 will display the bar chart for the number of users who rated the top 5 board games. If a certain year/rating combination has fewer than 5 entries, it should display as many as there are. Figure 4.2 shows an example design. Show one bar per game. The bar length represents the number of users who rated the game. Note: No bar chart should be displayed when the count of games is 0 for hovered year and rating. Axes: All axes should be automatically adjusted based on the data. Do not hard-code any values. • The vertical axis represents the board games. Sort the game names in descending order, such that the game with the highest users_rated is at the bottom, and the game with the smallest users_rated is at the top. Some boardgame names are quite long. For each game name, display its first 10 characters (if a name has fewer than 10 characters, display them all). A space counts as a character. The horizontal axis represents the number of users who rated the game (for the hovered year and rating). Use a linear scale. • Set horizontal axis label to ‘Number of users’ and vertical axis label to ‘Games’. 4. [2 points] Bar styling, grid lines and title • Bars: All bars should have the same color regardless of year or rating. All bars for the specific year should have a uniform bar thickness. • Grid lines should be displayed. • Title: Display a title with the format “Top 5 Most Rated Games of with Rating ” at the top of the chart where and are what the user hovers over in the line chart. For example, hovering over rating 6 in 2015, the title would read: “Top 5 Most Rated Games of 2015 with Rating 6” 5. [2 points] Mouseover Event Handling • The bar chart and its title should only be displayed during mouseover events for a circle in the line chart. 16 Version 0 • The circle in the line chart should change to a larger size during mouseover to emphasize that it is the selected point. • When count of games is 0 for hovered year and rating, no bar chart should be displayed. The hoveredover circle on the line graph should still change to a larger size to show it is selected. Hint: .attr() is generally used for describing the size, shape, location, etc. of an element, whereas .style() is used for other design aspects like color, opacity, etc. 6. [2 points] Mouseout Event Handling The bar chart and its title should be hidden from view on mouseout and the circle previously mouseover-ed should return to its original size in the line chart. The graph should exhibit interactivity similar to Figure 4.6 where the mouse is over the larger circle. Figure 4.6: Line chart and bar chart demonstrating interactivity. Your chart may appear different, but you can earn full credit if it meets all stated requirements. 17 Version 0 Note: Your D3 visualization MUST produce the following DOM structure. containing line chart | +– | +– element containing all line elements | | | +– elements for plotted lines | +– element for y-axis | +– element for all circular elements | | | +– elements | +– element for line chart title | +– element for GT username | +– element for legend | | | +– ( elements for legend) | | | +– ( elements for legend) | +– element for x axis label | +– element for y axis label

containing bar chart title containing bar chart | +– | +– element for bars | | | +– elements for bars | +– element for y-axis | +– element for x axis label | +– element for y axis label 18 Version 0 Q5 [25 points] Choropleth Map of Board Game Ratings Goal Create a choropleth map in D3 to explore the average rating of board games in different countries. Technology D3 Version 5 (included in the lib folder) Chrome v92.0 (or higher): the browser for grading your code Python http server (for local testing) Allowed Libraries D3 library is provided to you in the lib folder. You must NOT use any D3 libraries (d3*.js) other than the ones provided. On Gradescope, these libraries are provided for you in the auto-grading environment. Deliverables [Gradescope] Q5.(html/js/css): Modified file(s) containing all html, javascript, and any css code required to produce the plot. Do not include the D3 libraries or csv files. Choropleth maps are a very common visualization in which different geographic areas are colored based on the value of a variable for each geographic area. You have most probably seen choropleth maps showing quantities like unemployment rates for each county in the US, or COVID-related maps and data at the county level in the US. We will use choropleth maps to examine the popularity of different board games across the world. We have provided two files in the Q5 folder, ratings-by-country.csv and world_countries.json. • Each row in ratings-by-country.csv represents about a game’s information for a country, in the form of <Game,Country,Number of Users, Average Rating>, where o Game: the name of a game, e.g., Catan. o Country: a country in the world, e.g., United States of America. o Number of Users: the number of users who have rated Game who are from Country. o Average Rating: the mean rating given to Game by users who are from Country. This dataset has been preprocessed and filtered to include only those games that have been rated by more than 1000 users in the world. • The world_countries.json file is a geoJSON, containing a single geometry collection: countries. You can find examples of map generation using geoJSON here. 1. [20 points] Create a choropleth map using the provided data and use Figure 5.1 and 5.2 as references. a. [5 points] Dropdown lists are commonly used on dashboards to enable filtering of data. Create a dropdown list (see example in Figure 5.2) to allow users to select which game’s data are displayed. • The list options should be obtained from the Game column of the csv file. • Sort the list options in (case-sensitive) alphabetical order. Set the default display value to the first option. • Selecting a different game from the dropdown list should update both the choropleth map (see part 2) and the legend (see part 3) accordingly. Hint: If failing to load map, you may try this method. b. [10 points] Load the data from ratings-by-country.csv and create a choropleth map such that the color of each country in the map corresponds to the average rating of the game selected in the dropdown in each country. Use a “Natural Earth” projection of the geoJSON file. You MUST name the function for calculating path as ‘path’, to help the auto-grader locate it. Create a projection first and use it as an input for the path. 19 Version 0 Promise.all() is provided within the skeleton code and you can use it to read in both the world json file and game data csv file. Usage example: Making a Map in D3.js v.5. Many countries have no ratings for some games — these should be colored gray. For those countries that do have ratings for the selected game, use a quantile scale to generate the color scheme based on the average rating by country. Color them along a gradient of exactly 4 gradations from a single hue, with darker colors corresponding to higher rating values and lighter colors corresponding to lower values (see gradient examples at Color Brewer). About Scaling Colormaps: In order to create effective visualizations that highlight patterns of interest, it is important to carefully think about the relationship between the range and distribution of values being displayed (the domain) and the color scale the values are mapped to (the range). Many types of mapping functions are possible, e.g., we could use a linear mapping where the lowest game rating is mapped to the first value in the color scheme, the highest game rating is mapped to the highest value in the color scheme, and intermediate ratings are mapped to hues in the middle. This article illustrates the value of choosing appropriate endpoints for linear color maps, or log-scaling the domain so that large but relatively infrequent values do not cause differences between smaller values to be washed out. In our case, most board games have similar average ratings across countries, e.g. Catan has an average rating close to 9.3 in almost all countries, making it challenging to perceive relative differences in popularity. To address this, we can compute quantiles of the domain data — game rating values that divide the ordered list of average ratings per country into roughly equallysized groups. Here, we will get 4 groups, a special case of quantiles called “quartiles” since the data are divided into quarters. Hint: You can verify the correctness of the quartiles generated by using the ‘quartile’ function in Excel. Open ratings-by-country.csv and filter the data for one game (say Catan). Then use the quartile function to get the 0th, 1st, 2nd, 3rd and 4th quartile values from the Average Rating column. Here [0th quartile, 1st quartile), [1st quartile, 2nd quartile), [2nd quartile, 3rd quartile), [3rd quartile, 4th quartile] will represent the 4 groups of values generated by the d3 quantile scale. Use all the countries listed in ratings-by-country.csv to generate your quartiles depending on the selection, including ones which may not appear in the geoJSON. c. [5 points] Add a vertical legend showing how colors map to the average rating for a particular game. The legend must update for the quartiles of the selected game, and display values formatted to show precision up to 2 decimal places. You must use exactly 4 color gradations in your submission. It is recommended, but not required, to use d3-legend.min.js (in the lib folder) to create the legend for the scale you use. The legend bars should be rectangular in shape. Also, display your GT username (e.g., gburdell3) beneath the map. 2. [5 points] Add a tooltip using the d3-tip.min library (in the lib folder). On hovering over a country, the tooltip should show the following information on separate lines: • Country name • Game • Avg Rating for that game in that country • Number of Users from the Country who rated the game For countries with no data, the tooltip should display “N/A” for Avg Rating and Number of Users. Note: The tooltip should appear when the mouse hovers over the country. Figure 5.2 demonstrates this for Catan. On mouseout, the tooltip should disappear. You can position the tooltip a small distance away from the mouse cursor (e.g., add margin to the tooltip via CSS) and have it follow the cursor, which will prevent the tooltip from “flickering” as you move the mouse around quickly (the tooltip disappears when your mouse leaves a state and enters the tooltip’s bounding box). You must prevent such “flickering” because rapid appearance and disappearance prevent the autograder from detecting the tooltip’s content (since the tooltip can no longer be found) and adversely affect the usability of your visualization in practice. Alternatively, you 20 Version 0 may position the tooltip at a location (picked by you) such that it is close to the country the cursor is currently at. Please ensure the tooltip is fully visible (i.e., not clipped, especially near the page edges). If the tooltip becomes clipped, you may lose points. Note: Please ensure that you only have a single tooltip element defined in your code. You should not create new tooltip elements for different countries; rather, update the contents, position, and visibility of a single tooltip. Note: You must create the tooltip by only using d3-tip.min.js in the lib folder. Figure 5.1: Reference example for Choropleth Map showing average rating of Catan. Your chart may appear different but you will earn full credit as long as it meets all stated requirements. Figure 5.2: Reference example for Choropleth Map showing tooltip. Your chart may appear different but you will earn full credit as long as it meets all stated requirements. 21 Version 0 Figure 5.3: Reference example showing updated Choropleth and legend for Azul. Your chart may appear different but you will earn full credit as long as it meets all stated requirements. Hints • Countries without data should be colored gray. These countries can be found using a condition that compares the country’s average rating with ‘undefined’. • It is optional for your visualization to show (or not show) Antarctica. • D3-tip warning may be ignored if it does not break the code. • You may consider clearing the SVG and creating a new map when selecting a new game. Note: You may change the given code in choropleth.html as necessary. Your D3 visualization MUST produce the following DOM structure.

contains tooltip to display (Q5.2) | +– (text for tooltip)

CSE 6242 / CX 4242: Data and Visual Analytics HW 3

CSE 6242 / CX 4242: Data and Visual Analytics HW 3: Spark, Docker, DataBricks, AWS and GCP

Download the HW3 Skeleton, Q1 Data, Q2 Data, and Q4 Data before you begin. Also, create an AWS Academy account as outlined in Step 1 of the AWS Setup Guide

Homework Overview

Modern-day datasets are large. For example, the NASA Terra and Aqua satellites each produces over 300GB of satellite imagery daily. These datasets are too large for typical computer hard drives and requires advanced technologies for processing. In this assignment, you will work with a dataset of over 1 billion taxi trips from the New York City Taxi & Limousine Commission (TLC). Further details on this dataset are available here. This assignment aims to familiarize you with various tools that will be valuable for future projects, research, or career opportunities. By including AWS, Azure and GCP, we want to provide the opportunity to explore and compare these rapidly evolving platforms. This experience will help you make informed decisions when selecting a cloud platform in the future, allowing you to get started quickly and confidently. Many of the computational tasks in this assignment are straightforward, though quite a bit of “setup” will be needed before reaching the actual “programming” stage. Setting up work environments, launching clusters, monitoring compute usage, and running large-scale experiments on cloud platforms are important skills. This assignment familiarizes you with using machine clusters and understanding the pay-per-use model of most cloud services, offering a valuable first experience with cloud computing for many students. The maximum possible score for this homework is 100 points Homework Overview…………………………………………………………………………………………………………………. 1 Important Notes ……………………………………………………………………………………………………………………….. 2 Submission Notes…………………………………………………………………………………………………………………….. 2 Do I need to use the specific version of the software listed?……………………………………………………………. 2 Q1 [15 points] Analyzing trips data with PySpark…………………………………………………………………………… 3 Tasks and point breakdown…………………………………………………………………………………………………. 3 Q2 [30 pts] Analyzing dataset with Spark/Scala on Databricks ………………………………………………………… 6 Tasks and point breakdown…………………………………………………………………………………………………. 7 Q3 [35 points] Analyzing Large Amount of Data with PySpark on AWS…………………………………………….. 9 Tasks and point breakdown………………………………………………………………………………………………..10 Q4 [10 points] Analyzing a Large Dataset using Spark on GCP………………………………………………………12 Tasks and point breakdown………………………………………………………………………………………………..13 Q5 [10 points] Regression: Automobile price prediction using Azure Machine Learning ……………………..14 Tasks and point breakdown………………………………………………………………………………………………..14 2 Version 1

You will use the yellow_tripdata_2019-01_short.csv dataset, a modified record of the NYC Green Taxi trips that includes information about the pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, fare amounts, payment types, and driver-reported passenger counts. When processing the data or performing calculations, do not round any values, unless specifically instructed to. Technology PySpark, Docker Deliverables [Gradescope] q1.ipynb: your solution as a Jupyter Notebook file IMPORTANT NOTES: • Only regular PySpark Dataframe Operations can be used. • Do NOT use PySpark SQL functions, i.e., sqlContext.sql(‘select * … ‘). We noticed that students frequently encountered difficult-to-resolve issues when using these functions. Additionally, since you already worked extensively with SQL in HW1, completing this task in SQL would offer limited educational value. • Do not reference sqlContext within the functions you are defining for the assignment. • If you re-run cells, remember to restart the kernel to clear the Spark context, otherwise an existing Spark context may cause errors. • Be sure to save your work often! If you do not see your notebook in Jupyter, then double check that the file is present in the folder and that your Docker has been set up correctly. If, after checking both, the file still does not appear in Jupyter then you can still move forward by clicking the “upload” button in the Jupyter notebook and uploading the file – however, if you use this approach, then your file will not be saved to disk when you save in Jupyter, so you would need to download your work by going to File > Download as… > Notebook (.ipynb), so be sure to download often to save your work! • Do not add any cells or additional library imports to the notebook. • Remove all your additional debugging code that renders output, as it will crash Gradescope. For instance, any additional print, display and show statements used for debugging must be removed. Tasks and point breakdown 1. [1 pt] You will be modifying the function clean_data to clean the data. Cast the following columns into the specified data types: a. passenger_count — integer b. total_amount — float c. tip_amount — float d. trip_distance — float e. fare_amount — float f. tpep_pickup_datetime — timestamp 4 Version 1 g. tpep_dropoff_datetime — timestamp 2. [4 pts] You will be modifying the function common_pair. Return the top 10 pickup-dropoff location pairs that have the highest sum of passenger_count who have traveled between them. Sort the location pairs by total passengers between pairs. For each location pair, also compute the average amount per passenger over all trips (name this per_person_rate), utilizing total_amount. For pairs with the same total passengers, sort them in descending order of per_person_rate. Filter out any trips that have the same pick-up and drop-off location. Rename the column for total passengers to total_passenger_count. Sample Output Format — The values below are for demonstration purposes: PULocationID DOLocationID total_passenger_count per_person_rate 1 2 23 5.242345 3 4 5 6.61345634 3. [4 pts] You will be modifying the function distance_with_most_tip . Filter the data for trips having fares (fare_amount) greater than $2.00 and a trip distance (trip_distance) greater than 0. Calculate the tip percent (tip_amount * 100 / fare_amount) for each trip. Round all trip distances up to the closest mile and find the average tip_percent for each trip_distance. Sort the result in descending order of tip_percent to obtain the top 15 trip distances which tip the most generously. Rename the column for rounded trip distances to trip_distance, and the column for average tip percents tip_percent. Sample Output Format — The values below are for demonstration purposes: trip_distance tip_percent 2 6.2632344561 1 4.42342882 4. [6 pts] You will be modifying the function time_with_most_traffic to determine which hour of the day has the most traffic. Calculate the traffic for a particular hour using the average speed of all taxi trips which began during that hour. Calculate the average speed as the average trip_distance divided by the average trip duration, as distance per hour. Make sure to determine the average durations and average trip distances before calculating the speed. It will likely be helpful to cast the dates to the long data type when determining the interval. A day with low average speed indicates high levels of traffic. The average speed may be 0, indicating very high levels of traffic. Additionally, you must separate the hours into AM and PM, with hours 0:00-11:59 being AM, and hours 12:00-23:59 being PM. Convert these times to the 12 hour time, so you can match the output below. For example, the row with 1 as time of day, should show the average speed between 1 am and 2 am in the am_avg_speed column, and between 1 pm and 2pm in the pm_avg_speed column. Use date_format along with the appropriate pattern letters to format the time of day so that it matches the example output below. Your final table should contain values sorted from 0-11 for time_of_day. There may be data missing for a time of day, and it may be null for am_avg_speed 5 Version 1 or pm_avg_speed. If an hour has no data for am or pm, there may be missing rows. You will not have rows for all possible times of day, and do not need to add them to the data if they are missing. Sample Output Format — The values below are for demonstration purposes: time_of_day am_avg_speed pm_avg_speed 1 0.953452345 9.23345272 2 5.2424622 null 4 null 2.55421905 6 Version 1 Q2 [30 pts] Analyzing dataset with Spark/Scala on Databricks Firstly, go over this Spark on Databricks Tutorial, to learn the basics of creating Spark jobs, loading data, and working with data. You will analyze nyc-tripdata.csv1 using Spark and Scala on the Databricks platform. (A short description of how Spark and Scala are related can be found here.) You will also need to use the taxi zone lookup table using taxi_zone_lookup.csv that maps the location ID into the actual name of the region in NYC. The nyc-trip data dataset is a modified record of the NYC Green Taxi trips and includes information about the pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, fare amounts, payment types, and driverreported passenger counts. Technology Spark/Scala, Databricks Deliverables [Gradescope] • q2.dbc: Your solution as Scala Notebook archive file (.dbc) exported from Databricks (see Databricks Setup Guide below) • q2.scala: Your solution as a Scala source file exported from Databricks (see Databricks Setup Guide below) • q2_results.csv: The output results from your Scala code in the Databricks q2 notebook file. You must carefully copy the outputs of the display()/show() function into a file titled q2_results.csv under the relevant sections. Please double-check and compare your actual output with the results you copied.

IMPORTANT NOTES: • Use only Firefox, Safari or Chrome when configuring anything related to Databricks. The setup process has been verified to work on these browsers. • Carefully follow the instructions in the Databricks Setup Guide. (You should have already downloaded the data needed for this question using the link provided before Homework Overview.) o You must choose the Databricks Runtime (DBR) version as “10.4 (includes Apache Spark 3.2.1, Scala 2.12)”. We will grade your work using this version. o Note that you do not need to install Scala or Spark on your local machine. They are provided with the DBR environment. • You must use only Scala DataFrame operations for this question. Scala DataFrames are just another name for Spark DataSet of rows. You can use the DataSet API in Spark to work on these DataFrames. Here is a Spark document that will help you get started on working with DataFrames in Spark. You will lose points if you use SQL queries, Python, or R to manipulate a DataFrame. o After selecting the default language as SCALA, do not use the language magic % with other languages like %r, %python, %sql etc. The language magics are used to override the default language, which you must not do for this assignment. o You must not use full SQL queries in lieu of the Spark DataFrame API. That is, you must not use functions like sql(), which allows you to directly write full SQL queries like spark.sql (“SELECT* FROM col1 WHERE …”). This should be df.select(“*”) instead. • The template Scala notebook q2.dbc (in hw3-skeleton) provides you with code that reads a data file nyc-tripdata.csv. The input data is loaded into a DataFrame, inferring the schema using reflection (Refer to the Databricks Setup Guide above). It also contains code that filters the data to only keep the rows where the pickup location is different from the drop location, and the trip distance is strictly greater than 2.0 (>2.0). 1 Graph derived from the NYC Taxi and Limousine Commission 7 Version 1 o All tasks listed below must be performed on this filtered DataFrame, or you will end up with wrong answers. o Carefully read the instructions in the notebook, which provides hints for solving the problems. • Some tasks in this question have specified data types for the results that are of lower precision (e.g., float). For these tasks, we will accept relevant higher precision formats (e.g., double). Similarly, we will accept results stored in data types that offer “greater range” (e.g., long, bigint) than what we have specified (e.g., int). • Remove all your additional debugging code that renders output, as it will crash Gradescope. For instance, any additional print, display and show statements used for debugging must be removed. • Hint: You may find some of the following DataFrame operations helpful: toDF, join, select, groupBy, orderBy, filter, agg, window(), partitionBy, orderBy, etc. Tasks and point breakdown 1. List the top 5 most popular locations for: a. [2 pts] dropoff based on “DOLocationID”, sorted in descending order by popularity. If there is a tie, then one with a lower “DOLocationID” gets listed first. b. [2 pts] pickup based on “PULocationID”, sorted in descending order by popularity. If there is a tie, then the one with a lower “PULocationID” gets listed first. 2. [4 pts] List the top 3 locationID’s with the maximum overall activity. Here, overall activity at a LocationID is simply the sum of all pick-ups and all drop-offs at that LocationID. In case of a tie, the lower LocationID gets listed first. Note: If a taxi picked up 3 passengers at once, we count it as 1 pickup and not 3 pickups. 3. [4 pts] List all the boroughs (of NYC: Manhattan, Brooklyn, Queens, Staten Island, Bronx along with “Unknown” and “EWR”) and their total number of activities, in descending order of a total number of activities. Here, the total number of activities for a borough (e.g., Queens) is the sum of the overall activities (as defined in part 2) of all the LocationIDs that fall in that borough (Queens). An example output format is shown below. 4. [5 pts] List the top 2 days of the week with the largest number of daily average pick-ups, along with the average number of pick-ups on each of the 2 days in descending order (no rounding off required). Here, the average pickup is calculated by taking an average of the number of pick-ups on different dates falling on the same day of the week. For example, 02/01/2021, 02/08/2021 and 02/15/2021 are all Mondays, so the average pick-ups for these is the sum of the pickups on each date divided by 3. An example output is shown below. 8 Version 1 Note: The day of week is a string of the day’s full spelling, e.g., “Monday” instead of the number 1 or “Mon”. Also, the pickup_datetime is in the format: yyyy-mm-dd 5. [6 pts] For each hour of a day (0 to 23, 0 being midnight) — in the order from 0 to 23 (inclusively), find the zone in the Brooklyn borough with the largest number of total pick-ups. Note: All dates for each hour should be included. 6. [7 pts] Find which 3 different days in the month of January, in Manhattan, that saw the largest positive percentage increase in pick-ups compared to the previous day, in the order from largest percentage increase to smallest percentage increase. An example output is shown below. Note: All years need to be aggregated to calculate the pickups for a specific day of January. The change from Dec 31 to Jan 1 can be excluded. List the results of the above tasks in the provided q2_results.csv file under the relevant sections. These preformatted sections also show you the required output format from your Scala code with the necessary columns — while column names can be different, their resulting values must be correct. • You must manually enter the output generated into the corresponding sections of the q2_results.csv file, preferably using some spreadsheet software like MS-Excel (but make sure to keep the csv format). For generating the output in the Scala notebook, refer to show() and display()functions of Scala. • Note that you can edit this csv file using text editor, but please be mindful about putting the results under designated columns. • If you encounter a “UnicodeDecodeError”, please save file as “.csv UTF-8″ to resolve. Note: Do NOT modify anything other than filling in those required output values in this csv file. We grade by running the Spark Scala code you write and by looking at your results listed in this file. So, make sure your output is obtained from the Spark Scala code you write. Failure to include the dbc and scala files will result in a deduction from your overall score. 9 Version 1 Q3 [35 points] Analyzing Large Amount of Data with PySpark on AWS You will try out PySpark for processing data on Amazon Web Services (AWS). Here you can learn more about PySpark and how it can be used for data analysis. You will be completing a task that may be accomplished using a commodity computer (e.g., consumer-grade laptops or desktops). However, we would like you to use this exercise as an opportunity to learn distributed computing on AWS, and to gain experience that will help you tackle more complex problems. The services you will primarily be using are Amazon S3 storage, Amazon Athena. You will be creating an S3 bucket, running code using Athena and its serverless PySpark engine, and then storing the output into that S3 bucket. Amazon Athena is serverless, meaning that it is pay for what you use. There are no servers to maintain that will accrue costs whether it’s being used or not. For this question, you will only use up a very small fraction of your AWS credit. If you have any issues with the AWS Academy account, please post in the dedicated AWS Setup Ed Discussion thread. In this question, you will use a dataset of trip records provided by the New York City Taxi and Limousine Commission (TLC). You will be accessing the dataset directly through AWS via the code outlined in the homework skeleton. Specifically, you will be working with two samples of this dataset, one small, and one much larger. Optionally, if you would like to learn more about the dataset, check out here and here; also optionally, you may explore the structure of the data by referring to [1] [2]. You are provided with a python notebook (q3.ipynb) file which you will complete and load into EMR. You are provided with the load_data() function, which loads two PySpark DataFrames. The first DataFrame, trips, contains trip data where each record refers to one (1) trip. The second DataFrame, lookup, maps a LocationID to its trip information. It can be linked to either the PULocationID or DOLocationID fields in the trips DataFrame. Technology PySpark, AWS Deliverables [Gradescope] • q3.ipynb: PySpark notebook for this question (for the larger dataset). • q3_output_large.csv: output file (comma-separated) for the larger dataset. IMPORTANT NOTES • Use Firefox, Safari or Chrome when configuring anything related to AWS. • EXTREMELY IMPORTANT: Both the datasets are in the US East (N. Virginia) region. Using machines in other regions for computation will incur data transfer charges. Hence, set your region to US East (N. Virginia) in the beginning (not Oregon, which is the default). This is extremely important, otherwise your code may not work, and you may be charged extra. • Strictly follow the guidelines below, or your answer may not be graded. a. Ensure that the parameters for each function remain as defined and the output order and names of the fields in the PySpark DataFrames are maintained. b. Do not import any functions which were not already imported within the skeleton. c. You must NOT round any numeric values. Rounding numbers can introduce inaccuracies. Our grader will be checking the first 8 decimal places of each value in the DataFrame. d. You will not have access to the Spark object directly in the autograder. If you use it in your functions, the autograder will fail! You can use the Spark Context from the DataFrame. 10 Version 1 e. Double check that you are submitting the correct files, and the filenames follow the correct naming standard — we only want the script and output from the larger dataset. Also, double check that you are writing the right dataset’s output to the right file. f. You are welcome to store your script’s output in any bucket you choose if you can download and submit the correct files. g. Do not make any manual changes to the output files. h. Please ensure that you do not remove #export from the HW skeleton; i. Do not import any additional packages, INCLUDING pyspark.sql.functions, as this may cause the autograder to work incorrectly. Everything you need should be imported for you. j. Using .rdd() can cause issues in the GradeScope environment. You can accomplish this assignment without it. In general, since the RDD API is outdated (though not deprecated), you should be wary of using this API. k. Remove all your additional debugging code that renders output, as it will crash Gradescope. For instance, any additional print, display and show statements used for debugging must be removed. l. Regular Pyspark Dataframe Operations and PySpark SQL operations can be used. To use PySpark SQL operations, you must use the SQL Context on the Spark Dataframe.  Example: • df.createOrReplaceTempView(“some_table”) • df.sql_ctx.sql(“SELECT * FROM some_table”) Hints: a. Refer to DataFrame commands such as filter, join, groupBy, agg, limit, sort, withColumnRenamed and withColumn. Documentation for the DataFrame APIs is located here. b. Testing on a single, small dataset (i.e. a “test case”) is helpful, but is not sufficient for discovering all potential issues, especially if such issues only become apparent when the code is run on larger datasets. It is important for you to develop more ways to review and verify your code logic. c. Overwriting the DataFrames from the function parameters can cause unintended side effects when it comes to rounding. Be sure to preserve the DataFrames in each function. d. Precision in data analytics is very important. Keep in mind that precision reduction in an earlier step can accumulate and be magnified, subsequently significantly affecting the final output’s precision (e.g., for a dataset with 1,000,000 data points, a 0.0001 difference for each data point can lead to a total difference of 100 over the whole dataset). This is called precision loss. Check out this post or hints on how to avoid precision loss. e. Check if you’re reducing the precision (or “scale”) too aggressively. Can you relax the restriction during intermediate steps? f. Make sure you return a DataFrame. If you get NoneType errors, you are most likely not returning what you think you are. g. Some columns may need to be cast to the right data type. Keep that in mind! Tasks and point breakdown Your objective is to locate profitable pick-up locations in Manhattan by analyzing taxi trip data (only trips 2 miles or longer). Follow the steps below to identify top pick-up locations based on a “weighted profit” calculation: 1. [0 pts] Setting up the AWS environment. a. Go through all the steps in the AWS Setup Guide. You should have already completed Step 1 to create your account) to set up your AWS environment, e.g., creating S3 storage bucket, and uploading skeleton file. 2. [1 pts] user() 11 Version 1 a. Returns your GT Username as a string (e.g., gburdell3) 3. [2 pts] long_trips(trips) a. This function filters trips to keep only trips 2 miles or longer (e.g., >= 2). b. Returns PySpark DataFrame with the same schema as trips c. Note: Parts 4, 5 and 6 will use the result of this function 4. [6 pts] manhattan_trips(trips, lookup) a. This function determines the top 20 locations with a DOLocationID in Manhattan by sum of passenger count. b. Returns a PySpark DataFrame (mtrips) with the schema (DOLocationID, pcount) c. Note: If you encounter the error ‘Can only compare identically labeled DataFrame objects,’ it is likely due to the use of the RDD API. We recommend avoiding the use of the RDD API since it is not compatible with the autograder. Instead, we suggest rewriting the logic using a join clause. 5. [6 pts] weighted_profit(trips, mtrips) a. This function determines i. the average total_amount, ii. the total count of trips, and iii. the total count of trips ending in the top 20 destinations. b. Using the above values, i. determine the proportion of trips that end in one of the popular drop-off locations (# trips that end in drop off location divided by total # of trips) and ii. multiply that proportion by the average total_amount to get a weighted_profit value based on the probability of passengers going to one of the popular destinations. iii. Return the weighted_profit c. Returns a PySpark DataFrame with the schema (PULocationID, weighted_profit) for the weighted_profit . 6. [5 pts] final_output(wp, lookup) a. This function i. takes the results of weighted_profit, ii. links it to the borough and zone through the lookup data frame, iii. and returns the top 20 locations with the highest weighted_profit. b. Returns a PySpark DataFrame with the schema (Zone, Borough, weighted_profit) c. Note: If you encounter issues with ‘3.5 Test Final Output,’ primarily due to the DataFrame returned from ‘final_output()’ containing incorrect data, it is essential to reformat column data types, particularly when applying ‘agg()’ operations in previous sections. Once you have implemented all these functions, run the main() function, which is already implemented, and update the line of code to include the name of your output s3 bucket and a location. This function will fail if the output directory already exists, so make sure to change it each time you run the function. Example: final.write.csv(‘s3://cse6242-gburdell3/output-large3’) Your output file will appear in a folder in your s3 bucket as a csv file with a name which is similar to part-0000- 4d992f7a-0ad3-48f8-8c72-0022984e4b50-c000.csv. Download this file and rename it to q3_output_large.csv for submission. Do NOT make any other changes to the file. 12 Version 1 Q4 [10 points] Analyzing a Large Dataset using Spark on GCP The goal of this question is to familiarize you with creating storage buckets/clusters and running Spark programs on Google Cloud Platform. This question asks you to create a new Google Storage Bucket and load the NYC Taxi & Limousine Commission Dataset. You are also provided with a Jupyter Notebook q4.ipynb file, which you will load and complete in a Google Dataproc Cluster. Inside the notebook, you are provided with the skeleton for the load_data() function, which you will complete to load a PySpark DataFrame from the Google Storage Bucket you created as part of this question. Using this PySpark DataFrame, you will complete the following tasks using Spark DataFrame functions. You will use the data file yellow_tripdata09-08-2021.csv. The preceding link allows you to download the dataset you are required to work with for this question from the course DropBox. Each line represents a single taxi trip consisting of the comma-separated columns bulleted below. All columns are of string data type. You must convert the highlighted columns below into decimal data type (do NOT use float datatype) inside their respective functions when completing this question. Do not convert any datatypes within the load_data function. While casting to a decimal datatype, use a precision of 38 and a scale of 10. • vendorid • tpep_pickup_datetime • tpep_dropoff_datetime • passenger_count • trip_distance (decimal data type) • ratecodeid • store_and_fwd_flag • pulocationid • dolocationid • payment_type • fare_amount (decimal data type) • extra • mta_tax • tip_amount (decimal data type) • tolls_amount (decimal data type) • improvement_surcharge • total_amount Technology Spark, Google Cloud Platform (GCP) Deliverables [Gradescope] q4.ipynb: the PySpark notebook for this question. IMPORTANT NOTES: • Use Firefox, Safari or Chrome when configuring anything related to GCP. • Strictly follow the guidelines below, or your answer may not be graded. o Regular PySpark Dataframe Operations can be used. o Do NOT use any functions from the RDD API or your code will break the autograder. In general, the RDD API is considered outdated, so you should use the DataFrame API for better performance and compatibility. o Make sure to download the notebook from your GCP cluster before deleting the GCP cluster (otherwise, you will lose your work). o Do not add new cells to the notebook, as this may break the auto-grader. o Remove all your additional debugging code that renders output, as it will crash Gradescope. For instance, any additional print, display and show statements used for debugging must be removed. o Do not use any .rdd function in your code. Not only will this break the autograder, but you should 13 Version 1 be wary of using this function in general. o Ensure that you are only submitting a COMPLETE solution to Gradescope. Anything less will break the autograder. Write local unit tests to help test your code. Tasks and point breakdown 1. [0 pts] Set up your GCP environment a. Instructions to set up GCP Credits, GCP Storage and Dataproc Cluster are provide here: written instructions. b. Helpful tips/FAQs for special scenarios: i. If GCP service is disabled for your google account, try the steps in this google support link ii. If you have any issues with the GCP free credits, please post in the dedicated GCP Setup Ed Discussion thread. 2. [0 pts — required] Function load_data()to load data from a Google Storage Bucket into a Spark DataFrame a. You must first perform this task (part 2) BEFORE performing parts 3, 4, 5, 6 and 7. No points are allocated to task 2, but it is essential that you correctly implement the load_data()function as the remaining graded tasks depend upon this task and its correct implementation. Upload code to Gradescope ONLY after completing all tasks and removing/commenting all the testing code. Anything else will break the autograder. 3. [2 pts] Function exclude_no_pickup_locations() to exclude trips with no pick-up locations (pick-up location id column is null or is zero) in the original data from a. 4. [2 pts] Function exclude_no_trip_distance() to exclude trips with no distance (i.e., trip distance column is null or zero) in the dataframe output by exclude_no_pickup_locations(). 5. [2 pts] Function include_fare_range() to include trips with fare from $20 (inclusively) to $60 (inclusively) in the dataframe output by exclude_no_trip_distance(). 6. [2 pts] Function get_highest_tip() to identify the highest tip (rounded to 2 decimal places) in the dataframe output by include_fare_range(). 7. [2 pts] Function get_total_toll() to calculate the total toll amount (rounded to 2 decimal places) in the dataframe output by include_fare_range(). 14 Version 1 Q5 [10 points] Regression: Automobile price prediction using Azure Machine Learning The primary purpose of this question is to introduce you to Microsoft Machine Learning Studio by familiarizing you to its basic functionalities and machine learning workflows. Go through the Automobile Price Prediction tutorial and create/run ML experiments to complete the following tasks. You will not incur any cost if you save your experiments on Azure till submission. Once you are sure about the results and have reported them, feel free to delete your experiments. You will manually modify the given file q5.csv by adding the results using a plain text editor from the following tasks. Technology Azure Machine Learning Deliverables [Gradescope] q5.csv: a csv file containing results for all parts IMPORTANT NOTES: • Strictly follow the guidelines below, or your answer may not be graded. o DO NOT change the order of the questions. o Report the exact numerical values that you get in your output. DO NOT round any of them. o When manually entering a value into the csv file, append it immediately after a comma, so there will be NO space between the comma and your value, and no trailing spaces or commas after your value. o Follow the tutorial and do not change values for L2 regularization. For tasks 3 and 4, select the columns given in the tutorial. Tasks and point breakdown 1. [0 pts] Create and use a free workspace instance on Azure Machine Learning. Use your Georgia Tech username (e.g., jdoe3) to login. 2. [0 pts] Update q5.csv by replacing gburdell3 with your GT username. 3. [3 pts] Repeat the experiment described in the tutorial and report values of all metrics as mentioned in the Evaluate Model section of the tutorial. Make sure the Split Data looks as it does below: 4. [3 pts] Repeat the experiment mentioned in task 3 with a different value of Fraction of rows in the first output dataset in the Split Data module. Change the value to 0.8 from the originally set value of 0.7. Report corresponding values of the metrics. 15 Version 1 5. [4 pts] After fully completing tasks 3 and 4, run a new experiment — evaluate the model using 5- fold cross-validation CV. a. Select parameters in the Partition and Sample component in accordance with the figure below. b. For Cross Validate model set the column name as “price” for CV and use 0 as a random seed. c. Report the values of Root Mean Squared Error (RMSE) and Coefficient of Determination for each of the five folds (1st fold corresponds to fold number 0 and so on). Do NOT round the results. Report exact values. d. HINT: to see results, right click Cross Validate Model and select Preview data  Evaluation results by fold. Make sure to utilize the same data cleaning/processing steps as you did before. Figure: Property Tab of Partition and Sample Module

CSE 6242 / CX 4242: Data and Visual Analytics HW 4

CSE 6242 / CX 4242: Data and Visual Analytics
HW 4: PageRank Algorithm, Random Forest, Scikit-learn
Download the HW4 Skeleton before you begin

Homework Overview

Data analytics and machine learning both revolve around using computational models to capture
relationships between variables and outcomes. In this assignment, you will code and fit a range
of well-known models from scratch and learn to use a popular Python library for machine learning.

In Q1, you will implement the famous PageRank algorithm from scratch. PageRank can be
thought of as a model for a system in which a person is surfing the web by choosing uniformly at
random a link to click on at each successive webpage they visit. Assuming this is how we surf the
web, what is the probability that we are on a particular webpage at any given moment? The
PageRank algorithm assigns values to each webpage according to this probability distribution.
In Q2, you will implement Random Forests, a very common and widely successful classification
model, from scratch. Random Forest classifiers also describe probability distributions—the
conditional probability of a sample belonging to a particular class given some or all its features.
Finally, in Q3, you will use the Python scikit-learn library to specify and fit a variety of supervised
and unsupervised machine learning models.

The maximum possible score for this homework is 100 points.
Download the HW4 Skeleton before you begin …………………………………………………………….. 1
Homework Overview……………………………………………………………………………………………………. 1
Important Notes ………………………………………………………………………………………………………….. 2
Submission Notes……………………………………………………………………………………………………….. 2
Q1 [20 pts] Implementation of PageRank Algorithm ……………………………………………………… 3
Tasks …………………………………………………………………………………………………………………………………. 4
Q2 [50 pts] Random Forest Classifier…………………………………………………………………………… 5
Q2.1 – Random Forest Setup [45 pts] …………………………………………………………………………………….. 5
Q2.2 – Random Forest Reflection [5 pts]…………………………………………………………………………………. 7
Q3 [30 points] Using Scikit-Learn ………………………………………………………………………………… 8
Q3.1 – Data Import [2 pts]……………………………………………………………………………………………………… 8
Q3.2 – Linear Regression Classifier [4 pts] ……………………………………………………………………………… 8
Q3.3 – Random Forest Classifier [10 pts]………………………………………………………………………………… 8
Q3.4 – Support Vector Machine [10 pts] ………………………………………………………………………………….. 9
Q3.5 – Principal Component Analysis [4 pts]…………………………………………………………………………..10
2 Version 0

Important Notes
1. Submit your work by the due date on the course schedule.
a. Every assignment has a generous 48-hour grace period, allowing students to address
unexpected minor issues without facing penalties. You may use it without asking.
b. Before the grace period expires, you may resubmit as many times as needed.
c. TA assistance is not guaranteed during the grace period.
d. Submissions during the grace period will display as “late” but will not incur a penalty.
e. We will not accept any submissions executed after the grace period ends.
2. Always use the most up-to-date assignment (version number at bottom right of this
document). The latest version will be listed in Ed Discussion.
3. You may discuss ideas with other students at the “whiteboard” level (e.g., how cross-validation
works, use HashMap instead of an array) and review any relevant materials online. However,
each student must write up and submit the student’s own answers.
4. All incidents of suspected dishonesty, plagiarism, or violations of the Georgia Tech Honor
Code will be subject to the institute’s Academic Integrity procedures, directly handled by the
Office of Student Integrity (OSI). Consequences can be severe, e.g., academic probation or
dismissal, a 0 grade for assignments concerned, and prohibition from withdrawing from the
class.

Submission Notes
1. All questions are graded on the Gradescope platform, accessible through Canvas.
2. We will not accept submissions anywhere else outside of Gradescope.
3. Submit all required files as specified in each question. Make sure they are named correctly.
4. You may upload your code periodically to Gradescope to obtain feedback on your code. There
are no hidden test cases. The score you see on Gradescope is what you will receive.
5. You must not use Gradescope as the primary way to test your code. It provides only a few
test cases and error messages may not be as informative as local debuggers. Iteratively
develop and test your code locally, write more test cases, and follow good coding practices.
Use Gradescope mainly as a “final” check.
6. Gradescope cannot run code that contains syntax errors. If you get the “The autograder
failed to execute correctly” error, verify:
a. The code is free of syntax errors (by running locally)
b. All methods have been implemented
c. The correct file was submitted with the correct name
d. No extra packages or files were imported
7. When many students use Gradescope simultaneously, it may slow down or fail. It can become
even slower as the deadline approaches. You are responsible for submitting your work on
time.
8. Each submission and its score will be recorded and saved by Gradescope. By default, your
last submission is used for grading. To use a different submission, you MUST “activate”
it (click the “Submission History” button at the bottom toolbar, then “Activate”).

Q1 [20 pts] Implementation of PageRank Algorithm
Technology PageRank Algorithm
Graph
Python >=3.7.x. You must use Python >=3.7.x for this question.
Allowed Libraries Do not modify the import statements; everything you need to complete this question
has been imported for you. You MUST not use other libraries for this assignment.
Max runtime 5 minutes
Deliverables [Gradescope]
• Q1.ipynb [12 pts]: your modified implementation
• simplified_pagerank_iter{n}.txt: 2 files (as given below) containing the top
10 node IDs (w.r.t. the PageRank values) and their PageRank values for n
iterations via the provided run() helper function
o simplified_pagerank_iter10.txt [2 pts]
o simplified_pagerank_iter25.txt [2 pts]
• personalized_pagerank_iter{n}.txt: 2 files (as given below) containing the
top 10 node IDs (w.r.t the PageRank values) and their PageRank values for
n iterations via the provided run() helper function
o personalized_pagerank_iter10.txt [2 pts]
o personalized_pagerank_iter25.txt [2 pts]
Important: Remove all “testing” code that renders output, or Gradescope will crash. For instance,
any additional print, display, and show statements used for debugging must be removed.
In this question, you will implement the PageRank algorithm in Python for a large graph network
dataset.
The PageRank algorithm was first proposed to rank web pages in search results. The basic
assumption is that more “important” web pages are referenced more often by other pages and
thus are ranked higher. To estimate the importance of a page, the algorithm works by considering
the number and “importance” of links pointing to the page. PageRank outputs a probability
distribution over all web pages, representing the likelihood that a person randomly surfing the web
(randomly clicking on links) would arrive at those pages.
As mentioned in the lectures, the PageRank values are the entries in the dominant eigenvector
of the modified adjacency matrix in which each column’s values adds up to 1 (i.e., “column
normalized”), and this eigenvector can be calculated by the power iteration method that you will
implement in this question. This method iterates through the graph’s edges multiple times to
update the nodes’ PageRank values (“pr_values” in Q1.ipynb) in each iteration. We recommend
that you review the lecture video for PageRank and personalized PageRank before working on
your implementation. At 9 minutes and 41 seconds of the video, the full PageRank algorithm is
expressed in a matrix-vector form. Equivalently, the PageRank value of node 𝑣𝑣𝑗𝑗, at iteration 𝑡𝑡 + 1,
can also be expressed as (notation different from video’s):
𝑃𝑃𝑅𝑅𝑡𝑡+1�𝑣𝑣𝑗𝑗� = (1 − 𝑑𝑑) × 𝑃𝑃�𝑣𝑣𝑗𝑗� + 𝑑𝑑 × � 𝑃𝑃𝑅𝑅𝑡𝑡(𝑣𝑣𝑖𝑖)
𝑜𝑜𝑜𝑜 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑(𝑣𝑣𝑖𝑖) 𝑣𝑣𝑖𝑖
where
4 Version 0
• 𝑣𝑣𝑗𝑗 is node 𝑗𝑗
• 𝑣𝑣𝑖𝑖 is any node 𝑖𝑖 that has a directed edge pointing to node 𝑗𝑗
• 𝑜𝑜𝑜𝑜 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑(𝑣𝑣𝑖𝑖) is the number of links going out of node 𝑣𝑣𝑖𝑖
• 𝑃𝑃𝑡𝑡+1(𝑣𝑣𝑗𝑗) is the pagerank value of node 𝑗𝑗 at iteration 𝑡𝑡 + 1
• 𝑃𝑃𝑡𝑡(𝑣𝑣𝑖𝑖) is the pagerank value of node 𝑖𝑖 at iteration 𝑡𝑡
• 𝑑𝑑 is the damping factor; set it to the common value of 0.85 that the surfer would continue to
follow links
• 𝑃𝑃(𝑣𝑣𝑗𝑗) is the probability of random jump that can be personalized based on use cases
Tasks
You will be using the “network.tsv” graph network dataset in the hw4-skeleton/Q1 folder, which
contains about 1 million nodes and 3 million edges. Each row in that file represents a directed
edge in the graph. The edge’s source node id is stored in the first column of the file, and the target
node id is stored in the second column.

Your code must NOT make any assumptions about the relative magnitude between the node ids
of an edge. For example, suppose we find that the source node id is smaller than the target node
id for most edges in a graph, we must NOT assume that this is always the case for all graphs (i.e.,
in other graphs, a source node id can be larger than a target node id).
You will complete the code in Q1.ipynb (guidelines also provided in the file).
1. Calculate and store each node’s out-degree and the graph’s maximum node id in
calculate_node_degree()
a. A node’s out-degree is its number of outgoing edges. Store the out-degree in
instance variable “node_degree”.
b. max_node_id refers to the highest node id in the graph. For example, suppose a
graph contains the two edges (1,4) and (2,3), in the format of (source,target), the
max_node_id here is 4. Store the maximum node id to instance variable
max_node_id.

2. Implement run_pagerank()
a. For simplified PageRank algorithm, where Pd( vj ) = 1/(max_node_id + 1) is
provided as node_weights in the script and you will submit the output for 10 and 25
iteration runs for a damping factor of 0.85. To verify, we are providing the sample output
of 5 iterations for a simplified PageRank (simplified_pagerank_iter5_sample.txt).
b. For personalized PageRank, the Pd( ) vector will be assigned values based on your 9-
digit GTID (e.g., 987654321) and you will submit the output for 10 and 25 iteration runs
for a damping factor of 0.85.
3. Compare output
a. Generate output text files by running the last cell of Q1.ipynb.
b. Note: When comparing your output for simplified_pagerank for 5 iterations with the
given sample output, the absolute difference must be less than 5%. For example,
absolute((SampleOutput – YourOutput) / SampleOutput) must be less
than 0.05.

Q2 [50 pts] Random Forest Classifier
Technology Python >=3.7.x
Allowed Libraries Do not modify the import statements; everything you need to complete
this question has been imported for you. You MUST not use other
libraries for this assignment.
Max runtime 300 seconds
Deliverables [Gradescope]
• Q2.ipynb [45 pts]: your solution as a Jupyter notebook, developed by
completing the provided skeleton code
o 10 points are awarded for 2 utility functions, 5 points for
entropy() and 5 points for information_gain()
o 35 points are awarded for successfully implementing your random
forest
• Random Forest Reflection [5 pts]: multiple-choice question
completed on Gradescope.
Q2.1 – Random Forest Setup [45 pts]
Note: You must use Python >=3.7.x for this question.
You will implement a random forest classifier in Python via a Jupyter notebook. The performance
of the classifier will be evaluated via the out-of-bag (OOB) error estimate using the provided
dataset Wisconsin_breast_prognostic.csv, a comma-separated (csv) file in the Q2 folder.
Features (Attributes) were computed from a digitized image of a fine needle aspirate (FNA) of a
breast mass. They describe characteristics of the cell nuclei present in the image. You must not
modify the dataset. Each row describes one patient (a data point, or data record) and each row
includes 31 columns. The first 30 columns are attributes. The 31st (the last column) is the label,
and you must NOT treat it as an attribute. The value one and zero in the last column indicates
whether the cancer is malignant or benign, respectively. You will perform binary classification on
the dataset to determine if a particular cancer is benign or malignant.
Important:
1. Remove all “testing” code that renders output, or Gradescope will crash. For instance, any
additional print, display, and show statements used for debugging must be removed.
2. You may only use the modules and libraries provided at the top of the notebook file
included in the skeleton for Q2 and modules from the Python Standard Library. Python
wrappers (or modules) must NOT be used for this assignment. Pandas must NOT be used
— while we understand that they are useful libraries to learn, completing this question is
not critically dependent on their functionality. In addition, to make grading more
manageable and to enable our TAs to provide better, more consistent support to our
students, we have decided to restrict the libraries accordingly.
Essential Reading
Decision Trees. To complete this question, you will develop a good understanding of how
decision trees work. We recommend that you review the lecture on the decision tree. Specifically,
review how to construct decision trees using Entropy and Information Gain to select the splitting
attribute and split point for the selected attribute. These slides from CMU (also mentioned in the
lecture) provide an excellent example of how to construct a decision tree using Entropy and
6 Version 0
Information Gain. Note: there is a typo on page 10, containing the Entropy equation; ignore one
negative sign (only one negative sign is needed).
Random Forests. To refresh your memory about random forests, see Chapter 15 in the Elements
of Statistical Learning book and the lecture on random forests. Here is a blog post that introduces
random forests in a fun way, in layman’s terms.
Out-of-Bag Error Estimate. In random forests, it is not necessary to perform explicit crossvalidation or use a separate test set for performance evaluation. Out-of-bag (OOB) error estimate
has shown to be reasonably accurate and unbiased. Below, we summarize the key points about
OOB in the original article by Breiman and Cutler.
Each tree in the forest is constructed using a different bootstrap sample from the original data.
Each bootstrap sample is constructed by randomly sampling from the original dataset with
replacement (usually, a bootstrap sample has the same size as the original dataset). Statistically,
about one-third of the data records (or data points) are left out of the bootstrap sample and not
used in the construction of the kth tree. For each data record that is not used in the construction
of the kth tree, it can be classified by the kth tree. As a result, each record will have a “test set”
classification by the subset of trees that treat the record as an out-of-bag sample. The majority
vote for that record will be its predicted class. The proportion of times that a record’s predicted
class is different from the true class, averaged over all such records, is the OOB error estimate.
While splitting a tree node, make sure to randomly select a subset of attributes (e.g., square root
of the number of attributes) and pick the best splitting attribute (and splitting point of that attribute)
among these subsets of attributes. This randomization is the main difference between random
forest and bagging decision trees.
Starter Code
We have prepared some Python starter code to help you load the data and evaluate your model.
The starter file name is Q2.ipynb has three classes:
● Utililty: contains utility functions that help you build a decision tree
● DecisionTree: a decision tree class that you will use to build your random forest
● RandomForest: a random forest class
What you will implement
Below, we have summarized what you will implement to solve this question. Note that you must
use information gain to perform the splitting in the decision tree. The starter code has detailed
comments on how to implement each function.
1. Utililty class: implement the functions to compute entropy, information gain, perform
splitting, and find the best variable (attribute) and split-point. You can add additional
methods for convenience. Note: Do not round the output or any of your functions.
2. DecisionTree class: implement the learn() method to build your decision tree using
the utility functions above.
3. DecisionTree class: implement the classify() method to predict the label of a test
record using your decision tree.
4. RandomForest class: implement the methods _bootstrapping(), fitting(),
voting() and user().
5. get_random_seed(), get_forest_size(): implement the functions to return a
random seed and forest size (number of decision trees) for your implementation.
7 Version 0
Important:
1. You must achieve a minimum accuracy of 90% for the random forest. If the accuracy is turning
out to be low, try playing around with hyper-parameters. If it is extremely low, try revisiting
best_split() and classify()methods.
2. Your code must take no more than 5 minutes to execute (which is a very long time, given
the low program complexity). Otherwise, it may time out on Gradescope. Code that takes
longer than 5 minutes to run likely means you need to correct inefficiencies (or incorrect logic)
in your program. We suggest that you check the hyperparameter choices (e.g., tree depth,
number of trees) and code logic when figuring out how to reduce runtime.
3. The run() function is provided to test your random forest implementation; do NOT modify
this function.
4. Note: In your implementation, use basic Python lists rather than the more complex Numpy data
structures to reduce the chances of version-specific library conflicts with the grading scripts.
As you solve this question, consider the following design choices. Some may be more
straightforward to determine, while some maybe not (hint: study lecture materials and essential
reading above). For example:
● Which attributes to use when building a tree?
● How to determine the split point for an attribute?
● How many trees should the forest contain?
● You may implement your decision tree using the data structure of your choice (e.g., dictionary,
list, class member variables). However, your implementation must still work within the
DecisionTree Class Structure we have provided.
● Your decision tree will be initialized using DecisionTree(max_depth=10), in the
RandomForest class in the jupyter notebook.
● When do you stop splitting leaf nodes?
● The depth found in the learn function is the depth of the current node/tree. You may want a
check within the learn function that looks at the current depth and returns if the depth is greater
than or equal to the max depth specified. Otherwise, it is possible that you continually split on
nodes and create a messy tree. The max_depth parameter should be used as a stopping
condition for when your tree should stop growing. Your decision tree will be instantiated with
a depth of 0 (input to the learn() function in the jupyter notebook). To comply with this,
make sure you implement the decision tree such that the root node starts at a depth of 0 and
is built with increasing depth.
Note that, as mentioned in the lecture, there are other approaches to implement random forests.
For example, instead of information gain, other popular choices include the Gini index, random
attribute selection (e.g., PERT – Perfect Random Tree Ensembles). We decided to ask everyone
to use an information gain based approach in this question (instead of leaving it open-ended),
because information gain is a useful machine learning concept to learn in general.
Q2.2 – Random Forest Reflection [5 pts]
On Gradescope, answer the following multiple-choice question — you may answer it only once.
Select all that apply, and the answer must be completely correct to earn the points; No
partial marks will be awarded if all the correct options are NOT selected:
What is the main reason to use a random forest versus a decision tree?
8 Version 0
Q3 [30 points] Using Scikit-Learn
Technology Python >=3.7.x
Scikit-Learn >=0.22
Allowed Libraries Do not modify the import statements; everything you need to complete
this question has been imported for you. You MUST not use other
libraries for this assignment.
Max runtime 15 minutes
Deliverables [Gradescope] Q3.ipynb [30 pts]: your solution as a Jupyter notebook,
developed by completing the provided skeleton code
Scikit-learn is a popular Python library for machine learning. You will use it to train some classifiers
to predict diabetes in the Pima Indian tribe. The dataset is provided in the Q3 folder as pimaindians-diabetes.csv.
For this problem, you will be utilizing a Jupyter notebook.
Important:
1. Remove all “testing” code that renders output, or Gradescope will crash. For instance, any
additional print, display, and show statements used for debugging must be removed.
2. Use the default values while calling functions unless specific values are given.
3. Do not round off the results except the results obtained for Linear Regression Classifier.
4. Do not change the ‘#export’ statements or add any other code/comments above them.
They are needed for grading.
Q3.1 – Data Import [2 pts]
In this step, you will import the pima-indians-diabetes dataset and allocate the data to two
separate arrays. After importing the data set, you will split the data into a training and test
set using the scikit-learn function train_test_split. You will use scikit-learns built-in machine
learning algorithms to predict the accuracy of training and test set separately. Refer to the
hyperlinks provided below for each algorithm for more details, such as the concepts
behind these classifiers and how to implement them.
Q3.2 – Linear Regression Classifier [4 pts]
Q3.2.1 – Classification
Train the Linear Regression classifier on the dataset. You will provide the accuracy for
both the test and train sets. Make sure that you round your predictions to a binary value
of 0 or 1. Do not use np.round function as it can produce results that surprise you and not
meet your needs (see the official numpy documentation for details). Instead, we
recommend you write a custom round function using if-else. See the Jupyter notebook for
more information. Linear regression is most commonly used to solve regression problems.
The exercise here demonstrates the possibility of using linear regression for classification
(even though it may not be the optimal model choice).
Q3.3 – Random Forest Classifier [10 pts]
Q3.3.1 – Classification
Train the Random Forest classifier on the dataset. You will provide the accuracy for both
the test and train sets. Do not round your prediction.
Q3.3.2 – Feature Importance
9 Version 0
You have performed a simple classification task using the random forest algorithm. You
have also implemented the algorithm in Q2 above. The concept of entropy gain can also
be used to evaluate the importance of a feature. You will determine the feature importance
evaluated by the random forest classifier in this section. Sort the features in descending
order of feature importance score, and print the sorted features’ numbers.
Hint: There is a function available in sklearn to achieve this. Also, take a look at
argsort() function in Python numpy. argsort()returns the indices of the elements in
ascending order. You will use the random forest classifier that you trained initially in
Q3.3.1, without any kind of hyperparameter-tuning, for reporting these features.
Q3.3.3 – Hyper-Parameter Tuning
Tune your random forest hyper-parameters to obtain the highest accuracy possible on the
dataset. Finally, train the model on the dataset using the tuned hyper-parameters. Tune
the hyperparameters specified below, using the GridSearchCV function in Scikit library:
‘n_estimators’: [4, 16, 256], ’max_depth’: [2, 8, 16]
Q3.4 – Support Vector Machine [10 pts]
Q3.4.1 – Preprocessing
For SVM, we will standardize attributes (features) in the dataset using StandardScaler,
before training the model.
Note: for StandardScaler,
● Transform both x_train and x_test to obtain the standardized versions of both.
● Review the StandardScaler documentation, which provides details about
standardization and how to implement it.
Q3.4.2 – Classification
Train the Support Vector Machine classifier on the dataset (the link points to SVC, a
particular implementation of SVM by Scikit). You will provide the accuracy on both the test
and train sets.
Q3.4.3. – Hyper-Parameter Tuning
Tune your SVM model to obtain the highest accuracy possible on the dataset. For SVM,
tune the model on the standardized train dataset and evaluate the tuned model with the
test dataset. Tune the hyperparameters specified below in the same order, using the
GridSearchCV function in Scikit library:
‘kernel’:(‘linear’, ‘rbf’), ‘C’:[0.01, 0.1, 1.0]

Note: If GridSearchCV takes a long time to run for SVM, make sure you standardize your
data beforehand using StandardScaler.

Q3.4.4. – Cross-Validation Results
Let’s practice obtaining the results of cross-validation for the SVM model. Report the rank
test score and mean testing score for the best combination of hyper-parameter values that
you obtained. The GridSearchCV class holds a cv_results_ dictionary that helps you
report these metrics easily.

10 Version 0
Q3.5 – Principal Component Analysis [4 pts]
Performing Principal Component Analysis based dimensionality reduction is a common task
in many data analysis tasks, and it involves projecting the data to a lower-dimensional space
using Singular Value Decomposition. Refer to the examples given here; set parameters
n_component to 8 and svd_solver to full. See the sample outputs below.
1. Percentage of variance explained by each of the selected components. Sample
Output:
[6.51153033e-01 5.21914311e-02 2.11562330e-02 5.15967655e-03
6.23717966e-03 4.43578490e-04 9.77570944e-05 7.87968645e-06]
2. The singular values corresponding to each of the selected components. Sample
Output:
[5673.123456 4532.123456 4321.68022725 1500.47665361
1250.123456 750.123456 100.123456 30.123456]
Use the Jupyter notebook skeleton file called Q3.ipynb to write and execute your code.
As a reminder, the general flow of your machine learning code will look like:
1. Load dataset
2. Preprocess (you will do this in Q3.2)
3. Split the data into x_train, y_train, x_test, y_test
4. Train the classifier on x_train and y_train
5. Predict on x_test
6. Evaluate testing accuracy by comparing the predictions from step 5 with y_test.
Here is an example. Scikit has many other examples as well that you can learn from.

CSE 6242 / CX 4242 HW 1 to 4 solutions

Download Details:

Description

CSE 6242 / CX 4242: Data and Visual Analytics HW 1

Homework Overview

CSE 6242 / CX 4242: Data and Visual Analytics HW 2

Homework Overview

CSE 6242 / CX 4242: Data and Visual Analytics HW 3

Homework Overview

CSE 6242 / CX 4242: Data and Visual Analytics HW 4

Homework Overview

CSE 6242 / CX 4242 HW 1 to 4 solutions

Download Details:

Description

CSE 6242 / CX 4242: Data and Visual Analytics HW 1

Homework Overview

CSE 6242 / CX 4242: Data and Visual Analytics HW 2

Homework Overview

CSE 6242 / CX 4242: Data and Visual Analytics HW 3

Homework Overview

CSE 6242 / CX 4242: Data and Visual Analytics HW 4

Homework Overview

Related products

CSE​​6242/CX​4242 Homework 3 : Hadoop, Spark, Pig and Azure solution

CSE 6242 / CX 4242 HW 3: Spark, Docker, DataBricks, AWS and GCP solution

CSE 6242 / CX 4242 HW 4: PageRank Algorithm, Random Forest, Scikit-learn solution

CSE6242/CX4242 Homework 3 : Hadoop, Spark, Pig and Azure solution