Description
The main focus of our course is on data analytics. In fact, however, there are many other exciting topics about Big Data, which we cannot cover due to time constraints. Lecture 3 gave you a brief overview of Visualization. Assignment 3 is designed to deepen your understanding. After completing this assignment, you should be able to answer the following questions:
- How to perform visual data analysis using Python?
- How to study the behaviour of a machine learning algorithm using visualization?
As a motivating example of how visualization can bring data to life and clear up misconceptions, consider to watch Hans Rosling’s famous TED talks, e.g. “The best stat’s you’ve ever seen” from 2006.
Part 1. EDA
Real estate data
Imagine you are data scientist working at a real-estate company. In this week, you job is to analyze the Vancouver’s housing price. You first download a dataset from property_tax_report_2019.zip. The dataset contains information on properties from BC Assessment (BCA) and City sources in 2019. You can find the schema information of the dataset from this webpage. But this is not enough. You still know little about the data. That’s why you need to do EDA in order to get a better and deeper understanding of the data.
We first load the data as a DataFrame. To make this analysis more interesting, I added two new columns to the data: CURRENT_PRICE
represents the property price in 2019; PREVIOUS_PRICE
represents the property price in 2018.
import pandas as pd
# before running this, unzip the provided data
df = pd.read_csv("data/property_tax_report_2019.csv")
df['CURRENT_PRICE'] = df.apply(lambda x: x['CURRENT_LAND_VALUE']+x['CURRENT_IMPROVEMENT_VALUE'], axis = 1)
df['PREVIOUS_PRICE'] = df.apply(lambda x: x['PREVIOUS_LAND_VALUE']+x['PREVIOUS_IMPROVEMENT_VALUE'], axis = 1)
Now let’s start the EDA process.
Hint. For some of the following questions, we provided an example plot (see link). But note that you do not have to use the same plot design. In fact, we didn’t do a good job to follow the Principles of Visualization Design in the second half of the slides of Lecture 3, please review this part by yourself. You should think about how to correct the bad designs in my plots.
Question 1. Look at some example rows
Print the first five rows of the data:
# --- Write your code below ---
Question 2. Get summary statistics
From the above output, you will know that the data has 28 columns. Please use the describe() function to get the summary statistics of each column.
# --- Write your code below ---
Please look at the above output carefully, and make sure that you understand the meanings of each row (e.g., std, 25% percentile).
Question 3. Examine missing values
Now we are going to perform EDA on a single column (i.e., univariate analysis). We chose YEAR_BUILT
, which represents in which year a property was built. We first check whether the column has any missing value.
# --- Write your code below ---
# Print the percentage of the rows whose YEAR_BUILT is missing.
Missing values are very common in real-world datasets. In practice, you should always be aware of the impact of the missing values on your downstream analysis results.
Question 4. Plot a line chart
We now start investigating the values in the YEAR_BUILT
column. Suppose we want to know: “How many properties were built in each year (from 1990 to 2018)?” Please plot a line chart to answer the question.
# --- Write your code below ---
Please write down the two most interesting findings that you draw from the plot. For example, you can say: .
Findings
- [ADD TEXT]
- [ADD TEXT]
Question 5. Plot a bar chart
Next, we want to find that, between 1900 and 2018, which years have the most number of properties been built? Plot a bar chart to show the top 20 years.
# --- Write your code below ---
Please write down the two most interesting findings that you draw from the plot.
Findings
- [ADD TEXT]
- [ADD TEXT]
Question 6. Plot a histogram
What’s the distribution of the number of properties built between 1990 and 2018? Please plot a histogram to answer this question.
# --- Write your code below ---
Please write down the two most interesting findings that you draw from the plot.
Findings
- [ADD TEXT]
- [ADD TEXT]
Question 7. Make a scatter plot
Suppose we are interested in those years which built more than 2000 properties. Make a scatter plot to examine whether there is a relationship between the number of built properties and the year?
# --- Write your code below ---
Please write down the two most interesting findings that you draw from the plot.
Findings
- [ADD TEXT]
- [ADD TEXT]
Part 2: Data and Model Visualization
Revisit Assignment 9 from CMPT 732 – Weather prediction and show a deeper analysis of the same temperature data utilizing a simplified version of the model you already have.
Data
The weather data on HDFS /courses/732/tmax-{1,2,3,4}
spans a large time period and covers many stations around the globe. There are many possible questions to study. Use a python plotting library of your choice, such as matplotlib.
Model
The model from A9 of CMPT 732 was using 'latitude', 'longitude', 'elevation', 'yesterday_tmax', 'day_of_year'
as input features to predict t_max
. Please retrain your model to only use 'latitude', 'longitude', 'elevation', 'day_of_year'
before proceeding with task (b) below, and include this re-trained weather-model
in your submission.
Tasks
a) Produce one or more figures that illustrate the daily max. temperature distribution over the entire globe and enable a comparison of different, non-overlapping time periods, e.g. to reveal temporal trends over longer time periods or recurring seasons.
Only show temperatures where you have data available. Take care to handle overplotting of multiple different values into the same point on the figure, which might happen when you have multiple measurements for the same station in a chosen period. By handle overplotting we mean, for instance, to aggregate your data to have a clear meaning for the value that is displayed for a particular station, such as max. or average within the period.
Here is an example from the web:
b) Produce two or more figures that show the result of your re-trained regression model from CMPT 732-A9, i.e. a version of the model that does not use yesterday_tmax
as extra input feature:
(b1) Evaluate your model at a grid of latitude, longitude positions around the globe spanning across oceans and continents, leading to a dense plot of temperatures. This could, for instance, look something like the following:You can use a fixed
day_of_year
of your choice. Also, see further hints about elevation
below.
(b2) In a separate plot show the regression error of your model predictions against test data. In this case only use locations where data is given, i.e. you may reuse your plotting method from Part 2 (a).
Comments and Hints
Any imperfections of your trained model that show up in the visualization are fine. In fact, in this example it is a sign of a good visualization, if it enables us to understand shortcomings of the model. You are not marked for the performance of your model from 732-A9 again, but rather for the methods you create here to investigate it.
Please attempt to make continent or country borders visible on your map. You can do that either by using library function or by using enough data points, such that the shape of some continents roughly emerges from the data distribution. Out of the different datasets please use one with at least 100k rows.
For (b1) you will need elevation information for the points you produce. Have a look at elevation_grid.py
for a possible way to add this info to your choice of coordinates. If you place the accompanying elevation data in the same folder as the script you can import the module and see help(evaluation_grid)
for example usage.elevation_grid.py
internally stores elevation data as an array at 5 times the resolution of the figure shown here, use the get_elevations
function to access it.
Submission of Part 2
Please prepare the following components (each has one or two files):
- Report: Combine the plots into a PDF document
weather_report.pdf
along with brief captions explaining and discussing the figures. If you decide to produce the PDF using a Jupyter notebook that contains the markdown to render and discuss the figures saved byweather_plot.py
you can submit it asweather_report.ipynb
. Submitting the notebook is optional. - Code: Please provide your code to produce the figures in a script
weather_plot.py
, which could be based on theweather_test.py
from 732-A9. Since you may want to separate the spark code to run on the cluster from the plotting code, you can provide that in an optional script calledweather_spark.py
. Please ensure that all visualization code relevant for marking is in these python scripts. - Submit the weather model that you are using.
Submission
In summary, you need to complete the first part by filling out the first half of this notebook and the second part by following the submission instructions above. Overall, please submit , , , , and to the CourSys activity Assignment 3.
Lab environment for the assignment
Scratch space
Your scratch space allows you to store larger files outside of your home folder, not counting them towards your limited disk quota. To make that space available via a link from your home folder use:
ln -s /usr/shared/CMPT/scratch/<username> ~/scratch
Similar to HDFS on gateway, please treat this space as a shared resource, i.e. remove large temporary files when you’re done working with them.
Conda
For the big data lab setup, we have put a few useful python modules, such as basemap
or geoviews
, into a shared conda environment. To use the environment call
source activate /usr/shared/CMPT/big-data/condaenv/py36
or prepare once with
mkdir -p ~/.conda/envs
ln -s /usr/shared/CMPT/big-data/condaenv/py36 ~/.conda/envs/
and from thereon simply activate using source activate py36
and, for instance, work with pyspark
on your local lab machine.
Pip
As alternative to conda you can also just use pip.
For instance, create a pip environment called myenv
in the scratch space (see above)
python -m venv ~/scratch/myenv
Activate: source ~/scratch/myenv/bin/activate