## Description

## Introduction

President Donald Trump said in 2012 that he believed climate change was a hoax created by China. In

this pset, we will attempt to prove him wrong. We will use regression analysis to model the climate of

different areas in the United States in order to find evidence of global warming.

• First, you will create models to analyze and visualize climate change in terms of temperature, and

then consider ways to make the data less noisy and obtain clearer temperature change trends.

• You will then test your models to see how well historical data can predict future temperatures.

• Lastly, you will investigate a way to model the extremity of temperature, rather than just the

increasing temperature.

Getting Started

Please do not rename the files we provide you with, change any of the provided helper functions,

change function/method names, or delete provided docstrings. You will need to keep data.csv in the

same folder as ps5.py .

Also, please consult the Style Guide, as we will be taking point deductions for specific violations.

Furthermore, there will be deductions for non-descriptive variable names and uncommented code.

Besides that, you may want to check out the pylab documentation (see the numpy and scipy sections

https://scipy.org/docs.html), as pylab includes a number of functions that will make parts of this pset

much easier.

You should submit

1. Code for your solutions in ps5.py

2. Write-up for your discussions in ps5_writeup.pdf

Please do not submit data.csv , since you won’t make any changes to it.

The Write Up

For this problem set, you will be submitting a write up based on the plots generated from the different

problems. You must put plots created by your code in this writeup. Even if you believe your plots are

wrong, answer the question according to your plots/code, and you can include an explanation of what

you think should have happened. As this problem set is based on exploring trends in data, we will be

grading the write up more strictly based on your quality of analysis. Answer the questions in full

sentences.

The Climate Class

To model the change in climate of a particular area, you will need some data. For this problem set, we

will use temperature data obtained from the National Centers for Environmental Information (NCEI).

The data, stored in data.csv , contains the daily maximum and minimum temperatures observed in 21

U.S. cities from 1961 to 2015. Open the file, and take a look at the raw data. In order to parse the raw

data, in ps5.py we have provided a helper class, Climate . You can initialize an instance of the Climate

class by providing the name of the raw data file. Look over this class and read its docstrings to figure out

how to get data for the following problems.

## Part A: Creating Models

### Problem 1: Curve Fitting

This project was adopted from MIT Open Course ware –

Introduction to Computer Science and Programming in Python

Implement the generate_models function. This function takes in a set of (x,y) coordinates and degrees

(1 = linear, 2 = quadratic, 3 = cubic, etc), and fits polynomials of the specified degrees to the data points.

It then returns the coefficients for each of the best-fit polynomials.

Hint: see the documentation for pylab.polyfit.

Function inputs : x and y are two pylab one-dimensional arrays (NOT Python lists) corresponding to the

x-coordinates and y-coordinates of the data samples. For example, if you have N data points, then

x = [x1, x2, …, xN] and y = [y1, y2, …, yN]

where xi and yi are the x and y coordinates of the i th data points. In this problem set, each x-coordinate

is an integer and corresponds to the year of a sample (e.g., 1997 ). Each corresponding y-coordinate is a

float , and represents the temperature observation of that year in Celsius (we will describe how exactly

these observations are calculated later on in the problem set — you don’t need to worry about it for this

function.) This two-array representation of data points will be used throughout the entire problem set.

degs is a list of integers, indicating the degrees for each regression model that we want to create. For

each model, this function should fit the data (x,y) to a polynomial curve of that degree.

Function output:

The function should return a list of models. A model is a 1-d pylab array of the coefficients used for the

polynomial. (So your final output is a list made up of pylab arrays.) The models should be in the same

order as their corresponding integers in degs .

Example:

print generate_models(pylab.array([1961, 1962, 1963]),

pylab.array([-4.4, -5.5, -6.6]), [1, 2])

Should print something close to

[array([ -1.10000000e+00, 2.15270000e+03]),

array([ 6.83828238e-14, -1.10000000e+00, 2.15270000e+03])]

The above example generates linear and quadratic curves (degrees 1 and 2) on data samples (xi, yi) =

(1961, -4.4), (1962, -5.5), and (1963, -6.6). The resulting models are in the same order as specified in

degs . Note that it is fine if you did not get the exact number because of numerical errors.

After implementing this, your code should pass the unit test test_generate_models .

### Problem 2: R^2

After we create some regression models, we want to be able to evaluate our models to figure out how

well they represent our data and choose the best ones.

One way to evaluate how well a model performs is by computing the model’s R 2 value, also known as

its coefficient of determination. R^2 provides a measure of how well the total variation of samples is

explained by the model. Implement the function r_squared.

Input:

y : a 1-dimensional pylab array, which contains the y-coordinates of the actual data samples estimated :

a 1-dimensional pylab array, which contains the estimated y-coordinates from a regression model

Output:

This function should return the computed R^2 value. You can compute R^2 as follows:

If you are still confused about , the Wikipedia page R^2 has a good explanation about its use and how to

calculate it.

Some Python packages include functions that directly calculate R^2 ; you may not use these, and should

do the calculations from scratch.

Hint: with pylab arrays, you can easily perform many operations without using for-loops. For example:

>>> a = pylab.array([1,2,3])

>>> b = pylab.array([1,0,1])

>>> a + b array([2, 2, 4])

>>> a – 1 array([0, 1, 2])

>>> a * 2 array([2, 4, 6])

Your code should now pass test_r_squared.

### Problem 3: Visualizing the Models

We’ve learned how to obtain a numerical metric for evaluation. Visualizing our data samples along with

fitting curves can also help us figure out the goodness of obtained models. In this problem, we will

integrate the numerical metrics and visualization for a comprehensive evaluation.

Implement the function evaluate_models_on_training. This function will be used to evaluate models on

the same data that we use to create them — aka the training data.

This function takes as input several data samples ( x and y ) and a list of models (which are the arrays of

coefficients we obtain from generate_models ) that you want to apply to your data. The function should

generate a figure for each model. In the figure, you should plot the data along with the curve specified

by the model, and report the goodness of fit with the R^2 value.

Your graph needs to match the following format:

• Plot the data points as individual blue dots

• Plot your model as a red solid line

• Include a title and label your axes (you can assume this function will only be used in the case

where the x-axis is years and the y-axis is degrees Celsius)

o Your title should include the R^2 value of your model, along with the degree of this model .

Your title may be longer than your graph, and get cut off when you copy and paste the

graph to your write-up. To fix that you can add the newline character “\n” , which adds a

line break to your string, in the title (e.g., title = string_a + “\n” + string_b ).

If the model is a linear curve (i.e., its degree is one), the title of your plot should also include the ratio of

the standard error of this fitted curve’s slope to the slope. ( **see se_over_slope helper function**)

This ratio measures how likely it is that you’d see the trend in your data (upward/downward) and fitting

curve just by chance.

The larger the absolute value of this ratio is, the more likely it is that the trend is by

chance. We won’t cover this evaluation method in class, so if you are interested about it check out:

Hypothesis Test for Regression Slope. [https://stattrek.com/regression/slope-test.aspx?Tutorial=AP]

In our case, if the absolute value of the ratio is less than 0.5, the trend is significant (i.e., not by chance).

### Problem 4: Investigating the trend

Now we have all the components we need. We can start generating data samples from the raw

temperature records and investigate the trend.

Problem 4.I January 10th

A simple first method for sampling: we randomly pick a day from a year (i.e., Jan 10th in this case), and

see whether we can find any trends in the temperatures over the years. We surmise, due to global

warming, that the temperature of this specific date should increase over time.

Write your code for parts 4.I and 4.II under if __name__ == ‘__main__’: after the comment # Part A.4 .

First, generate your data samples. Each sample (data point) should be a year from 1961 to 2009 (i.e., the

years in TRAINING_INTERVAL ) and the temperature on January 10th for New York City in that year (look

at the Climate class for a function to help with you this!)

Next, fit your data to a degree-one polynomial with g enerate_models, and plot the regression results

using evaluate_models_on_training.

You will need to include the figure you generate in your write-up.

Problem 4.II Annual Temperature

Let’s try another way to get data points. We surmise that due to global warming, the average

temperature for each year should increase over time. Thus, we will going to plot the results of a linear

regression on the average annual temperature of New York City.

Write your code for this part under if __name__ == ‘__main__’: after the comment # Part A.4 .

First, generate your data samples. Each sample (data point) should be a year from 1961 to 2009 (i.e., the

years in TRAINING_INTERVAL ) and the average temperature in New York City for that year (again, look

at the Climate class for a function to help with you this.)

Hint: make sure you properly account for leap years!

Next, fit your data to a degree-one polynomial with g enerate_models and plot the regression results

with evaluate_models_on_training.

You will need to include the figure you generate in your write-up.

Include the plots for A4.I and A4.II in a document called ps5_writeup.pdf . Make sure each plot has

appropriately labeled axes, and is titled according to the type of model (e.g. linear, quadratic, etc.), the

R^2 value, and the standard error-to-slope ratio.

You also need to answer the following questions with a short paragraph in ps5_writeup.pdf.

• What difference does choosing a specific day to plot the data for versus calculating the yearly

average have on our graphs (i.e., in terms of the R 2 values and the fit of the resulting curves)?

Interpret the results.

• Why do you think these graphs are so noisy? Which one is more noisy?

• How do these graphs support or contradict the claim that global warming is leading to an increase

in temperature? The slope and the standard error-to-slope ratio could be helpful in thinking about

this.

## Part B: Incorporating More Data

Let’s see whether we can get a better picture of climate change by using data from more than just one

city.

Implement the function gen_cities_avg to get the yearly average temperature over multiple cities. Use

this function to compute national yearly temperature (i.e., average the yearly averaged temperature

over the 21 cities listed in CITIES ) between 1961-2009 as your data samples. As in previous parts, the xcoordinate of a sample is an integer for the year; this time, the y-coordinate is a float for the national

yearly temperature in that year.

Your code should now pass the test test_gen_cities_avg.

Again, you should fit your data to a degree-one polynomial with generate_models and plot the

regression results with evaluate_models_on_training. Leave your code for this part after the comment #

Part B , which is under if __name__ == ‘__main__’: .

You will need to include the figure you generate in your write-up.

Plot the result and include it in ps 5_writeup.pdf.

Answer the following questions with a short paragraph in ps5_writeup.pdf.

• How does this graph compare to the graphs from part A (i.e., in terms of the R^2 values, the fit of

the resulting curves, and whether the graph supports/contradicts our claim about global

warming)? Interpret the results.

• Why do you think this is the case?

• How would we expect the results to differ if we used 3 different cities? What about 100 different

cities?

• How would the results have changed if all 21 cities were in the same region of the United States

(for ex., New England)?

## Part C: 5-year Moving Average

We are now going to generate the temperatures for samples by taking a moving average over 5 years of

data. The moving average allows us to emphasize the general/global trend over local/yearly fluctuation.

Implement the function moving_average. This function takes in an 1d pylab array, y , and the

window_length , which is the length of window for moving average. It returns another 1d pylab array

containing the moving average results.

Here, we define moving average of y[i] as the average of y[i-window_length+1] to y[i] . For example, if y

= [10, 20, 30, 40, 50] and window_length = 3 , the array of moving averages is:

Note that in some cases we have less than window_length-1 previous values to use (as in the first two

calculations in the array); in those cases you should just average over the current value along with the

prior values that do exist. (So for the second array value, for example, we just average 10 and 20.)

You should now pass the test test_moving_avg.

Use this function on the national yearly temperatures from 1961-2009 in order to generate the moving

average temperatures with a window size of 5. Then, fit the (year, moving average) samples a to a

degree-one polynomial with g enerate_models, and plot the regression results with

evaluate_models_on_training.

Leave your code for this part after the comment # Part C , which is under if __name__ == ‘__main__’: .

Plot the result and include it in ps 5_writeup.pdf.

Answer the following questions with a short paragraph in ps5_writeup.pdf.

• How does this graph compare to the graphs from part A and B ( i.e., in terms of the R^2 values,

the fit of the resulting curves, and whether the graph supports/contradicts our claim about global

warming)? Interpret the results.

• Why do you think this is the case?

## Part D: Predicting the Future

It looks like we have indeed discovered some trends. Now, we are curious whether we can predict

future temperatures based on what we learn from historical data.

### Problem 1: RMSE

Before we use our models to predict future data points (i.e., data from later than 1961-2009), we should

think about how to evaluate our models’ performance. We can’t use R^2 here, since R^2 does not have

a clear meaning on testing data — R^2 measures how closely a model matches the data used to generate

the model, but we are generating the model with 1961-2009 and testing on 2010-2015.

One way to evaluate a model’s performance on test data is with Root Mean Square Error (RMSE), which

measures the deviation of predicted values from true values.

Implement the function rmse to return the RMSE of a model, where y represents the actual y-values

from the data set and estimated is the y-values estimated by the model.

RMSE can be found as follows:

If you are still confused about RMSE, its Wikipedia page has a good explanation about its use/how to

calculate it. https://en.wikipedia.org/wiki/Root-mean-square_deviation

You should now pass the test test_rmse . For the evaluation, you also need to implement the function

evaluate_models_on_testing . This function works very similarly to evaluate_models_on_training,

except that you should use rmse rather than r _squared to evaluate the prediction of your model. You

should report the RMSE value in the figure’s title.

You don’t need to compute the ratio of the standard error of slope to the slope.

### Problem 2: Predicting

Now that we have a method for evaluating models’ performance on test data, we are going to use our

models to “predict the future”. We need to compare our models’ predictions to the real data in order to

evaluate their performance, so we will use data from 2010-2015 (i.e. the TESTING_INTERVAL ) to

simulate the future. We will call data from 1961-2009 the “training” data set that we create the model

on, and data from 2010-2015 the “test” data set which we predict the values for.

Leave your code for this part after the comment # Part D , which is under if __name__ == ‘__main__’: .

### Problem 2.I Generate more models

First, we want to generate more models for prediction. Complete the following steps:

1. Compute 5-year moving averages of the national yearly temperature from 1961-2009 as your

training data samples.

2. Fit the samples to polynomials of degree 1, 2 and 20.

3. Use evaluate_models_on_training to plot your fitting results.

Plot the results and include them in ps5_writeup.pdf.

Answer the following questions with a short paragraph in ps5_writeup.pdf.

• How do these models compare to each other?

• Which one has the best R^2? Why?

• Which model best fits the data? Why?

### Problem 2.II Predict the results

Now, let’s do some predictions and compare our predictions to the real average temperatures from

2010-2015. Complete the following steps:

1. Compute 5-year moving averages of the national yearly temperature from 2010-2015 as your test

data samples.

2. For each model obtained in the previous problem (i.e., the curves fit to 5-year moving averages of

the national yearly temperature from 1961-2009 with degree 1, 2, and 20), apply the model to

your test data (defined above), and graph the predicted and the real observations (i.e., 5-year

moving average of test data). You should use evaluate_models_on_testing for applying the model

and plotting the results.

Plot the the resulting graphs for the degree =1,2,20 models and include them in ps5_writeup.pdf.

Answer the following questions with a short paragraph in ps5_writeup.pdf.

• How did the different models perform? How did their RMSEs compare?

• Which model performed the best? Which model performed the worst? Are they the same as

those in part D.2.I? Why?

• If we had generated the models using the A.4.II data (i.e. average annual temperature of New York

City) instead of the 5-year moving average over 22 cities, how would the prediction results 2010-

2015 have changed?

## Part E: Modeling Extreme Temperatures

In addition to raising temperature, global warming also makes temperatures more extreme (e.g., very

hot or very cold). We surmise that we can model this effect by measuring the standard deviation in our

data. A small standard deviation would suggest that the data is very close together around the mean. A

larger standard deviation, however, would suggest that the data varies a lot (i.e., more extreme

weather). Therefore, we expect that over time, the standard deviation should increase.

In order to test our prediction, implement the function gen_std_devs. This should be very similar to

gen_cities_avg.

This function returns a pylab array containing one value for each specified year. For each year in years,

your function will need to:

1. Calculate a temperature for each day in that year, by averaging the temperatures for that day

across the specified cities.

2. Take the standard deviation of the daily averages for the whole year.

You should now pass the test test_gen_std_devs.

Next, let’s try out the function. Leave your code for this part after the comment # Part E , which is under

if __name__ == ‘__main__’: .

1. Use gen_std_devs to compute the standard deviations using all 21 cities over the years in the

training interval, 1961-2009.

2. Compute 5-year moving averages on the yearly standard deviations.

3. Finally, fit your data to a degree-one polynomial with g enerate_models and plot the regression

results with evaluate_models_on_training.

Plot the the resulting graph and include it in ps5_writeup.pdf.

Answer the following questions with a short paragraph in ps5_writeup.pdf.

• Does the result match our claim (i.e., temperature variation is getting larger over these years)?

• Can you think of ways to improve our analysis?

Hand-In Procedure

1. Save

Save your solutions as ps5.py and p s5_writeup.pdf.

For ps5_writeup.pdf , please do not submit a .doc, .odt, .docx, etc.

2. Time and Collaboration Info

At the start of each file, in a comment, write down the number of hours (roughly) you spent on

the problems in that part, and the names of the people you collaborated with. For example:

# Problem Set 5

# Name: Jane Lee

# Collaborators: John Doe

# Time:

#

… your code goes here …

3. Sanity checks

After you are done with the problem set, do sanity checks. Run the code and make sure it can be

run without errors. You should never submit code that immediately generates an error when run!

Make sure that your write up contains everything we’ve asked for. In particular, your write up

needs to contain one graph from A.4.I, one graph from A.4.II, one from B, one from part C, three

graphs from D.4.I, three graphs from D.4.II, and one graph from part E.