Description
Question one (20 points)
Replicate figures 4.5 in section 4.7 of Alpyadin 4th edition
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
The ground thruth function for the regression is f(x) = 3 cos(2x)
Generate 100 sample datasets with f(x) + Gaussian white noise (N(0,1)). Each dataset will have 20 points randomly selected x from [0,5] with corresponding target points.
# Ground truth target function
def f(x):
return 3 * np.cos(1.3 * x)
# seed
np.random.seed(62)
# x
x = np.random.uniform(0.0, 5.0, [100, 20])
x = np.sort(x)
# Ground truth targets
g = f(x)
# Add white noise
noisy = np.random.normal(0, 1, [100, 20])
# y
y = g + noisy
# use linspace(0,5,100) as test set to plot the images
x_test = np.linspace(0,5,100)
TODO: Use the First 5 datasets to generate 4 plots.
- Figure one: Function f(x) = 3 cos(2x) and one noisy dataset sampled from the function, namely “Function, and data”.
- Figure two: Generate five polynomial fits of degree ONE based on the first five datasets and name this figure with “Order 1”
- Figure three: Generate five polynomial fits of degree THREE based on the first five datasets and name this figure with “Order 3”
- Figure four: Generate five polynomial fits of degree FIVE based on the first five datasets and name this figure with “Order 5”
- For figures two, three, and four, please add a dotted line as an average line for the five fits.
Please use x_test to plot all the model functions, not just the ground truth function. This will make all the higher polynomial models look smoother.
Hint: You can use the Sklearn’s PolynomialFeatures and LinearRegression.
# model
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
def linear_model_predict(X, Y, order):
# fit one polynomial model of degree `order`
## Insert your code BEGIN
## Insert your code END
return model
def plot_figure(x, y, x_test, order):
# plot five curves corresponding to the polynomial of degree `order`
# plot the average of these five curves
## Insert your code BEGIN
## Insert your code END
# show the plots
fig, axs = plt.subplots(2, 2, figsize=(15, 15))
# figure one
plt.subplot(2, 2, 1)
## Insert your code BEGIN
## Insert your code END
# figure two
plt.subplot(2, 2, 2)
plt.ylim(-5, 5)
plot_figure(x, y, x_test, order=1)
# figure three
plt.subplot(2, 2, 3)
plt.ylim(-5, 5)
plot_figure(x, y, x_test, order=3)
# figure four
plt.subplot(2, 2, 4)
plt.ylim(-5, 5)
plot_figure(x, y, x_test, order=5)
Question 2 (40 points)
TODO: Generate Figure 4.6 from Alpaydin 4th Edition
The x-axis is the order of polynomial model, from 1 to 5. the y-axis is the error. The plot should contain three curves: total error, bias error and variance error.
Use all 100 dataset to compute the total error, bias error and variance error functions by using total error equation (4.36): Ex[(E[r|x]−g(x))2|x]=(E[r|x])−EX(g(x))2+EX[(g(x)−EX[g(x)])2]𝐸𝑥[(𝐸[𝑟|𝑥]−𝑔(𝑥))2|𝑥]=(𝐸[𝑟|𝑥])−𝐸𝑋(𝑔(𝑥))2+𝐸𝑋[(𝑔(𝑥)−𝐸𝑋[𝑔(𝑥)])2]
Evaluate each of the three error functions with 10 equally spaced values starting from 0 and ending at 5, i.e. np.linspace(0, 5, 10)
TODO: For each of the five polynomial models, print the average predictions, EX[g(x)]𝐸𝑋[𝑔(𝑥)], at np.linspace(0, 5, 10)
Hint: Average prediction at point x means computing the average value of the predictions of 100 models generated by 100 datasets. The point x should range from np.linspace(0, 5, 10)
TODO: Generate and print a DataFrame with 5 rows, one for each order and 4 columns. The 4 columns are:
- Order
- Bias error
- Variance error
- Total error
Hint: Average prediction at point x means computing the average value of the predictions of 100 models generated by 100 datasets. The point x should range from np.linspace(0, 5, 10)
Hint: For bias error (E[r|x])−EX(g(x))2(𝐸[𝑟|𝑥])−𝐸𝑋(𝑔(𝑥))2, E[r|x]=f(x)𝐸[𝑟|𝑥]=𝑓(𝑥) and EX[g(x)]𝐸𝑋[𝑔(𝑥)] is the average over 100 models from the 100 datasets. Then, you can approximate bias error by average over x in np.linspace(0, 5, 10) of (E[r|x]−EX[g(x)])2(𝐸[𝑟|𝑥]−𝐸𝑋[𝑔(𝑥)])2.
Hint: For For variance error, you need to have a nested loops (for each dataset and for x in np.linespace(0, 5, 10)) to get the average variance error.
Hint: The total error is the sum of bias error and variance error.
## Insert your code BEGIN
# Define any variables or methods that you need here
## Insert your code END
def bias_error(avg_pred, x_eval):
# For each polynomial order, computes its bias error
# returns a list of length 5
five_bias = []
## Insert your code BEGIN
## Insert your code END
return five_bias
def variance_error(avg_pred, models_evals):
# For each polynomial order, computes its variance error
# returns a list of length 5
five_variance = []
## Insert your code BEGIN
## Insert your code END
return five_variance
# Fit 5 * 100 models, i.e. fit 100 models for each degree in range(1, 6).
# The shape of models_list is (5, 100)
models_list = []
## Insert your code BEGIN
## Insert your code END
# create evaluation x data
x_eval = np.linspace(0, 5, 10)
# Evaluate each of the 5 * 100 models on `x_eval`
# The shape of models_evals_list is (5,100,10) which is 5 degree with 100 models and each model predict the 10 x evaluation
models_evals_list = []
## Insert your code BEGIN
## Insert your code END
# For each degree compute the average predictiona at `x_eval`
# The shape `ave_preds_list` isis (5,10)
avg_preds_list = []
## Insert your code BEGIN
## Insert your code END
bias_lst = bias_error(avg_preds_list, x_eval)
variance_lst = variance_error(avg_preds_list, models_evals_list)
total_error = [x + y for x, y in zip(bias_lst, variance_lst)]
# show the plot
x_points = [1,2,3,4,5]
plt.plot(x_points, bias_lst, linestyle='dashed',label = "Bias^2", marker='o', markersize=10)
plt.plot(x_points, variance_lst, linestyle='dashed', label = "Variance", marker='o', markersize=10)
plt.plot(x_points, total_error, linestyle='solid', label = "Error", marker='o', markersize=10)
plt.legend()
plt.xlim(0.9, 5.1)
plt.xticks(np.linspace(1, 5, 5))
plt.xlabel("Order")
plt.ylabel("Error")
plt.title("Bias and Variance Trade-off")
# Display graph
plt.show()
# Error DataFrame
pd.set_option("display.precision", 3)
error_df = pd.DataFrame({
'Order': range(1,6),
'Bias Error': bias_lst,
'Variance Error': variance_lst,
'Total Error': total_error
})
error_df
# Average predictions
pd.set_option("display.precision", 3)
pd.DataFrame(avg_preds_list)
HW ASSIGNMENT 3
DSCI – 552
1) In this problem we will perform Maximum Likelihood Estimation to find the parameters of a Gaussian Distribution. Consider the data distribution of n one dimensional points. Let them be denoted by the variable X. Then, if we assume they come from a Gaussian Distribution with mean and Variance V, X comes from the probability distribution:
P(x | , V) =
Apply MLE on the above equation by using the following hints.
- a) The probability values of the Gaussian Distribution over X is given by
P(X | , V) =
We need to maximize this to find the values of and V. That is done by partially derivating this equation with respect to and V separately, setting it to 0 and solving for the values
- b) Minimizing the log of a function is the same as maximizing the function itself. Take the log of the equation to minimize it.
- b) Derivative of log(x) is 1/x
- c) Derivative of f(g(x)) is f’(g(x)).g’(x)
- d) log (ab) = log a + log b
- e) log () = x
- f) log (= b log a
2) Given the following statistics, what is the probability that a man has a particular disease in a town if he has been tested positive from a home testing kit
- One percent of men have the disease
- 90% of men who have the disease test positive on the home kit
- 8% of men who use the kit will have false positives.