Description

5/5 - (1 vote)

Spring 2025

Overview: In probability and statistics, it is important to understand the mean and variance for
any random variables. In many applications, it is straightforward to simulate the random variable Y ’s,
but it is often highly non-trivial to characterize the exact distribution of Y = Y (X1, X2) including
deriving the explicit formulas for the mean and variance of Y = Y (X1, X2) explicitly as a function of
X1 and X2.
Objective: In this exam, suppose that Y = Y (X1, X2) is a random variable whose distribution
depends on two independent variables X1 and X2, and the objective is to estimate two deterministic
functions of X1 and X2: one is the mean µ(X1, X2) = E(Y ) and the other is the variance V (X1, X2) =
V ar(Y ).
For that purpose, you are provided the observed 200 realizations of the Y ’s values for some given
pairs (X1, X2)’s. You are asked to use data mining or machine learning methods that allow us to
conveniently predict or approximate the mean and variance of Y = Y (X1, X2) as a function of X1
and X2. That is, your task is to predict or approximate two values for those given pairs (X1, X2)
in the testing data set: one for the mean µ(X1, X2) = E(Y (X1, X2)) and the other for the variance
V (X1, X2) = V ar(Y (X1, X2)).
Training data set: In order to help you to develop a reasonable estimation of the mean and
variance of Y = Y (X1, X2) as deterministic functions of X1 and X2, we provide a training data
set that is generated as follows. We first choose the uniform design points when 0 ≤ X1 ≤ 1 and
0 ≤ X2 ≤ 1, that is, x1i = 0.01 ∗ i for i = 0, 1, 2, . . . , 99, and x2j = 0.01 ∗ j for j = 0, 1, 2, . . . , 99. Thus
there are a total of 100 ∗ 100 = 104
combinations of (x1i
, x2j )’s, and for each of these 104
combinations,
we generate 200 independent realizations of the Y variables, denoted by Yijk for k = 1, . . . , 200.
The corresponding training data, 7406train.csv, is available from Canvas. Note that this training
data set is a 104×202 table. Each row corresponds to one of 100∗100 = 104
combinations of (X1, X2)’s.
The first and second columns are the X1 and X2 values, respectively, whereas the remaining 200 columns
are the corresponding 200 independent realizations of Y ’s.
Based on the training data, you are asked to develop an accurate estimation of the functions
µ(X1, X2) = E(Y ) and V (X1, X2) = V ar(Y ), as deterministic functions of X1 and X2 when 0 ≤ X1 ≤ 1
and 0 ≤ X2 ≤ 1.
To assist you, a limited empirical data analysis (EDA) on the training data is provided in the
appendix by using R. Please feel free to modify to other language such as Python, Matlab, etc.
Testing data set: For the purpose of evaluating your proposed estimation models and methods,
we choose 50 random design points for X1 and 50 random design points for X2. Thus there are a total
of 50 ∗ 50 = 2500 combinations of (X1, X2) in the testing data set. You are asked to use your formula
to predict µ(X1, X2) = E(Y ) and V (X1, X2) = V ar(Y ) for Y = Y (X1, X2) for the 50 ∗ 50 = 2500
combination of (X1, X2) in the testing data (please keep the six digits for your answers).
The exact values of the (X1, X2)’s in the testing data set are included in the file 7406test.csv,
which is available from Canvas. You are asked to use your formula to predict µ(X1, X2) = E(Y ) and
V (X1, X2) = V ar(Y ) for the 50 ∗ 50 = 2500 combination of (X1, X2) in the testing data (please keep
(at least) six digits for your answers).
1
Estimation Evaluation Criterion: In order to evaluate your estimation or prediction, we obtain
“true” values µ(X1, X2) = E(Y ) and V (X1, X2) = V ar(Y ) for each combination of (X1, X2) in the
testing data set, based on the following Monte Carlo simulations (we will not release these true values!).
We first generated 200 random realizations of Y ’s for each combination of (X1, X2) in the testing
data set, but we will not release these 200 independent realizations for the testing data. Next, for each
given combination of (X1, X2), we have 200 realizations of Y ’s, denoted by Y1, · · · , Y200, and then we
compute the “true” values as
µ
∗
true = Y¯ =
Y1 + · · · + Y200
200
and V
∗
true = V ar ˆ (Y ) = 1
200 − 1
X
200
i=1
(Yi − Y¯ )
2
.
Your predicted mean or variance functions, say, ˆµ(X1, X2) and Vˆ (X1, X2), will then be evaluated
as compared to these true values, µ
∗
true(X1, X2) and V
∗
true(X1, X2):
MSEµ =
1
IJ
X
I
i=1
X
J
j=1
(ˆµ(x1i
, x2j ) − µ
∗
true(x1i
, x2j ))2
MSEV =
1
IJ
X
I
i=1
X
J
j=1
(Vˆ (x1i
, x2j ) − V
∗
true(x1i
, x2j ))2
, (1)
where (I, J) = (50, 50) for the testing data.
Your tasks: as your solution set to this exam, you are required to submit two files to Canvas
before the deadline:
(a) A .csv file on the required prediction that includes your predicted values for µ(X1, X2) = E(Y )
and V (X1, X2) = V ar(Y ) for the testing data (in 6 digits). Please name your file as
“1.YourLastName.YourFirst.Name.csv”, e.g., “1.Mei.Yajun.csv” for the name of the instructor.
I think students in our class have a unique combination of last/first name, and thus there is no
need to include the middle name.
• The submitted csv file in excel must be 2500 × 4 column, and the first two columns must
be the exact same as the provided testing data file “7406test.csv”. The third column
should be your estimated mean ˆµ(X1, X2), and the fourth column is your estimated variance
Vˆ (X1, X2).
• If you want, please round your numerical answers to the six decimal places, e.g., report your
estimations as the form of 30.xxxxxx, but this is optional: in our evaluation process we will
use the round function to round your answers to the six decimal before computing MSE.
• Please save your predictions as a 2500*4 data matrix in this .csv file, e.g., without headers
or row/column labels/names. We will use the computer to auto-read your .csv file and
then auto-compute the MSE values in equation (1) for all students, based on the alphabet
order of the last/first name, and thus it is important for you to follow this guideline, e.g.,
without headers or extra columns/rows in the .csv file and name your .csv file as the above
form.
(b) A (pdf or docx) file that explains the methods used for the prediction. Please name your file as
“2.YourLastName.YourFirstName”, e.g., “2.Mei.Yajun.pdf” or “2.Mei.Yajun.docx” for the name
of the instructor.
2
Your written report should be like good journal papers that is concise, clearly explain and justify
your proposed models and methods, also see the guidelines on the final report of our course
project. Please feel free to use any methods — this is an open-ended problem, and you can either
use any standard methods you learned from the class, or develop your estimation by a completely
new approach.
Remark:
• If you upload your files multiple times at Canvas, the file names might be renamed automatically
by Canvas to ”1.YourLastName.YourFirstName.csv-1” or similar. If this occurs, please do not
worry, as we will take this into account and correct for you.
• This final exam essentially asks you to build two different models: one is to predict ˆµ and the
other is to predict V . ˆ For each model, there are p = 2 independent variables (X1, X2), although
both ˆµ(X1, X2) and Vˆ (X1, X2) functions are likely needed to be nonlinear in order to receive
good predict performance. To be more specific, you might want to look beyond the multiple
linear regression model of β0 + β1X1 + β2X2, and investigate some nonlinear models such as
polynomial regressions, local smoothing, generalized additive models, random forests, boosting,
support vector machine with suitable kernels, neural networks, etc. Then you need to decide
which (nonlinear) models should be used for predictions in the testing dataset. Hopefully this
high-level viewpoint allows you easily develop models for prediction.
• After your submission, please double check your submitted .csv file at Canvas to see
whether it has exactly 2500 rows and 4 columns or not, whether it has ”NA” or missing values
or not. In the past, there were three typical small mistakes that will severely affect predictions:
(i) having 5 or more columns (e.g., one unnecessary column to label observations); (ii) having
more than 2500 rows (e.g., predictions on the training data instead of the testing data, or having
predictions from multiple models); and (iii) having ”NA” values in the .csv files (e.g., some models
will generate a prediction of ”NA” if the data in the testing dataset is outside the range of the
training dataset). Thus it is crucial to assure that your .csv file has the required 2500 rows and
4 columns that reports the desired prediction values.
Grading Policies: The total point of this take-home final exam is 25 points, which will be graded
by the TAs and instructor. There are three components:
• Prediction accuracy on mean: 10 points. The smaller MSEµ in (1) the better. Tentatively,
we plan to assign ”10” if MSEµ ≤ 1.20, ”9” if (1.20, 1.40], ”8” if (1.40, 1.60], ”7” if (1.60, 1.80],
”6” if (1.80, 2.00], ”5” if (2.00, 3], ”4” if (3, 10], ”3” if (10, 20], ”2” if (20, 30], etc. For your
information, in the past semesters, the percentages of students who received ”10”, ”9”, and
”8” are 43%, 35%, 16%, resp. For the students who received low grades, they often did not
realize that they made small mistakes/errors here or there, or did not tune the hyper-parameters
appropriately, etc. In general, we feel that our grading is very generous, and will keep the right
to adjust the grading schedule to be more generous if needed.
• Prediction accuracy on variance: 10 points. The smaller MSEV in (1) the better. Note
that predicting variance is a much harder question as compared to predicting mean, and thus we
expect that MSE to be much larger. Tentatively, we plan to assign ”10” if MSEV ≤ 550, ”9” if
(550, 570], ”8” if (570, 590], ”7” if (590, 610], ”6” if (610, 630], ”5” if (630, 650], ”4” if (650, 700],
”3” if (700, 1000], ”2” if (1000, 5000], etc. In the past semesters, 60% students received ”10”, and
more than ”90%” received at least ”8”. Thus we feel the grading should be generous, and will
keep the right to adjust the grading schedule to be more generous if needed.
3
• Written Report: 5 points. There are no specific guidelines on this written report, and please
feel free to use the commonsense. With that said, we will look at the following aspects. Is the
report well-written or easy to read? Is it easy to find the final chosen model or method? Does the
report clearly explain how and why to choose the final chosen method? Does the report discuss
how to suitably tune parameters in the final chosen model? We plan to assign the grades of this
component as follows: “A”- 5, “B”- 4, “C” – 3, “D”-2, “F”- 1, “Not submitted” – 0):
The TAs and instructor will try their best to give fair technical grades to all reasonable answers,
e.g., even if your prediction accuracy is not as good as other students, we keep the right to increase
your prediction accuracy scores if your written report has a solid justification of your proposed models/methods/results. However, we acknowledge that ultimately this is a subjective decision.
If needed, please feel free to leave a public or private message at Piazza. Good luck to your final
exam!
4
Appendix: Some useful R codes for (A) training dataset, (B) testing dataset, and (C) our autograding program.
(A) Empirical Data Analysis of training dataset, which might be useful to inspire you to develop
suitable methods for prediction
#####
### Read Training Data
## Assume you save the training data in the folder “C:/temp” in your local laptop
traindata <- read.table(file = “C:/temp/7406train.csv”, sep=”,”);
dim(traindata);
## dim=10000*202
## The first two columns are X1 and X2 values, and the last 200 columns are the Y valus
### Some example plots for exploratory data analysis
### please feel free to add more exploratory analysis
X1 <- traindata[,1];
X2 <- traindata[,2];
## compute the empirical estimation of muhat = E(Y) and Vhat = Var(Y)
muhat <- apply(traindata[,3:202], 1, mean);
Vhat <- apply(traindata[,3:202], 1, var);
## You can construct a dataframe in R that includes all crucial
## information for our exam
data0 = data.frame(X1 = X1, X2=X2, muhat = muhat, Vhat = Vhat);
## we can plot 4 graphs in a single plot
par(mfrow = c(2, 2));
plot(X1, muhat);
plot(X2, muhat);
plot(X1, Vhat);
plot(X2, Vhat);
## Or you can first create an initial plot of one line
## and then iteratively add the lines
##
## below is an example to plot X1 vs. muhat for different X2 values
##
## let us reset the plot
dev.off()
##
## now plot the lines one by one for each fixed X2
##
flag <- which(data0$X2 == 0);
plot(data0[flag,1], data0[flag, 3], type=”l”,
xlim=range(data0$X1), ylim=range(data0$muhat), xlab=”X1″, ylab=”muhat”);
for (j in 1:99){
5
flag <- which(data0$X2 == 0.01*j);
lines(data0[flag,1], data0[flag, 3]);
}
## You can also plot figures for each fixed X1 or for Vhat
### You are essentially asked to build two models based on “data0”:
### one is to predict muhat based on (X1, X2); and
### the other is to predict Vhat based on (X1, X2).
(B) Read the testing data and write your prediction on the testing data:
## Testing Data: first read testing X variables
testX <- read.table(file = “C:/temp/7406test.csv”, sep=”,”);
dim(testX)
## This should be a 2500*2 matrix
## Next, based on your models, you predict muhat and Vhat for (X1, X2) in textX.
## Suppose that will lead you to have a new data.frame
## “testdata” with 4 columns, “X1”, “X2”, “muhat”, “Vhat”
## Then you can write them in the csv file as follows:
## (please use your own Last Name and First Name)
write.table(testdata, file=”C:/temp/1.LastName.FirstName.csv”,
sep=”,”, col.names=F, row.names=F)
## Then you can upload the .csv file to the Canvas
## Note that in your final answers, you essentially add two columns for your estimation of
## $mu(X1,X2)=E(Y)$ and $V(X1, X2)=Var(Y)$
## to the testing X data file “7406test.csv”.
## Please double check whether your predictions are saved as a 2500*4 data matrix
## in a .csv file “without” headers or extra columns/rows.
(C) Our auto-grading program on your prediction (this does not affect your prediction, and it is
only for those interested students). Also if somehow the auto-grading program failed (e.g., due to
inconsistent file names), we will manually compute your prediction, as we want to make sure to have
a fair grading to everyone.
##### In the auto-grading, we run loops, one loop for each student
##### In each loop, we first generate the filename as name1 = “1.LastName.FirstName.csv”,
##### Next, we compare your answers with those Monte Carlo based values,
##### “muhatestMC” and “VhatestMC”, which were computed as mentioned in the exam
##### and held out by the instructor (sorry that we would not release them
##### since otherwise students can simply copy and paste to get perfect predictions).
#####
6
resulttemp <- read.table(file = name1, sep=”,”);
muhatmp <- round(resulttemp[,3], 6); ## Your predicted values for \mu in 6 digits
Vhatmp <- round(resulttemp[,4],6); ## Your predicted value of Vhat in 6 digits
MSEmu <- mean((muhatestMC – muhatmp)^2);
MSEV <- mean((VhatestMC – Vhatmp)^2);
##### Your technical scores will be based on MSEmu and MSEV values
##### In general, the smaller MSEs, the better.
##### However, there is no universal answer on how small is small.
##### Also it is more difficult to have accurate prediction on Variance than on Mean
##### END #####

Solved Take-Home Final Exam for ISyE 7406

Download Details:

Description

Solved Take-Home Final Exam for ISyE 7406

Download Details:

Description

Related products

ISyE 7406: Data Mining & Statistical Learning HW 3 solution

Solved ISyE 7406: Data Mining & Statistical Learning HW#3

ISyE 7406 HW 3 solution