DSCC465 Int. to Statistical Machine Learning Problem Set -1 to 5 solutions

$120.00

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (1 vote)

DSCC465 Int. to Statistical Machine Learning Problem Set -1

Questions
1) Write a program using Python that does the following:
– Takes two matrices of any size as the input
– Returns their dot product as the output
Note: You cannot use pre-packaged algorithms for matrix operations for this question.
You can use numpy or pandas to store your data (not for calculations).
Please do the following:
a. Please test the following matrix multiplications using your hand-written code
and report the result:
b. Compare the result to the packaged dot product numpy.dot. Are they same?
c. Please add your code to your .pdf file and also save it as an .ipynb file
Spring 2022: Int. to Statistical Machine Learning University of Rochester
2
2) Assume that we have two (2) d-dimensional real vectors x and y. And denote by xi (or yi)
the value in the i-th coordinate of x (or y). Prove or disprove the following statements by
checking non-negativity, definiteness, homogeneity, and triangle inequality.
a. The following distance function is a metric. (5 points)
b. The following distance function is a metric. (5 points)
c. The following distance function is a metric. (10 points)
3) Calculating by hand, find the characteristic polynomial, eigenvalues and the eigenvectors
of the following matrix:
4) Provide a proof for the following: Let A, B, and C be any n x n matrices:
a. Show that trace(ABC) = trace(CAB) = trace(BCA) (10 points)
b. trace(ABC) = trace(BAC). Provide a proof or a counterexample (10 points)
5) Let A and B be n x n matrices with AB = 0.
Each question below is 5 points. Provide a proof or counterexample for each of the
following:
a) BA = 0
b) Either A = 0 or B = 0 (or both)
c) If det(A) = -3, then B = 0
d) There is a vector v ≠ 0 such that BAv = 0

DSCC465 Int. to Statistical Machine Learning Problem Set – 2

Questions
1) Suppose you’re on a game show, and you’re given the choice of three doors: Behind one
door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who
knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then
says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your
choice? (Note: Please show your solution step-by-step by using what you know about
marginal probability, conditional probability, joint probability, and the Bayes’ theorem)
2) Suppose we have two NBA teams – for simplicity team A and team B – who have made
it to NBA Playoffs. In each game between these two teams, team A has a winning
probability of 0.55, and team B has a winning probability of 0.45. What is the probability
that these two teams will play the 7
th game in NBA Playoffs? (Notes: There cannot be a
tie in any game (i). Please check this link for more information about NBA Playoffs:
https://en.wikipedia.org/wiki/NBA_playoffs and to think about possible combinations
(ii). Also, please show your solution step-by-step by using what you know about marginal
probability, conditional probability, joint probability, and the Bayes’ theorem (iii)).
3) From scratch (not using any pre-packaged tools for direct calculation), implement the
gradient descent algorithm for linear regression and test your results on the California
Housing Dataset:
Spring 2022: Int. to Statistical Machine Learning University of Rochester
2
https://scikitlearn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html#skl
earn.datasets.fetch_california_housing
Here is what you need to do step by step:
a. Implement the gradient descent algorithm from scratch
b. Choose the following features from the dataset as your X matrix: MedInc,
HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude
c. Choose the following feature from the dataset as your Y matrix: MedHouseVal
d. Randomly split your data into training (70% of total) and test sets (30% of total)
by using sklearn’s train_test_split function. Set random_state =
265:
https://scikitlearn.org/stable/modules/generated/sklearn.model_selection.train_test_split.ht
ml.
e. Set the number_of_steps = 1000 and learning_rate = 0.01.
f. By running your code, determine the best set of parameters (=weights) for the
constant and your features listed in b). Your cost function will be MSE (=you should
pick the set of parameters that give you the lowest MSE).
g. Report and interpret the results. What are the factors that explain the house
prices the most?
4) Now, try using a pre-packaged tool and comparing the results. Do the following:
a. Use SGDRegressor provided by scikit:
https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html
b. Step b), c), and d) are the same as in Question 3.
c. Set the max_iter = 1000, alpha = 0.01, random_state = 265, and
loss = ‘squared_error’. Other parameters should be set to ‘default’.
d. By running your code, determine the best set of parameters (=weights) for the
constant and your features listed in b).
e. Report and interpret the results. What are the factors that explain the house
prices the most? Are the results different from the previous question? If
different, explain why the results might be different.
5) Finally, write a function from scratch that computes a variance-covariance matrix by
transforming the following formula into code:
Variance-covariance matrix: 𝑐𝑜𝑣(𝑿) = E[(𝐗 − E[𝐗])(𝑿 − E[𝑿])
𝑇
]
Your function/code should work for matrices of any size. Test that your function is running
(=successfully computing the variances and covariances of the variables and variable pairs
Spring 2022: Int. to Statistical Machine Learning University of Rochester
3
in the dataset) by using the California Housing Dataset that you have used in previous
questions.

DSCC465 Int. to Statistical Machine Learning Problem Set – 3

Questions
In this homework, you have three questions. The first question is worth 20 points. The remaining
two questions are each worth 40 points.
1) Read the following highly-cited article by Fearon and Laitin (2003):
https://cisac.fsi.stanford.edu/publications/ethnicity_insurgency_and_civil_war
Answer the following:
a. What is the paper about? (Please write a paragraph – max. 250 words)
b. How many observations do authors have in their dataset? What does each
observation represent? (=What is the unit of analysis?)
c. What is the identification strategy of the authors? (=What are the different
regression equations they are running?) Please write down in form of equations
and explain. Identify the independent and dependent variables.
d. What do the coefficient values listed in Table 1 represent? (theoretically speaking)
e. Which independent variables have positive coefficients? Which independent
variables have negative coefficients? Which ones are statistically significant?
f. Thinking about the range of your independent variables, which variables do you
think have a greater impact on the dependent variable(s)?
Spring 2022: Int. to Statistical Machine Learning University of Rochester
2
2) Build a two-class logistic regression model from scratch. You will need to work on the
following:
a. Implement the sigmoid function from scratch and call it sigmoid_f
b. Implement the hypothesis function from scratch and call it classifier_f
c. Implement the entropy function as your cost function and call it
binary_loss_f
d. Implement gradient descent for logistic regression and call it gradient_f
e. Combining the functionalities of what you have coded above, create an optimizer
function and call it optimizer_f. Note: You should find out the input and
output to the functions above by reviewing the class notes and the textbook; in
other words, this will be part of the challenge! If needed, use 265 as your random
seed.
Let’s test your code on a dataset. Load the Breast Cancer Wisconsin Dataset provided
by sklearn: https://scikitlearn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn
.datasets.load_breast_cancer
Now, do the following:
a. Set the target column as your Y variable.
b. Set all other numeric variables (excluding index) as your X matrix.
c. Apply 0-1 normalization on both the X matrix and Y vector.
d. Run logistic regression by using the code you have written (no need to do
train/test split). Set the maximum number of iterations to 10,000.
e. Report the final equation you have obtained for logistic regression.
f. Also indicate which coefficients are positively associated and which
coefficients are negatively associated with the target variable. Rank them
from positive to negative. Interpret the results.
3) Implement the three following cross-validation algorithms from scratch:
a. Leave-one-out cross-validation
b. K-fold cross-validation
c. Train-test split cross-validation
Test your results on the California Housing Dataset:
https://scikitlearn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html#skl
earn.datasets.fetch_california_housing
Now, do the following:
a. Implement the cross-validation algorithms from scratch.
Spring 2022: Int. to Statistical Machine Learning University of Rochester
3
b. Choose the following features from the dataset as your X matrix: MedInc,
HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude
c. Choose the following feature from the dataset as your Y matrix: MedHouseVal
d. Apply 0 – 1 normalization on X and Y.
e. Apply the cross-validation algorithms that you implemented to train your model.
(For splitting your data always use 265 as your random number or seed value).
Note: You cannot use the pre-packaged algorithms for splitting the data. To split
the data please do the following:
a. Install the random package written for Python
b. Set (the initial) random.seed() to 265
c. Create a list of integers that will function as your index numbers:
list(range(0,len(name_of_dataset))
d. Pick one integer for Train-Test Split CV from the list your created in c) to
split 70% of your data as training set and the remaining 30% as the test set.
e. For the K-fold CV, set k = 5. Please divide the dataset into 5 quasi-equal
portions starting from index 0.
f. For LOOCV, start the training by randomly picking a feature vector
associated with an index in your dataset (Reminder: random seed is 265)
– you will need to run the model on every point.
f. Using scikit’s sklearn.linear_model.LinearRegression, predict
the house prices by using all of the data in your X matrix. Compare different
techniques of CV. Which CV provides the lowest MSE? Why? Interpret the results.

DSCC465 Int. to Statistical Machine Learning Problem Set – 4

Data Pre-Processing (40 points)
1) [20 points] Using the pandas package for Python, import the corona_fake.csv dataset,
and do the following:
a) [5 points] Import the nltk package. Check the documentation:
https://www.nltk.org/
b) [15 points] Take a look at the text column in the dataset, and do the following:
i. [3 points] Using nltk.word_tokenize(), tokenize the text.
ii. [3 points] Using the POS-tagging feature (nltk.pos_tag), POS-tag the
tokenized words.
iii. [3 points] Using WordNetLemmatizer (from nltk.stem import
WordNetLemmatizer) lemmatize the pos-tagged words you obtained
above. (Hint: If there is no available tag, append the token as is; else, use the
tag to lemmatize the token)
Spring 2022: Int. to Statistical Machine Learning University of Rochester
2
iv. [3 points] Using the list of stop words that can be imported (nltk.corpus
import stopwords), remove the stopwords in lemmatized text [Note: the
language needs to be set as ‘english’.].
v. [3 points] Finally, also remove numbers, words that are shorter than 2
characters, punctuation, links and emojis. Finally, convert the obtained list of
tokenized+tagged+lemmatized+cleaned list of words back into a joined string
(joined by space ‘ ‘ ) and add the result as text_clean column to your dataset.
2) [20 points] Let’s vectorize the data we produced above by using two approaches: Bag of
Words (BOW) and TF-IDF; and, at the end, we will make a prediction:
a. [5 points] Read the following page: https://en.wikipedia.org/wiki/N-gram. Explain
what an ‘n-gram’ is and why it is helpful in max. 200 words.
b. [5 points] Import CountVectorizer and TfidfVectorizer:
from sklearn.feature_extraction.text import
CountVectorizer,TfidfVectorizer
c. [5 points] Using CountVectorizer, create three vectorized representations of
text_clean [set lowercase=True]:
i. One vectorized representation where ngram_range = (1,1)
ii. One vectorized representation where ngram_range = (1,2)
iii. One vectorized representation where ngram_range = (1,3)
d. [5 points] Using TfidfVectorizer, create three vectorized representations of
text_clean [set lowercase=True]:
i. One vectorized representation where ngram_range = (1,1)
ii. One vectorized representation where ngram_range = (1,2)
iii. One vectorized representation where ngram_range = (1,3)
Prediction (20 points)
3) [20 points] Now, let’s use sklearn.linear_model.LogisticRegressionCV
to do some predictions. Set cv = 5, random_state = 265, and max_iter =
1000, and n_jobs = -1 (other parameters should be left as default) [Note: training
size is 70%, test size is 30%, split by random_state = 265].
a. [10 points] By using the three (3) different versions of the CountVectorizer
dataset you created above, run logistic regression to predict class labels (fake,
true). Report three (3) accuracy values associated with each of the regressions.
b. [10 points] By using the three (3) different versions of the TfidfVectorizer
dataset you created above, run logistic regression to predict class labels (fake,
true). Report three (3) accuracy values associated with each of the regressions.
c. Combine and report all accuracy values in a table (6 values in total).
Spring 2022: Int. to Statistical Machine Learning University of Rochester
3
Theoretical question (40 points)
4) [40 points] Check the optimizer (solver) functions used by
sklearn.linear_model.LogisticRegressionCV. For each function, explain
in around 100 words what they mean; specifically:
a. [8 points] What does newton-cg mean?
b. [8 points] What does lbfgs mean?
c. [8 points] What does liblinear mean?
d. [8 points] What does sag mean?
e. [8 points] What does saga mean?
Note: For this question you might need to do some online research. It is your job to find
out how they work. You are also welcome to use formulas / matrices in your description.

DSCC465 Int. to Statistical Machine Learning Problem Set – 5

Questions
1) [20 points] Download the dataset called ‘country_information.xlsx’ that can be found under
the ‘Data’ tab on BlackBoard. Do the following:
a. [10 points] Provide a summary of what the dataset is about (around 100 words) by
checking the variable names.
b. [10 points] Excluding the ‘country’ column, apply 0-1 normalization on the numeric
columns. Save the resulting dataset as:
‘country_information_normalized.xlsx’ [Note: Do not forget to add the ‘country’ column
to the normalized dataset. For normalization, you can use a package.]
2) [20 points] Code the kmeans++ algorithm from scratch. For more information about the
individual steps of the algorithm, please check here:
https://en.wikipedia.org/wiki/K-means%2B%2B.
As input, your algorithm should take a numpy matrix or a pandas dataframe and a k value
that denotes the expected number of clusters. The output needs to be the labels associated
with feature vectors coming from your dataset.
Note: You are welcome to use pre-packaged algorithms to calculate distances and means. If
you need to pick a point randomly, please do the following:
i. Import the random package of Python.
ii. Set seed to 265 by running the following line: random.seed(265) [This should be
done at the very beginning of your code file, after importing the packages.]
iii. Run the following line: randrange(0,len(name_of_your_dataset),1).
Use the resulting the number as the index number for the data point that should be
randomly picked in different stages of the kmeans++ algorithm.
For the remainder of the analysis, use the ‘country_information_normalized.xlsx’ dataset you
created in Q1.
3) [20 points] Now, we will test the code we have written in Q2 and apply dimension reduction:
Specifically, do the following:
a. [10 points]. Set the random seed to 265 again (to (re-)guarantee the same initialization).
Set k = 6. Run your kmeans++ code on the ‘country_information_normalized.xlsx’ dataset
by excluding the ‘country’ column.
Record the labels. Attach the labels as a new column to your dataset by naming your new
variable as kmeans_label.
b. [10 points] Excluding the ‘country’ and ‘kmeans_label’ columns, run dimension reduction
(specifically PCA) on your dataset by using sklearn’s PCA function: https://scikitlearn.org/stable/modules/generated/sklearn.decomposition.PCA.html.
Spring 2022: Int. to Statistical Machine Learning University of Rochester
3
[Note: set n_components = 2 and random_state = 265. Other parameters
should be left as ‘default’.]. Add the new variables in your dataset as pca_dim_1 and
pca_dim_2.
For the next question, use the attached ‘visualization_code.py’ file.
4) [20 points] Now, let’s visualize the results, use the clustering labels to color our data points,
and present them in convex hulls. Run the code provided to you in the
‘visualization_code.py’ file. Change the name of the dataset where it says […]. Add the visual
to your .pdf submission.
Note: For this exercise, you will need to find and explore the required packages that will need
to be imported. The resulting plot should look (somewhat) similar to what is below (but, you
will have k = 6).
5) [20 points] Interpret the results (in around 300 words) by answering the following:
a. [5 points] Which countries seem to be similar? Why do you think these countries are
clustered together?
b. [5 points] If you run the kmeans++ algorithm more than once, do you think the results
will change?
c. [5 points] (Subjectively speaking) Do you think this is an accurate clustering of the
countries? Would the results change greatly if we had different social/economic
variables?
d. [5 points] Do you think PCA may have affected the results at all? In other words, if we had
a different number of principle components, would our visual interpretation be different?