CS 422 – Data Mining Midterm Exam solution

$25.00

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (1 vote)

Part I – Short Answer (Show Points/Results) – 5 points each, 40 points total
1. Given the following feature vector , what would a
categorical representation of this feature vector be if we assumed discrete
categories with values as , as , and as ?

2. Given a binary classification problem with classes , draw a Confusion Matrix
showing result counts in terms of Predicted and Actual class.
Provide calculations for Accuracy and Error Rate, highlighting False Positives and False
Negatives as functions of these result counts.

3. For frequent itemsets , show the difference between the Confidence
vs the Interest Factor (Lift) for the Association Rule . What value
does Lift take into account that Confidence does not?
x = [4.4,5.1, − 3.7,2.1, − 1.9]

x ≤ − 2.5 A −2.5 < x < 2.5 B x ≥ 2.5 C
{C1,C2}
( f
11, f
10, f
1+, f
0+, . . . )
(FP, FN )
{{A, B}, {C}} c
{A, B} ⟹ {C}

4. Given a dataset with observations, what is the size of the training set if we choose
to hold out records as a test set? If we allow for , what does the
corresponding training set size approach?

5. With a data set containing features and observations, what is the
dimensionality of the covariance matrix of the predictors? If we were to represent the
predictors with a multivariate normal (Gaussian) distribution, how many distribution
parameters would need to be estimated from the feature data?

6. Given the following point observations: and , what would the
length of each vector in terms of the Manhattan and Euclidean norms be
defined as? Would the distance between the two points be larger under the or
norm?
n
n
k
k → n
d = 15 N = 12,000
x1 = [3,4] x2 = [5,12]
(L1, L2)
L1 L2

7. Draw the 2-way contingency table for a binary association rule ,
containing presence/absence counts . Interest Factor (Lift) can
be interpreted as a conditional probability , show this probability in terms
of these counts.

8. For a binary association rule , show that the coefficient for the rule’s
correlation measure is not invariant under null addition (unchanged with added
unrelated data) in terms of changes to the relevant counts .
{A} ⟹ {B}
( f
11, f
10, f
1+, f
0+, . . . )
P(A, B)
P(A)P(B)
{a} ⟹ {b} ϕ
( f
11, f
10, f
1+, f
0+, . . . )

Part II – Long Answer (Show Reasoning/Calculations) – 10 points each, 40 points total
1. Show the cosine similarity of the two vectors and . Results
can be kept in formula form in terms of the component values of and (calculation
of final value not required).

2. Given a classifier with True Positives/Negatives and False Positives/
Negatives , what is the highest Recall value that a model can achieve?
Define the Recall measure via . How can one design a simple
model which achieves the maximum value for Recall?
x = [3,4,5] y = [5,12,13]
x y
(TP, TN )
(FP, FN ) r
(TP, TN, FP, FN )

3. Given the following transactions: , with
, what itemsets would be frequent? What would be the support of
the association rule: be? What would the confidence of this rule be?
Given the value, would this be a valid rule that is extracted via the Apriori
Algorithm?

4. Given a data matrix with features/columns with a total variance of 100, an
analyst performs a PCA via eigenvalue decomposition, with the resulting eigenvalues
as . If the analyst wishes to reduce dimensionality with of
variance explained, how many dimensions would the analyst be able to reduce their
selection to? What would be the standard deviations of the data for each these
selected dimensions?
{a, b, c}, {a, c}, {b, c}, {a}, {b}, {c}
minsup = 60 % s
{a} ⟹ {c} c
minsup
D d = 5
[35,25,20,15,5] 80 %
σi

Part III – Essay Question (Show Argument/Proof) – 20 points each, 20 points total
1. Given a decision tree node containing records, half of which belong to Class and
the other half which belong to Class , show the impurity of the node under the
Entropy, Gini, and Misclassification Error measures. What would be the value of these
measures be for the child nodes, assuming an optimal split is performed? (Hint:
Assume ).

10 CA
CB I
0 log2 0 = 0
Lucky 7 – Bonus Questions (Industry News, AI/ML Topics) – 1 point each, 7 points total
1. What model recently released by DeepMind allows for accurate prediction of 3-
dimensional shape of a protein molecule given input amino acids?

2. Which firm recently fired its head of AI ethics, shortly after the controversial
departure of one of its senior researchers?

3. What family of algorithms were recently developed which are able to solve classic
treasure hunting video games such as Pitfall on Atari?

4. What disease was IBM able to predict the onset of based on changes in writing/
language via the use of machine learning models?

5. What category of modified videos did a consortium led by Facebook/Microsoft/
Cornell/MIT recently introduce a detection challenge for?

6. Which firm recently released a new image recognition algorithm that was trained on
over 1 billion images, but did not require manual labels?

7. What quantum computing goal was recently achieved by Google which was revealed to
the public via NASA?