## Description

## Problem 1. Short Answers. (8 points)

(a) (2 points) List two differences between Frequent Pattern Growth algorithm and Apriori

algorithm (e.g., pattern generation, candidate generation, processing time).

(b) (2 points) The Apriori algorithm make use of prior knowledge of subset support properties. Prove (1) all nonempty subsets of a frequent itemset also must be frequent; and

(2) the support of any nonempty subset s

0 of itemset s must be at least as large as the

support of s.

(c) (2 points) Regarding measures of interestingness, what problem might Lift and χ

have? To address this problem, null-invariant measures are used, explain why Jaccard

function is null-invariant? (hint: Venn diagram)

(d) (2 points) Overall, in sequential pattern mining, PrefixSpan has better performance

compared to GSP and SPADE. Under what circumstance, PrefixSpan will also have

poor performance?

## Problem 2. Apriori vs Frequent Pattern Growth. (15 points)

TID Items

1 {A, B, C, H, P}

2 {A, C, M, B, L}

3 {B, F, O, L}

4 {A, C, D, F, G, M, P}

5 {A, C, F, P, M}

6 {C, F, H, D}

Table 1: Transaction database.

A transactional database is given in Table 1.

(a) (6 points) Let min sup = 3 and apply Apriori algorithm on the dataset. Present the

intermediate results of the first three scans and the final derived frequent itemsets.

(b) (6 points) Let min sup = 3 and sort the frequent items first in frequency descending

and then alphabetical order. Construct the Frequent Pattern-tree from the dataset in

the same format as in the lecture.

(c) (3 points) List the top-3 association rules in terms of confidence from 2-itemsets.

## Problem 3. Frequent Pattern Mining Implementation (12 points)

Implement the core algorithm of Apriori (direct calling Apriori from external packages is

NOT allowed) on the grocery market transaction data (i.e., groceries.csv), and finish the

following tasks.

(a) (2 points) Report data statistics, for example, #product types, average #products per

transaction, 5 most popular products, etc.

(b) (5 points) Manually define the threshold settings, e.g., min support, min conf idence,

record the settings and list the top-10 corresponding frequent itemsets if existed.

Choose any pair of items (e.g., ”yogurt” and ”coffee”), construct the contingency table

and evaluate the interestingness using Lift. Report the table and evaluation results.

(c) (5 points) Report the running time of Apriori w.r.t. different data sizes (e.g., let the

number of transactions be 1, 000, 2, 000, etc.). Compare the efficiency performance of

Apriori and FP-Growth (You can directly call third-party libraries to run FP-Growth).

## Problem 4. Sequential and Graph Pattern Mining (10 points)

(a) A sequential database is presented in Table 2.

(1) (3 points) Given a set of prefixes, habi,h(ab)i,hdei, habci,hacbi,hacbai, list the

corresponding projection(s) according to the definition of prefix and projection;

(2) (2 points) List the sequential pattern(s) where support = 3 and length ≥ 3;

SID Sequence

1 ha(bcf)(ac)bi

2 h(ab)c(bc)ab(ce)i

3 hde(af)acb(acf)acdi

4 h(ad)de(abc)(ae)i

5 h(aef)ac(abd)df(ab)i

6 h(abc)eab(cd)f(abd)i

Table 2: Sequence Database.

G1 G2 G3 G4

Patterns:

P1 P2 P3

G5

Figure 1: Input graphs and subgraph patterns.

(b) Given the input graphs (i.e., first row) and patterns (i.e., second row) in Figure 1.

(1) (3 points) Compute the support (in fraction) of the subgraph patterns and list

the corresponding input graphs in which the patterns appear.

(2) (2 points) Let min sup = 3, show how Apriori-based method works to find all

frequent graph patterns using either edge growing or vertex growing (present

intermediate results/patterns at each iteration).

## Problem 5. SVM, RBF Kernel (15 points)

Given n training examples (xi

, yi)(i = 1, 2, . . . , n) where xi

is the feature vector of the i-th

training example and yi

is its label, we train an support vector machine (SVM) with Radial

Basis Function (RBF) kernel on the training data. Note that the RBF kernel is defined as

KRBF (x, y) = exp (−γ||x − y||2

2

).

(a) (5 points) Let G be the n×n kernel matrix of RBF kernel, i.e., G[i, j] = KRBF (xi

, xj ).

Prove that all eigenvalues of G are non-negative.

(b) (5 points) Prove that RBF kernel is the sum of infinite number of polynomial kernels.

(c) (3 points) Suppose the distribution of training examples is shown in Figure 2 where ’+’

denotes positive example and ’-’ denotes negative example. If we set γ large enough

(say 1000 or larger), what could possibly be the decision boundary of the SVM after

training? Please draw it on Figure 2.

Figure 2: Distribution of Training Examples.

(d) (2 points) if we set γ to be infinitely large, what could possibly happen when training

this SVM?

## Problem 6. Feature Selection (15 points)

Given the risk auditing dataset where every row refers to a data sample, the last column

denotes the labels of data samples, and the remaining columns serve as the input features,

try to:

(a) (5 points) implement an SVM classifier using the train.csv, validation.csv, and

test.csv. Report the classification results over the test set. (3rd-party packages are

allowed)

(b) (10 points) improve the performance of SVM classifier by implementing Fisher Score

feature selection method. (3rd-party packages are NOT allowed)

Remarks that you should strictly follow the conventional machine learning paradigm to

train your model on the training set, select the best hyper-parameter on the validation set

and report the corresponding results on the test set. The source code you write should be

submitted with the .pdf report in the compressed .zip file.

## Problem 7. Self-Training (15 points)

Follow the improvement made from Problem 6 via feature selection, try to further improve the classification results by self-training using the unlabelled data unlabelled.csv.

Remarks that all the columns in unlabelled.csv should serve as input features. You should

strictly follow the conventional machine learning paradigm to train your model on the training set, select the best hyper-parameter on the validation set and report the corresponding

results on the test set.

The source code you write should be submitted with the .pdf report

in the compressed .zip file. [HINT: change the output of SVM into probability output and

define the ’confidence’ towards every unlabelled sample] (for the implementation of SVM,

3rd-party packages are allowed but for the implementation of self-training framework, 3rdparty packages are NOT allowed)

## Problem 8. Random Walk with Restart (10 points)

(a) (3 points) From its formula, try to explain why random walk with restart (RWR) can

capture multiple weighted relationships between nodes?

(b) (7 points) Implement random walk with restart (RWR) on the email-EU-core unweighted undirected graph. The given file is an edge list and since it is an unweighted

undirected graph, when see ’i j’ in the edge list, set both A[i, j] and A[j, i] as 1. Set

preference vector e as a uniform vector to obtain global ranking and set the damping

factor c as 0.9.

Report the indices of top-10 nodes and the source code you write

should be submitted with the .pdf report in the compressed .zip file. [HINT: normalize adjacency matrix before RWR: A˜ = D− 1

2 AD− 1

2 , where D is an diagonal matrix

s.t. D[i, i] = P

j A[i, j]] (3rd-party packages are NOT allowed)