## Description

1. After your yearly checkup, the doctor has good news and bad news. The bad news is that you tested positive for a

serious disease and that the test is very accurate: the probability of testing positive when you do have the disease

is 0.983, and the probability of testing negative when you don’t have the disease is 0.945. The good news is that

this is a rare disease, striking only one in ten thousand people in your demographic.

(a) What are the chances you have the disease?

(b) Now assign a cost to the errors: deciding to seek treatment for the cancer when in fact you are healthy will

cost you $1000 in unnecessary tests and the recovery therefrom. Deciding to forgo treatment when in fact you

have the cancer will cost you and your family $1,000,000 in loss of life/income etc. Assume a correct decision

(seek treatment if you have cancer, forgo treatment if you are healthy) has no cost, for simplicity.

(c) What is the expected cost (i.e., “risk”) assuming the cancer test comes out positive and you undergo treatment?

(d) What should your decision be after a positive test? (Is this different from the answer to part (a)?)

(e) What is the expected cost if the cancer test is negative and you do not undergo treatment?

2. We want to build a pattern classifier with a continuous attribute using Bayes’ Theorem. The object to be classified

has one feature, x in the range 0 ≤ x ≤ 4. The conditional probability density functions for each class are,

respectively,

p(x|C1) = 1

4

if 0 ≤ x < 4

0 otherwise

p(x|C2) =

x − 1 if 1 ≤ x < 2

3 − x if 2 ≤ x < 3

0 otherwise

0 1 2 3 4 5

0.25

0.50

0.75

1.00

p

x

p(x|C2)

p(x|C1)

(a) Assuming equal priors, P(C1) = P(C2) = 0.5, classify an object with the attribute value x = 1.5.

(b) Assuming unequal priors, P(C1) = 0.75, P(C2) = 0.25, classify the object with the attribute value x = 1.5

(c) Consider a decision function φ(x) of the form φ(x) = (|x − 2|) − α with one free parameter α in the range

0 ≤ α ≤ 1. You choose Class 2 for a given input x if and only if φ(x) < 0, or equivalently 2 − α < x < 2 + α,

otherwise you choose class 1. What is the optimal decision boundary – that is, what is the value of α which

minimizes the probability of misclassification? What is the resulting probability of misclassification with this

optimal value for α? Assume equal priors. Hint: take advantage of the symmetry around x = 2.

(d) Assume equal priors. Also assume there are penalties when choosing a class as follows:

true true

class class

is 1 is 2

you classify object as Class 1 −5 +1

you classify object as Class 2 +3 −5

What is the decision boundary (optimal value for α) that would minimize the expected penalty?

1

(e) Compute the estimated means and standard deviations for the conditional probability density for each class

separately [use the unbiased estimates]. Plot corresponding normal (Gaussian) density functions using these

estimated means and variances.

3. Consider a sample training set in one dimension with attribute values in the interval [0, b], and 2 classes. Suppose

your space of possible classifiers (“hypothesis space” Hk) consists of “bucket” classifiers constructed by dividing

the interval [0, b] into k equal subintervals and assigning class 1 or 2 to each subinterval. Your only choices are

the number k and the class assignment for each subinterval. The learning process is to determine which class to

associate with each subinterval. Assume the number k of sub-intervals is given and fixed.

(a) How many different classifiers are there in the Hypothesis space Hk?

(b) What is the VC dimension of [0, b] with respect to Hk?

4. Implement a program to fit two multivariate Gaussian distributions to the 2-class data in “training data.txt” and

classify the test data in “test data.txt” by computing the log odds log P (C1|x)

P (C2|x) with P(C1) = 0.6 and P(C2) = 0.4.

Your program should display the quantities µ1, µ2, S1 and S2, the sample means and sample covariance matrices

obtained for each class separately, assuming they are independent.

You should then apply the classifier to the test set and show the resulting contigency table (confusion matrix):

number of C1 samples classified as C1 number of C2 samples classified as C1

number of C1 samples classified as C2 number of C2 samples classified as C2

What is the resulting error rate on the test set?

Instructions

• All solutions must be submitted electronically via Canvas.

• Things to submit: one PDF and one ZIP file:

1. hw1 sol.pdf: A document which contains the solutions to Problems 1, 2, 3, and 4, your name, student ID,

email, any assumptions you are making, and any other necessary details. The solution to 4 should include the

formulas for the parameters and their corresponding numerical values. The PDF file should include all the

numerical values requested in Problem 4.

2. For Problem 4 also submit a zip file containing the Matlab source file classify.m and any associated files

needed to make this run. The function classify reads in the training and test data files, computes and returns

the parameters estimated from the training set and the error rate on the test set. It should be a function which

begins as follows:

function [mu1,mu2,S1,S2,ConfusionMatrix,ErrorRate]=classify(TrainingSet, TestSet);

% Solve Hw1 Q4, student name:…..

% Input Parameters: TrainingSet, TestSet: file names (strings), in “csv” format.

% . . . more comments explaining the contents . . .

TrainingData=dlmread(TrainingSet);

class1=find(TrainingData(:,end)==1); % indices of observations in class 1.

class2=find(TrainingData(:,end)==2); % indices of observations in class 2.

TestData=dlmread(TestSet);

. . .

• Do not include the data files downloaded from the class web site. Do not include the PDF file within the ZIP file.

Rather, the PDF document should be submitted as a separate document.

2