## Description

1 Model performance

A very common question in every machine learning problem is: how many data samples

do we need to model the system behaviour adequately. Unfortunately, just like many other

topics in machine learning, there is no straight answer. In many toy problems presented

in textbooks, a classification problem is solved with only 50-100 data points. In real world

problems, a classification problem may be very difficult even with millions of data points.

Generally, the model performance depends on the following factors:

1. Are the classes easily separated or they are pretty mixed? Are they separated linearly

or non-linearly? Is a linear or non-linear model used?

2. The features quality. Do they carry information with respect to the output/class? More

features does not necessarily mean better performance. The famous quote ”Garbage

in, garbage out” is used to describe uninformative features.

3. The number of data points. Intuitively, more data points lead to better performance.

But after some point, it is expected that the increase in model performance diminishes.

The last point is the subject of this section. From a business perspective, you want to know

how many samples you need to model the clients behaviour adequately. This information is

crucial when the conditions change and you may want to re-fit your model.

For example, with Covid-19 the clients behaviour changed dramatically. Let’s assume

that you are at the beginning of Covid-19 in March 2020 and your manager is asking you

to re-fit the retail response problem you solved in Assignment #5 (apologies for putting you

mentally back at the beginning of Covid-19, we are almost out of it). The question that

comes with this request is: how many data points do you need to re-fit the model with

adequate performance?

You know that generally more data points means better performance, but you cannot

wait for too long to collect new data post-March 2020 because your business will not have a

reliable model for as long as you collect data. A similar situation may appear in an industrial

setting, let’s say after the annual maintenance of a machine or a reactor. How many data

points do you need to model the machine or reactor behaviour after the maintenance?

1

1.1 Dataset size vs model performance

Here, you will quantify the relationship between the dataset size and the model performance. Essentially, you will answer the question: how much data is enough to model client

behaviour? In order to do this, you will pick the best single tree model you created in Assignment #5 and evaluate it with datasets of different sizes using the monthly features you

created in Assignment #3.

Perform the evaluation with the following steps:

1. Split the train/test sets with 9:1 ratio This split should give you approximately 291k/32k

samples in train/test set, respectively.

2. Initialize and create a for loop in which you take N samples (e.g. 50), build a tree

model with the N samples and evaluate the test set AUC. Repeat the sampling process

10 times and append the test set AUC. The following table shows the desired output:

N = 50 samples

sample # Test AUC

1 0.545

2 0.561

.

.

.

.

.

.

10 0.551

From this table, you can calculate the mean and standard deviation of the test AUC

for N samples.

3. Repeat the procedure you performed in the previous step for different sample size N

(e.g. 100, 500, 1000, 2000, 5000, 10000) 1

.

4. Build a table that contains the values of:

Sample size N

Test AUC mean

Test AUC standard deviation

5. Using the matplotlib function errorbar, plot the model performance captured in the

test AUC mean and standard deviation as a function of the sample size. From this

plot, can you estimate what is the minimum number of samples needed to model the

behaviour adequately?

1The N values here are just my educated guesses. You should try values that will give you a meaningful

result as described in the next steps.

2