Description
1. (10 points) Section A (Theoretical)
(a) (2 marks) You are developing a machine-learning model for a prediction task. As
you increase the complexity of your model, for example, by adding more features
or by including higher-order polynomial terms in a regression model, what is most
likely to occur? Explain in terms of bias and variance with suitable graphs as
applicable.
(b) (3 marks) You’re working at a tech company that has developed an advanced email
filtering system to ensure users’ inboxes are free from spam while safeguarding legitimate messages. After the model has been trained, you are tasked with evaluating
its performance on a validation dataset containing a mix of spam and legitimate
emails. The results show that the model successfully identified 200 spam emails.
However, 50 spam emails managed to slip through, being incorrectly classified as legitimate. Meanwhile, the system correctly recognised most of the legitimate emails,
with 730 reaching the users’ inboxes as intended. Unfortunately, the filter mistakenly flagged 20 legitimate emails as spam, wrongly diverting them to the spam
folder. You are asked to assess the model by calculating an average of its overall
classification performance across the different categories of emails.
(c) (3 marks) Consider the following data where y(units) is related to x(units) over a
period of time: Find the equation of the regression line and, using the regression
x y
3 15
6 30
10 55
15 85
18 100
Table 1: Table of x and y values
equation obtained, predict the value of y when x=12.
(d) (2 marks) Given a training dataset with features X and labels Y , let ˆf(X) be the
prediction of a model f and L(
ˆf(X), Y ) be the loss function. Suppose you have two
models, f1 and f2, and the empirical risk for f1 is lower than that for f2. Provide a
toy example where model f1 has a lower empirical risk on the training set but may
not necessarily generalize better than model f2.
2. (15 points) Section B (Scratch Implementation)
Implement Logistic Regression in the given dataset. You need to implement Gradient
Descent from scratch, meaning you cannot use any libraries for training the model (You
may use libraries like NumPy for other purposes, but not for training the model). Split
the dataset into 70:15:15 (train: test: validation). The loss function to be used is Crossentropy loss.
Dataset: Heart Disease
(a) (3 marks) Implement Logistic Regression using Batch Gradient Descent. Plot training loss vs. iteration, validation loss vs. iteration, training accuracy vs. iteration,
and validation accuracy vs. iteration. Comment on the convergence of the model.
Compare and analyze the plots.
(b) (2 marks) Investigate and compare the performance of the model with different
feature scaling methods: Min-max scaling and No scaling. Plot the loss vs. iteration
for each method and discuss the impact of feature scaling on model convergence.
Page 2
(c) (2 marks) Calculate and present the confusion matrix for the validation set. Report precision, recall, F1 score, and ROC-AUC score for the model based on the
validation set. Comment on how these metrics provide insight into the model’s
performance.
(d) (3 marks) Implement and compare the following optimisation algorithms: Stochastic Gradient Descent and Mini-Batch Gradient Descent (with varying batch sizes,
at least 2). Plot and compare the loss vs. iteration and accuracy vs. iteration for
each method. Discuss the trade-offs in terms of convergence speed and stability
between these methods.
(e) (2 marks) Implement k-fold cross-validation (with k=5) to assess the robustness
of your model. Report the average and standard deviation for accuracy, precision,
recall, and F1 score across the folds. Discuss the stability and variance of the
model’s performance across different folds.
(f) (3 marks) Implement early stopping in your best Gradient Descent method to avoid
overfitting. Define and use appropriate stopping criteria. Experiment with different
learning rates and regularization techniques (L1 and L2). Plot and compare the
performance with and without early stopping. Analyze the effect of early stopping
on overfitting and generalization.
OR
3. (15 points) Section C (Algorithm implementation using packages)
Split the given dataset into 80:20 (train: test) and perform the following tasks:
Dataset: Electricity Bill Dataset
(a) (2.5 marks) Perform EDA by creating pair plots, box plots, violin plots, count plots
for categorical features, and a correlation heatmap. Based on these visualizations,
provide at least five insights on the dataset.
(b) (1 marks)Use the Uniform Manifold Approximation and Projection (UMAP) algorithm to reduce the data dimensions to 2 and plot the resulting data as a scatter
plot. Comment on the separability and clustering of the data after dimensionality
reduction.
(c) (2.5 marks) Perform the necessary pre-processing steps, including handling missing
values and normalizing numerical features. For categorical features, use LabelEncoding. Apply Linear Regression on the preprocessed data. Report Mean Squared
Error (MSE), Root Mean Squared Error (RMSE), R2 score, Adjusted R2 score,
and Mean Absolute Error (MAE) on the train and test data.
(d) (2 marks) Perform Recursive Feature Elimination (RFE) or Correlation analysis on
the original dataset to select the 3 most important features. Train the regression
model using the selected features. Compare the results (MSE, RMSE, R2 score,
Adjusted R2 score, MAE) on the train and test dataset with the results obtained
in part (c).
Page 3
(e) (2 marks) Encode the categorical features of the original dataset using One-Hot
Encoding and perform Ridge Regression on the preprocessed data. Report the
evaluation metrics (MSE, RMSE, R2 score, Adjusted R2 score, MAE). Compare
the results with those obtained in part (c).
(f) (2 marks) Perform Independent Component Analysis (ICA) on the one-hot encoded
dataset and choose the appropriate number of components (try 4, 5, 6, and 8
components). Compare the results (MSE, RMSE, R2 score, Adjusted R2 score,
MAE) on the train and test dataset.
(g) (1.5 marks) Use ElasticNet regularization (which combines L1 and L2) while training a linear model on the preprocessed dataset from part (c). Compare the evaluation metrics (MSE, RMSE, R2 score, Adjusted R2 score, MAE) on the test dataset
for different values of the mixing parameter (alpha).
(h) (1.5 marks) Use the Gradient Boosting Regressor to perform regression on the
preprocessed dataset from part (c). Report the evaluation metrics (MSE, RMSE,
R2 score, Adjusted R2 score, MAE). Compare the results with those obtained in
parts (c) and (g).