Description

5/5 - (1 vote)

CS 539 Machine Learning Assignment 1

QUESTION 1 [55 Marks]
Climate change stands as one of the most urgent challenges confronting our planet
today. To effectively comprehend and address this critical issue, access to precise and
comprehensive data regarding global temperatures and other climate-related factors is
indispensable.
In this regard, you serve as a data analyst at the National Aeronautics & Space
Administration (NASA) and are engaged in researching Earth’s climate and
temperature. Your work involves utilizing datasets sourced from satellites and groundbased sensors.
You have been entrusted with a dataset “earth_surface_temperatures .csv”
encompassing surface temperature data for various countries worldwide, spanning from
Dec 1743 to December 2020. Your mission is to conduct an in-depth Exploratory Data
Analysis (EDA) on this dataset. This analysis aims to extract insights and answer crucial
questions about the data by delving into trends and patterns.
To achieve this, you are expected to:
a. Identify and rectify any missing values in the data using appropriate techniques.
[5 Marks]
b. Transform the Years and Month columns into a single column labeled “Date” in
the MM-YYYY format, with a datetime64[ns] data type. For example, the year
1848 and month 5 should be unified as a single value, such as 5-1848. [5 Marks]
c. Detect and investigate extreme temperature values that might be regarded as
outliers. [5 Marks]
d. Compute summary statistics for temperature, monthly variation, and anomaly
values, including mean, median, standard deviation, and range. [5 Marks]
e. Identify the countries included in the dataset and calculate their average
temperature values. [5 Marks]
f. Determine the overall trend in global temperatures over the years and visualize
this trend using a suitable chart. [5 Marks]
g. Identify the months with the highest and lowest temperatures for each country
and find out whether there are noticeable seasonal patterns in the temperature
data. [5 Marks]
h. Explore the variation in temperature anomalies on a monthly basis and identify
any months with consistently high or low anomalies across the years. [5 Marks]
i. Choose five countries and compare the trends in their temperatures over the
years, seeking any similar temperature patterns. [5 Marks]
j. Explore the potential correlation between temperature and monthly variation or
anomaly values. Calculate correlation coefficients and create scatterplots to
investigate this relationship. [5 Marks]
k. Provide an intriguing insight from the dataset by utilizing data visualization
techniques such as histograms, box plots, or heatmaps to represent the data’s
distribution, trends, and relationships. [5 Marks]
QUESTION 2 [45 Marks]
As a member of the retail analytics team, you have been contacted by the Category
Manager at a retail store, who desires to gain a deeper understanding of the customers who
buy chips and their purchasing habits within the region through valuable insights that will
eventually be used to inform the store’s strategic plan for the chip category in the upcoming
six months.
You have received the following e-mail from your manager.
Greetings!
I am following up on our earlier conversation with a few pointers to help you succeed in this
task. Here are the key areas you will be working on and what we’re looking for in each one:
Firstly, examine the transaction data (“transaction_data“ file) and look for inconsistencies,
missing data, outliers, correctly identify category items, and numeric data across all tables.
If you notice any anomalies, please make the necessary changes in the dataset and save it
for further analysis. Having clean data will make it easier for us to conduct an effective
analysis.
Secondly, examine the customer data (“purchase_behaviour” file) for similar issues and
check for null values. Once you’re satisfied with the data, merge the transaction and
customer data together for analysis, ensuring that you save your files along the way.
Thirdly, conduct data analysis and identify customer segments. Define the metrics, such as
total sales, drivers of sales, and the source of the highest sales. Explore the data, create
charts and graphs, and note any interesting trends and insights you find.
Finally, deep dive into customer segments and recommend which segments we should
target. Determine if packet sizes are relative and form an overall conclusion based on your
analysis.
Here is the task:
Your task is to provide a data-driven strategic recommendation for the upcoming category
review. To achieve this, you must first analyze the current purchasing trends and behaviors
to understand the customer segments and their chip purchasing behavior. To describe the
customers’ purchasing behavior, you need to identify relevant metrics. The client has a
specific interest in understanding the chip purchasing behavior of different customer
segments.
To begin the task, download the comma-separated values (CSV) data files provided to you
and conduct preliminary data checks, including:
• Generating and interpreting high-level data summaries.
• Identifying any outliers and, if necessary, removing them (if applicable).
• Verifying the data formats and correcting them, if needed (if applicable).
In addition to the preliminary data checks, it is essential to extract additional features, such
as pack size and brand name, from the data. Defining relevant metrics of interest is also
crucial to gaining insights into the chip purchasing behavior of different customer segments.
Your ultimate goal is to formulate a strategic recommendation for the Category Manager,
based on your findings. Therefore, it is essential that your insights have a commercial
application and can be used to inform decision-making.
Lastly, a detailed report on your analysis findings, no longer than 3-4 pages, is required.
The report should include any relevant visualizations you have created, as well as your
recommendation to the Category Manager, to inform the store’s strategic plan for the chip
category. Do not include any technical aspects of your analysis, such as coding, in the
report.
Note: This is an open-ended case study and can be approached in various ways, allowing
for flexibility and creativity in the analysis process.
Additional Pointers (column description of purchase behavior):
LIFESTAGE: Customer attribute that determines if they have a family or not, and at what
stage of life they are in. For instance, it considers whether their children are in preschool,
primary or secondary school.
PREMIUM_CUSTOMER: Customer segmentation approach that distinguishes shoppers
based on the price point and product types they purchase. Its purpose is to determine
whether customers are willing to pay more for brand or quality or prefer to purchase the
most economical options.

Assignment 2 CS 539

For this assignment, you will:
(0 pts) Get started with Python
(60 pts) Implement Decision Tree with Discrete Attributes
(40 pts) Credit Risk Prediction
Part 0: Getting Started with Python
As in all the homeworks this semester, you will be using Python. So let’s get started first with our Python
installation.
Python Installation and Basic Configuration
First off, you will need to install a recent version of python 3 in here. There are lots of online resources for help
in installing Python.
Alternatively, there is a nice collection called Anaconda, that comes with Python plus tons of helpful packages
that we may use down the line in this course:
Anaconda: which has install instructions for Windows, Linux, and Mac OSX.
Installing Python Libraries (optional)
You will may need to install python libraries. To manage your Python installations, we recommend pip. Pip is a
tool for installing and keeping track of python packages. It is a replacement for easy_install which is included
with python. It’s a bit smarter than easy_install, and gives better error messages, so you probably want to use it.
You can install pip and the two packages we currently need by running these commands:
> easy_install pip
> pip install -r reqs.pip
Then, you may install other Python libraries such as NumPy by typing ‘pip install numpy’
Part 1: Implement Decision Tree with Discrete Attributes (60 pts)
In this assignment, you will implement the decision tree algorithm for a classification problem in python 3. We
provide the following three files:
a) data1.csv – You will load the file, build a tree, and evaluate its performance.
The first row of the file is the header (including the names of the attributes). In the remaining rows, each row
represents an instance/example. The first column of the file is the target label.
b) part1.py – You will implement several functions. Do not change the input and the output of the functions.
c) test1.py – This file includes unit tests. Run this file by typing ‘pytest -v test1.py’ in the terminal to check
whether all of the functions are properly implemented. No modification is required.
Part 2: Credit Risk Prediction (40 pts)
Let’s assume that you work for a credit card company. Given the sample credit dataset (credit.txt) as a training
set, your job is to build a decision tree and make risk prediction of individuals. The target/class variable is credit
risk described as high or low. Features are debt, income, marital status, property ownership, and gender.
Task 2-1: Draw your decision tree and report it. You may use visualization tools (e.g., Graphviz) or use text.
You might find it easier if you turn the decision tree on its side, and use indentation to show levels of the tree as
it grows from the left. For example:
outlook = sunny
| humidity = high: no
| humidity = normal: yes
outlook = overcast: yes
outlook = rainy
| windy = TRUE: no
| windy = FALSE: yes
Feel free to print out something similarly readable if you think it is easier to code.
Apply the decision tree to determine the credit risk of the following individuals:
Name Debt Income Married? Owns Property Gender
Tom low low no Yes Male
Ana low medium yes Yes female
Report a snapshot of your decision tree, and predicted credit risk of Tom and Ana.
Task 2-2: How does your decision tree change if Sofia’s credit risk is high instead of low as recorded in the
training data? Given the decision tree constructed from the original dataset, if existing, name any feature not
playing a role in the decision tree.
What to turn in:
 Submit to Canvas your part1.py, and a pdf document for part2.
 This is an individual assignment.

Assignment 3 CS 539

For this assignment, you will:
(70 pts) Implement linear regression with gradient descent
(30 pts) Make predictions by using your implementation
Part 1: Implement linear regression with gradient descent
In this problem, you will implement the linear regression algorithm in python3. We provide the following files:
a) linear_regression.py – You will implement several functions. As we discussed in class, implement the
functions by using vectorization. You may refer to matrix calculus here:
https://en.wikipedia.org/wiki/Matrix_calculus
Do not change the input and the output of the functions.
b) test.py – This file includes unit tests. Run this file by typing ‘pytest -v test.py’ in the terminal as you did
in homework 1 in order to check whether all of the functions are properly implemented. No modification
is required.
Part 2: Make predictions by using your implementation
Given training and test sets, you will make predictions of test examples by using your linear regression
implementation (linear_regression.py). We provide the following file:
a) application.py – write your code in this file. Do not change X and y.
Please play with the parameters alpha and number of epochs to make sure your testing loss is smaller than 1e-2
(i.e., 0.01). Report your parameters, training loss and testing loss. In addition, based on your observations,
report a relationshp between alpha and number of epochs. Note that a single epoch means the single time you
see all examples in the training set.
What to turn in:
 Submit to Canvas your linear_regression.py, application.py and a pdf document for part 2.
 This is an individual assignment.

Assignment 4 CS 539

Part 1. Softmax regression [60 points]
In this part, you will implement softmax regression (in problem1.py) with stochastic gradient descent in
python3.
We provide the following files:
a) problem1.py – You will implement several functions of softmax regression. Do not change the input and
the output of the functions.
b) test1.py – This file includes unit tests. Run this file by typing ‘pytest -v test1.py’ in the terminal. No
modification is required.
Part 2. Adding CNN and fully connected layers to recognize handwritten digits on PyTorch [40 points]
In this part, you will deal with the MNIST Database [1]. The MNIST Database is a collection of samples of
handwritten digits from many people, originally collected by the National Institute of Standards and
Technology (NIST), and modified to be more easily analyzed computationally. We will use a tutorial and
sample software provided:
• Read and run a tutorial [2] to be familiar with how to add CNN layers into PyTorch.
• Download hw4-part2.ipynb and run it to be familiar with the code. Currently it contains two fully connected
layers with softmax.
• Your job is to add one CNN layer with one pooling layer before the two fully connected layers with softmax.
Refer to the detailed instruction about the CNN layer. The main difference between the tutorial and this
given hw4_part2.ipynb is the input image of hw4_part2.ipynb has only 1 channel (i.e., gray scale).
Report results of two fully connected layers without CNN and with CNN.
• Then, experiment with at least 3 alternative network topologies and hyper-parameters (e.g., different # of
CNN/fully-connected layers, # of epochs, # of hidden units, learning rate, batch size, and different activation
functions).
• Save and summarize the results and report them.
• Through the experiment, what is the best configuration? What prediction accuracy on the test set you got?
What did you learn?
What to turn in:
 Submit to Canvas your problem1.py and pdf document for part 2.
 This is an individual assignment.
[1] MNIST Database: https://pytorch.org/vision/stable/generated/torchvision.datasets.MNIST.html
[2] Tutorial: https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html

CS 539 Assignment 1 to 4 solution

Download Details:

Description

CS 539 Machine Learning Assignment 1

Assignment 2 CS 539

Assignment 3 CS 539

Assignment 4 CS 539

CS 539 Assignment 1 to 4 solution

Download Details:

Description

CS 539 Machine Learning Assignment 1

Assignment 2 CS 539

Assignment 3 CS 539

Assignment 4 CS 539

Related products

CS539 Natural Language Processing HW 1 solution

CS 539 HW 6: Recurrent Neural Language Models solution

CS 539-001 EX 2: Language Models and Entropy solution