Description
P0. Set up your Python Machine Learning Environment
(Nothing to turn in for this part)
Step 1: Install Python+Conda distribution.
Your choice is either Miniforge or Miniconda. If you use a Mac with the M1 chip, you should install
Miniforge.
Step 2: Create a basic Python environment for projects in this course and install needed Python packages.
You should install the following Python packages to the newly created environment: matplotlib, numpy,
pandas, scipy, and scikit-learn. For editing Python code and running Python interactively, jupyter
notebook is recommended, and in that case, you will add jupyter to the above package list for installation.
Resources:
• The lecture slides contain many details that can guide the setup process.
• There is a great deal of Internet resources. Here are some of the videos and links I found useful in the
past.
Warning: These are for information only. Some of the instructions may be out of date. For package
installations, follow our lecture slides or search for the most recent instructions on the web.
For Mac computers with the M1 chip, I recommend the following two videos. You don’t need to follow
their steps to the end. But, the videos give you some idea about the big picture. For most of the projects in
this course, we don’t need Tensorflow and you don’t have to install that for now.
1. Jeff Heaton, Mac M1 Monterey Installing Miniforge and Anaconda/Miniconda Side-by-Side
2. Daniel Bourke, Setup Apple Silicon Mac for Machine Learning in 13 minutes (TensorFlow
edition)
Other Introductory Resources on Internet:
David Chong, How I Set Up My MacBook Pro as A ML Engineer in 2022
https://towardsdatascience.com/how-i-set-up-my-macbook-pro-as-a-ml-engineer-in-2022-88226f08bde2
Zolzaya Luvsandorj, Introduction to Conda virtual environments
https://towardsdatascience.com/introduction-to-conda-virtual-environments-eaea4ac84e28
Machine Learning libraries (NumPy, SciPy, matplotlib, scikit-learn, pandas)
https://www.dotnetlovers.com/article/217/machine-learning-libraries-numpy-scipy-matplotlib-scikitlearn-pandas
P1. (30 points) Work on the written part of the assignment. See the file name ‘A1-written.pdf’. You will
need the solution for the programming part.
P2. (20 points) Load the data set named ‘lin_df.csv’ on Canvas. You can use DataFrame to load it. Check
it out and you will see it contains two columns of data. The first column contains input X. The second
column contains output Y. You will use the entire data set as the training set. In other words, we don’t
worry about generalization in this exercise.
(a) Plot the data points and inspect it.
(b) Write your own linear regression code to find the best fit (don’t use the scikit-learn linear
regression package). You will need the result from the written part of the assignment. Plot the
learned linear function together with the training data points and see how it fits.
You may find it convenient to convert the columns of the DataFrame into numpy arrays and work
with the arrays.
(c) What are the results of 𝜃0 and 𝜃1 of your linear regression? Assume the linear function has the
form 𝑦 = 𝜃0 + 𝜃1 𝑥.
P3. (15 points) Load the data set named ‘nonlin_df.csv’ from Canvas. Repeat the steps in P2.
The data is generated by 𝑌 = 𝑋
2.5 + 𝜖, where 𝜖 is a random noise independent of 𝑋 and has zero mean.
You should superimpose the function 𝑦 = 𝑥
2.5
in your lot. It is the best prediction function because
𝐸[𝑌|𝑋 = 𝑥] = 𝑥
2.5
.
P4. (20 points) You will see that for the ‘nonlin_df.csv’ data set, linear regression does not give a good
fit. Now, implement your own K-Nearest-Neighbors (KNN) code. Plot the result of learning for three
cases: 𝐾 = 4, 𝐾 = 8, and 𝐾 = 16. You will see that although KNN provides a good fit, it does not yield a
smooth function.
P5. (10 points) For this part, you will use the data in the file ‘lin_df.csv’. In your written part of the
assignment, you derive the function ℎ(𝜃0, 𝜃1), which is a quadratic function of 𝜃0 and 𝜃1. Calculate the
required coefficients using the training data. Plot the function ℎ(𝜃0, 𝜃1) in 3D using matplotlib. Please try
to show the minimum in your plot, if you can. If the function is hard to visualize in 3D, you may
supplement it with a sequence of 2D plots, one for each chosen (fixed) value for 𝜃1.
P6. (5 points) Plot the function 𝑔(𝜃0, 𝜃1
) = 𝜃0
2 − 𝜃1
2
in 3D around the point (0,0). You should see (0,0) is
a saddle point.