1. Logistic regression
1a. Make a logistic regression model
relating the probability an iris has Species=’virginica’ to its ‘Petal.Length’ and classifying irises as ‘virginica’ or not ‘virginica’ (i.e. ‘versicolor’).
- Read into a DataFrame.
- Make a second data frame that excludes the ‘setosa’ rows (leaving the ‘virginica’ and ‘versicolor’ rows) and includes only the Petal.Length and Species columns.
- Train the model using X=𝑋= petal length and y=𝑦= whether the Species is ‘virginica’. (I used “y = (df[‘Species’] == ‘virginica’).to_numpy().astype(int)”, which sets y to zeros and ones.)
- Report its accuracy on the training data.
- Report the estimated P(Species=virginica | Petal.Length=5).
- Report the predicited Species for Petal.Length=5.
- Make a plot showing:
- the data points
- the estimated logistic curve
- and what I have called the “sample proportion” of y == 1 at each unique Petal.Length value
- a legend and title and other labels necessary to make the plot easy to read
# ... your code here ...
1b. Do some work with logistic regression by hand.
Consider the logistic regression model, P(yi=1)=11+e−(wx+b),.𝑃(𝑦𝑖=1)=11+𝑒−(𝑤𝑥+𝑏),.
Logistic regression is named after the log-odds of success, lnp1−pln𝑝1−𝑝, where p=P(yi=1)𝑝=𝑃(𝑦𝑖=1). Show that this log-odds equals wx+b𝑤𝑥+𝑏. (That is, start with lnp1−pln𝑝1−𝑝 and connect it in a series of equalities to wx+b𝑤𝑥+𝑏.)
… your Latex math in a Markdown cell here …
1c. Do some more work with logistic regression by hand.
I ran some Python/scikit-learn code to make the model pictured here:
From the image and without the help of running code, match each code line from the top list with its output from the bottom list.
model.predict_proba(X)[:, 1]
A. array([0, 0, 0, 1])
, B. array([0.003, 0.5, 0.5, 0.997])
, C. array([5.832])
, D. array([0.])
# ... Your answer here in a Markdown cell ...
# For example, "1: A, 2: B, 3: C, 4: D" is wrong but has the right format.
2. Decision tree
2a. Make a decision tree model on a Titanic data set.
Read the data from
These data are described at (click on the small down-arrow to see the “Data Dictionary”), which is where they are from.
- Retain only the Survived, Pclass, Sex, and Age columns.
- Display the first seven rows (passengers). Notice that the Age column includes NaN, indicating a missing value.
- Drop rows with missing data via
. Display your data frame’s shape before and after dropping rows. (It should be (714, 4) after dropping rows.) - Add a column called ‘Female’ that indicates whether a passenger is Female. You can make this column via
df.Sex == 'female'
. This gives bool values True and False, which are interpreted as 1 and 0 when used in an arithmetic context. - Train a decision tree with
to decided whether a passengerSurvived
from the other three columns. Report its accuracy (with 3 decimal places) on training data along with the tree’s depth (which is available inclf.tree_.max_depth
). - Train another tree with
. Report its accuracy (with 3 decimal places). Usetree.plot_tree()
to display it, including feature_names to make the tree easy to read.
# ... your code here ...
2b. Which features are used in the (max_depth=2) decision-making? Answer in a markdown cell.
# ... your English text in a Markdown cell here ...
2c. What proportion of females survived? What proportion of males survived?
Answer in two sentences via print(), with each proportion rounded to three decimal places.
Hint: There are many ways to do this. One quick way is to find the average of the Female
column for each subset.
# ... your code here ...
2d. Do some decision tree calculations by hand.
Consider a decision tree node containing the following set of examples S=(x,y)𝑆=(𝑥,𝑦) where x=(x1,x2)𝑥=(𝑥1,𝑥2):
((4, 9), 1)
((2, 6), 0)
((5, 7), 0)
((3, 8), 1)
Find the entropy of S𝑆.
# ... your brief work and answer here in a markdown cell ...
2e. Do some more decision tree calculations by hand.
Find a (feature, threshold) pair that yields the best split for this node.
# ... your brief work and answer here in a markdown cell ...