## Description

## 1. Logistic regression

# 1a. Make a logistic regression model

relating the probability an iris has Species=’virginica’ to its ‘Petal.Length’ and classifying irises as ‘virginica’ or not ‘virginica’ (i.e. ‘versicolor’).

- Read http://www.stat.wisc.edu/~jgillett/451/data/iris.csv into a DataFrame.
- Make a second data frame that excludes the ‘setosa’ rows (leaving the ‘virginica’ and ‘versicolor’ rows) and includes only the Petal.Length and Species columns.
- Train the model using X=𝑋= petal length and y=𝑦= whether the Species is ‘virginica’. (I used “y = (df[‘Species’] == ‘virginica’).to_numpy().astype(int)”, which sets y to zeros and ones.)
- Report its accuracy on the training data.
- Report the estimated P(Species=virginica | Petal.Length=5).
- Report the predicited Species for Petal.Length=5.
- Make a plot showing:
- the data points
- the estimated logistic curve
- and what I have called the “sample proportion” of y == 1 at each unique Petal.Length value
- a legend and title and other labels necessary to make the plot easy to read

`# ... your code here ...`

## 1b. Do some work with logistic regression by hand.

Consider the logistic regression model, P(yi=1)=11+e−(wx+b),.𝑃(𝑦𝑖=1)=11+𝑒−(𝑤𝑥+𝑏),.

Logistic regression is named after the log-odds of success, lnp1−pln𝑝1−𝑝, where p=P(yi=1)𝑝=𝑃(𝑦𝑖=1). Show that this log-odds equals wx+b𝑤𝑥+𝑏. (That is, start with lnp1−pln𝑝1−𝑝 and connect it in a series of equalities to wx+b𝑤𝑥+𝑏.)

#### … your Latex math in a Markdown cell here …

lnp1−p=...=...=...=...=wx+bln𝑝1−𝑝=…=…=…=…=𝑤𝑥+𝑏

### 1c. Do some more work with logistic regression by hand.

I ran some Python/scikit-learn code to make the model pictured here:

From the image and without the help of running code, match each code line from the top list with its output from the bottom list.

`model.intercept_`

`model.coef_`

`model.predict(X)`

`model.predict_proba(X)[:, 1]`

A. `array([0, 0, 0, 1])`

, B. `array([0.003, 0.5, 0.5, 0.997])`

, C. `array([5.832])`

, D. `array([0.])`

```
# ... Your answer here in a Markdown cell ...
# For example, "1: A, 2: B, 3: C, 4: D" is wrong but has the right format.
```

## 2. Decision tree

## 2a. Make a decision tree model on a Titanic data set.

Read the data from http://www.stat.wisc.edu/~jgillett/451/data/kaggle_titanic_train.csv.

These data are described at https://www.kaggle.com/competitions/titanic/data (click on the small down-arrow to see the “Data Dictionary”), which is where they are from.

- Retain only the Survived, Pclass, Sex, and Age columns.
- Display the first seven rows (passengers). Notice that the Age column includes NaN, indicating a missing value.
- Drop rows with missing data via
`df.dropna()`

. Display your data frame’s shape before and after dropping rows. (It should be (714, 4) after dropping rows.) - Add a column called ‘Female’ that indicates whether a passenger is Female. You can make this column via
`df.Sex == 'female'`

. This gives bool values True and False, which are interpreted as 1 and 0 when used in an arithmetic context. - Train a decision tree with
`max_depth=None`

to decided whether a passenger`Survived`

from the other three columns. Report its accuracy (with 3 decimal places) on training data along with the tree’s depth (which is available in`clf.tree_.max_depth`

). - Train another tree with
`max_depth=2`

. Report its accuracy (with 3 decimal places). Use`tree.plot_tree()`

to display it, including feature_names to make the tree easy to read.

`# ... your code here ...`

## 2b. Which features are used in the (max_depth=2) decision-making? Answer in a markdown cell.

`# ... your English text in a Markdown cell here ...`

## 2c. What proportion of females survived? What proportion of males survived?

Answer in two sentences via print(), with each proportion rounded to three decimal places.

Hint: There are many ways to do this. One quick way is to find the average of the `Female`

column for each subset.

`# ... your code here ...`

## 2d. Do some decision tree calculations by hand.

Consider a decision tree node containing the following set of examples S=(x,y)𝑆=(𝑥,𝑦) where x=(x1,x2)𝑥=(𝑥1,𝑥2):

((4, 9), 1)

((2, 6), 0)

((5, 7), 0)

((3, 8), 1)

Find the entropy of S𝑆.

`# ... your brief work and answer here in a markdown cell ...`

## 2e. Do some more decision tree calculations by hand.

Find a (feature, threshold) pair that yields the best split for this node.

`# ... your brief work and answer here in a markdown cell ...`