## Description

1 (50 points). This exercise relates to the Red Wine Quality data set (winequality-red.csv), which

can be found under the Datasets modules in Canvas. The dataset contains a number of

physicochemical test variables for 1599 different red wine variants of the Portuguese “Vinho

Verde” wine.

The variables are

• fixed_acidity

• volatile_acidity

• citric_acid

• residual_sugar

• chlorides

• free_sulfur_dioxide

• total_sulfur_dioxide

• density

• pH

• sulphates

• alcohol (output variable based on sensory data)

• quality (score between 0 and 10)

Before reading the data into R or Python, you can view it in Excel or a text editor. For each of

the following questions, include the code you used to complete the task as your response, along

with any plots or numeric outputs produced. You may omit outputs that are not relevant (such as

dataframe contents), but still include all of your code.

(a, 6 points) Use the read.csv() function to read the data into R, or the csv library to read

in the data with python. In R you will load the data into a dataframe. In python you may store it

as a list of lists or use the pandas dataframe to store your data. Call the loaded data redwine.

Ensure that your column headers are not treated as a row of data.

(b, 8 points) Find the mean quality of all the wine samples. Then find the median alcohol

level for all the wine samples.

(c, 8 points) Produce a scatterplot that shows the relationship between wine density and

residual_sugar. Ensure it has appropriate axis labels and a title. Briefly state if you see any effect

of residual_sugar on density.

(d, 10 points) Create a new qualitative variable, called ALevel, by binning the alcohol

variable into two categories (High and Medium). Specifically, divide the data into two groups

based on whether the alcohol level exceeds 11 or not (alcohol greater than 11 is considered High

otherwise it is considered Medium).

Now produce side-by-side boxplots of the ratio of sulphates to chlorides (hint: create a new

variable that calculates sulphates / chlorides) for each of the two ALevel categories. There

should be two boxes on your figure, one for High and one for Medium. How many samples are

in the High category?

(e, 8 points) Produce a histogram showing the fixed_acidity numbers for both High and

Medium (ALevel) wine samples. You may choose to show both on a single plot (using side by

side bars) or produce one plot for High samples and one for Medium samples. Ensure whatever

figures you produce have appropriate axis labels and a title.

(f, 10 points) Continue exploring the data, producing two new plots of any type, and

provide a brief (one to two sentence) summary of your hypotheses and what you discover. Feel

free to think outside the box on this one but if you want something to point you in the right

direction, look at the summary statistics for various features, and think about what they tell you.

Perhaps try plotting various features from the dataset against each other and see if any patterns

emerge.

2 (50 points). This exercise involves the forestfires.csv dataset which can be found under the

Datasets modules in Canvas. The features of the dataset are:

• X: x-axis spatial coordinate

• Y: y-axis spatial coordinate

• month: month of the year (‘jan’ to ‘dec’)

• day: day of the week (‘mon’ to ‘sun’)

• FFMC: Fine Fuel Moisture Code index

• DMC: Duff Moisture Code index

• DC: Drought code index

• ISI: Initial spread index

• temp: Temperature in degrees Celsius

• RH: Relative Humidity in %

• wind: Wind speed (km/h)

• rain: Amount of rainfall (mm/m2)

• area: area that got burnt in the forest fire

(a, 6 points) Specify which of the predictors are quantitative (measuring numeric

properties such as size or quantity) and which are qualitative (measuring non-numeric properties

such as color, appearance, type etc.), if any? Keep in mind that a qualitative variable may be

represented as a quantitative type in the dataset, or the reverse. You may wish to adjust the types

of your variables based on your findings.

(b, 8 points) What is the range, mean and standard deviation of each quantitative

predictor? Which month has the highest number of fires?

(c, 8 points) Produce boxplots of relative humidity (RH) by month. Your figure will have

a boxplot for every month. Which month has the highest median RH value?

(d, 10 points) Produce a bar plot to show the count of forest fires in each month for

which wind is greater than 4.9. During which months are high wind forest fires most common?

(Hint: filter data by wind, group data by month and calculate count.)

(e, 10 points) Using the full data set, investigate the predictors graphically, using

scatterplots, correlation scores or other tools of your choice. Create a correlation matrix for the

relevant variables.

(f, 8 points) Suppose that we wish to predict the Initial spread index (ISI) based on the

other variables. Which, if any, of the other variables might be useful in predicting ISI? Justify

your answer based on the prior correlations.