Description

5/5 - (1 vote)

1. Exploratory Data Analysis [40 points]
In this homework, you are given a dataset data la happiness.csv of happiness (meanvalence)
in different census tracts in Los Angeles county. This dataset assesses the influence of sociodemographic features on happiness. You are given four socio-demographic features:
• meanHSize – Mean household size
• percent bachelorPlus – Percent of census tract population with a Bachelor’s degree
or higher
• totalRace1 – Number of people from race 1
• totalRace2 – Number of people from race 2
1.1 Data Sanctity [8 points]
Load the file data la happiness.csv. Check for missing values. Note which columns have
missing values and how many?If there are no missing values, proceed to 1.2. Otherwise,
address how you can handle missing values and implement it. Report mean and medians of
all variable columns – meanvalence, totalRace1, totalRace2, percent bachelorPlus. Multiply
percent bachelorPlus by 1000. Be discreet.
1.2 Outlier Detection [8 points]
Generate a boxplot to visualize outliers in the outcome variable meanvalence. Use the interquartile range (IQR) method to remove census tracts with values less than 1.5 × IQR below
Q1 or greater than 1.5 × IQR above Q3.
1.3 Variable Relationships [8 points]
Use seaborn’s pairplot to analyze relationships between the variables. Report correlations
between variables. Discuss the distributions of totalRace1 and totalRace2.
1
1.4 Simple Models and Residual Analysis [8 points]
Build two simple models using statsmodels and generate residual plots for each:
meanvalence ∼ totalRace1
meanvalence ∼ totalRace2
What do the residual plots reveal? Are any linear regression assumptions violated?
1.5 Log Transformation [8 points]
Apply the log transformation to totalRace1 and totalRace2. Explain briefly why this
transformation is necessary.
2. Multivariate Regression [40 points]
2.1 Model Building [15 points]
After applying the log transformation, build the following multivariate regression model
using statsmodels:
meanvalence ∼ percent bachelorPlus + log(totalRace1) + log(totalRace2)
Report your findings and discuss interpretations for each independent variable.
2.2 Scatterplot Analysis [10 points]
Generate a scatterplot between meanvalence and percent bachelorPlot, colored by log(totalRace2).
Repeat this with predicted values (predicted meanvalence) from the model. Discuss any
differences observed.
2.3 Correlation Heatmap [15 points]
Generate a correlation heatmap using seaborn. Report correlations between:
• log(totalRace1) and meanvalence
• log(totalRace2) and meanvalence
• log(totalRace1) and predicted meanvalence
• log(totalRace2) and predicted meanvalence
Discuss whether the model appears biased.
2
3. Analyzing Bias in the Model [20 points]
3.1 Protected Variable Analysis [15 points]
Run the following model, considering log(totalRace2) as a protected variable:
meanvalence ∼ percent bachelorPlus + log(totalRace1)
Report regression results and correlations as in Section 2.3. Compare these correlations to
those in 2.3.
Discuss whether bias was reduced? [5 points].

Solved DSCI531: Fairness in Artificial Intelligence Homework 1: Linear Models

Download Details:

Description

Solved DSCI531: Fairness in Artificial Intelligence Homework 1: Linear Models

Download Details:

Description

Related products

Solved DSCI531 HW2: Bias in Data and Prediction

Solved HW4 – Analyzing Bias in Networks DSCI 531

Solved DSCI531: Fairness in Artificial Intelligence Homework 3