## Description

Walmart Sales Data in file ’WalmartSalesData.csv’

Walmart data (5 points)

There are 12 columns in this dataset.

1. Store: Store number

2. Date: Week

3. Temperature: Average temperature in the region

4. Fuel Price: Cost of fuel in the region

5. MarkDown1: Anonymized data related to promotional markdowns that

Walmart is running.

6. MarkDown2: Anonymized data related to promotional markdowns that

Walmart is running.

7. MarkDown3: Anonymized data related to promotional markdowns that

Walmart is running.

8. MarkDown4: Anonymized data related to promotional markdowns that

Walmart is running.

9. MarkDown5: Anonymized data related to promotional markdowns that

Walmart is running.

10. CPI: The consumer price index

11. Unemployment: The unemployment rate

12. IsHoliday: Whether the week is a special holiday week

Questions:

1. Load the data

2. Print the first and last six rows of the dataset

3. Find the dimension of the dataset

1

4. Find the total number of missing values

5. Find the missing values for each column

6. Visualize this data in any way you choose by producing 3 separate figures

7. Remove the missing values or perform imputations where you can replace

missing values with your chosen approach (+2 points for imputations)

8. Show that there are no more missing values

Credit Card Data (13 points)

Credit Card Data in file ’CreditCardData.csv’

There are 20 columns in this dataset.

1. Client ID

2. Gender: Gender (1=Male, 0=Female)

3. Own car: Does the client own a car? (1=Yes, 0=No)

4. Own property: Does the client own property? (1=Yes, 0=No)

5. Work phone: Does the client own a work phone? (1=Yes, 0=No)

6. Phone: Does the client own a phone? (1=Yes, 0=No)

7. Email: Does the client have an email address? (1=Yes, 0=No)

8. Unemployed: Is the client unemployed? (1=Yes, 0=No)

9. Num children: Number of children

10. Num family: Number of family members

11. Account length: Number of months credit card has been owned

12. Total income: Total income (Chinese Yuan)

13. Age: Age in years

14. Years employed: Number of years employed

15. Income type

16. Education type

17. Family status

18. Housing type

19. Occupation type

2

20. Target: Target (1=high risk, 0=low risk)

(Before delving into further questions, get familiar with the data and check

if there are any missing values)

1. Partition the data into train-test (eg 70-30)

2. Using the training dataset produce a decision tree model with the ’Unemployed’ value as the dependent and the set of independents as ’Own property’,

’Phone’, ’Email’, ’Num children’. Visualize the tree that is fitted. Use the

testing dataset and report the accuracy in predicting the Unemployed

values.

3. Using the training dataset produce a decision tree model with the ’Age’

as the dependent variable and the independents as the ’Total income’,

’Num children’, ’Unemployed’, ’Own property’. Visualize the tree that is

fitted. Use the testing dataset and report the RMSE.

4. Produce a decision tree to predict the Target column values of high risk or

low risk. Using the accuracy on the test dataset which 4 independent variables (features) do you choose reporting the accuracy for the alternatives

list at least 3)?

Credit contour visualization (2 points)

Produce a contour plot where the horizontal axis is the Account length, the

vertical axis is the Age and the z-index is the Total income. The z-index can

be produced using a linear model (regression) or with decision trees.

3