MIE 1624 Introduction to Data Science and Analytics Assignment 1 solution

$24.99

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (4 votes)

Background:
For this assignment, you are responsible for answering the below questions based on the dataset provided.
You will then need to submit a 2-page report in which you present the results of your analysis. In your
report, you should use visual forms to present your results. How you decide to present your results (i.e.
with tables/plots/etc.) is up to you but your choice should make the results of your analysis clear and
obvious. In your report, you will need to explain what you have used to arrive at the answer to the
research question and why it was appropriate for the data/question. You must interpret your final results in
the context of the dataset for your problem.
Dataset:
Kaggle has hosted an open data scientist competition in 2020 titled “Kaggle ML & DS Survey
Challenge.” The purpose of this challenge was to “tell a data story about a subset of the data science
community represented in this survey, through a combination of both narrative text and data exploration.”
More information on the competition, data, and prizes can be found on:
https://www.kaggle.com/c/kaggle-survey-2020/data
The dataset provided (kaggle_survey_2020_responses.csv) contains the survey results provided by
Kaggle. The survey results from 20036 participants are shown in 355 columns, representing survey
questions. Not all questions are answered by each participant, and responses contain various data types.
In the dataset for Assignment 1, column Q24 “What is your current yearly compensation (approximate
$USD)?” contains a numerical target variable. Rows with null salaries have been dropped. (Please refer to
clean_kaggle_data.csv). You should work with the clean dataset for this assignment.
Questions:
The objectives of this Assignment is to explore the survey data to understand (1) the nature of women’s
representation in Data Science and Machine Learning and (2) the effects of education on income level.
The following tasks should be completed:
1. [3pts] Perform exploratory data analysis to analyze the survey dataset and to summarize its main
characteristics. Present 3 graphical figures that represent different trends in the data. For your
explanatory data analysis, you can consider Country, Age, Education, Professional Experience,
and Salary.
1/4 MIE 1624 Introduction to Data Science and Analytics – Assignment 1
2. [4pts] Estimating the difference between average salary (Q24) of men vs. women (Q2).
a. [0.5pts] Compute and report descriptive statistics for each group (remove missing data, if
necessary).
b. [0.5pts] If suitable, perform a two-sample t-test with a 0.05 threshold. Explain your
rationale.
c. [1.5pts] Bootstrap your data for comparing the mean of salary (Q24) for the two groups.
Note that the number of instances you sample from each group should be relative to its
size. Use 1000 replications. Plot two bootstrapped distributions (for men and women) and
the distribution of the difference in means.
d. [0.5pts] If suitable, perform a two-sample t-test with a 0.05 threshold on the bootstrapped
data. Explain your rationale.
e. [1pts] Comment on your findings.
3. [5pts] Select “highest level of formal education” (Q4) from the dataset and repeat steps a to e,
this time use analysis of variance (ANOVA) instead of t-test for hypothesis testing to compare the
means of salary for three groups (Bachelor’s degree, Doctoral degree, and Master’s degree)
[0.75pts for a; 0.5 pts for b; 2pts for c; 0.75 pts for d; 1pt for e].
Submission:
1) Produce a 2-page report explaining your response to each question for the given data set and
detailing the analysis you performed. When writing the report, make sure to explain for each step,
what you are doing, why it is important, and the pros and cons of that approach.
2) Produce an IPython Notebook detailing the analysis you performed to answer the questions for
the given data set.
Tools:
● Software:
○ Python Version 3.X is required for this assignment. Your code should run on the
CognitiveClass Virtual Lab http://labs.cognitiveclass.ai (Kernel 3). All libraries are
allowed but here is a list of the major libraries you might consider: Numpy, Scipy,
Sklearn, Matplotlib, Pandas.
○ No other tool or software besides Python and its component libraries can be used to touch
the data files. For instance, using Microsoft Excel to clean the data is not allowed.
○ Read the required data file from the same directory as your notebook on the
CognitiveClass Virtual Lab – for example, pd.read_csv(“clean_kaggle_data.csv”).
2/4 MIE 1624 Introduction to Data Science and Analytics – Assignment 1
● Required data files:
○ clean_kaggle_data.csv: survey responses with yearly compensation.
○ The data file cannot be altered by any means. The Jupyter notebook will be run using a
local version of this data file. Do not save anything to file within the notebook and read it
back.
What to submit:
1. Submit via Quercus a Jupyter (IPython) notebook containing your implementation and motivation
for all the steps of the analysis with the following naming convention:
lastname_studentnumber_assignment1.ipynb
Make sure that you comment on your code appropriately and describe each step in sufficient
detail. Respect the above convention when naming your file, making sure that all letters are
lowercase and underscores are used as shown. A program that cannot be evaluated because it
varies from specifications will receive zero marks.
2. Submit a report in PDF including the findings from your analysis. Use the following naming
conventions lastname_studentnumber_assignment1.pdf.
Late submissions will receive a standard penalty:
● up to one hour late – no penalty
● one day late – 15% penalty
● two days late – 30% penalty
● three days late – 45% penalty
● more than three days late – 0 mark
Other requirements:
1. A large portion of marks is allocated to analysis and justification. Full marks will not be given for
code alone.
2. Output must be shown and readable in the notebook. The only files that can be read into the
notebook are the files posted in the assignment without modification. All work must be done
within the notebook.
3. The notebook should be presentable, do not show large amounts of raw output.
4. Ensure the code runs in full before submitting. Open the code in CognitiveClass Virtual Lab
(Kernel 3) and navigate to Kernel -> Restart Kernel and Run all Cells. Ensure that there are no
errors.
3/4 MIE 1624 Introduction to Data Science and Analytics – Assignment 1