Stats 12 Lab 1: R Basics and Working with Data solution

$24.99

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (5 votes)

Objectives
1. Demonstrate basic R Skills (create vectors, perform vector operations, install packages)
2. Use basic statistical functions (mean, min, max, median, sd)
3. Visualize data (dot plot, histogram, box plot, contingency tables, scatter plot, bubble plot)
Before Coming to Lab
1. Download the latest version of R at http://cran.stat.ucla.edu/
2. After R is installed, download the latest version of RStudio (the free one on the left):
https://www.rstudio.com/products/rstudio/download3/
3. Download “Introductory Statistics with R” by Peter Dalgaard. The book is available for
free on springerlink.com. You must access it via a campus computer or using the BOL
VPN.
4. Read 1.1 and 1.2.1-1.2.3 of the Dalgaard book
5. The computer lab is equipped with all the tools necessary to complete your lab
assignments. However, if you prefer to work on your own computer you can bring your
laptop to the lab as well.
A Primer on labs
In Stats 12 the lab courses are designed to help you explore the capabilities of computer
programming to assist you in statistical analysis. This quarter we will be using RStudio, a highlevel statistical programming language. We will cover several basic topics in RStudio to help
you carry out analysis that you learn in lecture. RStudio is a popular program in universities
because it is free and open-source, which means anyone can access the program’s source code
and suggest improvements.
Collaboration Policy
In Lab, you are encouraged to work in pairs or small groups to discuss the concepts on the
assignments. However, DO NOT copy each other’s work as this constitutes cheating. The work
you submit must be entirely your own. If you have a question in lab, feel free to reach out to
other groups or as the TA if you get stuck.
Section 1 – R and RStudio Basics
Instructions
When you open RStudio, you’ll notice the window is partitioned into four sections:
• The top left section is your working code file. This is where you should type and run
your commands. It serves two main purposes:
o You can edit and save this section just like a word document. In fact, let’s save
right now. Go to file>save, and save this in a convenient location on your
computer.
o You will run lines of code by highlighting the code chunk you want to run, and
clicking the run button (shortcut: ctrl+enter for PC, cmd+enter for Mac)
• The bottom left section is the “console”. This is essentially the R engine which runs
your commands and produces output. You should not need to type anything in this
section, but you will need to copy and paste output for exercises.
• The top right section defines your environment. The environment contains any objects
you create and data you import. You won’t use it much but it’s a quick way to see what
data and objects you have to work with.
• The bottom left section contains three important tabs in this course:
o Plots: when you create plots, they will show up here. You should copy and paste
these to answer some of the lab exercises.
o Packages: R contains packages that you can install and run to use cool features
and functions that the basic R install did not give you.
o Help: If you are unsure about what a function does, try searching for it in the help
tab. You will find information about the function arguments and examples.
Working with packages
To work with packages, we must do two things:
1. Install the package by using the packages tab mentioned above
2. Type and run library(“package_name”) to tell R to load the package we installed
Reading Data into RStudio
R needs to be told which folder to look in to retrieve data. To read in data, first download it into
a convenient folder on your machine. Then, try one of these methods:
1. At the top of RStudio, go to Session> set working directory> choose directory. Then
select the folder the data is saved in, and click “Select Folder”. Then, create a new data
object by using the syntax object_name <- read.csv(file = “filename.csv”)
2. Use the syntax object_name <- read.csv(file.choose()) and use the file explorer to
manually select the downloaded data.
Vectors – Examples
Try running the commands below. The first creates a numeric vector object called numbers that
contains the numbers 1 to 5. The second creates a character vector object called schools that
contains the names of two top-tier colleges, and the name of a terrible school for spoiled
children. Note that when you create a vector of names, you must put quotes around each element.
The c() function is used to create both vectors. Try typing c() in the help menu to get a better
sense of what this function does.
numbers <- c(1,2,3,4,5)
schools <- c(“UCLA”, “UC Berkeley”, “USC”)
Note that when you run these commands, there is no output. That is because whenever you
define an object, there is not output. However, once you have made the object, you can use the
print function to print contents of an object. Example: type print(numbers)
If the vector is numeric, we can do math on it. For example, try typing numbers*2
We can also use square brackets to create subsets of our data. For example, if you type
schools[2] you will retrieve only the second element from the schools vector.
Object classes
Vectors, matrices, and data frames are three common object classes you will work with. Matrices
and data frames can be thought of as tables of data. However, matrices contain only numeric
data. Data frames can contain several types of data. The file you will work with in this lab is a
sample of information about babies born in North Carolina. It is considered a data frame because
it contains numeric information about each baby, as well as various pieces of categorical data
such as the race of the parents, and whether the mom has a smoking habit.
Section 1 Exercises
1. – vectors:
a. Create a vector named heights that contains the heights, in inches, of yourself and two
students near you. Print the contents of this vector.
b. Create a vector named names that contains the names of these people. Print the contents
of this vector.
c. Try typing cbind(heights, names). What did this command do? What class is this new
object?
2. – downloading data:
a. Download the data set births.csv from the CCLE site and upload it into Rstudio. Name
the data frame NCbirths.
b. Demonstrate that you have been successful by typing head(NCbirths) and copying and
pasting the output into your word processing document.
3. – Load the maps package
a. Install the maps package. Verify its installation by typing find.package(“maps”) and
include the output in your answer.
b. Type library(maps) to load up the package. Type map(“state”) and include the plot output
in your answer.
4. – Perform vector operations
a. Extract the weight variable as a vector from the dataframe by typing
weights <- NCbirths$weight
b. What units do you think the weights are in?
c. Create a new vector named weights.in.pounds which are the weights of the babies in
pounds. You can look up conversion factors on the internet.
d. Demonstrate your success by typing weights.in.pounds[1:20] and including the output in
your word processing document.
Section 2 – Summarizing Data (one variable)
Useful functions
The functions summary(), mean(), sd(), max(), and min() all produce helpful results to help us
understand how quantitative data is distributed. These functions will take in a numeric vector
inside the parentheses. As an example, max(NCbirths$weight) will give us the maximum weight
of all the babies in out sample.
We can also use the tally() function in the Mosaic package to help us summarize categorical
data. Tally requires a categorical vector inside the parentheses. For example, after you have
installed and loaded the Mosaic package, try typing tally(NCbirths$Racemom).
Section 2 – Exercises
1. What is the mean weight of the babies in pounds?
2. What percentage of the mothers in the sample smoke? Hint: use the tally function with
the format argument. Use the help screen for guidance.
3. According to the Centers for Disease Control, approximately 21% of adult Americans are
smokers. How far off is the percentage you found in question 2 from the CDC’s report?
Section 3 – Visualizing Data (one quantitative variable)
Useful functions
The histogram, dot plot, and box plot are useful tools for visualizing quantitative data. The
functions you can use in R are histogram(), dotPlot(), and boxplot(). Browse the help screen for
these functions to get a sense of what these functions require.
Section 3 – Exercises
1. Produce a dot plot of the weights in pounds.
2. Produce three different histograms of the weights in pounds. Use 3 bins, 20 bins, and 100
bins. Which histogram seems to give the best visualization, and why?
3. We can use the syntax boxplot(vector1, vector2) to make a side by side box plot. Create a
side by side boxplot of the mother’s ages and the father’s ages. Which gender tends to be
older?
4. Try typing histogram(~weight | Habit, data = NCbirths, layout = c(1,2)). Describe what
this code does. Based on the graph, do you see any major differences between baby
weights from smoking moms vs. non-smoking moms?
Section 4 – Visualizing Data (two categorical)
Useful functions
With two categorical variables we can again create two-way tables using the tally() function. For
example, try tally(~Habit | MomPriorCond, data = NCbirths, format = “proportion”)
Section 4 – Exercises
1. Consider the other categorical variables in this data. Of those that record the health of the
baby, which do you think will be associated with the mother’s smoking and why? Make a
two-way Summary Table to check your hypothesis. Do you have evidence that this
variable associated with smoking? Why?
Section 5 – Visualizing Data (two quantitative)
Useful functions
To visualize a potential relationship between two quantitative variables, we typically use scatter
plots. The syntax is plot(variable1~variable2)
Also, for most plotting functions we can add some arguments to make the plots nicely formatted
and look more presentable.
• The col argument changes the color of the data points.
• cex changes the size of the points
• pch should be an integer from 1 to 25 and changes the style of the points
• xlab and ylab define the labels for the x and y axes
• main gives us the title for the plot
As an example, run and compare the following lines of code:
plot(NCbirths$weight ~ NCbirths$Gained)
plot(NCbirths$weight ~ NCbirths$Gained, col = “red”, cex = 1.5, pch = 3,
xlab = “Weight gained during pregnancy”, ylab = “Baby weight (oz.)”,
main = “Baby weight vs. pregnancy weight gain”)
Most of these arguments work for histograms, box plots, and dot plots too!
Section 5 – Exercises
1. Produce a nicely formatted scatter plot of the weight of the baby vs. the mother’s age.
Section 6 – Visualizing Data (geographic data)
If we have geographic data, often the best way to visualize it is with a bubble plot. Please read
pages 8-11 of the handout titled “stat12_intro_to_R”. Read in the data as instructed, and replicate
the code to reproduce the plots.
Remember that the maps package needs to be installed and loaded before beginning this
exercise.
Section 6 – Exercises
1. To demonstrate you have followed the handout, produce a modified version of the
colored bubble plot. Rename the title to “California ozone bubble plot”. Also, use a
different point style, and different colors for each air quality category.