Remember to download the report template for this lab and open it in RStudio. You can download the template by clicking this link: http://bradthiessen.com/html5/stats/m300/lab5report.Rmd
In 2005, the U.S. Census Bureau introduced the American Community Survey (ACS). The ACS, sent to approximately 250,000 households monthly (or 3 million per year), gathers information about a variety of factors including demographic, economic, housing, and social data.
In this lab, we’re going to work with the responses from 26,134 Iowans to the 2013 ACS. Let’s download the dataset (from my website) and look at the header:
acs <- read.csv("http://bradthiessen.com/html5/stats/m300/acs13.csv")
head(acs)
## Status WageSalary TotalIncome
## 1 Employed 0 100000
## 2 Employed 0 110000
## 3 Employed 0 80000
## 4 Employed 0 56760
## 5 Employed 0 30000
## 6 Employed 0 45500
The full dataset has many more variables, but I limited our investigation to three:
Status
= employment status of each respondentWageSalary
= Salary and wages for each respondentTotalIncome
= Total income for each respondentLet’s look at the structure of our data with the str()
command:
str(acs)
## 'data.frame': 26134 obs. of 3 variables:
## $ Status : Factor w/ 6 levels "Child","Employed",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ WageSalary : int 0 0 0 0 0 0 0 0 0 0 ...
## $ TotalIncome: int 100000 110000 80000 56760 30000 45500 75000 48000 12000 7000 ...
The str()
command shows our dataset has 26,134 observations and 3 variables. The WageSalary
and TotalIncome
variables are stored as integers (as we should expect). The Status
variable is a factor variable with 6 different levels. We know the first two levels are “child” and “employed,” but what are the other levels?
To find out, we can either look at that variable’s attributes or construct a simple table:
# The $ symbol tells R to use the Status variable in the acs dataset
# dataset$variable
attributes(acs$Status)
## $levels
## [1] "Child" "Employed"
## [3] "Employed but not at work" "Military at work"
## [5] "Not in labor force" "Unemployed"
##
## $class
## [1] "factor"
table(acs$Status)
##
## Child Employed Employed but not at work
## 406 15286 305
## Military at work Not in labor force Unemployed
## 17 9418 702
You can see survey respondents are classified into one of 6 Status
categories:
We’re going to look at the distribution of incomes, so let’s only look at individuals who are employed. How can we do this? How can we pull out the responses from only the employed respondents?
There are a few different ways to do this:
Option #1:subset
The subset()
command, built into R, allows you to specify which data you’d like to keep. In this scenario, we want to keep all the data (in the acs dataset) for respondents with Status == "Employed"
.
We’ll store our subset of data as acsEmp
with the command:
acsEmp <- subset(acs, Status == "Employed")
acsEmp <- subset(acs, Status == "Employed")
Let’s construct a quick table of our new dataset to make sure it only has employed respondents:
table(acsEmp$Status)
##
## Child Employed Employed but not at work
## 0 15286 0
## Military at work Not in labor force Unemployed
## 0 0 0
Yep, it looks like it worked.
ames
has been loaded for you. It’s described in the lab report template. Create a table of the Street
variable to show how many houses are on paved and gravel roads. Then, use the subset()
command to select only houses on gravel roads. Construct another table to verify you’ve selected the correct subset of data.Option #2: []
A faster, better, yet more confusing way to subset your data is by using square brackets []
. The general form of this syntax is:
dataframe[ (rows to keep), (columns to keep) ]
In this scenario, we want to keep all the columns (all 3 variables). To do this, we can just leave the (columns to keep) blank. Our syntax is now:
dataframe[ (rows to keep), ]
The rows (observations) we want to keep are those in which Status == "Employed"
. In other words, our condition is that Status == "Employed"
.
We can feed this condition into the general form of our syntax:
dataframe[ dataframe$variable (condition) ]
to obtain the command
acsEmp2 <- acs[acs$Status == "Employed", ]
Notice I stored the results in a dataframe called acsEmp2
. We can get a table of the status variable within this dataframe to verify we correctly subsetted our data:
acsEmp2 <- acs[acs$Status == "Employed", ]
table(acsEmp2$Status)
##
## Child Employed Employed but not at work
## 0 15286 0
## Military at work Not in labor force Unemployed
## 0 0 0
Once again, our subset of data contains 15,286 employed respondents.
[]
to select only houses on gravel roads.Option #3: dplyr
Another popular way to select a subset of data is to use the dplyr package developed by Hadley Wickham.
The dplyr
package is designed to help you manipulate data quickly and efficiently. To this end, dplyr allows you to use 5 verbs on your data:
It also lets you use pipes %>%
to string together verbs to perform more complicated manipulations.
Let’s see how these verbs can be used on our simple dataset. First, we better install the dplyr
package using the command: install.packages("dplyr")
. I’ve already installed this package, so I’m going to load it now:
library(dplyr)
Our acs
dataset has two variables related to income: WageSalary
and TotalIncome
. Suppose we only want to analyze the WageSalary
variable. We can use the select()
verb to focus on that variable (and our Status
variable). I’ll store the new dataframe as acsSalary
:
acsSalary <- acs %>%
select(Status, WageSalary)
If we look at the header of this new dataframe, we see it has only the two variables we selected:
head(acsSalary)
## Status WageSalary
## 1 Employed 0
## 2 Employed 0
## 3 Employed 0
## 4 Employed 0
## 5 Employed 0
## 6 Employed 0
Let’s now use the filter
verb to do what we want – filter our data to keep only the cases in which Status == "Employed"
. We’ll chain this together with the select
verb to keep only our Status
and WageSalary
variables:
acsEmp3 <- acs %>%
select(Status, WageSalary) %>%
filter(Status == "Employed")
head(acsEmp3)
## Status WageSalary
## 1 Employed 0
## 2 Employed 0
## 3 Employed 0
## 4 Employed 0
## 5 Employed 0
## 6 Employed 0
Let’s take one more look at the dplyr
package by chaining together a series of those verbs. Suppose, with our original acs dataframe, we want to find the mean and median total income for adults by employment status. Furthermore (just to make this more difficult), suppose we want these values in cents (rather than dollars).
To do this, we’d use:
acs %>% # Take our acs dataframe
select(Status, WageSalary) %>% # Select variables of interest
group_by(Status) %>% # Group data by employment status
filter(Status != "Child") %>% # Filter out children
mutate(cents = WageSalary * 100) %>% # Create new "cents" variable
summarize(mean = mean(cents), sd = sd(cents)) # Calculate mean & sd
## Source: local data frame [5 x 3]
##
## Status mean sd
## 1 Employed 3605196.5 3877046
## 2 Employed but not at work 2442606.6 2975496
## 3 Military at work 6829411.8 7355505
## 4 Not in labor force 144407.2 1036304
## 5 Unemployed 826317.7 1388675
With the %>%
pipes chaining together these verbs, the code is pretty easy to understand by reading from left-to-right.
Street
, Year.Built
, and SalePrice
variables; (b) group the data by Street
type; (c) filter out houses sold for less than 100,000 (dollars); (d) mutate a new variable named PricePerFoot
that measures price per square foot (of above-ground living area); (e) summarize the mean PricePerFoot
.We now have a dataset consisting of the salaries of 15286 employed adults in Iowa in 2013. Let’s pretend as though this is our population distribution.
As you can see below, the distribution has a positive skew with a mean of $36,051.96` and a standard deviation of $38770.46.
# Don't worry about the syntax here
# I set lots of options to create this density plot
densityplot(~WageSalary, data=acsEmp,
xlab=list(paste("mean = ", round(mean(acsEmp$WageSalary),2),
" std dev = ", round(sd(acsEmp$WageSalary),4)),
cex=1.5),
main="Distribution of Iowa Salaries in 2013", plot.points=FALSE,
from=-5000, to=350000, lw=5,
col="steelblue", bw=10000)
Suppose we didn’t know the average income of employed adults in Iowa in 2013 was $36,051.96. Suppose we were going to estimate it by surveying a sample of employed adults in Iowa.
Let’s start by assuming we take a random sample of n = 5. Let’s take one random sample and see what average we get.
mean( sample(acsEmp$WageSalary, 5, replace=TRUE))
## [1] 43160
From this single sample, we’d say our best estimate of the average income of Iowa adults is $43,160.
If we took another random sample of size n = 5, we’d get a different estimate:
mean( sample(acsEmp$WageSalary, 5, replace=TRUE))
## [1] 18200
From this sample, our best estimate of the average income of Iowa adults would be $18,200.
Because we know the population mean ($36,052), we can evaluate our sample estimates. The first estimate ($43,160) wasn’t too far off; the second estimate ($18,200) greatly underestimated the population mean.
But if we didn’t know the population mean, we wouldn’t be able to evaluate these sample estimates. If they came from random samples, they would be equally valid estimates.
We might, then, be interested in knowing how “good” of an estimate we would get if we take a single sample of size n=5. In other words, if we repeatedly took samples of size n=5, what would this distribution look like?
In class, we learned how to predict this sampling distribution (distribution of sample means). We learned that:
On top of that, if the Central Limit Theorem holds:
Let’s simulate the sampling distribution by taking 10,000 samples of size n=5. Will the CLT hold?
# We'll start by calculating means for 10,000 samples of size n=5
samples5 <- do(10000) * mean(sample(acsEmp$WageSalary, 5, replace=TRUE))
With a skewed population distribution and a small sample size, we can predict that the CLT will not hold and our sampling distribution will not approximate a normal distribution. Let’s create a histogram of our 10,000 sample means to see:
histogram(~result, data = samples5, col="grey", border="white",
xlim=c(0, 200000), xlab = "n=5", width=2000)
Yep, that’s not a normal curve. Theory tells us that we should expect:
$38,770.46 / sqrt(5) =
round(38770.46 / sqrt(5) ,2)`These results hold as the number of samples we take approaches infinity. We only took 10,000 samples, so our results won’t be perfect. Let’s see what we got:
mean(samples5) #This should be close to 36052
## [1] 36300.88
sd(samples5) #This should be close to 17339
## [1] 17623.9
Those aren’t too far off.
Our sampling distribution certainly doesn’t look normal, but let’s take a closer look. First, let’s see how it compares to a perfect normal distribution:
# Plot the histogram again
histogram(~result, data = samples5, col="grey", border="white",
xlim=c(0, 200000), xlab = "n=5", width=2000)
# Add a normal distribution on top of the histogram
plotDist("norm", params=list(mean=mean(acsEmp$WageSalary), sd=(sd(acsEmp$WageSalary)/(5)^.5)), col="red", lwd=3, add=TRUE)
Ok, that makes it pretty clear that our sampling distribution doesn’t approximate a normal distribution. We can also look at a Q-Q plot:
qqnorm(samples5$result, ylab = "n = 5"); qqline(samples5$result, ylab = "n = 5")
This, again, shows our sampling distribution doesn’t approximate a normal distribution.
Would the CLT hold if we take a larger sample?
As many textbooks claim, the CLT holds if our sample size is “large enough.” The rough guideline is that large enough = 30. Let’s see how well that works in this example.
Let’s take 10,000 samples of size n=30, calculate the mean of each sample, and then plot the sampling distribution. We’ll then compare the sampling distribution to a normal distribution.
# We'll start by calculating means for 10,000 samples of size n=30
samples30 <- do(10000) * mean(sample(acsEmp$WageSalary, 30, replace=TRUE))
# Get the mean of our sample means
paste("Mean =", round(mean(samples30), 2), " (compared to the theoretical value of 36052)")
## [1] "Mean = 36045.87 (compared to the theoretical value of 36052)"
# Get the standard deviation of our sample means
paste("SD =", round(sd(samples30), 2),
" (compared to the theoretical value of ", round(sd(acsEmp$WageSalary)/sqrt(30),2))
## [1] "SD = 6984.56 (compared to the theoretical value of 7078.49"
# Plot a histogram of our sample means
histogram(~result, data = samples30, col="grey", border="white",
xlim=c(0, 100000), xlab = "n=30", width=1000)
#Superimpose a normal distribution on top of the histogram
plotDist("norm", params=list(mean=mean(acsEmp$WageSalary), sd=(sd(acsEmp$WageSalary)/(30)^.5)), col="red", lwd=3, add=TRUE)
Let’s look at a Q-Q plot to compare our sampling distribution with a normal distribution:
qqnorm(samples30$result, ylab = "n = 30"); qqline(samples30$result, ylab = "n = 30")
Let’s quickly see the sampling distribution of 10,000 means from samples of size n = 1000. That’s a sample size big enough for the CLT to hold… right?
samples1000 <- do(10000) * mean(sample(acsEmp$WageSalary, 1000, replace=TRUE))
# Get the mean of our sample means
paste("Mean =", round(mean(samples1000), 2), " (compared to the theoretical value of 36052)")
## [1] "Mean = 36056.47 (compared to the theoretical value of 36052)"
# Get the standard deviation of our sample means
paste("SD =", round(sd(samples1000), 2),
" (compared to the theoretical value of ", round(sd(acsEmp$WageSalary)/sqrt(1000),2))
## [1] "SD = 1243.4 (compared to the theoretical value of 1226.03"
# Plot a histogram of our sample means
histogram(~result, data = samples1000, col="grey", border="white",
xlim=c(30000, 43000), xlab = "n=1000", width=300)
#Superimpose a normal distribution on top of the histogram
plotDist("norm", params=list(mean=mean(acsEmp$WageSalary), sd=(sd(acsEmp$WageSalary)/(1000)^.5)), col="red", lwd=3, add=TRUE)
qqnorm(samples1000$result, ylab = "n = 30"); qqline(samples1000$result, ylab = "n = 30")
It looks like this sampling distribution does approximate a normal distribution.
The prices of all 2,390 homes sold in Ames, Iowa between 2006-2010 have been loaded in a vector named price
.
Calculate the mean and standard deviation of this price
variable.
Construct a histogram of the price
variable.
Take 10,000 samples of size n = 50 from this price distribution. Calculate the mean of each sample. Then, generate a plot of the sampling distribution (of those 10,000 means).
Use a graphical method to determine if your sampling distribution approximates a normal distribution.
Calculate the mean and standard deviation of your sampling distribution.
When you’ve finished typing all your answers to the exercises, you’re ready to publish your lab report. To do this, look at the top of your source panel (the upper-left pane) and locate the Knit HTML button:
Click that button and RStudio should begin working to create an .html file of your lab report. It may take a minute or two to compile everything, but the report should open automatically when finished.
Once the report opens, quickly look through it to see all your completed exercises. Assuming everything looks good, send that lab1report.html file to me via email or print it out and hand it in to me.
Feel free to browse around the websites for R and RStudio if you’re interested in learning more, or find more labs for practice at http://openintro.org.