Remember to download the report template for this lab and open it in RStudio. You can download the template by clicking this link: http://bradthiessen.com/html5/stats/m300/lab6report.Rmd
IQ scores are normally distributed with a mean of 100 and a standard deviation of 16.
plotDist("norm", params=list(mean=100, sd=16), col="steelblue", lw=5)
If we wanted to calculate probabilities under this normal distribution, we could use the xpnorm()
command:
To get P(X < 72), we use xpnorm(72, mean = 100, sd = 16)
To calculate P(72 < X < 110), we use xpnorm(c(72,110), mean = 100, sd = 16)
xpnorm(72, mean = 100, sd = 16) # P(X < 72)
##
## If X ~ N(100,16), then
##
## P(X <= 72) = P(Z <= -1.75) = 0.0401
## P(X > 72) = P(Z > -1.75) = 0.9599
## [1] 0.04005916
xpnorm(c(72,110), mean = 100, sd = 16) # P(72 < X < 110)
##
## If X ~ N(100,16), then
##
## P(X <= 72) = P(Z <= -1.75) = 0.0401
## P(X <= 110) = P(Z <= 0.625) = 0.734
## P(X > 72) = P(Z > -1.75) = 0.9599
## P(X > 110) = P(Z > 0.625) = 0.266
## [1] 0.04005916 0.73401447
For now, forget that you know anything about the distribution of IQ scores. Suppose you don’t know the shape, mean, or standard deviation of the distribution of IQ scores.
If you’re interested in this unknown population distribution, you might take a random sample of adults and measure their IQ scores. Let’s take a random sample of n = 16
adults from the “unknown” population. We’ll use set.seed()
to ensure we get the same sample each time we run this lab:
n <- 16 # Set size of our sample
set.seed(3141) # Set random number seed
# Take sample of size n from our population distribution
samp <- rnorm(n, mean = 100, sd = 16)
We can take a quick look at our sample data by plotting a dotplot:
# cex sets the size of the dots
dotPlot(samp, cex=.25, xlim=c(30,170))
With such a small sample size, this dotplot doesn’t really tell us what shape to expect for the population distribution. Let’s take a look at a Q-Q plot:
qqnorm(samp, ylab = "n = 16"); qqline(samp, ylab = "n = 16")
Based on this Q-Q plot, suppose we’re willing to believe the population distribution is normal (which it is in this case).
Without any explanation, let me show you the syntax to construct a 95% confidence interval in R. This confidence interval is theory-based – it uses the formula we derived in class:
(sample mean) +/- (t * SE)
where SE = sd/sqrt(n)
confint(t.test(samp, conf.level = 0.95))
## mean of x lower upper level
## 104.11760 95.39791 112.83728 0.95000
We told R to get a confidence interval (confint
) based on the t-distribution (t.test
). Our sample data was called samp
and we wanted a 95% interval (conf.level=0.95
).
As you can see from the output, our confidence interval is: (95.398, 112.837)
.
Let’s verify this result with our formula. To use our formula, we need to know the following:
We already have our sample of data stored in samp
(and we already told R to store the sample size as n
), so let’s get the other components for our formula:
sampleMean <- mean(samp)
sampleSD <- sd(samp)
t = qt(p = 0.975, df = n-1)
SE <- sampleSD / sqrt(n)
The syntax for calculating the sampleMean
, sampleSD
, and SE
should be easy enough to understand. To get the t-score, I needed to tell R the percentile to cut-off and the degrees of freedom.
We learned in class that for a CI for a single mean, degrees of freedom is equal to n-1
. That means in this case, we have 15 degrees of freedom.
Since I want a 95% interval, I need to cut-off 2.5% in each tail of the distribution. That means I want the t-score for the 0.025 and 0.975 percentiles. Both p=0.025
and p=0.975
would give me the same t-score – one would be negative and one would be positive. I chose the positive one, as displayed in the following plot:
xqt(0.975, df=n-1)
## [1] 2.13145
Let’s summarize the values of all the components we need in our formula:
sampleMean
= 104.12sampleSD
= 16.3639n
= 16qt(p = 0.975, df = n-1)
= 2.131sampleSD / sqrt(n)
= 4.091From these components, we can calculate:
(sample mean) +/- (t * SE)
where SE = sd/sqrt(n)
Let’s calculate the lower and upper bounds of our CI with this formula:
lower <- sampleMean - (t * SE)
upper <- sampleMean + (t * SE)
# This will format the output in a single line
CI <- paste("95% CI: (", round(lower,5), ",", round(upper,5), ")")
CI
## [1] "95% CI: ( 95.39791 , 112.83728 )"
This confidence interval matches the output from the confint(t.test(samp, conf.level = 0.95))
command.
Notice that the true population mean of 100 (that we pretended we didn’t know) does fall inside our confidence interval. If we were to repeat this process many times (constructing confidence intervals for random samples of size n=16
), we’d find that approximately 95% of those intervals would capture our population mean.
To see this, we can conduct a simulation. Let’s simulate the construction of 100 95% confidence intervals from samples of size n=16
:
set.seed(3140) # Set the random number seed
CIsim(n=16, samples=100, rdist=rnorm, args = list(mean=100, sd=16), estimand=100)
## Did the interval cover?
## No Yes
## 0.05 0.95
Don’t worry about the syntax for that command. Instead, take a look at that plot. The horizontal line across the middle of the plot represents the population mean of 100. Each vertical line represents a confidence interval calculated from a random sample of 16 observations.
As you can see, 5 of the 100 intervals constructed do not capture the population mean. That means 95% of our intervals do capture the population mean.
The 1974 Motor Trend Car Road Test dataset has been loaded into R. The variable mtcars$mpg
contains fuel efficiency data (measured in miles per gallon) for a sample of 32 cars. Use the confint(t.test())
command to construct a 95% confidence interval for the population mean fuel efficiency for cars in 1974.
This time, use the confint(t.test())
command to construct a 90% confidence interval. Explain why this interval is wider or more narrow than the 95% confidence interval.
Let’s see what would happen if we started with a population that does not follow a normal distribution. Suppose the population follows an exponential distribution that looks like this:
plotDist("exp", col="steelblue", lw=5)
Let’s simulate 100 95% confidence intervals for the mean and see how many capture the population mean:
set.seed(3141)
CIsim(n=16, samples=100, rdist=rexp, estimand=1)
## Did the interval cover?
## No Yes
## 0.11 0.89
When our population follows an exponential distribution, only 89% of our 100 simulated confidence intervals captured the population mean. Explain why.
Suppose we change our sample size to n=1000. If we repeatedly calculate 95% confidence intervals from a population that follows an exponential distribution, do you expect approximately 95% of them to capture the population mean? Explain. Then, run this simulation with the code that’s provided.
This time, let’s use bootstrap methods to construct a 95% confidence interval for IQ scores. Remember, we’re pretending we don’t know IQ scores are normally distributed with a mean of 100 and a standard deviation of 16.
Instead, we only know about our sample of 16 observations that had sampleMean = 104.12
and sampleSD = 16.3639
.
To take a bootstrap sample from our data, we select 16 observations with replacement from our sample. That means some observations may be chosen multiple times, while others aren’t chosen at all. We’ll use the resample()
command to generate a bootstrap sample.
To verify this command works, lets look at all 16 observations in our sample. We’ll sort
them from lowest-to-highest:
sort(samp)
## [1] 72.25090 83.59202 86.72965 91.13124 92.89035 96.76862 100.34194
## [8] 103.75708 106.23904 111.04474 111.75398 112.04855 116.36972 124.37817
## [15] 125.92229 130.66328
You can see all 16 IQ scores, from the lowest (72.25090) to the highest (130.66328).
Now, let’s take a single bootstrap sample, sort them from lowest-to-highest, and display the results:
set.seed(3141)
sort(resample(samp))
## [1] 83.59202 83.59202 86.72965 91.13124 92.89035 92.89035 92.89035
## [8] 92.89035 103.75708 103.75708 111.04474 111.04474 116.36972 124.37817
## [15] 125.92229 125.92229
From that output, you can see the IQ score of 72.25090 was not chosen in this bootstrap sample. The next value (83.59202) was chosen twice. Looking across this bootstrap sample, you can see the IQ score 92.89035 was chosen 4 times!
We could easily calculate the mean of this bootstrap sample with the command: mean(resample(samp))
.
To construct a confidence interval, we’ll want to calculate the means of many bootstrap samples. Let’s generate 10,000 bootstrap samples and calculate the mean of each one. We’ll store those means in a dataframe named bootsamples
:
set.seed(3141)
bootsamples <- do(10000) * mean(resample(samp))
head(bootsamples)
## result
## 1 103.1490
## 2 107.9057
## 3 106.3548
## 4 104.3306
## 5 102.5325
## 6 100.8986
The output shows the means of the first 6 bootstrap samples (ranging from 100.8986 to 107.9057). To visualize all 10,000 means of our bootstrap samples, we can generate a histogram (or, in this case, a density plot):
densityplot(~result, data=bootsamples, plot.points = FALSE, col="steelblue", lwd=4)
To get a 95% confidence interval, we simply need to find the 2.5 and 97.5 percentiles of this distribution. In other words, we need to find the bootstrap sample means that cut-off 2.5% in the tails. To do this, we can use the confint()
command:
confint(bootsamples, level = 0.95, method = "quantile")
## name lower upper level method
## 1 result 96.28272 111.8013 0.95 quantile
Our bootstrap methods yielded a 95% confidence interval of:
(96.283, 111.801)
The theory-based methods we used earlier resulted in this 95% CI:
(95.398, 112.837)
They’re similar to one another because our sample came from a population with a normal distribution.
Use bootstrap methods to construct a 95% confidence interval for the fuel efficiency of 1974 cars (the mtcars$mpg
variable). Then, calculate the center and width of the theory-based and bootstrap confidence intervals.
Recall from the previous lab that the average income of employed Iowa adults in 2013 followed a positively-skewed distribution with a mean of $36,052. I’ve loaded a random sample of 30 observations and stored it in a vector named incomes
. Using this sample data, construct a 95% confidence interval for the population mean using both theory-based and bootstrap methods. Briefly explain which method (theory-based or bootstrap) you think is more appropriate in this scenario.
One advantage of the bootstrap method is that we can extend it to get confidence intervals for other parameters. For example, to construct a 95% confidence interval for the population median IQ, we could use:
bootmedians <- do(10000) * median(resample(samp))
densityplot(~result, data=bootmedians, plot.points = FALSE, col="steelblue", lwd=4)
confint(bootmedians, level = 0.95, method = "quantile")
## name lower upper level method
## 1 result 92.89035 114.0619 0.95 quantile
This interval does, in fact, capture our population median IQ of 100.
Let’s try to construct a confidence interval for the population standard deviation (which we know is equal to 16):
bootsd <- do(10000) * sd(resample(samp))
densityplot(~result, data=bootsd, plot.points = FALSE, col="steelblue", lwd=4)
confint(bootsd, level = 0.95, method = "quantile")
## name lower upper level method
## 1 result 11.05526 20.08566 0.95 quantile
incomes
vector, construct 95% confidence intervals for the population median and standard deviation.Scenario #3: Sleep time
Because it makes an important point about interpreting confidence intervals, I wanted to replicate the simple data collection example we did in class. Let’s construct a 90% confidence interval for the average number of hours slept by St. Ambrose students last night.
Here are the hours slept by a random sample of 20 (fictitious) St. Ambrose students:
sleep <-c(6, 6.5, 6.5, 7, 7, 4.5, 11, 8, 8, 6, 5, 9, 8, 7, 7, 8, 6, 7, 8, 10)
dotPlot(sleep, cex=.4, xlim=c(4,12))
qqnorm(samp, ylab = "n = 20"); qqline(samp, ylab = "n = 20")
From these 20 observations, we could calculate:
mean(sleep)
= 7.28sd(sleep)
= 1.56We could get a theory-based confidence interval with the conf.int
command:
confint(t.test(sleep, conf.level = 0.90))
## mean of x lower upper level
## 7.275000 6.671838 7.878162 0.900000
We could also use bootstrap methods to get the CI:
bootsleep <- do(10000) * mean(resample(sleep))
confint(bootsleep, level = 0.90, method = "quantile")
## name lower upper level method
## 1 result 6.725 7.875 0.9 quantile
Rounded to a single decimal place, both methods yielded the same confidence interval:
90% CI for mean number of hours slept last night: (6.7, 7.9)
To get at a common misconception students have about confidence intervals, let me ask a question:
What proportion of our 20 (fictitious) students slept between 6.7 and 7.9 hours?
Because 6.7 and 7.9 are the bounds of our 90% confidence interval, many students think 90% of our data is between those two values. Let’s check to see if this is correct. Let’s find the proportion of our 20 observations that are between 6.7 and 7.9:
prop(~(sleep >= 6.7 & sleep <= 7.9))
## TRUE
## 0.25
Only 25% of our data was in our confidence interval! Why is this the case? Remember, confidence intervals aren’t statements of confidence about individual observations; rather, confidence intervals try to capture population means
In class, we derived a formula for constructing confidence intervals for proportions. We then learned our formula didn’t really yield good confidence intervals, so we briefly mentioned alternatives like the Clopper-Pearson and Wilson +4 methods. Let’s see how to construct these CIs in R.
First, we need a dataset…
In 2012, WIN-Gallup International published results from a survey that asked, among other things, the question: Irrespective of whether you attend a place of worship or not, would you say you are a religious person, not a religious person or a convinced atheist?
We can load the data from this survey with the command:
load(url("http://www.openintro.org/stat/data/atheism.RData"))
The atheism
dataframe includes responses from a total of 88,032 individuals across 57 countries in the years 2005 and 2012. Let’s take a look at a random sample of this data:
sample(atheism, 10, orig.ids=FALSE)
## nationality response year
## 5374 Belgium non-atheist 2012
## 39494 Peru non-atheist 2012
## 68075 United States non-atheist 2005
## 18686 Germany non-atheist 2012
## 79044 Pakistan non-atheist 2005
## 65057 Russian Federation non-atheist 2005
## 8806 Bulgaria non-atheist 2012
## 62124 Spain non-atheist 2005
## 50163 United States non-atheist 2012
## 67141 Moldova non-atheist 2005
This dataset is big, so let’s look at a subset of the data. Let’s look at the results from the U.S. in the year 2012. To get this subset of data, let’s use the subset()
command:
us12 <- subset(atheism, nationality == "United States" & year == "2012")
A table will summarize the responses to the question about religious beliefs:
table(us12$response)
##
## atheist non-atheist
## 50 952
Out of 1002 total respondents, 50 (or 4.99%) identified themselves as atheists. Since this is a random sample of American adults, let’s construct a 95% confidence interval for the proportion of all American adults.
With the mosaic package that we’ve been using in our labs, the command to construct a confidence interval for a proportion is confint(prop.test(x, n, conf.level))
.
In this scenario, x = 50 = number of atheists
and n = 1002 = total number of respondents
.
confint(prop.test(50, 1002, conf.level=0.95))
## p lower upper level
## 0.04990020 0.03761982 0.06574456 0.95000000
This confidence interval includes a continuity correction (similar to the Wilson +4 intervals we saw in class). If we wanted to use our (not-as-good) formula without the correction, we can tell R:
confint(prop.test(50, 1002, conf.level=0.95, correct=FALSE))
## p lower upper level
## 0.04990020 0.03805375 0.06518465 0.95000000
With such a large sample size, the correction didn’t matter too much.
If we wanted to get a Clopper-Pearson interval, we need to tell R to run a binomial test (binom.test
) on our dataframe. We also need to tell R that we consider atheist
to be a “win.”
binom.test(us12$response == "atheist")$conf.int
## [1] 0.03726042 0.06526103
## attr(,"conf.level")
## [1] 0.95
## attr(,"method")
## [1] "Score"
When you’ve finished typing all your answers to the exercises, you’re ready to publish your lab report. To do this, look at the top of your source panel (the upper-left pane) and locate the Knit HTML button:
Click that button and RStudio should begin working to create an .html file of your lab report. It may take a minute or two to compile everything, but the report should open automatically when finished.
Once the report opens, quickly look through it to see all your completed exercises. Assuming everything looks good, send that lab1report.html file to me via email or print it out and hand it in to me.
Feel free to browse around the websites for R and RStudio if you’re interested in learning more, or find more labs for practice at http://openintro.org.