Let’s load the Mosaic package:
require(mosaic)
# Calculate the probability
xpnorm(32.76, mean = 21, sd = 6, verbose=FALSE, plot=FALSE) - xpnorm(9.24, mean = 21, sd = 6, verbose=FALSE, plot=FALSE)
## [1] 0.95
## Graph the shaded normal distribution
# First, plot the distribution (plotdist didn't work, but I don't know why)
curve(dnorm(x,21,6), xlim=c(5, 37), col="steelblue", lwd=3)
# Then set coordinates for a shaded polygon
cord.x <- c(9.24,seq(9.24,32.76,0.01),32.76)
cord.y <- c(0,dnorm(seq(9.24, 32.76, 0.01), mean=21, sd=6),0)
# Graph the shaded region
polygon(cord.x,cord.y,col='skyblue')
From this, we see the probability is almost exactly 0.95.
Let’s try this. Let’s randomly sample one person (and repeat this process 10,000 times). We’ll add and subtract 11.76 and see how many capture the population mean of 21.
## Sample 1 score from this normal distribution 10,000 times
# Create ACT score distribution
ACTscores<-rnorm(n=100000, mean=21, sd=6)
# Sample one score 10,000 times
sampleACT<- do(10000) * sample(ACTscores, 1)
## Loading required package: parallel
# Subtract and add 11.76 from each observation sampled
sampleACT$lower <- sampleACT$result-11.76
sampleACT$upper <- sampleACT$result+11.76
# Determine if the interval captures mu
sampleACT$captured <- sampleACT$lower<21 & sampleACT$upper>21
Let’s take a look at what we’ve generated:
head(sampleACT)
## result lower upper captured
## 1 13.81 2.053 25.57 TRUE
## 2 26.09 14.333 37.85 TRUE
## 3 15.04 3.280 26.80 TRUE
## 4 23.60 11.845 35.36 TRUE
## 5 15.88 4.120 27.64 TRUE
## 6 16.94 5.183 28.70 TRUE
The first column shows the ACT scores we sampled (obviously, the scores aren’t real ACT scores). The next two columns show the scores +/- 11.76. The last column shows TRUE if 21 (the population mean) is between the lower and upper bounds.
From this, we can calculate the proportion of ACT score intervals (score +/- 11.76) that capture mu (21):
prop(~captured, data=sampleACT)
## TRUE
## 0.9539
We replicated our answer of 95%.
First, let’s calculate the (theoretical) probability:
## Graph the shaded normal distribution
# First, plot the distribution (plotdist didn't work, but I don't know why)
curve(dnorm(x,21,1.5), xlim=c(15, 26), col="steelblue", lwd=3)
# Then set coordinates for a shaded polygon
cord.x <- c((21-2.94),seq((21-2.94),(21+2.94),0.01),(21+2.94))
cord.y <- c(0,dnorm(seq((21-2.94), (21+2.94), 0.01), mean=21, sd=1.5),0)
# Graph the shaded region
polygon(cord.x,cord.y,col='skyblue')
# Calculate the probability
xpnorm((21+2.94), mean = 21, sd = 1.5, verbose=FALSE, plot=FALSE) - xpnorm((21-2.94), mean = 21, sd = 1.5, verbose=FALSE, plot=FALSE)
## [1] 0.95
Once again, we calculate 95%. In any normal curve, 95% of the observations are within 1.96 standard deviations of the mean.
Now, let’s get simulated results. First, let’s sample 16 ACT scores 10,000 times and calculate an average for each sample.
## Sample 16 score from this normal distribution 10,000 times
# Sample 16 scores, calculate an average, and repeat 10,000 times
sampleACT16<- do(10000) * mean(sample(ACTscores, 16))
# Subtract and add 2.94 from each observation sampled
sampleACT16$lower <- sampleACT16$result-2.94
sampleACT16$upper <- sampleACT16$result+2.94
# Determine if the interval captures mu
sampleACT16$captured <- sampleACT16$lower<21 & sampleACT16$upper>21
Once again, the last column will tell us whether each sample captured the true population mean or not:
head(sampleACT16)
## result lower upper captured
## 1 22.51 19.57 25.45 TRUE
## 2 18.83 15.89 21.77 TRUE
## 3 22.52 19.58 25.46 TRUE
## 4 22.19 19.25 25.13 TRUE
## 5 22.20 19.26 25.14 TRUE
## 6 22.22 19.28 25.16 TRUE
Now we can see what proportion lie within 21 +/- 2.94
Let’s take a look at what we’ve generated:
prop(~captured, data=sampleACT16)
## TRUE
## 0.9497
Yep, it’s 95%.
We know 95% of observations fall within +/-1.96 standard deviations of the mean, so there’s a 95% chance this score comes within $29,400 (= 1.96 * 15000) of the mean. Therefore, if we add and subtract 29,400 from this observation, we’ll have a 95% chance of capturing mu.
So the confidence interval, assuming normality, would be: 40000 +/- 29400 = ($10,600 , $69,400).
We can use R to speed up these calculations. (Note: There’s a much easier way to do this when we do NOT know the population standard deviation).
# Enter values for the observed sample mean, population std. dev., sample size, and confidence level
observedMean = 40000
sigma=15000
n=1
confidence = 0.95
# This calculates the standard error
alpha = (1-confidence)/2
stdError <- sigma/sqrt(n)
# This calculates the confidence interval limits:
lower<- observedMean - qnorm(1-alpha)*(stdError)
upper<- observedMean + qnorm(1-alpha)*(stdError)
paste("CI: (", lower, " , ", upper, ")")
## [1] "CI: ( 10600.5402318992 , 69399.4597681008 )"
So our 95% confidence interval (based on 1 observation) is approximately ($10,600, $69400). That matches what we know it should be (assuming student debt load is normally distributed).
Let’s calculate a confidence interval for question B:
Let’s use the code we used in the previous example:
# Enter values for the observed sample mean, population std. dev., sample size, and confidence level
observedMean = 25000
sigma=15000
n=100
confidence = 0.95
# This calculates the standard error
alpha = (1-confidence)/2
stdError <- sigma/sqrt(n)
# This calculates the confidence interval limits:
lower<- observedMean - qnorm(1-alpha)*(stdError)
upper<- observedMean + qnorm(1-alpha)*(stdError)
paste("CI: (", lower, " , ", upper, ")")
## [1] "CI: ( 22060.0540231899 , 27939.9459768101 )"
Our CI is now ($22060, $27940).
Finally, let’s answer question C:
# Enter values for the observed sample mean, population std. dev., sample size, and confidence level
observedMean = 21000
sigma=15000
n=10000
confidence = 0.95
# This calculates the standard error
alpha = (1-confidence)/2
stdError <- sigma/sqrt(n)
# This calculates the confidence interval limits:
lower<- observedMean - qnorm(1-alpha)*(stdError)
upper<- observedMean + qnorm(1-alpha)*(stdError)
paste("CI: (", lower, " , ", upper, ")")
## [1] "CI: ( 20706.005402319 , 21293.994597681 )"
With such a large sample size, our confidence interval is fairly narrow: ($20,700 , $21,294). In fact, the width of that interval is:
upper-lower
## [1] 588
If we wanted to be more confident, we’d have to expand that interval. As an example, here’s a 99% confidence interval for mu:
# Enter values for the observed sample mean, population std. dev., sample size, and confidence level
observedMean = 21000
sigma=15000
n=10000
confidence = 0.99
# This calculates the standard error
alpha = (1-confidence)/2
stdError <- sigma/sqrt(n)
# This calculates the confidence interval limits:
lower<- observedMean - qnorm(1-alpha)*(stdError)
upper<- observedMean + qnorm(1-alpha)*(stdError)
paste("CI: (", lower, " , ", upper, ")")
## [1] "CI: ( 20613.6256044677 , 21386.3743955323 )"
The width of this interval is:
upper-lower
## [1] 772.7
So far, we’ve assumed that we knew the population standard deviation in all our examples. As I explained in class, that’s not a very reasonable assumption in many cases. If we don’t know the population standard deviation, we can substitute the sample standard deviation if we use a t-distribution to construct our confidence intervals.
Let’s look at 3 examples of t-distributions.
plotDist("t", params=list(df=2000), col="lightblue", lwd=4)
plotDist("t", params=list(df=10), col="blue", lwd=4, add=TRUE)
plotDist("t", params=list(df=2), col="darkblue", lwd=4, add=TRUE)
plotDist("norm", col="red", lwd=1, add=TRUE)
The dark blue plot is a t-distribution with 2 degrees of freedom. The medium blue plot has 10 degrees of freedom. The light blue plot (which you can’t really see) has 2000 degrees of freedom and is almost exactly the same as the red plot (which is the standard normal distribution).
As you can see, as degrees of freedom increase, the t-distribution better approximates a normal distribution.
It’s May 30, 2014, and I don’t have any students right now, so I can’t get a real dataset of sleep times. I’ll make up some data for 20 hypothetical students:
sleep <-c(6, 6.5, 6.5, 7, 7, 4.5, 11, 8, 8, 6, 5, 9, 8, 7, 7, 8, 6, 7, 8, 10)
favstats(sleep)
## min Q1 median Q3 max mean sd n missing
## 4.5 6.375 7 8 11 7.275 1.56 20 0
As you can see from our sample, * The sample size is 20 * The sample mean is 7.275 * The sample standard deviation is 1.56
Let’s let R construct a 90% confidence interval:
confint(t.test(~sleep, conf.level=0.90), data=sleep)
## mean of x lower upper level
## 7.275 6.672 7.878 0.900
The interval is (6.6718, 7.8782). One question statistics teachers love to ask is: Does this mean 90% of students in this class slept between 6.6718 and 7.8782 hours?
To answer that question, let’s have the computer tell us what proportion of our 20 students slept between 6.6718 and 7.8782 hours:
prop(~(sleep >= 6.6718 & sleep <= 7.8782))
## TRUE
## 0.25
Only 25% of our observations fell in our confidence interval. That’s because our confidence interval deals with the population mean hours slept; not our individual sample data.
To get a 95% confidence interval, we’d use:
confint(t.test(~sleep, conf.level=0.95), data=sleep)
## mean of x lower upper level
## 7.275 6.545 8.005 0.950
Let’s quickly construct the 90% confidence interval. I’ll make up a dataset that has a mean of 101.5 and a standard deviation of 10:
# Simulate the dataset
UFO <- c(85.84339, 84.1224, 84.93813, 87.95693, 87.97534, 93.77714, 96.53963, 97.49079, 97.83092, 99.51939, 99.93461, 99.99142, 100.7256, 102.1713, 105.6651, 105.9388, 107.4627, 108.4752, 109.5886, 109.6127, 113.0563, 113.7874, 113.7894, 115.4635, 115.8434)
favstats(UFO)
## min Q1 median Q3 max mean sd n missing
## 84.12 96.54 100.7 109.6 115.8 101.5 10 25 0
histogram(~UFO, width=4)
So you can see the dataset does have the appropriate mean and standard deviation. Now let’s construct a 90% confidence interval:
confint(t.test(~UFO, conf.level=0.90), data=UFO)
## mean of x lower upper level
## 101.50 98.08 104.92 0.90
Our interval is (98.078, 104.922). Since the average population IQ is 100, we have no evidence that UFO observers have a higher IQ than average.
So far, our confidence intervals have been constructed with some assumptions. We’ve either assumed the population distribution is approximately normal or our sample size is large enough for the CLT to kick-in.
What do we do if we’re unsure of those assumptions?
Another way to construct a confidence interval is by using bootstrap methods. Before we begin, let’s look at a dataset from Assignment 18 (question #5):
crashes <- read.csv(url("http://bradthiessen.com/html5/stats/m300/failures.csv"))
This dataset contains 136 times between successive computer crashes. Let’s take a look at the data:
histogram(~v1, data=crashes, width=250)
favstats(~v1, data=crashes)
## min Q1 median Q3 max mean sd n missing
## 0 80.5 272 831.8 6150 649.1 1036 136 0
As you can see, the data are heavily skewed to the right (the mean is much greater than the median). Suppose we wanted to construct a 95% confidence interval for the average time between crashes.
We could assume the sample is large enough for the CLT to kick-in, but I’m worried about how much that data is skewed.
For now, let’s go ahead and make that assumption.
confint(t.test(~crashes, conf.level=0.90), data=crashes)
## mean of x lower upper level
## 649.1 502.0 796.3 0.9
Our confidence interval is (502, 796) and has a width of 294.
Rather than making the normality assumption, we could construct a bootstrap confidence interval. To do so, we do the following:
We treat our sample of 136 observations as if it they are the entire population of interest. We randomly sample n=136 observations WITH REPLACEMENT from our population of size n=136. Since we’re sampling with replacement, the same observation could be chosen more than once in our bootstrap sample.
Let’s repeat this process 10,000 times, so we’ll end up with 10,000 random bootstrap samples of size n=136
For each bootstrap sample, let’s calculate the sample mean
Now that we have 10,000 means, let’s find the means that represent the 5th and 95th percentiles. These percentiles would represent our 90% confidence interval bounds.
Let’s walk through this process by first taking a look at sampling with replacement. Suppose we have the numbers 1, 2, 3, 4, 5. Let’s sample 5 of these numbers with replacement:
simple = c(1, 2, 3, 4, 5)
resample(simple)
## [1] 2 3 4 5 4
As you can (probably) see, some numbers were chosen multiple times. This is because we replaced the numbers after we chose each one.
So let’s take 10,000 bootstrap samples from our dataset. For each sample, we’ll calculate the mean:
meanCrashes <- do(10000) * mean(~v1, data = resample(crashes))
Let’s see what that distribution looks like:
densityplot(~result, data=meanCrashes, plot.points = FALSE, col="darkblue", lwd=4)
plotDist("norm", mean(meanCrashes), sd(meanCrashes), col="red", add=TRUE)
plotDist("t", mean(meanCrashes), sd(meanCrashes), col="red", add=TRUE)
That dark blue distribution is the distribution of our bootstrap means. The red distribution is a normal curve. They’re close enough that we probably could have made our normality assumption.
Let’s get the confidence interval from our bootstrap samples:
confint(meanCrashes, level = 0.9, method = "quantile")
## name lower upper level method
## 1 result 512.3 799.1 0.9 quantile
Bootstrap Confidence Interval (511, 796) with a width of 285 and centered at 653.5
** Confidence Interval with Normality Assumption** (502, 796) with a width of 294 and centered at 649
One last estimate of a 90% confidence interval for our crashes dataset. I won’t explain this method here (I will in either STAT 301 or STAT 305) or even show the output.
# Load the Bayesian package
require(BEST)
# Extract the data to a vector
crashvector <- crashes[,1]
# Run an analysis - it took my laptop a little more than 2 minutes
BESTout1 <- BESTmcmc(crashvector)
# Get the output
BESTout1
plot(BESTout1)