Let’s load the Mosaic package:
require(mosaic)
Let’s start by having the computer create an “answer key” for our 4- and 10-question quizzes:
PossibleAnswers <-c ("A","B","C","D")
FourAnswers <- sample(PossibleAnswers, 4, replace=TRUE)
TenAnswers <- sample(PossibleAnswers, 10, replace=TRUE)
FourAnswers; TenAnswers
## [1] "D" "D" "A" "B"
## [1] "C" "D" "D" "C" "D" "D" "D" "C" "C" "A"
Let’s simulate 10,000 students randomly guessing answers to these quizzes:
StudentsFour <- do(10000) * sample(PossibleAnswers, 4, replace=TRUE)
## Loading required package: parallel
StudentsTen <- do(10000) * sample(PossibleAnswers, 10, replace=TRUE)
# This stores the first student's choice as V1, the second student's choice as V2, etc.
Let’s look at the answers (on the 4-question quiz) from students in our sample of 10,000 students: Student #6 and Student #164.
Recall the answers to this quiz are:
FourAnswers
## [1] "D" "D" "A" "B"
Student #6:
StudentsFour[6,]
## V1 V2 V3 V4
## 6 D A A A
Student #164:
StudentsFour[164,]
## V1 V2 V3 V4
## 164 C D D D
Compare their scores to the answer key to calculate their quiz scores.
To “grade” these quizzes, we need to compare the students’ answers to those in the answer key.
# Convert incorrect answers to 0 and correct answers to 1
StudentsFour$Q1 <- as.numeric(recode(StudentsFour$V1, "'A'='0'; 'B'='0'; 'C'='0'; 'D'='1'"))-1
StudentsFour$Q2 <- as.numeric(recode(StudentsFour$V2, "'A'='0'; 'B'='0'; 'C'='0'; 'D'='1'"))-1
StudentsFour$Q3 <- as.numeric(recode(StudentsFour$V3, "'A'='1'; 'B'='0'; 'C'='0'; 'D'='0'"))-1
StudentsFour$Q4 <- as.numeric(recode(StudentsFour$V4, "'A'='0'; 'B'='1'; 'C'='0'; 'D'='0'"))-1
# Sum the item scores to get a total quiz score
StudentsFour$Sum <- rowSums(subset(StudentsFour, select=Q1:Q4))
Just to see that this code worked, let’s look again at students #6 and #164:
StudentsFour[6,]
## V1 V2 V3 V4 Q1 Q2 Q3 Q4 Sum
## 6 D A A A 1 0 1 0 2
StudentsFour[164,]
## V1 V2 V3 V4 Q1 Q2 Q3 Q4 Sum
## 164 C D D D 0 1 0 0 1
That last column shows the quiz score for each student. Those scores look accurate, so it looks like our code works.
We can now tally the quiz scores and find the probability of passing the quiz:
tally(~Sum, data=StudentsFour)
##
## 0 1 2 3 4
## 3165 4116 2192 486 41
histogram(~Sum, width=1, main="4-question Quiz Scores", col="Steelblue", data=StudentsFour)
# Finds probability of scoring 2 or higher
prop(~(Sum>=2), data=StudentsFour)
## TRUE
## 0.2719
That last value is P(X≥2) on the 4-question quiz. Let’s do the same thing for the 10-question quiz:
# Convert incorrect answers to 0 and correct answers to 1
StudentsTen$Q1 <- as.numeric(recode(StudentsTen$V1, "'A'='0'; 'B'='0'; 'C'='1'; 'D'='0'"))-1
StudentsTen$Q2 <- as.numeric(recode(StudentsTen$V2, "'A'='0'; 'B'='0'; 'C'='0'; 'D'='1'"))-1
StudentsTen$Q3 <- as.numeric(recode(StudentsTen$V3, "'A'='0'; 'B'='0'; 'C'='0'; 'D'='1'"))-1
StudentsTen$Q4 <- as.numeric(recode(StudentsTen$V4, "'A'='0'; 'B'='0'; 'C'='1'; 'D'='0'"))-1
StudentsTen$Q5 <- as.numeric(recode(StudentsTen$V5, "'A'='0'; 'B'='0'; 'C'='0'; 'D'='1'"))-1
StudentsTen$Q6 <- as.numeric(recode(StudentsTen$V6, "'A'='0'; 'B'='0'; 'C'='0'; 'D'='1'"))-1
StudentsTen$Q7 <- as.numeric(recode(StudentsTen$V7, "'A'='0'; 'B'='0'; 'C'='0'; 'D'='1'"))-1
StudentsTen$Q8 <- as.numeric(recode(StudentsTen$V8, "'A'='0'; 'B'='0'; 'C'='1'; 'D'='0'"))-1
StudentsTen$Q9 <- as.numeric(recode(StudentsTen$V9, "'A'='0'; 'B'='0'; 'C'='1'; 'D'='0'"))-1
StudentsTen$Q10 <- as.numeric(recode(StudentsTen$V10, "'A'='1'; 'B'='0'; 'C'='0'; 'D'='0'"))-1
# Sum the item scores to get a total quiz score
StudentsTen$Sum <- rowSums(subset(StudentsTen, select=Q1:Q10))
We can now tally the quiz scores and find the probability of passing the quiz:
tally(~Sum, data=StudentsTen)
##
## 0 1 2 3 4 5 6 7 8
## 525 1925 2815 2494 1457 576 178 28 2
histogram(~Sum, width=1, main="10-question Quiz Scores", col="Steelblue", data=StudentsTen)
# Finds probability of scoring 5 or higher
prop(~(Sum>=5), data=StudentsTen)
## TRUE
## 0.0784
That last value is P(X≥5) on the 10-question quiz. It’s significantly lower than the probability of passing the 4-question quiz.
Passing the 4-question quiz can be modeled by a binomial distribution with:
p = 0.25
n = 4
x ≥ 2
plotDist('binom', params=list(size=4, prob=0.25), xlim=c(-1,5), main="Binomial Probabilities with n=4, p=0.25")
To calculate the probability of passing, we take:
# P(X≥2) = 1 - P(X≤1)
1-pbinom(1, 4, 0.25, lower.tail=TRUE, log.p=FALSE)
## [1] 0.2617
Passing the 10-question quiz can be modeled by a binomial distribution with:
p = 0.25
n = 10
x ≥ 2
plotDist('binom', params=list(size=10, prob=0.25), xlim=c(-1,11), main="Binomial Probabilities with n=10, p=0.25")
To calculate the probability of passing, we take:
# P(X≥5) = 1 - P(X≤1=4)
1-pbinom(4, 10, 0.25, lower.tail=TRUE, log.p=FALSE)
## [1] 0.07813
The probability of getting a perfect score on each quiz would be:
pbinom(4, 4, 0.25, lower.tail=TRUE, log.p=FALSE) - pbinom(3, 4, 0.25, lower.tail=TRUE, log.p=FALSE)
## [1] 0.003906
pbinom(10, 10, 0.25, lower.tail=TRUE, log.p=FALSE) - pbinom(9, 10, 0.25, lower.tail=TRUE, log.p=FALSE)
## [1] 9.537e-07
Making at least 9 out of 10 free throws can be modeled by a binomial distribution with:
p = 0.80
n = 10
x ≥ 9
plotDist('binom', params=list(size=10, prob=0.80), xlim=c(-1,11), main="Binomial Probabilities with n=10, p=0.80")
# P(X≥9) = 1 - P(X≤8)
1-pbinom(8, 10, 0.80, lower.tail=TRUE, log.p=FALSE)
## [1] 0.3758
Here’s some code you can use to create an interactive plot of the binomial distribution. If you run this code on RStudio, you will be able to manipulate the sample size, probability, and graph type. On this document, we’ll only get an error message:
require(manipulate)
manipulate(
plotDist('binom',
kind=type,
params=list(size=N, prob=P),
xlim=c(-1, N+1)),
N=slider(3,50, step=1, initial=10),
P=slider(0,1, step=0.1, initial=0.5),
type=picker("density"="density", "cdf"="cdf", "qq"="qq", "histogram"="histogram"))
Dolphins = data.frame(switch=c(rep("correct",15),rep("wrong",1))); Dolphins
## switch
## 1 correct
## 2 correct
## 3 correct
## 4 correct
## 5 correct
## 6 correct
## 7 correct
## 8 correct
## 9 correct
## 10 correct
## 11 correct
## 12 correct
## 13 correct
## 14 correct
## 15 correct
## 16 wrong
To run a Binomial Test, we use the following syntax: binom.test(x, n, p = 0.5, alternative = c("two.sided", "less", "greater"), conf.level = 0.95,)
In our scenario, a success is when the switch is “correct.” Our null hypothesis is that p=0.50 and our alternate hypothesis is p>0.50:
binom.test(~switch == "correct", p=0.5, alternative=c("greater"), data=Dolphins)
##
## Exact binomial test
##
## data: Dolphins$switch == "correct"
## number of successes = 15, number of trials = 16, p-value =
## 0.0002594
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
## 0.736 1.000
## sample estimates:
## probability of success
## 0.9375
At the top-right on that output, you can see our p-value of 0.0002594.
Pigeons = data.frame(direction=c(rep("east",10),rep("west",2))); Pigeons
## direction
## 1 east
## 2 east
## 3 east
## 4 east
## 5 east
## 6 east
## 7 east
## 8 east
## 9 east
## 10 east
## 11 west
## 12 west
binom.test(~direction == "east", p=0.5, alternative=c("greater"), data=Pigeons)
##
## Exact binomial test
##
## data: Pigeons$direction == "east"
## number of successes = 10, number of trials = 12, p-value = 0.01929
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
## 0.5619 1.0000
## sample estimates:
## probability of success
## 0.8333
As you can see, the p-value is 0.01929.
Our null hypothesis in this case would be p=0.60. Our alternate would be p>0.60.
Survey = data.frame(BondIssue=c(rep("favor",12),rep("against",3))); Survey
## BondIssue
## 1 favor
## 2 favor
## 3 favor
## 4 favor
## 5 favor
## 6 favor
## 7 favor
## 8 favor
## 9 favor
## 10 favor
## 11 favor
## 12 favor
## 13 against
## 14 against
## 15 against
binom.test(~BondIssue == "favor", p=0.6, alternative=c("greater"), data=Survey)
##
## Exact binomial test
##
## data: Survey$BondIssue == "favor"
## number of successes = 12, number of trials = 15, p-value = 0.0905
## alternative hypothesis: true probability of success is greater than 0.6
## 95 percent confidence interval:
## 0.5602 1.0000
## sample estimates:
## probability of success
## 0.8
As you can see, the p-value is 0.0905.
Getting more than 14 correct matches can be modeled by a binomial distribution with:
p = 0.50
n = 28
x ≥ 15
plotDist('binom', params=list(size=28, prob=0.50), xlim=c(-1,29), main="Binomial Probabilities with n=28, p=0.50")
# P(X≥15) = 1 - P(X≤14)
1-pbinom(14, 28, 0.50, lower.tail=TRUE, log.p=FALSE)
## [1] 0.4253
So the probability is 0.425277.
Our null hypothesis is p=0.425277.
DogOwners = data.frame(DogMatch=c(rep("match",16),rep("No-match",9))); DogOwners
## DogMatch
## 1 match
## 2 match
## 3 match
## 4 match
## 5 match
## 6 match
## 7 match
## 8 match
## 9 match
## 10 match
## 11 match
## 12 match
## 13 match
## 14 match
## 15 match
## 16 match
## 17 No-match
## 18 No-match
## 19 No-match
## 20 No-match
## 21 No-match
## 22 No-match
## 23 No-match
## 24 No-match
## 25 No-match
binom.test(~DogMatch == "match", p=0.425277, alternative=c("greater"), data=DogOwners)
##
## Exact binomial test
##
## data: DogOwners$DogMatch == "match"
## number of successes = 16, number of trials = 25, p-value = 0.02504
## alternative hypothesis: true probability of success is greater than 0.4253
## 95 percent confidence interval:
## 0.4561 1.0000
## sample estimates:
## probability of success
## 0.64
As you can see, the p-value is 0.02504.