Activity #7: R Syntax

Loading the Mosaic Package

Let’s load the Mosaic package:

require(mosaic)

Examples from Activity #7

2. Let’s go ahead and take both the shorter and longer versions of the quiz. Since you’re going to guess the answers anyway, I won’t even bother reading the questions. Instead, I’ll have a computer randomly select the answer to each question.

Let’s start by having the computer create an “answer key” for our 4- and 10-question quizzes:

PossibleAnswers <-c ("A","B","C","D")
FourAnswers <- sample(PossibleAnswers, 4, replace=TRUE) 
TenAnswers <- sample(PossibleAnswers, 10, replace=TRUE)
FourAnswers; TenAnswers

## [1] "D" "D" "A" "B"

##  [1] "C" "D" "D" "C" "D" "D" "D" "C" "C" "A"

Let’s simulate 10,000 students randomly guessing answers to these quizzes:

StudentsFour <- do(10000) * sample(PossibleAnswers, 4, replace=TRUE)

## Loading required package: parallel

StudentsTen <- do(10000) * sample(PossibleAnswers, 10, replace=TRUE)
# This stores the first student's choice as V1, the second student's choice as V2, etc.

Let’s look at the answers (on the 4-question quiz) from students in our sample of 10,000 students: Student #6 and Student #164.

Recall the answers to this quiz are:

FourAnswers

## [1] "D" "D" "A" "B"

Student #6:

StudentsFour[6,]

##   V1 V2 V3 V4
## 6  D  A  A  A

Student #164:

StudentsFour[164,]

##     V1 V2 V3 V4
## 164  C  D  D  D

Compare their scores to the answer key to calculate their quiz scores.

To “grade” these quizzes, we need to compare the students’ answers to those in the answer key.

# Convert incorrect answers to 0 and correct answers to 1
StudentsFour$Q1 <- as.numeric(recode(StudentsFour$V1, "'A'='0'; 'B'='0'; 'C'='0'; 'D'='1'"))-1
StudentsFour$Q2 <- as.numeric(recode(StudentsFour$V2, "'A'='0'; 'B'='0'; 'C'='0'; 'D'='1'"))-1
StudentsFour$Q3 <- as.numeric(recode(StudentsFour$V3, "'A'='1'; 'B'='0'; 'C'='0'; 'D'='0'"))-1
StudentsFour$Q4 <- as.numeric(recode(StudentsFour$V4, "'A'='0'; 'B'='1'; 'C'='0'; 'D'='0'"))-1
# Sum the item scores to get a total quiz score
StudentsFour$Sum <- rowSums(subset(StudentsFour, select=Q1:Q4))

Just to see that this code worked, let’s look again at students #6 and #164:

StudentsFour[6,]

##   V1 V2 V3 V4 Q1 Q2 Q3 Q4 Sum
## 6  D  A  A  A  1  0  1  0   2

StudentsFour[164,]

##     V1 V2 V3 V4 Q1 Q2 Q3 Q4 Sum
## 164  C  D  D  D  0  1  0  0   1

That last column shows the quiz score for each student. Those scores look accurate, so it looks like our code works.

We can now tally the quiz scores and find the probability of passing the quiz:

tally(~Sum, data=StudentsFour)

## 
##    0    1    2    3    4 
## 3165 4116 2192  486   41

histogram(~Sum, width=1, main="4-question Quiz Scores", col="Steelblue", data=StudentsFour)

plot of chunk unnamed-chunk-10

# Finds probability of scoring 2 or higher
prop(~(Sum>=2), data=StudentsFour)

##   TRUE 
## 0.2719

That last value is P(X≥2) on the 4-question quiz. Let’s do the same thing for the 10-question quiz:

# Convert incorrect answers to 0 and correct answers to 1
StudentsTen$Q1 <- as.numeric(recode(StudentsTen$V1, "'A'='0'; 'B'='0'; 'C'='1'; 'D'='0'"))-1
StudentsTen$Q2 <- as.numeric(recode(StudentsTen$V2, "'A'='0'; 'B'='0'; 'C'='0'; 'D'='1'"))-1
StudentsTen$Q3 <- as.numeric(recode(StudentsTen$V3, "'A'='0'; 'B'='0'; 'C'='0'; 'D'='1'"))-1
StudentsTen$Q4 <- as.numeric(recode(StudentsTen$V4, "'A'='0'; 'B'='0'; 'C'='1'; 'D'='0'"))-1
StudentsTen$Q5 <- as.numeric(recode(StudentsTen$V5, "'A'='0'; 'B'='0'; 'C'='0'; 'D'='1'"))-1
StudentsTen$Q6 <- as.numeric(recode(StudentsTen$V6, "'A'='0'; 'B'='0'; 'C'='0'; 'D'='1'"))-1
StudentsTen$Q7 <- as.numeric(recode(StudentsTen$V7, "'A'='0'; 'B'='0'; 'C'='0'; 'D'='1'"))-1
StudentsTen$Q8 <- as.numeric(recode(StudentsTen$V8, "'A'='0'; 'B'='0'; 'C'='1'; 'D'='0'"))-1
StudentsTen$Q9 <- as.numeric(recode(StudentsTen$V9, "'A'='0'; 'B'='0'; 'C'='1'; 'D'='0'"))-1
StudentsTen$Q10 <- as.numeric(recode(StudentsTen$V10, "'A'='1'; 'B'='0'; 'C'='0'; 'D'='0'"))-1
# Sum the item scores to get a total quiz score
StudentsTen$Sum <- rowSums(subset(StudentsTen, select=Q1:Q10))

We can now tally the quiz scores and find the probability of passing the quiz:

tally(~Sum, data=StudentsTen)

## 
##    0    1    2    3    4    5    6    7    8 
##  525 1925 2815 2494 1457  576  178   28    2

histogram(~Sum, width=1, main="10-question Quiz Scores", col="Steelblue", data=StudentsTen)

plot of chunk unnamed-chunk-12

# Finds probability of scoring 5 or higher
prop(~(Sum>=5), data=StudentsTen)

##   TRUE 
## 0.0784

That last value is P(X≥5) on the 10-question quiz. It’s significantly lower than the probability of passing the 4-question quiz.

Now that we’ve seen the simulation results, here’s how we’d calculate these probabilities using the binomial distribution:

Passing the 4-question quiz can be modeled by a binomial distribution with:
p = 0.25
n = 4
x ≥ 2

plotDist('binom', params=list(size=4, prob=0.25), xlim=c(-1,5), main="Binomial Probabilities with n=4, p=0.25")

plot of chunk unnamed-chunk-13

To calculate the probability of passing, we take:

# P(X≥2) = 1 - P(X≤1)
1-pbinom(1, 4, 0.25, lower.tail=TRUE, log.p=FALSE)

## [1] 0.2617

Passing the 10-question quiz can be modeled by a binomial distribution with:
p = 0.25
n = 10
x ≥ 2

plotDist('binom', params=list(size=10, prob=0.25), xlim=c(-1,11), main="Binomial Probabilities with n=10, p=0.25")

plot of chunk unnamed-chunk-15

To calculate the probability of passing, we take:

# P(X≥5) = 1 - P(X≤1=4)
1-pbinom(4, 10, 0.25, lower.tail=TRUE, log.p=FALSE)

## [1] 0.07813

The probability of getting a perfect score on each quiz would be:

pbinom(4, 4, 0.25, lower.tail=TRUE, log.p=FALSE) - pbinom(3, 4, 0.25, lower.tail=TRUE, log.p=FALSE)

## [1] 0.003906

pbinom(10, 10, 0.25, lower.tail=TRUE, log.p=FALSE) - pbinom(9, 10, 0.25, lower.tail=TRUE, log.p=FALSE)

## [1] 9.537e-07

20) A basketball player has an 80% chance of making a free throw. What’s the probability that he makes at least 9 out of 10 free throws?

Making at least 9 out of 10 free throws can be modeled by a binomial distribution with:
p = 0.80
n = 10
x ≥ 9

plotDist('binom', params=list(size=10, prob=0.80), xlim=c(-1,11), main="Binomial Probabilities with n=10, p=0.80")

plot of chunk unnamed-chunk-18

# P(X≥9) = 1 - P(X≤8)
1-pbinom(8, 10, 0.80, lower.tail=TRUE, log.p=FALSE)

## [1] 0.3758

Interactive Binomial Distribution

Here’s some code you can use to create an interactive plot of the binomial distribution. If you run this code on RStudio, you will be able to manipulate the sample size, probability, and graph type. On this document, we’ll only get an error message:

require(manipulate)

manipulate(
  plotDist('binom',
    kind=type,
    params=list(size=N, prob=P),
    xlim=c(-1, N+1)),
  N=slider(3,50, step=1, initial=10),
  P=slider(0,1, step=0.1, initial=0.5),
  type=picker("density"="density", "cdf"="cdf", "qq"="qq", "histogram"="histogram"))

Binomial Test

26) If the null hypothesis were true, what’s the likelihood that Buzz and Doris would have pushed the correct switch at least 15 times out of 16?

Dolphins = data.frame(switch=c(rep("correct",15),rep("wrong",1))); Dolphins

##     switch
## 1  correct
## 2  correct
## 3  correct
## 4  correct
## 5  correct
## 6  correct
## 7  correct
## 8  correct
## 9  correct
## 10 correct
## 11 correct
## 12 correct
## 13 correct
## 14 correct
## 15 correct
## 16   wrong

To run a Binomial Test, we use the following syntax: binom.test(x, n, p = 0.5, alternative = c("two.sided", "less", "greater"), conf.level = 0.95,)

In our scenario, a success is when the switch is “correct.” Our null hypothesis is that p=0.50 and our alternate hypothesis is p>0.50:

binom.test(~switch == "correct", p=0.5, alternative=c("greater"), data=Dolphins)

## 
##  Exact binomial test
## 
## data:  Dolphins$switch == "correct"
## number of successes = 15, number of trials = 16, p-value =
## 0.0002594
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
##  0.736 1.000
## sample estimates:
## probability of success 
##                 0.9375

At the top-right on that output, you can see our p-value of 0.0002594.

27-31) Assuming the null hypothesis is true, what’s the likelihood that we would have observed 10 or more pigeons flying eastward?

Pigeons = data.frame(direction=c(rep("east",10),rep("west",2))); Pigeons

##    direction
## 1       east
## 2       east
## 3       east
## 4       east
## 5       east
## 6       east
## 7       east
## 8       east
## 9       east
## 10      east
## 11      west
## 12      west

binom.test(~direction == "east", p=0.5, alternative=c("greater"), data=Pigeons)

## 
##  Exact binomial test
## 
## data:  Pigeons$direction == "east"
## number of successes = 10, number of trials = 12, p-value = 0.01929
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
##  0.5619 1.0000
## sample estimates:
## probability of success 
##                 0.8333

As you can see, the p-value is 0.01929.

33) A civic group reported to the town council that at least 60% of the town residents were in favor of a particular bond issue. The town council then asked a random sample of 15 residents if they were in favor of the bond issue. Twelve said yes. Is the group’s report reasonable?

Our null hypothesis in this case would be p=0.60. Our alternate would be p>0.60.

Survey = data.frame(BondIssue=c(rep("favor",12),rep("against",3))); Survey

##    BondIssue
## 1      favor
## 2      favor
## 3      favor
## 4      favor
## 5      favor
## 6      favor
## 7      favor
## 8      favor
## 9      favor
## 10     favor
## 11     favor
## 12     favor
## 13   against
## 14   against
## 15   against

binom.test(~BondIssue == "favor", p=0.6, alternative=c("greater"), data=Survey)

## 
##  Exact binomial test
## 
## data:  Survey$BondIssue == "favor"
## number of successes = 12, number of trials = 15, p-value = 0.0905
## alternative hypothesis: true probability of success is greater than 0.6
## 95 percent confidence interval:
##  0.5602 1.0000
## sample estimates:
## probability of success 
##                    0.8

As you can see, the p-value is 0.0905.

36) Let’s assume dogs do NOT resemble their owners (and, therefore, the students were just randomly guessing which of the two dogs belonged to the owner). For any photo triad, what’s the probability more than 14 students would choose the correct dog?

Getting more than 14 correct matches can be modeled by a binomial distribution with:
p = 0.50
n = 28
x ≥ 15

plotDist('binom', params=list(size=28, prob=0.50), xlim=c(-1,29), main="Binomial Probabilities with n=28, p=0.50")

plot of chunk unnamed-chunk-24

# P(X≥15) = 1 - P(X≤14)
1-pbinom(14, 28, 0.50, lower.tail=TRUE, log.p=FALSE)

## [1] 0.4253

So the probability is 0.425277.

37) Given your answer to the previous question, what was the likelihood of observing 16 or more purebred dogs that resemble their owners?

Our null hypothesis is p=0.425277.

DogOwners = data.frame(DogMatch=c(rep("match",16),rep("No-match",9))); DogOwners

##    DogMatch
## 1     match
## 2     match
## 3     match
## 4     match
## 5     match
## 6     match
## 7     match
## 8     match
## 9     match
## 10    match
## 11    match
## 12    match
## 13    match
## 14    match
## 15    match
## 16    match
## 17 No-match
## 18 No-match
## 19 No-match
## 20 No-match
## 21 No-match
## 22 No-match
## 23 No-match
## 24 No-match
## 25 No-match

binom.test(~DogMatch == "match", p=0.425277, alternative=c("greater"), data=DogOwners)

## 
##  Exact binomial test
## 
## data:  DogOwners$DogMatch == "match"
## number of successes = 16, number of trials = 25, p-value = 0.02504
## alternative hypothesis: true probability of success is greater than 0.4253
## 95 percent confidence interval:
##  0.4561 1.0000
## sample estimates:
## probability of success 
##                   0.64

As you can see, the p-value is 0.02504.