Let’s load the Mosaic package:
require(mosaic)
We’ll first do this as a simulation (like the last example in the syntax for Activity #2):
# Let's create 5 engineering majors (all named "E") and 3 math majors (all named "M").
Students <-c ("E","E","E","E","E","M","M","M")
# To choose 2 students, we sample 2 of these letters. We'll do this 30,000 times
Money <- do(30000) * sample(Students, 2, replace=FALSE)
## Loading required package: parallel
# Note that this creates two columns
# V1 = first student, V2 = second student
# We can then find the proportion of those samples that have two M's
prop( ~(V1=="M" & V2=="M"), data=Money)
## TRUE
## 0.1061
That final value is our estimated probability. We can also use the hypergeometric distribution to estimate this probability:
Recall that the function is: phyper(q, m, n, k, lower.tail = TRUE, log.p = FALSE)
where:
q = number of type 1 objects chosen
m = number of type 1 objects
n = number of type 2 objects
k = number of objects selected
We have 3 math majors (Type 1; m=3) and 5 engineering majors (Type 2; n=5). We select 2 students at random (k=2) and we want to know the probability of getting 2 math majors (q = 2).
We want to calculate P(M=2). I only know how to calculate P(M≤2), so I’ll need to take > P(M≤2) - P(M≤1).
Our syntax would be:
phyper(2, 3, 5, 2, lower.tail = TRUE, log.p = FALSE) - phyper(1, 3, 5, 2, lower.tail = TRUE, log.p = FALSE)
## [1] 0.1071
Simulation:
# We'll use the simulation we created in the previous question.
# I'll count the proportion of times we got
# E,M or M,E
prop( ~(V1=="M" & V2=="E" | V1=="E" & V2=="M"), data=Money)
## TRUE
## 0.534
Hypergeometric:
# I'll take P(M≤1)-P(M≤0) to find P(M=1)
phyper(1, 3, 5, 2, lower.tail = TRUE, log.p = FALSE) - phyper(0, 3, 5, 2, lower.tail = TRUE, log.p = FALSE)
## [1] 0.5357
I’ll skip straight to the hypergeometric distribution:
# I'll take P(M≤20)-P(M≤19) to find P(M=20)
phyper(20, 30, 50, 20, lower.tail = TRUE, log.p = FALSE) - phyper(19, 30, 50, 20, lower.tail = TRUE, log.p = FALSE)
## [1] 8.499e-12
We can use the hypergeometric distribution here, too. We’ll let:
Questions you know = m = 80
Questions you don't know = n = 20
Questions we select = k = 4
To find P(all 4 correct) = P(x=4) = P(x≤4)-P(x≤3), we use:
phyper(4, 80, 20, 4, lower.tail = TRUE, log.p = FALSE) - phyper(3, 80, 20, 4, lower.tail = TRUE, log.p = FALSE)
## [1] 0.4033
If I wanted to reduce that probability, I could grade more questions. Suppose I grade 6 questions (instead of 4). The probability of getting all 6 questions correct would be:
phyper(6, 80, 20, 6, lower.tail = TRUE, log.p = FALSE) - phyper(5, 80, 20, 6, lower.tail = TRUE, log.p = FALSE)
## [1] 0.2521
If we wanted to conduct a simulation of this random babies experiment, we’d use the exact same syntax used in the previous activity (where we randomly shuffled the letters ABCD).
We need to simulate 4 students each choosing a tire. Let’s call the tires A, B, C, and D.
Tires <-c ("A","B","C","D")
To simulate the 4 students each choosing a tire, we’ll sample 4 “tires” with replacement (since multiple students can select the same tire). We’ll do this 100,000 times:
TiresChosen <- do(100000) * sample(Tires, 4, replace=TRUE)
# This stores the first student's choice as V1, the second student's choice as V2, etc.
head(TiresChosen)
## V1 V2 V3 V4
## 1 A C B A
## 2 D C D A
## 3 D B A C
## 4 A B D C
## 5 B A C A
## 6 D B B D
To estimate the probability of all 4 students choosing the same tire, we simply find the proportion of our 100,000 trials that yielded the same letter for all 4 students:
prop( ~(V1==V2 & V1==V3 & V1==V4), data=TiresChosen)
## TRUE
## 0.01577
That estimate should be close to the answer we got in class (.015625).
Observed | Yawn seed | No seed | Total |
---|---|---|---|
Yawned | 10 | 4 | 14 |
Did not Yawn | 24 | 12 | 36 |
Total | 34 | 16 | 50 |
As explained in the activity, we can argue that the 14 subjects who yawned would have yawned regardless of getting a yawn seed or not. Under our null hypothesis, the yawn seed had no effect on whether subjects yawned or not.
With this logic, we can treat our 50 subjects as consisting of 14 yawners and 36 non-yawners. For simplicity, I’ll create a vector of subjects where 1 = Yawner and 0 = Non-Yawner.
Yawners = c(rep(1,14),rep(0,36))
Yawners
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [36] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
There you can see our 14 1’s (Yawners) and 36 0’s.
To simulate the random assignment of our 50 subjects into the yawn seed groups, we can randomly sample 34 subjects (without replacement) to represent the Yawn Seed Group. We’ll repeat this random sampling 25,000 times:
YawnSeed <- do(25000) * sample(Yawners, 34, replace=FALSE)
# Note that this creates 34 columns
# V1 = first subject, V2 = second subject, etc.
Now we want to count the number of Yawners (1’s) that were randomly assigned to the Yawn Seed group:
# Creates a new column, sum, that adds all the 1s from the 34 sampled values
YawnSeed$sum <- rowSums(subset(YawnSeed, select=V1:V34))
Our actual (observed) data had 10 Yawners randomly assigned to the Yawn Seed group. To estimate the p-value, we need to estimate the likelihood of observing something as or more extreme than that. So, in this scenario, we need to find the proportion of our 25,000 samples that have 10 or more yawners in the Yawn Seed group:
# Creates a histogram of the number of yawners in all 25,000 simulations
histogram( ~sum, width=1, main="Yawners in the Yawn Seed Group", v=9.5, col="Steelblue", data=YawnSeed)
# Estimates the p-value
prop( ~(sum>=10), data=YawnSeed)
## TRUE
## 0.5117
We can use the hypergeometric distribution once more:
Yawners = m = 14
Non-Yawners = n = 36
Subjects selected for Yawn Seed group = k = 34
To find P(10 or more yawners in our sample of 34) = P(x≥10) = 1-P(x≤9), we use:
1-phyper(9, 14, 36, 34, lower.tail = TRUE, log.p = FALSE)
## [1] 0.5128
We first need to create the dataset:
group=c(rep("drug",4), rep("placebo", 4))
place=c(1, 2, 4, 5, 3, 6, 7, 8)
Race = data.frame(group=group,place=place)
Race
## group place
## 1 drug 1
## 2 drug 2
## 3 drug 4
## 4 drug 5
## 5 placebo 3
## 6 placebo 6
## 7 placebo 7
## 8 placebo 8
In class, we used the sum of the ranks as our test statistic. Let’s find that sum for each group:
sum(place ~ group, data = Race)
## drug placebo
## 12 24
To estimate the p-value, we wanted to know the likelihood of getting a sum less than 12 if we randomly select 4 of these subjects to take the drug (which we assume has no effect on the results of the race).
That’s the same as simply shuffling the group for each subject. We’ll take all 8 of our subjects and mix-up all their group assignments (drug or placebo). Here are two examples of this shuffling:
RaceShuffle1 = data.frame(group=shuffle(group),place=place)
RaceShuffle2 = data.frame(group=shuffle(group),place=place)
RaceShuffle1
## group place
## 1 placebo 1
## 2 drug 2
## 3 drug 4
## 4 placebo 5
## 5 drug 3
## 6 placebo 6
## 7 placebo 7
## 8 drug 8
RaceShuffle2
## group place
## 1 drug 1
## 2 placebo 2
## 3 drug 4
## 4 placebo 5
## 5 drug 3
## 6 drug 6
## 7 placebo 7
## 8 placebo 8
As you can see, the individual subjects may get reassigned to either group in each shuffle. It doesn’t impact their performance in the race, though. The subject who finished first was going to finish first regardless of the drug – we’re assuming the drug does not matter.
Let’s run 30,000 of these shuffles and see how many times we get a sum of 12 or lower for the placebo group:
RaceShuffles <- do(30000) * sum(place ~ shuffle(group), data=Race)
histogram( ~drug, width=1, main="Sum of Ranks in Placebo Group", v=12.5, col="Steelblue", data=RaceShuffles)
# Estimates the p-value
prop( ~(drug<=12), data=RaceShuffles)
## TRUE
## 0.0565
We first need to create the dataset:
style=c(rep("U",3), rep("T", 3))
ftmade=c(10, 14, 8, 6, 12, 9)
FreeThrows = data.frame(style=style,ftmade=ftmade)
FreeThrows
## style ftmade
## 1 U 10
## 2 U 14
## 3 U 8
## 4 T 6
## 5 T 12
## 6 T 9
In class, we used the sum of the ranks as our test statistic. Let’s find that sum for the underhanded and traditional styles:
sum(ftmade ~ style, data = FreeThrows)
## T U
## 27 32
To estimate the p-value, we wanted to know the likelihood of getting a sum greater than 32 if we randomly select 3 of these subjects to shoot underhanded (which we assume has no effect on the results of the race).
That’s the same as simply shuffling the group for each subject. We’ll take all 6 of our subjects and mix-up all their group assignments (underhanded or traditional). Here are two examples of this shuffling:
Let’s run 25,000 of these shuffles and see how many times we get a sum of 32 or higher for the underhand group:
FTShuffles <- do(25000) * sum(ftmade ~ shuffle(style), data=FreeThrows)
histogram( ~U, width=1, main="Sum of Ranks in Underhanded Group", v=31.5, col="Steelblue", data=FTShuffles)
# Estimates the p-value
prop( ~(U>=32), data=FTShuffles)
## TRUE
## 0.2997
That’s our p-value.