Activity #1: R Syntax

Reproducibility

To run our simulations, we’ll need to generate random numbers. To ensure your random numbers match mine, we can set the seed:

set.seed(3141)  # for reproducibility



Examples from Activity #1

4) So, assuming dolphins cannot communicate, how likely was it that Buzz chose the correct switch 15 out of 16 times? In other words, how likely was Buzz to get 15 correct switches if he was only guessing? How can we estimate this probability?

In class, we stated a null hypothesis that the dolphins had a 50% chance of choosing the correct switch. We then simulated the experiment by flipping a coin 16 times and counting how many heads we get. We replicated this process a large number of times to see the likelihood of observing at least 15 heads.

Flipping a coin with the Mosaic package is easy. We only need to enter:

rflip()
## 
## Flipping 1 coin [ Prob(Heads) = 0.5 ] ...
## 
## H
## 
## Number of Heads: 1 [Proportion Heads: 1]

The output shows that our “coin” came up heads. We can modify the code to flip 16 coins:

rflip(16)
## 
## Flipping 16 coins [ Prob(Heads) = 0.5 ] ...
## 
## H T T H H H H H H H H H T T H T
## 
## Number of Heads: 11 [Proportion Heads: 0.6875]

The output shows we got 11 heads and 5 tails. To replicate this process, we tell R to do it multiple times. Here’s how we’d flip 16 coins five times:

do(5) * rflip(16)
## Loading required package: parallel
##    n heads tails   prop
## 1 16     8     8 0.5000
## 2 16     8     8 0.5000
## 3 16     9     7 0.5625
## 4 16     9     7 0.5625
## 5 16     7     9 0.4375

If we wanted to see the results for each of those replications, we’d use the following commands:

coins <- do(5) * rflip(16)
tally( ~heads, data=coins)
## 
##  7  9 10 11 
##  1  1  2  1

The first line of code tosses 16 coins 5 times and stores the results in coins. The second line keeps a tally (count) of the number of heads in each replication (using the data we just stored in coins).

The top row of the output shows the number of heads and the bottom row shows the number of replications with that result. If we wanted to count the number of times we observed 8 or more heads (from our 5 replications), we’d modify the tally() command:

tally( ~(heads >= 8), data=coins)
## 
##  TRUE FALSE 
##     4     1

In the output, TRUE shows the number of times it was true that we observed 8 or more heads from our 16 coin tosses. We can get this as a proportion:

prop( ~(heads >= 8), data=coins)
## TRUE 
##  0.8

We’re finally ready to simulate our dolphin study. We’ll replicate the experiment 50,000 times and store the results in Dolphins. We’ll also create a histogram showing the number of heads in each of our 10,000 replications. Then, we’ll find the proportion of those replications in which we observe 15 or more heads.

Dolphins <- do(50000) * rflip(16)
tally( ~heads, data=Dolphins)
## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
##    1   11   93  410 1387 3380 6316 8544 9801 8681 6140 3312 1407  427   78 
##   15   16 
##   10    2
histogram( ~ heads, data=Dolphins, v=15, width=1)

plot of chunk unnamed-chunk-9

tally( ~(heads >= 15) , data=Dolphins)
## 
##  TRUE FALSE 
##    12 49988
prop( ~(heads >= 15), data=Dolphins)
##    TRUE 
## 0.00024

That final value is our estimated p-value. It represents the likelihood of observing 15 or more heads from 16 coin tosses (assuming the coin is fair). In our study, it represents the likelihood of the dolphins correctly choosing the correct switch 15 or more times if they, in fact, had a 50% chance of choosing the correct switch each time.


7) If we flip a coin 16 times, what is the probability that we observe 0 heads?

As we’ll learn in Activity #7, we can use the Binomial Distribution to calculate this probability. In R, we would use:

pbinom(0, 16, .5)
## [1] 1.526e-05

From this, we can see P(0 heads) = 0.00001526. Since we’ve already flipped 16 coins 50,000 times (and stored the results in Dolphins), we can estimate this probability by the proportion of times we observed 0 heads out of those 50,000 replications:

prop( ~(heads <= 0), data=Dolphins)
##  TRUE 
## 2e-05


7) If we flip a coin 16 times, what is the probability that we observe 7 or fewer heads? 8 or more heads?

The “exact” solutions are:

pbinom(7, 16, .5)
## [1] 0.4018
1-pbinom(7, 16, .5)
## [1] 0.5982

The first line calculates the probability of 7 or fewer heads. The second line gets the complement of this result (the opposite, or, 8+ heads). Notice that if we add those probabilities together, we must get 1.00.

pbinom(7, 16, .5) + 1-pbinom(7, 16, .5)
## [1] 1

Let’s see what probabilities are estimated from our simulation:

prop( ~(heads <= 7), data=Dolphins)
##   TRUE 
## 0.4028
prop( ~(heads >= 8), data=Dolphins)
##   TRUE 
## 0.5972
prop( ~(heads <= 7), data=Dolphins) + prop( ~(heads >= 8), data=Dolphins)
## TRUE 
##    1


8) Suppose Buzz had only chosen 10 correct switches out of 16. Estimate the p-value from this study.

To get this answer, we simply use:

prop( ~(heads >= 10), data=Dolphins)
##   TRUE 
## 0.2275

This p-value is large enough that the dolphins could have gotten 10 or more correct even if they had a 50% chance each time. From this, we have very little (if any) evidence that the dolphins can communicate.


9) This time, suppose Dr. Bastian gave the dolphins 160 trials (instead of only 16). What would you conclude if Buzz got 100 correct switches out of 160 trials? What is more unlikely, for Buzz to get 10 out of 16 or 100 out of 160?

We just calculated the likelihood of observing 10 or more correct out of 16 in the previous question. To estimate the likelihood of observing 100 or more correct out of 160, we’d use:

Dolphins2 <- do(50000) * rflip(160)
histogram( ~ heads, data=Dolphins2, v=100, width=1)

plot of chunk unnamed-chunk-16

prop( ~(heads >= 100), data=Dolphins2)
##    TRUE 
## 0.00104


10) Worrying that Buzz could possibly have seen the light around the curtain, Dr. Bastian conducted a follow-up study where he separated the dolphins by a wooden barrier. This time, he gave the dolphins 28 trials and Buzz chose the correct switch 16 times. Use the computer simulation to estimate the p-value in this study and state your conclusion.

Dolphins3 <- do(50000) * rflip(28)
histogram( ~ heads, data=Dolphins3, v=16, width=1)

plot of chunk unnamed-chunk-17

prop( ~(heads >= 16), data=Dolphins3)
##  TRUE 
## 0.288


12) We know 18 of the 24 applicants were selected for promotion. Let’s assume that gender had no impact on promotion decisions. We’ll call this the null hypothesis. Assuming this null hypothesis is true, how many of the 18 promotions should have gone to male applicants?

Expected Promoted Not Promoted Total
Male 9 3 12
Female 9 3 12
Total 18 6 24

Observed | Promoted | Not Promoted | Total — | — | — | — Male | 11 | 1 | 12 Female | 7 | 5 | 12 Total | 18 | 6 | 24

13) From these results, can we conclude that male bank supervisors are sexist?

We need to load the data that’s displayed in the table above. To do this, I’m first going to create two vectors. The first one, called gender will contain 12 males and 12 females. The second vector, called promoted will consist of values of 0 (not promoted) and 1 (promoted).

Instead of typing the gender and promotion status of each of the 24 subjects individually, I’ll use some shorthand.

gender = c(rep("male",12),rep("female",12))

This code lists “male” 12 times and “female” 12 times. In other words, the code combines c() the value “male” replicated rep() 12 times and the value “female” 12 times.

We can see the contents of this vector:

gender
##  [1] "male"   "male"   "male"   "male"   "male"   "male"   "male"  
##  [8] "male"   "male"   "male"   "male"   "male"   "female" "female"
## [15] "female" "female" "female" "female" "female" "female" "female"
## [22] "female" "female" "female"

We’ll do the same to get the vector of promotion data. We know 11 of the 12 males (and 7 of the 12 females) were promoted. That means we need 11 of the first 12 data values to be 1 (which is our code for promoted). We then need a 0 to represent the male who wasn’t promoted. This is followed by 7 promotions and 5 non-promotions for the females.

promoted = c(rep(1,11),0,rep(1,7),rep(0,5))

We’ll then combine these two vectors into a single data frame. I’ll name the columns gender and promotion:

promotion = data.frame(gender=gender,promoted=promoted)
promotion
##    gender promoted
## 1    male        1
## 2    male        1
## 3    male        1
## 4    male        1
## 5    male        1
## 6    male        1
## 7    male        1
## 8    male        1
## 9    male        1
## 10   male        1
## 11   male        1
## 12   male        0
## 13 female        1
## 14 female        1
## 15 female        1
## 16 female        1
## 17 female        1
## 18 female        1
## 19 female        1
## 20 female        0
## 21 female        0
## 22 female        0
## 23 female        0
## 24 female        0

Just to make sure the data matches the table displayed in question #12, let’s create a table using:

tally(promoted~gender, format=c("count"), data=promotion)
##         gender
## promoted female male
##        0      5    1
##        1      7   11

Other than having the rows and columns in a different order, this table matches what we want.

Instead of thinking about frequencies, we can think about the proportion of males and females who were promoted:

mean(promoted ~ gender, data = promotion)
## female   male 
## 0.5833 0.9167

The difference, then, between the proportions of males and females who were promoted is:

diff(mean(promoted ~ gender, data = promotion))
##   male 
## 0.3333

Under our null hypothesis, gender had no impact on promotion decisions. Therefore, we can shuffle the gender labels around in our data. Let’s shuffle the gender labels a couple times:

tally(promoted~shuffle(gender), format=c("count"), data=promotion)
##         shuffle(gender)
## promoted female male
##        0      6    0
##        1      6   12
tally(promoted~shuffle(gender), format=c("count"), data=promotion)
##         shuffle(gender)
## promoted female male
##        0      3    3
##        1      9    9

Both of these tables have the same number of males (12), females (12), promotions (18), and non-promotions (6) as our observed dataset. The gender labels have been shuffled, though, so some of the promoted males are now females (and some of those non-promoted females are now males).

We care about the difference in proportion of promotions within each group, so let’s calculate that for 5000 replications:

ShuffledPromotions<- do(5000) * diff(mean(promoted ~ shuffle(gender), data = promotion))

A histogram shows the 5,000 differences in proportions:

histogram(~male, data=ShuffledPromotions, width=0.16)

plot of chunk unnamed-chunk-27

We can then count the proportion of replications that gave us results as or more extreme than what we observed:

prop( ~(male >= 0.3333333), data=ShuffledPromotions)
##   TRUE 
## 0.0746

Compare that result to what we’ll get when we’re able to calculate these probabilities directly (after Activity #8):

1-phyper(10, 18, 6, 12)
## [1] 0.07748