Remember to download the report template for this lab and open it in RStudio. You can download the template by clicking this link: http://bradthiessen.com/html5/stats/m300/lab2report.Rmd
Recall the scenario from class in which a professor teaches a statistics class with 5 engineering majors and 3 math majors. The professor, through a process he claims is random, chooses to give money to 2 math majors. We were interested in the likelihood of the professor choosing 2 math majors if he did, in fact, choose students randomly.
In class, we solved this problem with a few different methods:
Let’s learn how to apply these methods in R (and estimate the likelihood via simulation).
Let’s begin by generating the dataset for this scenario. We need to tell R to create a vector containing 5 engineering majors and 3 math majors.
Since the identifications of the individuals students aren’t important, we only need to generate a dataset of the students’ majors. If we let E = engineering
and M = math
, our dataset will look like:
E E E E E M M M
To input this dataset into R, we’ll use the c()
command. You can think of c()
as representing combine()
. To construct our dataset, we could use:
students <- c("E", "E", "E", "E", "E", "M", "M", "M")
Notice that since our data consist of characters (letters), we need to put each value in quotation marks.
While this method works, it can be tedious. Is there any way I can input the data without having to type each individual student’s major?
Yes, as long as we learn the replicate function in R: rep()
. This function takes two arguments: rep(x, times)
, where
To construct our dataset, we can combine the c()
and rep()
functions, like this:
students <- c( rep("Eng.", 5), rep("Math", 3) )
students
## [1] "Eng." "Eng." "Eng." "Eng." "Eng." "Math" "Math" "Math"
The output shows that our dataset, students, was constructed by replicating “Eng.” 5 times and “Math” 3 times.
Now that we have our dataset, let’s simulate the process of choosing 2 students at random. To do this, we can use the sample()
command. Typing help(sample)
into the console shows me this command allows the following arguments:
sample(x, size, replace = FALSE)
, where
Let’s assume the statistics professor was not going to choose the same student twice. In this case, we want to choose 2 students without replacement. To do this, we use:
sample(students, size=2, replace=FALSE)
## [1] "Eng." "Math"
The output shows the two students who were chosen at random.
Just as we did in the first lab, we’ll want to replicate this sampling process many times with the do()
function. Let’s get 10,000 replications of sampling 2 students without replacement:
sampled_students <- do(10000) * sample(students, size=2, replace=FALSE)
Now that we’ve simulated this process 10,000 times, let’s summarize the results in a table.
table(sampled_students)
## V2
## V1 Eng. Math
## Eng. 3671 2619
## Math 2650 1060
The output gives us a 2x2 table showing the number of times the simulated professor chose:
We can tell R to give us a more nicely-formatted table if we use the tally()
command. To use tally()
, we need to identify which variable(s) we want to tally.
When we simulated the selection of two students, we stored the results in a dataframe called sampled_students
. Let’s look at the first several rows of this dataframe to identify our variables:
head(sampled_students)
## V1 V2
## 1 Math Eng.
## 2 Math Eng.
## 3 Eng. Math
## 4 Eng. Eng.
## 5 Eng. Eng.
## 6 Math Eng.
This output shows our two variables are named V1
(representing the first student selected) and V2
(the second student selected). Now that we know the names of our variables, we can use the tally()
command:
# Tally our variables, V1 and V2, and include the margin totals
tally(~V1 & V2, data=sampled_students, margins=TRUE)
## V2
## V1 Eng. Math Total
## Eng. 3671 2619 6290
## Math 2650 1060 3710
## Total 6321 3679 10000
Finally, if we’re interested in the likelihood of choosing 2 math majors at random, we can format our table to display proportions (instead of frequencies). To do this, I use:
# Tally our variables, V1 and V2, and include the margin totals
tally(~V1 & V2, data=sampled_students, margins=TRUE, format="proportion")
## V2
## V1 Eng. Math Total
## Eng. 0.3671 0.2619 0.6290
## Math 0.2650 0.1060 0.3710
## Total 0.6321 0.3679 1.0000
That value in the bottom-right cell represents the proportion of times our simulated professor chose two math majors.
table
or tally
commands. Finally, estimate the likelihood of choosing 3 math majors from your simulation.We’ve used a simulation to estimate the likelihood of choosing 2 math majors at random. This time, let’s calculate the likelihood using combinations.
As we saw in class, we calculate this likelihood by calculating:
P(choose 2 math majors) / P(choose 2 students)
Since we’re sampling without replacement and order does not matter, we use combinations to calculate both the numerator and denominator.
To calculate combinations in R, we use the choose()
command. In this command, we type the number of objects and then the number we want to choose:
# Choose 2 out of 3 math majors
choose(3,2)
## [1] 3
# Choose 2 out of 8 total students
choose(8,2)
## [1] 28
We can combine these in a single command to calculate the likelihood of choosing 2 math majors at random:
choose(3,2) / choose(8,2)
## [1] 0.1071429
That answer is close to the likelihood we estimated with our simulation.
choose()
command to calculate the probability of choosing 3 math majors (out of the 12 engineering, 5 math, and 3 other majors in your fictitious class). Verify that this answer is similar to the likelihood you estimated in exercise #3.Finally, let’s verify our answer using the hypergeometric distribution. We’ll discuss the hypergeometric distribution in-depth in unit #2. For now, let’s see how to calculate probabilities with this distribution in R.
The syntax for the hypergeometric distribution is:
dhyper(q, m, n, k)
, where
If we let math majors be our TYPE 1 objects, our scenario would need:
Let’s calculate this probability:
dhyper(2, 3, 5, 2)
## [1] 0.1071429
This answer (0.1071429) matches the answer we calculated via combinations.
dhyper()
command to calculate the probability of choosing 3 math majors (out of the 12 engineering, 5 math, and 3 other majors in your fictitious class). Although it seems as though you have 3 different types of objects, you really only have math majors and non-math majors.Recall the scenario in class where 48 bank supervisors were asked if they’d promote an applicant to a managerial position. 24 of those supervisors were randomly chosen and told the applicant was male. The other 24 supervisors were told the applicant was female.
The results of the study were summarized in a table:
Observed | Promoted | Not promoted | Total |
---|---|---|---|
Male App | 21 | 3 | 24 |
Female App | 14 | 10 | 24 |
Total | 35 | 13 | 48 |
If the gender of the applicants had no affect on the promotion decisions, we’d expect a similar number of male and female applicants to be promoted. Instead, the results of this study indicate males were much more likely to be promoted than females.
Our question of interest is this: Assuming gender does not influence promotion decisions, how likely were we to observe 21 or more promoted males in this study?
We’ll estimate this likelihood via simulation and the hypergeometric distribution.
The code for this simulation is going to be a bit complicated. Two factors make it complicated: (1) I want to make the code as readable as possible for you, and (2) I’m not a very skilled coder.
Let’s start by constructing our dataset. I need to create a dataset that has 21 promoted males, 3 unpromoted males, 14 promoted females, and 10 unpromoted females. To do this, I’m going to create two variables: gender
and promoted
.
For the gender
variable, I need to create 24 males and 24 females using:
gender <- c( rep("Male",24), rep("Female", 24) )
Remember, this will create a list of the word “Male” repeated 24 times, followed by “Female” repeated 24 times. I want 21 of those 24 males to be promoted and 3 to not be promoted. I then want 14 promoted females and 10 unpromoted females. To do this, I’ll use:
promoted = c( rep("Yes", 21), rep("No", 3), rep("Yes", 14), rep("No", 10))
I’ll then combine the two variables (gender and promoted) into a single dataframe I’ll call promotions
.
Let’s put this together in a single command:
promotions <- data.frame(
gender = c( rep("Male",24), rep("Female", 24) ),
promoted = c( rep("Yes", 21), rep("No", 3), rep("Yes", 14), rep("No", 10) )
)
I could have written that across a single line, but I’m hoping the spacing helps you interpret the code.
Let’s look at a tally (or table) of this data to make sure it matches the actual results from this experiment:
tally(promoted~gender, data=promotions)
## gender
## promoted Female Male
## No 10 3
## Yes 14 21
The rows and columns are in a different (alphabetical) order, but this does match the results from the study.
To simulate the results of this study, we’re going to assume that gender had no affect on promotion decisions. If that’s true, the gender listed on the application (male or female) had no impact on the supervisors decision to promote the applicant.
Imagine one supervisor in this study who was randomly assigned a male applicant and decided to promote that male applicant. Now imagine we go back in time and, through random assignment, this supervisor was now assigned a female applicant. Because we’re assuming that gender does not influence promotion decisions, we assume that this supervisor would still choose to promote the applicant (even though the supervisor now believes the applicant is female).
This means that in order to simulate the results of this experiment, we simply need to shuffle the gender of the applicants assigned to the supervisors. To randomly shuffle the gender variable, we need to add shuffle()
to the previous command. I’m also going to add a format="data.frame"
argument to store the results of this shuffling process as a dataframe:
shuffled_gender <- tally(promoted ~ shuffle(gender), data=promotions, format="data.frame")
shuffled_gender
## promoted shuffle.gender. Freq
## 1 No Female 7
## 2 Yes Female 17
## 3 No Male 6
## 4 Yes Male 18
As you can see, randomly shuffling the data yielded different results than the original experiment. While we still have 35 promotions, there’s now a different number of promoted males.
Notice that the number of promoted males is stored in the 4th row and 3rd column of our shuffled_gender
dataframe. Using square brackets, I can tell R to give me only this single value:
shuffled_gender[4, 3]
## [1] 18
Yep, that’s the number of promoted males we obtained.
As we’ve done several times now, we’re going to replicate our simulation many times. We’re going to:
Here’s the code (with some comments to help explain) to run 10,000 replications of this simulation:
shuffled_gender <- do(10000) * # Replicate 10,000 times
tally( # Create a table
promoted ~ shuffle(gender), # of promotions and (shuffled) gender.
data=promotions, # The data is stored in "promotions"
format="data.frame")[4,3] # Access the number of promoted males
Now we can examine the distribution of promoted males we got from our simulation:
# Tally the results
tally(~result, data=shuffled_gender)
##
## 12 13 14 15 16 17 18 19 20 21 22 23
## 1 33 210 688 1631 2424 2391 1606 751 222 39 4
# Create a histogram of the results
histogram(~result, # Plot the results
data=shuffled_gender, # Stored in shuffled_gender
type="count", # Show frequencies on the y-axis
width=1, # Make the bars have width = 1
xlab = "Promoted males", # Label the x-axis
v = 21, # Draw a vertical line at X=21
groups = (result >= 21) ) # Color all bars >= 21
From this, it doesn’t look as though it was likely to get 21 or more promoted males if gender did not influence promotion decisions. To estimate this likelihood, we can use our prop()
command:
prop(~result>=21, data=shuffled_gender, format="prop")
## TRUE
## 0.0265
That’s our p-value – the likelihood of observing 21+ promoted males if gender does not influence promotion decisions. Because this likelihood is low, we have evidence that gender may, in fact, influence promotion decisions (at least for these bank supervisors).
We can use the hypergeometric distribution to estimate the likelihood of observing 21 or more promoted males. To do this, we need to think of our experiment as a process in which we choose a certain number of objects from two groups.
One way to do this would be to think of our data as:
The phyper()
command tells R to calculate the probability of choosing q or fewer objects. In this case, it would calculate the probability of choosing 21 or fewer promoted males. We’re interested in the likelihood of choosing 21 or more promoted males, so we can use the complement rule to calculate 1 - P(20 or fewer)
:
1 - phyper(20, 24, 24, 35)
## [1] 0.02449571
That’s our likelihood: 0.0244957. It’s similar to what we estimated with our simulation methods.
Another way of characterizing our experiment under the hypergeometric distribution would be to consider:
1-phyper(21, 35, 13, 24)
## [1] 0.003920274
Notice that this gives us exactly the same likelihood.
In class, we analyzed a simple (fictitious) dataset to see if a new drug made people run faster. Out of 8 randomly chosen subjects, 4 were randomly assigned to take a drug. All 8 subjects then ran a race. The subjects who took the drug finished in 1st, 2nd, 4th, and 5th place.
Let’s input this data:
race <- data.frame(
group = c( rep("Drug",4), rep("Placebo", 4) ),
finish = c(1, 2, 4, 5, 3, 6, 7, 8) )
# Get a tally of the data
tally(group ~ finish, data=race)
## finish
## group 1 2 3 4 5 6 7 8
## Drug 1 1 0 1 1 0 0 0
## Placebo 0 0 1 0 0 1 1 1
In class, we decided that one good way of measuring the relative performance of each group was to sum their ranks.
So, for the drug group, the sum is 1 + 2 + 4 + 5 = 12
. The placebo group has a sum of 3 + 6 + 7 + 8 =24
. Lower numbers are better, so our data suggests the drug is associated with better race results.
We can get these sums in R with the sum()
command:
sum(finish ~ group, data = race)
## Drug Placebo
## 12 24
Let’s look a little more in-depth at the arguments in this command. Specifically, let’s look at the part in bold:
sum(**finish ~ group**, data=race)
The y ~ x
argument is a sort of functional notation in the mosaic package. In this case, we’re telling R that we want to sum the finish variable as a function of the group variable (i.e., sum the race ranks for each group separately).
Assuming the drug has no effect on an individual’s speed, what’s the likelihood we would have obtained a sum of 12 or less in our study?
To estimate this likelihood, we’re going to shuffle the group
variable. Since we’re assuming the drug doesn’t matter, we’re free to randomly assign subjects to the drug and placebo groups (all the while assuming the results of the race wouldn’t change).
Let’s run 10,000 replications of this shuffling simulation:
race_simulation <- do(10000) * sum(finish ~ shuffle(group), data=race)
Now, let’s visualize the results and estimate the p-value:
histogram(~Drug, # Plot the sums of the drug group
data=race_simulation, # Stored in race_simulation
type="count", # Show frequencies on the y-axis
width=1, # Make the bars have width = 1
xlab = "Sum of Ranks in Drug Group", # Label x-axis
v = 12, # Draw a vertical line at X=12
groups = (Drug <= 12) ) # Color all bars <= 21
prop(~ Drug<=12, data=race_simulation)
## TRUE
## 0.0573
In class, we began listing all 70 possible outcomes of this study. We stopped once we had found the 4 most extreme outcomes and calculated our p-value to be 0.0571429.
In this simulation approach, we didn’t bother figuring out how many outcomes were possible. In many scenarios (with datasets bigger than n=8 subjects), it wouldn’t be feasible to list out all possible outcomes. Instead, we rely on the computer to generate a large representative sample of possible outcomes.
As you can see, the p-value estimated with this simulation approach was very similar to the “exact” p-value we calculated in class.
When you’ve finished typing all your answers to the exercises, you’re ready to publish your lab report. To do this, look at the top of your source panel (the upper-left pane) and locate the Knit HTML button:
Click that button and RStudio should begin working to create an .html file of your lab report. It may take a minute or two to compile everything, but the report should open automatically when finished.
Once the report opens, quickly look through it to see all your completed exercises. Assuming everything looks good, send that lab1report.html file to me via email or print it out and hand it in to me.
Feel free to browse around the websites for R and RStudio if you’re interested in learning more, or find more labs for practice at http://openintro.org.