# Load the tidyverse packages
library(tidyverse)
library(mosaic)
# Set scientific notation limit
options(scipen=999)
In this unit, we’ll learn how to construct probability models. Probability models display (through a list, plot, or formulas):
Probabilities, while useful, can be interpreted in many ways. Physical probabilities, such as those arising from coin tosses, card games, or radioactive atoms, might be interpreted and estimated using:
Evidential probabilities, such as our beliefs in the fairness of a coin, can be estimated through a subjective approach:
\(P(win) \approx\underset{\textrm{trials}\rightarrow \infty}{lim}\frac{\textrm{wins}}{\textrm{trials}}\)
We estimate probability as the proportion of times an outcome occurs in a long series of repetitions.
# First, specify how many coin tosses we want. We want 24,000
N = 24000
# Flip a coin N times and store the results in **coin**
coin <- Do(N) * nflip()
# Add a column: *sum* = cumulative number of heads
# Add a column: *toss* = cumulative number of coin tosses
# Add a column: *runprop* = running proportion of heads
coin <- coin %>%
mutate(sum = cumsum(nflip),
toss = 1:N,
runprop = sum/toss)
# Let's look at some of the data
head(coin)
# # A tibble: 6 × 4
# nflip sum toss runprop
# * <dbl> <dbl> <int> <dbl>
# 1 0 0 1 0.0000000
# 2 1 1 2 0.5000000
# 3 1 2 3 0.6666667
# 4 0 2 4 0.5000000
# 5 1 3 5 0.6000000
# 6 0 3 6 0.5000000
# Let's plot the results
ggplot(data = coin, aes(x = toss, y = runprop)) +
geom_line(color="steelblue", alpha=0.7) +
annotate("segment", x = 0, xend = 24000, y = 0.50, yend = 0.50,
colour = "black") +
labs(title="Running proportion of heads in 24,000 coin tosses",
subtitle=paste("Proportion after 100 trials = ", coin$runprop[100], ". Proportion of heads after 24,000 trials = ", coin$runprop[24000]))
Using this approach, how would you estimate the probability that the current vice president will become president?
How could we use this approach to estimate the probability of obtaining a sum of 7 if we roll two 6-sided dice?
Let’s simulate 5,000 rolls of two dice.
# Create the data. The outcome of each die is simply
# one sample (with replacement) from the numbers 1-6.
dice <- tibble(
die1 = sample(1:6, size=5000, replace=TRUE),
die2 = sample(1:6, size=5000, replace=TRUE),
sum = die1 + die2
)
# Let's see a table of those sums
table(dice$sum)
#
# 2 3 4 5 6 7 8 9 10 11 12
# 146 260 399 591 732 849 689 537 411 255 131
# A histogram of those sums
dice %>%
ggplot(aes(x = sum)) +
geom_histogram(binwidth = 1, fill="lightblue", color="white", alpha = 0.8) +
labs(
title = "5,000 replications of rolling two 6-sided dice",
subtitle = paste("Proportion with sum equal to 7 = " , sum(dice$sum == 7) / length(dice$sum)),
x = "sum of the two dice"
) +
scale_x_continuous(breaks=seq(-1, 13, 1), minor_breaks=NULL) +
theme(
axis.text.x = element_text(size = 11, color="grey10"),
legend.position = "none",
panel.grid.major.y = element_line(colour = "white"),
panel.grid.major.x = element_line(colour = "white", size=.15),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "grey93")
)
# We're shuffling the babies and randomly returning them
# Let's shuffle the letters ABCD
babies <- c("A", "B", "C", "D")
shuffle(babies)
# [1] "B" "A" "D" "C"
shuffle(babies)
# [1] "C" "B" "A" "D"
# We'll replicate this process 10,000 times
lazynurse <- Do(10000) * shuffle(babies)
# Calculate the number of matches
lazynurse <- lazynurse %>%
mutate(return = paste(V1, V2, V3, V4, sep=""),
firstmatch = (V1=="A"),
secondmatch = (V2=="B"),
thirdmatch = (V3=="C"),
fourthmatch = (V4=="D"),
matches = firstmatch + secondmatch + thirdmatch + fourthmatch) %>%
select(return, matches)
# Histogram
lazynurse %>%
ggplot(aes(x = matches)) +
geom_histogram(binwidth = 1, fill="lightblue", color="white", alpha = 0.8) +
labs(
title = "10,000 replications of returning four babies at random",
x = "number of correct matches"
) +
scale_x_continuous(breaks=seq(-1, 5, 1), minor_breaks=NULL) +
theme(
axis.text.x = element_text(size = 11, color="grey10"),
legend.position = "none",
panel.grid.major.y = element_line(colour = "white"),
panel.grid.major.x = element_line(colour = "white", size=.15),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "grey93")
) +
stat_bin(aes(y=(..count..)/sum(..count..),
label=round((..count..)/sum(..count..),2)),
geom="text", size=4, binwidth = 1, vjust=-1.5)
With the relative frequency approach, the estimated probability will change each time we collect (or simulate) more trials. Perhaps we want an approach that gives us a single consistent estimate.
Classical / Theoretical Approach: \(P(\textrm{event}) = \frac{\textrm{number of outcomes that result in the event}}{\textrm{total number of possible outcomes}}\)
# Simulate tossing 3 coins 10,000 times
threecoins <- Do(10000) * nflip(3)
prop(~nflip >= 2, data=threecoins)
# TRUE
# 0.4981
# List all possible outcomes
# HHH
# HHT, HTH, THH
# TTH, THT, HTT
# TTT
Rolling one 6-sided die 10 times: 60466176 outcomes
Tossing a coin 20 times: 1048576 outcomes
Dividing 20 students into 4 equal-sized groups (group ID does matter): 11732745024 outcomes
Choosing 6 lottery numbers between 1-36 in any order with no repeats: 1947792 outcomes
To use this classical approach, we’re going to need to learn how to count.
# Number of options
12 * 20 * 4 * 6
# [1] 5760
# Probability it includes sea salt
1 / 6
# [1] 0.1666667
# Simulate flipping 10 coins 10,000 times
quiz <- Do(10000) * nflip(10)
# Count proportion of quiz scores that equal 10
prop(~nflip == 10, data=quiz)
# TRUE
# 0.0018
# "Exact" calculation
1 / (2 ^ 10)
# [1] 0.0009765625
# Count proportion of quiz scores that equal at least one
prop(~nflip >=1, data=quiz)
# TRUE
# 0.999
# "Exact" calculation
1 - (1 / (2 ^ 10))
# [1] 0.9990234
possibilities <- (26 + 26 + 10)
possibilities^4
# [1] 14776336
62 * 61 * 60 * 59
# [1] 13388280
62 * 61 * 61 * 61
# [1] 14072822
(1 / 104) * (1 / 157)
# [1] 0.00006124449
# Let's load 1.83 million baby names from the Social Security Administration
library(babynames)
babies <- babynames
# Let's only look at boys born +/- 3 years from my birth year
babies <- babynames %>%
filter(sex == "M", year < 1980, year > 1972)
# How many boys were born in these 7 years?
totalbabies <- sum(babies$n)
# How many boys were born with the name Bradley in these 7 years?
bradleys <- sum(babies$n[babies$name=="Bradley"])
# How many boys were born with the name Adam in these 7 years?
adams <- sum(babies$n[babies$name=="Adam"])
# Proportion of babies born with the name Bradley
bradprop <- bradleys / totalbabies
# Proportion of babies born with the name Adam
adamprop <- adams / totalbabies
# P(Bradley Adam)
bradprop * adamprop
# [1] 0.00002464961
# Assume Adam Bradley is just as likely as Bradley Adam
bradprop * adamprop / 2
# [1] 0.00001232481
4 * 3 * 2 * 1
# [1] 24
1 / factorial(10)
# [1] 0.0000002755732
# I'll assume 25 students are in the class
factorial(25)
# [1] 15511210043330986055303168
# I'll assume 25 students are in the class
factorial(25) / factorial(25-5)
# [1] 6375600
25 * 24 * 23 * 22 * 21
# [1] 6375600
factorial(10) / factorial(10-4)
# [1] 5040
10*9*8*7
# [1] 5040
# We can also calculate permutations with this function:
perm = function(n, x) {
return(factorial(n) / factorial(n-x))
}
perm(10, 4)
# [1] 5040
# This assumes the order matters
perm(6, 3)
# [1] 120
# Order does not matter, so ABC is the same as:
# ACB, BAC, BCA, CAB, CBA
# In other words, six outcomes actually represent
# only one unique outcome
# We can divide our permutation by the number of ways
# we can rearrange our chosen objects
perm(6, 3) / factorial(3)
# [1] 20
# Or we can calculate a combination with choose()
choose(6,3)
# [1] 20
# Order does not matter, so we have a combination
choose(52,5)
# [1] 2598960
# Order does not matter, so we have a combination
# We're choosing 4 people for a group
choose(8,4)
# [1] 70
# To choose 3 groups, we choose
choose(12,4) * choose(8, 4) * choose(4, 4)
# [1] 34650
The answers to those questions are wrong. Why?
# Does the order of the groups matter? For example, suppose we choose
# Group 1: A, B, C, D
# Group 2: E, F, G, H
# Is that different from:
# Group 1: E, F, G, H
# Group 2: A, B, C, D
# If the group identifications do not matter
# (and those two scenarios are the same)
# then we need to divide by the number of ways
# we can rearrange groups
choose(8,4) / factorial(2)
# [1] 35
choose(12,4) * choose(8, 4) * choose(4, 4) / factorial(3)
# [1] 5775
choose(11,1) * choose(10,4) * choose(6,4) * choose(2,2)
# [1] 34650
# The order of the letters matter (they represent different "words")
# We'll learn how to use the hypergeometric distribution to calculate this quickly
# For now, let's do it step-by-step
# Choose 4 students out of 20
total <- choose(20, 4)
# Choose 4 engineers out of 10
engineers <- choose(10,4)
# Choose no other majors
other <- choose(10,0)
# Event of interest = choosing engineers.
engineers / total
# [1] 0.04334365
# We can also simulate this experiment
# Create 10 engineering majors (1s) and 10 other majors (0s)
students <- c(rep(1,10), rep(0,10))
# Choose 4 of them 10,000 times
choice <- Do(10000) * sample(students, 4)
# Create the data frame
choice <- choice %>%
mutate(engineers = V1 + V2 + V3 + V4)
# Calculate proportion
prop(~engineers == 4, data=choice)
# TRUE
# 0.0433
Sources:
• Bastian, J. (1967). The transmission of arbitrary environmental information between bottlenose dolphins. In Animal sonar systems: Biology and bionics, ed. R.-G. Busnel, pp. 807-873. Jouy-en Josas, France: Laboratoire de Physiologie Acoustique.
• Tintle, N., et al. (2015). Introduction to Statistical Investigations, Preliminary edition
• Rosen, B. & Jerdee, T. (1974). Influence of sex role stereotypes on personnel decisions. Journal of Applied Psychology, 59: 9-14.
Creative Commons Attribution-ShareAlike 3.0 Unported license.
```