# Load the tidyverse and mosaic packages
library(tidyverse)
library(mosaic)
If you gave a student’s essay to multiple teachers, would they score it consistently? Can consistency improve through training?
50 teachers score the same essay. 28 received prior training, while the other 22 received no training.
Click code
to see how the data were entered –>
# Simulate n=28 scores with a mean=5, sd=1.8.
# Force all scores to be between 1-10
# Round to nearest whole number.
set.seed(123)
n1 <- 28
m <- 5
s1 <- 1.8
L <- 1
U <- 10
trained <- round(qnorm(runif(n1, pnorm(L, mean=m, sd=s1),
pnorm(U, mean=m, sd=s1)), mean=m, sd=s1),0)
# Simulate n=22 scores with the same conditions
# except make the standard deviation larger (around 3.7)
set.seed(123)
n2 <- 22
s2 <- 3.7
L <- 1
U <- 10
untrained <- round(qnorm(runif(n2, pnorm(L, mean=m, sd=s2),
pnorm(U, mean=m, sd=s2)), mean=m, sd=s2),0)
# Combine these scores into a single data frame
essay <- tibble(
group = c( rep("trained", 28), rep("not-trained", 22) ),
scores = c(trained, untrained))
# Get summary statistics
essay %>%
group_by(group) %>%
summarize(n = n(), mean=mean(scores), sd=sd(scores), var=var(scores))
Group | n | mean | sd | var |
---|---|---|---|---|
Not-trained | 22 | 5.95 | 2.44 | 5.950 |
Trained | 28 | 5.57 | 1.73 | 2.995 |
essay %>%
ggplot(aes(x = group, y = scores)) +
geom_dotplot(binaxis = "y", binpositions="all", stackdir = "center", fill = "steelblue", color = "white") +
scale_y_continuous(breaks=seq(1, 10, 10), minor_breaks=NULL) +
theme(axis.title.x = element_text(size = 12, color = "#777777")) +
theme(axis.text.x = element_text(size = 12)) +
theme(axis.title.y = element_text(size = 12, color="#777777")) +
theme(axis.text.y = element_text(size = 12)) +
labs(
title = "Scores by group"
)
essay %>%
ggplot(aes(y = scores, x = group)) +
geom_boxplot(fill = "white", color = "black", alpha = 0.6) +
geom_dotplot(binaxis = "y", stackdir = "center", fill = "steelblue", color = "white", alpha = 0.8) +
scale_y_continuous(breaks=seq(1, 10, 10), minor_breaks=NULL) +
theme(axis.title.x = element_text(size = 12, color = "#777777")) +
theme(axis.text.x = element_text(size = 12)) +
theme(axis.title.y = element_text(size = 12, color="#777777")) +
theme(axis.text.y = element_text(size = 12)) +
labs(
title = "Scores by group"
)
We use \(\overline{X}_{1}-\overline{X}_{2}\) to estimate the difference between two means. How do we estimate the difference in variances? Consider the following:
Scenario A: A hospital calibrates and then repeatedly tests two blood pressure monitors. Measurements from the inexpensive armband-pump had a variance of 36 units. The expensive automated monitor had a variance of 9 units.
Scenario B: Samples of upholstery fabric from two suppliers have similar average durability ratings (75,000 double-rubs) but different variances. Samples from one supplier had a variance of 22,000,000 double-rubs, while the variance from the other supplier was 20,000,000 double-rubs.
We’ll define our test statistic as:
test.stat
= \(\frac{s_{1}^{2}}{s_{2}^{2}}=\frac{5.950}{2.995}=1.987\)
test_stat <- var( essay$scores[essay$group=="not-trained"] ) /
var( essay$scores[essay$group=="trained"] )
test_stat
How likely were we to obtain a ratio of 1.987 from our sample data if there were no differences in population variances for the groups? To estimate this, we can use randomization- and theory-based methods to determine the samping distribution of the ratio of variances.
Framework: | Goal: NHST | Goal: Estimate effect |
---|---|---|
Randomization-based | Randomized test of ratios | Bootstrap confidence interval |
Theory-based | F-, Levene, Brown-Forsythe tests | Confidence interval |
To estimate the sampling distribution of our test statistic, we will:
The code to the right shows how I did this in R:
# There are many ways to calculate the ratio of variances
# with 10,000 random shuffles of the group assignment.
# The sample() function randomly samples (shuffles) the data
#
# <------------------------------>
# <--- Option #1: Really fast --->
# <------------------------------>
#
# To conduct the 10,000 replications...
# random_ratios <- replicate(10000, {
# all <- sample(essay$scores)
# shuffled.train <- all[1:28]
# shuffled.untrain <- all[29:50]
# return(var(shuffled.untrain) / var(shuffled.train))
# })
#
# Here's a function to shuffle and calculate the variance ratio
# It's slow, though.
# random_ratios <- function (scores, groups)
# {
# shuffledata <- sample(scores)
# n1 <- count(groups)
# N <- length(groups)
# shuffle1 <- shuffledata[1:n1]
# shuffle2 <- shuffledata[(n1+1):N]
# return(var(shuffle1) / var(shuffle2))
# }
# random_ratios <- replicate(10000, rvr(essay$scores, essay$group))
#
#
# <------------------------------>
# <------ Option #2: dplyr ------>
# <------------------------------>
#
# random_ratios <- Do(10000) * essay %>%
# mutate(shuffled = sample(scores)) %>%
# summarize(var.ratio = var(shuffled[group=="not-trained"]) /
# var(shuffled[group=="trained"]) )
#
# or
# random_ratios <- Do(1000) * essay %>%
# mutate(shuffled = sample(scores)) %>%
# group_by(group) %>%
# summarize(variances = var(shuffled)) %>%
# mutate(ratio = lag(variances) / variances) %>%
# select(ratio) %>%
# filter(ratio != "NA")
#
# <------------------------------>
# <----- Option #3: slowest ----->
# <------------------------------>
random_ratios <- Do(10000) * var.test(scores ~ shuffle(group), data=essay)$estimate
#
# That stored the results in a variable (column) called "ratio.of.variances"
Now we plot the randomized variance ratios:
# Let's plot the distribution of those randomized ratios of variances
#
# Here's the most straight-forward way to plot this
# histogram(~ratio.of.variances, data=random_ratios,
# xlim=c(0,4), # Set the x-axis range
# width=.1, # Set the width of each bar
# xlab="Possible mean differences assuming null hypothesis is true")
# ladd(panel.abline(v=test_stat)) # Add vertical line at test statistic
#
# I'll use ggplot2. Replace geom_density with geom_histogram for a histogram
ggplot(data = random_ratios, aes(x = ratio.of.variances)) +
geom_density(fill="lightblue", color="white", alpha = 0.8) +
annotate("segment", x = 1.987, xend = 1.987, y = 0, yend = .8, color = "red") +
annotate("text", x = 1.987, y = .85, label = "observed 1.987", color = "red") +
annotate("segment", x = 1/1.987, xend = 1/1.987, y = 0, yend = .8, color = "red") +
annotate("text", x = 1/1.987, y = .85, label = "observed 1/1.987", color = "red") +
labs(
title = "Randomized ratios of variances",
x = "ratios of variances"
) +
scale_x_continuous(breaks=seq(0, 4, 1), minor_breaks=NULL) +
theme(
axis.text.x = element_text(size = 11, color="grey10"),
legend.position = "none",
panel.grid.major.y = element_line(colour = "white"),
panel.grid.major.x = element_line(colour = "white", size=.15),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "grey93")
)
# Count the proportion of randomized ratios > our test statistic
p <- prop(~ (ratio.of.variances >= test_stat), data=random_ratios)
# We were just as likely to calculate our ratio as
# var(untrained) / var(trained)
# as we were to calculate the reciprocal. We should estimate
# the p-value for the ratio of our test statistic.
p_reciprocal <- prop(~ (ratio.of.variances <= 1/test_stat), data=random_ratios)
# Sum these values
pvalue <- p + p_reciprocal
pvalue
Expand the code to see how I did this in R.
boot_ratios <- Do(10000) * var.test(scores ~ group, data=resample(essay))$estimate
# Capture the endpoints
lower <- quantile(boot_ratios$ratio.of.variances, 0.025)
upper <- quantile(boot_ratios$ratio.of.variances, 0.975)
# Put the endpoints into the plot
# Density plot
ggplot(data = boot_ratios, aes(x = ratio.of.variances)) +
geom_density(fill="lightblue", color="white", alpha = 0.8) +
labs(
title = "Bootstrap distribution of variance ratios",
x = "ratio of group variances (from resampled data)"
) +
scale_x_continuous(breaks=seq(0, 6, 1), minor_breaks=NULL) +
theme(
axis.text.x = element_text(size = 11, color="grey10"),
legend.position = "none",
panel.grid.major.y = element_line(colour = "white"),
panel.grid.major.x = element_line(colour = "white", size=.15),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "grey93")
) +
annotate("text", x = lower, y = .05, label = round(lower,3)) +
annotate("text", x = upper, y = .05, label = round(upper,3)) +
annotate("text", x = mean(boot_ratios$ratio.of.variances), y = .055, label = paste("95%")) +
annotate("segment", x = lower, xend = upper, y = 0.03, yend = .03, color = "red")
In this example, the randomization-based methods resulted in a sampling distribution with a positive skew. Will the distribution of sample variance ratios always have a positive skew? To investigate, let’s run some simulations.
Let’s start with two populations that follow normal distributions with equal variances:
# I'll use lattice graphics
plotDist('norm', mean=60, sd=10, lw=3, xlim = c(0, 200), main="Normal distributions with equal variances")
plotDist('norm', mean=140, sd=10, lw=3, col="red", add=TRUE)
# Here's the ggplot2 syntax
# ggplot(data.frame(x = c(0,200)), aes(x = x)) +
# stat_function(fun = dnorm, args = list(mean = 60, sd = 10), color="blue", size=2) +
# stat_function(fun = dnorm, args = list(mean = 140, sd = 10), color="red", size=2) +
# labs(
# title = "Normal distributions with equal variances")
Here are the distributions of variance ratios with sample sizes of 10 and 100. Do they match your expectations?
# Take random samples and calculate ratio of variances
# Results are stored in a variable called "result"
ratios10 <- Do(5000) * ( var(rnorm(10, mean=60, sd=10)) / var(rnorm(10, mean=140, sd=10)) )
ratios100 <- Do(5000) * ( var(rnorm(100, mean=60, sd=10)) / var(rnorm(100, mean=140, sd=10)) )
# Plot the distributions
ggplot(data = ratios100, aes(x = result)) +
geom_density(fill="lightgreen", color="forestgreen", alpha = 0.2) +
geom_density(data = ratios10, aes(x=result), fill="purple", color="purple", alpha = 0.1) +
annotate("text", x = 2.5, y = .25, label = "n=10", color="purple") +
annotate("text", x = 1.8, y = 1.5, label = "n=100", color="forestgreen") +
coord_cartesian(seq(0,10,1)) +
scale_x_continuous(breaks=seq(0, 10, 1), minor_breaks=NULL)
This time, let’s repeat the simulation with normal distributions that have unequal variances.
# I'll use lattice graphics
plotDist('norm', mean=60, sd=20, lw=3, xlim = c(0, 200), ylim = c(0, .042), main="Normal distributions with unequal variances (400 vs 100)")
plotDist('norm', mean=140, sd=10, lw=3, col="red", add=TRUE)
# Take random samples and calculate ratio of variances
# Results are stored in a variable called "result"
ratios10 <- Do(5000) * ( var(rnorm(10, mean=60, sd=20)) / var(rnorm(10, mean=140, sd=10)) )
ratios100 <- Do(5000) * ( var(rnorm(100, mean=60, sd=20)) / var(rnorm(100, mean=140, sd=10)) )
# Plot the distributions
ggplot(data = ratios100, aes(x = result)) +
geom_density(fill="lightgreen", color="forestgreen", alpha = 0.2) +
geom_density(data = ratios10, aes(x=result), fill="purple", color="purple", alpha = 0.1) +
annotate("text", x = 2, y = .2, label = "n=10", color="purple") +
annotate("text", x = 2.8, y = .4, label = "n=100", color="forestgreen") +
coord_cartesian(seq(0,14,1)) +
scale_x_continuous(breaks=seq(0, 14, 2), minor_breaks=NULL)
Let’s see if we can predict what will happen if we start with distributions that are not normally distributed. Let’s start with two non-central t-distributions with 3 degrees-of-freedom:
plotDist('t', df=3, ncp=1, lw=3, xlim = c(-5, 5), ylim = c(0,.5), main="Skewed distributions with equal variances")
plotDist('t', df=3, ncp=-1, lw=3, col="red", add=TRUE)
# Take random samples and calculate ratio of variances
ratios10 <- Do(5000) * ( var(rt(10, df=3, ncp=1)) / var(rt(10, df=3, ncp=-1)) )
# Plot the distribution
ggplot(data = ratios10, aes(x = result)) +
geom_density(fill="purple", color="purple", alpha = 0.2) +
stat_function(fun = df, args = list(df1=9, df2=9)) +
annotate("text", x = -.5, y = .25, label = "n=10", color="purple") +
annotate("text", x = 1.8, y = .5, label = "expectation", color="black") +
scale_x_continuous(limits = c(-1,8), breaks=seq(0, 8, 1), minor_breaks=NULL)
If we repeatedly sample \(n_1\) and \(n_2\) observations from two populations and calculate \(\frac{s_{1}^{2}}{s_{2}^{2}}\) for each pair of samples, then:
If:
If we want to compare two sample variances, we can simply take their ratio and compare it to an F-distribution with the appropriate set of degrees-of-freedom. If our sample variance ratio is unlikely to have come from that F-distribution, then either:
# Q-Q Plot to check for normality
# What does this plot show?
qqnorm(essay$scores, ylab = "Essay scores"); qqline(essay$scores)
# Recall our observed test_stat = 1.987
# Here's a function to plot the F-distribution
# with 21 and 27 degrees-of-freedom
# and display the p-value.
xpf((1.987), df1=21, df2=27)
# Here's the built-in F-test
# Note that it also produces a confidence interval
var.test(scores ~ group, data = essay, alternative="greater")
If you want to see a derivation of the F-distribution (touching such topics as confidence intervals for variances, along with the t- and chi-square distributions), you can read through pages 4-17 of an old activity I wrote long ago.
Recall this F-test only works under the condition that our population data follow normal distributions. Other tests to compare variances (e.g., Bartlett’s test and Hartley’s Fmax test are also sensitive to violations of normality.
Alternative tests (such as Levene’s test or the Brown-Forsyth test) are more robust against violations of the normality assumption. To understand these tests, though, we need to first learn about analysis of variance (which is the topic of the next activity.)
Expand the code to see Levene’s test and the Brown-Forsythe test in action:
# Functions for both tests are contained in the car package
# install.packages("car", dependencies=TRUE)
library(car)
# Levene Test
leveneTest(scores ~ group, data = essay, center = mean)
# Brown-Forsythe Test (uses medians)
leveneTest(scores ~ group, data = essay, center = median)
studyhours
dataframe (with variables class
(Freshmen or Sophomores) and hours
(spent studying per week).var.test
function to conduct an F-test to compare group variances. Make sure you evaluate the assumption(s) necessary to conduct this test.studyhours <- read.csv("http://www.bradthiessen.com/html5/data/studyhours.csv")
head(studyhours)
Once you’re satisfied with your solutions, email that html file to thiessenbradleya@sau.edu or give me a printed copy.
If you run into any problems, email me or work with other students in the class.
Sources
This document is released under a Creative Commons Attribution-ShareAlike 3.0 Unported license.