# Install and load all necessary packages
list.of.packages <- c("tidyverse", "mosaic", "energy", "ggExtra")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
library(tidyverse)
library(mosaic)
library(energy)
library(ggExtra)
Do NFL teams with more malevolent-looking uniforms play more aggressively than other teams?
In 1988 25 volunteers were paid $2 each to rate the perceived malevolence of each NFL team’s uniform and logo. The total yards each team was penalized that year were then recorded as z-scores (to represent standardized penalty yards for each team).
Here’s a sample of the data:
Recall the concept of variance: \(s_{x}^{2}=\frac{\sum \left ( x_{i}-\bar{X} \right )^{2}}{n-1}=\frac{\sum \left ( x_{i}-\bar{X} \right )\left ( x_{i}-\bar{X} \right )}{n-1}=s_{xx}\)
We can visualize the covariance of our data in a couple ways:
nfl_sample <- tibble(
malevolence = c(5.1, 4.68, 4, 3.9, 3.38, 2.8),
penalty = c(1.19, 0.29, -0.73, -0.81, 0.04, -1.6))
ggplot(nfl_sample, aes(x = malevolence, y=penalty)) +
geom_point(color="steelblue", size=3)
To calculate the covariance, we can modify the formula for variance:
\(\textrm{cov}_{xy}=s_{xy}=\frac{\sum \left ( x_{i}-\bar{X} \right )\left ( y_{i}-\bar{Y} \right )}{n-1}=\frac{(5.1-3.977)(1.19- -0.27)+...+(2.8-3.977)(-1.60- -0.27)}{6-1}=0.6889\).
# This calculates the variance-covariance matrix
cov(nfl_sample)
# Multiply everything by 10
ten_times <- nfl_sample * 10
cov(ten_times)
That’s one limitation of using covariance as a measure of the strength of the relationship between two variables: it depends on the scale of measurement used.
For example, the penalty yards variable in our data are reported as z-scores. If we were to use (unstandardized) penalty yards, the covariance increases to 4.822. If we used penalty feet (instead of yards), the covariance would again increase to 14.467. These examples show that larger values of covariance do not necessarily indicate stronger relationships between variables. Because of this, we shouldn’t directly compare covariances.
To address this scale dependence, we need to use a unit of measurement (e.g., meters, pounds, decibels) into which any scale of measurement (e.g., penalty distance, uniform malevolence)could be converted.
If we use the standard deviation as that standard unit of measurement, we can standardize the covariance by:
or
This is called Pearson's product-moment correlation coefficient
(or r).
# Using cor()
cor(nfl_sample$malevolence, nfl_sample$penalty)
# Using cor() within dplyr
nfl_sample %>%
summarize(correlation = cor(malevolence, penalty))
Let’s now calculate the correlation for our full dataset:
# Load data
nfl <- read.csv("http://www.bradthiessen.com/html5/data/NFLuniforms.csv")
# Rename variables
names(nfl) <- c("team", "malevolence", "penalty")
# Plot and annotate with correlation
ggplot(nfl, aes(x = malevolence, y=penalty)) +
geom_point(color="steelblue", size=3) +
annotate("text", x=4.5, y=-1.2, label=paste("r = ", round(cor(nfl$malevolence, nfl$penalty),4)))
The magnitude of the correlation does not change under any linear transformation of the variables. Expand the code to see an example.
# Let's convert malevolence to centi-malevolence
nfl$centi_malevolence <- nfl$malevolence / 100
# The correlation does not change
cor(nfl$centi_malevolence, nfl$penalty)
Pearson’s r only measures the strength of a linear relationship. If we want to measure the strength of nonlinear relationships, we’ll need another statistic. Here are some values of r for various scatterplots:
# Add a variable to highlight the two outliers
nfl$outlier <- c(1, rep(0,26), 1)
# Create scatterplot with the outliers highlighted
ggplot(data=nfl, aes(x = malevolence, y = penalty, color=outlier)) +
geom_point(size=3) + theme(legend.position="none")
# Calculate correlation without highlighted observations
nfl %>%
filter(outlier==0) %>%
summarize(correlation = cor(malevolence, penalty))
Let’s simulate a large dataset with a strong, positive correlation:
# Simulate data
sim_data <- tibble(
x = c(1:1000),
y = 2*x + rnorm(1000, 3, 300))
# Plot
ggplot(data = sim_data, aes(x, y)) +
geom_point(color="steelblue", size=2, alpha=0.5) +
annotate("text", x = 125, y = 2222, label=paste("r = ",round(cor(sim_data$x, sim_data$y),3))) +
geom_vline(xintercept=900, color="red")
# Calculate correlation when only considering x>900
sim_data %>%
filter(x>900) %>%
summarize(correlation = cor(x, y))
Everyone thinks they know correlation is not causation. A correlation does not imply a causal relationship, and a lack of correlation does not mean there is no relationship between two variables.
There are many factors that contribute to this. We’ve already seen that outliers and range restriction can influence the magnitude (and even sign) of Pearson’s r. Likewise, if two variables have a nonlinear relationship, Pearson’s r will misrepresent that relationship.
Yet another reason is the third-variable problem.
A 2004 article questioned research into the relationship between hormone replacement therapy (HRT) and coronary heart disease (CHD). Numerous studies had shown negative correlations between HRT and CHD, leading doctors to propose that HRT was protective against CHD.
Randomized controlled trials showed, in fact, that HRT caused a small but significant increase in the risk of CHD.
Re-analysis of the data from the original studies showed that women undertaking HRT were more likely to be from higher socioeconomic groups. These women, therefore, were more likely to have better-than-average diets and exercise regimens.
The third variable linking hormone replacement therapy to coronary heart disease was socioeconomic status.
test <- tibble(
x = c(rnorm(100, 6, 2), rnorm(100, 10, 2), rnorm(100, 14, 2)),
y = c(rnorm(100, 6, 2), rnorm(100, 10, 2), rnorm(100, 14, 2)),
group = as.factor(c(rep(1,100), rep(2,100), rep(3,100)))
)
# Calculate overall correlation
overall <- test %>%
summarize(r = cor(x, y))
# Calculate correlation for each subgroup
subs <- test %>%
group_by(group) %>%
summarize(r = cor(x, y))
# Plot and annotate with correlations
ggplot(data = test, aes(x, y, color = group)) +
geom_point(size = 2, alpha=0.7) +
geom_smooth(method="lm", se=FALSE) +
theme(legend.position="none") +
annotate("text", x = 5, y = 18, label=paste("overall r =", round(overall, 2)), color="black") +
annotate("text", x = 2, y = 4, label = paste("r =", round(subs[1,2], 2)), color="red") +
annotate("text", x = 16, y = 9, label = paste("r =", round(subs[2,2], 2)), color="forestgreen") +
annotate("text", x = 18.8, y = 16, label = paste("r = ", round(subs[3,2], 2)), color = "blue")
If you’d like to explore other examples in which correlation is mistaken for causation, check out:
set.seed(3141)
# Generate random values for X and Y
random_xy <- tibble(
x = rnorm(100, 0, 1),
y = rnorm(100, 0, 1)
)
cor(random_xy$x, random_xy$y)
ggplot(data = random_xy, aes(x, y)) +
geom_point(size=3, color="steelblue") +
annotate("text", x = 0, y = 2.8, label=paste("overall r =", round(cor(random_xy$x, random_xy$y), 3)), color="black")
The correlation of any two variables in any sample of data will be non-zero, so how can we have any confidence that a correlation is “real?”
Let’s take a look at some randomizations of our data:
# Store our sample correlation as test_stat
test_stat <- cor(nfl$malevolence, nfl$penalty)
# Randomize our data
p1 <- nfl %>%
ggplot(., aes(x = malevolence, y = penalty)) +
geom_point(color="steelblue", size=2) +
geom_smooth(method="lm", se=FALSE, color="red", size=.2) +
annotate("text", x = 4, y = 1, label = "actual data", color="red", size=3)
p2 <- nfl %>%
mutate(shuffled_penalty = sample(penalty)) %>%
ggplot(., aes(x = malevolence, y = shuffled_penalty)) +
geom_point(color="steelblue", size=2) +
geom_smooth(method="lm", se=FALSE, color="red", size=.2) +
annotate("text", x = 4, y = 1, label = "randomization #1", color="red", size=3)
p3 <- nfl %>%
mutate(shuffled_penalty = sample(penalty)) %>%
ggplot(., aes(x = malevolence, y = shuffled_penalty)) +
geom_point(color="steelblue", size=2) +
geom_smooth(method="lm", se=FALSE, color="red", size=.2) +
annotate("text", x = 4, y = 1, label = "randomization #2", color="red", size=3)
p4 <- nfl %>%
mutate(shuffled_penalty = sample(penalty)) %>%
ggplot(., aes(x = malevolence, y = shuffled_penalty)) +
geom_point(color="steelblue", size=2) +
geom_smooth(method="lm", se=FALSE, color="red", size=.2) +
annotate("text", x = 4, y = 1, label = "randomization #3", color="red", size=3)
p5 <- nfl %>%
mutate(shuffled_penalty = sample(penalty)) %>%
ggplot(., aes(x = malevolence, y = shuffled_penalty)) +
geom_point(color="steelblue", size=2) +
geom_smooth(method="lm", se=FALSE, color="red", size=.2) +
annotate("text", x = 4, y = 1, label = "randomization #4", color="red", size=3)
p6 <- nfl %>%
mutate(shuffled_penalty = sample(penalty)) %>%
ggplot(., aes(x = malevolence, y = shuffled_penalty)) +
geom_point(color="steelblue", size=2) +
geom_smooth(method="lm", se=FALSE, color="red", size=.2) +
annotate("text", x = 4, y = 1, label = "randomization #5", color="red", size=3)
# This uses a "multiplot" function that was loaded silently
multiplot(p1, p4, p2, p5, p3, p6, cols=3)
Each randomization of data yields a non-zero correlation. Let’s run 10,000 randomizations and plot all 10,000 correlations.
# Using cor()
shuffled_r <- Do(10000) * cor(malevolence ~ shuffle(penalty), data=nfl)
# Using dplyr
# shuffled_rs <- Do(10000) * nfl %>%
# mutate(shuffled = sample(penalty)) %>%
# summarize(cor = cor(shuffled, nfl$malevolence))
# Calculate likelihood of test statistic
pvalue <- prop(shuffled_r$cor >= test_stat)
pvalue
# Plot simple histogram
# histogram(~cor, data=shuffled_r,
# width=.04,
# xlab="Possible correlations assuming null hypothesis is true")
# ladd(panel.abline(v=test_stat)) # Add vertical line at test statistic
# Plot
ggplot(data = shuffled_r, aes(x = cor)) +
geom_histogram(binwidth = 0.04, fill="lightblue", color="white", alpha=0.8) +
annotate("segment", x = test_stat, xend = test_stat, y = 0, yend = 600, color = "red") +
annotate("text", x = test_stat, y = 630, label = "observed r = 0.43", color = "red") +
annotate("text", x = test_stat+.11, y = 170, label = paste("p =",round(pvalue,4)), color = "red") +
labs(
title = "Randomized Correlations",
x = "r"
) +
scale_x_continuous(limits = c(-.7,.7), breaks=seq(-.6, .6, .3), minor_breaks=NULL) +
theme(
axis.text.x = element_text(size = 11, color="grey10"),
legend.position = "none",
panel.grid.major.y = element_line(colour = "white", size=.25),
panel.grid.major.x = element_line(colour = "white", size=.15),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "grey93")
)
boot <- Do(10000) * cor( malevolence ~ penalty, data=resample(nfl) )
bootstrapCI <- confint(boot, level = 0.95, method = "quantile")
lower <- as.numeric(bootstrapCI[2]) # Store lower CI bound
upper <- as.numeric(bootstrapCI[3]) # Store upper CI bound
# Density plot
ggplot(data = boot, aes(x = cor)) +
geom_density(fill="lightblue", color="white", alpha = 0.8) +
labs(
title = "Bootstrap distribution of correlations",
x = "bootstrap correlations"
) +
scale_x_continuous(breaks=seq(-0.8, 0.8, 0.2), minor_breaks=NULL) +
theme(
axis.text.x = element_text(size = 11, color="grey10"),
legend.position = "none",
panel.grid.major.y = element_line(colour = "white"),
panel.grid.major.x = element_line(colour = "white", size=.15),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "grey93")
) +
annotate("text", x = lower, y = .2, label = round(lower,3)) +
annotate("text", x = upper, y = .2, label = round(upper,3)) +
annotate("text", x = median(boot$cor), y = .25, label = paste("95%")) +
annotate("segment", x = lower, xend = upper, y = 0.1, yend = .1, color = "red")
If we assume our data are sampled from populations with normal distributions, we can use a t-test.
In the next lesson, we’ll derive the following test statistic for the correlation coefficient:
\(t_{n-2}=\frac{r_{xy}-0}{\textrm{SE}_{r_{xy}}}=\frac{r_{xy}}{\sqrt{\frac{1-r_{xy}^{2}}{n-2}}}=\frac{r_{xy}\sqrt{n-2}}{\sqrt{1-r_{xy}^{2}}}=2.427\ \ (p=.011)\)
We can run this test in R using the cor.test()
function:
cor.test(nfl$malevolence, nfl$penalty, alternative = c("greater"), method=c("pearson"))
Pearson's product-moment correlation
data: nfl$malevolence and nfl$penalty
t = 2.4272, df = 26, p-value = 0.01122
alternative hypothesis: true correlation is greater than 0
95 percent confidence interval:
0.1299369 1.0000000
sample estimates:
cor
0.429796
Spearman’s rho uses the same formula as Pearson’s r, except the data are converted to ranks.
# Enter the raw data
mba <- tibble(
gmat = c(710, 610, 640, 580, 545, 560, 610, 530),
gpa = c(4, 4, 3.9, 3.8, 3.7, 3.6, 3.5, 3.5)
)
# Calculate Pearson's r and Spearman's rho
mba %>%
summarize(r = cor(gmat, gpa),
rho = cor(gmat, gpa, method="spearman"))
Kendall’s Tau is calculated as:
\(\tau =\frac{\textrm{(number of concordant pairs)}-\textrm{(number of discordant pairs)}}{\textrm{(number of concordant pairs)}+\textrm{(number of discordant pairs)}}\)
We’ve calculated correlation coefficients for continuous variables, but correlation coefficients can also be calculated for categorical data.
We won’t have time to investigate these in class, but they are all very easy to learn:
The distance correlation is a modern measure of the relationship between two variables, even when the relationship is nonlinear.
Earlier in this lesson, you saw a figure with values of Pearson’s r for various scatterplots. Compare that to the following figure showing values of the distance correlation:
Recall Pearson’s r is a standardized covariance. The numerator of the covariance is the sum of the products of two distances (distance from the mean of X and distance from the mean of Y) over all points: \(\frac{\sum (x_{i}-\bar{X})(y_{i}-\bar{Y})}{n-1}\). The covariance is maximized when all data points are arranged along a straight line.
The numerator of the distance covariance is similar to that of the covariance. The difference is that the distances are between varying data points; not between a data point and the mean. The distance covariance is defined by the sum of the products of the two distances over all pairs of points. The distance covariance is maximized when the data are arranged along a straight line locally (when the data overall represent a chain in any shape).
Let’s calculate the distance correlation for an extremely small dataset. Our dataset will be:
X = 1, 2, 3
Y = 3, 1, 5
Let’s first calculate Pearson’s r and Spearman’s rho for this data:
cor(x, y, method="spearman")
[1] 0.5
Now, let’s calculate the distance covariance:
\(a_{x}=\begin{bmatrix}0 &1 &2 \\ 1&0 &1 \\ 2 &1 &0 \end{bmatrix} \textrm{ and } b_{y}=\begin{bmatrix}0 &2 &2 \\ 2&0 &4 \\ 2 &4 &0 \end{bmatrix}\)
These represent the distances between observations (1 and 2), (2 and 3), and (1 and 3).
This gives us:
\(A_{x}=\begin{bmatrix} -1.11 &0.22 &0.89 \\ 0.22&-0.44 &0.22 \\ 0.89 &0.22 &-1.11\end{bmatrix} \textrm{ and } B_{y}=\begin{bmatrix} -0.89 &0.44 &0.44 \\ 0.44&-2.22 &1.78 \\ 0.44 &1.78 &-2.22\end{bmatrix}\)
\(AxB=\begin{bmatrix}0.9877 &0.0987 &0.3951 \\ 0.0987&0.9877 &0.3951 \\ 0.3951 &0.3951 &2.4691\end{bmatrix}\)
\(cov_{xy}^{2}=0.69136\)
\(cov_{xy}=\sqrt{0.69136}=0.8315\)
To convert that distance covariance to a distance correlation, we standardize by dividing by the product of the distance variances of X and Y.
Thankfully, we can let R do all these calculations for us:
dcor(x, y)
[1] 0.83666
Let’s look at some correlation coefficients for various simulated datasets.
midwest
contains demographic information about 437 counties in the Midwest. The list of variables is displayed below.percollege
(the percent of people in the county with a college education) and ’percbelowpoverty` (the proportion of people below the poverty line).# Load data
data(midwest)
# Display variable names and types
str(midwest)
# Display scatterplot
ggplot(data=midwest, aes(x = percollege, y = percbelowpoverty)) +
geom_point(alpha=.5, color="steelblue", fill="lightblue", shape=20, size=5) +
theme_grey()
Once you’re satisfied with your solutions, email that html file to thiessenbradleya@sau.edu or give me a printed copy.
If you run into any problems, email me or work with other students in the class.
License
This document is released under a Creative Commons Attribution-ShareAlike 3.0 Unported license.