# Install and load all necessary packages
list.of.packages <- c("tidyverse", "mosaic", "energy", "ggExtra")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
library(tidyverse)
library(mosaic)
library(energy)
library(ggExtra)
Do NFL teams with more malevolent-looking uniforms play more aggressively than other teams?
In 1988 25 volunteers were paid $2 each to rate the perceived malevolence of each NFL team’s uniform and logo. The total yards each team was penalized that year were then recorded as z-scores (to represent standardized penalty yards for each team).
Here’s a sample of the data:
Recall the concept of variance: \(s_{x}^{2}=\frac{\sum \left ( x_{i}-\bar{X} \right )^{2}}{n-1}=\frac{\sum \left ( x_{i}-\bar{X} \right )\left ( x_{i}-\bar{X} \right )}{n-1}=s_{xx}\)
We can visualize the covariance of our data in a couple ways:
nfl_sample <- tibble(
malevolence = c(5.1, 4.68, 4, 3.9, 3.38, 2.8),
penalty = c(1.19, 0.29, -0.73, -0.81, 0.04, -1.6))
ggplot(nfl_sample, aes(x = malevolence, y=penalty)) +
geom_point(color="steelblue", size=3)
To calculate the covariance, we can modify the formula for variance:
\(\textrm{cov}_{xy}=s_{xy}=\frac{\sum \left ( x_{i}-\bar{X} \right )\left ( y_{i}-\bar{Y} \right )}{n-1}=\frac{(5.1-3.977)(1.19- -0.27)+...+(2.8-3.977)(-1.60- -0.27)}{6-1}=0.6889\).
# This calculates the variance-covariance matrix
cov(nfl_sample)
# malevolence penalty
# malevolence 0.7007067 0.68892
# penalty 0.6889200 0.96268
# Multiply everything by 10
ten_times <- nfl_sample * 10
cov(ten_times)
# malevolence penalty
# malevolence 70.07067 68.892
# penalty 68.89200 96.268
That’s one limitation of using covariance as a measure of the strength of the relationship between two variables: it depends on the scale of measurement used.
For example, the penalty yards variable in our data are reported as z-scores. If we were to use (unstandardized) penalty yards, the covariance increases to 4.822. If we used penalty feet (instead of yards), the covariance would again increase to 14.467. These examples show that larger values of covariance do not necessarily indicate stronger relationships between variables. Because of this, we shouldn’t directly compare covariances.
To address this scale dependence, we need to use a unit of measurement (e.g., meters, pounds, decibels) into which any scale of measurement (e.g., penalty distance, uniform malevolence)could be converted.
If we use the standard deviation as that standard unit of measurement, we can standardize the covariance by:
or
This is called Pearson's product-moment correlation coefficient
(or r).
# Using cor()
cor(nfl_sample$malevolence, nfl_sample$penalty)
# [1] 0.8388025
# Using cor() within dplyr
nfl_sample %>%
summarize(correlation = cor(malevolence, penalty))
# # A tibble: 1 x 1
# correlation
# <dbl>
# 1 0.8388025
Let’s now calculate the correlation for our full dataset:
# Load data
nfl <- read.csv("http://www.bradthiessen.com/html5/data/NFLuniforms.csv")
# Rename variables
names(nfl) <- c("team", "malevolence", "penalty")
# Plot and annotate with correlation
ggplot(nfl, aes(x = malevolence, y=penalty)) +
geom_point(color="steelblue", size=3) +
annotate("text", x=4.5, y=-1.2, label=paste("r = ", round(cor(nfl$malevolence, nfl$penalty),4)))
The magnitude of the correlation does not change under any linear transformation of the variables. Expand the code to see an example.
# Let's convert malevolence to centi-malevolence
nfl$centi_malevolence <- nfl$malevolence / 100
# The correlation does not change
cor(nfl$centi_malevolence, nfl$penalty)
# [1] 0.429796
Pearson’s r only measures the strength of a linear relationship. If we want to measure the strength of nonlinear relationships, we’ll need another statistic. Here are some values of r for various scatterplots:
# Add a variable to highlight the two outliers
nfl$outlier <- c(1, rep(0,26), 1)
# Create scatterplot with the outliers highlighted
ggplot(data=nfl, aes(x = malevolence, y = penalty, color=outlier)) +
geom_point(size=3) + theme(legend.position="none")
# Calculate correlation without highlighted observations
nfl %>%
filter(outlier==0) %>%
summarize(correlation = cor(malevolence, penalty))
# # A tibble: 1 x 1
# correlation
# <dbl>
# 1 0.08164226
Let’s simulate a large dataset with a strong, positive correlation:
# Simulate data
sim_data <- tibble(
x = c(1:1000),
y = 2*x + rnorm(1000, 3, 300))
# Plot
ggplot(data = sim_data, aes(x, y)) +
geom_point(color="steelblue", size=2, alpha=0.5) +
annotate("text", x = 125, y = 2222, label=paste("r = ",round(cor(sim_data$x, sim_data$y),3))) +
geom_vline(xintercept=900, color="red")
# Calculate correlation when only considering x>900
sim_data %>%
filter(x>900) %>%
summarize(correlation = cor(x, y))
# # A tibble: 1 x 1
# correlation
# <dbl>
# 1 0.4114621
Everyone thinks they know correlation is not causation. A correlation does not imply a causal relationship, and a lack of correlation does not mean there is no relationship between two variables.
There are many factors that contribute to this. We’ve already seen that outliers and range restriction can influence the magnitude (and even sign) of Pearson’s r. Likewise, if two variables have a nonlinear relationship, Pearson’s r will misrepresent that relationship.
Yet another reason is the third-variable problem.
A 2004 article questioned research into the relationship between hormone replacement therapy (HRT) and coronary heart disease (CHD). Numerous studies had shown negative correlations between HRT and CHD, leading doctors to propose that HRT was protective against CHD.
Randomized controlled trials showed, in fact, that HRT caused a small but significant increase in the risk of CHD.
Re-analysis of the data from the original studies showed that women undertaking HRT were more likely to be from higher socioeconomic groups. These women, therefore, were more likely to have better-than-average diets and exercise regimens.
The third variable linking hormone replacement therapy to coronary heart disease was socioeconomic status.
test <- tibble(
x = c(rnorm(100, 6, 2), rnorm(100, 10, 2), rnorm(100, 14, 2)),
y = c(rnorm(100, 6, 2), rnorm(100, 10, 2), rnorm(100, 14, 2)),
group = as.factor(c(rep(1,100), rep(2,100), rep(3,100)))
)
# Calculate overall correlation
overall <- test %>%
summarize(r = cor(x, y))
# Calculate correlation for each subgroup
subs <- test %>%
group_by(group) %>%
summarize(r = cor(x, y))
# Plot and annotate with correlations
ggplot(data = test, aes(x, y, color = group)) +
geom_point(size = 2, alpha=0.7) +
geom_smooth(method="lm", se=FALSE) +
theme(legend.position="none") +
annotate("text", x = 5, y = 18, label=paste("overall r =", round(overall, 2)), color="black") +
annotate("text", x = 2, y = 4, label = paste("r =", round(subs[1,2], 2)), color="red") +
annotate("text", x = 16, y = 9, label = paste("r =", round(subs[2,2], 2)), color="forestgreen") +
annotate("text", x = 18.8, y = 16, label = paste("r = ", round(subs[3,2], 2)), color = "blue")
If you’d like to explore other examples in which correlation is mistaken for causation, check out:
set.seed(3141)
# Generate random values for X and Y
random_xy <- tibble(
x = rnorm(100, 0, 1),
y = rnorm(100, 0, 1)
)
cor(random_xy$x, random_xy$y)
# [1] 0.200954
ggplot(data = random_xy, aes(x, y)) +
geom_point(size=3, color="steelblue") +
annotate("text", x = 0, y = 2.8, label=paste("overall r =", round(cor(random_xy$x, random_xy$y), 3)), color="black")
The correlation of any two variables in any sample of data will be non-zero, so how can we have any confidence that a correlation is “real?”
Let’s take a look at some randomizations of our data:
# Store our sample correlation as test_stat
test_stat <- cor(nfl$malevolence, nfl$penalty)
# Randomize our data
p1 <- nfl %>%
ggplot(., aes(x = malevolence, y = penalty)) +
geom_point(color="steelblue", size=2) +
geom_smooth(method="lm", se=FALSE, color="red", size=.2) +
annotate("text", x = 4, y = 1, label = "actual data", color="red", size=3)
p2 <- nfl %>%
mutate(shuffled_penalty = sample(penalty)) %>%
ggplot(., aes(x = malevolence, y = shuffled_penalty)) +
geom_point(color="steelblue", size=2) +
geom_smooth(method="lm", se=FALSE, color="red", size=.2) +
annotate("text", x = 4, y = 1, label = "randomization #1", color="red", size=3)
p3 <- nfl %>%
mutate(shuffled_penalty = sample(penalty)) %>%
ggplot(., aes(x = malevolence, y = shuffled_penalty)) +
geom_point(color="steelblue", size=2) +
geom_smooth(method="lm", se=FALSE, color="red", size=.2) +
annotate("text", x = 4, y = 1, label = "randomization #2", color="red", size=3)
p4 <- nfl %>%
mutate(shuffled_penalty = sample(penalty)) %>%
ggplot(., aes(x = malevolence, y = shuffled_penalty)) +
geom_point(color="steelblue", size=2) +
geom_smooth(method="lm", se=FALSE, color="red", size=.2) +
annotate("text", x = 4, y = 1, label = "randomization #3", color="red", size=3)
p5 <- nfl %>%
mutate(shuffled_penalty = sample(penalty)) %>%
ggplot(., aes(x = malevolence, y = shuffled_penalty)) +
geom_point(color="steelblue", size=2) +
geom_smooth(method="lm", se=FALSE, color="red", size=.2) +
annotate("text", x = 4, y = 1, label = "randomization #4", color="red", size=3)
p6 <- nfl %>%
mutate(shuffled_penalty = sample(penalty)) %>%
ggplot(., aes(x = malevolence, y = shuffled_penalty)) +
geom_point(color="steelblue", size=2) +
geom_smooth(method="lm", se=FALSE, color="red", size=.2) +
annotate("text", x = 4, y = 1, label = "randomization #5", color="red", size=3)
# This uses a "multiplot" function that was loaded silently
multiplot(p1, p4, p2, p5, p3, p6, cols=3)
Each randomization of data yields a non-zero correlation. Let’s run 10,000 randomizations and plot all 10,000 correlations.
# Using cor()
shuffled_r <- Do(10000) * cor(malevolence ~ shuffle(penalty), data=nfl)
# Using dplyr
# shuffled_rs <- Do(10000) * nfl %>%
# mutate(shuffled = sample(penalty)) %>%
# summarize(cor = cor(shuffled, nfl$malevolence))
# Calculate likelihood of test statistic
pvalue <- prop(shuffled_r$cor >= test_stat)
pvalue
# TRUE
# 0.0097
# Plot simple histogram
# histogram(~cor, data=shuffled_r,
# width=.04,
# xlab="Possible correlations assuming null hypothesis is true")
# ladd(panel.abline(v=test_stat)) # Add vertical line at test statistic
# Plot
ggplot(data = shuffled_r, aes(x = cor)) +
geom_histogram(binwidth = 0.04, fill="lightblue", color="white", alpha=0.8) +
annotate("segment", x = test_stat, xend = test_stat, y = 0, yend = 600, color = "red") +
annotate("text", x = test_stat, y = 630, label = "observed r = 0.43", color = "red") +
annotate("text", x = test_stat+.11, y = 170, label = paste("p =",round(pvalue,4)), color = "red") +
labs(
title = "Randomized Correlations",
x = "r"
) +
scale_x_continuous(limits = c(-.7,.7), breaks=seq(-.6, .6, .3), minor_breaks=NULL) +
theme(
axis.text.x = element_text(size = 11, color="grey10"),
legend.position = "none",
panel.grid.major.y = element_line(colour = "white", size=.25),
panel.grid.major.x = element_line(colour = "white", size=.15),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "grey93")
)
boot <- Do(10000) * cor( malevolence ~ penalty, data=resample(nfl) )
bootstrapCI <- confint(boot, level = 0.95, method = "quantile")
lower <- as.numeric(bootstrapCI[2]) # Store lower CI bound
upper <- as.numeric(bootstrapCI[3]) # Store upper CI bound
# Density plot
ggplot(data = boot, aes(x = cor)) +
geom_density(fill="lightblue", color="white", alpha = 0.8) +
labs(
title = "Bootstrap distribution of correlations",
x = "bootstrap correlations"
) +
scale_x_continuous(breaks=seq(-0.8, 0.8, 0.2), minor_breaks=NULL) +
theme(
axis.text.x = element_text(size = 11, color="grey10"),
legend.position = "none",
panel.grid.major.y = element_line(colour = "white"),
panel.grid.major.x = element_line(colour = "white", size=.15),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "grey93")
) +
annotate("text", x = lower, y = .2, label = round(lower,3)) +
annotate("text", x = upper, y = .2, label = round(upper,3)) +
annotate("text", x = median(boot$cor), y = .25, label = paste("95%")) +
annotate("segment", x = lower, xend = upper, y = 0.1, yend = .1, color = "red")
If we assume our data are sampled from populations with normal distributions, we can use a t-test.
In the next lesson, we’ll derive the following test statistic for the correlation coefficient:
\(t_{n-2}=\frac{r_{xy}-0}{\textrm{SE}_{r_{xy}}}=\frac{r_{xy}}{\sqrt{\frac{1-r_{xy}^{2}}{n-2}}}=\frac{r_{xy}\sqrt{n-2}}{\sqrt{1-r_{xy}^{2}}}=2.427\ \ (p=.011)\)
We can run this test in R using the cor.test()
function:
# t-test for correlation coefficient
cor.test(nfl$malevolence, nfl$penalty, alternative = c("greater"), method=c("pearson"))
#
# Pearson's product-moment correlation
#
# data: nfl$malevolence and nfl$penalty
# t = 2.4272, df = 26, p-value = 0.01122
# alternative hypothesis: true correlation is greater than 0
# 95 percent confidence interval:
# 0.1299369 1.0000000
# sample estimates:
# cor
# 0.429796
Spearman’s rho uses the same formula as Pearson’s r, except the data are converted to ranks.
# Enter the raw data
mba <- tibble(
gmat = c(710, 610, 640, 580, 545, 560, 610, 530),
gpa = c(4, 4, 3.9, 3.8, 3.7, 3.6, 3.5, 3.5)
)
# Calculate Pearson's r and Spearman's rho
mba %>%
summarize(r = cor(gmat, gpa),
rho = cor(gmat, gpa, method="spearman"))
# # A tibble: 1 x 2
# r rho
# <dbl> <dbl>
# 1 0.6953241 0.6788003
Kendall’s Tau is calculated as:
\(\tau =\frac{\textrm{(number of concordant pairs)}-\textrm{(number of discordant pairs)}}{\textrm{(number of concordant pairs)}+\textrm{(number of discordant pairs)}}\)
We’ve calculated correlation coefficients for continuous variables, but correlation coefficients can also be calculated for categorical data.
We won’t have time to investigate these in class, but they are all very easy to learn:
The distance correlation is a modern measure of the relationship between two variables, even when the relationship is nonlinear.
Earlier in this lesson, you saw a figure with values of Pearson’s r for various scatterplots. Compare that to the following figure showing values of the distance correlation:
Recall Pearson’s r is a standardized covariance. The numerator of the covariance is the sum of the products of two distances (distance from the mean of X and distance from the mean of Y) over all points: \(\frac{\sum (x_{i}-\bar{X})(y_{i}-\bar{Y})}{n-1}\). The covariance is maximized when all data points are arranged along a straight line.
The numerator of the distance covariance is similar to that of the covariance. The difference is that the distances are between varying data points; not between a data point and the mean. The distance covariance is defined by the sum of the products of the two distances over all pairs of points. The distance covariance is maximized when the data are arranged along a straight line locally (when the data overall represent a chain in any shape).
Let’s calculate the distance correlation for an extremely small dataset. Our dataset will be:
X = 1, 2, 3
Y = 3, 1, 5
Let’s first calculate Pearson’s r and Spearman’s rho for this data:
# Input data
x <- c(1, 2, 3)
y <- c(3, 1, 5)
# Pearson's r
cor(x, y, method="pearson")
# [1] 0.5
# Spearman's rho
cor(x, y, method="spearman")
# [1] 0.5
Now, let’s calculate the distance covariance:
\(a_{x}=\begin{bmatrix}0 &1 &2 \\ 1&0 &1 \\ 2 &1 &0 \end{bmatrix} \textrm{ and } b_{y}=\begin{bmatrix}0 &2 &2 \\ 2&0 &4 \\ 2 &4 &0 \end{bmatrix}\)
These represent the distances between observations (1 and 2), (2 and 3), and (1 and 3).
This gives us:
\(A_{x}=\begin{bmatrix} -1.11 &0.22 &0.89 \\ 0.22&-0.44 &0.22 \\ 0.89 &0.22 &-1.11\end{bmatrix} \textrm{ and } B_{y}=\begin{bmatrix} -0.89 &0.44 &0.44 \\ 0.44&-2.22 &1.78 \\ 0.44 &1.78 &-2.22\end{bmatrix}\)
\(AxB=\begin{bmatrix}0.9877 &0.0987 &0.3951 \\ 0.0987&0.9877 &0.3951 \\ 0.3951 &0.3951 &2.4691\end{bmatrix}\)
\(cov_{xy}^{2}=0.69136\)
\(cov_{xy}=\sqrt{0.69136}=0.8315\)
To convert that distance covariance to a distance correlation, we standardize by dividing by the product of the distance variances of X and Y.
Thankfully, we can let R do all these calculations for us:
# We loaded the energy package earlier
# library(energy)
# Calculate all the components of a distance correlation
# Verify the covariance is 0.8315
DCOR(x, y)
# $dCov
# [1] 0.8314794
#
# $dCor
# [1] 0.83666
#
# $dVarX
# [1] 0.7027284
#
# $dVarY
# [1] 1.405457
# Here's a direct way to calculate the distance correlation
dcor(x, y)
# [1] 0.83666
Let’s look at some correlation coefficients for various simulated datasets.
midwest
contains demographic information about 437 counties in the Midwest. The list of variables is displayed below.percollege
(the percent of people in the county with a college education) and ’percbelowpoverty` (the proportion of people below the poverty line).# Load data
data(midwest)
# Display variable names and types
str(midwest)
# Classes 'tbl_df', 'tbl' and 'data.frame': 437 obs. of 28 variables:
# $ PID : int 561 562 563 564 565 566 567 568 569 570 ...
# $ county : chr "ADAMS" "ALEXANDER" "BOND" "BOONE" ...
# $ state : chr "IL" "IL" "IL" "IL" ...
# $ area : num 0.052 0.014 0.022 0.017 0.018 0.05 0.017 0.027 0.024 0.058 ...
# $ poptotal : int 66090 10626 14991 30806 5836 35688 5322 16805 13437 173025 ...
# $ popdensity : num 1271 759 681 1812 324 ...
# $ popwhite : int 63917 7054 14477 29344 5264 35157 5298 16519 13384 146506 ...
# $ popblack : int 1702 3496 429 127 547 50 1 111 16 16559 ...
# $ popamerindian : int 98 19 35 46 14 65 8 30 8 331 ...
# $ popasian : int 249 48 16 150 5 195 15 61 23 8033 ...
# $ popother : int 124 9 34 1139 6 221 0 84 6 1596 ...
# $ percwhite : num 96.7 66.4 96.6 95.3 90.2 ...
# $ percblack : num 2.575 32.9 2.862 0.412 9.373 ...
# $ percamerindan : num 0.148 0.179 0.233 0.149 0.24 ...
# $ percasian : num 0.3768 0.4517 0.1067 0.4869 0.0857 ...
# $ percother : num 0.1876 0.0847 0.2268 3.6973 0.1028 ...
# $ popadults : int 43298 6724 9669 19272 3979 23444 3583 11323 8825 95971 ...
# $ perchsd : num 75.1 59.7 69.3 75.5 68.9 ...
# $ percollege : num 19.6 11.2 17 17.3 14.5 ...
# $ percprof : num 4.36 2.87 4.49 4.2 3.37 ...
# $ poppovertyknown : int 63628 10529 14235 30337 4815 35107 5241 16455 13081 154934 ...
# $ percpovertyknown : num 96.3 99.1 95 98.5 82.5 ...
# $ percbelowpoverty : num 13.15 32.24 12.07 7.21 13.52 ...
# $ percchildbelowpovert: num 18 45.8 14 11.2 13 ...
# $ percadultpoverty : num 11.01 27.39 10.85 5.54 11.14 ...
# $ percelderlypoverty : num 12.44 25.23 12.7 6.22 19.2 ...
# $ inmetro : int 0 0 0 1 0 0 0 0 0 1 ...
# $ category : chr "AAR" "LHR" "AAR" "ALU" ...
# Display scatterplot
ggplot(data=midwest, aes(x = percollege, y = percbelowpoverty)) +
geom_point(alpha=.5, color="steelblue", fill="lightblue", shape=20, size=5) +
theme_grey()
Once you’re satisfied with your solutions, email that html file to thiessenbradleya@sau.edu or give me a printed copy.
If you run into any problems, email me or work with other students in the class.
License
This document is released under a Creative Commons Attribution-ShareAlike 3.0 Unported license.