In movies, good guys wear white and bad guys wear black. Does the color of a person’s clothing have a significant impact on that person’s behavior (or the perception of that behavior)?
In 1988, researchers asked whether NFL teams with black uniforms play more aggressively (or are perceived to play more aggressively) than teams with less malevolent-looking uniforms.
The researchers paid 25 volunteers $2 each to rate the perceived malevolence of each NFL team’s uniform and logo. Then, the researchers converted the number of yards each team was penalized each year into z-scores and averaged those z-scores to obtain a standardized number of penalty yards for each team.
The researchers wanted to measure the relationship between perceived malevolence and penalty yards.
Here’s a sample of the data collected by the researchers:
Recall the concept of variance: \(s_{x}^{2}=\frac{\sum \left ( x_{i}-\bar{X} \right )^{2}}{n-1}=\frac{\sum \left ( x_{i}-\bar{X} \right )\left ( x_{i}-\bar{X} \right )}{n-1}=s_{xx}\)
In this scenario, we’re interested in the relationship between two continuous variables. Specificially, we’re interested in whether changes in one variable (malevolence) are met with similar changes in the other variable (penalty yards). In other words, we’re interested in the following question: When one variable deviates from its mean, does the other variable deviate from its mean in a similar way?
We can visualize this covariance in a diagram:
… or in a scatterplot:
To calculate the covariance, we can modify our formula for the variance: \(\textrm{cov}_{xy}=s_{xy}=\frac{\sum \left ( x_{i}-\bar{X} \right )\left ( y_{i}-\bar{Y} \right )}{n-1}\)
The covariance of our sample data is:
\(s_{xy}=\frac{\sum \left ( x_{i}-\bar{X} \right )\left ( y_{i}-\bar{Y} \right )}{n-1}=\frac{(5.1-3.977)(1.19- -0.27)+...+(2.8-3.977)(-1.60- -0.27)}{6-1}=0.6889\)
We can use cov()
to calculate the covariance in R. It will output the sample variances on the diagonal and the sample covariances on the off-diagonals:
# Calculate the covariance of our sample data (nflsample)
cov(nflsample)
## malevolence penalty
## malevolence 0.7007067 0.68892
## penalty 0.6889200 0.96268
# Multiply every value by 10
biggerdata <- nflsample *10
# Take a look at these bigger values
biggerdata
## malevolence penalty
## 1 51.0 11.9
## 2 46.8 2.9
## 3 40.0 -7.3
## 4 39.0 -8.1
## 5 33.8 0.4
## 6 28.0 -16.0
# Calculate the covariance of these bigger values
cov(biggerdata)
## malevolence penalty
## malevolence 70.07067 68.892
## penalty 68.89200 96.268
There is one problem with using covariance as a measure of the relationship between variables: it depends on the scale of measurement used. Covariance is not a standardized measure.
For example, the penalty yards variable in our dataset are reported as z-scores. If we were to convert those to number of yards penalized, the covariance would differ (it becomes 4.822). Likewise, if we were then to convert that variable to number of feet penalized, the covariance would change yet again (to 14.467).
This dependence on the scale of measurement means we cannot compare covariances directly.
To address this scale dependence, we need to convert the covariance to a standard set of units. We need to think of a unit of measurement (e.g., meters) into which any scale of measurement (e.g., malevolence of NFL uniforms) could be converted.
The unit of measurement we’ll use for standardization is the standard deviation. We can standardize the covariance in one of two ways:
• Standardize our variables (into z-scores) and then calculate the covariance: \(r_{xy}=\frac{\sum \left ( z_{xi}-\bar{Z_{x}} \right )\left ( z_{yi}-\bar{Z_{y}} \right )}{n-1}=\frac{\sum z_{x}z_{y}}{n-1}\)
or
• Calculate the covariance and standardize it (by dividing by the product of the standard deviation): \(r_{xy}=\frac{\sum \left ( x_{i}-\bar{X} \right )\left ( y_{i}-\bar{Y} \right )}{(n-1)s_{x}s_{y}}\)
This standardized covariance is called Pearson’s product-moment correlation coefficient (or r, for short).
Here’s an example of how to calculate Pearson’s correlation coefficient by hand for a small dataset of femur and humerus bone lengths:
To calculate a correlation coefficient in R, we use the cor()
function. Here’s the correlation of our sample NFL data:
# Calculate correlation between malevolence and penalty yards
cor(malevolence, penalty, data=nflsample)
## [1] 0.8388025
Now, let’s calculate the correlation coefficient for the entire NFL dataset (with 28 teams):
# Load data
NFL <- read.csv("http://www.bradthiessen.com/html5/data/NFLuniforms.csv")
# Rename variables
names(NFL) <- c("team", "malevolence", "penalty")
# Create scatterplot
ggplot(data=NFL, aes(x = malevolence, y = penalty)) +
geom_point(alpha=.9, color="steelblue", fill="lightblue", shape=20, size=5)
# Calculate correlation
cor(penalty ~ malevolence, data=NFL)
## [1] 0.429796
Let’s focus on some characteristics (or limitations) of the Pearson product-moment correlation coefficient.
As intended, Pearson’s r is scale invariant. That means the magnitude of the correlation does not change under any linear transformation of the variables. If we change the units of measurement of X and/or y, the correlation remains constant.
Pearson’s r only measures the strength of a linear relationship. It does not measure the dependency between variables that have a nonlinear relationship.
The following figure displays values of Pearson’s r for various scatterplots:
If we want a measure that also works for nonlinear relationships, we’ll need to use something like the distance correlation discussed later in this lesson.
Pearson’s r is influenced heavily by outliers. Take another look at the scatterplot for our NFL data. In this scatterplot, I’ve highlighted two observations that could be considered outliers:
# Add a variable to highlight the two outliers
NFL$outlier <- c(1, rep(0,26), 1)
# Create scatterplot with the outliers highlighted
ggplot(data=NFL, aes(x = malevolence, y = penalty, color=outlier)) +
geom_point(alpha=.9, shape=20, size=5) + theme(legend.position="none")
Let’s go ahead and calculate the correlation with those outliers removed:
NFLnoOutliers <- NFL[NFL$outlier==0, ]
# Notice the outliers are gone
ggplot(data=NFLnoOutliers, aes(x = malevolence, y = penalty, color=outlier)) +
geom_point(alpha=.9, shape=20, size=5) + theme(legend.position="none")
# Calculate correlation - look how low the correlation gets!
cor(penalty ~ malevolence, data=NFLnoOutliers)
## [1] 0.08164226
Pearson’s r is also influenced by range restriction. To see this effect, let’s look at a large dataset with a high correlation:
# Generate dataset with high correlation
x <- as.numeric(c(1:1000))
y <- 2*x + 3 + rnorm(1000,0,300)
lineardata <- data.frame(x,y)
# Sketch scatterplot
ggplot(data=lineardata, aes(x = x, y = y)) +
geom_point(alpha=.3, color="steelblue", fill="lightblue", shape=20, size=5) +
theme_grey()
# Calculate correlation
cor(lineardata$x, lineardata$y)
## [1] 0.8828121
# Restrict the range of our data
restricted <- lineardata[x>750, ]
# Calculate correlation
cor(restricted$x, restricted$y)
## [1] 0.4909807
As you’ve probably heard, correlation is not causation. Not only does correlation not guarantee a causal relationship, a lack of correlation does not even mean there is no relationship between two variables!
One reason for this is because correlations are best suited to continuous, normally distributed data. As we’ve already seen, outliers have a heavy influence on the magnitude (and direction) of Pearson’s r.
Another reason we’ve already encountered is that Pearson’s r is a measure of linear dependency. It will misrepresent any relationship that isn’t linear.
Yet another reason is called the third-variable problem.
A 2004 article questioned research into the relationship between hormone replacement therapy (HRT) and coronary heart disease (CHD). Numerous studies had shown negative correlations between HRT and CHD, leading doctors to propose that HRT was protective against CHD.
Randomized controlled trials showed that, in fact, HRT caused a small but significant increase in the risk of CHD.
Re-analysis of the data from the original studies showed that women undertaking HRT were more likely to be from higher socioeconomic groups. These women, therefore, were more likely to have better-than-average diets and exercise regimens.
The third variable linking hormone replacement therapy to coronary heart disease was socioeconomic status.
If you’d like to explore other cases in which correlation is mistaken for causation, check out the following websites:
Let’s try it:
# Set random number seed
set.seed(3141)
# Generate X and Y from standard normal distributions
x <- rnorm(100, 0, 1)
y <- rnorm(100, 0, 1)
xy <- data.frame(x, y)
# Scatterplot
ggplot(data=xy, aes(x = x, y = y)) +
geom_point(alpha=.9, color="steelblue", fill="lightblue", shape=20, size=5) +
theme_grey()
# Calculate correlation coefficient
cor(x, y)
## [1] 0.200954
When we have a sample of data, our correlation will be non-zero even if the variables have no relationship. The correlation is due to random chance.
What about the correlation we found for the NFL data? Is that correlation (r = 0.43) large enough for us to suspect that it represents a real correlation for the population?
To test if our correlation differs significantly from zero, we can use randomization- or theory-based methods.
We’re interested in determining the likelihood of observing a correlation of r = 0.43 or greater if, in fact, there is no relationship between malevolence and penalty yards.
To test this, we can randomize our data. Let’s take a look at a subset of our data:
Let’s assume penalty yards have no association with malevolence. If that’s the case, then there’s no real reason for the Raiders, with the highest malevolence rating, to have 1.19 penalty yards. With no relationship between these variables, the Raiders were just as likely to have 0.48, 0.27, or -1.6 penalty yards.
With this logic, we can randomize the values of one of our variables while holding the other variable constant. Each time we randomize the data, we can then calculate the correlation coefficient.
Let’s take a look at some correlations of several randomizations of our data:
To run this randomization-based test of our correlation, we simply need to replicate this process many times.
# Store our test statistic
test.corr <- cor(penalty ~ malevolence, data=NFL)
# Randomize our data and calculate correlations
randomized_correlations <- do(10000) * cor(shuffle(penalty) ~ malevolence, data=NFL)
# Plot distribution of randomized correlations
histogram(~cor, data=randomized_correlations,
xlab="Possible correlations assuming null hypothesis is true",
groups=cor >= test.corr, # Highlight values > test statistic
main=paste("p-value = ", prop(~cor >= test.corr, data=randomized_correlations)))
ladd(panel.abline(v=test.corr)) # Add vertical line at test statistic
We can also use bootstrap methods to construct a confidence interval for our correlation.
# Generate 10000 bootstrap samples and calculate the test statistic for each
boot.correlations <- do(10000) * cor(penalty ~ malevolence, data=resample(NFL))
# Create a plot of the bootstrap estimates of our test statistic
densityplot(~cor, data=boot.correlations, plot.points = FALSE, col="steelblue", lwd=4)
# Calculate confidence interval
cdata(0.95, cor, data = boot.correlations)
## low hi central.p
## -0.05912212 0.72782685 0.95000000
If both variables come from populations with normal distributions, we can use theory-based methods to test our correlation.
In the next lesson, we’ll derive the following test statistic for the correlation coefficient:
\(t_{n-2}=\frac{r_{xy}-0}{\textrm{SE}_{r_{xy}}}=\frac{r_{xy}}{\sqrt{\frac{1-r_{xy}^{2}}{n-2}}}=\frac{r_{xy}\sqrt{n-2}}{\sqrt{1-r_{xy}^{2}}}\)
We can run this test in R using the cor.test()
function:
# t-test for correlation coefficient
cor.test(NFL$malevolence, NFL$penalty, alternative = c("greater"), method=c("pearson"))
##
## Pearson's product-moment correlation
##
## data: NFL$malevolence and NFL$penalty
## t = 2.4272, df = 26, p-value = 0.01122
## alternative hypothesis: true correlation is greater than 0
## 95 percent confidence interval:
## 0.1299369 1.0000000
## sample estimates:
## cor
## 0.429796
So far, we’ve only investigated Pearson’s product-moment correlation coefficient. There are many other types of correlations, though.
One correlation coefficient directly related to Pearson’s r is Spearman’s rho. It’s calculated using the exact same formula as Pearson’s coefficient. The only difference is we first convert our data to ranks.
# First, let's enter the raw data
gmat <- c(710, 610, 640, 580, 545, 560, 610, 530)
gpa <- c(4, 4, 3.9, 3.8, 3.7, 3.6, 3.5, 3.5)
# Calculate Pearson's r
cor(gmat, gpa)
## [1] 0.6953241
# Calculate Spearman's rho
cor(gmat, gpa, method="spearman")
## [1] 0.6788003
# This time, enter ranks
gmatranks <- c(8, 5.5, 6, 4, 2, 3, 5.5, 1)
gparanks <- c(7.5, 7.5, 6, 5, 4, 3, 1.5, 1.5)
# Pearson's r on ranks should be the same as Spearman's rho
# (minus any differences due to choice of ranks of ties)
cor(gmatranks, gparanks, method="pearson")
## [1] 0.6769605
Another fairly popular nonparametric correlation coefficient is Kendall’s Tau. It’s calculated as:
\(\tau =\frac{\textrm{(number of concordant pairs)}-\textrm{(number of discordant pairs)}}{\textrm{(number of concordant pairs)}+\textrm{(number of discordant pairs)}}\)
We’ve calculated correlation coefficients for continuous variables. We can calculate correlations for categorical variables, too.
We won’t have time to investigate these in class, but here are links if you want to learn more:
• Phi coefficient for 2x2 contingency tables
• Cramer’s V for two nominal variables
• Biserial correlation for one continuous and one artificially dichotomous (ordinal) variable
• Point-biserial correlation for one continuous and one (artificial or natural) dichotomous variable
• Polychoric correlation for two ordinal variables with underlying continuous distributions
One newer measure for the dependence between two variables is the distance correlation. It gives a measure of the relationship between two variables, even when the relationship is nonlinear.
Earlier in this lesson, you saw a figure with values of Pearson’s r for various scatterplots. Compare that to the following figure showing values of the distance correlation:
Recall we derived Pearson’s r as a standardized measure of covariance. The numerator of the covariance is the sum of the products of two distances (distance from the mean of X and distance from the mean of Y) over all points: \(\frac{\sum (x_{i}-\bar{X})(y_{i}-\bar{Y})}{n-1}\). The covariance is maximized when all data points are arranged along a straight line.
The numerator of the distance covariance is similar to that of the covariance. The only difference is that the distanes are between varying data points; not between a data point and the mean. The distance covariance is defined by the sum of the products of the two distances over all pairs of points. The distance covariance is maximized when the data are arranged along a straight line locally (when the data overall represent a chain in any shape).
Let’s see how to calculate the distance correlation for an extremely small dataset. Our dataset will be:
X = 1, 2, 3
Y = 3, 1, 5
Before we begin, let’s calculate Pearson’s r and Spearman’s rho for this data:
# Input data
x <- c(1, 2, 3)
y <- c(3, 1, 5)
# Pearson's r
cor(x, y, method="pearson")
## [1] 0.5
# Spearman's rho
cor(x, y, method="spearman")
## [1] 0.5
To calculate a distance covariance, we must do the following:
• First, we calculate all euclidean distances between pairs of observations for X and then for Y.
\(a_{x}=\begin{bmatrix}0 &1 &2 \\ 1&0 &1 \\ 2 &1 &0 \end{bmatrix} \textrm{ and } b_{y}=\begin{bmatrix}0 &2 &2 \\ 2&0 &4 \\ 2 &4 &0 \end{bmatrix}\)
These represent the distances between observations (1 and 2), (2 and 3), and (1 and 3).
• Next, convert this to a euclidean norm by taking each value in each matrix and (1) subtracting its row mean, (2) subtracting its column mean, and (3) adding the grand mean.
This gives us:
\(A_{x}=\begin{bmatrix} -1.11 &0.22 &0.89 \\ 0.22&-0.44 &0.22 \\ 0.89 &0.22 &-1.11\end{bmatrix} \textrm{ and } B_{y}=\begin{bmatrix} -1.44 &-0.11 &-0.11 \\ 0.89&-1.78 &2.22 \\ 0.56 &1.89 &-2.11\end{bmatrix}\)
• Now, multiply each value in the A matrix by its corresponding value in the B matrix
\(AxB_{xy}=\begin{bmatrix}1.605 &-0.025 &-0.099 \\ 0.198&0.790 &0.494 \\ 0.494 &0.420 &2.346\end{bmatrix}\)
• Calculate the average of the sum of all the values in this AxB matrix
\(cov_{xy}^{2}=0.69136\)
• Finally, take the square root of that value
\(cov_{xy}=\sqrt{0.69136}=0.8315\)
To convert that distance covariance to a distance correlation, we need to standardize by dividing by the product of the distance variances of X and Y.
Thankfully, we can let R do all these calculations for us:
# We need to load the energy package
library(energy)
# Calculate all the components of a distance correlation
# Verify the covariance is 0.8315
DCOR(x, y)
## $dCov
## [1] 0.8314794
##
## $dCor
## [1] 0.83666
##
## $dVarX
## [1] 0.7027284
##
## $dVarY
## [1] 1.405457
# Here's a direct way to calculate the distance correlation
dcor(x, y)
## [1] 0.83666
Let’s look at some correlation coefficients for various simulated datasets.
First, let’s simulate a dataset with a strong positive correlation:
Notice that each correlation coefficient yields a similiar value.
Now let’s try a dataset with a weaker (and negative) relationship:
Again, each correlation coefficient yields a similiar value. That implies the distance correlation is ok to use with linear relationships.
Now let’s try a dataset with a nonlinear relationship:
Here you can see the advantage of the distance correlation. Spearman’s rho and Pearson’s r found no (linear) relationship in the data, so their values are close to zero. The distance correlation indicates there is a dependency between X and Y.
Let’s try a wavy (sinusoidal) relationship:
Finally, let’s try data that form a circular relationship:
midwest
contains demographic information about 437 counties in the Midwest. The list of variables is displayed below.percollege
(the percent of people in the county with a college education) and ’percbelowpoverty` (the proportion of people below the poverty line).# Load data
data(midwest)
# Display variable names and types
str(midwest)
## 'data.frame': 437 obs. of 28 variables:
## $ PID : int 561 562 563 564 565 566 567 568 569 570 ...
## $ county : chr "ADAMS" "ALEXANDER" "BOND" "BOONE" ...
## $ state : chr "IL" "IL" "IL" "IL" ...
## $ area : num 0.052 0.014 0.022 0.017 0.018 0.05 0.017 0.027 0.024 0.058 ...
## $ poptotal : int 66090 10626 14991 30806 5836 35688 5322 16805 13437 173025 ...
## $ popdensity : num 1271 759 681 1812 324 ...
## $ popwhite : int 63917 7054 14477 29344 5264 35157 5298 16519 13384 146506 ...
## $ popblack : int 1702 3496 429 127 547 50 1 111 16 16559 ...
## $ popamerindian : int 98 19 35 46 14 65 8 30 8 331 ...
## $ popasian : int 249 48 16 150 5 195 15 61 23 8033 ...
## $ popother : int 124 9 34 1139 6 221 0 84 6 1596 ...
## $ percwhite : num 96.7 66.4 96.6 95.3 90.2 ...
## $ percblack : num 2.575 32.9 2.862 0.412 9.373 ...
## $ percamerindan : num 0.148 0.179 0.233 0.149 0.24 ...
## $ percasian : num 0.3768 0.4517 0.1067 0.4869 0.0857 ...
## $ percother : num 0.1876 0.0847 0.2268 3.6973 0.1028 ...
## $ popadults : int 43298 6724 9669 19272 3979 23444 3583 11323 8825 95971 ...
## $ perchsd : num 75.1 59.7 69.3 75.5 68.9 ...
## $ percollege : num 19.6 11.2 17 17.3 14.5 ...
## $ percprof : num 4.36 2.87 4.49 4.2 3.37 ...
## $ poppovertyknown : int 63628 10529 14235 30337 4815 35107 5241 16455 13081 154934 ...
## $ percpovertyknown : num 96.3 99.1 95 98.5 82.5 ...
## $ percbelowpoverty : num 13.15 32.24 12.07 7.21 13.52 ...
## $ percchildbelowpovert: num 18 45.8 14 11.2 13 ...
## $ percadultpoverty : num 11.01 27.39 10.85 5.54 11.14 ...
## $ percelderlypoverty : num 12.44 25.23 12.7 6.22 19.2 ...
## $ inmetro : int 0 0 0 1 0 0 0 0 0 1 ...
## $ category : Factor w/ 16 levels "AAR","AAU","AHR",..: 1 15 1 6 1 1 13 1 1 8 ...
# Display scatterplot
ggplot(data=midwest, aes(x = percollege, y = percbelowpoverty)) +
geom_point(alpha=.5, color="steelblue", fill="lightblue", shape=20, size=5) +
theme_grey()
When you’re ready to try the Your Turn exercises:
Sources:
Frank, M.G. & Gilovich, T. (1988). The Dark Side of Self- and Social Perception: Black Uniforms and Aggression in Professional Sports. Journal of Personality and Social Psychology, 54(1), 74-85.
Lawlor DA, Davey Smith G, Ebrahim S (2004). Commentary: the hormone replacement-coronary heart disease conundrum: is this the death of observational epidemiology?. Int J Epidemiol 33 (3): 464–7.
This document is released under a Creative Commons Attribution-ShareAlike 3.0 Unported license.