# Install and load all necessary packages
list.of.packages <- c("tidyverse", "mosaic", "energy", "ggExtra")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
library(tidyverse)
library(mosaic)
library(energy)
library(ggExtra)

Scenario: Villains wear black

Do NFL teams with more malevolent-looking uniforms play more aggressively than other teams?

In 1988 25 volunteers were paid $2 each to rate the perceived malevolence of each NFL team’s uniform and logo. The total yards each team was penalized that year were then recorded as z-scores (to represent standardized penalty yards for each team).

Here’s a sample of the data:

Covariance

Describe the relationship between these two variables. Are larger values of malevolence associated with larger values of penalty yards? When one variable deviates from its mean, does the other variable deviate from its mean in a similar way?

Recall the concept of variance: $s_{x}^{2}=\frac{\sum \left ( x_{i}-\bar{X} \right )^{2}}{n-1}=\frac{\sum \left ( x_{i}-\bar{X} \right )\left ( x_{i}-\bar{X} \right )}{n-1}=s_{xx}$

We can visualize the covariance of our data in a couple ways:

diagram

scatterplot

nfl_sample <- tibble(
  malevolence = c(5.1, 4.68, 4, 3.9, 3.38, 2.8),
  penalty = c(1.19, 0.29, -0.73, -0.81, 0.04, -1.6))

ggplot(nfl_sample, aes(x = malevolence, y=penalty)) +
  geom_point(color="steelblue", size=3)

.

To calculate the covariance, we can modify the formula for variance:

$\textrm{cov}_{xy}=s_{xy}=\frac{\sum \left ( x_{i}-\bar{X} \right )\left ( y_{i}-\bar{Y} \right )}{n-1}=\frac{(5.1-3.977)(1.19- -0.27)+...+(2.8-3.977)(-1.60- -0.27)}{6-1}=0.6889$.

# This calculates the variance-covariance matrix
cov(nfl_sample)
#                malevolence penalty
#    malevolence   0.7007067 0.68892
#    penalty       0.6889200 0.96268

Under what conditions would the covariance be negative? positive? zero?

If we multiplied every value of X and Y by 10. What would happen to the covariance? What are the lowest and highest possible values for a covariance?

# Multiply everything by 10
ten_times <- nfl_sample * 10
cov(ten_times)
#                malevolence penalty
#    malevolence    70.07067  68.892
#    penalty        68.89200  96.268

That’s one limitation of using covariance as a measure of the strength of the relationship between two variables: it depends on the scale of measurement used.

For example, the penalty yards variable in our data are reported as z-scores. If we were to use (unstandardized) penalty yards, the covariance increases to 4.822. If we used penalty feet (instead of yards), the covariance would again increase to 14.467. These examples show that larger values of covariance do not necessarily indicate stronger relationships between variables. Because of this, we shouldn’t directly compare covariances.

Correlation: standardized covariance

To address this scale dependence, we need to use a unit of measurement (e.g., meters, pounds, decibels) into which any scale of measurement (e.g., penalty distance, uniform malevolence)could be converted.

If we use the standard deviation as that standard unit of measurement, we can standardize the covariance by:

Standardizing our variables and calculating the covariance: $\frac{\sum \left ( z_{xi}-\bar{Z_{x}} \right )\left ( z_{yi}-\bar{Z_{y}} \right )}{n-1}=\frac{\sum z_{x}z_{y}}{n-1}=r_{xy}$

Calculating the covariance and standardizing it: $\frac{\sum \left ( x_{i}-\bar{X} \right )\left ( y_{i}-\bar{Y} \right )}{(n-1)s_{x}s_{y}}=r_{xy}$

This is called Pearson's product-moment correlation coefficient (or r).

What are the smallest and largest values of r?

Calculating the correlation coefficient

Expand the code to see how to calculate the correlation coefficient in R. Interpret this correlation.

# Using cor()
cor(nfl_sample$malevolence, nfl_sample$penalty)
#    [1] 0.8388025

# Using cor() within dplyr
nfl_sample %>%
  summarize(correlation = cor(malevolence, penalty))
#    # A tibble: 1 x 1
#      correlation
#            <dbl>
#    1   0.8388025

Let’s now calculate the correlation for our full dataset:

# Load data
nfl <- read.csv("http://www.bradthiessen.com/html5/data/NFLuniforms.csv")

# Rename variables
names(nfl) <- c("team", "malevolence", "penalty")

# Plot and annotate with correlation
ggplot(nfl, aes(x = malevolence, y=penalty)) +
  geom_point(color="steelblue", size=3) +
  annotate("text", x=4.5, y=-1.2, label=paste("r = ", round(cor(nfl$malevolence, nfl$penalty),4)))

Characteristics of the Person product-moment correlation

Scale invariance

The magnitude of the correlation does not change under any linear transformation of the variables. Expand the code to see an example.

# Let's convert malevolence to centi-malevolence
nfl$centi_malevolence <- nfl$malevolence / 100

# The correlation does not change
cor(nfl$centi_malevolence, nfl$penalty)
#    [1] 0.429796

Strength of linear relationship

Pearson’s r only measures the strength of a linear relationship. If we want to measure the strength of nonlinear relationships, we’ll need another statistic. Here are some values of r for various scatterplots:

Impact of outliers

Below, I’ve highlighted the two most extreme observations in our data. We know the correlation between malevolence and penalty yards for all the data is $r_{xy}=0.43$. What would happen to the value of this correlation if those two highlighted values were removed?

# Add a variable to highlight the two outliers
nfl$outlier <- c(1, rep(0,26), 1)

# Create scatterplot with the outliers highlighted
ggplot(data=nfl, aes(x = malevolence, y = penalty, color=outlier)) +
  geom_point(size=3) + theme(legend.position="none")


# Calculate correlation without highlighted observations
nfl %>%
  filter(outlier==0) %>%
  summarize(correlation = cor(malevolence, penalty))
#    # A tibble: 1 x 1
#      correlation
#            <dbl>
#    1  0.08164226

Impact of range restriction

Let’s simulate a large dataset with a strong, positive correlation:

# Simulate data
sim_data <- tibble(
  x = c(1:1000),
  y = 2*x + rnorm(1000, 3, 300))

# Plot
ggplot(data = sim_data, aes(x, y)) +
  geom_point(color="steelblue", size=2, alpha=0.5) +
  annotate("text", x = 125, y = 2222, label=paste("r = ",round(cor(sim_data$x, sim_data$y),3))) +
  geom_vline(xintercept=900, color="red")

Look at that sample data. Suppose we only restrict the range to only include data in which $x>900$ (the data to the right of the red line). What will happen to the magnitude of the correlation? Why?

# Calculate correlation when only considering x>900
sim_data %>%
  filter(x>900) %>%
  summarize(correlation = cor(x, y))
#    # A tibble: 1 x 1
#      correlation
#            <dbl>
#    1   0.4114621

Correlation does not imply causation

Everyone thinks they know correlation is not causation. A correlation does not imply a causal relationship, and a lack of correlation does not mean there is no relationship between two variables.

There are many factors that contribute to this. We’ve already seen that outliers and range restriction can influence the magnitude (and even sign) of Pearson’s r. Likewise, if two variables have a nonlinear relationship, Pearson’s r will misrepresent that relationship.

Yet another reason is the third-variable problem.

The number of heart attacks in a given month is positively correlated with ice cream sales. Explain why this does not provide evidence that ice cream causes heart attacks.

A 2004 article questioned research into the relationship between hormone replacement therapy (HRT) and coronary heart disease (CHD). Numerous studies had shown negative correlations between HRT and CHD, leading doctors to propose that HRT was protective against CHD.

Randomized controlled trials showed, in fact, that HRT caused a small but significant increase in the risk of CHD.

Re-analysis of the data from the original studies showed that women undertaking HRT were more likely to be from higher socioeconomic groups. These women, therefore, were more likely to have better-than-average diets and exercise regimens.

The third variable linking hormone replacement therapy to coronary heart disease was socioeconomic status.

In 2002, a letter in Diabetes Care detailed the calculation behind the correlation of r = 0.54 between diabetes rates and pollution levels across all 50 states.

How would you interpret this correlation?
The author tried to be careful in concluding, “… the correlation between air emissions and the prevalence of diabetes does not prove a cause-and-effect relationship; the significance of the relationship demands attention.”.

In response, Mark Nicolich questioned if the relationship even demanded attention. Using the same diabetes data, Nicolich calculated the correlation between diabetes rates and the alphabetized rank of each state to be r = 0.49. He also found the correlation between diabetes rates and the latitude of each state’s capital to be r = -0.54.

See if you can think of a scenario which would result in the following scatterplot. What is to be learned from this?

test <- tibble(
  x = c(rnorm(100, 6, 2), rnorm(100, 10, 2), rnorm(100, 14, 2)),
  y = c(rnorm(100, 6, 2), rnorm(100, 10, 2), rnorm(100, 14, 2)),
  group = as.factor(c(rep(1,100), rep(2,100), rep(3,100)))
)

# Calculate overall correlation
overall <- test %>%
  summarize(r = cor(x, y))

# Calculate correlation for each subgroup
subs <- test %>%
  group_by(group) %>%
  summarize(r = cor(x, y))

# Plot and annotate with correlations
ggplot(data = test, aes(x, y, color = group)) +
  geom_point(size = 2, alpha=0.7) + 
  geom_smooth(method="lm", se=FALSE) + 
  theme(legend.position="none") +
  annotate("text", x = 5, y = 18, label=paste("overall r =", round(overall, 2)), color="black") +
  annotate("text", x = 2, y = 4,  label = paste("r =", round(subs[1,2], 2)), color="red") +
  annotate("text", x = 16, y = 9, label = paste("r =", round(subs[2,2], 2)), color="forestgreen") +
  annotate("text", x = 18.8, y = 16, label = paste("r = ", round(subs[3,2], 2)), color = "blue")

If you’d like to explore other examples in which correlation is mistaken for causation, check out:

Spurious correlations

Correlation or Causation

Nonzero correlations?

What correlation would you expect to find between two sets of random values? Expand the code to see –>

set.seed(3141)
# Generate random values for X and Y
random_xy <- tibble(
  x = rnorm(100, 0, 1),
  y = rnorm(100, 0, 1)
) 

cor(random_xy$x, random_xy$y)
#    [1] 0.200954

ggplot(data = random_xy, aes(x, y)) +
  geom_point(size=3, color="steelblue") +
    annotate("text", x = 0, y = 2.8, label=paste("overall r =", round(cor(random_xy$x, random_xy$y), 3)), color="black")

The correlation of any two variables in any sample of data will be non-zero, so how can we have any confidence that a correlation is “real?”

Randomization-based test

A subset of our data is displayed below. What would the (null) randomization hypothesis be in this scenario? Under that hypothesis, how likely were the Raiders (with a 5.1 malevolence rating) to have 1.19, 0.48, or 0.27 standardized penalty yards?

Explain how will we use randomization-based methods to test our randomization hypothesis.

Let’s take a look at some randomizations of our data:

# Store our sample correlation as test_stat
test_stat <- cor(nfl$malevolence, nfl$penalty)

# Randomize our data 
p1 <- nfl %>%
  ggplot(., aes(x = malevolence, y = penalty)) +
  geom_point(color="steelblue", size=2) +
  geom_smooth(method="lm", se=FALSE, color="red", size=.2) +
  annotate("text", x = 4, y = 1, label = "actual data", color="red", size=3)

p2 <- nfl %>%
  mutate(shuffled_penalty = sample(penalty)) %>%
  ggplot(., aes(x = malevolence, y = shuffled_penalty)) +
  geom_point(color="steelblue", size=2) +
  geom_smooth(method="lm", se=FALSE, color="red", size=.2) +
  annotate("text", x = 4, y = 1, label = "randomization #1", color="red", size=3)

p3 <- nfl %>%
  mutate(shuffled_penalty = sample(penalty)) %>%
  ggplot(., aes(x = malevolence, y = shuffled_penalty)) +
  geom_point(color="steelblue", size=2) +
  geom_smooth(method="lm", se=FALSE, color="red", size=.2) +
  annotate("text", x = 4, y = 1, label = "randomization #2", color="red", size=3)

p4 <- nfl %>%
  mutate(shuffled_penalty = sample(penalty)) %>%
  ggplot(., aes(x = malevolence, y = shuffled_penalty)) +
  geom_point(color="steelblue", size=2) +
  geom_smooth(method="lm", se=FALSE, color="red", size=.2) +
  annotate("text", x = 4, y = 1, label = "randomization #3", color="red", size=3)

p5 <- nfl %>%
  mutate(shuffled_penalty = sample(penalty)) %>%
  ggplot(., aes(x = malevolence, y = shuffled_penalty)) +
  geom_point(color="steelblue", size=2) +
  geom_smooth(method="lm", se=FALSE, color="red", size=.2) +
  annotate("text", x = 4, y = 1, label = "randomization #4", color="red", size=3)

p6 <- nfl %>%
  mutate(shuffled_penalty = sample(penalty)) %>%
  ggplot(., aes(x = malevolence, y = shuffled_penalty)) +
  geom_point(color="steelblue", size=2) +
  geom_smooth(method="lm", se=FALSE, color="red", size=.2) +
  annotate("text", x = 4, y = 1, label = "randomization #5", color="red", size=3)

# This uses a "multiplot" function that was loaded silently
multiplot(p1, p4, p2, p5, p3, p6, cols=3)

Each randomization of data yields a non-zero correlation. Let’s run 10,000 randomizations and plot all 10,000 correlations.

# Using cor()
shuffled_r <- Do(10000) * cor(malevolence ~ shuffle(penalty), data=nfl)

# Using dplyr
# shuffled_rs <- Do(10000) * nfl %>%
#   mutate(shuffled = sample(penalty)) %>%
#   summarize(cor = cor(shuffled, nfl$malevolence))

# Calculate likelihood of test statistic
pvalue <- prop(shuffled_r$cor >= test_stat)
pvalue
#      TRUE 
#    0.0097

# Plot simple histogram
# histogram(~cor, data=shuffled_r,
#           width=.04,
#           xlab="Possible correlations assuming null hypothesis is true")
# ladd(panel.abline(v=test_stat))   # Add vertical line at test statistic

# Plot
ggplot(data = shuffled_r, aes(x = cor)) +
  geom_histogram(binwidth = 0.04, fill="lightblue", color="white", alpha=0.8) +
  annotate("segment", x = test_stat, xend = test_stat, y = 0, yend = 600, color = "red") +
  annotate("text", x = test_stat, y = 630, label = "observed r = 0.43", color = "red") +
  annotate("text", x = test_stat+.11, y = 170, label = paste("p =",round(pvalue,4)), color = "red") +
  labs(
      title = "Randomized Correlations",
      x = "r"
      ) +
  scale_x_continuous(limits = c(-.7,.7), breaks=seq(-.6, .6, .3), minor_breaks=NULL) +
    theme(
    axis.text.x = element_text(size = 11, color="grey10"),
    legend.position = "none",
    panel.grid.major.y = element_line(colour = "white", size=.25),
    panel.grid.major.x = element_line(colour = "white", size=.15),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "grey93")
  )

Bootstrap confidence interval

Interpret the following confidence interval and explain how it was constructed.

boot <- Do(10000) * cor( malevolence ~ penalty, data=resample(nfl) )

bootstrapCI <- confint(boot, level = 0.95, method = "quantile")
lower <- as.numeric(bootstrapCI[2]) # Store lower CI bound
upper <- as.numeric(bootstrapCI[3]) # Store upper CI bound

# Density plot
ggplot(data = boot, aes(x = cor)) +
  geom_density(fill="lightblue", color="white", alpha = 0.8) +
  labs(
      title = "Bootstrap distribution of correlations",
      x = "bootstrap correlations"
      ) +
  scale_x_continuous(breaks=seq(-0.8, 0.8, 0.2), minor_breaks=NULL) +
    theme(
    axis.text.x = element_text(size = 11, color="grey10"),
    legend.position = "none",
    panel.grid.major.y = element_line(colour = "white"),
    panel.grid.major.x = element_line(colour = "white", size=.15),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "grey93")
  ) +
  annotate("text", x = lower, y = .2, label = round(lower,3)) +
  annotate("text", x = upper, y = .2, label = round(upper,3)) +
  annotate("text", x = median(boot$cor), y = .25, label = paste("95%")) +
  annotate("segment", x = lower, xend = upper, y = 0.1, yend = .1, color = "red")

t-test for correlation

If we assume our data are sampled from populations with normal distributions, we can use a t-test.

In the next lesson, we’ll derive the following test statistic for the correlation coefficient:

$t_{n-2}=\frac{r_{xy}-0}{\textrm{SE}_{r_{xy}}}=\frac{r_{xy}}{\sqrt{\frac{1-r_{xy}^{2}}{n-2}}}=\frac{r_{xy}\sqrt{n-2}}{\sqrt{1-r_{xy}^{2}}}=2.427\ \ (p=.011)$

We can run this test in R using the cor.test() function:

# t-test for correlation coefficient
cor.test(nfl$malevolence, nfl$penalty, alternative = c("greater"), method=c("pearson"))
#    
#       Pearson's product-moment correlation
#    
#    data:  nfl$malevolence and nfl$penalty
#    t = 2.4272, df = 26, p-value = 0.01122
#    alternative hypothesis: true correlation is greater than 0
#    95 percent confidence interval:
#     0.1299369 1.0000000
#    sample estimates:
#         cor 
#    0.429796

Is there anything (worthwhile) we can conclude from this t-test?

Other types of correlations

Nonparametric correlations (correlations of ranks)

Spearman’s rho

Spearman’s rho uses the same formula as Pearson’s r, except the data are converted to ranks.

Using data from 8 MBA graduates, calculate Spearman’s rho to estimate the strength of the relationship between scores on the GMAT (a test they took prior to entering graduate school) and their grade point average in the MBA program. To do this, first convert the scores into ranks.

# Enter the raw data
mba <- tibble(
  gmat = c(710, 610, 640, 580, 545, 560, 610, 530),
  gpa  = c(4, 4, 3.9, 3.8, 3.7, 3.6, 3.5, 3.5)

)

# Calculate Pearson's r and Spearman's rho
mba %>%
  summarize(r = cor(gmat, gpa),
            rho = cor(gmat, gpa, method="spearman"))
#    # A tibble: 1 x 2
#              r       rho
#          <dbl>     <dbl>
#    1 0.6953241 0.6788003

Kendall’s tau

Kendall’s Tau is calculated as:

$\tau =\frac{\textrm{(number of concordant pairs)}-\textrm{(number of discordant pairs)}}{\textrm{(number of concordant pairs)}+\textrm{(number of discordant pairs)}}$

A new worker is assigned to a machine that manufactures bolts. Each day, a sample of bolts is examined and the percent defective is recorded. Do the following data indicate a significant improvement over time for that worker? Calculate Kendall’s tau by first calculating the number of concordant and discordant observations below each value.

Correlations for categorical data

We’ve calculated correlation coefficients for continuous variables, but correlation coefficients can also be calculated for categorical data.

We won’t have time to investigate these in class, but they are all very easy to learn:

Phi coefficient for 2x2 contingency tables
Cramer’s V for two nominal variables
Biserial correlation for one continuous and one artificially dichotomous (ordinal) variable
Point-biserial correlation for one continuous and one (artificial or natural) dichotomous variable
Polychoric correlation for two ordinal variables with underlying continuous distributions

Correlations for nonlinear relationships

Distance correlation

The distance correlation is a modern measure of the relationship between two variables, even when the relationship is nonlinear.

Earlier in this lesson, you saw a figure with values of Pearson’s r for various scatterplots. Compare that to the following figure showing values of the distance correlation:

Recall Pearson’s r is a standardized covariance. The numerator of the covariance is the sum of the products of two distances (distance from the mean of X and distance from the mean of Y) over all points: $\frac{\sum (x_{i}-\bar{X})(y_{i}-\bar{Y})}{n-1}$. The covariance is maximized when all data points are arranged along a straight line.

The numerator of the distance covariance is similar to that of the covariance. The difference is that the distances are between varying data points; not between a data point and the mean. The distance covariance is defined by the sum of the products of the two distances over all pairs of points. The distance covariance is maximized when the data are arranged along a straight line locally (when the data overall represent a chain in any shape).

Let’s calculate the distance correlation for an extremely small dataset. Our dataset will be:

X = 1, 2, 3

Y = 3, 1, 5

Let’s first calculate Pearson’s r and Spearman’s rho for this data:

# Input data
x <- c(1, 2, 3)
y <- c(3, 1, 5)

# Pearson's r
cor(x, y, method="pearson")
#    [1] 0.5

# Spearman's rho
cor(x, y, method="spearman")
#    [1] 0.5

Now, let’s calculate the distance covariance:

First, we calculate all euclidean distances between pairs of observations for X and then for Y.

$a_{x}=\begin{bmatrix}0 &1 &2 \\ 1&0 &1 \\ 2 &1 &0 \end{bmatrix} \textrm{ and } b_{y}=\begin{bmatrix}0 &2 &2 \\ 2&0 &4 \\ 2 &4 &0 \end{bmatrix}$

These represent the distances between observations (1 and 2), (2 and 3), and (1 and 3).

Next, convert this to a euclidean norm by taking each value in each matrix and (1) subtracting its row mean, (2) subtracting its column mean, and (3) adding the grand mean.

This gives us:

$A_{x}=\begin{bmatrix} -1.11 &0.22 &0.89 \\ 0.22&-0.44 &0.22 \\ 0.89 &0.22 &-1.11\end{bmatrix} \textrm{ and } B_{y}=\begin{bmatrix} -0.89 &0.44 &0.44 \\ 0.44&-2.22 &1.78 \\ 0.44 &1.78 &-2.22\end{bmatrix}$

Now, multiply each value in the A matrix by its corresponding value in the B matrix

$AxB=\begin{bmatrix}0.9877 &0.0987 &0.3951 \\ 0.0987&0.9877 &0.3951 \\ 0.3951 &0.3951 &2.4691\end{bmatrix}$

Calculate the average of all the values in this matrix

$cov_{xy}^{2}=0.69136$

Finally, take the square root of that value

$cov_{xy}=\sqrt{0.69136}=0.8315$

To convert that distance covariance to a distance correlation, we standardize by dividing by the product of the distance variances of X and Y.

Thankfully, we can let R do all these calculations for us:

# We loaded the energy package earlier
# library(energy)

# Calculate all the components of a distance correlation
# Verify the covariance is 0.8315
DCOR(x, y)
#    $dCov
#    [1] 0.8314794
#    
#    $dCor
#    [1] 0.83666
#    
#    $dVarX
#    [1] 0.7027284
#    
#    $dVarY
#    [1] 1.405457

# Here's a direct way to calculate the distance correlation
dcor(x, y)
#    [1] 0.83666

Comparing correlation coefficients

Let’s look at some correlation coefficients for various simulated datasets.

Your turn

The data.frame midwest contains demographic information about 437 counties in the Midwest. The list of variables is displayed below.

(a) Calculate Pearson’s r to measure the correlation between percollege (the percent of people in the county with a college education) and ’percbelowpoverty` (the proportion of people below the poverty line).

(b) Calculate Spearman’s rho for the same two variables.

(c) Calculate Kendall’s tau for those two variables.

(d) Calculate a distance correlation between those variables.

(e) Conduct a randomization-based test of Pearson’s r and state your conclusions.

(f) Construct a bootstrap confidence interval for Pearson’s r.

(g) Conduct a t-test for pearson’s r.

# Load data
data(midwest)
# Display variable names and types
str(midwest)
#    Classes 'tbl_df', 'tbl' and 'data.frame':  437 obs. of  28 variables:
#     $ PID                 : int  561 562 563 564 565 566 567 568 569 570 ...
#     $ county              : chr  "ADAMS" "ALEXANDER" "BOND" "BOONE" ...
#     $ state               : chr  "IL" "IL" "IL" "IL" ...
#     $ area                : num  0.052 0.014 0.022 0.017 0.018 0.05 0.017 0.027 0.024 0.058 ...
#     $ poptotal            : int  66090 10626 14991 30806 5836 35688 5322 16805 13437 173025 ...
#     $ popdensity          : num  1271 759 681 1812 324 ...
#     $ popwhite            : int  63917 7054 14477 29344 5264 35157 5298 16519 13384 146506 ...
#     $ popblack            : int  1702 3496 429 127 547 50 1 111 16 16559 ...
#     $ popamerindian       : int  98 19 35 46 14 65 8 30 8 331 ...
#     $ popasian            : int  249 48 16 150 5 195 15 61 23 8033 ...
#     $ popother            : int  124 9 34 1139 6 221 0 84 6 1596 ...
#     $ percwhite           : num  96.7 66.4 96.6 95.3 90.2 ...
#     $ percblack           : num  2.575 32.9 2.862 0.412 9.373 ...
#     $ percamerindan       : num  0.148 0.179 0.233 0.149 0.24 ...
#     $ percasian           : num  0.3768 0.4517 0.1067 0.4869 0.0857 ...
#     $ percother           : num  0.1876 0.0847 0.2268 3.6973 0.1028 ...
#     $ popadults           : int  43298 6724 9669 19272 3979 23444 3583 11323 8825 95971 ...
#     $ perchsd             : num  75.1 59.7 69.3 75.5 68.9 ...
#     $ percollege          : num  19.6 11.2 17 17.3 14.5 ...
#     $ percprof            : num  4.36 2.87 4.49 4.2 3.37 ...
#     $ poppovertyknown     : int  63628 10529 14235 30337 4815 35107 5241 16455 13081 154934 ...
#     $ percpovertyknown    : num  96.3 99.1 95 98.5 82.5 ...
#     $ percbelowpoverty    : num  13.15 32.24 12.07 7.21 13.52 ...
#     $ percchildbelowpovert: num  18 45.8 14 11.2 13 ...
#     $ percadultpoverty    : num  11.01 27.39 10.85 5.54 11.14 ...
#     $ percelderlypoverty  : num  12.44 25.23 12.7 6.22 19.2 ...
#     $ inmetro             : int  0 0 0 1 0 0 0 0 0 1 ...
#     $ category            : chr  "AAR" "LHR" "AAR" "ALU" ...
# Display scatterplot
ggplot(data=midwest, aes(x = percollege, y = percbelowpoverty)) +
  geom_point(alpha=.5, color="steelblue", fill="lightblue", shape=20, size=5) +
  theme_grey()

Calculate Kendall’s tau for the NFL malevolence data we’ve used in this lesson. Then, conduct a randomization-based test for Kendall’s tau. State your conclusion.

Publishing Your Turn solutions

Download the Your Turn Solutions template
- RStudio will automatically open it if it downloads as a .Rmd file.
- If it downloads as a .txt file, then you can:
  - open the file in a text editor and copy its contents
  - open RStudio and create a new R Markdown file (html file)
  - delete everything and paste the contents of the template
Type your solutions into the Your Turn Solutions template
- I’ve tried to show you where to type each of your solutions/answers
- You can run the code as you type it.
When you’ve answered every question, click the Knit HTML button located at the top of RStudio:
- RStudio will begin working to create a .html file of your solutions.
- It may take a few minutes to compile everything (if you’ve conducted any simulations)
- While the file is compiling, you may see some red text in the console.
- When the file is finished, it will open automatically (or it will give you an error message).
  - If the file does not open automatically, you’ll see an error message in the console.
Once you’re satisfied with your solutions, email that html file to thiessenbradleya@sau.edu or give me a printed copy.
If you run into any problems, email me or work with other students in the class.

License

This document is released under a Creative Commons Attribution-ShareAlike 3.0 Unported license.