Does having a disability influence how others perceive you perform on a job interview? If so, is the effect positive or negative?
To study this, researchers prepared five videotaped job interviews using the same two male actors for each. A set script was designed to reflect an interview with an applicant of average qualifications. The tapes differed only in that the applicant appeared with a different handicap:
• In one, he appeared in a wheelchair
• In a second, he appeared on crutches
• In a third, his hearing was impaired
• In a fourth, he appeared to have one leg amputated
• In a fifth, he appeared to have no disability
70 undergraduate students from an American university were randomly assigned to view the tapes, 14 to each tape. After viewing the tape, each subject rated the qualifications of the applicant on a 0-10 point scale.
Our primary questions are:
• Is having a disability associated with higher or lower interview scores?
• Do permanent disabilities (amputee, hearing, wheelchair) receive higher/lower scores than temporary disabilities (crutches)?
Here’s a summary of the interview scores given by the 70 undergraduate students:
Let’s download this data, visualize it, and test some assumptions for an ANOVA.
# Download data from my website and store it as "interviews"
interviews <- read.csv("http://www.bradthiessen.com/html5/data/disability.csv")
# See the first several rows of data to get a sense of the contents
head(interviews)
## disability score
## 1 none 1.9
## 2 none 2.5
## 3 none 3.0
## 4 none 3.6
## 5 none 4.1
## 6 none 4.2
# Stripplot
stripplot(score ~ disability, data=interviews, jitter.data = TRUE,
xlab = "anxiety", cex=1.3, col="steelblue", aspect="1", pch=20)
# Boxplot
bwplot(score ~ disability, data=interviews, col="steelblue", xlab="Group", ylab="scores",
par.settings=list(box.rectangle=list(lwd=3), box.umbrella=list(lwd=3)))
# Error bars showing mean and standard errors
# I need to load the ggplot2 library to generate this plot
library(ggplot2)
ggplot(interviews, aes(disability, score)) +
stat_summary(fun.y = mean, geom = "point") +
stat_summary(fun.y = mean, geom = "line", aes(group = 1),
colour = "Blue", linetype = "dashed") +
stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width = 0.2) +
labs(x = "Disability", y = "Interview score")
# Run Shapiro-Wilk test for normality
# I run the test for each disability and then print all p-values
a <- round(shapiro.test(interviews$score[interviews$disability=="Amputee"])$p.value,3)
c <- round(shapiro.test(interviews$score[interviews$disability=="Crutches"])$p.value,3)
h <- round(shapiro.test(interviews$score[interviews$disability=="Hearing"])$p.value,3)
n <- round(shapiro.test(interviews$score[interviews$disability=="none"])$p.value,3)
w <- round(shapiro.test(interviews$score[interviews$disability=="Wheelchair"])$p.value,3)
paste("Amputee: p =", a, ". Crutch: p =", c, ". Hear: p =", h, ". none: p =", n, ". W-chair: p =", w)
## [1] "Amputee: p = 0.805 . Crutch: p = 0.548 . Hear: p = 0.932 . none: p = 0.971 . W-chair: p = 0.425"
# Levene's test for equal variances
leveneTest(score ~ disability, data=interviews)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 0.1974 0.9389
## 65
Let’s run the ANOVA:
# Run the analysis of variance (aov) model
model <- aov(score ~ disability, data=interviews)
# Show the output in an ANOVA summary table
anova(model)
## Analysis of Variance Table
##
## Response: score
## Df Sum Sq Mean Sq F value Pr(>F)
## disability 4 30.521 7.6304 2.8616 0.03013 *
## Residuals 65 173.321 2.6665
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Even though the p-value indicates it’s unlikely we would have observed this data if the group population means were equal, the ANOVA does not allow us to identify which group means differ from one another.
Since the questions of interest in this study are related to comparisons among specific group means, we need to derive some methods to do just that. We’ll classify these post-hoc methods as:
• Pairwise-comparisons
• Flexible comparisons (contrasts)
Since we found a statistically significant result with our ANOVA, why don’t we just compare each pair of group means using t-tests?
This is the idea behind the Bonferroni correction. If we want to maintain an overall Type I error rate, we need to divide \(\alpha\) for each test by the number of tests we want to conduct:
Bonferroni correction: \(\alpha_{\textrm{each test}}=\frac{\alpha_{\textrm{overall}}}{\textrm{\# of tests}}\).
So, for example, if we wanted to conduct 10 t-tests and maintain an overall Type I error rate of 0.05, we would need to use \(\alpha_{\textrm{each test}}=0.005\).
Instead of adjusting the Type I error rate for each test, we could correct the p-values after running the t-tests. To do this, we would use:
\(\left (\textrm{p-value from each test} \right )\left ( \textrm{\# of tests to be conducted} \right )\)
Let’s see the Bonferroni correction in action. Remember, the p-value from the ANOVA we conducted earlier implies at least one of our group means differs from the others. (This isn’t exactly true. It actually tells us at least one linear combination of our group means will differ significantly from zero)
We’ll have R run all the pairwise t-tests to compare pairs of group means. First, we’ll have R run t-tests with no adjustment to the p-values. Then, we’ll adjust the p-values with the Bonferroni correction.
# Pairwise t-tests with no adjustment
with(interviews, pairwise.t.test(score, disability, p.adjust.method="none"))
##
## Pairwise comparisons using t tests with pooled SD
##
## data: score and disability
##
## Amputee Crutches Hearing none
## Crutches 0.0184 - - -
## Hearing 0.5418 0.0035 - -
## none 0.4477 0.1028 0.1732 -
## Wheelchair 0.1433 0.3520 0.0401 0.4756
##
## P value adjustment method: none
# Pairwise t-tests with Bonferroni correction on the p-values
# Verify the p-values increase by a factor of 10 (in this example)
with(interviews, pairwise.t.test(score, disability, p.adjust.method="bonferroni"))
##
## Pairwise comparisons using t tests with pooled SD
##
## data: score and disability
##
## Amputee Crutches Hearing none
## Crutches 0.184 - - -
## Hearing 1.000 0.035 - -
## none 1.000 1.000 1.000 -
## Wheelchair 1.000 1.000 0.401 1.000
##
## P value adjustment method: bonferroni
Because Bonferroni’s correction is conservative (and can struggle to find mean differences that actually exist), several other post-hoc strategies have been developed.
One alternative strategy that’s relatively easy to explain is Holm’s method. You first conduct all pairwise t-tests to obtain a p-value from each test. Then, you rank order those p-values from largest to smallest. For our data, the ranked order of our (uncorrected) p-values is:
We then correct each of those p-values by multiplying the p-value by its rank:
Notice that for the smallest p-value, this method is exactly the same as the Bonferroni correction. In subsequent comparisons, however, the p-value is corrected only for the remaining number of comparisons. This makes it more likely to find a difference between a pair of group means.
Let’s have R calculate the pairwise tests using Holm’s method:
# Holm's method
with(interviews, pairwise.t.test(score, disability, p.adjust.method="holm"))
##
## Pairwise comparisons using t tests with pooled SD
##
## data: score and disability
##
## Amputee Crutches Hearing none
## Crutches 0.165 - - -
## Hearing 1.000 0.035 - -
## none 1.000 0.719 0.866 -
## Wheelchair 0.860 1.000 0.321 1.000
##
## P value adjustment method: holm
There are many more post-hoc methods (e.g., Hochberg’s False Discovery Rate, Tukey’s Honestly Significant Difference, Dunnett’s method) that we simply don’t have time to investigate. It is important, though, that you get a sense for which methods perform best.
As you learned in a previous statistics class, the Type I error rate is linked to the statistical power of a test. If we increase the probability of making a Type I error, the power declines (and vice versa). Because of this, it’s important that post-hoc methods control the overall Type I error rate without sacrificing too much power.
Bonferroni’s correction and Tukey’s HSD control the Type I error rate well, but they lack statistical power. Bonferroni’s correction has more power when testing a small number of comparisons, while Tukey’s HSD performs better with a larger number of comparisons. Holm’s method is more powerful than both Bonferroni and Tukey (and Hochberg’s method is more powerful still).
Each of these methods, however, performs poorly if the population group variances differ and the sample sizes are unequal. If you have an unequal sample sizes and unequal variances, you should probably use a robust post-hoc comparison method.
In case you’re interested, here’s the code to conduct a Tukey’s HSD:
# Conduct the post-hoc test
TukeyHSD(model, conf.level=.95)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = score ~ disability, data = interviews)
##
## $disability
## diff lwr upr p adj
## Crutches-Amputee 1.4928571 -0.2388756 3.2245899 0.1232819
## Hearing-Amputee -0.3785714 -2.1103042 1.3531613 0.9724743
## none-Amputee 0.4714286 -1.2603042 2.2031613 0.9399911
## Wheelchair-Amputee 0.9142857 -0.8174470 2.6460185 0.5781165
## Hearing-Crutches -1.8714286 -3.6031613 -0.1396958 0.0277842
## none-Crutches -1.0214286 -2.7531613 0.7103042 0.4686233
## Wheelchair-Crutches -0.5785714 -2.3103042 1.1531613 0.8812293
## none-Hearing 0.8500000 -0.8817328 2.5817328 0.6442517
## Wheelchair-Hearing 1.2928571 -0.4388756 3.0245899 0.2348141
## Wheelchair-none 0.4428571 -1.2888756 2.1745899 0.9517374
# See a plot of the confidence intervals for each pairwise mean difference
plot(TukeyHSD(model))
The previous methods allow us to compare pairs of means. Sometimes, as in this scenario, we’ll want more flexibility. For example, think about the key questions in this study:
To address these questions, we’ll need to compare linear combinations of group means:
Let’s focus on that first contrast (no disability vs. the 4 disability groups). To compare this linear combination of means, we can use the Scheffe method.
With the Scheffe method, we first need to calculate our contrast of interest by choosing weights in our linear combination:
Contrast (linear combination of means): \(\psi =\sum_{a=1}^{a}c_{a}\mu _{a}=c_{1}\mu _{1}+c_{2}\mu _{2}+c_{3}\mu _{3}+c_{4}\mu _{4}+c_{5}\mu _{5}, \ \ \ \textrm{where} \ \sum c_{a}=0\)
Our contrast isn’t zero, but we wouldn’t expect any contrast calculated from real data to equal zero. Even if the population values of our group means were identical, we’d expect our sample group means to differ and our sample contrast to be nonzero.
So our task is to determine the likelihood of observing a contrast as or more extreme than the one we calculated (assuming the null hypothesis is true). To do this, we need to derive the sampling distribution of our contrast.
Recall the general form of a t-test. We may be able to use the t-distribution to determine the likelihood of our contrast:
\(t=\frac{\textrm{observed}-\textrm{hypothesized}}{\textrm{standard error}}=\frac{\hat{\psi }-0}{\textrm{SE}_{\psi }}\)
If this works, we only need to derive the standard error of a contrast. Then, we’d know the distribution of our sample contrast and we could estimate its likelihood.
To derive the standard error of a contrast, we need to remember two properties of variances:
\(\textrm{var}\left ( ax \right )=a^{2}\textrm{var}\left ( x \right )\).
If X and Y are independent, then: \(\textrm{var}\left ( X+Y \right )=\textrm{var}\left ( X \right )+\textrm{var}\left ( Y \right )\)
With these two facts, we can calculate the variance of our contrast:
\(\textrm{var}\left ( \hat{\psi} \right )=\textrm{var}\left ( c_{1}\overline{X}_{1} \right )+\textrm{var}\left ( c_{2}\overline{X}_{2} \right )+\textrm{var}\left ( c_{3}\overline{X}_{3} \right )+\textrm{var}\left ( c_{4}\overline{X}_{4} \right )+\textrm{var}\left ( c_{5}\overline{X}_{5} \right )\)
\(\textrm{var}\left ( \hat{\psi} \right )=c_{1}^{2}\textrm{var}\left ( \overline{X}_{1} \right )+c_{2}^{2}\textrm{var}\left ( \overline{X}_{2} \right )+c_{3}^{2}\textrm{var}\left ( \overline{X}_{3} \right )+c_{4}^{2}\textrm{var}\left ( \overline{X}_{4} \right )+c_{5}^{2}\textrm{var}\left ( \overline{X}_{5} \right )\)
We now need one other fact about the variance of a mean: \(\textrm{var}\left ( \overline{X} \right )=\frac{\sigma _{x}^{2}}{n}\)
\(\textrm{var}\left ( \hat{\psi} \right )=c_{1}^{2}\frac{\sigma _{1}^{2}}{n_{1}}+c_{2}^{2}\frac{\sigma _{2}^{2}}{n_{2}}+c_{3}^{2}\frac{\sigma _{3}^{2}}{n_{3}}+c_{4}^{2}\frac{\sigma _{4}^{2}}{n_{4}}+c_{5}^{2}\frac{\sigma _{5}^{2}}{n_{5}}\)
If we’re going to assume the group variances are equal (which we already assumed in our ANOVA), we have:
\(\textrm{var}\left ( \hat{\psi} \right )=\sum \frac{c_{a}^{2}}{n_{a}}\sigma _{x}^{2}\)
Finally, we know MSE is our best estimate of the pooled standard deviation of our groups, so we can substitute:
\(\textrm{var}\left ( \hat{\psi} \right )=\textrm{MS}_{\textrm{E}}\sum \frac{c_{a}^{2}}{n_{a}}\)
That’s the standard error of our contrast. We can now substitute this into our t-statistic:
\(t=\frac{\textrm{observed}-\textrm{hypothesized}}{\textrm{standard error}}=\frac{\hat{\psi }-0}{\textrm{SE}_{\psi }}=\frac{\hat{\psi }-0}{\sqrt{\textrm{MS}_{\textrm{E}}\sum \frac{c_{a}^{2}}{n_{a}}}} \sim\sqrt{\left ( a-1 \right )F}\)
Now let’s try to calculate this Scheffe statistic in R. Remember, we’re comparing the mean of the “no disability” group to the combination of the 4 disability group means:
# Create vector of contrast coefficients
coeff <- c(-1/4,-1/4,-1/4,1,-1/4)
# Store group means as mm
mm <-mean(score ~ disability, data=interviews)
# Store group sample sizes as nn
nn <- tapply(interviews$score, interviews$disability, length)
# Calculate contrast (psi)
psi <- sum(coeff*mm)
# Calculate standard error of contrast
# First, run the ANOVA to get the mean squares
omnianova <- anova(aov(score ~ disability, data=interviews))
# Then, store mean square error as MSE
MSE <- omnianova$`Mean Sq`[2]
# Finally, calculate the SE of our psi
SEpsi <- sqrt(MSE) * sqrt( sum((coeff^2)/nn) )
# Calculate our test statistic (psi / std. error)
PSIoverSE <- psi / SEpsi
# Find critical value
# Choose alpha
alpha = 0.05
# sqrt(a-1) x sqrt(F)
cv <- sqrt(omnianova$Df[1]) * sqrt(qf(1-alpha, omnianova$Df[1], omnianova$Df[2]))
# Paste results
paste("observed test.stat =", round(PSIoverSE,4), ". Critical value =", round(cv,4))
## [1] "observed test.stat = -0.0732 . Critical value = 3.1705"
We can get the same result with another contrast:
coeff <- c(-1,-1,-1,4,-1)
mm <-mean(score ~ disability, data=interviews)
nn <- tapply(interviews$score, interviews$disability, length)
psi <- sum(coeff*mm)
omnianova <- anova(aov(score ~ disability, data=interviews))
MSE <- omnianova$`Mean Sq`[2]
SEpsi <- sqrt(MSE) * sqrt( sum((coeff^2)/nn) )
PSIoverSE <- psi / SEpsi
alpha = 0.05
cv <- sqrt(omnianova$Df[1]) * sqrt(qf(1-alpha, omnianova$Df[1], omnianova$Df[2]))
paste("observed test.stat =", round(PSIoverSE,4), ". Critical value =", round(cv,4))
## [1] "observed test.stat = -0.0732 . Critical value = 3.1705"
Because the observed test statistic is less than the critical value, we’d conclude that we do not have evidence that the mean of the no disability group differs from the mean of the combination of the disability groups.
Let’s use the Scheffe method to compare this linear combination of group means:
coeff <- c(-2,3,-2,3,-2)
mm <-mean(score ~ disability, data=interviews)
nn <- tapply(interviews$score, interviews$disability, length)
psi <- sum(coeff*mm)
omnianova <- anova(aov(score ~ disability, data=interviews))
MSE <- omnianova$`Mean Sq`[2]
SEpsi <- sqrt(MSE) * sqrt( sum((coeff^2)/nn) )
PSIoverSE <- psi / SEpsi
alpha = 0.05
cv <- sqrt(omnianova$Df[1]) * sqrt(qf(1-alpha, omnianova$Df[1], omnianova$Df[2]))
paste("observed test.stat =", round(PSIoverSE,4), ". Critical value =", round(cv,4))
## [1] "observed test.stat = 2.017 . Critical value = 3.1705"
# Load data
music <- read.csv("http://www.bradthiessen.com/html5/data/music.csv")
# Means plot with standard error bars
ggplot(music, aes(MusicType, Amount)) +
stat_summary(fun.y = mean, geom = "point") +
stat_summary(fun.y = mean, geom = "line", aes(group = 1),
colour = "Blue", linetype = "dashed") +
stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width = 0.2) +
labs(x = "Music Type", y = "Money Spent")
# Levene's test for equal variances
leveneTest(Amount ~ MusicType, data=music)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 6.2569 0.002115 **
## 390
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
When you’re ready to try the Your Turn exercises:
Sources:
Study: Cesare, Tannenbaum, Dalessio (1990). Interviewers’ Decisions Related to Applicant Handicap Type and Rater Empathy. Human Performance, 3(3)
This document is released under a Creative Commons Attribution-ShareAlike 3.0 Unported license.