Remember to download the report template for this lab and open it in RStudio. You can download the template by clicking this link: http://bradthiessen.com/html5/stats/m300/lab7report.Rmd
Let’s begin with the example we went through in class: Do doctors spend less time with obese patients? We’ll attempt to answer this question via null hypothesis significance testing (NHST). We’ll conduct a t-test first; then we’ll use randomization-based methods.
Let’s load the data and take a look at its structure:
doctors <- read.csv(url("http://bradthiessen.com/html5/stats/m300/doctors.csv"))
str(doctors)
## 'data.frame': 71 obs. of 2 variables:
## $ weight: Factor w/ 2 levels "average","obese": 1 1 1 1 1 1 1 1 1 1 ...
## $ time : int 15 15 45 40 45 20 40 30 40 30 ...
We want to test the difference in time
between average
and obese
patients. Let’s visualize the distribution of the time
variable by weight
:
# Side-by-side dotplots
dotPlot( ~time | weight, data = doctors,
width=1, #Width of each "bar" = 1
layout = c(1,2), #Plot 1 column, 2 rows
xlab = "Minutes spent with patient")
# Side-by-side density plots
densityplot(~doctors$time | doctors$weight,
lw=4, # Increase width of line
bw=7, # Increase bandwidth
layout=c(1,2), # Plot 1 col. and 2 rows
xlab = "Minutes with patient")
# Boxplots
bwplot(~time | weight, data=doctors,
ylab="Patient weight", xlab="Minutes with patient",
layout=(c(1,2)),
par.settings = list(box.rectangle = list(col = "steelblue", lwd=3),
box.umbrella=list(lwd=3, col="steelblue")))
Each plot shows the average amount of time doctors report they would spend with a patient is smaller if the patient is obese.
Using the favstats()
command will give us summary statistics:
# Summary statistics (time as a function of weight)
favstats(time ~ weight, data=doctors)
## weight min Q1 median Q3 max mean sd n missing
## 1 average 15 25 30 40 50 31.36364 9.864134 33 0
## 2 obese 5 20 25 30 60 24.73684 9.652571 38 0
The difference in means (obese - average) equals:
24.737 - 31.364
= -6.627
We can get this difference with the diffmean()
command:
diffmean(time ~ weight, data=doctors)
## diffmean
## -6.626794
We’re going to conduct an independent samples t-test to determine if the mean time spent with obese patients is less than the mean time spent with non-obese patients. Before we do this, let’s investigate a couple of the assumptions (conditions): normality and homogeneity of variances.
To check for normality, we can look at the density plots we constructed earlier. They didn’t look too bad, outside of the outlier in the obese group. We can also check for normality via Q-Q plots:
qqnorm(doctors$time[doctors$weight=="obese"], ylab = "Minutes with patients", main="Obese Patients"); qqline(doctors$time[doctors$weight=="obese"]) # Obese patients
qqnorm(doctors$time[doctors$weight=="average"], ylab = "Minutes with patients", main="Non-obese Patients"); qqline(doctors$time[doctors$weight=="average"]) # Non-obese patients
We could also run a Shapiro-Wilk normality test. This tests the hypothesis that the data follow a normal distribution. Let’s run this test for both the obese and non-obese groups:
shapiro.test(doctors$time[doctors$weight=="obese"])
##
## Shapiro-Wilk normality test
##
## data: doctors$time[doctors$weight == "obese"]
## W = 0.83843, p-value = 7.024e-05
shapiro.test(doctors$time[doctors$weight=="average"])
##
## Shapiro-Wilk normality test
##
## data: doctors$time[doctors$weight == "average"]
## W = 0.90958, p-value = 0.009531
The low p-values tell us that there’s a small chance we would have gotten our data if, in fact, they came from normal distributions.
To check for equal variances, we can run the F-max test we discussed in class:
varNonobese <- var(doctors$time[doctors$weight=="average"])
varObese <- var(doctors$time[doctors$weight=="obese"])
Fmax <- varNonobese / varObese
Fmax
## [1] 1.044316
Since this Fmax value is close to 1.00, we have evidence to support our decision to assume equal variances. We could also test equal variances with Bartlett’s test:
bartlett.test(time ~ weight, data=doctors)
##
## Bartlett test of homogeneity of variances
##
## data: time by weight
## Bartlett's K-squared = 0.015916, df = 1, p-value = 0.8996
The p-value of 0.8996 tells us there’s a good chance of getting this data if, in fact, the groups have equal variances.
To conduct the t-test, we can use the t.test()
command. The arguments include:
t.test(y ~ x, data, alternative = c("two.sided", "less", "greater"), var.equal = FALSE, sig.level = 0.05)
In this scenario…
y ~ x
= time ~ weight
data = doctors
alternative = "greater"
(we’ll test if non-obese > obese)var.equal = TRUE
(we’ll try FALSE later)sig.level = 0.05
(to match the example in the activity)Let’s run the command:
t.test(time ~ weight, data=doctors,
var.equal=TRUE, alternative="greater", sig.level=0.05)
##
## Two Sample t-test
##
## data: time by weight
## t = 2.856, df = 69, p-value = 0.002831
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 2.758328 Inf
## sample estimates:
## mean in group average mean in group obese
## 31.36364 24.73684
The output shows our observed t-statistic (2.856) along with its degrees of freedom (69). The p-value (0.0028) provides evidence that the difference is not zero (which we already knew).
If we want to get the corresponding confidence interval (like we did in class), we can find a two-sided 90% CI by adding $conf.int
to the end of our command:
# Notice the conf.level is 0.90
# Also notice that I did not specify the alternative
t.test(time ~ weight, var.equal=TRUE, conf.level=0.90, data=doctors)$conf.int
## [1] 2.758328 10.495260
## attr(,"conf.level")
## [1] 0.9
Since both the lower and upper bounds of our confidence interval are positive, we have reason to believe the difference is positive (and not zero).
In activity #19 (which we may not have yet completed in class), we discuss effect sizes. Specifically, we calculate and interpret Cohen’s d. We can do this manually in R using:
mean1 <- mean(doctors$time[doctors$weight=="average"])
mean2 <- mean(doctors$time[doctors$weight=="obese"])
var1 <- var(doctors$time[doctors$weight=="average"])
var2 <- var(doctors$time[doctors$weight=="obese"])
spooled <- sqrt( (var1 + var2) / 2 )
effectsize <- (mean1 - mean2) / spooled
effectsize
## [1] 0.6790496
That effect size, d = 0.679
, indicates we have a medium-size effect.
bfeed
with the variables: Feeding
: whether each child was “Breastfed” or “NotBreastfed” GCI
: a score of intelligence. GCI
scores for each group. If we aren’t willing to make an equal-variance assumption, we can use the Welch-Satterthwaite method. In R, we use the same t.test()
command but we include the argument var.equal=FALSE
:
t.test(time ~ weight, data=doctors, var.equal=FALSE, alternative="greater", sig.level=0.05)
##
## Welch Two Sample t-test
##
## data: time by weight
## t = 2.8516, df = 67.174, p-value = 0.002887
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 2.750899 Inf
## sample estimates:
## mean in group average mean in group obese
## 31.36364 24.73684
As you can see, the p-value didn’t change much. This is because our group variances were so similar – the assumption of equal variances wasn’t much of a stretch. Notice, though, the degrees of freedom changed from 69 to 67.174.
bfeed
data, run an independent-samples t-test without the equal variance assumption.We can also compare time spent with obese and non-obese patients using randomization-based methods. These methods don’t require the conditions of normality or homogeneity of variances.
Our first step is to identify our test statistic of interest. The test statistic should be a single value that measures what we want to know. In this case, we want to know the difference between the means of our two groups.
As we did earlier, we can calculate this test statistic with the diffmean()
command:
test.stat <- diffmean(time ~ weight, data = doctors)
Remember that for this dataset, the difference in means is: test.stat = -6.63
.
Now that we have our test statistic, let’s randomize our data. Our null hypothesis is that the weight of a patient has no effect on the time doctors would spent with that patient. If this is the case, we can shuffle the weight of our patients with no effect on the time spent with them.
For example, the first doctor in this study was randomly assigned a non-obese “patient.” This doctor reported that he or she would spend 15 minutes with the patient. Now suppose we go back in time, run this study again, and randomly assign the doctor to an obese patient. Under our null hypothesis, this doctor would still report spending 15 minutes with the patient.
With this logic, we can randomly shuffle our data so that we randomly assign non-obese patients to 33 doctors and obese patients to 38 doctors (just like was done in the actual study). Once we shuffle our data like this, we calculate our test statistic.
We repeat this process 10,000 times — randomly shuffling our data and calculating our test statistic. This gives us a nice, large, representative sample of test statistic values if our null hypothesis were true. From this, we can estimate the likelihood of our actual test statistic (-6.63
).
Let’s see how this works, using the shuffle()
command. Let’s shuffle our data two times and calculate the test statistic in each randomization:
set.seed(3141) #Set the random number seed
diffmean(time ~ shuffle(weight), data=doctors)
## diffmean
## -6.3437
Take a look at the syntax and try to make sense of it. It might help to start at the right and work our way left.
diffmean(time ~ shuffle(weight), data=doctors)
doctors
shuffle(weight)
time
is a function of the shuffled weight
We're calculating the difference in means with
diffmean`The output shows our test statistic in this first randomization equals -6.34
. That’s fairly similar to our actual test statistic (-6.63
). Let’s try a second randomization:
set.seed(3142) #Set the random number seed
diffmean(time ~ shuffle(weight), data=doctors)
## diffmean
## 1.016746
In this randomization, the test statistic equals +1.02
. That means that in this randomization, doctors actually reported spending 1 more minute (on average) with obese patients.
Let’s get the values of our test statistic under 10,000 randomizations:
randomized <- do(10000) * diffmean(time ~ shuffle(weight), data=doctors)
Let’s plot the values of our test statistic for all 10,000 randomizations:
histogram(~ diffmean, data=randomized, groups=diffmean <= test.stat,
xlim=c(-12,12), width=1,
xlab="Mean differences assuming null hypothesis is true")
ladd(panel.abline(v=test.stat))
You can already see that our actual test statistic value of -6.63
is an unusual value (if our null hypothesis were true). Let’s see how unlikely our observed test statistic would be:
# We want the proportion of our 10,000 values
# in which the mean difference is less than
# or equal to our actual test.stat of -6.63
prop(~ (diffmean<= test.stat), data=randomized)
## TRUE
## 0.0034
That’s our p-value: p = 0.0034
. Notice how close it is to the p-value from our t-test (p = 0.0028
).
bfeed
dataset. test.stat
We can extend our randomization-based methods to test other test statistics, such as the difference between the group medians. In our actual dataset, the difference in medians is 5 minutes:
median(time ~ weight, data=doctors)
## average obese
## 30 25
Let’s randomize our data and get 10,000 values of this test statistic:
# Calculate medians for randomized weight groups
randmedians <- do(10000) * median(time ~ shuffle(weight), data=doctors)
# Calculate difference between group medians for each
# of the 10,000 randomizations
randmedians$diffmedians <- randmedians$average - randmedians$obese
# Plot a histogram and highlight differences > 5 minutes
histogram(~ diffmedians, data=randmedians, groups=diffmedians >= 5,
xlim=c(-12,12), width=1,
xlab="Mean differences assuming null hypothesis is true")
ladd(panel.abline(v=5))
# Estimate the p-value
prop(~ (diffmedians >= 5), data=randmedians)
## TRUE
## 0.0277
There’s our p-value of p = 0.0277
.
bfeed
dataset.In class, we looked at the smiles dataset. This dataset contains leniency scores for each of 4 different types of smiles:
smiles <- read.csv("http://www.bradthiessen.com/html5/data/smiles2.csv")
str(smiles)
## 'data.frame': 136 obs. of 2 variables:
## $ expression: Factor w/ 4 levels "felt","feltfalse",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ leniency : num 7 3 6 4.5 3.5 4 3 3 3.5 4.5 ...
favstats(leniency ~ expression, data=smiles)
## expression min Q1 median Q3 max mean sd n missing
## 1 felt 2.5 3.500 4.75 5.875 9 4.911765 1.680866 34 0
## 2 feltfalse 2.5 3.625 5.50 6.500 9 5.367647 1.827023 34 0
## 3 miserable 2.5 4.000 4.75 5.500 8 4.911765 1.453682 34 0
## 4 neutral 2.0 3.000 4.00 4.875 8 4.117647 1.522850 34 0
One test statistic that could be used to measure the difference among these 4 groups is the sum of absolute deviations or SAD.
SAD = |mean1 - mean2| + |mean1 - mean3| + |mean1 - mean4| + |mean2 - mean3| + |mean2 - mean4| + |mean3 - mean4|
With this dataset, SAD is calculated to be 3.75
:
test.stat <- SAD(mean(leniency ~ expression, data = smiles))
test.stat
## [1] 3.75
We can randomly shuffle the smile types and calculate SAD for each of 10,000 randomizations:
randomSAD = do(10000) * SAD(mean(leniency ~ shuffle(expression), data = smiles))
From this, we can visualize our randomization-based distribution and estimate a p-value:
# Distribution of SAD values
histogram(~ result, xlim=c(0,6), groups=result >= test.stat, data=randomSAD, width=.2)
ladd(panel.abline(v=test.stat))
# Estimate p-value
prop(~ (result>= test.stat), data=randomSAD)
## TRUE
## 0.022
earn
dataframe. It contains the following variables for 250 individuals: Degree
: The highest level of education achieved by each individual (Some
college, Associate
degree, Bachelor
degree, Master
degree, or Doctorate
. Earnings
: income of each individual (in tens of thousands of dollars) I won’t run it in this lab, but here’s the code I would use to estimate the difference in group means via Bayesian estimation:
# Get times for obese patients in a vector called "obese"
obese <- doctors$time[doctors$weight == "obese"]
# Get times for non-obese patients in "average"
average <- doctors$time[doctors$weight == "average"]
# Load the BEST package
library(BEST)
# Run the MCMC chains
BESTout <- BESTmcmc(average, obese)
# Get output
BESTout
# Plot posterior distributions
plot(BESTout)
plot(BESTout, "sd")
plotPostPred(BESTout)
pairs(BESTout)
When you’ve finished typing all your answers to the exercises, you’re ready to publish your lab report. To do this, look at the top of your source panel (the upper-left pane) and locate the Knit HTML button:
Click that button and RStudio should begin working to create an .html file of your lab report. It may take a minute or two to compile everything, but the report should open automatically when finished.
Once the report opens, quickly look through it to see all your completed exercises. Assuming everything looks good, send that lab1report.html file to me via email or print it out and hand it in to me.
Feel free to browse around the websites for R and RStudio if you’re interested in learning more, or find more labs for practice at http://openintro.org.