Author(s): [Enter names of students working on these solutions]
bfeed
with variables: group
: identifies whether each child was breastfed or notscore
: each child’s score on the intelligence test.# The data is loaded as bfeed
bfeed <- read.csv("http://www.bradthiessen.com/html5/stats/m300/bfeed.csv")
# Rename variables and groups
library(forcats)
bfeed <- bfeed %>%
rename(group = Feeding, score = GCI) %>%
mutate(group = fct_recode(group, "breastfed" = "Breastfed", "not" = "NotBreastfed"))
# Report means and standard deviations
bfeed %>%
group_by(group) %>%
summarize(n = n(),
mean = mean(score),
sd = sd(score))
## # A tibble: 2 × 4
## group n mean sd
## <fctr> <int> <dbl> <dbl>
## 1 breastfed 237 105.3 14.49998
## 2 not 85 100.9 13.99997
# a) Plot the data to visualize the intelligence scores
# You can modify the code I used at the beginning of the activity
# to create a dotplot or boxplot. You'll need to change the name
# of the data frame (from "spider" to "bfeed") and the names of the
# variables (from "anxiety" and "group" to "score" and "group").
# You don't need any of the lines of code that start with: theme(.
# Type your code right below this line.
bfeed %>%
ggplot(aes(y = score, x = group)) +
geom_boxplot(fill = "gold", color = "black", alpha = 0.5) +
geom_dotplot(binaxis = "y", stackdir = "center", fill = "steelblue", color = "white", alpha = 0.5) +
scale_y_continuous(breaks=seq(50, 150, 10), minor_breaks=NULL) +
theme(axis.title.x = element_text(size = 12, color = "#777777")) +
theme(axis.text.x = element_text(size = 12)) +
theme(axis.title.y = element_text(size = 12, color="#777777")) +
theme(axis.text.y = element_text(size = 12)) +
labs(
title = "Anxiety by group"
)
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
# b) Calculate either cohen's d or D. You can use the functions
# "cohens_d" or "D_effect." Type your code right below this line.
cohens_d(score ~ group, data = bfeed)
## [1] "Cohen's d = 0.306"
D_effect(score ~ group, data = bfeed)
## [1] "D = 0.586"
# In this shaded box, type your interpretation of the effect size you
# just calculated.
Cohen's d: The groups differ by 0.3 standard deviation units. That's a low/medium effect.
D: If we randomly select one person from each group, there is a 59% chance the individual who was breastfed has the higher intelligence test score.
# c) Conduct a t-test using code similar to this:
# t.test(YYY ~ XXX, data = DDD, alternative = "greater")
t.test(score ~ group, data = bfeed, alternative = "greater")
##
## Welch Two Sample t-test
##
## data: score by group
## t = 2.4624, df = 153.01, p-value = 0.007455
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 1.442962 Inf
## sample estimates:
## mean in group breastfed mean in group not
## 105.3 100.9
# Write your hypotheses here:
Null hypothesis: U-breastfeed = U-not
Alternative hypothesis: U-bf > U-not
Assumptions I am making: Independence, normality, equal variances, H0 is true
Interpretation of the p-value: Likelihood of observing results as or more extreme if the null hypothesis were true.
# d) Make sure you construct a 90% interval (not 95%):
confint(t.test(score ~ group, data=bfeed, conf.level = 0.90, var.equal=TRUE))
## # A tibble: 1 × 5
## `mean in group breastfed` `mean in group not` lower upper level
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 105.3 100.9 1.40296 7.39714 0.9
confint(t.test(score ~ group, data=bfeed, conf.level = 0.90, var.equal=FALSE))
## # A tibble: 1 × 5
## `mean in group breastfed` `mean in group not` lower upper level
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 105.3 100.9 1.442962 7.357138 0.9
If we repeated this experiment a large number of times, 95% of the mean differences would fall between (1.4, 7.4)
simdata
data frame contains simulated data from a fictitious study on weightloss. Subjects were randomly assigned to one of two diet
groups (Diet A or Diet B). The number of pounds each subject lost (weightloss
) was then recorded.# You'll type your answers in the next shaded box
# Simulate the data
set.seed(1234) # Set random number generator to replicate this data
simdata <- tibble(
diet = c(rep("A", 20000), rep("B", 20000)),
weightloss = c(rnorm(20000, 2, 20), rnorm(20000, 1, 20))
)
# Summary statistics
simdata %>%
group_by(diet) %>%
summarize(n = n(),
mean = mean(weightloss),
sd = sd(weightloss))
## # A tibble: 2 × 4
## diet n mean sd
## <chr> <int> <dbl> <dbl>
## 1 A 20000 1.983689 19.93599
## 2 B 20000 1.150867 20.03774
# Effect sizes
cohens_d(weightloss ~ diet, data = simdata)
## [1] "Cohen's d = 0.042"
D_effect(weightloss ~ diet, data = simdata)
## [1] "D = 0.512"
# t-test
t.test(weightloss ~ diet, data = simdata, alternative="greater")
##
## Welch Two Sample t-test
##
## data: weightloss by diet
## t = 4.1668, df = 39997, p-value = 1.548e-05
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 0.5040582 Inf
## sample estimates:
## mean in group A mean in group B
## 1.983689 1.150867
# Write your conclusion here
The means differ, but the difference is not meaningful or important.
# Explain what caused both the p-value and effect sizes to be low.
The large sample size
spider
data frame. Use bootstrap methods to construct a 95% confidence interval for Cohen’s d (the effect size of the difference in anxiety levels). To do this, you’ll want to use the cohens_d_2
function I wrote (instead of the diffmeans
function we used to construct a confidence interval for the mean difference in the activity). Plot the distribution of bootstrap estimates and report the confidence interval limits. From this confidence interval, what can you conclude about the effect size?# You'll calculate the effect size using
# cohens_d_2(anxiety ~ group, data = spider)
cohens_d_2(anxiety ~ group, data = spider)
## real
## 0.6864065
# You'll want to calculate the effect size
# as you repeatedly sample your data with replacement
# Your code will look something like this:
# bootsamples <- Do(###) * cohens_d_2(anxiety ~ group, data = spider)
# But you'll need to make sure you are *resampling* your data
bootsamples <- Do(5000) * cohens_d_2(anxiety ~ group, data = resample(spider))
# To plot those bootstrap estimates of the effect size, you can simply use:
# histogram(~real, data=bootsamples)
histogram(~real, data=bootsamples)
# To calculate the confidence interval, you'll use something like:
# confint(bootsamples, level = 0.95, method = "quantile")
confint(bootsamples, level = 0.95, method = "quantile")
## # A tibble: 1 × 6
## name lower upper level method estimate
## * <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 real 0.0687742 1.652369 0.95 percentile 0.6864065
spider
data frame. Use randomization-based NHST methods to compare the medians of the real spider vs. photo groups. Plot the randomized sampling distribution and estimate the p-value. To calculate the difference in medians, you can use a diffmedian()
function I wrote for this lesson.# To calculate the difference in medians for our spider data
# you can use: diffmedian(anxiety ~ group, data=spider)
random_medians <- Do(5000) * diffmedian(anxiety ~ shuffle(group), data = spider)
ggplot(data = random_medians, aes(x = real)) +
geom_histogram(binwidth = 1, fill="lightblue", color="white", alpha = 0.8) +
annotate("segment", x = 7, xend = 7, y = 0, yend = 1200, color = "red") +
labs(
title = "Randomized median differences",
x = "difference between group medians"
) +
scale_x_continuous(breaks=seq(-25, 25, 5), minor_breaks=NULL) +
theme(
axis.text.x = element_text(size = 11, color="grey10"),
legend.position = "none",
panel.grid.major.y = element_line(colour = "white"),
panel.grid.major.x = element_line(colour = "white", size=.15),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "grey93")
)
prop(~ (real >= 7), data=random_medians)
## TRUE
## 0.2872
# We no longer have two independent groups. We needed that independence
# assumption to calculate either SEpooled or the Satterthwaite approximation.
# What can we do with two dependent groups (the same group measured twice)?
# Let's separate our data into the two repeated measures
# First, let's get anxiety scores related to the real spider
real <- spider %>%
filter(group == "real") %>%
transmute(real = anxiety)
# Now, we'll get anxiety scores for those same people with the photo
photo <- spider %>%
filter(group == "photo") %>%
transmute(photo = anxiety)
# We can combine these into a single data frame
spider2 <- data.frame(real, photo)
# Let's look at this data
spider2
## # A tibble: 12 × 2
## real photo
## <dbl> <dbl>
## 1 40 30
## 2 35 35
## 3 50 45
## 4 55 40
## 5 65 50
## 6 55 35
## 7 50 55
## 8 35 25
## 9 30 30
## 10 50 45
## 11 60 40
## 12 39 50
# This shows that instead of having 24 independent anxiety scores
# we have a pair of anxiety scores for each of 12 subjects.
# The first subject had an anxiety of 40 with the real spider
# and 30 with the photo.
# Now, let's calculate the difference in anxiety scores for each subject
spider2 <- spider2 %>%
mutate(difference = real - photo)
# We can see this difference column now...
spider2
## # A tibble: 12 × 3
## real photo difference
## <dbl> <dbl> <dbl>
## 1 40 30 10
## 2 35 35 0
## 3 50 45 5
## 4 55 40 15
## 5 65 50 15
## 6 55 35 20
## 7 50 55 -5
## 8 35 25 10
## 9 30 30 0
## 10 50 45 5
## 11 60 40 20
## 12 39 50 -11
# Under a true null hypothesis, we would expect this difference to be zero
# Let's see what our sample mean difference equals
mean(spider2$difference)
## [1] 7
# How likely were we to get an average difference score of 7
# if the null hypothesis were true? Let's use a theory-based
# null hypothesis significance test
# Run the following code:
# t_test(anxiety ~ group, data=spider, paired=TRUE, alternative="less")
t_test(anxiety ~ group, data=spider, paired=TRUE, alternative="less")
##
## Paired t-test
##
## data: anxiety by group
## t = -2.4725, df = 11, p-value = 0.01549
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -1.915663
## sample estimates:
## mean of the differences
## -7
You’re done!