Remember to download the report template for this lab and open it in RStudio. You can download the template by clicking this link: http://bradthiessen.com/html5/stats/m300/lab4report.Rmd
Let’s replicate the summaries and visualizations from our in-class Activity #12. We first start by loading the MAP-Works dataset.
mw <- read.csv("http://www.bradthiessen.com/html5/mw.csv")
head(mw)
## ID risk female minority ACT ACTeng ACTmath Hsgpa
## 1 1 Low (Green) 1 7 26 25 21 3.72
## 2 2 Moderate (Yellow) 1 7 20 16 24 2.15
## 3 3 Low (Green) 0 7 19 16 18 2.33
## 4 4 Low (Green) 1 7 20 24 16 2.61
## 5 5 Low (Green) 1 7 20 26 22 3.70
## 6 6 Low (Green) 1 7 21 23 17 3.47
## major GPAfall SPRretain GPAspring GPAyear Retained
## 1 Early Childhood Educ 2.87 1 3.44 3.16 1
## 2 Nursing 2.08 1 2.37 2.23 1
## 3 Criminal Justice 2.20 1 2.86 2.52 1
## 4 Pub Rel/Strategic Comm 2.87 1 2.62 2.74 1
## 5 Theatre 2.42 1 1.46 1.92 1
## 6 Elementary Education 2.07 1 3.21 2.64 1
## COMMITsept ACADEMICsept SATISsept COMMITapril ACADEMICapril SATISapril
## 1 7.000000 5.166667 6.333333 7 6.0 7.000000
## 2 7.000000 4.333333 5.666667 5 4.8 5.000000
## 3 6.500000 5.000000 6.000000 7 4.8 5.666667
## 4 7.000000 6.666667 7.000000 7 4.8 4.333333
## 5 7.000000 4.250000 5.666667 7 4.2 4.000000
## 6 6.666667 4.333333 4.666667 7 7.0 5.333333
If you compare this output to what’s shown on the first page of Activity #12, you’ll find this is the exact same dataset. Let’s use the str()
command to see if the variable types match:
str(mw)
## 'data.frame': 200 obs. of 20 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ risk : Factor w/ 3 levels "High (Red)","Low (Green)",..: 2 3 2 2 2 2 3 2 2 1 ...
## $ female : int 1 1 0 1 1 1 1 1 1 1 ...
## $ minority : int 7 7 7 7 7 7 7 7 7 1 ...
## $ ACT : int 26 20 19 20 20 21 24 28 24 24 ...
## $ ACTeng : int 25 16 16 24 26 23 24 27 27 20 ...
## $ ACTmath : int 21 24 18 16 22 17 23 28 24 27 ...
## $ Hsgpa : num 3.72 2.15 2.33 2.61 3.7 3.47 3.87 3.75 3.83 3.22 ...
## $ major : Factor w/ 27 levels "Accounting","Biology",..: 8 20 7 23 26 9 17 11 12 2 ...
## $ GPAfall : num 2.87 2.08 2.2 2.87 2.42 2.07 3.5 3.35 3.35 3.19 ...
## $ SPRretain : int 1 1 1 1 1 1 1 1 1 1 ...
## $ GPAspring : num 3.44 2.37 2.86 2.62 1.46 3.21 3.93 3.46 3.63 3.62 ...
## $ GPAyear : num 3.16 2.23 2.52 2.74 1.92 2.64 3.71 3.41 3.49 3.39 ...
## $ Retained : int 1 1 1 1 1 1 0 1 1 0 ...
## $ COMMITsept : num 7 7 6.5 7 7 ...
## $ ACADEMICsept : num 5.17 4.33 5 6.67 4.25 ...
## $ SATISsept : num 6.33 5.67 6 7 5.67 ...
## $ COMMITapril : int 7 5 7 7 7 7 1 7 7 2 ...
## $ ACADEMICapril: num 6 4.8 4.8 4.8 4.2 7 4.4 5.2 5.8 6.6 ...
## $ SATISapril : num 7 5 5.67 4.33 4 ...
The activity identifies 6 factor variables, while our dataset only lists risk
and major
as factor variables. Let’s convert those other variables to factor variables using the as.factor
command.
mw$female <- as.factor(mw$female)
mw$minority <- as.factor(mw$minority)
mw$SPRretain <- as.factor(mw$SPRretain)
mw$Retained <- as.factor(mw$Retained)
The first line of code in that block converts the female variable (mw$female
) to a factor variable with the same name. The other lines convert the minority, PRretain, and Retained variables. Let’s run the str()
command again to see if the conversions worked:
str(mw)
## 'data.frame': 200 obs. of 20 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ risk : Factor w/ 3 levels "High (Red)","Low (Green)",..: 2 3 2 2 2 2 3 2 2 1 ...
## $ female : Factor w/ 2 levels "0","1": 2 2 1 2 2 2 2 2 2 2 ...
## $ minority : Factor w/ 2 levels "1","7": 2 2 2 2 2 2 2 2 2 1 ...
## $ ACT : int 26 20 19 20 20 21 24 28 24 24 ...
## $ ACTeng : int 25 16 16 24 26 23 24 27 27 20 ...
## $ ACTmath : int 21 24 18 16 22 17 23 28 24 27 ...
## $ Hsgpa : num 3.72 2.15 2.33 2.61 3.7 3.47 3.87 3.75 3.83 3.22 ...
## $ major : Factor w/ 27 levels "Accounting","Biology",..: 8 20 7 23 26 9 17 11 12 2 ...
## $ GPAfall : num 2.87 2.08 2.2 2.87 2.42 2.07 3.5 3.35 3.35 3.19 ...
## $ SPRretain : Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1 1 ...
## $ GPAspring : num 3.44 2.37 2.86 2.62 1.46 3.21 3.93 3.46 3.63 3.62 ...
## $ GPAyear : num 3.16 2.23 2.52 2.74 1.92 2.64 3.71 3.41 3.49 3.39 ...
## $ Retained : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 2 2 1 ...
## $ COMMITsept : num 7 7 6.5 7 7 ...
## $ ACADEMICsept : num 5.17 4.33 5 6.67 4.25 ...
## $ SATISsept : num 6.33 5.67 6 7 5.67 ...
## $ COMMITapril : int 7 5 7 7 7 7 1 7 7 2 ...
## $ ACADEMICapril: num 6 4.8 4.8 4.8 4.2 7 4.4 5.2 5.8 6.6 ...
## $ SATISapril : num 7 5 5.67 4.33 4 ...
Yep, the output now shows we have 6 factor variables. There’s still one problem, though. Take a look at the levels of the female
variable.
As you can see, the female variable is now a factor variable that takes on two possible values: 0 and 1. We can see this more clearly in a summary table:
table(mw$female)
##
## 0 1
## 41 159
This dataset has 41 individuals in which female == 0
. This may make sense to you (female = 0 means the individual is male), but it would be nice if we could simply label those values as male and female. To do this, we can use the levels()
command. We’d like to tell R that the levels 0 and 1 of our female variable actually represent male and female:
levels(mw$female) <- c("male", "female")
levels(mw$minority) <- c("Caucasian", "Minority")
levels(mw$SPRretain) <- c("Dropped", "Returned")
levels(mw$Retained) <- c("Dropped", "Returned")
Let’s reconstruct a table of this female variable to check the results:
table(mw$female)
##
## male female
## 41 159
It worked – now we can more easily see we have 41 males and 159 females.
Let’s continue replicating Activity #12. In the first question of that activity, we looked at tables of the risk
and ACT
variables:
table(mw$risk)
##
## High (Red) Low (Green) Moderate (Yellow)
## 8 145 47
table(mw$ACT)
##
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## 5 11 11 17 23 17 16 24 13 17 11 8 8 8 5 3 2 1
We’ve already seen we can also create tables using the tally()
command:
tally(~risk, data=mw)
##
## High (Red) Low (Green) Moderate (Yellow)
## 8 145 47
tally(~ACT, data=mw)
##
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## 5 11 11 17 23 17 16 24 13 17 11 8 8 8 5 3 2 1
The activity then displays 3 histograms for the Hsgpa
variable. Each histogram has a different bin width:
histogram(~Hsgpa, data=mw, col="lightblue", border="lightblue", width=0.01)
histogram(~Hsgpa, data=mw, col="lightblue", border="white", width=0.22)
histogram(~Hsgpa, data=mw, col="lightblue", border="white", width=0.75)
On question #3 of the activity, we looked at side-by-side histograms of GPAfall
by Retained
. The general syntax for this is:
histogram( ~y | x, data)
Where y is our dependent variable and x is our independent variable. Let’s try it:
histogram(~GPAfall | Retained, data=mw, col="lightblue", border="white", type="count")
Question #4 of the activity displays a stemplot for GPAfall
:
stem(mw$GPAfall)
##
## The decimal point is 1 digit(s) to the left of the |
##
## 8 | 6
## 10 |
## 12 | 1
## 14 | 6777
## 16 | 49
## 18 | 23
## 20 | 00027778898
## 22 | 00123339133369
## 24 | 00022799900344778
## 26 | 02357777791337
## 28 | 000022223777788023333588
## 30 | 0000000066666778882333444557889
## 32 | 00002445555556779023335567788888
## 34 | 0000000002334577777777000000003679
## 36 | 1133345713358
Question #6 of the activity displays side-by-side kernal density plots for GPAfall
by Retained
. To do this, we can replace histogram
with densityplot
and use the same general syntax:
densityplot(~GPAfall | Retained, data=mw, col="steelblue", lw=5)
The lw=5
option simply tells R to increase the width of the line. If I used the default setting, the line would be thinner:
densityplot(~GPAfall | Retained, data=mw, col="steelblue")
Beginning with question #7 of the activity, we started calculating numerical summaries. Here’s how we can get the mean, median, and trimmed mean of our ACT
variable:
mean(~ACT, data=mw)
## [1] 23.72
median(~ACT, data=mw)
## [1] 23.5
mean(mw$ACT, trim=.2)
## [1] 23.41667
Similar to what we did in question #8 of the activity, let’s see what happens to the mean and median if we were to convert our ACT score as follows:
newACT = 2*ACT + 4
Let’s create a newACT
variable that is equal to four plus the original ACT variable doubled
:
mw$newACT <- (2 * mw$ACT) + 4
Recall the mean and median of the original ACT variable equal: mean = 23.72 mean = 23.5
The mean and median of our newACT variable are:
mean(mw$newACT)
## [1] 51.44
median(mw$newACT)
## [1] 51
We could have predicted this, because we know:
mean(2x + 4) = 2 * mean(x) + 4
and median(2x + 4) = 2 * median(x) + 4
We can verify this:
mean(mw$newACT) == (2 * mean(mw$ACT)) + 4
## [1] TRUE
median(mw$newACT) == (2 * median(mw$ACT)) + 4
## [1] TRUE
The output of TRUE indicates that both equations are, in fact, equal.
Question #9 in the activity displays some quantiles (percentiles) of a dataset. Let’s find the 10th, 57th, and 90th percentiles of our GPAfall
variable:
quantile(mw$GPAfall, c(.10, .57, .90))
## 10% 57% 90%
## 2.171 3.140 3.500
Question #10 then displays side-by-side boxplots of GPAfall
by Retained
and female
:
bwplot(GPAfall~Retained | female, data=mw, col="steelblue")
In question 12 of the activity, we calculated a standard deviation. We can do this easily in R with the sd()
command. Let’s get the standard deviation of our ACT
variable:
sd(mw$ACT)
## [1] 3.855636
What’s the standard deviation of our newACT
variable (in which we doubled each ACT score and then added 4)?
sd(mw$newACT)
## [1] 7.711272
Is this equal to (2 * sd(ACT)) + 4
? Let’s see:
sd(mw$newACT) == (2 * sd(mw$ACT)) + 4
## [1] FALSE
Nope! Remember that adding a constant to each value does not change the standard deviation of a variable. We can verify this:
# Notice the +4 is missing on the right
sd(mw$newACT) == (2 * sd(mw$ACT))
## [1] TRUE
Question 15 of the activity displays a scatterplot of the relationship between ACT
and GPAfall
. To get this scatterplot, we use:
xyplot(GPAfall ~ ACT, data=mw, cex=1.1, col="steelblue")
Fancy stuff
If you want to see some fancier plots (including some interactive plots), you can check out the following documents from our course website:
Example 1. Creating plots of baseball statistics
Example 2. Manipulating baseball data
The documents show how to use the ggplot2, ggvis, and dplyr packages to manipulate and plot data. The first document also shows this interactive plot (displaying all the pitches in a single game from one of my favorite pitchers):
If you’re at all interested in statistics, data science, or R, I recommend you check out the ggvis and dplyr packages. I’m more than willing to help you learn how to use them.
As a couple quick examples, here are plots of our ACT
and GPAfall
variables using ggplot2 and ggvis. Don’t let the length of the commands bother you – I just set lots of options:
ggplot(mw, aes(x=ACT)) +
geom_histogram(binwidth=1, colour = "steelblue", fill = "white", size=1, alpha=0.4) +
geom_point(aes(y = -1), position = position_jitter(height = 0.5), size=2) +
ggtitle("ACT Scores") +
xlim(0, 36) +
xlab("ACT scores for incoming freshmen") +
ylab("number of students")
library(ggvis)
##
## Attaching package: 'ggvis'
##
## The following object is masked from 'package:mosaic':
##
## prop
##
## The following object is masked from 'package:ggplot2':
##
## resolution
mw %>%
ggvis(x=~ACT, y=~GPAfall, fill := "steelblue", stroke:="steelblue",
strokeOpacity:=.2, fillOpacity:=.2) %>%
layer_points() %>%
layer_smooths(span = input_slider(0.2, 1, value = 1), strokeOpacity:=1) %>%
add_axis("x", title = "ACT scores") %>%
scale_numeric("x", domain=c(0, 36), nice=FALSE, clamp=TRUE) %>%
add_axis("y", title = "Fall semester GPAs") %>%
scale_numeric("y", domain=c(0, 4), nice=FALSE, clamp=TRUE)
## Warning: Can't output dynamic/interactive ggvis plots in a knitr document.
## Generating a static (non-dynamic, non-interactive) version of the plot.
Note that the above plot is interactive if you run it within RStudio.
Suppose you’re analyzing a dataset – creating visualizations as you should – and you come across a variable X
with the following distribution:
histogram(X, col="lightblue", border="white")
It looks like this data might approximate a normal distribution, but how can we check? We could overlay a normal curve over the plot. To do this, we need to know the mean and standard deviation of this variable:
mean(X)
## [1] 99.02143
sd(X)
## [1] 32.9947
Now we can plot a normal distribution (with mean = 99.495 and standard deviation = 29.676) on top of our histogram to see how they compare:
plotDist("norm", params=list(mean=99.495, sd=29.676), col="red", lwd=3, add=TRUE)
The add=TRUE
option tells R to plot this normal distribution on top of our previous plot.
As expected, the fit isn’t perfect but it looks pretty good. Unfortunately, it’s difficult to judge the degree of fit with a normal curve overlaid on a histogram. This is due, in large part, to the fact that our histogram looks different if we choose a different number of bins.
The following plot will be the exact same variable X
. I only changed the histogram to show a greater number of bins:
histogram(X, col="lightblue", border="white", width=5)
plotDist("norm", params=list(mean=99.495, sd=29.676), col="red", lwd=3, add=TRUE)
From this plot, it would be hard to argue that our data approximate a normal distribution. Is there any way we can evaluate normality and not worry about the width of our bins (the number of bars in our histogram)?
Thankfully, the answer is yes. We can use Q-Q (Quantile-Quantile) Plots. Q-Q plots show the observed quantiles of a dataset compared to the theoretical quantiles under a normal distribution.
We’ll discuss Q-Q plots in class. For now, just know that the Q-Q plot for a variable that follows a perfect normal distribution would show all the points lying along the diagonal line.
Let’s look at an example. I’ll generate a variable Z
that contains 10,000 samples from a normal distribution.
Z <- rnorm(10000)
qqnorm(Z); qqline(Z)
You can see that the points mostly fall along the diagonal line, indicating that this variable does closely follow a normal distribution.
Let’s take a look at the Q-Q plot for our variable of interest, X
:
qqnorm(X); qqline(X)
Hmm… does this indicate the variable approximates a normal distribution? It’s still hard to tell, isn’t it? This variable, in fact, contains 100 random samples from a normal distribution (so we could argue that we should conclude it approximates a normal distribution).
So how can we train our eyes to tell whether a small sample of data approximate (or come from an approximate) normal distribution? We could look at a bunch of Q-Q plots, I guess.
Below, I’ve pasted some code from Randall Pruim that displays Q-Q plots for data generated from normal distributions.
The first row displays 8 Q-Q plots for samples of size n=10. The second and third rows show Q-Q plots for samples of size n=25 and n=100. Remember, all these values were generated from normal distributions, so any fluctuations from the diagonal line are due to random variation:
dat10 <- data.frame(
x = rnorm(8*10), # 8 samples of size 10
size=rep(10,8*10), # record sample size
sample=rep(1:8,each=10) # record sample number
)
dat25 <- data.frame(
x = rnorm(8*25), # 8 samples of size 25
size=rep(25,8*25), # record sample size
sample=rep(1:8,each=25) # record sample number
)
dat100 <- data.frame(
x = rnorm(8*100), # 8 samples of size 100
size=rep(100,8*100), # record sample size
sample=rep(1:8,each=100) # record sample number
)
simdata <- rbind(dat10,dat25,dat100)
# generate the normal-quantile plots for each of the 30 samples
qqmath(~x|sample*factor(size),data=simdata,
layout=c(8,3), as.table=TRUE,cex=0.5)
When you’ve finished typing all your answers to the exercises, you’re ready to publish your lab report. To do this, look at the top of your source panel (the upper-left pane) and locate the Knit HTML button:
Click that button and RStudio should begin working to create an .html file of your lab report. It may take a minute or two to compile everything, but the report should open automatically when finished.
Once the report opens, quickly look through it to see all your completed exercises. Assuming everything looks good, send that lab1report.html file to me via email or print it out and hand it in to me.
Feel free to browse around the websites for R and RStudio if you’re interested in learning more, or find more labs for practice at http://openintro.org.