Lab #4: Visual and numerical summaries

Remember to download the report template for this lab and open it in RStudio. You can download the template by clicking this link: http://bradthiessen.com/html5/stats/m300/lab4report.Rmd

Replicate Activity #12

Tidy Data

Let’s replicate the summaries and visualizations from our in-class Activity #12. We first start by loading the MAP-Works dataset.

mw <- read.csv("http://www.bradthiessen.com/html5/mw.csv")
head(mw)

##   ID              risk female minority ACT ACTeng ACTmath Hsgpa
## 1  1       Low (Green)      1        7  26     25      21  3.72
## 2  2 Moderate (Yellow)      1        7  20     16      24  2.15
## 3  3       Low (Green)      0        7  19     16      18  2.33
## 4  4       Low (Green)      1        7  20     24      16  2.61
## 5  5       Low (Green)      1        7  20     26      22  3.70
## 6  6       Low (Green)      1        7  21     23      17  3.47
##                    major GPAfall SPRretain GPAspring GPAyear Retained
## 1   Early Childhood Educ    2.87         1      3.44    3.16        1
## 2                Nursing    2.08         1      2.37    2.23        1
## 3       Criminal Justice    2.20         1      2.86    2.52        1
## 4 Pub Rel/Strategic Comm    2.87         1      2.62    2.74        1
## 5                Theatre    2.42         1      1.46    1.92        1
## 6   Elementary Education    2.07         1      3.21    2.64        1
##   COMMITsept ACADEMICsept SATISsept COMMITapril ACADEMICapril SATISapril
## 1   7.000000     5.166667  6.333333           7           6.0   7.000000
## 2   7.000000     4.333333  5.666667           5           4.8   5.000000
## 3   6.500000     5.000000  6.000000           7           4.8   5.666667
## 4   7.000000     6.666667  7.000000           7           4.8   4.333333
## 5   7.000000     4.250000  5.666667           7           4.2   4.000000
## 6   6.666667     4.333333  4.666667           7           7.0   5.333333

If you compare this output to what’s shown on the first page of Activity #12, you’ll find this is the exact same dataset. Let’s use the str() command to see if the variable types match:

str(mw)

## 'data.frame':    200 obs. of  20 variables:
##  $ ID           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ risk         : Factor w/ 3 levels "High (Red)","Low (Green)",..: 2 3 2 2 2 2 3 2 2 1 ...
##  $ female       : int  1 1 0 1 1 1 1 1 1 1 ...
##  $ minority     : int  7 7 7 7 7 7 7 7 7 1 ...
##  $ ACT          : int  26 20 19 20 20 21 24 28 24 24 ...
##  $ ACTeng       : int  25 16 16 24 26 23 24 27 27 20 ...
##  $ ACTmath      : int  21 24 18 16 22 17 23 28 24 27 ...
##  $ Hsgpa        : num  3.72 2.15 2.33 2.61 3.7 3.47 3.87 3.75 3.83 3.22 ...
##  $ major        : Factor w/ 27 levels "Accounting","Biology",..: 8 20 7 23 26 9 17 11 12 2 ...
##  $ GPAfall      : num  2.87 2.08 2.2 2.87 2.42 2.07 3.5 3.35 3.35 3.19 ...
##  $ SPRretain    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ GPAspring    : num  3.44 2.37 2.86 2.62 1.46 3.21 3.93 3.46 3.63 3.62 ...
##  $ GPAyear      : num  3.16 2.23 2.52 2.74 1.92 2.64 3.71 3.41 3.49 3.39 ...
##  $ Retained     : int  1 1 1 1 1 1 0 1 1 0 ...
##  $ COMMITsept   : num  7 7 6.5 7 7 ...
##  $ ACADEMICsept : num  5.17 4.33 5 6.67 4.25 ...
##  $ SATISsept    : num  6.33 5.67 6 7 5.67 ...
##  $ COMMITapril  : int  7 5 7 7 7 7 1 7 7 2 ...
##  $ ACADEMICapril: num  6 4.8 4.8 4.8 4.2 7 4.4 5.2 5.8 6.6 ...
##  $ SATISapril   : num  7 5 5.67 4.33 4 ...

The activity identifies 6 factor variables, while our dataset only lists risk and major as factor variables. Let’s convert those other variables to factor variables using the as.factor command.

mw$female <- as.factor(mw$female)
mw$minority <- as.factor(mw$minority)
mw$SPRretain <- as.factor(mw$SPRretain)
mw$Retained <- as.factor(mw$Retained)

The first line of code in that block converts the female variable (mw$female) to a factor variable with the same name. The other lines convert the minority, PRretain, and Retained variables. Let’s run the str() command again to see if the conversions worked:

str(mw)

## 'data.frame':    200 obs. of  20 variables:
##  $ ID           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ risk         : Factor w/ 3 levels "High (Red)","Low (Green)",..: 2 3 2 2 2 2 3 2 2 1 ...
##  $ female       : Factor w/ 2 levels "0","1": 2 2 1 2 2 2 2 2 2 2 ...
##  $ minority     : Factor w/ 2 levels "1","7": 2 2 2 2 2 2 2 2 2 1 ...
##  $ ACT          : int  26 20 19 20 20 21 24 28 24 24 ...
##  $ ACTeng       : int  25 16 16 24 26 23 24 27 27 20 ...
##  $ ACTmath      : int  21 24 18 16 22 17 23 28 24 27 ...
##  $ Hsgpa        : num  3.72 2.15 2.33 2.61 3.7 3.47 3.87 3.75 3.83 3.22 ...
##  $ major        : Factor w/ 27 levels "Accounting","Biology",..: 8 20 7 23 26 9 17 11 12 2 ...
##  $ GPAfall      : num  2.87 2.08 2.2 2.87 2.42 2.07 3.5 3.35 3.35 3.19 ...
##  $ SPRretain    : Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ GPAspring    : num  3.44 2.37 2.86 2.62 1.46 3.21 3.93 3.46 3.63 3.62 ...
##  $ GPAyear      : num  3.16 2.23 2.52 2.74 1.92 2.64 3.71 3.41 3.49 3.39 ...
##  $ Retained     : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 2 2 1 ...
##  $ COMMITsept   : num  7 7 6.5 7 7 ...
##  $ ACADEMICsept : num  5.17 4.33 5 6.67 4.25 ...
##  $ SATISsept    : num  6.33 5.67 6 7 5.67 ...
##  $ COMMITapril  : int  7 5 7 7 7 7 1 7 7 2 ...
##  $ ACADEMICapril: num  6 4.8 4.8 4.8 4.2 7 4.4 5.2 5.8 6.6 ...
##  $ SATISapril   : num  7 5 5.67 4.33 4 ...

Yep, the output now shows we have 6 factor variables. There’s still one problem, though. Take a look at the levels of the female variable.

As you can see, the female variable is now a factor variable that takes on two possible values: 0 and 1. We can see this more clearly in a summary table:

table(mw$female)

## 
##   0   1 
##  41 159

This dataset has 41 individuals in which female == 0. This may make sense to you (female = 0 means the individual is male), but it would be nice if we could simply label those values as male and female. To do this, we can use the levels() command. We’d like to tell R that the levels 0 and 1 of our female variable actually represent male and female:

levels(mw$female) <- c("male", "female")
levels(mw$minority) <- c("Caucasian", "Minority")
levels(mw$SPRretain) <- c("Dropped", "Returned")
levels(mw$Retained) <- c("Dropped", "Returned")

Let’s reconstruct a table of this female variable to check the results:

table(mw$female)

## 
##   male female 
##     41    159

It worked – now we can more easily see we have 41 males and 159 females.

Let’s continue replicating Activity #12. In the first question of that activity, we looked at tables of the risk and ACT variables:

table(mw$risk)

## 
##        High (Red)       Low (Green) Moderate (Yellow) 
##                 8               145                47

table(mw$ACT)

## 
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 
##  5 11 11 17 23 17 16 24 13 17 11  8  8  8  5  3  2  1

We’ve already seen we can also create tables using the tally() command:

tally(~risk, data=mw)

## 
##        High (Red)       Low (Green) Moderate (Yellow) 
##                 8               145                47

tally(~ACT, data=mw)

## 
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 
##  5 11 11 17 23 17 16 24 13 17 11  8  8  8  5  3  2  1

The activity then displays 3 histograms for the Hsgpa variable. Each histogram has a different bin width:

histogram(~Hsgpa, data=mw, col="lightblue", border="lightblue", width=0.01)

histogram(~Hsgpa, data=mw, col="lightblue", border="white", width=0.22)

histogram(~Hsgpa, data=mw, col="lightblue", border="white", width=0.75)

On question #3 of the activity, we looked at side-by-side histograms of GPAfall by Retained. The general syntax for this is:

histogram( ~y | x, data)

Where y is our dependent variable and x is our independent variable. Let’s try it:

histogram(~GPAfall | Retained, data=mw, col="lightblue", border="white", type="count")

Question #4 of the activity displays a stemplot for GPAfall:

stem(mw$GPAfall)

## 
##   The decimal point is 1 digit(s) to the left of the |
## 
##    8 | 6
##   10 | 
##   12 | 1
##   14 | 6777
##   16 | 49
##   18 | 23
##   20 | 00027778898
##   22 | 00123339133369
##   24 | 00022799900344778
##   26 | 02357777791337
##   28 | 000022223777788023333588
##   30 | 0000000066666778882333444557889
##   32 | 00002445555556779023335567788888
##   34 | 0000000002334577777777000000003679
##   36 | 1133345713358

Question #6 of the activity displays side-by-side kernal density plots for GPAfall by Retained. To do this, we can replace histogram with densityplot and use the same general syntax:

densityplot(~GPAfall | Retained, data=mw, col="steelblue", lw=5)

The lw=5 option simply tells R to increase the width of the line. If I used the default setting, the line would be thinner:

densityplot(~GPAfall | Retained, data=mw, col="steelblue")

Numerical summaries

Beginning with question #7 of the activity, we started calculating numerical summaries. Here’s how we can get the mean, median, and trimmed mean of our ACT variable:

mean(~ACT, data=mw)

## [1] 23.72

median(~ACT, data=mw)

## [1] 23.5

mean(mw$ACT, trim=.2)

## [1] 23.41667

Similar to what we did in question #8 of the activity, let’s see what happens to the mean and median if we were to convert our ACT score as follows:

newACT = 2*ACT + 4

Let’s create a newACT variable that is equal to four plus the original ACT variable doubled:

mw$newACT <- (2 * mw$ACT) + 4

Recall the mean and median of the original ACT variable equal: mean = 23.72 mean = 23.5

The mean and median of our newACT variable are:

mean(mw$newACT)

## [1] 51.44

median(mw$newACT)

## [1] 51

We could have predicted this, because we know:

mean(2x + 4) = 2 * mean(x) + 4 and median(2x + 4) = 2 * median(x) + 4

We can verify this:

mean(mw$newACT) == (2 * mean(mw$ACT)) + 4

## [1] TRUE

median(mw$newACT) == (2 * median(mw$ACT)) + 4

## [1] TRUE

The output of TRUE indicates that both equations are, in fact, equal.

Question #9 in the activity displays some quantiles (percentiles) of a dataset. Let’s find the 10th, 57th, and 90th percentiles of our GPAfall variable:

quantile(mw$GPAfall, c(.10, .57, .90))

##   10%   57%   90% 
## 2.171 3.140 3.500

Question #10 then displays side-by-side boxplots of GPAfall by Retained and female:

bwplot(GPAfall~Retained | female, data=mw, col="steelblue")

In question 12 of the activity, we calculated a standard deviation. We can do this easily in R with the sd() command. Let’s get the standard deviation of our ACT variable:

sd(mw$ACT)

## [1] 3.855636

What’s the standard deviation of our newACT variable (in which we doubled each ACT score and then added 4)?

sd(mw$newACT)

## [1] 7.711272

Is this equal to (2 * sd(ACT)) + 4? Let’s see:

sd(mw$newACT) == (2 * sd(mw$ACT)) + 4

## [1] FALSE

Nope! Remember that adding a constant to each value does not change the standard deviation of a variable. We can verify this:

# Notice the +4 is missing on the right
sd(mw$newACT) == (2 * sd(mw$ACT))

## [1] TRUE

Question 15 of the activity displays a scatterplot of the relationship between ACT and GPAfall. To get this scatterplot, we use:

xyplot(GPAfall ~ ACT, data=mw, cex=1.1, col="steelblue")

Fancy stuff

If you want to see some fancier plots (including some interactive plots), you can check out the following documents from our course website:

Example 1. Creating plots of baseball statistics

Example 2. Manipulating baseball data

The documents show how to use the ggplot2, ggvis, and dplyr packages to manipulate and plot data. The first document also shows this interactive plot (displaying all the pitches in a single game from one of my favorite pitchers):

Pitches from Justin Verlander on May 7, 2011

If you’re at all interested in statistics, data science, or R, I recommend you check out the ggvis and dplyr packages. I’m more than willing to help you learn how to use them.

As a couple quick examples, here are plots of our ACT and GPAfall variables using ggplot2 and ggvis. Don’t let the length of the commands bother you – I just set lots of options:

ggplot(mw, aes(x=ACT)) + 
  geom_histogram(binwidth=1, colour = "steelblue", fill = "white", size=1, alpha=0.4) + 
  geom_point(aes(y = -1), position = position_jitter(height = 0.5), size=2) +
  ggtitle("ACT Scores") +
  xlim(0, 36) +
  xlab("ACT scores for incoming freshmen") +
  ylab("number of students")

library(ggvis)

## 
## Attaching package: 'ggvis'
## 
## The following object is masked from 'package:mosaic':
## 
##     prop
## 
## The following object is masked from 'package:ggplot2':
## 
##     resolution

mw %>% 
  ggvis(x=~ACT, y=~GPAfall, fill := "steelblue", stroke:="steelblue",
        strokeOpacity:=.2, fillOpacity:=.2) %>%
  layer_points() %>%
  layer_smooths(span = input_slider(0.2, 1, value = 1), strokeOpacity:=1) %>%
  add_axis("x", title = "ACT scores") %>%
  scale_numeric("x", domain=c(0, 36), nice=FALSE, clamp=TRUE)  %>%
  add_axis("y", title = "Fall semester GPAs") %>%
  scale_numeric("y", domain=c(0, 4), nice=FALSE, clamp=TRUE)

## Warning: Can't output dynamic/interactive ggvis plots in a knitr document.
## Generating a static (non-dynamic, non-interactive) version of the plot.

Note that the above plot is interactive if you run it within RStudio.

Q-Q Plots

Suppose you’re analyzing a dataset – creating visualizations as you should – and you come across a variable X with the following distribution:

histogram(X, col="lightblue", border="white")

It looks like this data might approximate a normal distribution, but how can we check? We could overlay a normal curve over the plot. To do this, we need to know the mean and standard deviation of this variable:

mean(X)

## [1] 99.02143

sd(X)

## [1] 32.9947

Now we can plot a normal distribution (with mean = 99.495 and standard deviation = 29.676) on top of our histogram to see how they compare:

plotDist("norm", params=list(mean=99.495, sd=29.676), col="red", lwd=3, add=TRUE)

The add=TRUE option tells R to plot this normal distribution on top of our previous plot.

As expected, the fit isn’t perfect but it looks pretty good. Unfortunately, it’s difficult to judge the degree of fit with a normal curve overlaid on a histogram. This is due, in large part, to the fact that our histogram looks different if we choose a different number of bins.

The following plot will be the exact same variable X. I only changed the histogram to show a greater number of bins:

histogram(X, col="lightblue", border="white", width=5)
plotDist("norm", params=list(mean=99.495, sd=29.676), col="red", lwd=3, add=TRUE)

From this plot, it would be hard to argue that our data approximate a normal distribution. Is there any way we can evaluate normality and not worry about the width of our bins (the number of bars in our histogram)?

Thankfully, the answer is yes. We can use Q-Q (Quantile-Quantile) Plots. Q-Q plots show the observed quantiles of a dataset compared to the theoretical quantiles under a normal distribution.

We’ll discuss Q-Q plots in class. For now, just know that the Q-Q plot for a variable that follows a perfect normal distribution would show all the points lying along the diagonal line.

Let’s look at an example. I’ll generate a variable Z that contains 10,000 samples from a normal distribution.

Z <- rnorm(10000)
qqnorm(Z); qqline(Z)

You can see that the points mostly fall along the diagonal line, indicating that this variable does closely follow a normal distribution.

Let’s take a look at the Q-Q plot for our variable of interest, X:

qqnorm(X); qqline(X)

Hmm… does this indicate the variable approximates a normal distribution? It’s still hard to tell, isn’t it? This variable, in fact, contains 100 random samples from a normal distribution (so we could argue that we should conclude it approximates a normal distribution).

So how can we train our eyes to tell whether a small sample of data approximate (or come from an approximate) normal distribution? We could look at a bunch of Q-Q plots, I guess.

Below, I’ve pasted some code from Randall Pruim that displays Q-Q plots for data generated from normal distributions.

The first row displays 8 Q-Q plots for samples of size n=10. The second and third rows show Q-Q plots for samples of size n=25 and n=100. Remember, all these values were generated from normal distributions, so any fluctuations from the diagonal line are due to random variation:

dat10 <- data.frame(
            x = rnorm(8*10),                     # 8 samples of size 10
            size=rep(10,8*10),                   # record sample size
            sample=rep(1:8,each=10)              # record sample number
            )
dat25 <- data.frame(
            x = rnorm(8*25),                     # 8 samples of size 25
            size=rep(25,8*25),                   # record sample size
            sample=rep(1:8,each=25)              # record sample number
            )
dat100 <- data.frame(
            x = rnorm(8*100),                    # 8 samples of size 100
            size=rep(100,8*100),                 # record sample size
            sample=rep(1:8,each=100)             # record sample number
            )
simdata <- rbind(dat10,dat25,dat100)

# generate the normal-quantile plots for each of the 30 samples
qqmath(~x|sample*factor(size),data=simdata,
    layout=c(8,3), as.table=TRUE,cex=0.5)

Lab questions

Find a dataset, load it into R, and explore it graphically and numerically. At a minimum, pick some variables and create (a) a table, (b) a histogram, (c) a stemplot, (d) a density plot, (e) a boxplot, (f) an xyplot, and (g) a Q-Q plot. Also, pick a variable and calculate its mean, median, trimmed mean, standard deviation, and quantiles. If you need any help locating or loading a dataset, contact me at <a href="mailto:thiessenbradleya@sau.edu">thiessenbradleya@sau.edu</a>

Generating (pubishing) your lab report

When you’ve finished typing all your answers to the exercises, you’re ready to publish your lab report. To do this, look at the top of your source panel (the upper-left pane) and locate the Knit HTML button: Drawing

Click that button and RStudio should begin working to create an .html file of your lab report. It may take a minute or two to compile everything, but the report should open automatically when finished.

Once the report opens, quickly look through it to see all your completed exercises. Assuming everything looks good, send that lab1report.html file to me via email or print it out and hand it in to me.

You’ve done it! Congratulations on finishing your 4th lab!

Feel free to browse around the websites for R and RStudio if you’re interested in learning more, or find more labs for practice at http://openintro.org.

This lab, released under a Creative Commons Attribution-ShareAlike 3.0 Unported license, is a derivative of templates originally developed by Mark Hansen (UCLA Statistics) and adapted by Andrew Bray (Mt. Holyoke) & Mine Çetinkaya-Rundel (Duke).