I'm going to load a dataset that is included with a standard R installation. It's the same old faithful data we used in Activity 13, but the variables are measured on a different scale.
Let's just go through some basic visualizations and summary statistics…
# Load the dataset
# Look at the first several rows
# Calculate the mean eruption time
mean(~eruptions, data = faithful)
## [1] 3.49
# Calculate the standard deviation
sd(~eruptions, data = faithful)
## [1] 1.14
# Calculate the variance
var(~eruptions, data = faithful)
## [1] 1.3
# Calculate the median
median(~eruptions, data = faithful)
## [1] 4
# Calculate quartiles
with(faithful, quantile(eruptions))
# Calculate all the "standard" descriptive statistics
favstats(~eruptions, data = faithful)
# Create histograms with different bin widths (or intervals)
histogram(~eruptions, data = faithful)
histogram(~eruptions, nint=20, data = faithful)
histogram(~eruptions, width=.1, data = faithful)
# Create histograms but only for data in which the waiting time is less than 75
histogram(~ eruptions, fit="normal", data=subset(faithful, waiting<=75))
## Loading required package: MASS
# Stemplot
with(faithful, stem(eruptions))
# Dotplot
dotPlot(~ eruptions, data=faithful)
# Density Plot
densityplot(~ eruptions, data=faithful)
# Scatterplot
xyplot(waiting ~ eruptions, data=faithful)
# Scatterplot with lowess curve
xyplot(waiting ~ eruptions, type=c('p', 'smooth'), cex=.6, lwd=3, data=faithful)
# Scatterplot with best-fitting line and confidence/prediction intervals
xyplot(waiting~eruptions, panel=panel.lmbands, cex=.3, band.lwd=2, data=faithful)
# Correlation coefficient
cor(waiting, eruptions, data = faithful)
## [1] 0.901
I thought I'd include some other examples of graphs you could create with R. You can read the # Comments to follow along:
### Source:
## Create a plot of top batting averages over time
# Loads a baseball dataset
# Loads batting data
# Loads a package that helps manipulate/reshape data
# Look at the first several rows of data
# calculate batting average and other stats
batting <- battingStats()
# add salary to Batting data; need to match by player, year and team
batting <- merge(batting,
Salaries[,c("playerID", "yearID", "teamID", "salary")],
by=c("playerID", "yearID", "teamID"), all.x=TRUE)
# Add name, age and bat hand information:
masterInfo <- Master[, c('playerID', 'birthYear', 'birthMonth',
'nameLast', 'nameFirst', 'bats')]
batting <- merge(batting, masterInfo, all.x = TRUE)
batting$age <- with(batting, yearID - birthYear -
ifelse(birthMonth < 10, 0, 1))
batting <- arrange(batting, playerID, yearID, stint)
## Generate a plot of batting average over time
# Restrict the pool of eligible players to the years after 1899 and
# players with a minimum of 450 plate appearances (this covers the
# strike year of 1994 when Tony Gwynn hit .394 before play was suspended
# for the season - in a normal year, the minimum number of plate appearances is 502)
eligibleHitters <- subset(batting, yearID >= 1900 & PA > 450)
# Find the hitters with the highest BA in MLB each year (there are a
# few ties). Include all players with BA > .400
topHitters <- ddply(eligibleHitters, .(yearID), subset, (BA == max(BA))|BA > .400)
# Create a factor variable to distinguish the .400 hitters
topHitters$ba400 <- with(topHitters, BA >= 0.400)
# Sub-data frame for the .400 hitters plus the outliers after 1950
# (averages above .380) - used to produce labels in the plot below
bignames <- rbind(subset(topHitters, ba400),
subset(topHitters, yearID > 1950 & BA > 0.380))
# Cut to the relevant set of variables
bignames <- subset(bignames, select = c('playerID', 'yearID', 'nameLast',
'nameFirst', 'BA'))
# Ditto for the original data frame
topHitters <- subset(topHitters, select = c('playerID', 'yearID', 'BA', 'ba400'))
# Positional offsets to spread out certain labels
bignames$xoffset <- c(0, 0, 0, 0, 0, 0, 0, 0, -8, 0, 3, 3, 0, 0, -2, 0, 0)
bignames$yoffset <- c(0, 0, -0.003, 0, 0, 0, 0, 0, -0.004, 0, 0, 0, 0, 0, -0.003, 0, 0) + 0.002
# Load package for creating visualizations
# Create the visualization
ggplot(topHitters, aes(x = yearID, y = BA)) +
geom_point(aes(colour = ba400), size = 2.5) +
geom_hline(yintercept = 0.400, size = 1) +
geom_text(data = bignames, aes(x = yearID + xoffset, y = BA + yoffset,
label = nameLast), size = 3) +
scale_colour_manual(values = c('FALSE' = 'black', 'TRUE' = 'red')) +
ylim(0.330, 0.430) +
xlab('Year') +
scale_y_continuous('Batting average',
breaks = seq(0.34, 0.42, by = 0.02),
labels = c('.340', '.360', '.380', '.400', '.420')) +
geom_smooth() +
theme(legend.position = 'none')
We could also look at the long-term trend of homeruns:
### Source:
# Total home runs by year
totalHR <- ddply(Batting, .(yearID), summarise,
HomeRuns = sum(as.numeric(HR), na.rm=TRUE),
Games = sum(as.numeric(G_batting), na.rm=TRUE),
HRperGame = HomeRuns/Games
totalHR <- totalHR[ which(totalHR$Games>0), ]
# Quick look at the data
# Plot trend (total homeruns / total games played)
# Add lowess smoothed line to see trend
ggplot(totalHR, aes(x=yearID, y=HRperGame)) +
geom_point(shape=1, alpha=0.8) + # Use hollow circles
geom_smooth(alpha=0.3) # Add a loess smoothed fit curve with confidence region
The pitchRx package allows you to analyze all the pitches thrown in baseball games. We'll examine the pitches thrown by Justin Verlander when he threw a no-hitter on May 7, 2011.
### Source:
# Load the package
## scrape pitchFX data from the Detroit vs Toronto game on May 7th, 2011
# To get all data from a particular set of dates, we'd use:
## scrape(start = "2011-05-07", end = "2011-05-07", suffix = "inning/inning_all.xml")
## I already know the Game ID of the game I want
GameData <- scrape(game.ids="gid_2011_05_07_detmlb_tormlb_1", suffix = "inning/inning_all.xml")
# This gets us four dataframes: atbat, action, pitch, po
# Combine pitch and at-bat data
pitchFX <- join(GameData$pitch, GameData$atbat, by = c("num", "url"), type = "inner")
# This creates a dataframe with 69 columns
# Keep only the pitches thrown by Justin Verlander
pitches <- subset(pitchFX, pitcher_name == "Justin Verlander")
# Graph all the pitches thrown by JV
strikeFX(pitches, geom="tile", layer=facet_grid(.~stand))
The above graph shows the location of all the pitches Justin Verlander threw that day (to right- and left-handed batters). Let's look at just the called strikes:
strikes <- subset(pitches, des == "Called Strike")
strikeFX(strikes, geom="tile", layer=facet_grid(.~stand))
… and the swinging strikes:
swingstrikes <- subset(pitches, des == "Swinging Strike")
strikeFX(swingstrikes, geom="tile", layer=facet_grid(.~stand))
… and the balls:
balls <- subset(pitches, des == "Ball")
strikeFX(balls, geom="tile", layer=facet_grid(.~stand))
We can see the results of all pitches thrown to right-handed batters. B = Ball S = Strike X = Hit into play
Righties <- subset(pitches, stand=="R")
strikeFX(Righties, geom="subplot2d", fill="type")
We can even have the computer estimate the probability of a strike based on the location of the pitch.
noswing <- subset(pitches, des %in% c("Ball", "Called Strike"))
noswing$strike <- as.numeric(noswing$des %in% "Called Strike")
m1 <- bam(strike ~ s(px, pz, by=factor(stand)) +
factor(stand), data=noswing, family = binomial(link='logit'))
strikeFX(noswing, model=m1, layer=facet_grid(.~stand))
When pitching to left-handed batters, it looks like high pitches have a higher chance of being called strikes.
Finally, here's an animation of all the pitches from Justin Verlander during this game:
animateFX(pitches, layer=list(facet_grid(pitcher_name~stand, labeller = label_both), theme_bw(), coord_equal()))
ani.options(interval = 0.05)
saveHTML({animateFX(pitches, layer=list(facet_grid(pitcher_name~stand, labeller = label_both), theme_bw(), coord_equal()))},"JVpitches")
Let's download a big dataset of baby names:
# Install a package with the baby names dataset
# Save the dataset as "babies"
babies <- babynames
# See the first several rows
This dataset (from the Social Security Administration) lists names with at least 5 uses each year. It's ordered by the proportion of babies given each name, so you can see Mary was the most popular name for a girl in 1880 (with 7% of girls being given that name).
Let's arbitrarily choose a name and see how it changed in popularity over time.
# Choose the name "Bradley"
bradley <- subset(babies, name == "Bradley")
# Quick line plot of name popularity
qplot(x = year, y = prop, data = bradley, geom = 'line', group = sex, colour=sex)
It looks like Bradley peaked in the early 1980s (although it was also somewhat popular when I was born in 1976).
Let's choose another name:
# Choose the name "Trinity"
trinity <- subset(babies, name == "Trinity")
# Quick line plot of name popularity
qplot(x = year, y = prop, data = trinity, geom = 'line', group = sex, colour=sex)
The name Trinity really peaked in the early 2000s. It just so happens the name was a character from The Matrix.