Let’s load the Mosaic package:
require(mosaic)
First Digit | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|
Proportion | .301 | .176 | .125 | .097 | .079 | .067 | .058 | .051 | .046 |
R comes with some built-in datasets. I’m going to use a dataset called state.x77. Let’s look at the first several rows to see what this dataset contains:
Data1 <- state.x77
head(Data1)
## Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20
## Alaska 365 6315 1.5 69.31 11.3 66.7 152
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65
## California 21198 5114 1.1 71.71 10.3 62.6 20
## Colorado 2541 4884 0.7 72.06 6.8 63.9 166
## Area
## Alabama 50708
## Alaska 566432
## Arizona 113417
## Arkansas 51945
## California 156361
## Colorado 103766
The entire dataset has values for each of the 50 states.
Benford’s law holds for first digits, so we need to select the first digit from those variables. To do this, we use a function called substr()
Let’s get the first digits of the Area variable:
FirstArea <- substr(as.character(state.area),1,1)
FirstArea
## [1] "5" "5" "1" "5" "1" "1" "5" "2" "5" "5" "6" "8" "5" "3" "5" "8" "4"
## [18] "4" "3" "1" "8" "5" "8" "4" "6" "1" "7" "1" "9" "7" "1" "4" "5" "7"
## [35] "4" "6" "9" "4" "1" "3" "7" "4" "2" "8" "9" "4" "6" "2" "5" "9"
We can then see the frequency of each of those first digits:
tally(~FirstArea)
##
## 1 2 3 4 5 6 7 8 9
## 8 3 3 8 11 4 4 5 4
To get a larger dataset, let’s get the first digits of each of the variables:
FirstDigits <- substr(as.character(Data1),1,1)
tally(~FirstDigits)
##
## 0 1 2 3 4 5 6 7 8 9
## 26 89 26 36 59 53 36 52 12 11
Uh oh… there aren’t supposed to be any zeros. Take another look at our dataset:
head(Data1)
## Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20
## Alaska 365 6315 1.5 69.31 11.3 66.7 152
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65
## California 21198 5114 1.1 71.71 10.3 62.6 20
## Colorado 2541 4884 0.7 72.06 6.8 63.9 166
## Area
## Alabama 50708
## Alaska 566432
## Arizona 113417
## Arkansas 51945
## California 156361
## Colorado 103766
See Colorado at the bottom? The illiteracy rate is 0.7, so it looks like the illiteracy variable may have some leading zeros. To get rid of those leading zeros, we can multiply everything by 10:
table(substr(as.character(10*Data1),1,1))
##
## 0 1 2 3 4 5 6 7 8 9
## 1 89 26 36 59 56 46 57 15 15
Argh! There’s still one leading zero. It took me awhile to find it, but I eventually did. Hawaii has 0 days of frost. I’m going to convert that value to NA and run the tabulation again:
Data1["Hawaii","Frost"]<- NA
FirstDigits <- table(substr(as.character(10*Data1),1,1)); FirstDigits
##
## 1 2 3 4 5 6 7 8 9
## 89 26 36 59 56 46 57 15 15
There we go. Now that we have our leading digits, let’s compare their relative frequencies to what we’d expect under Benford’s Law:
# Sum all the first digits in the dataset
SumOfFirstDigits <- sum(FirstDigits)
# Get Benford's Law
BLaw <- sapply(1:9, function(x) log10(1+1/x)); BLaw
## [1] 0.30103 0.17609 0.12494 0.09691 0.07918 0.06695 0.05799 0.05115 0.04576
m <- cbind(FirstDigits/SumOfFirstDigits,BLaw)
colnames(m)<- c("Observed Prop.", "Theoretical Prop."); m
## Observed Prop. Theoretical Prop.
## 1 0.22306 0.30103
## 2 0.06516 0.17609
## 3 0.09023 0.12494
## 4 0.14787 0.09691
## 5 0.14035 0.07918
## 6 0.11529 0.06695
## 7 0.14286 0.05799
## 8 0.03759 0.05115
## 9 0.03759 0.04576
# Boxplots of observed vs theoretical
barplot( rbind(FirstDigits/SumOfFirstDigits,BLaw/sum(BLaw)), beside = T, col = rainbow(7)[c(2,5)], xlab = "First Digit")
title("Benford’s Law Compared to States Data")
## Warning: conversion failure on 'Benford’s Law Compared to States Data' in 'mbcsToSbcs': dot substituted for <e2>
## Warning: conversion failure on 'Benford’s Law Compared to States Data' in 'mbcsToSbcs': dot substituted for <80>
## Warning: conversion failure on 'Benford’s Law Compared to States Data' in 'mbcsToSbcs': dot substituted for <99>