In 1993, Wayne James Nelson was accused of defrauding the state of Arizona out of $2 million by diverting funds to a bogus vendor. Nelson claimed he diverted random payments to the bogus vendor to demonstrate the absence of safeguards in a new computer system. The prosecutors argued that the diverted funds were not randomly selected payments and, in fact, Nelson created the payments in attempt to make them appear random.
Here are the amounts of the 23 checks Nelson diverted to the bogus vendor:
arizona
## date amount
## 1 Oct. 9, 1992 1927.46
## 2 Oct. 9, 1992 27902.31
## 3 Oct. 14, 1992 86241.90
## 4 Oct. 14, 1992 72117.46
## 5 Oct. 14, 1992 81321.75
## 6 Oct. 14, 1992 97473.96
## 7 Oct. 19, 1992 93249.11
## 8 Oct. 19, 1992 89658.17
## 9 Oct. 19, 1992 87776.89
## 10 Oct. 19, 1992 92105.83
## 11 Oct. 19, 1992 79949.16
## 12 Oct. 19, 1992 87602.93
## 13 Oct. 19, 1992 96879.27
## 14 Oct. 19, 1992 91806.47
## 15 Oct. 19, 1992 84991.67
## 16 Oct. 19, 1992 90831.83
## 17 Oct. 19, 1992 93776.67
## 18 Oct. 19, 1992 88336.72
## 19 Oct. 19, 1992 94639.69
## 20 Oct. 19, 1992 83709.28
## 21 Oct. 19, 1992 96412.21
## 22 Oct. 19, 1992 88432.86
## 23 Oct. 19, 1992 71552.15
If you look closely, you’ll notice:
• The amounts start small and then increase. This may be expected in the case of fraud.
• Many amounts are just below $100,000 (which may be the limit at which transactions receive greater scrutiny)
If we look at the distribution of the first digits of each of those amounts, we find:
# Extract first digit of each amount and store as "firstdigit"
arizona$firstdigit <- as.numeric(substr(arizona$amount, 1, 1))
# Get table of first digits
tally(~firstdigit, margins=TRUE, data=arizona)
##
## 1 2 7 8 9 Total
## 1 1 3 9 9 23
# Bar chart of first digits
histogram(~firstdigit, data=arizona, col="lightblue", border="white", xlab="First Digit", width=1)
Does the distribution of these first digits give us evidence that the amounts were not randomly selected? If so, it may support the accusation that Nelson created these amounts in an attempt to defraud the state.
What should the distribution of leading digits be in a set of financial transactions? Without giving it much thought, we might expect a uniform distribution:
It turns out, however, that the distribution of leading digits of financial records looks like this:
This distribution, generated from Benford’s Law, applies to first digits of numeric data that have a wide distribution and are generated by a natural process (as opposed to being generated by a human). Some examples of data that follow Benford’s Law include: stock market volumes, populations of cities, distances to stars in the universe, house addresses, and numbers of followers on Twitter.
What is Benford’s Law? It originates back in the nineteenth century, when Simon Newcombe visited the library. Like most mathematicians of the time, Newcombe relied on reference books like Arithmetica Logarithmica to help with calculations. The book provided page after page of logarithms calculated to 10 decimal places of every number up to 100,000.
As Newcombe used this book, he noticed the pages at the beginning of the book were dirtier than the pages towards the end. In fact, he noticed the pages with numbers beginning with 1 were the dirtiest, followed by pages beginning with the number 2, and so on. The cleanest pages were for numbers beginning with 9. Newcombe, realizing the dirt came from people using some pages more than others, argued:
“That the ten digits do not occur with equal frequency must be evident to anyone making much use of logarithmic tables, and noticing how much faster the first pages wear out than the last ones. The first significant figure is oftener 1 than any other digit, and the frequency diminishes up to 9.”
Newcombe published his findings in an 1881 article, describing a law for the probability that any number begins with a particular digit:
Probability of a number beginning with digit d is: \(\textrm{log}_{10}\left ( d+1 \right )-\textrm{log}_{10}\left ( d \right )=\textrm{log}_{10}\left ( 1+\frac{1}{d} \right )\)
Imagine you have a pencil that is one unit long. It doesn’t matter what one unit means to you – it could be centimeters, inches, cubits, or anything else.
Now imagine that pencil slowly and continuously growing in length. For a long time, the length will be 1.x units long. In fact, it will need to double in length (a 100% change) before the leading digit changes from 1 to 2.
Once the leading digit is 2, however, it only needs to change in length by 50% to get a leading digit of 3. Once at 3, a change of 33.3% will bring the leading digit to a 4.
The following figure shows that as we move along a logarithmic scale, there’s a shorter distance between each subsequent digit. In fact, once we get to a leading digit of 9, we only need an 11% increase to go back to a leading digit of 1.
In the figure displayed below, the amount of “time” the leading digit is a 1 has been shaded red. You can see the shaded area represents roughly 30% of the entire figure:
If each leading digit is given its own color, we can compare the relative widths of each shaded region:
It turns out that the width of each colored segment is proportional to \(\textrm{log}_{10}\left ( d+1 \right )-\textrm{log}_{10}\left ( d \right )=\textrm{log}_{10}\left ( 1+\frac{1}{d} \right )\)
These probabilities tell us what to expect if we examine the first digits of a series of financial records.
Unfortunately for Newcombe, his article went largely unnoticed. Fifty-seven years later, a physicist named Frank Benford made the exact same discovery as Newcome through the exact same observation of dirtier pages at the beginning of logarithm books, and derived the exact same formula. Benford, though, had the resources to test his theory empirically.
Benford spent years gathering 20,229 observations from a diverse range of data sets: baseball statistics, areas of rivers, atomic weights of elements–even arbitrary numbers plucked from Reader’s Digest articles. This mix of all kinds of different datasets happened to be a near-perfect fit with his theory.
With these additional findings, Benford’s article quickly gained popularity and the distribution of first-digits became known as Benford’s Law.
Now that we know what we should expect for the distribution of leading digits, we simply need to compare it to what we observed.
How can we calculate the likelihood of observing these first-digits if, in fact, Nelson had selected a random sample of 23 transactions (from a population that follows Benford’s Law)?
To derive some methods to estimate this likelihood, let’s turn to a simpler scenario.
Suppose we roll a single 6-sided die 120 times. We might get results like:
To measure the distance from an observed set of frequencies to a theoretical distribution of expectations, we can use a chi-squared statistic:
\(\chi _{c-1}^{2}=\sum_{c=1}^{c}\frac{\left ( \textrm{observed}_{c} -\textrm{expected}_{c}\right )^{2}}{\textrm{expected}_{c}}=\sum \frac{\left ( \textrm{O}_{c}-\textrm{E}_{c} \right )^{2}}{\textrm{E}_{c}}\)
To estimate a p-value, we compare this value to a chi-squared distribution.
We can calculate this chi-square goodness-of-fit test in R:
# Input the observed frequencies of each digit
die <- c(23, 15, 19, 24, 21, 18)
# Input the expected probabilities of each digit
dieprob <- c(1/6, 1/6, 1/6, 1/6, 1/6, 1/6)
# Run the chi-square test using chisq.test
# rescale.p = TRUE just makes sure the probabilities sum to 1
chisq.test(die, p = dieprob, rescale.p = TRUE)
##
## Chi-squared test for given probabilities
##
## data: die
## X-squared = 2.8, df = 5, p-value = 0.7308
We could also conduct this goodness-of-fit test using randomization-based methods. In this simple example with rolling a die 120 times, we expect to see an equal proportion of 1s, 2s, 3s, 4s, 5s, and 6s. In other words, we expect P(1) = P(2) = ... = P(6) = 1/6
.
With this assumption, we can randomly sample 120 values from the discrete distribution we just identified with those probabilities. In this case, we’ll have the computer randomly sample 120 values from a distribution in which there is an equal chance of choosing any number between 1-6:
# Run the chi-square test using chisq.test
# Set simulate.p.value = TRUE to run the randomization-based test
# B = the number of randomizations to run
chisq.test(die, p = dieprob, rescale.p = TRUE, simulate.p.value = TRUE, B=5000)
##
## Chi-squared test for given probabilities with simulated p-value
## (based on 5000 replicates)
##
## data: die
## X-squared = 2.8, df = NA, p-value = 0.7421
Let’s conduct this goodness-of-fit test on our potential fraud case:
# First, let's input the frequency of each first digit that we observed
nelsondigits <- c(1, 1, 0, 0, 0, 0, 3, 9, 1)
# Now, let's calculate the expected probabilities of each digit
# First, I'll input the digits 1-9
digits <- c(1, 2, 3, 4, 5, 6, 7, 8, 9)
# Now, I'll calculate the probabilities using Benford's Law
benford <- log10(1 + 1/digits)
# Let's take a look at those probabilities
benford
## [1] 0.30103000 0.17609126 0.12493874 0.09691001 0.07918125 0.06694679
## [7] 0.05799195 0.05115252 0.04575749
# Now we can run the chi-squared goodness-of-fit test
# First via theory-based methods:
chisq.test(nelsondigits, p = benford)
## Warning in chisq.test(nelsondigits, p = benford): Chi-squared approximation
## may be incorrect
##
## Chi-squared test for given probabilities
##
## data: nelsondigits
## X-squared = 102.97, df = 8, p-value < 2.2e-16
# Next with randomization-based methods
chisq.test(nelsondigits, p = benford, simulate.p.value = TRUE, B=5000)
##
## Chi-squared test for given probabilities with simulated p-value
## (based on 5000 replicates)
##
## data: nelsondigits
## X-squared = 102.97, df = NA, p-value = 2e-04
Testing for goodness-of-fit allows us to test whether a sample of data come from a theoretical distribution. Suppose the following data represent time between breakdowns for a machine:
times <- c(0.1, 0.1, 0.6, 0.6, 1.3, 1.6, 2.4, 2.4, 3.7, 3.7, 4.9,
5.0, 6.2, 6.3, 10.3, 10.6, 0.1, 0.2, 0.7, 0.7, 1.7, 1.8,
2.4, 2.7, 4.1, 4.1, 5.2, 5.3, 7.4, 7.5, 10.9, 12.3, 0.2,
0.3, 0.3, 0.4, 0.8, 0.9, 0.9, 1.0, 1.9, 1.9, 2.0, 2.0,
2.8, 2.8, 2.8, 3.0, 4.2, 4.2, 4.3, 4.3, 5.3, 5.7, 5.7,
5.9, 7.5, 7.7, 8.1, 8.6, 13.9, 13.7, 13.9, 14.8, 0.5, 0.6,
0.6, 1.1, 1.2, 1.3, 2.1, 2.2, 2.2, 3.1, 3.5, 3.7, 4.5,
4.9, 4.9, 6.1, 6.2, 6.2, 9.2, 9.5, 10.0, 14.9, 17.3, 17.6)
We want to know if the exponential distribution is an appropriate model for the time we wait until the machine breaks down.
To test this, let’s place our sample data into bins. I’ll arbitrarily choose bins of width = 3 units:
That last bin only has 2 observations, so I’ll merge it with the preceding bin:
Now that we have our observed data in bins, we need to calculate our expected frequencies (or probabilities). We could do this by hand, using the exponential distribution:
We could then calculate our chi-square goodness-of-fit statistic:
We could also do this in R:
# The exponential distribution is defined by lambda = rate = 1 / E(x)
# Calculate probabilities for each bin
# This calculates probabilities under the exponential distribution
p1 <- pexp(3, rate = 1/mean(times), lower.tail = TRUE) #P(0 < x < 3)
p2 <- pexp(6, rate = 1/mean(times)) - pexp(3, rate = 1/mean(times)) #P(3<x<6)
p3 <- pexp(9, rate = 1/mean(times)) - pexp(6, rate = 1/mean(times)) #P(6<x<9)
p4 <- pexp(12, rate = 1/mean(times)) - pexp(9, rate = 1/mean(times)) #P(9<x<12)
p5 <- pexp(12, rate = 1/mean(times), lower.tail = FALSE) #P(X > 12)
# Store all the probabilities as "probs"
probs <- c(p1, p2, p3, p4, p5)
# There's an easier way to do the following, but I think this makes sense
# Let's store our observed data as frequencies in each bin
# To do this, I'm going to "cut" my data into 5 bins
# The bins are defined by the endpoints 0, 3, 6, 9, 12, and infinity
# I name the bins "1", "2", "3", "4", and "5"
cuts <- cut(times, breaks = c(0, 2.99, 6, 9, 12, Inf), labels=1:5)
# Now I can get frequencies for each bin
tally(~cuts, margins=TRUE)
##
## 1 2 3 4 5 Total
## 40 23 11 6 8 88
# I'll store those frequencies in a vector called "timebins"
# This code does it automatically
timebins <- c(length(cuts[cuts==1]), length(cuts[cuts==2]),
length(cuts[cuts==3]), length(cuts[cuts==4]),
length(cuts[cuts==5]))
# Now I run the chi-squared test
chisq.test(timebins, p = probs)
##
## Chi-squared test for given probabilities
##
## data: timebins
## X-squared = 0.36087, df = 4, p-value = 0.9856
# Next with randomization-based methods
chisq.test(timebins, p = probs, simulate.p.value = TRUE, B=5000)
##
## Chi-squared test for given probabilities with simulated p-value
## (based on 5000 replicates)
##
## data: timebins
## X-squared = 0.36087, df = NA, p-value = 0.9886
Researchers selected a sample of 663 heart disease patients and a control group of 772 males not suffering from heart disease. Each subject was classified into one of 5 baldness categories:
# Mosaic plot
mosaicplot(table(baldheart))
Can we use a chi-square test to test for a relationship between heart disease in baldness? We can if we’re able to derive our expectations for each cell.
Suppose we conduct this experiment and get the following results:
We can conduct this test (Fisher’s Exact Test) in R:
# P(30 or more heads) = 1 - P(29 or fewer heads), where
# m = 40 heads in the population
# n = 60 tails in the population
# k = 50 chosen for males
1 - phyper(29, 40, 60, 50, lower.tail=TRUE)
## [1] 4.154329e-05
# Enter the data for Fisher's Test
coin <- c(rep("heads", 40), rep("tails", 60))
gender <- c(rep("male", 30), rep("female", 10), rep("male", 20), rep("female", 40))
coingender <- data.frame(coin, gender)
# Examine the data
tally(coin ~ gender, margins = TRUE, data=coingender)
## gender
## coin female male
## heads 10 30
## tails 40 20
## Total 50 50
# Fisher's test
fisher.test(coingender$gender, coingender$coin, alternative="less")
##
## Fisher's Exact Test for Count Data
##
## data: coingender$gender and coingender$coin
## p-value = 4.154e-05
## alternative hypothesis: true odds ratio is less than 1
## 95 percent confidence interval:
## 0.0000000 0.3864017
## sample estimates:
## odds ratio
## 0.1700317
You can see the results from Fisher’s Exact Test match the calculation from the hypergeometric distribution. This method is more difficult to use with larger numbers (or when our table isn’t a 2x2 table).
Since we have expected and observed observations for categorical variables, we could also calculate a chi-square statistic.
We can conduct this chi-square test for independence in R:
# Our data are still stored in the "coingender" data.frame
# We use chisq.test(x, y) to run the test
chisq.test(coingender$gender, coingender$coin, correct=FALSE)
##
## Pearson's Chi-squared test
##
## data: coingender$gender and coingender$coin
## X-squared = 16.667, df = 1, p-value = 4.456e-05
# We can estimate the p-value with randomization-based methods
chisq.test(coingender$gender, coingender$coin, simulate.p.value=TRUE, B=5000)
##
## Pearson's Chi-squared test with simulated p-value (based on 5000
## replicates)
##
## data: coingender$gender and coingender$coin
## X-squared = 16.667, df = NA, p-value = 2e-04
Let’s enter our heart disease / baldness data into R and run both Fisher’s Exact Test and a Chi-square Test for Independence:
# Enter data
baldness <- c(rep("none", 582), rep("little", 386), rep("some", 380), rep("much", 84), rep("extreme", 3))
disease <- c(rep("yes", 251), rep("no", 331), rep("yes", 165), rep("no", 221), rep("yes", 195), rep("no", 185), rep("yes", 50), rep("no", 34), rep("yes", 2), rep("no", 1))
baldheart <- data.frame(baldness, disease)
# Check for entry errors
tally(disease ~ baldness, margins = TRUE, data=baldheart)
## baldness
## disease extreme little much none some
## no 1 221 34 331 185
## yes 2 165 50 251 195
## Total 3 386 84 582 380
# Run Fisher's Exact Test
fisher.test(baldheart$baldness, baldheart$disease, alternative="less")
##
## Fisher's Exact Test for Count Data
##
## data: baldheart$baldness and baldheart$disease
## p-value = 0.003818
## alternative hypothesis: less
# Chi-square test for independence
chisq.test(baldheart$baldness, baldheart$disease, correct=FALSE)
## Warning in chisq.test(baldheart$baldness, baldheart$disease, correct =
## FALSE): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: baldheart$baldness and baldheart$disease
## X-squared = 14.57, df = 4, p-value = 0.005682
# We can estimate the p-value with randomization-based methods
chisq.test(baldheart$baldness, baldheart$disease, simulate.p.value=TRUE, B=5000)
##
## Pearson's Chi-squared test with simulated p-value (based on 5000
## replicates)
##
## data: baldheart$baldness and baldheart$disease
## X-squared = 14.57, df = NA, p-value = 0.005199
One problem with the chi-square test is that the sampling distribution of the test statistic we calculate has an approximate chi-square distribution. With larger sample sizes, the approximation is good enough. With smaller sample sizes, however, p-values estimated from the chi-square distribution can be inaccurate.
This is why many textbooks recommend you only use chi-square tests when the expected frequencies in each cell are at least 5. Even with large overall sample sizes, expected cell frequencies below 5 result in a loss of statistical power.
When you have cells with less than 5 observations, you can:
• Use Fisher’s Exact Test
• Use a likelihood ratio test (like the g-test we’ll see in a bit)
• Merge cells into larger bins to ensure observed frequencies are at least 5
Another condition of the chi-square test is independence. All observations must contribute to only one cell of the contingency table. You cannot use chi-square tests on repeated-measures designs.
If you have a large sample size, the chi-square test will produce a low p-value… even if the difference in proportions is small. It’s imperative that whenever you analyze contingency tables, you calculate some sort of effect size. You could calculate relative risk or odds ratios to give an estimate of the differences among categories.
When you have a 2x2 contingency table, the chi-square test (with one degree of freedom) tends to produce p-values that are too low (i.e., it makes Type I errors). One suggestion to address this issue is Yates’s continuity correction.
Yates’s correction simply has you subtract 0.5 from the absolute value of the deviation in the numerator of our chi-square test statistic:
\(\chi _{\textrm{Yates}}^{2}=\frac{\left ( \left | O_{i}-E_{i} \right |-0.5 \right )^{2}}{E_{i}}\)
This correction lowers our chi-square test statistic, thereby yielding a higher p-value. Although this may seem like a nice solution, there is evidence that this correction over-corrects and produces chi-square values that are too small.
If you want to use Yates’s correction in R, you simply tell R that correct=TRUE
:
# Chi-square test for that baldness data
chisq.test(baldheart$baldness, baldheart$disease, correct=TRUE)
## Warning in chisq.test(baldheart$baldness, baldheart$disease, correct =
## TRUE): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: baldheart$baldness and baldheart$disease
## X-squared = 14.57, df = 4, p-value = 0.005682
An alternative to the chi-square test (that happens to produce results similar to the chi-square test) is the G-test.
The G-test is an example of a likelihood ratio test (LRT). We’ll see other likelihood ratio tests later in the course. For now, let’s conduct an LRT on a simple dataset.
Suppose you hypothesize 75% of St. Ambrose students graduate within 6 years. You then collect a sample of 1000 students and find 770 of them graduated within 6 years.
You can think of this data as frequencies across two cells: 770 (graduated) and 230 (did not).
With this, we could calculate a chi-square goodness-of-fit test:
# Input data
graduated <- c(770, 230)
# Input probabilities
probabilities <- c(.75, .25)
# Run the test
chisq.test(graduated, p = probabilities)
##
## Chi-squared test for given probabilities
##
## data: graduated
## X-squared = 2.1333, df = 1, p-value = 0.1441
We could also calculate the likelihood of observing this data under our hypothesis. If we think of each student as an independent trial with two possible outcomes (1 = graduated; 0 = did not), we can use the binomial distribution.
We want to calculate P(X = 770) under a binomial distribution with n = 1000 trials and p = 0.75 (our hypothesized value). Let’s store this likelihood as null
:
# P(X = 770) in binomial distribution with n=1000, p=0.75
null <- dbinom(770, size=1000, prob=0.75)
null
## [1] 0.0101099
Now let’s an alternative hypothesis that the true proportion of students who graduate is 77%. We’ll store this result as alternative
:
# P(X = 770) in binomial distribution with n=1000, p=0.77
alternative <- dbinom(770, size=1000, prob=0.77)
alternative
## [1] 0.02996627
From these results, we’d conclude the data we obtained were more likely to be obtained if the true proportion of St. Ambrose students that graduate within 6 years is 77%.
Let’s take the ratio of the null likelihood to the alternative likelihood:
# Likelihood ratio
LR <- null / alternative
LR
## [1] 0.337376
This ratio will get smaller as the observed results get further away from our null hypothesis. If we take the natural log of this ratio and multiply it by -2, we get the log-likelihood (or G-statistic).
\(G=-2\textrm{ln}\frac{L_{\textrm{null}}}{L_{\textrm{alternative}}}\)
We multiply by -2 and take the natural log because it results in a test statistic that approximates a chi-square distribution.
Let’s calculate this statistic for our graduation example and determine its likelihood from a chi-square distribution:
# Calculate G-statistic
G <- -2 * log(LR)
G
## [1] 2.173115
# Compare it to chi-square distribution with df = 2-1
pchisq(G, df=1, lower.tail=FALSE)
## [1] 0.1404415
For this simple example, our p-value is 0.140. From this, we have very little evidence to support the belief that 77% of students graduate from St. Ambrose within 6 years.
Calculating the G-statistic can be tedious. Thankfully, there’s an easier formula to use (based on the fact that in the ratio of likelihoods, a lot of stuff cancels out):
\(G=2\sum_{i=1}^{c}\left [ O_{i}-\textrm{ln}\frac{O_{i}}{E_{i}} \right ]\)
The G-test can also be used to test for independence. You can see how to conduct this test in R here.
Since both the G-test and the chi-squared test can be used for both goodness-of-fit and independence, which test should you use? It depends.
The G-test is more flexible and can be used for more elaborate (and additive) experimental designs. The chi-square test, on the other hand, is more easily understood and explained.
The G-test uses the multinomial distribution in its calculation, whereas the chi-square test provides only an approximation.
When you’re ready to try the Your Turn exercises:
Sources:
Chi-squared randomization applet
Chi-squared distribution applet
List of categorical analysis methods
This document is released under a Creative Commons Attribution-ShareAlike 3.0 Unported license.