Friday, April 24, 2015

Benford and the IRS: A Love Story

If you're gonna cheat on your taxes, do it right.

I'm not saying you should cheat on your taxes or anything. I'm not. I'm just saying that there's a dumb way to do it and there's a slightly less dumb way to do it, and I'm gonna tell you this slightly less dumb way to do it, and then you're not gonna do it. 

It's all based on this one fact. Take just about any real life data set (a table of lengths of rivers, or the population of the largest US cities, or the street addresses of everyone who has eaten an apple in the last week)---at least one that's pretty large, and where the figures within aren't artificially constrained within a small range. Look at the leading digit of each of the numbers that appears in the data. You'd probably guess that among the 9 possible leading digits, you'd see 1 appearing roughly 19 ≈ 11.1% of the time, just like every other one of the numbers. But in practice, 1 is a leading digit something like 30% of the time! What? Yeah. This is a consequence of a little thing called Benford's Law, which says that the actual appearance of leading digits should look something like this in real life data sets:

(Cultural note: The exact numbers above are given by the difference between the logs, so the law says that 1 should appear as a leading digit approximately log10(2)-log10(1) ≈ 30.1% of the time, 2 should appear approximately log10(3)-log10(2) ≈ 17.6% of the time, etc. We won't discuss this any further, so if you want more rigorous math info, check out some other sources.)

So maybe you see where I'm going with this. If you're gonna cook the books by making up random-seeming numbers, you'd probably end up having leading 1's showing up about as often as leading 2's, 7's, 9's, etc. because humans suck at randomness. But a clever IRS agent who knows about Benford's law 
(and you better believe those pasty fluorescent-lighting-lit-basement dwellers know every trick in the book, learned after years of what I assume is indentured servitude in a that-scene-in-the-Matrix-where-they're-running-down-the-hallway-with-the-Keymaker type of government black site)

Copyright: The Matrix people?

would be able to spot that in his sleep, and slap you with a CP06 notice and Form 14950 request for your improperly filed 1095-A and Form 8962 right the dang away. Because maybe then the debt of his ancestors will be paid. Maybe then he'll finally see sunlight again. 

Copyright: Agent Smith?! I don't know!

Benford distribution in real life

In practice, data isn't going to follow the Benford distribution perfectly. One type of data which will be pretty far off from the Benford prediction is data that is constrained to a single order of magnitude ("orders of magnitude" are basically powers of ten---so 1,456 and 2,984 would have the same order of magnitude, whereas 1,456 is of a smaller order of magnitude than 10,233). If we were to write out all the numbers from 10 to 99 (which are all in a single order of magnitude), we'd have have each digit from 1 to 9 appearing exactly 10 times as the leading digit. This is not very interesting. But let's look at what happens when we have, say, the first 100 powers of 2 (and while we're at it, let's also multiply those digits by 3 or divide them by π or whatever other scaling you want to do):
The first 100 powers of two range from 1 to something like 1.9x1030 which means we have a range of 31 different orders of magnitude, and clearly this seems to fit much better. 

A convincing argument for why we'd expect data to follow this distribution is that if there is some kind of distribution that real-life data sets follow, then this distribution shouldn't depend on what units of measurement we use. So if we're looking at some data that's measured in kilometers, and it follows this magic distribution, it should still roughly follow the magic distribution if we change to miles instead. The uniform distribution (i.e. where each digit is equally likely to be a leading digit), doesn't have this property! But as evidenced by this power-of-2 example, the Benford distribution seems to. It's actually the case that the Benford distribution is the only scale-invariant distribution out there. 

Just how closely does natural data follow the distribution? Well, 1 definitely seems to appear more than most other digits, for most data sets, even if the Benford distribution isn't perfectly followed. Here are a handful of examples. The first chart below gives the leading digit in some data about the 295 most populated US cities. The second chart gives the leading digit in various data on a list of the 176 longest rivers in the world

The gray bars represent the numbers dictated by the Benford distribution, and the lines represent actual data. Some are decently matched to the Benford distribution (like the drainage area and average discharge for rivers) and others are pretty far off (like the length of the rivers and the population of US cities).

The worst fit is undoubtedly the length of rivers in miles, which looks wildly different from the length of rivers in kilometers. But we could have guessed that the Benford distribution would be a bad fit here by looking at the data. The range is 625 miles to 4,132 miles. This means that all of the 1's that appear as leading digits are from rivers that have length 1,000 miles to 1,999 miles. So the problem is that our data isn't spread over enough orders of magnitude for Benford's law to be applicable.

The first person to make a note of this surprising law was an astronomer named Simon Newcomb. Back in the late 1800s, people used log tables to do computations, which were basically long books with pages and pages of values of different logarithms on them.

A log table. Photo credit: agr.
Newcomb noted that the pages which had numbers starting with 1 were way more worn out that the later pages. So basically we can thank this discovery on the fact that new versions of log tables weren't coming out at an iPhone rate (I guess because logs haven't changed much over the years) and everyone had crappy old books full of worn out pages. I guess people weren't that impressed with Newcomb and his old book because it isn't called Newcomb's law. A guy named Frank Benford published some statistics about a mind-boggling number of different data points about half a century later and people named the law after him. Which is fair enough because there were 20,229 observations in the paper.

Benford and fraud in practice

Before you go off testing too many data sets, it should be noted that there have been cases where someone was accused of faking data because it didn't fit the Benford distribution, but it turned out that the data was not actually fake. One such embarrassing misapplication of the Benford law was actually done by the US Secret Service. It's important to note that just because data fails to satisfy Benford's law doesn't mean that it's falsified. Benford's law doesn't say that there cannot exist a data set that's uniform, for example.

There are people who actually use Benford's law to analyze whether tax documents are likely to be faked or not. An accounting consultant analyzed Bill Clinton's tax returns. He came out looking pretty good. So either he didn't cheat on his taxes, or he cheated on them pretty well.

So if you try to use Benford's law in your daily life, be a Bill Clinton, not a Secret Service.

2 comments:

  1. You should make sure that he or she holds the right qualifications and has undergone training in the latest accounting methods and computer software. Author is an expert of brampton accounting, click here for more interesting information.

    ReplyDelete
  2. By liaising with other businesses, such as restaurants and hotels, it may be possible to work out reasonable deals with Government departments for payment of tax at a lower rate. If you are curious to know more about accounting professionals, visit here.

    ReplyDelete