Friday, April 24, 2015

Benford and the IRS: A Love Story

If you're gonna cheat on your taxes, do it right.

I'm not saying you should cheat on your taxes or anything. I'm not. I'm just saying that there's a dumb way to do it and there's a slightly less dumb way to do it, and I'm gonna tell you this slightly less dumb way to do it, and then you're not gonna do it. 

It's all based on this one fact. Take just about any real life data set (a table of lengths of rivers, or the population of the largest US cities, or the street addresses of everyone who has eaten an apple in the last week)---at least one that's pretty large, and where the figures within aren't artificially constrained within a small range. Look at the leading digit of each of the numbers that appears in the data. You'd probably guess that among the 9 possible leading digits, you'd see 1 appearing roughly 19 ≈ 11.1% of the time, just like every other one of the numbers. But in practice, 1 is a leading digit something like 30% of the time! What? Yeah. This is a consequence of a little thing called Benford's Law, which says that the actual appearance of leading digits should look something like this in real life data sets:

(Cultural note: The exact numbers above are given by the difference between the logs, so the law says that 1 should appear as a leading digit approximately log10(2)-log10(1) ≈ 30.1% of the time, 2 should appear approximately log10(3)-log10(2) ≈ 17.6% of the time, etc. We won't discuss this any further, so if you want more rigorous math info, check out some other sources.)

So maybe you see where I'm going with this. If you're gonna cook the books by making up random-seeming numbers, you'd probably end up having leading 1's showing up about as often as leading 2's, 7's, 9's, etc. because humans suck at randomness. But a clever IRS agent who knows about Benford's law 
(and you better believe those pasty fluorescent-lighting-lit-basement dwellers know every trick in the book, learned after years of what I assume is indentured servitude in a that-scene-in-the-Matrix-where-they're-running-down-the-hallway-with-the-Keymaker type of government black site)

Copyright: The Matrix people?

would be able to spot that in his sleep, and slap you with a CP06 notice and Form 14950 request for your improperly filed 1095-A and Form 8962 right the dang away. Because maybe then the debt of his ancestors will be paid. Maybe then he'll finally see sunlight again. 

Copyright: Agent Smith?! I don't know!

Benford distribution in real life

In practice, data isn't going to follow the Benford distribution perfectly. One type of data which will be pretty far off from the Benford prediction is data that is constrained to a single order of magnitude ("orders of magnitude" are basically powers of ten---so 1,456 and 2,984 would have the same order of magnitude, whereas 1,456 is of a smaller order of magnitude than 10,233). If we were to write out all the numbers from 10 to 99 (which are all in a single order of magnitude), we'd have have each digit from 1 to 9 appearing exactly 10 times as the leading digit. This is not very interesting. But let's look at what happens when we have, say, the first 100 powers of 2 (and while we're at it, let's also multiply those digits by 3 or divide them by π or whatever other scaling you want to do):
The first 100 powers of two range from 1 to something like 1.9x1030 which means we have a range of 31 different orders of magnitude, and clearly this seems to fit much better. 

A convincing argument for why we'd expect data to follow this distribution is that if there is some kind of distribution that real-life data sets follow, then this distribution shouldn't depend on what units of measurement we use. So if we're looking at some data that's measured in kilometers, and it follows this magic distribution, it should still roughly follow the magic distribution if we change to miles instead. The uniform distribution (i.e. where each digit is equally likely to be a leading digit), doesn't have this property! But as evidenced by this power-of-2 example, the Benford distribution seems to. It's actually the case that the Benford distribution is the only scale-invariant distribution out there. 

Just how closely does natural data follow the distribution? Well, 1 definitely seems to appear more than most other digits, for most data sets, even if the Benford distribution isn't perfectly followed. Here are a handful of examples. The first chart below gives the leading digit in some data about the 295 most populated US cities. The second chart gives the leading digit in various data on a list of the 176 longest rivers in the world

The gray bars represent the numbers dictated by the Benford distribution, and the lines represent actual data. Some are decently matched to the Benford distribution (like the drainage area and average discharge for rivers) and others are pretty far off (like the length of the rivers and the population of US cities).

The worst fit is undoubtedly the length of rivers in miles, which looks wildly different from the length of rivers in kilometers. But we could have guessed that the Benford distribution would be a bad fit here by looking at the data. The range is 625 miles to 4,132 miles. This means that all of the 1's that appear as leading digits are from rivers that have length 1,000 miles to 1,999 miles. So the problem is that our data isn't spread over enough orders of magnitude for Benford's law to be applicable.

The first person to make a note of this surprising law was an astronomer named Simon Newcomb. Back in the late 1800s, people used log tables to do computations, which were basically long books with pages and pages of values of different logarithms on them.

A log table. Photo credit: agr.
Newcomb noted that the pages which had numbers starting with 1 were way more worn out that the later pages. So basically we can thank this discovery on the fact that new versions of log tables weren't coming out at an iPhone rate (I guess because logs haven't changed much over the years) and everyone had crappy old books full of worn out pages. I guess people weren't that impressed with Newcomb and his old book because it isn't called Newcomb's law. A guy named Frank Benford published some statistics about a mind-boggling number of different data points about half a century later and people named the law after him. Which is fair enough because there were 20,229 observations in the paper.

Benford and fraud in practice

Before you go off testing too many data sets, it should be noted that there have been cases where someone was accused of faking data because it didn't fit the Benford distribution, but it turned out that the data was not actually fake. One such embarrassing misapplication of the Benford law was actually done by the US Secret Service. It's important to note that just because data fails to satisfy Benford's law doesn't mean that it's falsified. Benford's law doesn't say that there cannot exist a data set that's uniform, for example.

There are people who actually use Benford's law to analyze whether tax documents are likely to be faked or not. An accounting consultant analyzed Bill Clinton's tax returns. He came out looking pretty good. So either he didn't cheat on his taxes, or he cheated on them pretty well.

So if you try to use Benford's law in your daily life, be a Bill Clinton, not a Secret Service.

Sunday, April 19, 2015

The Unintuitive Nature of Randomness

Humans suck at randomness. I mean we really, really don't have a good intuition for what randomness looks like. What follows is an example that illustrates this total intuition blind spot.

Whenever I teach probability in a math class, I like to start with this experiment (partly because I want to show that intuitive BS arguments are likely to be totally false in this class, and partly because this experiment is neat and memorable and I'm pretty into what my students think of me). I split the class into, say, twenty groups. Ten of the groups are told to flip an actual coin and record the results, while the other ten groups are told to make up a plausible-seeming coin flip sequence and write it down. I leave the room for a few minutes, and when I come back I figure out which results are real and which are made up just by looking at the sequences. In practice, I end up being correct about 93% of the sequences, which is a number I just made up (no but it's accurate). Sorcery! I only look at each of the sequences for a few seconds, and I'm sadly not a mental arithmetic savant or anything. So how does it work?

The answer is that I'm using the aforementioned central tenet of human probability-intuition: we don't have it. My one check is this. If there is a run of length 5 or 6 or longer (by a run, I mean that the same result happened in consecutive flips, so for instance the sequence---where here a white coin represents a tail and a black coin represents a head---


has a run of one tail, followed by a run of two heads, etc.), then I guess that this sequence was generated by actual coin flips. If the longest run is of length 4 or less, then I guess that this sequence is made up. Here's why this works most of the time (we'll talk about where these numbers come from later):

  1. The actual probability of having a run of length at least 5 in a sequence of 100 coin flips is around 97%. The probability of having a run of length at least 6 is a bit over 80%.
  2. Most people not only don't know this first fact but assume that long runs are actually pretty unlikely, so most of the time the made up sequences don't have any runs longer than length 3. The probability that a sequence of 100 coin flips has longest run length 3 or less is around 0.03%.
Let's look at an example. Here are a pair of sequences of coin flips. Each one represents a coin that was flipped 50 times in a row (because 100 didn't fit nicely on the page), with the results recorded each time. Does one look more "random" to you than the other?


or



Both look like something that's reasonable to expect as an outcome, but one of these was generated by an actual series of random flips (simulated by a random number generator, anyway) and the other one is made up by hand to "look" random. From the facts outlined above, a good guess is that the first one is the made up one. The first sequence has longest run length 3, which has about a 2% probability of happening, and so is pretty dang unlikely. The second sequence has longest run length 5, and the probability of a sequence of 50 coin flips having a run of length 5 or longer is about 82%. Not definitive, but likely.

The first thing that goes wrong at this point in the class is that someone takes this to mean that sequence 1 is less likely to come up than sequence 2. In fact, both sequences are equally likely. What? But you just said---NOPE. Think about it this way. Each time you flip a coin you have a 1/2 chance of getting heads, and 1/2 chance of getting tails, and these events are independent of each other (there are mathy definitions for what we actually mean by event and independence but our colloquial understanding of these terms will do here). So the probability of getting


is (1/2)(1/2)=1/4. The probability of getting

is (1/2)(1/2)=1/4. OK, you get the idea.

But this seems kind of weird. Suppose that a friend invited you to play a game in which you must pay him $1 every time a coin comes up heads, and you'll get paid $1 every time a coin comes up tails (which certainly seems like, and is, a fair game). If the first ten coin flips are



then this seems legit. Heads and tails have both come up five times, so the game is tied up, and that seems like a reasonable outcome for a fair game. Now suppose that the first ten coin flips are


You'd probably be suspicious as heck because you have yet to make any money and your friend has already won $10 of yours. (If you're not convinced yet: imagine you get another ten heads in a row. Your friend is obviously cheating somehow. Another ten heads, for a total of thirty in a row, and you've cut all ties with this former friend of yours and don't regret it one bit because he's a filthy lying scumbag.) Well, the probability of getting 10 heads in a row is (1/2)10 ≈ 0.00097, so you're right about the fact that he's probably a scumbag. The reason that the outcome that gave five heads and five tails seems less dubious is that there are actually 252 different ways to flip five heads and five tails (because 252 = "10 choose 5," but we won't go into that). There is only one outcome that gives exactly ten heads. So while any particular outcome that consisted of five heads and five tails is only going to occur with minuscule probability (1/2)10, one of these 252 outcomes will occur with probability 252(1/2)10 ≈ .25, or 252 times more often than the all-heads scenario.

So that's a quick example that helps illustrate the fact that human beings just aren't good at "sensing" randomness, and all the education in the world doesn't change our crappy intuition, even if we should know better. In the immortal words of Ygritte and Seth Meyers, when it comes to randomness:



Bonus math stuff

Skip this part if you've had enough.

If you're wondering where the probabilities referenced above come from, here's a first, easier-to-answer question that can improve our misguided intuition. If you repeat the flipping-100-coins experiment many times (like 1,000,000 times), how many runs of length 5 should we get on average? So, we have 100 flips, and a run of 5-in-a-row can start at flip 1, flip 2, flip 3, ... , all the way out to flip 96 (if we started any later than that, we'd run out of coin flips). So there are 96 places for a length 5 run to start. As we discussed above, any particular set of 5 flips is going to be all-heads with probability (1/2)5 and all-tails with probability (1/2)5, for a total probability of 
(1/2)5+(1/2)5=(1/2)4=1/16. 
If we want to figure out the average number of length 5 runs among these 1,000,000 coin-flipping experiments, we can simply add up the average number of length 5 runs among each of the 96 sets of 5 consecutive coin flips. On average, each one of these is a length 5 run 1/16 of the time, so in total we have 96/16 = 6 length 5 runs. Note that the way we're counting them, the sequence 


actually has two length 5 runs, so these aren't necessarily separate runs. But still, if on average there are six in an experiment, intuition says that it should be pretty likely that we find at least one. 

Okay, so that's intuition-building. But we did say that our intuition is not to be trusted, so you really shouldn't be satisfied right now. To get the actual numbers claimed above, we can do some clever but pretty complicated math, or we can simply run a simulation. Here's an example of what a simulation looks like, in a piece of math software called Matlab (there's a free Matlab equivalent called Octave that has most of the same functionality). If you have access to this program, you can run this function as written. If you have some facility with some other programming language, you can write a program like this one. If you have no idea what I'm talking about, just ignore what follows.

function probability = runCounter(run_time,num_flips,n)
% runCounter outputs the proportion of simulations among the run_time
% simulations tested in which there was a run of length n or greater among
% the num_flips flips of a coin.


longest_runs = 0;
run_of_n = zeros(1,run_time);
    
for i=1:run_time
    longest = 1;
    counter = 1;
    flip = 2;

    for j=1:num_flips
        last_flip = flip;
        flip = randi(2)-1;

        if flip == last_flip 
            counter = counter + 1;
            longest = max(longest,counter);
        else
            counter = 1;
        end
    end
    
    if longest >= n     
        run_of_n(i) = 1;
    end
    
    longest_runs = longest_runs + longest;
end

probability = sum(run_of_n)/run_time;

end

Tuesday, April 14, 2015

The Four Color Problem: the Illustrious History

Remember being a kid being dragged to a restaurant with grownups? The one consolation was when they'd hand you a paper placemat with the outline of the US or a parrot or something and that set of four crayons. Why was it always only four crayons? Come on, Friendly's, more than half a billion in revenues and you couldn't give us one more dang crayon? Dang! Well, turns out you really didn't need to complain so much, at least if you were going for efficiency in your coloring practice. Four crayons isn't a whole lot, but there's a pretty deep math theorem that says it's enough.

Theorem: Four crayons suffice.

But okay, we should probably be clearer about what that means. Let's think of the uncolored picture on the placemat like a map. You can use whatever color you want in each region, but you don't want two regions that border each other to get the same color. So let's restate the theorem with this constraint.

Theorem: Suppose you have a picture, consisting of any number of regions, that you want to color so that no two bordering regions get the same color. Then four crayons suffice.

And if that isn't applied math, I don't know what is. If you doubt the direct applicability of this theorem to your life, just buy yourself a grown-up coloring book and four colored pencils. It should be noted that though you'll never need more than four colors, it's not always easy to actually find a way to color a picture with only four. There was even a famous April Fools' Joke gone awry in which Martin Gardner published a picture that they claimed required five colors to color, and a bunch of mathematicians actually believed this meant that the theorem had been disproven. Here's Gardner's picture. See if you can color it using only four colors.

Fig. 1. This picture requires five colors to color, haha jk lol April Fool's it only takes four.


By the way, can we think of any uncomplicated examples where we need four? If you draw a few simple pictures without thinking about it too hard, you may not end up with anything that requires more than three colors. To illustrate where three don't suffice, here's a picture of a bird. 
Fig. 2. This is a bird, okay?
You might be wondering if this is really a picture of a bird and if so, what is with that middle thing that has to be its wing, and can a bird with a wing that looks like a potato fly or is this one of those flightless things like a dodo or a penguin, and my answer to you is that we're talking about MATHEMATICS here, not art or zoology, so focus. Here is an example of how we can color this bird using only four colors, which is all this tuber-winged air skunk is good for right now.
Fig. 3. Now it's obviously a bird, right?
Convince yourself that you couldn't get away with three colors here. (The problem is with the "wing" in the middle. It's touching three other regions, each of which are touching each other, so coloring those four regions with any three colors would necessarily result in two bordering regions sharing a color. I don't know why the mouth-triangle and the eye are red, okay, so stop asking. Maybe the bird is a demon and it bathes its flesh-tearing beak in the blood of the vanquished in sacrifice to its unholy idol. Look, you're getting really distracted here. We have to get back to the math.)

Okay, so four colors are sometimes necessary, and the theorem---clearly true because it is in bold---tells us that we'll never need more than that. The theorem is so easy to state that you might think this problem is not very substantive. To address that, let's talk a little about this problem's lengthy and colorful (ha, ha) history, which should quickly give the idea that proving that theorem is going to be Really Dang Hard.


Fig. 4. How it all went down.

It's 1852 and Francis Guthrie, a future-mathematician-slash-botanist in London, is coloring a map of the counties of England, because math grad school leaves you with a lot of free time. Hold up, he thinks. He's in England though so it comes out, Hark! Verily! What if like, one day, there were chain restaurants that distracted small children from their ennui with sets of crayons? How many crayons shall suffice? Based on his map, I guess, the answer seemed to be four, and he couldn't come up with a map that needed more than that, and lo!---a conjecture was born. A couple of years later, he (or maybe his brother? or maybe someone else entirely with the same initials as him?) published the problem in The Athenaeum, the New Yorker of 19th century London, but fancier because there are three vowels in a row in its name.  But nobody cares what a grad student thinks, and what kind of self-respecting fancy pants needs to concern himself with such paltry questions? (Fancy people have lots of crayons.) So it wasn't until Guthrie's super-famous advisor Augustus De Morgan took an interest in the problem that people really got into it.

De Morgan wrote to this guy, Sir Hamilton, and told him about the problem. Sir Hamilton wrote back,
I am not likely to attempt your quaternion of colour very soon.

Because that is how people talked and it was awesome. So after sitting on it for a few more years, De Morgan published the same problem in the same dang magazine in 1960, and that got attention. Francis was probably pissed that his advisor stole his glory, though.

So now all these high-powered mathematicians are into it, but nothing happens for a while. And then, 19 years after De Morgan's publication, this barrister Alfred Kempe comes out of nowhere with a proof that impresses the pants off all these hoity-toities. Barrister is British for "lawyer in a wig" I'm guessing. Everyone is impressed as heck. Francis Guthrie was a barrister for a while too, by the way. They were both trained mathematicians as well, but the wig game isn't as strong in that industry.

The next year, another guy comes on the scene with another proof. This guy, Peter Guthrie Tait, is actually a physicist. He also says something like,

The four color conjecture is true if and only if no snarks are planar.

Nobody knows what he's talking about, so they just nod. By the way, Tait studied the ozone, which is kinda neat, and quaternions, which are weird kinds of numbers. He also hung out with Sir Hamilton from earlier and was really into golf.

Things are pretty good for a while after that. But then in 1890, Percy Heawood finds a flaw in Kempe's proof, and then in 1891, Julius Petersen finds a flaw in Tait's proof, and we're left totally proofless. Heawood is a gloriously mustachioed but tiny little dude who wears a cape, carries a ratty old handbag, and always brings his little dog to his lectures. Also, he goes by Pussy, so.

Pussy salvages Kempe's proof somewhat and uses it to prove the Five Color Theorem by using something called Kempe chains. The Five Color Theorem says that five colors suffice to color any picture (with the same constraints we made earlier). The argument is pretty short and sweet and is taught in undergrad math classes now. But back in 1891, we're still nowhere with the Four Color Problem, and over the next 64 years pretty much all that happens is that this guy Hugo Hadwiger comes up with a problem that's a vast generalization of this one, and it's still unsolved, and it just serves to depress everyone further.

Meanwhile, in Germany, Heinrich Heesch is doing his thing at his university, and then the 1930's come along and the Nazis say that you can't work in a university unless you become a member of the Nazi party. (Hitler tries to do a lot of crazy stuff at this point, and we'll just fast forward.) Heesch decides on a firm no thanks to that, and quits. He moves into his parents' basement in 1935, where he stays until 1948 (which seems longer than strictly necessary, given that WWII ends in 1945). When he emerges, he's awesome. This is basically a mathematical superhero origins story. In 1955, he starts working on using computers to prove stuff. Which we should take a second to appreciate, because this is what computers looked like at the time.

Fig. 5. What the hell is this?
This starts an arms race throughout the 1960's and 1970's for who can use computers to solve the Four Color Problem, and basically everyone is using Heesch's methods. Heesch is close, maybe, but it's a couple of Americans, Appel and Haken, that finally beat him to it in 1976. Their paper is long and basically unreadable, and they themselves admit it:
...50 pages containing text and diagrams, 85 pages filled with almost 2500 additional diagrams, and 400 microfiche pages that contain further diagrams and thousands of individual verifications of claims made in the 24 lemmas in the main section of text. In addition ... certain facts have been verified with the use of about 1200 hours of computer time and would be extremely time-consuming to verify by hand. ... few mathematician have read [the paper] in any detail. 
The upshot is that not a whole lot of people actually understand the proof, and to this day there are those who think that using a computer to prove something is controversial. Some mathematicians are still looking for a non-computer proof, but nobody has succeeded.