07.20.07

You’re making it up

Posted in Python, maths at 10:25 pm by Twm

Most of us have faked data in columns at some point in our lives, be it the results from the GCSE lab experiment hich went wrong or some time sheet for the week before the holiday. It’s not hard to imagine that humans in general are pretty bad at making up convincing numbers, we are especially bad with picking extremes which makes us bad at generating random numbers.
There are several tests for determining the randomness of data, but there is a suprising amount of data which isn’t random. Quantitive measurments (e.g the height of a building, the lengths of a phali or the value of the stocks on the stock exchange) although may seem random, exhibit an interesting property called Benfords Law which may help in detecting made up values in non random data.

Benford’s law is a curious little observation about non random numbers which states that for a set of numbers, the leading digit will be a ‘1′ around 30% of the time, digit ‘2′ around 18% of the time, and ever decreasing until ‘9′ which will only be present as the first digit 5% of the time.
The classic example is if you measure river lengths in the world and count the number of times 0-9 are the leading digit then Benford’s observation can be confirmed. Since the law is ’scale-invariant’ (it’s not affected by multiplying or dividing the numbers) then it applies weather you measure the length of the rivers in meters, feet or inches.

See the below graph which compares the expected frequency of leading digits in truly random numbers (all 0.1) and compare that with Benford’s numbers.
Freq of random numbers

For any set of data which is known to follow Benford’s law, it makes sense that signficiant deviation from the fequencies of leading digits predicted by Benford could indicate foul play in the data.
The IRS are thought to be running this sort of algorithm in order to filter out suspect TAX claims for closer scruitny.

In the absense of aduquate measurements of penis sizes (sample size = 1), I thought i would try with some more readily available data – the size of files on my hard disk. I tried with both c:\windows and the temp directory in my user area.

Windows directory (sample size = 41)

Digit	E 	A	D
1	0.3	0.21	0.09
2	0.18	0.13	0.05
3	0.12	0.11	0.01
4	0.1	0.27	0.17
5	0.08	0.08	0
6	0.07	0.05	0.02
7	0.06	0.05	0.01
8	0.05	0.03	0.02
9	0.05	0.08	0.03

E = Expected frequency (accoring to Benford)
A = Actual observed
D = The difference (i.e zero means no difference)

Leading digit of file sizes

The sample size is pretty small, but already this shows signs of a the decreasing Benford curve. Note that the digit number five is the only one which matches perfectly.
The thing to note is how close the overal match of the curve is compared to the random case where each digit’s probability is 0.1. So the digit ‘1′ looks way off at 0.2. but is still twice as frequent as the random case (0.1).
The sum of the differences is 0.45.


Temp directory (sample size = 143)

Digit	E 	A	D
1	0.3	0.45	0.15
2	0.18	0.17	0.01
3	0.12	0.08	0.04
4	0.1	0.09	0.01
5	0.08	0.08	0
6	0.07	0.04	0.03
7	0.06	0.02	0.04
8	0.05	0.03	0.02
9	0.05	0.03	0.02
E = Expected frequency (accoring to Benford)
A = Actual observed
D = The difference (i.e zero means no difference)

Files in my temp directory

With a sample size of over a hundred, the the Benford curve reveals itself quite nicely. Again note that 5 is spot on, but this time the sum difference is 0.32 which is closer to zero ,meaning that the sample data is a closer match to the prediction than the previous samples.


Made up data (sample size = 36)

Finally, here are the frequencies of leading digit when using a list of made up file sizes (I made them up myself).

Digit	E 	A	D
1	0.3	0.14	0.16
2	0.18	0.11	0.06
3	0.12	0.22	0.1
4	0.1	0.19	0.1
5	0.08	0.03	0.05
6	0.07	0.06	0.01
7	0.06	0.06	0
8	0.05	0.08	0.03
9	0.05	0.11	0.07

E = Expected frequency (accoring to Benford)
A = Actual observed
D = The difference (i.e zero means no difference)

Made up file sizes

With a sum difference of 0.6, the data represents a lousy fit and inspection of the graph confirms that the pattern has been disrupted significantly. It looks as if I have been caught.

link : The source

Leave a Comment

You must be logged in to post a comment.