07.20.07

Plura

Posted in Uncategorized at 11:18 pm by Twm

I should be ashamed of myself for having a degree in computer science and still I misuse the word data by treating it as a singular.
One item of data is of course a datum. But these days we would have problem with “the data are large” vs “the data is large” with the later sounding most natural. If the word data is replaced with a familar plural such as bananas, then we get “the bananas are large” vs “the bananas is large” and it becomes a lot easier to identify the correct form.

Fortunately, the english language has been raped and battered throughout the ages and now data as a singular is the norm, refering to wod of information. It’s less to do with a single case of the word data than the problem most of us have with a plural ending with a. It just sounds like a singular noun.

There are loads of other examples, but two I tried to use today (and got strange looks) were:
erratum – errata
Agendum – Agenda

Making up plurals is rife in the medical profession. Here is a good article on the subject
Link : Medical Plurals

Widget wank

Posted in AJAX, Mobile, javascript at 10:51 pm by Twm

We had a presentation at work today by some enthusiastic folk from the wireless industry which took me back to 1999.
As part of the talk, the words “widgesphere” and “widgetization” were exchanged without a trace of irony. This of course relates to the hype wagon which surrounds the widget engine which Nokia announced recently.

It will be nice to have another application programming environment for ‘creative types’ but I tried to convince my manager to just replace the word widget with app and see if the proposition was still as revolutionary. For me the word widgets just means something you spent an hour configuring and chosing when you get a new PC and then disable after a couple of weeks when you realised that you can see what the weather is like by sticking your head out the window.

I’d much prefer to write web services using AJAX than C++ and thousands of _LIT() statements, but as an app development environment it’s going to need access to native APIs and a persistence model (something like google gears) to float my boat. Once you start adding lots of native calls, it starts to look like a clunkly version of flash without debugging tools.

You’re making it up

Posted in Python, maths at 10:25 pm by Twm

Most of us have faked data in columns at some point in our lives, be it the results from the GCSE lab experiment hich went wrong or some time sheet for the week before the holiday. It’s not hard to imagine that humans in general are pretty bad at making up convincing numbers, we are especially bad with picking extremes which makes us bad at generating random numbers.
There are several tests for determining the randomness of data, but there is a suprising amount of data which isn’t random. Quantitive measurments (e.g the height of a building, the lengths of a phali or the value of the stocks on the stock exchange) although may seem random, exhibit an interesting property called Benfords Law which may help in detecting made up values in non random data.

Benford’s law is a curious little observation about non random numbers which states that for a set of numbers, the leading digit will be a ‘1′ around 30% of the time, digit ‘2′ around 18% of the time, and ever decreasing until ‘9′ which will only be present as the first digit 5% of the time.
The classic example is if you measure river lengths in the world and count the number of times 0-9 are the leading digit then Benford’s observation can be confirmed. Since the law is ’scale-invariant’ (it’s not affected by multiplying or dividing the numbers) then it applies weather you measure the length of the rivers in meters, feet or inches.

See the below graph which compares the expected frequency of leading digits in truly random numbers (all 0.1) and compare that with Benford’s numbers.
Freq of random numbers

For any set of data which is known to follow Benford’s law, it makes sense that signficiant deviation from the fequencies of leading digits predicted by Benford could indicate foul play in the data.
The IRS are thought to be running this sort of algorithm in order to filter out suspect TAX claims for closer scruitny.

In the absense of aduquate measurements of penis sizes (sample size = 1), I thought i would try with some more readily available data – the size of files on my hard disk. I tried with both c:\windows and the temp directory in my user area.

Windows directory (sample size = 41)

Digit	E 	A	D
1	0.3	0.21	0.09
2	0.18	0.13	0.05
3	0.12	0.11	0.01
4	0.1	0.27	0.17
5	0.08	0.08	0
6	0.07	0.05	0.02
7	0.06	0.05	0.01
8	0.05	0.03	0.02
9	0.05	0.08	0.03

E = Expected frequency (accoring to Benford)
A = Actual observed
D = The difference (i.e zero means no difference)

Leading digit of file sizes

The sample size is pretty small, but already this shows signs of a the decreasing Benford curve. Note that the digit number five is the only one which matches perfectly.
The thing to note is how close the overal match of the curve is compared to the random case where each digit’s probability is 0.1. So the digit ‘1′ looks way off at 0.2. but is still twice as frequent as the random case (0.1).
The sum of the differences is 0.45.


Temp directory (sample size = 143)

Digit	E 	A	D
1	0.3	0.45	0.15
2	0.18	0.17	0.01
3	0.12	0.08	0.04
4	0.1	0.09	0.01
5	0.08	0.08	0
6	0.07	0.04	0.03
7	0.06	0.02	0.04
8	0.05	0.03	0.02
9	0.05	0.03	0.02
E = Expected frequency (accoring to Benford)
A = Actual observed
D = The difference (i.e zero means no difference)

Files in my temp directory

With a sample size of over a hundred, the the Benford curve reveals itself quite nicely. Again note that 5 is spot on, but this time the sum difference is 0.32 which is closer to zero ,meaning that the sample data is a closer match to the prediction than the previous samples.


Made up data (sample size = 36)

Finally, here are the frequencies of leading digit when using a list of made up file sizes (I made them up myself).

Digit	E 	A	D
1	0.3	0.14	0.16
2	0.18	0.11	0.06
3	0.12	0.22	0.1
4	0.1	0.19	0.1
5	0.08	0.03	0.05
6	0.07	0.06	0.01
7	0.06	0.06	0
8	0.05	0.08	0.03
9	0.05	0.11	0.07

E = Expected frequency (accoring to Benford)
A = Actual observed
D = The difference (i.e zero means no difference)

Made up file sizes

With a sum difference of 0.6, the data represents a lousy fit and inspection of the graph confirms that the pattern has been disrupted significantly. It looks as if I have been caught.

link : The source

07.14.07

KErrWhat?

Posted in Python, Symbian, c++ at 9:57 pm by Twm

I got absolutely sick of looking up Symbian leave codes and was glad to come across a list on newlc.
http://newlc.com/Symbian-OS-Error-Codes.html

To ease the lookup further, here is a python script which scapes the website into a database and allows you to lookup error codes from the command line.

e.g:

C:\kerrwhat -12
KErrPathNotFound [Unable to find the specified folder]

Link: The Source