2008-04-13

Comments

So I've had this idea bouncing around in my head and I'm not sure why. Well that's not true, if I were so unsure I don't think that I'd post this. Maybe it is the idea of what's going on with a project called StupidFilter. It is pretty much what its name would suggest. Its aim is to create a software filter that would remove stupid blog comments (or any other content that you'd like to filter). Note that by my use of 'stupid' above, I'm not really talking about stupidity per se, (because that's very likely a hard problem). No, what I'm talking about may be more akin to a symptom of stupidity? Or maybe just gross violations of proper English, which doesn't really speak to a person's intelligence... I had been thinking about what would be the simplest, most brain-dead method for identifying stupidity in text. The first thing that sprang to mind was entropy. Entropy is the measure of how much uncertainty there is in something (at least as it applies to information theory). A coin toss has 1 bit of entropy, it is calculated like so:
So we get -1/2 * log(1/2) + -1/2 * log(1/2) = 1 I wrote the following code: here (which you can actually run, thanks codepad!). You'll notice that the text from a YouTube comment has a higher entropy than some text that I typed in. I'm actually just going by the letters, no punctuation is included. My hypothesis here is that badly mangled text will have a higher entropy than normal English. Perfectly random text (i.e. text with all 26 letters equally likely) has an entropy of 4.7 bits. This makes sense, since if you'd want to encode all the letters of the alphabet in binary, you'd need at least 5 bits (2^5 = 32, first power of two greater than 26). I think if I include punctuation and all that I may get a better "reading" because an exclamation point, being rare in normal text at least, would have a longer Huffman coding (I haven't really thought about this, could be wrong) and thus lend more to the entropy. My next idea was taken from my cryptography class. English has a certain frequency distribution for the letters (and numbers, punctuation etc.) that we can exploit. A string of 25 'z's in a row doesn't "look" like any standard sentence. We expect, roughly, that as the length of an English text increases the frequencies of the letters should approach 13% 'e', 9% 't', 8% 'a', and so on (List here). I wrote a very simple program here. I took the sum of the squared differences from the "normal" distribution as a "distance" measurement. We would expect a very long text to get very close to zero (i.e. the frequency distributions will tend to match). Both of these are really simple and would need a lot of work (multiplying together, weighting?) to be in in any way practical. But both constitute a very simple test of English-ness, basically, does the target text resemble English (at least statistically).

No comments:

twopoint718

About Me

My photo
A sciency type, but trying to branch out into other areas. After several years out in the science jungle, I'm headed back to school to see what I can make of the other side of the brain.