Friday, February 24, 2012

Captcha - Reading along the Lines

You know those little puzzles you get on some web sites that make you translate blurred and distorted pictures of letters and number before you can get whatever it is you are after...

...well there's more to them than meets the eye.

The most common variety is 'captcha' (as in capture). They have two words and look like this:

The puzzle checks to see that you are a 'real' person before allowing you to access the site. People are quite good at dechiphering them, computers aren't. And there are other people 'out there' who write programmes which log into millions of sites automatically to steal email addresss,  leave links, sell Viagra, break into your bank account, steal your children and generally make life difficult. But so far computers generally find this sort of puzzle too difficult and cannot enter the sites protected by these garbled guardians.

In an intreguing bit of double-think Google has come up with a practical use for this process.

You may be aware that Google are in the process of digitising a shed load of printed material including millions of books, magazines and newspapers.

All the books in the world in Google's shed...
The intention is to make them available online in an easily searchable form. They scan the pages of print and use OCR (optical character recognition) programmes to convert the pictures into words, sentences and paragrpahs. Sometimes the computers cannot come up with an acceptable interpretation of a particular word. This maybe because the text has been printed at a small size on absorbent paper, printed on a fold, scribbled over, or otherwise damaged.

Books damaged by a deluge of 1988 Chardonnay following
an accident in the basement of a restaurantnext to a
rare book dealer in the Charing Cross Road
Google has a 'captcha' application called ReCaptcha. It uses these garbled words as the fodder for its puzzles. ReCaptcha looks like this:

Every time we 'solve' one of these puzzles our answer is sent back to Google who use it to suggest a meaning for a word that their character recognition programme could not understand.

This is a nifty implementation of distributed computing (where a massive task is divided up between a load of people who all do a little bit each) but it does give rise an interesting scenario and one big question.

The scenario concerns what happens if the answers we supply are fed back into the character recognition programme so that it learns to decipher garbled words? This sort of feedback loop must be an irresistable temptation to the programmers. It is elegant in the extreme and will help speed up the process - but if the programme escapes and the bad guys will get hold of it they will destroy the  'is this a real person' test.

I expect the answer is 'We wouldn't do that' or alternatively 'It won't happen'. Hmmm, we shall see...

The big and more obvious question is: If they do not know what the word is, how do they know we have typed in the correct letters?

Of course they have thought of this. As noted above, there are always two words. The reason being that one word is known and the other is an unknown word from the scanning project. They work on the basis that if you get the known one right, the unknown one will be either right or nearly right. They will collect a load of suggestions for the unknown word and (probably) choose the most popular. An exercise in democracy in translation (discuss).

But as far as the 'is this a real person' test is concerned, only the known word counts. So next time you are having trouble with one of the words it is 50/50 at worst that you need to get it right at all. If fact the easier it is to read the more likely it is to be the one that counts. If it is a number, has accents or punctuation or is truly indecpipherable there is a good chance that you can type in just about anything and still pass the test.

Among the issues yet to be addressed is: Do they try to prevent rude and offensive words appearing? If so how?

No comments:

Post a Comment