2007-10-14

CAPTCHA helps to keep out bots and preserve books

CAPTCHA (Completely Automated Turing Test To Tell Computers and Humans Apart), a weapon used to fight spammers is now helping university researchers preserve old books and manuscripts. Carnegie Mellon is using it to help decipher words in books that machines cannot read.

The CMU research team, based in Pittsburgh, Pennsylvania, is involved in digitising old books and manuscripts supplied by a non-profit organisation called the Internet Archive, and uses Optical Character Recognition (OCR) software to examine scanned images of texts and turn them into digital text files which can be stored and searched by computers. Unfortunately, due to the poor quality of the original documents the OCR software is unable to read about one in ten words. To solve this problem the team takes images of such words and uses them as CAPTCHAs. These CAPTCHAs, known as reCAPTCHAS, are then distributed to websites around the world to be used in place of conventional CAPTCHAs. Thanks to the adoption of reCAPTCHAs by popular websites like Facebook, Twitter and StumbleUpon, the system is helping to decipher about one million words every day, allowing the CMU team to digitise documents and manuscripts as fast as the Internet Archive can supply them. How is it possible? Well, when visitors decipher the reCAPTCHAs to gain access to the web site, another word from an old book or manuscript is digitised and sent back to CMU. Simple but brilliant, isn’t it.

No comments: