Friday, July 06, 2007

Google exposes plain text

An interesting post from the "Inside Google Book Search" blog:

http://booksearch.blogspot.com/2007/07/greater-access-to-public-domain-works.html


For the works that Google deems to be in the public domain, they are exposing the full text "text layer' of the work. This is not the pages images, but what one assumes is the output of their OCR process. The plain text is organized into pages and retains punctuation and line breaks. You can toggle between the plain text and page image view.

This is a huge breakthrough in terms of accessibility -- the principle reason that they cite for the new feature. What is slightly disappointing is that you cannot download the plan text version to use offline reader tools -- you have to read online. That could be burdensome for some.

It would also be nice someday if the downloaded PDFs included the plain text behind the scenes to support searching within the PDF.

No comments: