Saturday, January 03, 2009

on electronic texts

I just read an article at Information Today by Nicholas Tomaiuolo, an instruction librarian at Central Connecticut State University, entitled "U-Content: Project Gutenberg, Me, and You." He outlines the requirements and steps for preparing an etext for Project Gutenberg.

At one point in the article, there is a discussion about the requirements for full text, not just a PDF created from page images. The author wrote this from the point of one unfamiliar with PG's requirements, illustrating the process one might follow to create an acceptable PG submission -- images to PDF, and images to OCR to corrected plain text -- I found myself thinking quite a bit about the often heard statement (not in this article, mind you) that PDF is the ultimate format for texts.

I'm in no way denigrating PDF. PDFs is an absolutely required format for texts. PDF is highly portable and shareable and readable, and, if the source files are good enough, clearly printable. But it's not innately analyzable or easily repurposed. That requires full text.

I am not unfamiliar with what it takes to create an accurate plain text transcription of a text. When Gutenberg was in its early days, we were really talking about transcriptions, as in people typing in text. OCR has greatly streamlined that process, but the proofreading required is non-trivial. Want to work with a highly formatted text, or one with tables or formulae or figures? Challenging. Adding layers of structural and semantic markup to plain text, as with TEI, is time consuming. Rich markup, including identifying dates or names or geographical places, or providing normalized versions of said dates and names is a large undertaking. A full text with structural and sematic markup can be repurposed into many formats, including ebooks and PDF.

And you do want ebooks. Some months ago I had the great opportunity to demonstrate the prototype World Digital Library site at the National Book Festival. There is no greater focus group than thousands of people who love to read! The two top requests were that the books should be downloadable as ebooks and that all the text content be available as full text in all seven project languages. These were not academics or librarians (although there were some of the former and many of the latter who stopped by), but parents and commuters and researchers and genealogists.

Both are daunting requests when you do not have full text available to work from. There will be PDFs. The others are goals to strive for.

