Wednesday, October 15, 2008

First Monday article on Google Books and OCA

The newest issue of First Monday (volume 13, number 10, 6 October 2008) has an interesting article by KalevLeetaru -- "Mass book digitization: The deeper story of Google Books and the Open Content Alliance."
The article compares what is publicly known about the Google Book and OCA projects.

From the conclusions:

While on their surface, the Google Books and Open Content Alliance projects may appear very different, they in fact share many similarities:

  • Both operate as a black box outsourcing agent. The participating library transports books to the facility to be scanned and fetches them when they are done. The library provides or assists with housing for the facility, but its personnel are not permitted to operate the scanning units, which must be staffed by personnel from either Google or OCA.

  • Neither publishes official technical reports. Google engineers have published in the literature on specific components of their project, which offer crucial insights into the processes they use, while talks from senior leadership have yielded additional information. OCA has largely been absent from the literature and few speeches have unveiled substantial technical details. Both projects have chosen not to issue exhaustive technical reports outlining their infrastructure: Google due to trade secret concerns and OCA due to a lack of available time.

  • Both digitize in–copyright works. Google Books scans both out–of–copyright books and those for which copyright protection is still in force. OCA scans out–of–copyright books and only scans in–copyright books when permission has been secured to do so. Both initiatives maintain partnerships with publishers to acquire substantial in–copyright digital content.

  • Both use manual page turning and digital camera capture. Large teams of humans are used to manually turn pages in front of a pair of digital cameras that snap color photographs of the pages.

  • Both permit libraries to redistribute materials digitized from their collections. While redistribution rights vary for other entities, both the Google Books and OCA initiatives permit the library providing a work for digitization to host its own copy of that digitized work for selected personal use distribution.

  • Both permit unlimited personal use of out–of–copyright works. While redistribution rights vary for other entities, both the Google Books and OCA initiatives permit the library providing a work for digitization to host its own copy of that digitized work for selected personal use distribution.

  • Both enforce some restrictions on redistribution or commercial use. Google Books enforces a blanket prohibition on the commercial use of its materials, while at least one of OCA’s scanning partners does the same. Google requires users to contact it about redistribution or bulk downloading requests, while OCA permits any of its member institutions to restrict the redistribution of their material.

From the section on "Transparency"
A common comparison of the Google Books and Open Content Alliance projects revolves around the shroud of secrecy that underlies the Google Books operation. However, one may argue that such secrecy does not necessarily diminish the usefulness of access digitization projects, since the underlying technology and processes do not matter, only the final result. This is in contrast to preservation scanning, in which it may be argued that transparency is an essential attribute, since it is important to understand the technologies being used so as to understand the faithfulness of the resulting product. When it comes down to it, does it necessarily matter what particular piece of software or algorithm was used to perform bitonal thresholding on a page scan? When the intent of a project is simply to generate useable digital surrogates of printed works, the project may be considered a success if the files it offers provide digital access to those materials.
To me, that paragraph gets at the key issue in discussing and comparing the projects -- are books being scanned in a consistent way and being made accessible through at least one portal, enforcing current rights restrictions? Yes? Then both these projects are, at a basic level, successful and provide a useful service.

Yes, there are issues to quibble with for both projects. More technical transparency is desirable for both projects. Both have controlled workflows that limit what can be contributed to the projects in different ways. There are aspects of the Google workflow that Google contractually requires its partners to keep secret. That's their right to include in their contracts, and a potential partner's decision to make if they find it objectionable and therefore choose not to participate. Each documents and enforces rights in different ways and to different extents -- we should be looking to standards in that area. Each sets different requirements for allowing reuse. If only there could be agreement.

One note on preservation. Neither projects are preservation projects -- they're access projects. Even if there were something we could point to and say "that's a preservation-quality digital surrogate" -- if such a concept as "preservation-quality" exists -- neither project aims for that. Both projects do, however, allow the participating libraries to preserve the files created through the projects. These files should and must be preserved because they can be used to provide digital modes of access, and, in some cases, they may be the only surrogates ever made if the condition of a book has deteriorated. Look at the HathiTrust for more on the topic of preserving the output of mass digitization projects.

And one note about the Google project providing "free" digitization for its participants. Yes, Google is underwriting the cost of digitization. But each partner library is bearing the cost of staffing and supplies for project management, checkout/checkin, shelving, barcoding, cataloging, and conservation activities, not to mention storage and management of the files. The overall cost is definitely reduced, but not free.

No comments: