Sunday, November 16, 2008

google book search session at dlf

I was going to spend some time transforming my notes from Dan Clancy's session on Google Book Search from the DLF Fall 2008 Forum into more coherent prose, but for the sake of timeliness, I'm going to post them as is.

  • 20% of the content in Google Book Search is in the public domain, 5% is in print, and the rest is in an unknown “twilight zone” -- unknown status and/or out-of-print.
  • 7 million books scanned, over 1 million are public domain, 4-5 million are in snippet view.
  • Early scanning was not performed at an impressive rate, and it took way longer than expected to set up.
  • Priorities are working search quality, and exposure to
  • Search is definitely not solved and “done,” and is harder given the big distribution of relatively successful hits.
  • They are working to improve the quality of scanning and the algorithm to process the books and improve usability. They admit that they still have work to do, especially with the re-processing of older scans.
  • Data to support Long Tail model is right.
  • Creating open APIs, including one to determine the status of a book, and a syndicated viewer that can be embedded.
  • Trying to identify the status of orphans, and release a database of determinations. But institutions need to use determinations to guide their decisions, not just follow them because “Google said so.”
  • On the proposed settlement agreement: Google thought they would benefit users more to settle than to litigate.
  • The class is defined as anyone in the U.S. with a copyright interest in a book, in U.S. use. (no journals or music)
  • For all books in copyright, Google is allowed to scan, index, and provide varying access models dependent upon the status of the book -- if in print or out-of-print. Rights holders can opt out.
  • 4 access models: consumer digital purchase (in the cloud, not downloads – downloads are not specifically included in agreement); free preview of up to 20% of book; institutional subscription for the entire database (site license with authentication, can be linked into course reserves and course management systems); public access terminals for public libraries or higher ed that do not want to subscribe (1 access point in each public library building, some # by FTE for high ed institutions) which allows printing (for 5 years or $3 million underwriting of payments to rights holders).
  • Books Rights Registry to record rights, handle payments to rights holders. It can operate on behalf of other content providers, not just Google.
  • Plan to open up government documents, because they feel that the rights registry organization will deal with the issue of possible in-copyright content included in gov docs, which kept them from opening gov docs before.
  • Admits that publishers and authors do not always agree if publishers have the rights for digital distribution of books. Some authors are adamant that they did not assign rights, some publishers are adamant that even if not explicit, it's allowed. The settlement supposedly allows sharing between authors and publishers to cover this.
  • What is “Non-consumptive research”? OCR application research. Image processing research. Textual analysis research. Search development research. Use of the corpus as a test corpus for technology research, not research using the content. 2 institutions will run data centers for access to the research corpus, with financial support from Google to set up the centers.
  • What about their selling books back to the libraries that contributed them via subscriptions? They will take the partnership and amount of scanning into account and provide a subsidy toward a subscription. Stanford and Michigan will likely be getting theirs free. Institutions can get a free limited set of their own books for the length of the copyright of the books. They can already do whatever they want with their public domain books.
  • They will not necessarily be collecting rights information/determinations from other projects for the registry. In building the registry, they are including licensed metadata (from libraries, OCLC, publishers, etc), so they cannot publicly share all the data that will make up the registry. But they will make public the status of book that are identified/claimed as in copyright.
  • If Google goes away or becomes “evil Google,” there is lots of language in contracts and settlement for an out.
  • The settlement is U.S. only because the class in the suit was U.S. only. Non-U.S. terms are really challenging because many countries have no concept of class-action, and there is a wide variation of laws.
  • A notice period begins January 5. Mid 2009 is the earliest time this could be approved by the court.

No comments: