Tuesday, December 30, 2008

archiving the bush administration

There is a great article on ars technica today about the major processing effort that will be required at the National Archives when the Bush administration leaves office. The ars technica piece references a New York Times article on the topic from this past weekend.

This section really strikes home:

The contingency plan will entail "ingesting" the Bush White House's data into a separate system before integrating it with the ordinary archive. As the plan explains, "the current PERL [Presidential Electronic Records Library] system architecture was not scalable to actually support the volume of records that are expected from the current Presidential administration."

It's not just size that matters, though: the Archives will also need to process reams of information locked in some quaint proprietary formats. The RMS index, for example, "consists of an implementation of a customized older version of Documentum running on Oracle, with image files (including copies of scanned records) incorporated as objects in the database." The photos are stored in a "proprietary photo management software called MerlinOne, running on Microsoft SQL as the database engine," and it has apparently taken several months to extract the images and metadata for relinkage outside the Merlin format.

First, the use of quotation marks should remind us all that "ingest" means absolutely nothing to someone who is not a repository manager.

I have participated in some discussions about a potential data migration project at work. I recently saw an inventory of media formats -- not file formats, but media formats -- that the project would need to encompass, and it is lengthy. The only source I can think of for hardware to read some of the formats is EBay. That doesn't even take into account the files themselves. It's interesting how quickly a format becomes obsolete, and how many customized systems federal agencies use.

Monday, December 29, 2008

blogging has fallen by the wayside

Between a month almost solely dedicated to a single high-stress project and a lot of other writing commitments -- revising a paper for a conference, drafting a conference proposal and a co-authored conference proposal, and a writing chapter for a book -- I find that I haven't made time to blog. I promise to make time soon.

best metrics for comparing hardware?

Recently I spent 4 weeks on a project where we were considering hardware options for a large amount of storage for a data migration project. We ended up with 4 different proposals -- three from vendors and one to be built in-house. One of the tasks that I worked on was a matrix to compare the 4 potential solutions.

There were the easy metrics -- the amount of raw and usable storage, number of racks/tiles required, electrical and cooling requirements, cost, etc. Comparing supportability was trickier but doable, with 24*7 versus 12*5 phone support, availability of on-site technicians, warranty terms, support contract costs, etc. Where it became more difficult was identifying metrics to compare performance. Ratio of processors to storage? Location of processing nodes in the architecture? I/O rates? Time to read all data? And how do you best calculate those last two with four quite architecturally different proposals? We ended up with metrics that not everyone agreed upon, in part because there was a requirement that not everyone agreed upon.

I'm curious how other folks have gone about doing this. I'd be interested in hearing from anyone who is willing to share their strategies.

Wednesday, December 17, 2008

ICDL adding European collections

From an article on Forbes.com, the International Children's Digital Library (ICDL) announced a partnership with the Taliaferro Family Fund to increase the number of European children's titles in the collection. The Elias Project will target three collections in Europe: the Norwegian Children's Book Institute in Oslo, Norway, the International Youth Library in Munich, Germany , and the National Center for Children's Books in Paris, France.

After reading the article, I check in at the ICDL site, which I hadn't visited in a few months, and noticed two other news announcements: ICDL and the Google Book project will be sharing public domain children's book titles; and ICDL has launched an iPhone app with full access to the collection, a new titles features, and an offline mode and an airplane mode. It's great to see such a worthwhile project making such advances in collection building and in adding new services.

(I didn't see a press release about the European project on the ICDL site. I saw the press release on some other sites, so I assume it's meant to be out there.)

Tuesday, December 16, 2008

letter to santa

Nik Honeysett has posted a great letter to Santa on the Musematic blog.

If enough of us ask for that image format, will Santa grant our wish?

Friday, December 12, 2008

interview with Paul LeClerc in New York Times

Paul LeClerc, director of the New York Public Library, answered questions online at the New York Times that have been made available in three parts: part one, part two, part three. Topics include budgets, branch closures and renovations, ebooks, and preservation efforts.

In part two, he briefly mentioned their participation in the Library of Congress National Digital Newspaper Project and its public access content web site Chronicling America. It was nice to see this project mentioned get a media mention in the context of preserving and providing access to often ephemeral newspapers.

Library of Congress releases report on flickr pilot

The Library of Congress has released its report on its Flickr Commons pilot, where approximately 5,000 images were uploaded for a crowdsourcing metadata experiment. A full report and a summary report are available, both PDFs.

The photos have drawn more than 10 million views, 7,166 comments and more than 67,000 tags. When Flickr commenters provide updated place and personal names, dates, and event identification, staff from the Library's Prints and Photographs Division verify the information and have so far updated more than 500 records in their catalog -- with many more in the queue -- citing the Flickr Commons Project as the source of the new information.

Thursday, December 11, 2008

creative commons wants feedback on licenses

Creative Commons is conducting a study to collect feedback on the term “noncommercial” and how it should be covered in its licenses. The hope is that what’s learned from the survey can improve the licenses that allow or restrict noncommercial uses. The questionnaire has to be completed by this Sunday, December 14, 2008. Everyone who has taken advantage of CC licenses as a creator or a user should take some time to answer the questions.

world war II collection at the national archive and footnote

The US National Archives and the historical document website Footnote.com have collaborated on the digitization of a large collection of documents from the US involvement in World War II, which are now available on the footnote.com web site. There is an ars technica article on the collection and interface.

Like the ars technica writer, I had a lot of difficulty finding anything that I hoped to find. My grandfather, father, and uncle all served in WWII. My grandfather died in a friendly fire incident where allied planes accidentally sunk a ship carrying prisoners of war to be returned. I found nothing. There was nothing in the documents nor in the photos. Although I did find out that a man with almost the same name as my uncle (same middle initial but different middle name) was listed as missing when his plane was shot down in 1943. Still, it's a lot of useful content that I'm glad to see digitized and OCR'ed.

I was disappointed I wasn't surprised. I found the navigation to be a bit puzzling. I found I had to have multiple tabs open to easily go back to search. Not just the image but the entire image viewer screen had to come into focus when I selected something to view.

The ars technica writer said that his view of the site included the disclaimer "All Free (for a limited time)," and commented that "... it would be nice to think that a service based on government records of a significant American experience would be free indefinitely." The original press release describing the collaboration is worth reviewing, because it addresses that point in the ars technica article. The agreement allows Footnote.com non-exclusive access, and "After an interval of five years, all images digitized through this agreement will be available at no charge through the National Archives web site." So, Footnote can charge for it for now, but it will all revert to the National Archives for free and open access.

I don't see that disclaimer when using my Library of Congress computer because we have full access -- I wonder how long it will be fully accessible for those without subscriptions?

Sunday, December 07, 2008

laine farley named cdl director

I just saw the press release naming Laine Farley as the new Director of the California Digital Library. I am thrilled for Laine, who's been serving as the interim Director for over 2 years. I have worked with her on Aquifer and at least one other collaborative community project, and I know what an experienced and capable person she is.

Sunday, November 30, 2008

costs of operating a data center

In recent weeks I've been in a number of meetings about storage architecture. In comparing potential solutions for a particular project's needs, there has been a lot of discussion about requirements for space, cooling, power, etc., which are some of the metrics for comparison of the proposals. Today I came across a posting by James Hamilton from Microsoft's Data Center Futures Team, on misconceptions about the cost of power in large-scale data centers.

There is an error in the table for the two amortization entries: the 180-month amortization of the facility says 3 years in the notes column when it should say 15, and the 36-month amortization of the servers says 15 years when it should say 3. The right numbers are used in the calculation -- it's just the explanatory notes that are switched.

There are some very interesting calculations that I plan to forward to other folks. One commenter points out that the figures and graph don't take staffing into account, which, according to Hamilton, is because that cost is a very small percentage of the cost. At Microsoft's scale that's probably true. I guess we're a relatively small- to medium-scale data center, because we are definitely taking that into consideration. And, while the cost of power may be a lower percentage of overall cost than previously considered, power capacity is still definitely a factor when you're considering adding a number of servers and switches.

Saturday, November 29, 2008

Jamie Boyle on public domain

Jamie Boyle's book The Public Domain: Enclosing the Commons of the Mind has been published by Yale University Press, and is also available for free download under a Creative Commons license.

I've seen Jamie Boyle speak two or three times, and I consider him a very important voice in the discussions on the public domain, intellectual property, patents, the economics of same, and their place in technology and culture.

Read this book.

Wednesday, November 26, 2008

LuSql - Lucene indexing of DBMS records

The release of LuSql has been announced on a few email lists:

LuSql is a high-performance, simple tool for indexing data held in a DBMS into a Lucene index. It can use any JDBC-aware SQL database.

It includes a tutorial with a series of increasingly complex use cases, showing how article metadata held in a series of MySql tables can be indexed and how file system files containing full-text can also be indexed.

It has been tested extensively, including using 6.4 million metadata and full-text records to produce a 86GB index in 13.5 hours.

It is licensed with the Apache 2.0 license.

release: http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql
tutorial: http://cuvier.cisti.nrc.ca/~gnewton/lusql/v0.9/lusqlManual.pdf.html

Tuesday, November 25, 2008

commentary on why google must die

John Dvorak write an essay in PC Magazine entitled "Why Google Must Die." It's a pithy article on search engine optimization (SEO) and the SEO tricks that are in play to work best with Google or get around a Google feature. This is an an essay that I never would have noticed had it not been referenced in a posting by Stephen Abram that I very much took notice of, also entitled "Why Google Must Die."

His post is a response to the often-heard suggestion that OPACs, federated search, and web site search engines should be "just like Google." He asks what should be implemented first:

1. Should I start manipulating the search results of library users based on the needs of advertisers who pay for position?
2. Should I track your users' searches and offer different search results or ads based on their private searches?
3. Should I open library OPACs and searches to 'search engine optimization' (SEO) techniques that allow special interest groups, commercial interests, politicians (as we've certainly seen with the geotagged searches in the US election this year), racist organizations (as in the classic MLK example), or whatever to change results?
4. Should I geotag all searches, using Google Maps, coming from colleges, universities or high schools because I can ultimately charge more for clicks coming from younger searchers? Should I build services like Google Scholar to attract young ad viewers and train or accredit librarians and educators in the use of same?
5. Should I allow the algoritim to override the end-user's Boolean search if it meets an advertiser's goal?
6. "Evil," says Google CEO Eric Schmidt, "is what Sergey says is evil." (Wired). Is that who you want making your personal and institutional values decisions?
There's more to the post. I admire a forthright post like this that pushes back on the assertion that doing things the Google way is automatically better.

I used to have lengthy discussions with a library administrator in a past job who wanted image searching to be just like Google images, because searches on a single word like "horse" would always produce images of horses at the top of the results. It was a lot of effort to explain that this was somewhat artificial, due to the sheer number of images and, that, in the absence of descriptive metadata, that having the string "horse" in the file name would ensure that they would be near the top of the list and that Google didn't actually recognize that it was an image of a horse. Sorry, we really did need to expend effort on descriptive metadata.


The Library of Congress is hosting the SearchCampDC barcamp on Tuesday, December 2. From the wiki:

The goal of SearchCampDC is to bring together people working on and using search releated technologies in and around Washington DC. We all rely on tools like Google in our daily lives, but as IT professionals we often need to build or integrate search technologies into applications, and the enterprise.

Search technologies typically rely on a wide variety of techniques such as fulltext indexing, geocoding, named entity recognition, data mining, natural language parsing, machine learning, distributed processing. Perhaps you've used a piece of technology, and care share how well it worked? Or perhaps you help develop a search-related tool that you'd like to get feedback on? Or maybe you've got a search-itch to scratch, and want to learn more about how to scratch it? If so, please join us for SearchCampDC.

The idea for this event came about from a happy coincidence that several Lucene, Solr, OpenLayers and TileCache developers were going to be in the DC area at the same time. The idea is to provide them (and hopefully others like you) a space for short & sweet presentations about stuff they are working on, and also to provide a collaborative space for people to try things out, share challenges, ideas etc.

Monday, November 24, 2008

europeana a victim of its own success

There are a number of article documenting the wild success and consequent server failure of the Europeana digital library: Times Online, PublicTechnology.net, and Yahoo Tech. The development site that documents the project is still available.

This is a cautionary tale for those of us who are working on the World Digital Library project, which is set to launch in April 2009. We know that there is a potentially high level of interest in a multi-lingual international digital collection site -- albeit one with a much initial smaller collection -- and seeing this confirms for us that making plans for mirroring is a necessity.

Thursday, November 20, 2008


The prototype site for Europeana, the European digital library funded by the EC, is set to launch today, November 20, 2008.

The initial collection of 2 million items comes from museums, libraries, archives, and audio-visual collections, and includes paintings, maps, videos and newspapers. The interface is in French, English, and German, with more languages planned. Highlights include the Magna Carta from Britain, the Vermeer painting “Girl With a Pearl Earring” from the Mauritshuis Museum in The Hague, a copy of Dante’s “Divine Comedy'" and newsreel footage of the fall of the Berlin Wall. Content is free of copyright restrictions.

New York Times and EU Business articles provide more details. The goal is to add 10 million items into the digital library by 2010, at a price of 350 million to 400 million euros.

Wednesday, November 19, 2008

LIFE photo archive in google images

Google has launched a hosted collection of newly-digitized images includes photos and etchings produced and owned by LIFE Magazine.

Apparently only a small percentage of these images have been published; the remainder come from their photo archive. Google is digitizing them: 20 percent of the collection is online, and hey are working toward he goal of having all 10 million photos online.

There are some great Civil War, WWI, WWII and Vietnam war images, portraits of Queen Victoria and Czar Nicholas, civil rights-era documentation, and early images of Disneyland. There are photographs by Mathew Brady, Alexander Gardner, Margaret Bourke-White, Dorothea Lange, Alfred Eisenstaedt, Carl Mydans, and Larry Burrows, among others.

Tuesday, November 18, 2008

cataloging flash mob

I love the idea of a flash mob volunteer effort to catalog the book and videotape collection at St. John's Church in Beverly Farms, Massachusetts. There's more from one of the volunteers.

I'd like to see this effort replicated at small historical societies, house museums, and other organizations that are often supported only by a small cadre of volunteers.

Monday, November 17, 2008

facebook repository deposit app using sword

From Stuart Lewis' blog comes word of a Facebook app -- SWORDAPP -- for depositing content into SWORD-enabled repositories. It's meant to encourage social deposit: notice of your deposits goes out in your Facebook newsfeed, and you can receive news of your friend's deposits. You have to already be eligible to authenticate and deposit into a repository somewhere.

This is an interesting use of SWORD in an application that lives in a really different context. He's looking for testers and feedback.

Sunday, November 16, 2008

arl guide to the google book search settlement

The Association of Research Libraries has created "A Guide for the Perplexed: Libraries & the Google Library Project Settlement," a 23-page document intended to help libraries understand the impact of the proposed Google Book Search settlement.

oxford university preserv2 project session at dlf

My brief, unstructured notes from a presentation by Sally Rumsey from Oxford University on the Preserv2 project, at the DLF Fall 2008 Forum.

  • Effort focused primarily on IRs with research data, but applies to all types of repositories.
  • “Boxed up -- but innovative at the time – repository environment.” 6 different repositories at Oxford, multiple Fedora & Eprints, some sharing the same storage.
  • A desire for decoupled services -- distributed storage and transportable data. Scalability issue.
  • Implemented British National Archives seamless flow approach to preservation. There are three categories of activities: Characterization (inventory and format identification). Preservation planning and technology watch (using the PRONOM technology watch service). Preservation Actions (file migration, rendering tools, etc, based on repository policy).
  • Smart Storage. Ability to address objects in open storage through a repository layer or directly in the storage system. DROID was used as the tool to verify new items as they are stored, and continually check files in storage. They are implementing a scheduler.
  • Oxford looking at Sun Honeycomb server architecture.

exposing harvard digital collections to web crawlers session at dlf

My somewhat unstructured notes from a presentation by Roberta Fox from Harvard, at at the DLF Fall 2008 Forum.

  • Barriers in exposing their collections from their legacy applications : session-based, frames, form-driven , non-compliant coding, URLs with lots of parameters.
  • Crawlers couldn’t get past first page of any of their services (VIA, OASIS, PDS, TED, etc)
  • Concerns: Server load an issue, dynamic database-driven sites.
  • But, exposure to crawler declared a priority.
  • Added robots links on every page to identify how it should be handled -- index, no, follow links, etc, to control server use.
  • Slow down crawl using Google webmaster tools to avoid major server hits.
  • Added alt and title tags to provide context for pages/items found through external search (originally assumed because access would be through the Harvard University Library portal context). They added links on all pages to provide additional context, e.g. the full preferred, presentation, to assist udders when they land on a page through Google and not through the HUL context, so users can find their way around.
  • Crawler friendly – generated a static site map for key dynamic pages, update the site map weekly.
  • Updated, simplified URL structure for deep pages.
  • Access to Page Delivery Service the most challenging for OCR’ed text. Generated a crawler-friendly page for each text in addition to original unfriendly frames version (there is no priority to rewrite the app to remove frames).
  • It is somewhat burdensome to create index pages that point to all the items in a database (such as all the images in VIA) on a weekly basis, but it’s better than no access – it’s an automated creation process and doesn’t take up that much room on the server.

ncsu course views session at dlf

My somewhat unstructured notes from a presentation by Tito Sierra and Jason Casden from NCSU on the Course Views tools, at at the DLF Fall 2008 Forum.

  • Their goal is to create a web page for every single course at NCSU.
  • Blackboard Vista (hard to work with), Moodle (easy to work with), WolfWare (internal tool, easy to integrate), Library web site.
  • Implemented using object-oriented PHP, and the front-end design makes use of YUI, jQuery and CSS. RESTful requests to the Widget System, and Restful URIs.
  • They screen scrape the course directory to build the database of course titles.
  • There is an issue with the course reserves system so they are not currently embedding reserves, just linking.
  • Built their own mini-metasearch which searches one vendor database set for each general discipline (created by subject librarians, who identify one vendor and a set of their databases to be searched).
  • The subject librarians create a “recommended” resource list for each course.
  • Real issues with customization, especially for sections of large courses.
  • Balance of fully custom, hand-made versus fully automated. Librarians given a number of widgets to select from, some of which are canned and some of which can be customized.
  • Reserves used the most by far, then search, then the “recommended” widget.
  • Customized widgets used most (the 3 listed above).

djatoka jpeg 2000 image server session at dlf

My brief, unstructured notes from a presentation by Ryan Chute from Los Alamos National Labs on the Djatoka JPEG 2000 Image Server, at at the DLF Fall 2008 Forum.

  • APIs and libraries: Kakadu SDK, Sun Java advanced images, NIH ImageJ, OCLC OpenURL OOM. Could work with other than Kakadu, including Aware.
  • Viewer adapted from a British Library implementation.
  • Features: resolution and region extraction, rotation, support for rich set of input/output formats, extensible interfaces for image transformations, including watermarking. Persistent URLs for images and for individual regions.
  • URI-addressability for specified regions needed, and OpenURL provided a standardized service framework foundation for requesting regions. This extends the OpenURL use cases.
  • Service example: Ajax-based client requests, obtains metadata and regions from image server using image identifier, client iterates through tiles and metadata to display.

university of texas repo atom use session at dlf

My somewhat unstructured notes from a presentation by Peter Keane from UT Austin on the use of Atom and Atom/Pub in their DASe repository, at at the DLF Fall 2008 Forum.

  • DASe project: lightweight repository, 100+ collections, 1.2 million files, 3 million metadata records.
  • DASe has replaced their image reserves system. Home grown (“built instead of borrowed”), originaly prototyped 2004/2005.
  • They didn’t originally plan to build a repository, they were building an image slideshow and ended up with a repository, too.
  • It’s a data first application. Data comes from spreadsheets, FM, Flickr, iPhoto, file headers, etc. System includes a variety of different collection-based data models. Needed to map to/from standard schemas. Accepted as is, no normalization or enrichment at all.
  • SynOA: Syndicated Oriented Architecture. Importance recognized in being RESTful. DASe is a Rest framework.
  • Use the Atom publishing protocol to represent collections and items and searches. Used internally between services, including upload and ingest (uses http get, post, etc). Everything is Atom with a UI (Smarty PHP templates) on top of it.
  • Working on a Blackboard integration.
  • Interesting use of Google spreadsheets – create Google spreadsheet for whatever they have a name/value pairs, automatically outputs atom, can ingest from feed.
  • No fielded search across collections, only within a single collection. They could map across data models to a common standard, but haven’t. (corrected as per comment below)
  • Repositories were considered a door to libraries, all trying to create a better door. This is not the right concept, instead should be exposing in a standard way to any and all services.
  • Loves REST; used the term “RESTafarian.”

creative commons non-commercial use session at dlf

My somewhat unstructured notes from a presentation by Virginia Rutledge, an attorney from Creative Commons, at the DLF Fall 2008 Forum.

  • Copyright is a bundle of rights. She went over this in some detail for those who are less familiar.
  • Creative Commons exists to support the ability to share, remix, and reuse, legally.
  • Example of recent Library use: the entire UCLA Library web site is under a CC license to clarify its content re-use status.
  • The Creative Commons definition of non-commercial is tied to the intent of the user -- no intent towards commercial advantage or private monetary compensation. BUT, there is no single definition of non-commercial.
  • There are undertaking a research project in many phases. 1st (done) – focus groups. Identified 4 communities: Arts, education, web, and science communities. This proved to be a VERY bad idea, as the boundaries are actually way too fuzzy and interdisciplinary. The work invalidated the assumption that they could do this on a community-based basis.
  • A number of issues of importance to rights holders in allowing non-commercial use were identified in the discussions: Is there a perceived economic value? Who is the user-- an individual or an organization? Non profit or not? Is any money generated? Is access supported by advertising or not? Is the use for the “public good” -- for charity/education? What is the amount of distribution? Will the work be used in part or in whole? Is this use by a “competitor?”
  • There are also subjective issues: Is it an objectionable use? Is it perceived as fair use?
  • Personal creator and personal use versus institutional ownership and use is a distinction that really makes a difference to people, but has no meaning in US law.
  • Some of the confusion over how to define "non-commercial" is not understanding what activities the prohibition of non-commercial use actually prohibits. Most rights holders don’t really want to prohibit all commercial uses, just some, and it varies wildly by person/organization.
  • Based on the research so far, there is no checklist they can come up with.
  • As of November 17, a poll will be available online, and they are encouraging librarians to participate.

google book search session at dlf

I was going to spend some time transforming my notes from Dan Clancy's session on Google Book Search from the DLF Fall 2008 Forum into more coherent prose, but for the sake of timeliness, I'm going to post them as is.

  • 20% of the content in Google Book Search is in the public domain, 5% is in print, and the rest is in an unknown “twilight zone” -- unknown status and/or out-of-print.
  • 7 million books scanned, over 1 million are public domain, 4-5 million are in snippet view.
  • Early scanning was not performed at an impressive rate, and it took way longer than expected to set up.
  • Priorities are working search quality, and exposure to google.com.
  • Search is definitely not solved and “done,” and is harder given the big distribution of relatively successful hits.
  • They are working to improve the quality of scanning and the algorithm to process the books and improve usability. They admit that they still have work to do, especially with the re-processing of older scans.
  • Data to support Long Tail model is right.
  • Creating open APIs, including one to determine the status of a book, and a syndicated viewer that can be embedded.
  • Trying to identify the status of orphans, and release a database of determinations. But institutions need to use determinations to guide their decisions, not just follow them because “Google said so.”
  • On the proposed settlement agreement: Google thought they would benefit users more to settle than to litigate.
  • The class is defined as anyone in the U.S. with a copyright interest in a book, in U.S. use. (no journals or music)
  • For all books in copyright, Google is allowed to scan, index, and provide varying access models dependent upon the status of the book -- if in print or out-of-print. Rights holders can opt out.
  • 4 access models: consumer digital purchase (in the cloud, not downloads – downloads are not specifically included in agreement); free preview of up to 20% of book; institutional subscription for the entire database (site license with authentication, can be linked into course reserves and course management systems); public access terminals for public libraries or higher ed that do not want to subscribe (1 access point in each public library building, some # by FTE for high ed institutions) which allows printing (for 5 years or $3 million underwriting of payments to rights holders).
  • Books Rights Registry to record rights, handle payments to rights holders. It can operate on behalf of other content providers, not just Google.
  • Plan to open up government documents, because they feel that the rights registry organization will deal with the issue of possible in-copyright content included in gov docs, which kept them from opening gov docs before.
  • Admits that publishers and authors do not always agree if publishers have the rights for digital distribution of books. Some authors are adamant that they did not assign rights, some publishers are adamant that even if not explicit, it's allowed. The settlement supposedly allows sharing between authors and publishers to cover this.
  • What is “Non-consumptive research”? OCR application research. Image processing research. Textual analysis research. Search development research. Use of the corpus as a test corpus for technology research, not research using the content. 2 institutions will run data centers for access to the research corpus, with financial support from Google to set up the centers.
  • What about their selling books back to the libraries that contributed them via subscriptions? They will take the partnership and amount of scanning into account and provide a subsidy toward a subscription. Stanford and Michigan will likely be getting theirs free. Institutions can get a free limited set of their own books for the length of the copyright of the books. They can already do whatever they want with their public domain books.
  • They will not necessarily be collecting rights information/determinations from other projects for the registry. In building the registry, they are including licensed metadata (from libraries, OCLC, publishers, etc), so they cannot publicly share all the data that will make up the registry. But they will make public the status of book that are identified/claimed as in copyright.
  • If Google goes away or becomes “evil Google,” there is lots of language in contracts and settlement for an out.
  • The settlement is U.S. only because the class in the suit was U.S. only. Non-U.S. terms are really challenging because many countries have no concept of class-action, and there is a wide variation of laws.
  • A notice period begins January 5. Mid 2009 is the earliest time this could be approved by the court.

Friday, November 14, 2008

omeka 0.10

Omeka v 0.10 has been released. Omeka 0.10b incorporates many requested changes: an unqualified Dublin Core metadata schema and fully extensible element sets to accommodate interoperability with digital repository software and collections management systems; elegant reworkings of the theme and plugin APIs to make add-on development more intuitive and more powerful; a new, even more user friendly look for the administrative interface; and a new and improved Exhibit Builder.

scholastic books flickr set

I came across a link to a flickr set that someone has created with covers and illustrations from Scholastic Book Services books from the 1960s and 1970s. I _loved_ it when the Scholastic book order forms were distributed, and I always ordered something like a dozen books every time. This set includes a number of books that I know I owned, and even a very few that I _still_ own. This person has a great collection.

EDIT: Here's another flickr set.

Greenstone release

Greenstone v2.81 has been released. Improvements include handling filenames that include non-ASCII characters, accent folding switched on by default for Lucene, and character based segmentation for CJK languages. There are many other significant additions, including the Fedora Librarian Interface (analogous to GLI, but working with a Fedora repository).

Rome Reborn and Google Earth

BBC News reported on a release of a collaboration between Google Earth and the Rome Reborn project. Ancient Rome is the first historical city to be added to Google Earth. The model contains more than 6,700 buildings, with more than 250 place marks linking to key sites in a variety of languages.

roman de la rose digital library

Johns Hopkins University and the Bibliothèque nationale de France have announced that the Roman de la Rose Digital Library available at http://romandelarose.org/. The goal is to bring together digital surrogates of all the approximately 270 extant manuscript copies of the Roman de la Rose. By the end of 2009 they expect to have 150 versions included in the resource. There is an associated blog available at http://romandelarose.blogspot.com/.

I am particularly interested in the pageturner and image browser that they used -- the FSI Viewer, a Flash-based tool. It seems to work with TIF, JPG, FPX, and PDF (but not JPEG2000?), and converts files to multi-resolution TIFs. It's a very intuitive interface.

Monday, November 10, 2008

photoshop ui rendered in real-world objects

Via BoingBoing, the UI for Photoshop recreated with real objects, created by the agency Bates 141 in Jakarta for Software Asli. Follow the links to the image and to the "making of" flickr set.

Friday, October 31, 2008

court rules that hash analysis is a fourth amendment search

The U.S. District Court for the Middle District of Pennsylvania has issued an opinion in the case United States v. Crist that a hash value analysis in a criminal investigation counts as a Fourth Amendment "search." Read a synopsis at ars technica.

JISC Digital Preservation Policies Study

JISC has released a two-part study of digital preservation policies: Digital Preservation Policies Study and Digital Preservation Policies Study, Part 2: Appendices—Mappings of Core University Strategies and Analysis of Their Links to Digital Preservation. The study aims to provide an outline model for digital preservation policies and to analyse the role that digital preservation can play in supporting and delivering key strategies for higher ed institutions.

cloud computing

An interesting new book -- The Tower and The Cloud: Higher Education in the Age of Cloud Computing -- has been published by Educause. The term "cloud computing" is usually used to refer to applications that run on remote systems in "the cloud" rather than on desktop computers or to the storage of files remotely rather than locally, but the book defines the term more broadly, including open-source software and social-networking tools. The full book is available online as a free PDF.

twitter war of the worlds

Thanks to Amanda for pointing this out -- I am addictively following the twitter production of War of the Worlds, an homage to the Orson Welles radio production. How this came about is described at the Ask a Wizard blog.

Tuesday, October 28, 2008

google book search settlement agreement announced

Today it was announced that Google has reached a settlement in the lawsuit filed by the Authors Guild, the Association of American Publisher, and a group of individual authors.

Some of the details are available at Google. The changes that I am the most interested in are these:

"Until now, we've only been able to show a few snippets of text for most of the in-copyright books we've scanned through our Library Project. Since the vast majority of these books are out of print, to actually read them you'd have to hunt them down at a library or a used bookstore. This agreement will allow us to make many of these out-of-print books available for preview, reading and purchase in the U.S.. Helping to ensure the ongoing accessibility of out-of-print books is one of the primary reasons we began this project in the first place, and we couldn't be happier that we and our author, library and publishing partners will now be able to protect mankind's cultural history in this manner."


"The agreement will also create an independent, not-for-profit Book Rights Registry to represent authors, publishers and other rightsholders. In essence, the Registry will help locate rightsholders and ensure that they receive the money their works earn under this agreement. You can visit the settlement administration site, the Authors Guild or the AAP to learn more about this important initiative."
I'm all for more access to these books and for rightsholders to get their due, but what does it mean to assign a value to them?

They also plan to offer subscriptions: "We'll also be offering libraries, universities and other organizations the ability to purchase institutional subscriptions, which will give users access to the complete text of millions of titles while compensating authors and publishers for the service." I have mixed feelings -- the subscription model is not an unusual one, and libraries have certainly provided digitized materials from their collections for paid subscription services before, i.e., with ProQuest. I wonder if the partners will get any share in the compensation for providing the content for the service?

I'm currently at an Open Content Alliance meeting and I'm looking forward to what I am sure will be many discussions among the attendees today.

EDIT: There's now a joint press release from the University of Michigan, the University of California, and Stanford University, a FAQ from the American Association of Publishers, a Google rightsholders site, a Google blog post, in addition to the site above and the press release.

Friday, October 24, 2008

search engine cache isn't copyright infringement

Some argue that search engines such are copyright violators because they scrawl, index and keep an archive of web sites. That copied archive -- or cache -- is, according to this argument, an unauthorized copy. Found via TechDirt, the Pennsylvania Eastern District Court held that a Web site operator's failure to deploy a robots.txt file containing instructions not to copy and cache Web site content gave rise to an implied license to index that site.

In Parker v. Yahoo!, Inc., 2008 U.S. Dist. LEXIS 74512 (E.D. Pa. Sep. 26, 2008), the court found that the plaintiff's acknowledgment that he deliberately chose not to deploy a robots.txt file on the site containing his work was conclusive on the issue of implied license. In so ruling the court followed Field v. Google, a similar copyright infringement action brought by an author who failed to deploy a robots.txt file and whose works were copied and cached by the Google search engine.

The court further ruled, though, that a nonexclusive implied license may be terminated. Parker may have terminated the implied license by the institution of the litigation, and he alleged that the search engines failed to remove copies of his works from their cache even after the litigation was instituted. If proved, "the continued use over Parker's objection might constitute direct infringement." That issue will likely be resolved at a later date.

For an analysis, see the New Media and Technology Law Blog.

The same plaintiff's earlier
Parker v. Google, Inc., No. 06-3074 (3d Cir. July 10, 2007) is also a search engine copyright infringement case.

Wednesday, October 22, 2008

tiny faces

This struck me as hilarious -- Someone noted that a box of Cascadian Farms frozen broccoli had teeny, tiny faces worked into the image on the label:


A comment in another blog said this (unsubstantiated):

"They've been putting tiny faces of employees, family and friends on the labels since at least 1995, which was when someone first showed me this on the labels of Cascadian Farms jams when I was first working at Fresh Fields (later bought by Whole Foods). CF has since been bought by General Mills, but it seems the tiny faces continue."

A bulletin board thread from 2007 claimed that the creamed corn packaging had a little hidden baby's face. Strange. That thread described it as a version of an "easter egg" in a video game or DVD ... an undocumented feature that you have to really try to find. That's pretty apt.

Sunday, October 19, 2008

los angeles food nostalgia

On a long drive recently, my partner Bruce and I were reminiscing about places we used to eat at in Los Angeles. He grew up there and has a longer list than I (maybe for another post). So many places we used to patronize as recently as the early 1990s are now gone, or, as the L.A. Time Machines site puts it, "extinct." I decided that I would try to write down the places I remember frequenting that are no longer open. Then I started semi-obsessively researching them.

  • A Chinese restaurant on Sunset Blvd or Antioch in Pacific Palisades. I don't remember the name or the food, but I do remember the old school Cantonese-American black, red, and gold dragon-decorated bar that served cocktails in tiki glasses with umbrellas. I used to order Fog Cutters or Mai Tais.
  • Blum’s in the Plaza Building at the Disneyland Hotel complex. Mom and I used to visit Disneyland every summer when we went down to L.A. to visit grandmother Johnston in Beverly Hills. Mom loved to stay in the old "Garden Rooms" section of the Disneyland Hotel (she called them the Lanai rooms). Every day we ate at least one meal at Blum's before or after a monorail trip. And on every trip we watched the "Dancing Waters" at the hotel complex.
  • Cafe Casino with locations on Gayley Ave in Westwood and on Ocean Blvd in Santa Monica. I ate a lot at the Westwood location (there was a great vintage poster store next door), and Bruce ate at the Santa Monica location.
  • Café Katsu on Sawtelle Blvd in West L.A. One of the first places Bruce and I went on a date. It was very small but the food was extraordinary. It was down the street from a Japanese restaurant that I cannot remember the name of that had fabulous grilled squid, and nearby the take-away hole-in-the-wall Tempura House that you had to visit early or they'd be out of the shrimp and sweet potatoes.
  • D.B. Levy’s sandwiches on Lindbrook Drive in Westwood. Located above what I seem to remember was a Burger King and near the now-demolished National movie theater, this place had a massive menu of sandwiches named after celebrities.
  • English Tea Room on Glendon Ave in Westwood. You walked off the street into a brick courtyard to enter this very quaint tea room. I almost always had the Welsh Rarebit and the bread and butter pudding.
  • Gianfranco restaurant and deli on Santa Monica Blvd in West L.A. This was the place a group of us ate on a very regular basis when undergraduates. I had a weakness for the gnocchi with pesto. My friend Cynthia remembers nothing but their delicate hazelnut cake.
  • Gorky’s Café at 536 East 8th Street in the downtown Garment District. Saturday mornings when I was in need of fabric or trim or beads, I'd drive to downtown L.A. to the garment district before any place was open, because the shopping day had to start with blintzes at Gorky's.
  • India's Oven on Pico Blvd in Culver City(?). The original location, with the disposable plates and cutlery.
  • Kelbo’s on Pico Blvd in west L.A. Polynesian tiki tacky like you cannot believe. Friends went there for the cheap, huge communal well drinks with lots of long straws. I went for the tiki tacky. They also sold painted faux stained glass for some reason.
  • Knoll's Black Forest Inn on Wilshire Blvd in Santa Monica. An unchanging decor and menu for decades.
  • Merlin McFly's on Main St. in Santa Monica. I remember it being near a great vintage clothing store. The draw was the amazing stained glass windows that featured historic magicians. The restaurant is long gone, but the windows were saved and are now at a venue called Magicopolis.
  • Mie and Mie Falafal in Westwood. I never had falafal before my freshman year of college. Our dorm was being renovated for the 1984 Olympics and had no dining hall, so I often found myself there.
  • Moise’s Mexican on Santa Monica Blvd near Federal in West L.A. I lived four blocks away and always ordered the same thing -- carne en serape, a burrito filled with beef in a sour cream sauce, covered in cheese and quickly broiled.
  • Panda Inn on Pico Blvd in West L.A. An elegant 80s room, the 2nd or 3rd location after Pasadena, before it went national. It was at the Westside Pavilion mall, an upscale mall designed by Jon Jerde.
  • The Penguin Coffee Shop at 1670 Lincoln Blvd in Santa Monica. My friend Cynthia and I loved it for its Penguin logo. The sign is partly still there (as is the Googie-style building), even though it became an orthodontic office.
  • Polly's Pies at 501 Wilshire Blvd in Santa Monica. Bruce loved this place.
  • R.J.s for Ribs at 252 N. Beverly Blvd in Beverly Hills. The ribs were good, but the real joy was trying to stump your waiter when he asked what animal he should fashion your foil package for leftover food into. Armadillos, bats, ...
  • Robata on Santa Monica Blvd in West L.A. near the Nuart theater. I was a very frequent attendee when the Nuart (and the Fox on Lincoln in Venice) were full-time revival houses. Robata was, as you'd expect, Japanese robata, or grilled, skewered food.
  • The Sculpture Gardens restaurant on Abbott Kinney in Venice. Bruce and I ate brunch there very frequently, but no one else seems to remember it, with its multiple little buildings surrounding a funky courtyard with sculptures. They had the best breakfast bread basket and baked apple pancakes.
  • Ship’s coffee shop at 10877 Wilshire Blvd in Westwood. A classic Googie-style coffee shop with toasters at every table. My friend Kevin ate there a lot.
  • Tampico Tilly's on Wilshire in Santa Monica. Cheap, decent Mexican in a huge faux rancho house. I think El Cholo took over the building.
  • Trader Vic’s at 9876 Wilshire Blvd in Beverly Hills. I remember eating there every summer with my grandmother Johnston. I was always allowed to order a Shirley Temple, which feels pretty daring when you're 10 years old. Sometimes we also ate at The Velvet Turtle on Sepulveda Blvd.
  • Wildflour pizza on Wilshire Blvd in Santa Monica. There is still one location open, but this is the one I remember best. They had to-die for spinach salad with marinated artichoke hearts.
  • Zucky’s Deli at 431 Wilshire Blvd at 5th St in Santa Monica. I had a roommate in college who was from New York via Florida. He took me to Zucky's for my first egg cream, and introduced me to Fox's Ubet syrup. When I worked at the Getty Research Institute when it was at 4th and Wilshire, I often stopped in for the fabulous corn muffins from their bakery. Izzy's was across the street -- they had good pie but I remember really disliking their tuna melts. What an odd thing to remember.
Of course there were lots of other places that are still there ... Angeli Caffe on Melrose Ave in West Hollywood, Anna Maria's Trattoria on Wilshire in Santa Monica, The Apple Pan on Pico, the Border Grill on 4th Street in Santa Monica, the Broadway Deli on the 3rd Street Promenade in Santa Monica, Campanile on La Brea Ave in North Hollywood, Chaya Venice on Navy Street in Venice, Chin Chin on San Vicente Blvd in Brentwood, i Cugini on Ocean Avenue in Santa Monica, Dhaba Indian on Main in Santa Monica, Empress Pavilion on Hill St in Chinatown, Father's Office on Montana in Santa Monica, Marix Playa on Entrada in Pacific Palisades, Noma Sushi on Wilshire Blvd in Santa Monica, Stan's Donuts on Weyburn in Westwood (the best apple fritters), Robin Rose ice cream on Rose Ave. in Venice, The Rose Cafe on Rose Ave. in Venice, Snug Harbor on Wilshire in Santa Monica, Thai Dishes on Wilshire in West L.A., Versailles Cuban on Venice Blvd in Culver City, Woo Lae Oak Korean on Western at Wilshire (the best place to eat before shows at the Wiltern theater), Ye Olde King's Head on Santa Monica Blvd in Santa Monica (why do I think it was somewhere else before?) ... and likely dozens that I don't remember right now.

Friday, October 17, 2008

digital book access at John Hopkins

Jonathan Rochkind has posted a great description of digital book access features that he's put into production in the link resolver and OPAC at Johns Hopkins. They're remarkable in the sense that he's taken advantage of so many different service APIs (Google Books, IA, OCLC, Amazon, HathiTrust) to provide functionality with conditional options to provide as much collection coverage as possible.

obstacles to universal access

I've just read an interesting paper from a presentation at the recent CIDOC meeting: Nicholas Crofts, “Digital Assets and Digital Burdens: Obstacles to the Dream of Universal Access,” 2008 Annual Conference of CIDOC (Athens, September 15-18, 2008).

The premise is that technology is not the issue keeping our institutions from reaching a goal of universal access -- it's a number of post-technical issues, including varied intellectual property barriers, institutions' desires to protect their digital assets, and collection documentation that is not well-suited to sharing.

From the section on "Suitability of Documentation":

... but while this technical revolution has taken place, there has not been a corresponding revolution in documentation practice. The way that documentation is prepared and maintained and the sort of documentation that is produced are still heavily influenced by pre-Internet assumptions. The documentation found in museums – the raw material for diffusion – is often ill-suited for publication.
From the conclusion:
While making cultural material freely available is part of their mission, and therefore a goal that they are obliged to support, it may still come into conflict with other factors, notably commercial interests: the need to maintain a high-profile and to protect an effective brand image. If museums are to cooperate successfully and make digital resources widely available on collaborative platforms, they will either need to find ways of avoiding institutional anonymity, or agree to put aside their institutional identity to one side.
It's a frank and interesting paper. I think there has been progress in documentation practice -- look at the CCO and the Aquifer Shareable Metadata efforts, and the earlier Categories for the Description of Works of Art -- but it's true that this hasn't yet taken hold in a widespread way.

Wednesday, October 15, 2008

First Monday article on Google Books and OCA

The newest issue of First Monday (volume 13, number 10, 6 October 2008) has an interesting article by KalevLeetaru -- "Mass book digitization: The deeper story of Google Books and the Open Content Alliance."
The article compares what is publicly known about the Google Book and OCA projects.

From the conclusions:

While on their surface, the Google Books and Open Content Alliance projects may appear very different, they in fact share many similarities:

  • Both operate as a black box outsourcing agent. The participating library transports books to the facility to be scanned and fetches them when they are done. The library provides or assists with housing for the facility, but its personnel are not permitted to operate the scanning units, which must be staffed by personnel from either Google or OCA.

  • Neither publishes official technical reports. Google engineers have published in the literature on specific components of their project, which offer crucial insights into the processes they use, while talks from senior leadership have yielded additional information. OCA has largely been absent from the literature and few speeches have unveiled substantial technical details. Both projects have chosen not to issue exhaustive technical reports outlining their infrastructure: Google due to trade secret concerns and OCA due to a lack of available time.

  • Both digitize in–copyright works. Google Books scans both out–of–copyright books and those for which copyright protection is still in force. OCA scans out–of–copyright books and only scans in–copyright books when permission has been secured to do so. Both initiatives maintain partnerships with publishers to acquire substantial in–copyright digital content.

  • Both use manual page turning and digital camera capture. Large teams of humans are used to manually turn pages in front of a pair of digital cameras that snap color photographs of the pages.

  • Both permit libraries to redistribute materials digitized from their collections. While redistribution rights vary for other entities, both the Google Books and OCA initiatives permit the library providing a work for digitization to host its own copy of that digitized work for selected personal use distribution.

  • Both permit unlimited personal use of out–of–copyright works. While redistribution rights vary for other entities, both the Google Books and OCA initiatives permit the library providing a work for digitization to host its own copy of that digitized work for selected personal use distribution.

  • Both enforce some restrictions on redistribution or commercial use. Google Books enforces a blanket prohibition on the commercial use of its materials, while at least one of OCA’s scanning partners does the same. Google requires users to contact it about redistribution or bulk downloading requests, while OCA permits any of its member institutions to restrict the redistribution of their material.

From the section on "Transparency"
A common comparison of the Google Books and Open Content Alliance projects revolves around the shroud of secrecy that underlies the Google Books operation. However, one may argue that such secrecy does not necessarily diminish the usefulness of access digitization projects, since the underlying technology and processes do not matter, only the final result. This is in contrast to preservation scanning, in which it may be argued that transparency is an essential attribute, since it is important to understand the technologies being used so as to understand the faithfulness of the resulting product. When it comes down to it, does it necessarily matter what particular piece of software or algorithm was used to perform bitonal thresholding on a page scan? When the intent of a project is simply to generate useable digital surrogates of printed works, the project may be considered a success if the files it offers provide digital access to those materials.
To me, that paragraph gets at the key issue in discussing and comparing the projects -- are books being scanned in a consistent way and being made accessible through at least one portal, enforcing current rights restrictions? Yes? Then both these projects are, at a basic level, successful and provide a useful service.

Yes, there are issues to quibble with for both projects. More technical transparency is desirable for both projects. Both have controlled workflows that limit what can be contributed to the projects in different ways. There are aspects of the Google workflow that Google contractually requires its partners to keep secret. That's their right to include in their contracts, and a potential partner's decision to make if they find it objectionable and therefore choose not to participate. Each documents and enforces rights in different ways and to different extents -- we should be looking to standards in that area. Each sets different requirements for allowing reuse. If only there could be agreement.

One note on preservation. Neither projects are preservation projects -- they're access projects. Even if there were something we could point to and say "that's a preservation-quality digital surrogate" -- if such a concept as "preservation-quality" exists -- neither project aims for that. Both projects do, however, allow the participating libraries to preserve the files created through the projects. These files should and must be preserved because they can be used to provide digital modes of access, and, in some cases, they may be the only surrogates ever made if the condition of a book has deteriorated. Look at the HathiTrust for more on the topic of preserving the output of mass digitization projects.

And one note about the Google project providing "free" digitization for its participants. Yes, Google is underwriting the cost of digitization. But each partner library is bearing the cost of staffing and supplies for project management, checkout/checkin, shelving, barcoding, cataloging, and conservation activities, not to mention storage and management of the files. The overall cost is definitely reduced, but not free.

Tuesday, October 14, 2008

Frankfurt Book Fair survey on digitization

Via TeleRead, the 2008 Frankfurt Book Fair conducted a survey on how digitization will shape the future of publishing. The summary results are available in a press release.

These are the top four challenges facing the industry identified through the survey:

• copyright – 28 per cent
• digital rights management – 22 per cent
• standard format (such as epub) – 21 per cent
• retail price maintenance – 16 per cent

Not knowing what the details of these concerns really are in their survey results, as generalizations the first three are an interesting overlap with challenges facing digital collection building in libraries. What are appropriate terms for copyright and licensing for libraries? How do we identify/document copyright (and other rights) status? How do we manage access and provide for fair use with varying DRM scenarios? What standards will enhance preservation and ongoing access?

Wednesday, October 08, 2008

DCC Curation Lifecycle Model

Via the Digital Curation Blog, I came across the DCC Curation Lifecycle Model. This is a very interesting high-level overview of the life cycle stages in digital curation efforts. There's an introductory article available.

The model proposes a generic set of sequential activities -- creating or receiving content, appraisal, ingest, preservation events, storage, etc. There are some decisions points at the appraisal and preservation event stages about next steps -- refusal, reappraisal, migration, etc. A colleague and I sat together and looked it over this afternoon. We were both looking at it from a perspective of a digital collections repository and not an IR, and the model was designed primarily with IRs in mind, so our thoughts are coming from a different place in terms of what we wanted to see additionally taken into account in the visualization.

There's a "transform" activity -- definitely something that takes place potentially multiple times in a data life cycle. In the visualization this appears sequentially after "store" and "access, use and reuse." This is an activity that's hard to include in a visualization of a sequence because it can take place at so many points, but it feels like it should be earlier in the sequence, perhaps before those two steps.

The next ring is labeled with the activities "curate" and "preserve" with arrows. Does the placement of the terms and arrows mean anything in relation to the outermost ring? Are "ingest," "preservation activity" and "store" part of "preserve" and the rest part of "curate?" Or does this more simply represent ongoing activities?

The center of the model is the data. It's surrounded by a ring for descriptive and presentation information. It's an activity of central importance and is directly related to the data as is shown, but we weren't sure how its placement related to the sequence of tasks in the visualization.

"Preservation planning" is the next ring out. Planning and implementation are a central, ongoing activity. We also weren't sure when this ongoing activity meshed with the sequence.

"Community watch and participation" is the last remaining inner ring. It's also on ongoing activity. What actions might the outcomes of this activity affect?

Overall, this is a good model for planning. It's challenging to create a visualization for complex processes and dependencies and this covers a lot of ground. And of course it's meant to be generic and high-level, to be made more concrete by an institution that makes use of it. It certainly stimulated our thinking in terms of how we might model our data life cycle and the dependencies between the various tasks.

NOTE: Sarah Higgins, who created the model, has provided excellent responses to my thoughts and questions in the comments to this post. Please read them!

Sunday, October 05, 2008

Gettysburg Cyclorama

A cyclorama was the cutting-edge multimedia installation of its time in the 1870-90s. A massive 360 degree painting in the round, it was often accompanied by narration, music, and a light show to heighten the illusion. Today I went to see the conserved, restored, and reinstalled Gettysburg Cyclorama at the new visitors' center. The center opened in April, but the Cyclorama only reopened 10 days ago.

I'm glad I went. True, you only get to spend 15 minutes in the Cyclorama gallery and you have to sit through a short movie about the battle first because the museum, movie, and cyclorama are on one ticket. The new museum is very nicely designed and installed (and extensive), the movie is not too long and very well-done, and the tickets are reasonably priced.

The painting (by Paul Philippoteaux, 1884) was installed in the new facility with its diorama foreground illusions recreated. They run a 15-minute narrated sound and light show to recreate Pickett's Charge (dawn over the battlefield is amazing), then they bring up the lights for a few minutes so you can see the entire painting clearly. In some spots the diorama leads seamlessly into the painting. It's still an amazing illusion and it takes your breath away.

The Cyclorama painting was previously housed in a Richard Neutra-designed building at Getttysburg. The Neutra building is scheduled for demolition in December 2008, but there is litigation to attempt to stop it. That will be a difficult case -- battlefield restoration versus Modern architecture preservation.

Friday, October 03, 2008

Federal Agencies Digitization Guidelines Initiative

The Federal Agencies Digitization Guidelines Initiative site went live on September 30, 2008. The initiative represents a collaborative effort between U.S. government agencies to establish a common set of guidelines for digitizing historical materials. Participants include the Defense Visual Information Directorate, the Library of Congress, the National Agricultural Library, the National Archives and Records Administration, the National Gallery of Art, the National Library of Medicine, the National Technical Information Service, the National Transportation Library, the Smithsonian Institution, the U.S. Geological Survey, the U.S. Government Printing Office, and The Voice of America.

The Still Image Working Group is focusing its efforts on books, manuscripts, maps, and photographic prints and negatives. There are draft "Digital Imaging Framework" and "TIFF Image Metadata" documents available. The Audio-Visual Working Group effort will cover sound and video recordings and will consider the inclusion of motion picture film as the project proceeds. That group is still at the document drafting stage.

Thursday, October 02, 2008

interesting re-use of American Memory content

From Boing Boing:

American Memory is a new and compelling DVD coming from extended Skinny Puppy posse members William Morrison and Justin Bennett later this year. It took me a while to figure out exactly what was going on (and exactly who was responsible), but that didn't detract from this hypnotic and ultimately forceful piece.

The voice in the clip on the DVD's trailer is that of former slave Alice Gaston, interviewed in her eighties for the Library of Congress in 1941. The actress is lip-synching to her dialogue. Videomaker William Morrison explains that the whole project works this way, using audio from the American Memory Archive along with new and processed footage. And, of course, Skinny Puppy music.

According to Morrison: "The theoretical context of the project is that some time in the very distance future, long after America is gone, some artists scouring the backwater of whatever the net has become discover the American Memory Archive. They have no context for it's meaning but are intrigued by the sights and sounds. They create surreal impressions of the material they find and broadcast it back through time. A quantum radio channel beamed into the sub conscious minds of the 21st century."

A few different permutations of the band will be playing a show on December 4 at the Gramercy in NYC.

Wednesday, September 24, 2008

new version of getty introduction to metadata

The third edition of the Getty Introduction to Metadata -- edited by Murtha Baca, with essays by Tony Gill, Anne J. Gilliland, Maureen Whalen, and Mary Woodley -- is now available online and in hard copy. This is a very useful overview and it's nice to see it updated.

Thursday, September 18, 2008

generational myths

Siva Vaidhyanathan has a great article in The Chronicle Review entitled "Generational Myth." Siva and I first met through our online discussion of this topic -- I very strongly agree with him on this issue.

Lorcan Dempsey posted about a couple of blog posts by Andy Powell and Dave White about their takes on this issue. Dave's proposed "Resident" and "Visitor" categories and his acknowledgment of the spectra of behaviors that these categories represent is a well-considered take on how libraries might better understand styles of learning of distance students in particular. I'm obviously not a fan of human categorization -- people are notoriously hard to pigeonhole. But I think these are actually more akin to personas than categories, like those that you'd develop as an exercise when designing a new online service. Not unerringly accurate, but not without usefulness. It's certainly supplements the often simplistic thinking about our users as "faculty" or "graduate students" or "undergraduates" or "the public."

I also strongly recommend Janna Brancolini's blog Generation Underrated, her response to Mark Bauerlein's The Dumbest Generation: How the Digital Age Stupefies Young Americans and Jeopardizes Our Future (Or, Don't Trust Anyone Under 30). Check out what someone under 30 has to say, who also happens to be the daughter of a digital librarian.

grapes need a eula?

From Serious Eats, an image of an empty bag of grapes ... with a EULA.

The recipient of the produce contained in this package agrees not to propagate or reproduce any portion of the produce, including (but not limited to) seeds, stems, tissue and fruit.
To me this is particularly amusing because they're seedless grapes ...

Wednesday, September 17, 2008


There is major buzz around the announcement that the Los Alamos National Laboratory Research Library has released djakota, a "reuse friendly" open source JPEG2000 Image Server. It's available on SourceForge under a GNU Lesser Public General License.

There's an excellent D-Lib article that fully describes the server. I love the first sentence of the article: "The Digital Library Research & Prototyping Team at the Los Alamos National Laboratory (LANL) enjoys tackling challenging problems." Now there's an understatement!

We did some explorations with Kakadu (one of the components of djakota) when I was at UVA, and we use Aware at LC. I plan to take a long, hard look at this.

smithsonian digitization initiative

There's an announcement on CNN that the Smitshonian plans to put its 137 million object collection online. The new Smithsonian Secretary G. Wayne Clough said in an interview that they do not yet know how long it will take or how much it will cost to digitize the full 137 million-object collection and will do it as money becomes available. A team will prioritize which artifacts are digitized first. They plan to focus on making the collections usable for the K-12 audience.

When I was at the Smithsonian yesterday for David Weinberger's talk, this seemed to be a buzzing topic of discussion among audience members; one Smithsonian employee even mentioned it in a question to Weinberger, expressing a certain level of surprise.

Tuesday, September 16, 2008

small pieces loosely joined by metadata

Today I had the extreme pleasure of attending a talk by David Weinberger (The Cluetrain Manifesto, Small Pieces Loosely Joined, Everything is Miscellaneous) at the Smithsonian, entitled "Knowledge, Noise and the End of Information." It was webcast, and I strongly suggest viewing it if you can.

There was lots of interesting discussion about the definition of information, the innately social nature of the human race and how social interaction is a vital aspect of information discovery, and how the loosely joined and messy nature of the internet just reflects human nature and is not a bad thing. He also stressed that one can never know what digital information might be of importance in the future, so we should, as cultural institutions, be striving to keep as much as possible. He also touched on the importance of brand and authoritativeness, but not to equate that with control.

A word I did not expect to hear today, let alone about a hundred times, was "metadata." The cellphone image above is a shot of one of his concluding statements. He talked a lot about the importance of metadata, whether it be authoritative cataloging, community tagging, or contextual relationships through linking. Since we cannot ever imagine all the uses for our digital content we cannot possibly expend the costly effort to provide all the descriptive metadata that every community might want or need, so all three are complementary and of equal value.

One of my take-aways was that this again shows the importance of just getting digital content out there. Let the content express itself through its authoritative metadata, but also provide open access and support multiple mechanisms through which it can be incorporated into new contexts and uses and gain new descriptions.

Monday, September 15, 2008

open access to museum collections

Last Friday there was a post on Open Access News that Wake Forest University's Anthropology Museum had issued a press release about the launch of its online collections, supported by an IMLS grant.

I welcomed this news on many fronts -- there aren't enough ethnographic or archaeological collections online; the museum is using Re:discovery, a great product geared toward small museums; and I have a number of friends with ties to Wake Forest and I've visited Winston-Salem many times and have a fondness for the area.

What made me sit down to think about this for a few days was the passing description of this an an Open Access project.

I worked for many years in the museum community, and every museum that I ever worked for or consulted for wanted to make its collections available in one digital form or another. The Museum Computer Network was founded in 1967 to enable museums to automate their processes and convert collections records to digital form. Museums were among the earliest institutions to share their collections online in the mid 1990s. The University of California Museum of Paleontology had a web site in 1994. The Fine Arts Museums of San Francisco brought their Thinker "imagebase" online in 1996 -- and they had volunteers assist with an early form of experimental user supplied subject metadata. e.g., proto-tagging. By 1997 the National Gallery of Art provided access to over 100,000 objects in its collection, and the Los Angeles County Museum of Art experimented with converting print museum catalogs into freely available online publications.

Sure, there have been lengthy discourses about levels of access to the digital media surrogates and questions of rights and control of those new media assets, and there is some information about the acquisition of objects that's subject to privacy restrictions, but no museum wants to limit discovery of their collections -- they want to facilitate their collections' use in research and teaching.

I've just not heard it described as "open access" before.

I'm not saying that it isn't a sort of open access initiative -- it most obviously is -- but I just think of it as such a normal museum activity I don't categorize it in my mind as anything other than business as usual. Then it hit me -- for the past 15 years museums have been major players in the open access movement without necessarily always knowing it.

Labeling this an open access initiative re-contextualizes this core museum activity into a different realm -- one that I hope will make museum collections information more visible and reinforce the importance of all categories of open access content.

Friday, September 12, 2008

nsdl metadata registry

This afternoon a group of us had the opportunity to sit down with Jon Phipps, implementer of the NSDL Metadata Registry.

I knew that such a thing existed. I understand RDF. I know about SKOS. I hadn't really given a lot of thought as to how to best take advantage of it.

Today, I had one of those skies are opening and angels are singing from on high moments. RDF can be used to model relationships between concepts and potentially enforce them through schemas. This can obviously be applied to improve discoverability when a hierarchical taxonomy is employed. Then my LC colleague Clay Redding said that he was experimenting with multiple schemas and managing additional local alternative labels in addition to authoritative preferred labels. And then Jon and Ed Summers mentioned the potential for this tool to map across schemas. My a-ha moment was understanding the potential for formalized mappings across metadata schemas to improve discoverability within and across collections described with hetreogenous taxonomies and vocabularies.

I remember using Chenhall's Nomenclature in records for ethnographic objects where we recorded every level of the hierarchy in its own field -- It was madness. I remember when we were in the early days of the AAT, busily submitting new terms and building the hierarchies, our dream was searching for "case furniture" and getting results with bookcases, chests, desks, wardrobes, and every semantic child where "case furniture" never appeared in the record.

I remember some research at USC in the late 1990s about thesaurus-enabled searching. OCLC's Metadata Switch project has done some work in cross-schema mapping. I know this is very difficult to accomplish. Today was the first time I saw a tool that might make the conceptual mapping simpler. But not simple. This is a potentially massively overwhelming task if it can't be done programmatically to a large extent.

I'm coming late to the party, but now I'm really intrigued by what might be accomplished in this arena.

Tuesday, September 09, 2008

Cory Doctorow book of essays

I am a big fan of Cory Doctorow's writing -- his fiction and his essays on technology, rights, and privacy. Via BoingBoing, comes word of his new book of essays -- Content: Selected Essays on Technology, Creativity, Copyright, and the Future of the Future.-- which he is making available as a free Creative Commons licensed PDF download.

I you haven't read Cory Doctorow yet, you should. I don't always agree with everything he says, but he is thoughtful and technologically savvy and writes thorough essays on very relevant topics in an entertaining style.

I've read some of these essays before, but having them together in one beautifully-designed volume that I can always refer to is the proverbial good thing.

LoC Repository Development Group hiring

Our group has a position open. Visit the LoC jobs page and search for posting "080214". The posting does not mention our unit specifically, so this is a head's up that the job is with us. We're still a relatively new group, working on a variety of projects with many units across the Library and developing our group's role in the institution.

The application period closes on October 3, and that is an absolute deadline. You must apply using an online federal job application system -- it's a lengthy form that requires some time to fill out. Be prepared with electronic copies of your documents to cut-and paste.

EDIT (9/24/2008): This position reports to the Director of the Repository Development Group. Everyone in the team -- including me -- reports to the Director. There is no additional management structure.

Monday, September 08, 2008

google newspaper digitization

Google is digitizing newspapers.

Not only will you be able to search these newspapers, you'll also be able to browse through them exactly as they were printed -- photographs, headlines, articles, advertisements and all.

This effort expands on the contributions of others who've already begun digitizing historical newspapers. In 2006, we started working with publications like the New York Times and the Washington Post to index existing digital archives and make them searchable via the Google News Archive. Now, this effort will enable us to help you find an even greater range of material from newspapers large and small, in conjunction with partners such as ProQuest and Heritage, who've joined in this initiative. One of our partners, the Quebec Chronicle-Telegraph, is actually the oldest newspaper in North America—history buffs, take note: it has been publishing continuously for more than 244 years.

You’ll be able to explore this historical treasure trove by searching the Google News Archive or by using the timeline feature after searching Google News. Not every search will trigger this new content, but you can start by trying queries like [Nixon space shuttle] or [Titanic located]. Stories we've scanned under this initiative will appear alongside already-digitized material from publications like the New York Times as well as from archive aggregators, and are marked "Google News Archive." Over time, as we scan more articles and our index grows, we'll also start blending these archives into our main search results so that when you search Google.com, you'll be searching the full text of these newspapers as well.
It's interesting that they're working directly with publishers and with aggregators such as ProQuest to digitize and improve discoverability of back files. That's good news, but do they also plan to work with major newspaper open access projects such as the National Digital Newspaper Program? Are they digitizing any collections in addition to publisher collections?

When I last looked at the Google news archive in September 2006 I found that way too much of the content was pay-per-view, made you pay even if your institution had licensed subscription access, and didn't work with OpenURL resolvers. I don't see that any of that has changed. I hope it will.

vintage museum photos

Via BoingBoing, check out these fabulous vintage photos from the American Museum of Natural History. I love dioramas, and the exhibit installation images are just great. Taxidermy mounting, diorama background painting, articulating dinosaur bones, casting animal models ... And the vintage exhibitions! I love the images of earnest children being led around ... and doing so-called Indian dances in their construction paper bonnets. State-of-the-art, 1900s-1970s.

Saturday, September 06, 2008

ambient awareness

This week's New York Times Magazine has a piece by Clive Thompson that explores issues around ambient awareness and privacy. Facebook, twitter, flickr, dopplr, and texting and blogging more generally. Is it narcissistic to broadcast your status using awareness tools? Are these tools to improve connectedness in a more mobile and global human ecology -- the ultimate tools for building and maintaining relationships?

This is the paradox of ambient awareness. Each little update — each individual bit of social information — is insignificant on its own, even supremely mundane. But taken together, over time, the little snippets coalesce into a surprisingly sophisticated portrait of your friends’ and family members’ lives, like thousands of dots making a pointillist painting. This was never before possible, because in the real world, no friend would bother to call you up and detail the sandwiches she was eating. The ambient information becomes like “a type of E.S.P.,” as Haley described it to me, an invisible dimension floating over everyday life.
And when they do socialize face to face, it feels oddly as if they’ve never actually been apart. They don’t need to ask, “So, what have you been up to?” because they already know. Instead, they’ll begin discussing something that one of the friends Twittered that afternoon, as if picking up a conversation in the middle.
An interesting section focuses on the so-called "Dunbar Number" -- just how many people can you be "friends" with, anyway? According to anthropologist Robin Dunbar, about 150. Can you max out on social connectedness? Not really, since many of one's ambient connections are weak ties, not close, intimate friends. But weak ties are just an important part of social and professional networks.

I find it useful to check in on my Facebook account and see the status newsfeeds of my friends and colleagues. I have also personally met all but a handful, and I believe that they are controlling their feeds and filtering what they write in their status that maintains their chosen levels of privacy. I keep my status updated. I blog, and I know and expect that people who have never met me read it. But is the ability to follow personal newsfeeds and tweets of people you will never know a creepy invasion of privacy, making it too easy to develop parasocial relationships? Or is it all just part of ubiquitous ambient awareness where participation is increasingly not optional?

I originally refused to blog or join Facebook because I thought it was vain to assume that anyone wanted to know what I was thinking or doing, and that I'd be giving up my privacy. OK, I have given up some of my privacy, but I've also made new connections I might never have otherwise, re-established relationships that had gone dormant, and built stronger ties with geographically disparate friends. While I'm not willing to give up my privacy for a free cup of coffee, I am willing to give up some privacy to to that.

Wednesday, September 03, 2008


The University of Michigan has announced that their MBooks initiative has grown into a shared repository effort called the HathiTrust (pronounced hah-TEE).

HathiTrust was originally a collaboration of the thirteen universities of the Committee on Institutional Cooperation (CIC) to establish a repository for those universities to archive and share their digitized collections. All content to date has been supplied by the University of Michigan and the University of Wisconsin, and Indiana University and Purdue University will soon be contributing their digital materials. 20% of its current content is open access and 80% is restricted. Don't look for a single search interface yet -- it's planned. As they say: "Good, useful, technology takes time.... and the strength and insight born of collaborative work."

The new HathiTrust initiative has been funded for an initial five-year period beginning January 2008, and is now open to other institutions. Partners will be charged a one-time start-up fee based on the number of volumes added to the repository, in addition to an annual fee for the curation of the data. They already support both open access and dark archive materials, and will also do so for new partners.

Their July 2008 monthly report gives a good sense of their activities. It is interesting to note that the only initial ingest workflow supported is the Google partner workflow. That's not too surprising since this work is based on the MBooks project developed in support of Google content workflows. That in and of itself ensures that there many institutions who'll be considering partnership.

This announcement is exceptionally exciting. I look forward to its development as a service.