Tuesday, December 30, 2008

archiving the bush administration

There is a great article on ars technica today about the major processing effort that will be required at the National Archives when the Bush administration leaves office. The ars technica piece references a New York Times article on the topic from this past weekend.

This section really strikes home:

The contingency plan will entail "ingesting" the Bush White House's data into a separate system before integrating it with the ordinary archive. As the plan explains, "the current PERL [Presidential Electronic Records Library] system architecture was not scalable to actually support the volume of records that are expected from the current Presidential administration."

It's not just size that matters, though: the Archives will also need to process reams of information locked in some quaint proprietary formats. The RMS index, for example, "consists of an implementation of a customized older version of Documentum running on Oracle, with image files (including copies of scanned records) incorporated as objects in the database." The photos are stored in a "proprietary photo management software called MerlinOne, running on Microsoft SQL as the database engine," and it has apparently taken several months to extract the images and metadata for relinkage outside the Merlin format.

First, the use of quotation marks should remind us all that "ingest" means absolutely nothing to someone who is not a repository manager.

I have participated in some discussions about a potential data migration project at work. I recently saw an inventory of media formats -- not file formats, but media formats -- that the project would need to encompass, and it is lengthy. The only source I can think of for hardware to read some of the formats is EBay. That doesn't even take into account the files themselves. It's interesting how quickly a format becomes obsolete, and how many customized systems federal agencies use.

Monday, December 29, 2008

blogging has fallen by the wayside

Between a month almost solely dedicated to a single high-stress project and a lot of other writing commitments -- revising a paper for a conference, drafting a conference proposal and a co-authored conference proposal, and a writing chapter for a book -- I find that I haven't made time to blog. I promise to make time soon.

best metrics for comparing hardware?

Recently I spent 4 weeks on a project where we were considering hardware options for a large amount of storage for a data migration project. We ended up with 4 different proposals -- three from vendors and one to be built in-house. One of the tasks that I worked on was a matrix to compare the 4 potential solutions.

There were the easy metrics -- the amount of raw and usable storage, number of racks/tiles required, electrical and cooling requirements, cost, etc. Comparing supportability was trickier but doable, with 24*7 versus 12*5 phone support, availability of on-site technicians, warranty terms, support contract costs, etc. Where it became more difficult was identifying metrics to compare performance. Ratio of processors to storage? Location of processing nodes in the architecture? I/O rates? Time to read all data? And how do you best calculate those last two with four quite architecturally different proposals? We ended up with metrics that not everyone agreed upon, in part because there was a requirement that not everyone agreed upon.

I'm curious how other folks have gone about doing this. I'd be interested in hearing from anyone who is willing to share their strategies.

Wednesday, December 17, 2008

ICDL adding European collections

From an article on Forbes.com, the International Children's Digital Library (ICDL) announced a partnership with the Taliaferro Family Fund to increase the number of European children's titles in the collection. The Elias Project will target three collections in Europe: the Norwegian Children's Book Institute in Oslo, Norway, the International Youth Library in Munich, Germany , and the National Center for Children's Books in Paris, France.

After reading the article, I check in at the ICDL site, which I hadn't visited in a few months, and noticed two other news announcements: ICDL and the Google Book project will be sharing public domain children's book titles; and ICDL has launched an iPhone app with full access to the collection, a new titles features, and an offline mode and an airplane mode. It's great to see such a worthwhile project making such advances in collection building and in adding new services.

(I didn't see a press release about the European project on the ICDL site. I saw the press release on some other sites, so I assume it's meant to be out there.)

Tuesday, December 16, 2008

letter to santa

Nik Honeysett has posted a great letter to Santa on the Musematic blog.

If enough of us ask for that image format, will Santa grant our wish?

Friday, December 12, 2008

interview with Paul LeClerc in New York Times

Paul LeClerc, director of the New York Public Library, answered questions online at the New York Times that have been made available in three parts: part one, part two, part three. Topics include budgets, branch closures and renovations, ebooks, and preservation efforts.

In part two, he briefly mentioned their participation in the Library of Congress National Digital Newspaper Project and its public access content web site Chronicling America. It was nice to see this project mentioned get a media mention in the context of preserving and providing access to often ephemeral newspapers.

Library of Congress releases report on flickr pilot

The Library of Congress has released its report on its Flickr Commons pilot, where approximately 5,000 images were uploaded for a crowdsourcing metadata experiment. A full report and a summary report are available, both PDFs.

The photos have drawn more than 10 million views, 7,166 comments and more than 67,000 tags. When Flickr commenters provide updated place and personal names, dates, and event identification, staff from the Library's Prints and Photographs Division verify the information and have so far updated more than 500 records in their catalog -- with many more in the queue -- citing the Flickr Commons Project as the source of the new information.

Thursday, December 11, 2008

creative commons wants feedback on licenses

Creative Commons is conducting a study to collect feedback on the term “noncommercial” and how it should be covered in its licenses. The hope is that what’s learned from the survey can improve the licenses that allow or restrict noncommercial uses. The questionnaire has to be completed by this Sunday, December 14, 2008. Everyone who has taken advantage of CC licenses as a creator or a user should take some time to answer the questions.

world war II collection at the national archive and footnote

The US National Archives and the historical document website Footnote.com have collaborated on the digitization of a large collection of documents from the US involvement in World War II, which are now available on the footnote.com web site. There is an ars technica article on the collection and interface.

Like the ars technica writer, I had a lot of difficulty finding anything that I hoped to find. My grandfather, father, and uncle all served in WWII. My grandfather died in a friendly fire incident where allied planes accidentally sunk a ship carrying prisoners of war to be returned. I found nothing. There was nothing in the documents nor in the photos. Although I did find out that a man with almost the same name as my uncle (same middle initial but different middle name) was listed as missing when his plane was shot down in 1943. Still, it's a lot of useful content that I'm glad to see digitized and OCR'ed.

I was disappointed I wasn't surprised. I found the navigation to be a bit puzzling. I found I had to have multiple tabs open to easily go back to search. Not just the image but the entire image viewer screen had to come into focus when I selected something to view.

The ars technica writer said that his view of the site included the disclaimer "All Free (for a limited time)," and commented that "... it would be nice to think that a service based on government records of a significant American experience would be free indefinitely." The original press release describing the collaboration is worth reviewing, because it addresses that point in the ars technica article. The agreement allows Footnote.com non-exclusive access, and "After an interval of five years, all images digitized through this agreement will be available at no charge through the National Archives web site." So, Footnote can charge for it for now, but it will all revert to the National Archives for free and open access.

I don't see that disclaimer when using my Library of Congress computer because we have full access -- I wonder how long it will be fully accessible for those without subscriptions?

Sunday, December 07, 2008

laine farley named cdl director

I just saw the press release naming Laine Farley as the new Director of the California Digital Library. I am thrilled for Laine, who's been serving as the interim Director for over 2 years. I have worked with her on Aquifer and at least one other collaborative community project, and I know what an experienced and capable person she is.