Thursday, August 13, 2009

metaphors

My colleague Thorny Staples often uses the metaphor that digital humanities projects are, at their most basic level, online exhibitions. Curated content is presented with key descriptive information not unlike exhibition tombstone labels and contextualized through categorization and by scholarly essays of varying lengths as well as site information architecture (not unlike rooms of an exhibition with wall texts). The end results include the identification and explication of relationships and the presentation of deep readings of objects. That metaphor always resonated with me.

In a recent discussion a small group was trying to work out some generalized models to for the processes we follow from the receipt/creation of digital files through to providing access. We were having a particularly lengthy discussion about description and contextualization -- at what point in a digital file's life cycle is it related to other files and identified as a digital object, and at what point is some sort of intellectual meaning overlaid onto that digital object?

My new colleague Terry Harrison -- a big fan of using metaphors -- commented that when museums acquire objects they cannot know every context in which the object will be exhibited or published in the future, but they acquire it and put effort into description and conservation to prepare for future display/publication when the object will be contextualized many times over.

This sent me down the road to a metaphor that's still developing in my head which may not yet translate to something that anyone beside me thinks is sensible. Or it may not be sensible at all.

First, I'm starting with an assumption that there are four very broad categories of activities that we need to describe (leaving out "preservation" for now). On the museum side, it's these:

Acquisition: Items are proposed, selected, and acquired
Accessioning: Items have accession numbers assigned, are assigned storage locations, relationships between parts are identified (a tea set is made up of individual components), and basic descriptive information is recorded in a registration system
Preparation: Items are cleaned, repaired, mounted, framed, or otherwise stabilized and made ready for research use and public viewing
Exhibition: Items are further described and presented in the context identified by a collection or exhibition curator; an object will be exhibited many times and assigned to multiple contexts

This roughly translates to this in the digital realm:

Creation/Transfer: Selection and digitization or transfer of digital (master?) files to an institution
Inventory: Files are assigned identifiers/names, placed into some sort of meaningful (or not) storage location in a server environment
Processing: QA, manipulation, derivative creation
Access: Making content discoverable and usable, which can include a curator providing context and intellectual overlays for objects (not files)

I'm having one real issue in making this metaphor work for me and for others, and that's around the creation of metadata and recording of file relationships. At what point is the relationship of files to each other recorded? Is the creation of metadata identifying/describing an intellectual object part of inventory, processing, or access? When is the relationship of files to that intellectual object recorded?

I think that inventorying should include a step whereby the relationships between files are recorded so it is recognizable that some set of 300 files go together. There wasn't a lot of push back on this in our discussion. When descriptive metadata for an intellectual object is created and when the relationship of files to an intellectual object are recorded engendered a lot of discussion. I personally think that descriptive metadata for intellectual objects represented by those files is also created during the inventory stage, and that files in hand at that stage should in some way be associated with the intellectual objects at that time.

This is complicated because the recording of all the relationship of files to intellectual objects is not fully possible until objects are prepared and added to an access application. That's where the contextualization happens, so one can argue that that is where intellectual objects are truly defined and the process of associating files to objects takes place. Preparation is driven by access. If access applications are siloed at all, each might use different derivative files, and there has to be some association of those derivatives to the master and to the intellectual objects.

So, we have master files, derivative files (possibly multiple sets over time per access point), intellectual object metadata, relationships of all files to each other and to that intellectual object, and the need to inventory and manage all of the above. Which may be separate from an access application or multiple access points. Where is this recorded, in what order, where, and how do we describe these activities? I'm struggling with that part of the metaphor/model.

How did this conversation arise? Well, we're trying to scope out some future directions and activities, and a shared understanding of the model for the activities we support is vital. Mine is not the only model proposed and it just may not be right. I'm sharing this as much for my own process as anything else.

Tuesday, June 30, 2009

LoC on iTunes

The Library of Congress now has content on iTunes U. iTunes U is the area of the iTunes Store which offers open educational audio and video content from universities and other educational institutions. The Library’s initial iTunes U content includes historical videos such as original Edison films and a series of 1904 films from the Westinghouse Works, as well as event videos such as author talks from the National Book Festival, the "Books and Beyond" series, discussions with curators, and lectures from the Kluge Center. The audio content includes Library podcast series such as "Music and the Brain," slave narratives from the American Folklife Center, and interviews with authors from the National Book Festival. The collection also includes Library-produced classroom and educational materials, such as courses from the Catalogers’ Learning Workshop.

You must be running iTunes to be able to view the LoC content.

Saturday, June 27, 2009

new BIL on SourceForge and update to BagIt spec

This week saw a couple of events around the BagIt specification and tools.

A revision of the BagIt specification went out this week. You will note that it is still 0.96 -- the revisions were only in language to clarify some questions that had been received. There are some discussions going on about 0.97 - join the Digital Curation Google group. I'd like to see some more activity there!

Version 3.0 of BIL, the BagIt Library for Java, was released on SourceForge this week. It's available as binary and source code.

Plus, there was the BagIt video ...

BagIt video

The first in a planned series of digital preservation videos is available on the digitalpreservation.gov site -- an introduction to BagIt! Brian Vargas did a great job as "the talent" -- e.g., the narrator -- but folks should know that Brian was not selected just for his acting experience: he wrote many of our transfer tools (like the transfer scripts on SourceForge) and is a co-author of the BagIt specification.

The video premiered this week at the annual NDIIPP Partner's Meeting to great acclaim. It's aimed at a general audience.

EDIT: The NDIIPP site has added a great new page on the Transfer Tools with a link to the video.

Friday, June 26, 2009

Chesapeake Project Legal Information Archive

I came across a very interesting resource today -- the Chesapeake Project Legal Information Archive -- and the just-released results of a study they did on archiving legal resources on the web:

The Chesapeake Project Legal Information Archive has released a comprehensive report evaluating its digital preservation efforts during the project's two-year pilot phase.

The project evaluation reveals that nearly 14 percent — or approximately one in seven — of the online publications archived between March 2007 and March 2009 have already disappeared from their original locations on the Web but, due to the project's efforts, remain accessible via permanent archive URLs. A similar analysis in 2008 showed that slightly more than 8 percent of archived titles had disappeared from their original URLs, demonstrating a dramatic increase in "link rot," or inactive URLs, among archived content over the past year.

During the two-year pilot phase, the libraries participating in the project archived more than 4,300 digital objects and tracked more than 177,000 visits to www.legalinfoarchive.org, the home of The Chesapeake Project's digital archive collections. Users of the project's Web site visited from educational, government, and military institutions in the United States, as well as from countries abroad throughout the Americas, Europe, the Middle East, Asia, Africa, Australia, and the Pacific Islands.

Not too surprisingly, the second highest class of domain to where resource loss is found is .edu, after .info. Academic institutions are not always very conscientious about preserving access to their content, and with their academic term structure and the movement of faculty between institutions, web content on .edu sites is highly variable in its longevity. I don't see a characterization of how old the resources are that they harvested -- that can be very difficult to identify -- but it is a high percentage of bitrot, and there was quite an increase from the end of the first year to the end of the second year.

Download the PDF of their report.

Tuesday, June 16, 2009

milestones for the National Digital Newspaper Program

Today there was an exciting press event at the Newseum for the National Digital Newspaper Program, sponsored by the Library of Congress and the National Endowment for the Humanities. There was a great live demo, a video on digital production for the project from the University of Kentucky, and some nice speechmaking. The event promoted the milestone where the project surpassed 1,000,000 pages available at the Chronicling America site, the addition of seven new state partners, and the addition of images of illustrated newspaper supplements to the LoC Flickr Commons set (with more to come every month).

So far the AP has an article available, and there were representatives of other news outlets at the event. Check out the press release. Roy Tennant has a post that includes some of the technical specs supplied by my colleague Ed Summers. Ed and Dan Krech have done some great work to update the underlying application, improving the ingest and search functionality, adding the functionality that allows the site to be crawled, and exposing the data as RDF for a multitude of possibilities.

Edit: Here's the Washington Post article, and the official LoC blog posting.

Saturday, June 13, 2009

something odd happened today

Last weekend I went to my local public library (which I love), where I spotted a book that was on my to-be-read list. I keep a list of books I want to read, and periodically search the library's catalog to see if they have it at any of their branches. I had this book noted on my list as being held in the collection of my local branch. Depending upon how much I want to read the book, I'll put a hold onto the book if they have it in the collection but it isn't checked in. This is a book that held a middling position on my list for a while, a 2007 sequel to a science fiction novel by a newish but award-winning author which I liked but didn't love, but thought might be interesting. I grabbed the book off the shelf, but, in the process of wandering around and gathering up other books, I must have set it down and it didn't make it to the self-checkout with me, something I didn't discover until I got home. Ah well, I knew I'd be back this weekend, and maybe it would still be available.

I returned today and wandered over to the shelf. It wasn't there. I decided to look the book up and see when it was due and put a hold on it this time.

It wasn't there any more. It wasn't in the catalog, and the author wasn't in the catalog either.

I left with the books I found and one that was on hold for me. I considered asking about the missing book/author, but there was quite a line and I didn't want to hold people up while I asked my crazy-conspiracy-sounding questions -- how did this author and his books disappear in the last week? And why?

Tuesday, May 26, 2009

how did a month go by?

In re-writing the opening sentence to this post about seventeen times, I have alternated between apologizing, rationalizing, making excuses for, and outright ignoring that I haven't posted here in a month.

I've been attending conferences and traveling a lot. Four meetings/trips in three weeks, and four states (yes, one state was Virginia, but I was off site for three days, followed the next day by a trip over two hours away and overnight for two nights, so that counts). That doesn't stop most folks from continuing to reach out and share, but I find travel very draining. I can happily spend my days chatting with colleagues, taking notes and tweeting, and talking about what excites me about my job. By the time I collapse in my room at the end of the day, I sometimes feel like I hope to never discuss the BagIt specification again (But I will, you know I will, and with great enthusiasm). And when I get home, I hole up and do not feel social for a good 24 hours. Yes, I might be the most outgoing Myers-Briggs "I" out there, but I'm still an I who just wants to sit quietly and think for a while.

And, if I also want to make some semi-valid excuses, my work PC died again and it was out of my possession for 3 weeks, one of my projects had a major deadline that was almost fully met on time and required some last minute scrambling on my part so I didn't blow the deadline too badly, and we had to pack up and move out of our office suite so some duct repairs could take place. I should not even admit how far behind I am in studying for my Japanese class.

I hope to resume normal blogging this week. The coming attractions: the IS&T Arching 2009 conference, Open Repositories 2009, and a visit to Scola, the Library's international newscast preservation partner.

Monday, April 27, 2009

Digital Karnak

I am a huge fan of 3-D visualizations of archaeological sites, and there's a new one developed by a team under Diane Favro and Willeke Wendrich at UCLA. Digital Karnak provides a Google Earth visualization of the site of Karnak, a massive temple complex in Egypt that was in use for some 1,500 years. There's a nice interactive timeline through which you can view the development of the site over time. Start with the overview if you're unfamiliar with Karnak.

The web site includes an amazing archive consisting of stills from the 3-D model and photographs from the archaeological site. I'd like to see that expanded some day to include any smaller objects from Karnak that are in various cultural heritage collections. Historical renderings (there are known drawings from the early 18th century onwards) would also be a nice addition.

There's a nice article in the Chronicle of Higher Education.

Tuesday, April 21, 2009

World Digital Library Launch

The World Digital Library is now available.

The site is launching with 1,170 objects from 26 partner institutions. WDL focuses on significant primary materials reflecting the cultural heritage of all UNESCO member countries, including manuscripts, maps, rare books, recordings, films, prints, photographs, architectural drawings, and other types of primary sources from varying time periods. The project will continue to add content to the site, and will enlist new partners from the widest possible range of institutions and countries.

The site is available in seven different languages: Arabic, Chinese, English, French, Russian, Spanish, and Portuguese. The content is not translated -- the items appear in their original language. The metadata and all the site navigation is translated to make it possible to search and browse the site in any of the languages. The metadata came from partner institutions or was created by catalogers at the Library of Congress, and much of the translation was provided by Lingotek.

The site was built using the Django Python framework, nginx, Lucene/Solr, and a mySQL database. The zooming in the imageviewer and pageturner is Seadragon Ajax. There is heavy use of Javascript, jquery, JSON and underlying XML. Check out the image carousels and timeline tool! The project also developed a cataloging tool to manage the metadata and cataloging process and interact with the Lingotek translation system via their API.

Sunday, April 12, 2009

museum data exchange software

OCLC, funded by the Mellon Foundation and working with the software company Cognitive Applications, Inc, has released COBOAT and OAICat Museum to support data interchange between museums. This work is happening under the auspices of their Museum Data Exchange Project.

So what, many people will say? It should already be easy to share museums data, right?

Not so much.

The museum collection management system arena has some major vendors (Gallery Systems, Willoughby, Minisis, Cuadra, etc) and some smaller vendors (Re:discovery, PastPerfect, etc.), and countless (and I really mean countless) home-grown systems running on FileMaker, Access, and MS-SQL. I know, because I spent many years working for museums and I was on the board of the Museum Computer Network, a group that dilligently worked on many interchange initiatives. I worked with software from 3 vendors and managed a FileMaker-based system. Getting data in was easy. Getting data out was often hard. Participation in data aggregation projects took a lot of effort. And most small- or medium-sized museums (and there are many, many more of them than large museums) have little or no technology staff to enable data sharing. And there is no common data schema in the community.

The museum community itself has sometimes slowed progress. When discussion of relevant library community standards were mentioned, some said "We're nothing like libaries! Our collections are unique! Their standards are not for us!" That attitude seems to have adapted in the last 10 years.

I am glad to see something like this going forward. A fee-free tool that can help museums extract data from black-box vendor systems and enable sharing? Bring it on.

Friday, April 10, 2009

open repositories 2009

The abstracts are now available for the presentation and poster sessions at OR09. This is one of my favorite conferences to attend and present at.

Sunday, April 05, 2009

DigCCurr 2009

I was in Chapel Hill the first week of April for the DigCCurr 2009 conference and to attend a meeting to brainstorm about personal digital collection preservation. I thought the conference was very good, better than the first one in 2007. I saw many excellent presentations, had some great conversations, and got a good response to my presentation on LC's work with file transfer and inventory tools. As with the last conference, I walked out thinking that I should have been an archivist.

I strongly recommend the proceeding form DigCCurr 2009. They're available as a free download from Lulu, or you can buy a POD version. You can also look up the very active twittering history at #digccurr.

I found it strangely hard to write up my notes from this meeting. I think it's because I'm still struggling with some aspects of the digital preservation problem space.

I absolutely agree that the activities of traditional archival practice have a place in the preservation of digital records. Where I found myself disagreeing with some presenters is in the balance between collecting and saving what we can versus an appraisal process to select what we will collect/save. In collection development practices for general collections, there is the often-held discussion about never knowing what might prove useful in the future, so it is a disservice to be too selective now. I guess that I have taken that point of view to heart, and I want to see our institutions cast as open a net as possible for digital collections. If we don't grab it when we can, there will be nothing to select.

I also found myself bristling occasionally over the implied scope of the term "digital collections" as I most often heard that phrase used at the meeting. There was very much a focus on electronic records and the digital realm of personal papers. Of course there were some great discussions around multimedia, web sites, audio/video, and image collections, but what I pretty much never heard anybody mention was born-digital scholarship and teaching and learning materials.

My first web site preservation project was at the Harvard Design School in the late 1990s, where, while developing courseware software, I realized that we were losing the history of what we taught and the products of the courses as we overwrote sites every term. Part of an institution's records are its lists of course offerings, course syllabi and reading lists, and, for some courses, the projects that the students created and put online in the course site. This was particularly true at at graduate school with programs in architecture, landscape architecture, and urban planning where the studio courses produced important site-specific work and case studies that was often lost after every term. I felt so strongly about this that I launched a course site preservation project that would have involved retrieving sites off server archives. We were looking at using METS (in its early days) to map the sites. But, as often happens, I ended up leaving before the project got very far along and no one felt nearly as devoted to the project as I did and it didn't go very far.

At UVA we launched a project called "Sustaining Digital Scholarship" to preserve born-digital scholarship, primarily in the humanities and social sciences. We instituted a technical assessment process and were working on documenting and migrating some major digital scholarly resources with varying strategies. That project is still going on in a limited way. It can take a lot of resources to assess and document a large digital archive.

That said, I was excited by some of the tools that I saw. ACE from the University of Maryland. MOPSEUS from Greece. The PARSE.Insight draft preservation roadmap. CASPAR for representation information. PLATO and Hoppla from Austria. LANL's ReMember Framework for OAI-ORE. CDL's Pairtree directory structure. Prometheus and MediaPedia from Australia. All very much worth looking into.

There was also a thread in this meeting on the use of digital forensics, transitioning some tools and practices from legal digital forensics into archival digital forensics. This interested me very much and I intend to read up in this area.

Thursday, April 02, 2009

new flip book beta

From Peter Brantley on the OCA blog -- A new beta version of the Flipbook bookreader has been released open source under GNU license. The source code is available from the Open Library site.

Wednesday, April 01, 2009

LC/CLIR report on pre-1972 sound recording copyright

Excerpted from the press release:

Sound recordings were not protected by federal copyright law until 1972. A Library of Congress report indicates that the miscellany of state laws protecting pre-1972 sound recordings will extend copyright protection until 2067, creating a situation where some recordings dating to the 19th century are not available in public domain.

The Library announced today the completion of a commissioned report that examines copyright issues associated with unpublished sound recordings. This new report from the Library of Congress and the Council on Library and Information Resources addresses the question of what libraries and archives are legally empowered to do, under current laws, to preserve and make accessible for research their holdings of unpublished sound recordings made before 1972.

The report, "Copyright and Related Issues Relevant to Digital Preservation and Dissemination of Unpublished Pre-1972 Sound Recordings by Libraries and Archives’ is one of a series of studies undertaken by the National Recording Preservation Board (NRPB), under the auspices of the Library of Congress. It was written by June Besek, executive director of the Kernochan Center for Law, Media and the Arts at Columbia University. The report is available free of charge at www.clir.org/pubs/abstract/pub144abst.html.

Friday, March 27, 2009

New LC multimedia collection sharing initiatives

This is news ... The Library of Congress will begin sharing content from its vast video and audio collections on the YouTube and Apple iTunes web services as part of a continuing initiative to make its incomparable treasures more widely accessible to a broad audience. The new Library of Congress channels on each of the popular services will launch within the next few weeks.

...

The General Services Administration today also announced agreements with Flickr, YouTube, Vimeo and blip.tv that will allow other federal agencies to participate in new media while meeting legal requirements and the unique needs of government. GSA plans to negotiate agreements with other providers, and the Library will explore these new media services when they are appropriate to its mission and as resources permit.

Read the Press Release.