Friday, December 28, 2007

adding metadata to the list of what we learned

A couple of folks wrote and asked me why I didn't specify metadata standards in my post on what we learned from our repository project. I mentioned the impact of interface design on metadata needs, but some additional reinforcement can't hurt.

So ...

5. It shouldn't even need to be said that you should have your metadata standards identified before you start your development. We did. What we learned is that activities related to categories 2-4 will mean that you will have to make changes to what metadata you create and how you use it. For example, when changes were made to our interface design and functionality, we needed metadata formatted in a certain way for search results and some displays. We thought that we'd generate the metadata on-the-fly, but that turned out to be a lot of overhead so we decided to pre-generate the metadata needed: display name, sort name, display title, sort title, display date, and sort date. It isn't metadata we necessarily create during cataloging processes, but it's something we can generate during the conversion from its original form to UVA DescMeta XML. Another example is faceted browsing. To have the most sensible facets in our next interface, we need to break up post-coordinated subject strings or we'll have a facets for every variation. We thought about pre-generating this, but it turns out that Lucene can do this as part of the index building.

Wednesday, December 26, 2007

copyright the pyramids?

From BoingBoing and Techdirt -- Egypt plans to pass a law that enacts copyright protection for the pyramids and the Sphinx and any object in any museum in Egypt in order to levy fines against creators of full-scale replicas that infringe on this so-called copyright. Apparently the Luxor hotel is safe because it's not an exact replica ...

I understand looking for income streams to care for their cultural heritage and I am extremely sympathetic, but this is such a wrongheaded interpretation of the concept of copyright. That said, Egypt can set their own copyright laws, and if they want to declare this part of their copyright code, they can. I wonder how they plan to enforce this?

NIH Open Access mandate is law

Via Peter Suber, President Bush signed the omnibus spending bill that includes the requirement that publications based on NIH-funded research be submitted to PubMed Central and made publicly available, with a no more than 12-month embargo.

That said, there is no reason to wait to start depositing, articles should be deposited upon publication with restricted access for one year rather than waiting for deposit (Who keeps track of embargoes? -- "Today it's been a year and I should deposit that article"), and the same articles should additionally be self-archived in each researcher's university's own Institutional Repository. If their institution does not have an IR, they should ask "Why Not?"

EDIT at 6:27 PM: Here's the press release from the Alliance for Taxpayer Access.

Saturday, December 22, 2007

Google responds to privacy questions

Via Stephen's Lighthouse, a posting at search engine land on Google's response to 24 questions about privacy issues submitted by US Representative Joe Barton. A PDF of Google's response is provided.

Friday, December 21, 2007

Fedora 3.0b1 is released

The first beta version of Fedora 3.0 is now available. I am thrilled about this (even though we're going to wait until after the release of final 3.0 to upgrade) because this includes a change that I have wanted for a long time -- the Content Model Architecture (CMA). When the Fedora architecture was first designed, the tight bindings between objects and their disseminators seemed to be a good thing. As repository services were developed using Fedora, though, some of us discovered that we needed to make changes to disseminator mechanisms after we developed them. In the previous (current) architecture, because the objects were tightly bound to the disseminators we could not make changes to the disseminators unless NO objects were bound to them. In other words, we had to purge disseminators that were bound to objects (the easiest way was to actually purge the objects), then update the disseminators, and then re-ingest the objects and re-establish bindings. CMA adds the ability to formally instantiate content models in relation to objects, also making it easier to update disseminators without the purging of objects. This also supports a new ability to validate objects against content models, a very desirable QA feature.

what we learned from our repository project

Earlier this week someone asked me what we had learned from our repository development project over the years. This is the first time that anyone has asked that so directly, as opposed to general discussions about assessment and process review and software optimization.

So, what did we learn? This is what I've come up with so far.

1. Have your media file standards in mind before you start. Of course standards will change (especially if you're talking about video objects) during a multi-year implementation project. But if you have standards identified before you start (and minimize the number of file standards that you'll be working with), you at least have some chance of making it easier to migrate and manage and preserve what you've got and to design a simpler architecture. We did this (in conjunction with an inventory of our existing digital assets) and it was key for us in developing our architecture and content models.

2. Know what the functional requirements of your interface will be before you start. We had developed functional spec documents and use cases, but two different stakeholders came back to us during the process with new requests that we couldn't ignore. In those two cases the newly identified functional requirements for our interface required that we change our deliverable files and our metadata standard. We had to go back and re-process tens of thousands and then over 100,000 objects to meet the functional need and have consistency across our objects.

3. Some aspect of your implementation technologies will change during the project. New technologies will become available for implementation during your project that are a better fit than what you planned to use. For example, we never initially identified Cocoon as part of our implementation, but it became a core part of our text disseminators.

4. Your project will never be "done." OK, we've got a production repository with three format types in full production and three in prototype. We've still got to figure out production workflows for those media types and there are more media types to consider. And, as a corollary to point 3, there are new technologies that we want to substitute for what we used. We're obviously going to switch to Lucene and Solr for our indexing. New indexing capabilities will absolutely bring about an interface change. There are also more open source web services applications available now than when we started in 2003. We can potentially employ XTF in place of some very complex TEI and EAD transformation and display disseminators that we developed locally. This is bringing about a discussion about simplifying our architecture -- fewer complex delivery disseminators to manage and develop and more handing off of datastreams to outside web services. Not that there aren't complexities there and a lot of re-development work, but it's a discussion worth having. We're talking a lot these days about simplifying what it takes for us to put new content models into production. The development of Fedora 3.0 will also have a huge effect.

EDIT 28 December 2007:

A couple of folks wrote and asked me why I didn't specify metadata standards. I mentioned the impact of interface design on metadata needs, but some additional reinforcement can't hurt. So ...

5. It shouldn't even need to be said that you should have your metadata standards identified before you start your development. We did. What we learned is that activities related to categories 2-4 will mean that you will have to make changes to what metadata you create and how you use it. For example, when changes were made to our interface design and functionality, we needed metadata formatted in a certain way for search results and some displays. We thought that we'd generate the metadata on-the-fly, but that turned out to be a lot of overhead so we decided to pre-generate the metadata needed: display name, sort name, display title, sort title, display date, and sort date. It isn't metadata we necessarily create during cataloging processes, but it's something we can generate during the conversion from its original form to UVA DescMeta XML. Another example is faceted browsing. To have the most sensible facets in our next interface, we need to break up post-coordinated subject strings or we'll have a facets for every variation. We thought about pre-generating this, but it turns out that Lucene can do this as part of the index building.


Monday, December 17, 2007

Open Journal Systems 2.2 released

OJS 2.2 has been released. We have an earlier version up that we've been testing with a student journal. I see some additions that I know folks will find attractive. Strangely, one of the potentially most desirable features is the integration of Google Analytics. I have watched the editors of the journals that we host implement them because (sometimes struggling to do so) we're just not in the business anymore of supplying web stats for every site on our servers. I want to upgrade our test instance to the new version to see how it compares with other solutions that we are considering for the hosting of journals and the submission/review/editing process.

petition for open bibliographic records

Peter Suber reports on one of the responses to the draft report from the Working Group on the Future of Bibliographic Control -- a petition to LC to make its bibliographic records openly available.

open data commons

There is a lot of hidden data out there -- data used for and created through research -- and a lot of confusion as to what rights it carries and how it can be used. Talis has announced a partnership with Creative Commons to launch a new type of personal data licensing: a Science Commons protocol and the Open Data Commons Public Domain Dedication and License. Notes on the project are available at the panlibus and Nodalities blogs.

Friday, December 14, 2007

now it's a trend

I shouldn't port my data without someone to watch over my shoulder. Today, supposed power-user that I am, I decided to port my own data in our conversion to Microsoft Outlook and Exchange.

I went to the web page that I was pointed to by many helpful email messages. I backed up my calendar and created the right format file for importing. I didn't need to convert an IMAP mailbox because I had been using POP and would retain my old local Eudora archive. I determined that I had an active account by logging into the web version. But somehow, while setting up the desktop Outllook client I messed up not only my authentication method (it kept asking me for my certificate and wouldn't authenticate using it), but in trying to fix it I set up an offline email box that became my default.

I feel for the poor IT guy (thanks, Rob!) who got called in to help me after I commented to my colleague across the hall that I had messed things up. We had to delete my certificate to get my authentication to reset to asking for a password. Then we were able to get into the application and declare my server email box as my default, not the offline "personal" email box.

We kept trying to check my email, but then I was told that I'd missed a step -- I needed to register Exchange as my new default email server. I think I can blame the documentation -- That was NOT listed as a step on the page "Getting Started with Exchange," I swear. I had four tech support guys in my office (I was the object of much good-tempered group teasing for messing up), and the only reason that one of them could tell me the URL I needed to go to so I could change my default email is that he had printed the page. None of them could find a link to that page anywhere on the site.

I set up Exchange as my default, imported my "old" calendar (through January 14 when we change), and even got my Eudora address book imported with only a little weirdness. I still have no clue how to set up thematic email folders and filters like I had in Eudora. I guess I have to give in to the teasing and sit through some training, or at least read the help docs ...

Thursday, December 13, 2007

sharing personal libraries

I am very excited by the news about the joint project between the Zotero group at the Center for History and New Media and the Internet Archive.

Zotero is a very easy-to-use tool for developing personal citation repositories for distributed resources. The creation of a "Zotero Commons" registry of sorts where materials used by researchers can be shared is a powerful idea. It's an institutional repository without institutional boundaries. The idea of tying this in to the Internet Archive's archive of the web so that materials citied but not directly deposited are also not lost is even more intriguing. That there will be the capacity for both individual and group work is as it should be.

Here's the core of the project to me: "The combined digital collections present opportunities for scholars to find primary research materials, to discover one another’s work, to identify materials that are already available in digital form and therefore do not need to be located and scanned, to find other scholars with similar interests and to share their own insights broadly."

I wonder how this will fit into the landscape with other digital registries and collections. The DLF/OCLC Registry? OAIster? Aquifer? American Memory? What is the relationship between what institutions digitize, what their research communities have deposited in IRs, what is harvested into larger aggregations, and what scholars personally create? This is a problem space that bears a lot more discussion.

Wednesday, December 12, 2007

personal libraries

There is a great post on Hackito Ergo Sum about how they set up their personal library system, from barcode readers to cataloging to shelving. They went with Readerware, a reasonably well-known desktop program.

It's a shame they never encountered LibraryThing in their research. I knew about LibraryThing but didn't do anything with it until the time came to pack up our books to more to our new house in 2006. In late December 2005 I signed up and by mid-February 2006 I had all of our over 2,000 books cataloged, tagged, and packed without the use of a barcode reader (although I do have an original cuecat that Wired shipped out as a subscriber benefit sometime in the 90s, still in its original box). Like them, I have older books without ISBNs and foreign imprints. Many of the foreign imprints I found through non-US library catalogs. I had to manually enter some items, but that was not a challenge.

I guess I'm just a shill for LibraryThing.

Monday, December 10, 2007

print and digital at the New York Times

One of the most insightful blogs on media culture is written by David Byrne. He recently blogged about the new New York Times building, and, in response to his posting, he was invited to meet with various NY Times staff about print and digital journalism. His post on those meetings is one to read.

photography in museums

Please check out Perian's post on photography in museums over at Musematic. It presents both sides, and quotes from a BoinBoing comment of hers to rebut some comments made by Cory Doctorow and others.

Sunday, December 09, 2007

xISBN everywhere

The FRBR blog posted on Eric Hellman's annoucement that wikipedia is now available as a lookup option in xISBN.

xISBN seems to be everywhere now. It's used by LibX. It's available to resolvers. It's obviously part of WorldCat. But through what tools can you take advantage of the option to choose which, or maybe multiple, lookup options you want?

I went to the Alice's Adventures in Wonderland wikipedia article. I wondered which icon would appear next to the ISBNs -- LibX or Find@UVA. It was the latter. For a paperback edition of Alice in Amazon, it's the former.

With LibX and Amazon I got a great application of xISBN against our catalog. With Find@UVA I got no matches because it's not using xISBN (note to self -- make sure that we request support for xISBN and the newly announced xISSN as features, if we haven't already). In both cases the lookup target was our catalog, because that's the default (or only option).

LibX can search other lookups -- WorldCat and GoogleScholar. Depending on the requests they get (Wikipedia has a bad rep as a search resource among many academic librarians and faculty), they could add Wikipedia.

But what if I want to check in more than one location for links or citations or holdings? Is there a tool that can do that?

Saturday, December 08, 2007

how did I miss this?

Apparently today was "Pretend to be a Time Traveler Day."

The Wired geek dad blog had some great suggestions.

I love this one: "Take some trinket with you (it can be anything really), hand it to some stranger, along with a phone number and say 'In thirty years dial this number. You'll know what to do after that.'"

UpLib personal digital library tool

From Bill Janssen of PARC, via TeleRead:

"We’re planning to release our UpLib digital library system as open source software in a couple of months, and I’m still looking for beta testers, people who’d like to receive early a raw codebase tar file which may require some hacking skills to compile and install (though, I like to think, not too much). If you know about ‘configure’ and ‘make,’ you’re a candidate. I’m particularly interested in finding beta testers who use Windows, as that platform is still somewhat of a mystery to me. We use MSYS and MINGW to create the installer for Windows. For those of you who are wondering, the core of UpLib is a document repository server which serves as a base for a number of document analysis functions."

“The system consists of a full-text indexed repository accessed through an active agent via a Web interface. It is suitable for personal collections comprising tens of thousands of documents (including papers, books, photos, receipts, email, etc.), and provides for ease of document entry and access as well as high levels of security and privacy. Unlike many other systems of the sort, user access to the document collection is assured even if the UpLib system is unavailable.”
There are a couple of PARC white paper PDFs available, from 2003 and 2005, and one ACM article from 2005 that I found.

I have no idea what the status of this is -- there appears to have been some conversation on the ebook community list that this has been vaporware for some time -- but I'm intrigued that another tool is entering the personal digital collection problem space.

Thursday, December 06, 2007

supporting data in PQDT

Today we received a downtime notice from ProQuest. This is not unusual, especially during the holidays when services make time to update and upgrade. What cheered my little heart was this notice:

ProQuest Dissertations & Theses (PQDT) Multimedia Support release—ProQuest has seen an increase in dissertations and theses that include supplementary digital materials - audio, video, spreadsheets, etc. To properly support scholarly access to these materials, ProQuest is now making them available online in the Full Text version of the ProQuest Dissertations & Theses (PQDT) database.
Yes, ProQuest is going to include at least some supporting media and data for the theses and dissertations that it makes available. PQDT already has an Open PQDT that includes Open Access ETDs. I cannot find any details about this on the ProQuest site, but it's a promising step.

Wednesday, December 05, 2007

digital whomever

Siva has a great post about the labeling of "Digital" whomevers -- natives, immigrants, generation, millennials -- etc.

He doesn't buy it. I have a slightly different spin and some different reasons, but I don't buy it either.

I often take part in discussions about services for faculty and students, and sometimes hear ageist comments about how older faculty are completely non-digital and all students are automatically all digital. Hah! Just like some folks have an interest or skill in languages or math or art and some folks don't, it's the same with whatever "digital" is. I have worked with faculty in their 60s who saw something in being digital decades ago and have worked in that realm for years. I have worked with colleagues -- librarians and faculty -- in my own age group (I'm 44) who hate all technology with a passion and others who embrace it in all ways. I have worked with students at three different research universities who could not care less about being digital.

Being digital is not generational. At the core of what Jeff Gomez calls "Generation Download" and "Generation Upload" in his book Print is Dead, there is truly an ubiquitousness of digital media use that is changing media consumption and production paradigms and changing the media market. There is absolutely an increased level of acceptance that this is standard operating procedure. I'm still not willing to agree that an entire generation is digital and that the entirety of other generations are not. There's still predilection and interest and skill and, yes, issues of availability and affordability of technology that crosses all generations.

There are degrees of digital-ness. Different comfort levels. Different skill levels. Different levels of access. Why do we have to apply such absolute labels?

developing a service vision for a repository

Dorothea rightly challenged me for not including a service vision in my post on repository goals and vision. I do have something like a vision, but I wouldn't say that it's quite where it needs to be yet. That said, I said I would post it, so I am.

What are the services needed around a repository?

  • Identification and acquisition of valuable content
    • You can't wait for content to come to you – research what’s going on in the departments and at the University, and initiate a dialog.
    • Digital collections must also come from the Library and other University units – University Archives, Museums, etc.
  • Consulting Services
    • Advise on intellectual property and contract/licensing issues for scholarly output.
    • Assistance in preparing files for deposit, creating or converting metadata, and in the actual deposit process.
  • Access
    • Easy-to-use discovery interface with full-text searching and browse.
    • Instruction for community on how to find and use and cite content.
    • Make the content shareable via Open Archives Initiative (OAI).
  • Promotion and Marketing
    • Build awareness of the high cost of scholarly journals, and that we are buying back our own institutional scholarship.
    • Promote the value of building sustainable digital collections – preservation is more than just backing up files.
    • Promote the goals of the Open Access movement, including managed, free online access and a focus on improved visibility and impact.
    • Show faculty that they can build personal and community archives.
    • Market repository building services that will enable the institution to build a body of digital content.
    • Market the repository as a content resource and a venue that increases the visibility of the institution.

Tuesday, December 04, 2007

Report on the Future of Bibliographic Control available for comment

The Library of Congress has released a draft of the Report on the Future of Bibliographic Control for comment. Comments should be received by December 15, 2007, so pull together all your autocat, ngc4lib, and web4lib postings on the issues and get your comments in.

Alliance for Permanent Access

I haven't been able to track down much on the new Alliance for Permanent Access. There's this press release. Did anyone attend the Second International Conference on Permanent Access to the Records of Science held in Brussels on November 15, 2007 where the Alliance was launched?


There's nothing that I can say about the MARCThing Z39.50 implementation for LibraryThing that isn't in the Thingology post. But this statement caught my eye:

I do have a recommendation for anybody involved in implementing a standard or protocol in the library world. Go down to your local bookstore and grab 3 random people browsing the programming books. If you can't explain the basic idea in 10 minutes, or they can't sit down and write some basic code to use it in an hour or two, you've failed. It doesn't matter how perfect it is on paper -- it's not going to get used by anybody outside the library world, and even in the library world, it will only be implemented poorly.

goals and vision for a repository

Last week I had the opportunity to have a lengthy conversation with some folks about our Repository. In doing so I was able to get at some really simplified statements about our activities.

Why a Repository?

  • A growing body of the scholarly communications and research produced in our institutions exists solely in digital form.
  • Valuable assets -- secondary or gray scholarship such as proceedings, white papers, presentations, working papers, and datasets -- are being lost or not reproduced.
  • Numerous online digital collections and databases produced through research activity are not formally managed and are at risk.
  • An institutional repository is needed as a trusted system to permanently archive, steward, and manage access to the intellectual work – both research and teaching – of a university.
  • Open Access, Open Access, Open Access and Preservation, Preservation, Preservation.
What's the vision for a Repository?
  • A new scholarly publishing paradigm: an outlet for the open distribution of scholarly output as part of the open access movement.
  • A trusted digital repository for collections.
  • A cumulative and perpetual archive for an institution.
What does success look like?
  • Improved open access and visibility of digital scholarship and collections.
  • Participation from a variety of units, departments, and disciplines at the institution.
  • Usable process and standards for adding content.
  • Content is actively added.
  • Content is used: searched and cited and downloaded.
  • There is a wide variety of content types.
  • Simple counts are NOT a metric.
I really appreciate having the chance to formulate ideas like these that have nothing to do with the technology but everything to do with why we're doing what we do. I want to work this up into something more formal to share broadly.

Tuesday, November 27, 2007

boxes, the outcome of online shopping

In our household we shop online a lot. Boxes often arrive at home that one or the other of us didn't expect.

What happens when thousands of students living at a University are shopping online every day for whatever they might need, from shoes to car tires? In a very entertaining article, The New York Times reports on what's happening at mail rooms at Universities.

ars technica on Google Book Project

ars technica has a very fair blog posting outlining the ongoing online discussion between Paul Courant at Michigan and Siva Vaidhyanathan at UVA about the Google Book project. It nicely presents both sides of the discussion.

Saturday, November 24, 2007

unplanned hiatus

Between a 3-day planning retreat, updating technology and the threat of data loss, a cat that went missing 7 days ago who has not returned, and the Thanksgiving holiday, blogging hasn't been top of the list. I'm back now.

the not-so-nameless fear of personal data loss

Ten days ago I upgraded my smartphone.

For over three years I hung onto my Treo 600. Yes, a 600. Even when its antenna broke while I was at JCDL 2006, I drove to Durham and got a replacement unit rather than upgrade. I just didn't see a reason, as none of the new Treos were better. So what if my contract expired a year and a half ago and I could have upgraded at any time?

Then I saw the Centro, and something about it changed my mind. Maybe it was the much smaller size, or the brighter, clearer screen. Maybe the much more usable buttons and keyboard (I know, they're smaller keys, but their gel surface is great for my small hands and fingernails). Maybe that it had Bluetooth. Or that it was available in slightly sparkly black. I came to covet it online, and then I gave in and bought it.

Then I had to transfer my data.

Now, I've had a Palm device for 10 years. I have my calendar going back to early 1996 because I've been synching up my calendar with an enterprise system at three universities during those eleven years. I have databases like all my books (downloaded from LibraryThing) and DVDs in FileMaker Mobile. I have the electronic Zagat. I have ebooks. I have memos and to-do lists and a large address book.

None of them moved over when I updated the Palm desktop and initiated my first sync. Not a darned thing. Not only did they not move to the Centro, they disappeared from my Palm Desktop.

Hysteria ensued.

Once I was talked down off the proverbial ledge, things improved. I was able to beam over all my memos, my to-do list, and my address book from the Treo (although they all lost their categories). I had a calendar backup on my laptop that had 1996 to late 2005 (apparently the last time I backed up my full calender. Memo to self -- never forget to do that again). I discovered that my Zagat had expired the week before, which is why it didn't move. Fixed. I got my ebooks reinstalled (after discovering that I somehow had 2 Palm users set up and that I was using the wrong one in the installer). I deleted the extraneous Palm user, causing a bit more temporary hysteria because I re-lost and and to retrieve some data again.

My Oracle Calendar Sync completely failed. I couldn't even see the conduit. I thought, hey, there's a newer version than the one I have, upgrading will reinitialize the conduit. A fine plan if the installer hadn't failed, corrupting the old install so that I not only couldn't install a new one, I couldn't remove the old one because a file was missing. Our IT help desk didn't have the old installer available anymore (Memo to self -- keep all installers AND keep up on new versions) but a nameless university elsewhere in the world had it on their web site and all was repaired. The first sync took over 130 minutes, but that was a small price to pay to have my past and future calendar back.

All of the above took 2 1/2 days. What never came back, though, was FileMaker Mobile. UVA never moved past FM 6, so that's what I've continued to use as a client with FM Mobile 2. Given that FM is on to version 9 and FM Mobile 8, and the company has announced that it's discontinuing FM Mobile, I do seem to be be in a jam there. Yes, that I have that installer, and I have my databases, and I can see the conduit, but no joy in synching.

Change is good. I just need to decide what to change to, and then I will have all my data back. Until the next time I have to upgrade ...

Thursday, November 08, 2007

highlights from the Fall 2007 DLF Forum

There were a number of highlights for me from the fall 2007 DLF forum.

Dan Gillmor gave a great opening plenary on journalism in a world of ubiquitous media access. He talked quite a bit about participatory media and collaboration, and how the average person with a cell phone can participate in "random acts of journalism," such as recording the shooting at Virginia Tech or the 2005 tsunami in Sri Lanka. It is more likely that on-the-spot photojournalism will come from anyone, not journalists . He also talked about what he termed "advocacy journalism," where a community is formed, whether around a location or a topic, and that community reports often more quickly and more deeply on issues of interest to their community. What's the role of professional journalism? "Do what you do best and point to the rest." Follow well-established journalistic practices in reporting, and point to those communities and compedia that are doing a good job rather than trying to also do what they do. In a world with so much access, there is more transparency; conversely, it is also much more difficult to keep secrets. But what do you trust? What's accurate? Trust cannot be based solely on popularity, but on reputation, which is exceptionally difficult to qualify and quantify.

Rick Prelinger gave a talk on moving image archiving and digitization. I loved this phrase: "Wonderful and unpredictable things happen when ordinary people get access to original materials." In a world where there is now more individual production that institutional production, we should be crawling and preserving what's out there on YouTube and elsewhere, starting in early on what will be the hardest to preserve. He also pointed out that YouTube raised popular expectations about video findability while simultaneously lowering quality expectations and making the segmenting of content out of its raw or original context the norm. Rick also referenced the SAA Greene-Meissner report which urged archivists to consider new ways to deal with hidden collections, in making his point that workflows should not be sacred. Our social contract with users is to provide access. Digitization provides visibility and access, which can drive preservation.

Ricky Erway led an interesting overview of the agreements that various institutions have entered into with their third-party digitizing partners. The one that I knew the least about was the NARA arrangement with, where the materials will be available only on Footnote by subscription for five years, after which NARA can make them available, although NARA admits that it isn't clear exactly what they can and cannot do. For some reason James Hastings made sure to make the point that Footnote is not a LDS Church unit, although the parties involved definitely have ties to the church and are strongly interested in the materials for use in genealogy. There is absolutely nothing wrong with that.

There was a session on mass digitization "under the hood." I was particularly floored with the work at the National Archive of Recorded Sound and Images in Sweden. Their automated (in some cases robotic) processes for digitization are truly astonishing, as is the scale of their digitization. If I am reading my notes right, they create between 1.5 and 2.5 TB every day.

Herbert Van de Sompel gave a very effective presentation on ORE. I see a lot more folks getting what he and Carl Lagoze have been saying about compound objects. I love their elegant use of ATOM.

Denise Troll Covey gave a report on a the preliminary results of an in-progress study at Carnegie-Mellon on faculty self-archiving. I look forward to reading the final results and being able to share them, especially given the mis-information that faculty believe about their rights and their lack of archiving.

Steve Toub and Heather Christenson gave a great talk on a survey on book discovery interfaces. Microsoft Live Search Books seemed to fare the worst, while LibraryThing seemed to be at the top. They promise to make a ppt with many more slides than they presented available.

Tito Sierra, Markus Wust, and Emily Lynema from NC State presented their "CatalogWS" a RESTful Web API, which they take advantage of for their very cool MobiLib mobile catalog app, as well as a staff book cover visualization tool for large screen displays, and an advanced faceted search interface for generating custom catalog item lists for blogs and webpages. They also gave a nice shout-out to Blacklight, which we appreciated.

Mike Furlough and Patrick Alexander from Penn State led a good discussion on publishing and libraries. Activities represented in the room ranged from journal hosting to publishing of library collections online to collaborating on born-digital scholarship to working with university presses on electronic editions of works.

Read Peter Brantley's post on Mimi Calter's talk on the examination of the Copyright Registration Database that Peter and Carl Malamud worked to hard to set free. Make sure to also read the comments.

I was happy with the responses that Bess Sadler and I received from our presentation on Project Blacklight. Bess went way above the call of duty and completed some UI update tasks (translating language codes and setcodes into human-readable terms and adding browse centuries) and figured out how to combine the Virgo and Repo indexes in Lucene while sitting in our hotel room. We were able to show REALLY up-to-date screen shots in our presentation, including one we added while setting up the laptop at the podium.

Wednesday, October 31, 2007

LibX and OpenURL Referrer browser extensions

We launched our UVA Library LibX plugin for Firefox in June 2007, and its gotten some rave reviews from staff. Now that UVA has approved the rollout of Vista and IE 7 on its computers, we're testing the beta IE version of LibX. I understand we've supplied some feedback on installation and running on Vista.

When I saw the recent announcement of the availability of OCLC's OpenURL Referrer for IE, I paused a bit when considering who to send the annoucement to. LibX is the tool we promote with our users, it recognizes DOIs, ISBNs, ISSNs, and PubMed IDs, and supports COinS and OCLC xISBN, and works with our resolver and our catalog Virgo. We have our resolver working with Google Scholar.

In the end, I didn't forward the annoucement because we're trying to promote the use of LibX and I didn't want to dilute that message for our staff and users. The OpenURL Referrer is a very cool tool and a great use of the OCLC Resolver Registry so users don't have to know anything except the name of their institution to set it up. I'm just not sure if we need both, at least not right now.

I need to ask if we know how much use our LibX toolbar is getting.

JPEG2000 at LC

It is indeed welcome news (Digital Koans and the Jester) that the Library of Congress and Xerox are teaming up on j2k implementation, and even more welcome that it's in context of NDIIPP and preservation.

This is a key part of the announcement for me:

Xerox scientists will develop the parameters for converting TIFF files to JPEG 2000 and will build and test the system, then turn over the specifications and best practices to the Library of Congress. The specific outcome will be development of JPEG 2000 profiles, which describe how to use JPEG 2000 most effectively to represent photographic content as well as content digitized from maps. The Library plans to make the results available on a public Web site.
We all know that we could have jp2 files that are smaller than our TIFF masters. We all know that we could move from TIFF to jp2 as an archival format. We've all researched codecs and delivery applications so we can cut down on the number of deliverable files we generate. That's real progress. What we haven't done is figure out how best to migrate our legacy files and move forward. I look forward to seeing the outcomes of this project.

Thursday, October 25, 2007

OCLC report on privacy and trust

I grabbed the report the other day but only had the chance to barely skim it. It's obviously full of valuable data, and the visualization graphics are great.

I liked these stats, because I have long tried to make the argument that online interactions are becoming ubiquitous:

Browsing/purchasing activities: Activities considered as emerging several years ago, such as online banking, have been used by more than half of the total general public respondents. Over 40% of respondents have read someone’s blog, while the majority have browsed for information and used e-commerce sites in the last year, a substantial increase in activity as seen in 2005. While commercial and searching activities have surged in the past two years, the use of the library Web site has declined from our 2005 study.

Interacting activities: The majority of the respondents have sent or received an e-mail and over half have sent or received an instant message. Twenty percent (20%) or more of respondents have participated in social networking and used chat rooms.

Creating activities: Twenty percent (20%) or more of respondents have used a social media site and have created and/or contributed to others’ Web pages; 17% have blogged or written an online diary/journal. (section 1-6)
It's nice to see library web sites firmly in the middle when grouped with commercial web sites used by all age groups. (section 2-8)

Data on how much private information is shared (section 2-31) is not too surprising but interesting to see quantified. People's faith in the security of personal information on the Internet (section 3-2, 3-4) is higher than mine. That younger respondents have different ideas about relative privacy of categories of data (section 3-9, 3-10, 3-35) is not surprising, but I wonder why more people aren't concerned about privacy. It's good to see trust data in addition to privacy data. (section 3-24)

Section 4 focuses on Library Directors as a category of respondents. It seems that overall they read more, have been online longer, and interact more online than the general public. They also over-estimate how important privacy is to their users.

Section 5 is on libraries and social activities.

Karen Schneider had more time, and her response is worth a read.

Section 7 is the summary if you can't face all 280 pages.

Open Geospatial Consortium

It's ironic in a way, but I was unfamiliar with the Open Geospatial Consortium until seeing mention that Microsoft was joining as a principle member. Their standards seem worth looking at, but I have no sense of how much their reference model or specifications are used or what their compliance certification means for content providers or users, other than the promise of interoperability. They have a long list of registered products, not a lot of which seem to be noted as compliant yet.

Open Library ILL

While I was reacting to one aspect of the New York Times article on the OCA, others were reacting to the very last bit of the article on the proposed ILL-like service from the Open Library initiative. Aaron Swartz was interviewed at the Berkman Center. G√ľnter Waibel briefly described the effort. Peter Brantley had something to say.

But Peter Hirtle really had something to say.

For me, this is the key potion of Peter's post:

Unfortunately just because a book is out of print does not mean that it is not protected by copyright. Right now a library may use Section 108(e) of the Copyright Act to make a copy of an entire book for delivery to a patron either directly or through ILL, and that copy can be digital. But the library has to clear some hurdles first. One of them is that the library has to determine that a copy (either new or used) cannot be obtained at a fair price. The rationale for this requirement was that activity in the OP market for a particular title might encourage a publisher to reprint that work. In addition, the copy has to become the property of the user - the library cannot add a digital copy to its own collection.

Is this inefficient? Yes. Would it be better if libraries could pre-emptively digitize copyrighted works, store them safely, conduct the market test when an ILL request arrives, and then deliver the book to the user if no copy can be found? Yes. But this is not what the law currently allows.

As much as I want to encourage digitization and freeing of post-1923 works that are indeed out of copyright or orphaned, I know that Peter Hirtle has a strong position here. This principle is one that we've been looking at quite closely in a related realm -- developing a proposed "drop-off" digitization service for faculty. Not surprisingly, faculty sometimes hope we can digitize, say, slides that they have purchased commercially. We must determine who the source is and if we can obtain digital versions (whether at a fair price or at all). If we then decide that we can digitize them for the faculty member (which will not always be the case) we definitely cannot add the files to our collections or hold on to them in any way.

Peter ends his post with a goal for us all --that we should all work toward "convincing Congress that the public good would benefit from a more generous ILL policy for out of print books " in order to increase access to our collections.

ArchaeoInformatics Virtual Lecture Series

The Archaeoinformatics Consortium has announced their 2007-2008 Virtual Lecture Series schedule, where archaeologists describe their successful cyberinfrastructure efforts and innovators from other disciplines present information on their cyberinfrastructure initiatives and strategies that may be useful to archaeology.

These lectures are presented every other week using the Access GRID video conferencing system. It is also possible to participate in the lectures by downloading the presentation slides and participating via a telephone bridge. Information on how to connect to the Access GRID system and alternatives are provided at The lectures from the 2006-2007 series and this year’s lectures are also available as streaming video from the archaeoinformatics web site.

Monday, October 22, 2007

New York Times article on OCA

There was an article in the New York Times about libraries choosing to work with OCA instead of Google.

These are two quotes that I kept going back to -- "It costs the Open Content Alliance as much as $30 to scan each book" and "Libraries that sign with the Open Content Alliance are obligated to pay the cost of scanning the books. Several have received grants from organizations like the Sloan Foundation.The Boston Library Consortium’s project is self-funded, with $845,000 for the next two years. The consortium pays 10 cents a page to the Internet Archive ..."

A number of years ago we estimated that it would cost us many hundreds of dollars to digitize a book, but that involved a lot of manual work, from shooting the images to keyboarding of text (rather than OCR) to QA. We haven't revisited such an estimate in a while -- I'm sure it's lower now, but not that low. Of OCA participants, some get foundation support to shoulder the costs, and some fund it entirely themselves. Compared to doing it all yourselves, that's a bargain. I'm a fan of the OCA effort.

But then there's this quote -- “taking Google money for now while realizing this is, at best, a short-term bridge to a truly open universal library of the future.”

We don't actually take any Google money. Yes, Google provides a service for us, but they don't pay us for our participation. We underwrite certain costs for our participation. Yes, there are restrictions. You can read our agreement. One can and should question issues of control over data by any company or institution, but there is value in our pre-1923 volumes being made publicly available through Google Books. The institutions that chose to participate in one project versus the other (or both) should be neither lionized nor assailed.

Friday, October 19, 2007

RUBRIC Toolkit released

The RUBRIC project (Regional Universities Building Research Infrastructure Collaboratively) is sponsored by the Australian Commonwealth Department of Education, Science and Training. The RUBRIC Toolkit is the documentation of the process used by the project to model institutional repository services and evaluate tools. It includes a number of great checklists that could be used by any institution planning for an IR.


Every place I turn today there's something about Twine:

Twine is Smart

Twine is unique because it understands the meaning of information and relationships and automatically helps to organize and connect related items. Using the Semantic Web, natural language processing, and artificial intelligence, Twine automatically enriches information and finds patterns that individuals cannot easily see on their own. Twine transforms any information into Semantic Web content, a richer and ultimately more useful and portable form of knowledge. Users of Twine also can locate information using powerful new patent-pending social and semantic search capabilities so that they can find exactly what they need, from people and groups they trust.

Twine “ties it all together”

Twine pools and connects all types of information in one convenient online location, including contacts, email, bookmarks, RSS feeds, documents, photos, videos, news, products, discussions, notes, and anything else. Users can also author information directly in Twine like they do in weblogs and wikis. Twine is designed to become the center of a user’s digital life.

This is an exceptionally attractive concept, especially given its ability to parse content for meaning and identify new relationships and context with other content. You identify your "twine" content through tagging, and Twine applies your assigned semantics to other content, adding to your twine. It's also a social site where a personal network can interact with your shared twine and its content, enriching its semantic layer through that interaction. Twine is meant to learn from its folksonomies.

O'Reilly Radar reports on the demo at the Web2 Summit.
Read/Write Web has two postings here and here.

I've submitted a beta participation request. This could be an interesting tool for distributed research efforts.

Thursday, October 18, 2007

killer digital libraries and archives

Yesterday the Online Education Database released a great list of "250+ Killer Digital Libraries and Archives." It lists sites by state, by type, has a focus on etexts, and is a remarkable compendium of digital resources.

Of course, the first thing I did was look for our digital collections on the list. The UVA Library hosts the wonderful Virginia Heritage resource, which brings together thousands of EAD finding aids for two dozen institutions across the state of Virginia. We have our Digital Collections, with more than 20,000 images, 10,000 texts, and almost 4,000 finding aids.

Nope. Not on the list.

Not surprisingly, our former Etext Center was on the list under etexts (the Center no longer exists as a unit and its texts are gradually being migrated). The Virginia Center for Digital History was there, as it should be with its groundbreaking projects and its great blog. IATH was there with its many innovative born-digital scholarly projects.

I sulked about this for a few minutes while thinking about the likely reason we weren't on the list -- for the past few years we've been talking nonstop about our Repository and Fedora and not about our collections. Now, we wanted and needed to talk about Fedora and our Repository because we we really trying new things and solving interesting problems with our development and participating in building a community around Fedora. But users don't care about how cool our Repository development is. They care about the collections in the Repository.

We've spent the last few months working at raising awareness about what we have. Our new Library home page now has a number of links to the digital collections. We have pages on how to find what you're looking for in our digital collections. We have feature pages for all of our collections in the Repository. We're making progress in migrating collections and making the Digital Collections site a central location where they're visible. We have an RSS Feed for additions to the collections. We now have a librarian and a unit dedicated to shepherding collection digitization through the process and working more closely with faculty. I hope the next time someone creates a list like this we'll be visible enough to be on it.

Wednesday, October 17, 2007

two interesting IP decisions

Just in time for World Series fever -- The US Court of Appeals for the Eighth Circuit has upheld (PDF) a lower court's ruling that stats are not copyrightable in a case between CBC Distribution (a fantasy sports operation) against Major League Baseball and the MLB Players Association.

MLB argued that its player names and stats were copyrightable and that CBC—or any other fantasy league—couldn't operate a fantasy baseball league without a multimillion-dollar licensing agreement with MLB. CBC countered that the data was in the public domain and as such, it had a First Amendment right to use it. In August 2006 a US District Court sided with CBC. Now that decision has been upheld after MLB appealed.

In other news, the USPTO has rejected many of the broadest claims of the Amazon One-Click patent following the re-examination request by blogger Peter Calveley. To read the outcome of the re-examination request, go to the USPTO PAIR access site, choose the "Control Number" radio button, enter 90/007,946 and press the "Submit" button. Then go to the "Image File Wrapper" tab and select the "Reexam Non-Final Action" document. This doesn't mean that the patent has been thrown out -- five of the twenty six claims were found to be patentable. The document is the "non-final action" document and Amazon has rights in the process. The patent could be either thrown out or modified. It will be interesting to see.

Tuesday, October 16, 2007

what if google had to optimize its design for google?

Web developers have a lot of hoops to jump through to optimize their sites for discovery through the Google search engine. Getting good Google index placement is paramount, so it is apparently increasingly difficult for designers to offer the uncluttered user experience that they'd like to while building needed content into their pages. Here is a funny look at what would happen to Google's uncluttered look if Google had to design for the Google Search Engine. Not sure how accurate it is ...

Monday, October 15, 2007


DRAMA (Digital Repository Authorization Middleware Architecture) at Macquarie University has released
Muradora, a repository application that supports federated identity (via Shibboleth authentication) and flexible

authorization (using XACML). Fedora forms the core back-end repository, while different front-end applications (such as portlets or standalone web interfaces) can all talk to the same instance of Fedora, and yet maintain a consistent approach to access control. A Live DVD image can be downloaded to install Muradora on a server following an easy installation procedure that is based on Ubuntu Linux Distribution. Muradora LiveDVD can be downloaded from

From the announcement email:

- "Out-of-the-box" or customized deployment options

- Intuitive access control editor allows end-users to specify their own access control criteria without editing any XML.

- Hierarchical enforcement of access control policies. Access control can be set at the collection level, object level or datastream level.

- Metadata input and validation for any well-formed metadata schema using XForms (a W3C standard). New metadata schemas can be supported via XForms scripts (no Muradora code modification required).

- Flexible and extensible architecture based on the well known Java Spring enterprise framework.

- Multiple deployments of Muradora (each customized for their own specific purpose) can talk to the one instance of Fedora.

- Freely available as open source software (Apache 2 license). All dependent software is also open source.

Friday, October 12, 2007

public personae

When I joined Facebook earlier this week I had every intention of keeping my activity to a minimum. No picture. No personal details. Minimal professional details. No groups. Why would I share my personal life online? That's personal, that's private!

Then folks started pointing out to me that I was being silly. My email address and information about my job is all over our Library's web site. Presentations that I've given are online everywhere. Old email listserv postings are available through list archives. I use flickr, and while some images are kept restricted to friends and family, most aren't (but with a Creative Commons license). My LibraryThing catalog is public. I have this blog.

Basically, I was told to get over it. I already have a public persona, even if it's not one that interests more than a few close friends and colleagues and folks interested in arcane digital library topics.

But, but, what about privacy? I know better than to share too much about myself online. I have friends who have been the subject of identity theft. One hears cautionary (but possible apocryphal) tales every day about middle schoolers getting stalked online through their MySpace pages. How did my life get so public? Gradually, without my even consciously noticing it. There's no going back.

what is publishing?

I'm in the process of making a transition in my organization, shifting into a newly created position as Head of Digital Publishing Services.

The first question that everyone asks is "What will you be doing?" The second is "What is publishing in a Library?"

We partner with faculty who are selecting content, organizing it, describing it, analyzing and identifying and creating intellectual relationships, and presenting and interpreting that content in new ways as born-digital scholarship. These projects are more frequently being considered in the promotion and tenure process. Is that the Library supporting a publishing activity? Most certainly.

We digitize collections, describe them, organize them, present them online, and promote them to our community for their use in teaching in research. Is that a publishing activity? It can be argued either way (and has been) -- I'm on the side that leans toward yes.

We provide production support and hosting for peer-reviewed electronic journals. No one would argue that we're participating an a publishing activity.

We're evaluating an Institutional Repository. Not a publishing activity per se, but a way to preserve the publishing output of our community. That's a service related to our stewardship role.

As a Library we're already very active participants in publishing activities. We have our Scholars' Lab, Research Computing Lab, and Digital Media Lab serving as the loci for our faculty collaborations. My role will be formalizing what our publishing services are, what our work flows should be, and how we can sustain and expand our consulting services related to scholarly communication.

Wednesday, October 10, 2007

apparently peer pressure works on me

I gave in and joined Facebook. I'm astonished at the high level of activity that some folks seem to be able to maintain. I have a very boring profile: I added a few applications; some people seem to have dozens. I haven't added much personal information. People apparently send lots of messages and post on each other's profiles. I'm going to have to remember to check it, on top of checking other places where I do my sharing and connect with folks (LinkedIn, LibraryThing, flickr). Still, I think it will be good to have another place to try to maintain contact.

Friday, October 05, 2007

digital lives project

Digital Koans posted about the Digital Lives research project, "focusing on personal digital collections and their relationship with research repositories."

For centuries, individuals have used physical artifacts as personal memory devices and reference aids. Over time these have ranged from personal journals and correspondence, to photographs and photographic albums, to whole personal libraries of manuscripts, sound and video recordings, books, serials, clippings and off-prints. These personal collections and archives are often of immense importance to individuals, their descendants, and to research in a broad range of Arts and Humanities subjects including literary criticism, history, and history of science.


These personal collections support histories of cultural practice by documenting creative processes, writing, reading, communication, social networks, and the production and dissemination of knowledge. They provide scholars with more nuanced contexts for understanding wider scientific and cultural developments.

As we move from cultural memory based on physical artifacts, to a hybrid digital and physical environment, and then increasingly shift towards new forms of digital memory, many fundamental new issues arise for research institutions such as the British Library that will be the custodians of and provide research access to digital archives and personal collections created by individuals in the 21st century.

I very much look forward to seeing the results of this work, as university archives and institutional repositories increasingly have to cope with not only managing and preserving deposited personal digital materials, but have to potentially describe, organize, and make such collections usable.

While not the focus of their study, anyone who has ever supported teaching with images knows a tangential area of this problem space intimately. Faculty develop their own collections of teaching images -- their own analog photography, purchased slides, digital photography, images found on the open web, images from colleagues, etc. We have licensed images and surrogates of our own physical collections. They want to use materials from their own collections and our repositories together in their teaching. What is the relationship between their image collections and our repositories and teaching tools? Do we integrate their collections into ours? Do we have a role in digital curation and preservation of their data used in teaching and research, which happen to be images? We struggle with the legal and resource allocation issues every day.

elastic lists

I've been exploring a demo of the visualization model called Elastic Lists. It comes from work done on "Information Aesthetics" published by Andrea Lau and Andrew Vande Moere (pdf) and is implemented using the Flamenco Browser developed at UC Berkeley. The facets have a visualization component comprising color (a brighter background connotes higher relevancy) and proportion (larger facets take up a larger relative proportion of space).

I find this to be a really compelling visualization, easier for the user to navigate than the perhaps too-complex Bungee View. I'd like to see this applied to more heterogeneous data, such as one might encounter in a large digital collection.

Thursday, October 04, 2007

Cal has its own YouTube channel for course lectures

The University of California at Berkeley has its own YouTube channel. This complements their use of iTunesU.

Ars Techina covers the launch.

It's interesting to see everything from course lectures to colloquia presentations to football highlights. This is very smart co-branding with YouTube as a web destination, and a savvy use of a commercial hosting service.

a new museum starts virtual

I am thrilled to see that the Smithsonian's newest museum -- the National Museum of African American History and Culture -- is open for business. Their opening exhibition is “Let Your Motto Be Resistance,” featuring portraits and photographs of people who stood against oppression, from Frederick Douglass to Ella Fitzgerald to Malcolm X. They have lesson plans. They have a Memory Book for visitor-supplied content that makes great use of the visualization interface for browsing that's also employed on the rest of the site.

It's a really nice experience, especially for a museum for which groundbreaking is not scheduled until 2012.

From the site:

With the help of a $1 million grant of technology and expertise from IBM, the NMAAHC Museum on the Web represents a unique partnership to use innovative IBM expertise and services to bring the stories of African American History to a global audience. Conceived from the very beginning as a fully virtual precursor to the museum to be built on the Washington Mall, this is the first time a major museum is opening its doors on the Web prior to its physical existence.
This is a very exciting collaboration, and a great way to build a community around an institution and its mission years before anyone will walk through the door.

Tuesday, October 02, 2007

self-determined music pricing

Yesterday it was all over the blogosphere and on NPR that Radiohead is taking control of its own distribution and releasing its new album with self-determined pricing. This was heralded as huge, earth-shattering, and a "watershed moment" according to CNET.

It's as if no one has done this before. But of course, at least one person has, and very successfully.

In the early 1980s I discovered Canadian singer Jane Siberry, who now goes by the name Issa. I own her albums. I've seen her perform twice so far (a third is in the offing -- she's coming to Charlottesville this month).

And, in 2005, she transformed her personal label's inventory from physical to digital, put it online, and allowed self-determined pricing. Some of her earlier material is not available due to licensing restrictions -- she's not encouraging illegal downloading -- but she has successfully licensed some albums and songs she did for Warner and made them available through this route. Her wikipedia article quotes an interview in The Globe and Mail where she says that since she had instituted the self-determined pricing policy, the average income she receives per song is in fact slightly more than standard price. The "Press" section of her Jane Siberry site has an interesting Chicago Tribune article from that year.

But, since almost no one has heard of Jane Siberry and everyone has heard of Radiohead, it's as if no one has ever done this before. There was a Thingology post that commented that Radiohead probably borrowed the idea from someone or thought of it on their own. Here's an example of someone who did it first. I'm not knocking Radiohead. I like their music. I'm thrilled that such a high profile group is doing this. But this is not their watershed moment alone.

Monday, October 01, 2007

freeing copyright data

Messages went out on the Digital Library Federation (DLF) mailing list and on O'Reilly Radar yesterday letting us know that DLF had helped set free the entire dataset from the U.S. Copyright Office's copyright registration database. LC pointed out that Congress set the rules for charging, but they should have known that their referring to the issue as a "blogospheric brouhaha" was just going to drive someone to do something.

DLF and Public.Resource.Org sent a request to open access to the data. Lots of sites picked this up, including BoingBoing. The registration database is a compilation of facts -- it's not copyrightable itself. The U.S. Register of Copyright agreed that these are public records and should be available in bulk.

So Public.Resource.Org and DLF made it so.

It's extremely exciting to see DLF lobbying so effectively and participating in an effort to make data available via open access, and help all libraries provide better copyright search services. BoingBoing celebrates them as guerrilla librarians. This is not your father's DLF.

Friday, September 28, 2007

cool securing tool for kids on the internet

I just saw a commercial for the Fisher Price Easy Link Internet Launch Pad, targeted for children three and up. The Easy Link -- a specialized USB peripheral with its own software -- allows children to explore sites dedicated to characters when they plug a figure of that character, like Elmo, into the Launch Pad. The kids are offered links to read and games to play, and nothing more -- there is no access to the Internet or to the hard drive and any of its applications without a password. It's $30, which is a reasonable price to introduce kids to working with a computer while limited their access to anything they can damage or can harm them.

And yes, kids that young do use computers. I remember watching my cousin's daughter playing computer games when she was four. But of course, both her parents are software engineers.

follow-up to virtual strike

Read the report and see the screen shots from the virtual strike in Second Life.

award for digital preservation tool

A press release was circulated via email announcing that DROID, a tool from The National Archives in London, had won the 2007 Digital Preservation Award.

From the press release:

An innovative tool to analyse and identify computer file formats has won the 2007 Digital Preservation Award.

DROID, developed by The National Archives in London, can examine any mystery file and identify its format. The tool works by gathering clues from the internal 'signatures' hidden inside every computer file, as well as more familiar elements such as the filename extension (.jpg, for example), to generate a highly accurate 'guess' about the software that will be needed to read the file.

Identifying file formats is a thorny issue for archivists. Organisations such as the National Archives have an ever-increasing volume of electronic records in their custody, many of which will be crucial for future historians to understand 21st-century Britain. But with rapidly changing technology and an unpredictable hardware base, preserving files is only half of the challenge. There is no guarantee that today's files will be readable or even recognisable using the software of the future.

Now, by using DROID and its big brother, the unique file format database known as PRONOM, experts at the National Archives are well on their way to cracking the problem. Once DROID has labelled a mystery file, PRONOM's extensive catalogue of software tools can advise curators on how best to preserve the file in a readable format. The database includes crucial information on software and hardware lifecycles, helping to avoid the obsolescence problem. And it will alert users if the program needed to read a file is no longer supported by manufacturers.

PRONOM's system of identifiers has been adopted by the UK government and is the only nationally-recognised standard in its field.

The judges chose The National Archives from a strong shortlist of five contenders, whittled down from the original list of thirteen. The prestigious award was presented in a special ceremony at The British Museum on 27 September 2007 as part of the 2007 Conservation Awards, sponsored by Sir Paul McCartney.

Ronald Milne, Chair of the Board of Directors of the Digital Preservation Coalition, which sponsors the award, said: "The National Archives fully deserves the recognition that accompanies this award."

Thursday, September 27, 2007


I spent a little time this afternoon reading up on the newly released digital preservation tool Xena.

You can point it at a directory of diverse file types and it will convert the files into normalized open formats. The list of supported formats and the conversion outcomes is available in the help docs.

This is potentially a really useful workflow tool but there's a lot to examine here. I don't know how scriptable it is. You can write plugins to add in new formats -- I'm not yet sure if you can change conversion decisions and alter the target formats. Why is the target format for pretty much every image format PNG? Could we change that to TIFF or JPEG2000 if we were willing to write the plugin? It runs on Windows and Linux and requires OpenOffice. On Linux, does it require a graphical environment, or can you run it from the command line?

I'm thinking that this could be really useful for an IR, but I'm not yet sure if it will scale for Library-wide preservation or collection repositories.

Monday, September 24, 2007

archives on the web

Technophilia lists Where the Web Archives Are. Here's what they say:

Some of the most intriguing resources on the web are located in archives—compilations of data that in the past, could only be found by making appointments in dusty libraries. Today, I'm going to take you on a quick tour through some of the most fascinating archives on the web.
So where are they? If I am reading the list correctly, they're pretty much not at any academic libraries.

In the "Government" section, there is the National Archives and the Library of Congress. There is the Internet Archive, which is indeed a library. There's the Rockfeller Archive. There's NASA. There's David Rumsey, possibly the best private map archive in the world. There is the British Library.

Otherwise, it's Calvin and Hobbes, Smithsonian Magazine, the Smoking Gun, and The Balcony Archives of movie reviews.

I don't want to knock their list -- it's an interesting list full of great collections of very worthwhile content. But where are all the other myriad Library special collections and archives on this list? Is it that we aren't visible enough? Or perhaps not cool enough compared to PBS's Nova? Where are our extensive online archives on runaway slaves or civil rights or early American literature? Or political cartoons or penny dreadfuls or sheet music? Or puzzles or jazz or the civil war?

I think we have to remember that our target audience is not just our very local community, but the global community, including non-academics. We all need to think a bit more about how to get the word out about what we've made freely available. Being available in a Google search isn't proactive enough. We need to work to get noticed.

Friday, September 21, 2007

not the usual google law suit

As seen at Tech Crunch, a Pennsylvania resident is suing Google for crimes against humanity and is asking the court for $5 billion in damages because his social security number, when turned upside down and scrambled, spells Google. His handwritten filings are on the Justia site.

Tuesday, September 18, 2007

virtual strike

The first virtual strike is taking place soon. Apparently there are labor actions planned by the union representing Italian employees of IBM over pay negotiations -- as one of their strategies they plan to picket the company's campus in Second Life. They're even providing orientation for IBM employees who are new users. I wonder what the corporate reaction will be? The press this action is getting is pretty intensive.

new york times open access

The story of the day seems to be about the NY Times opening up its archives. So far I've seen postings at boing boing, if:book, open access news, o'reilly radar, and teleread.

So why am I bothering to blog this? Because this made me think about something I blogged about some months ago -- Google News Archive Search. One of the things that galled me at the time was how much of what they indexed was behind a pay firewall. Now, the NY Times is opening almost all their content up (save for 1923-1986), making this a more useful service, at least for resources from one newspaper. If only there wasn't so much other for-fee public domain newspaper content controlled through ProQuest Archiver. I still hope for an OpenURL Resolver service so authorized users can get to authorized resources at ProQuest Historical Newspapers instead.

Saturday, September 15, 2007

career meme

Jerry blogged about the results he received from a test at Career Cruising. Since I was sitting at home on a Saturday afternoon, it seemed the thing to do. I dutifully answered the questions and the follow-up questions, and I just about fell off the sofa when I got the results:

1. Anthropologist
2. Video Game Developer
3. Multimedia Developer
4. Scientist
5. Picture Framer
6. Political Aide
7. Computer Animator
8. Interior Designer
9. Business Systems Analyst
10. Website Designer
11. Market Research Analyst
12. Librarian
13. Medical Illustrator
14. Artist
15. Real Estate Appraiser
16. Computer Programmer
17. Set Designer
18. Cartographer
19. Animator
20. Costume Designer
21. Cartoonist / Comic Illustrator
22. Illustrator
23. Mathematician
24. GIS Specialist
25. Epidemiologist
26. Dental Assistant
27. Statistician
28. Economist
29. Graphic Designer
30. Desktop Publisher
31. Historian
32. Archivist
33. Curator
34. Web Developer
35. Public Policy Analyst
36. Esthetician
37. Hairstylist
38. Technical Writer
39. Makeup Artist
40. Webmaster

I have no idea how their questions led their system to tell me that I should be an anthropologist. Apparently I did select the correct course of study in college and graduate school! Archivist, curator, web designer and developer, and tech writer are all familiar activities to me. I did my share of amateur theatrical work years ago. This was uncannily on target.

But where did dental assistant come from? Or esthetician? Picture framer? Political aide? I just cannot imagine any of those are for me.

Friday, September 14, 2007


I don't think that there is much that I can add to this excellent review of oSkope at if:book. I spent some time at oSkope exploring their flickr search. The mouseover shows the title and date for the image, plus whose collection it came from. If you click on the image, a popup appears that includes the above plus the tags and a zoomable thumbnail. There's a slider at the right that changes the number of images that appear in the grid -- from 4 to 500. The grid, stack, pile, and list views are great --- but I'm not sure what the axes are for the graph view.

I like the drill-down navigation through the ebay categories. As noted in the if:book entry, it didn't seem to be working and kept returning no items.

The oSkope User Agreement (pdf) accompanies the language "Use of this website consitutes [sic] acceptance of the oSkope User Agreement and Privacy Policy. Please read these agreements carefully." At six pages it is thorough. There's also a four page privacy policy (pdf).

Monday, September 10, 2007


In January I saw a presentation by Julie Allinson at Open Repositories on the UKOLN Repository Deposit Service work. Phil Barker of CETIS has a blog entry on a number of repository standards topics, one of which is SWORD (Simple Web-service Offering Repository Deposit), the project which takes forward the work I saw presented. The goal is to take their deposit protocol and implement it as a lightweight web-service using a prototype "smart deposit" tool for four repository software platforms: EPrints, DSpace, Fedora and IntraLibrary. They're taking advantage of the ATOM Publishing Protocol and extending it, which seems like a smart direction to me. I'm looking forward to seeing more of this.

Sunday, September 09, 2007

UNESCO open source repository report

UNESCO has issued an very interesting report -- Towards an Open Source Archival Repository and Preservation System -- that defines the requirements for a digital archival and preservation system and describes a set of open source software which can be used to implement it. It focuses on DSpace, Fedora, and Greenstone, principally comparing the three systems in their support for OAIS. The report uses as the basis for its comparison a single use case -- the management and preservation of images.

I think it's a very fair report, not deeply technical, but an overview of the capabilities of the tools. Fedora is well-reviewed, with some shortcomings mentioned -- it takes a high level of programming expertise to contribute to the core development (true), the administrative reporting tools could stand some improvement (I could use granular use statistics), and a lack of built-in automated preservation metadata extraction and file format validation. On those last two points, the Fedora architecture very easily supports the integration of locally developed automated processes in metadata extraction and format validation into object preparation. That's what we have done. That Fedora has supported checksum checking support since version 2.2 is a huge step for file preservation.

Thursday, September 06, 2007

google book search features

Google Book Search has introduced a My Library feature, where you can identify volumes in GBS and books that you own and associate them with your Google account. I also ready had an account that I use with blogger and Google Analytics, so there was nothing to set up. I can search and easily click on an "add to my library" link. I can assign a star rating, add a review, and add labels. I don't seem to be able to see a list of labels that I've assigned. I'd like to be able to create individual sets, but there doesn't seem to be a way to do that. The export is a lightweight xml document that's lacking publication data like date, or publisher. You automatically have an RSS feed. It's interesting, but I'm not sure what this gives me over LibraryThing other than URLs for the books in GBS.

The more interesting service is the ability to highlight and quote from a text in GBS. It only works with full view texts -- the tool is not available for any other view. I searched for the term I was interested in and went through 20 screens of results without finding a book that I could try the tool with. I had to resort to an advanced search for titles between 1900 and 1923 to try it. That's an interesting indicator of just how much is in GBS that's post 1923 -- none of the first 200 results in my search were in the domain and full view.

I found a text I wanted to quote and used the tool to draw a box around the text. Drawing the box is a tad tricky -- my first two tries I didn't get the box large enough to get the first line of what I wanted to quote. I was given the option to create an image of the text block or to grab the text. I could add it to my Google Notebook or send it to blogger (because I have an account). You are also presented with a URL that you can use to embed the note in a web page. The quote includes a link to the text in GBS.

This seems really useful to me. In our paradigm at UVA we talk about how it's not enough to digitize something -- you have to be able to use it. This is the first tool I've seen from GBS where it makes its texts into something that you can really take advantage of in a networked environment.

amazon kindle

There was an article in the New York Times yesterday on ebooks that briefly mentioned two upcoming business models:

In October, the online retailer will unveil the Kindle, an electronic book reader that has been the subject of industry speculation for a year, according to several people who have tried the device and are familiar with Amazon’s plans. The Kindle will be priced at $400 to $500 and will wirelessly connect to an e-book store on Amazon’s site.

That is a significant advance over older e-book devices, which must be connected to a computer to download books or articles.

Also this fall, Google plans to start charging users for full online access to the digital copies of some books in its database, according to people with knowledge of its plans. Publishers will set the prices for their own books and share the revenue with Google. So far, Google has made only limited excerpts of copyrighted books available to its users.

The Google announcement is, I think, a fair one -- right now they limit viewing to copyrighted books to a snippet view. If a work is still clearly in copyright and the rights owner wants to release that book for full access, they should be able to charge for that access. It's their right. Of course I'd like to see more publishers make e-versions of their title available freely ...

The Amazon news gives me pause, not knowing all the details yet. You access the files wirelessly -- do you read them via a live connection from their servers, or is the file downloaded to the device? I understand why some think it's a plus to not require a full-fledged computer to get access to a book, but it potentially seems like a really limited version of access. The ebook files will be Mobipocket format and the Kindle device seems to use a proprietary wireless system to grab the files (known through their FCC filing), so the files likely won't be available to other devices. They are not using the Adobe format for their files; it's not clear if the Kindle will support reading of Abobe ebooks from other sources or if you can only read Amazon files. Can you get the files off the device or back it up? If you can get the files off the device, will they work with the desktop version of Mobipocket? There have also been complaints about Mobipocket DRM.

This is all speculation given the lack of details. TeleRead has some speculation of their own. I look forward to hearing more about the product and the service.

Wednesday, September 05, 2007

fair use decision

Today the Tenth Circuit court ruled unanimously in favor of Larry Lessig, et al, in Golan v. Gonzales, a case about the scope of fair use. The court has acknowledged that First Amendment freedoms must be considered when copyright law is made.

The government had argued in this case, and in related cases, that the only First Amendment review of a copyright act possible was if Congress changed either fair use or erased the idea/expression dichotomy. We, by contrast, have argued consistently that in addition to those two, Eldred requires First Amendment review when Congress changes the "traditional contours of copyright protection." In Golan, the issue is a statute that removes work from the public domain.

Monday, September 03, 2007

internet archive and nasa

I missed this announcement last week (even though Peter Suber blogged it) -- NASA and Internet Archive Team to Digitize Space Imagery:

NASA and Internet Archive of San Francisco are partnering to scan, archive and manage the agency's vast collection of photographs, historic film and video. The imagery will be available through the Internet and free to the public, historians, scholars, students and researchers.

Currently, NASA has more than 20 major imagery collections online. With this partnership, those collections will be made available through a single, searchable "one-stop-shop" archive of NASA imagery.


NASA selected Internet Archive, a nonprofit organization, as a partner for digitizing and distributing agency imagery through a competitive process. The two organizations are teaming through a non-exclusive Space Act agreement to help NASA consolidate and digitize its imagery archives at no cost to the agency.


Under the terms of this five-year agreement, Internet Archive will digitize, host and manage still, moving and computer-generated imagery produced by NASA.


In addition, Internet Archive will work with NASA to create a system through which new imagery will be captured, catalogued and included in the online archive automatically. To open this wealth of knowledge to people worldwide, Internet Archive will provide free public access to the online imagery, including downloads and search tools....

From an AP article on Wired News:

Kahle said the archive won't be able to digitize everything NASA has ever produced but will try to capture the images of broadest interest to historians, scholars, students, filmmakers and space enthusiasts.

Kahle said the images already in digital form represent the minority of NASA's collections, and they are scattered among some 3,000 Web sites operated by the space agency. He said those sites would continue to exist; the archive would keep copies on its own servers to provide a single, free site to augment the NASA sites.


The Internet Archive is bearing all of the costs, and Kahle said fundraising has just started. The five-year agreement is non-exclusive, meaning NASA is free to make similar deals with others to further digitize its collections.

What's particularly exciting is that this is both an aggregation and a digitization project -- widespread materials will be brought together for easier discovery, get enriched metadata, and important materials will be selected and digitized to add to the corpus.