Wednesday, December 05, 2007

developing a service vision for a repository

Dorothea rightly challenged me for not including a service vision in my post on repository goals and vision. I do have something like a vision, but I wouldn't say that it's quite where it needs to be yet. That said, I said I would post it, so I am.

What are the services needed around a repository?

  • Identification and acquisition of valuable content
    • You can't wait for content to come to you – research what’s going on in the departments and at the University, and initiate a dialog.
    • Digital collections must also come from the Library and other University units – University Archives, Museums, etc.
  • Consulting Services
    • Advise on intellectual property and contract/licensing issues for scholarly output.
    • Assistance in preparing files for deposit, creating or converting metadata, and in the actual deposit process.
  • Access
    • Easy-to-use discovery interface with full-text searching and browse.
    • Instruction for community on how to find and use and cite content.
    • Make the content shareable via Open Archives Initiative (OAI).
  • Promotion and Marketing
    • Build awareness of the high cost of scholarly journals, and that we are buying back our own institutional scholarship.
    • Promote the value of building sustainable digital collections – preservation is more than just backing up files.
    • Promote the goals of the Open Access movement, including managed, free online access and a focus on improved visibility and impact.
    • Show faculty that they can build personal and community archives.
    • Market repository building services that will enable the institution to build a body of digital content.
    • Market the repository as a content resource and a venue that increases the visibility of the institution.

Tuesday, December 04, 2007

Report on the Future of Bibliographic Control available for comment

The Library of Congress has released a draft of the Report on the Future of Bibliographic Control for comment. Comments should be received by December 15, 2007, so pull together all your autocat, ngc4lib, and web4lib postings on the issues and get your comments in.

Alliance for Permanent Access

I haven't been able to track down much on the new Alliance for Permanent Access. There's this press release. Did anyone attend the Second International Conference on Permanent Access to the Records of Science held in Brussels on November 15, 2007 where the Alliance was launched?

MARCThing

There's nothing that I can say about the MARCThing Z39.50 implementation for LibraryThing that isn't in the Thingology post. But this statement caught my eye:

I do have a recommendation for anybody involved in implementing a standard or protocol in the library world. Go down to your local bookstore and grab 3 random people browsing the programming books. If you can't explain the basic idea in 10 minutes, or they can't sit down and write some basic code to use it in an hour or two, you've failed. It doesn't matter how perfect it is on paper -- it's not going to get used by anybody outside the library world, and even in the library world, it will only be implemented poorly.
Amen.

goals and vision for a repository

Last week I had the opportunity to have a lengthy conversation with some folks about our Repository. In doing so I was able to get at some really simplified statements about our activities.

Why a Repository?

  • A growing body of the scholarly communications and research produced in our institutions exists solely in digital form.
  • Valuable assets -- secondary or gray scholarship such as proceedings, white papers, presentations, working papers, and datasets -- are being lost or not reproduced.
  • Numerous online digital collections and databases produced through research activity are not formally managed and are at risk.
  • An institutional repository is needed as a trusted system to permanently archive, steward, and manage access to the intellectual work – both research and teaching – of a university.
  • Open Access, Open Access, Open Access and Preservation, Preservation, Preservation.
What's the vision for a Repository?
  • A new scholarly publishing paradigm: an outlet for the open distribution of scholarly output as part of the open access movement.
  • A trusted digital repository for collections.
  • A cumulative and perpetual archive for an institution.
What does success look like?
  • Improved open access and visibility of digital scholarship and collections.
  • Participation from a variety of units, departments, and disciplines at the institution.
  • Usable process and standards for adding content.
  • Content is actively added.
  • Content is used: searched and cited and downloaded.
  • There is a wide variety of content types.
  • Simple counts are NOT a metric.
I really appreciate having the chance to formulate ideas like these that have nothing to do with the technology but everything to do with why we're doing what we do. I want to work this up into something more formal to share broadly.

Tuesday, November 27, 2007

boxes, the outcome of online shopping

In our household we shop online a lot. Boxes often arrive at home that one or the other of us didn't expect.

What happens when thousands of students living at a University are shopping online every day for whatever they might need, from shoes to car tires? In a very entertaining article, The New York Times reports on what's happening at mail rooms at Universities.

ars technica on Google Book Project

ars technica has a very fair blog posting outlining the ongoing online discussion between Paul Courant at Michigan and Siva Vaidhyanathan at UVA about the Google Book project. It nicely presents both sides of the discussion.

Saturday, November 24, 2007

unplanned hiatus

Between a 3-day planning retreat, updating technology and the threat of data loss, a cat that went missing 7 days ago who has not returned, and the Thanksgiving holiday, blogging hasn't been top of the list. I'm back now.

the not-so-nameless fear of personal data loss

Ten days ago I upgraded my smartphone.

For over three years I hung onto my Treo 600. Yes, a 600. Even when its antenna broke while I was at JCDL 2006, I drove to Durham and got a replacement unit rather than upgrade. I just didn't see a reason, as none of the new Treos were better. So what if my contract expired a year and a half ago and I could have upgraded at any time?

Then I saw the Centro, and something about it changed my mind. Maybe it was the much smaller size, or the brighter, clearer screen. Maybe the much more usable buttons and keyboard (I know, they're smaller keys, but their gel surface is great for my small hands and fingernails). Maybe that it had Bluetooth. Or that it was available in slightly sparkly black. I came to covet it online, and then I gave in and bought it.

Then I had to transfer my data.

Now, I've had a Palm device for 10 years. I have my calendar going back to early 1996 because I've been synching up my calendar with an enterprise system at three universities during those eleven years. I have databases like all my books (downloaded from LibraryThing) and DVDs in FileMaker Mobile. I have the electronic Zagat. I have ebooks. I have memos and to-do lists and a large address book.

None of them moved over when I updated the Palm desktop and initiated my first sync. Not a darned thing. Not only did they not move to the Centro, they disappeared from my Palm Desktop.

Hysteria ensued.

Once I was talked down off the proverbial ledge, things improved. I was able to beam over all my memos, my to-do list, and my address book from the Treo (although they all lost their categories). I had a calendar backup on my laptop that had 1996 to late 2005 (apparently the last time I backed up my full calender. Memo to self -- never forget to do that again). I discovered that my Zagat had expired the week before, which is why it didn't move. Fixed. I got my ebooks reinstalled (after discovering that I somehow had 2 Palm users set up and that I was using the wrong one in the installer). I deleted the extraneous Palm user, causing a bit more temporary hysteria because I re-lost and and to retrieve some data again.

My Oracle Calendar Sync completely failed. I couldn't even see the conduit. I thought, hey, there's a newer version than the one I have, upgrading will reinitialize the conduit. A fine plan if the installer hadn't failed, corrupting the old install so that I not only couldn't install a new one, I couldn't remove the old one because a file was missing. Our IT help desk didn't have the old installer available anymore (Memo to self -- keep all installers AND keep up on new versions) but a nameless university elsewhere in the world had it on their web site and all was repaired. The first sync took over 130 minutes, but that was a small price to pay to have my past and future calendar back.

All of the above took 2 1/2 days. What never came back, though, was FileMaker Mobile. UVA never moved past FM 6, so that's what I've continued to use as a client with FM Mobile 2. Given that FM is on to version 9 and FM Mobile 8, and the company has announced that it's discontinuing FM Mobile, I do seem to be be in a jam there. Yes, that I have that installer, and I have my databases, and I can see the conduit, but no joy in synching.

Change is good. I just need to decide what to change to, and then I will have all my data back. Until the next time I have to upgrade ...

Thursday, November 08, 2007

highlights from the Fall 2007 DLF Forum

There were a number of highlights for me from the fall 2007 DLF forum.

Dan Gillmor gave a great opening plenary on journalism in a world of ubiquitous media access. He talked quite a bit about participatory media and collaboration, and how the average person with a cell phone can participate in "random acts of journalism," such as recording the shooting at Virginia Tech or the 2005 tsunami in Sri Lanka. It is more likely that on-the-spot photojournalism will come from anyone, not journalists . He also talked about what he termed "advocacy journalism," where a community is formed, whether around a location or a topic, and that community reports often more quickly and more deeply on issues of interest to their community. What's the role of professional journalism? "Do what you do best and point to the rest." Follow well-established journalistic practices in reporting, and point to those communities and compedia that are doing a good job rather than trying to also do what they do. In a world with so much access, there is more transparency; conversely, it is also much more difficult to keep secrets. But what do you trust? What's accurate? Trust cannot be based solely on popularity, but on reputation, which is exceptionally difficult to qualify and quantify.

Rick Prelinger gave a talk on moving image archiving and digitization. I loved this phrase: "Wonderful and unpredictable things happen when ordinary people get access to original materials." In a world where there is now more individual production that institutional production, we should be crawling and preserving what's out there on YouTube and elsewhere, starting in early on what will be the hardest to preserve. He also pointed out that YouTube raised popular expectations about video findability while simultaneously lowering quality expectations and making the segmenting of content out of its raw or original context the norm. Rick also referenced the SAA Greene-Meissner report which urged archivists to consider new ways to deal with hidden collections, in making his point that workflows should not be sacred. Our social contract with users is to provide access. Digitization provides visibility and access, which can drive preservation.

Ricky Erway led an interesting overview of the agreements that various institutions have entered into with their third-party digitizing partners. The one that I knew the least about was the NARA arrangement with Footnote.com, where the materials will be available only on Footnote by subscription for five years, after which NARA can make them available, although NARA admits that it isn't clear exactly what they can and cannot do. For some reason James Hastings made sure to make the point that Footnote is not a LDS Church unit, although the parties involved definitely have ties to the church and are strongly interested in the materials for use in genealogy. There is absolutely nothing wrong with that.

There was a session on mass digitization "under the hood." I was particularly floored with the work at the National Archive of Recorded Sound and Images in Sweden. Their automated (in some cases robotic) processes for digitization are truly astonishing, as is the scale of their digitization. If I am reading my notes right, they create between 1.5 and 2.5 TB every day.

Herbert Van de Sompel gave a very effective presentation on ORE. I see a lot more folks getting what he and Carl Lagoze have been saying about compound objects. I love their elegant use of ATOM.

Denise Troll Covey gave a report on a the preliminary results of an in-progress study at Carnegie-Mellon on faculty self-archiving. I look forward to reading the final results and being able to share them, especially given the mis-information that faculty believe about their rights and their lack of archiving.

Steve Toub and Heather Christenson gave a great talk on a survey on book discovery interfaces. Microsoft Live Search Books seemed to fare the worst, while LibraryThing seemed to be at the top. They promise to make a ppt with many more slides than they presented available.

Tito Sierra, Markus Wust, and Emily Lynema from NC State presented their "CatalogWS" a RESTful Web API, which they take advantage of for their very cool MobiLib mobile catalog app, as well as a staff book cover visualization tool for large screen displays, and an advanced faceted search interface for generating custom catalog item lists for blogs and webpages. They also gave a nice shout-out to Blacklight, which we appreciated.

Mike Furlough and Patrick Alexander from Penn State led a good discussion on publishing and libraries. Activities represented in the room ranged from journal hosting to publishing of library collections online to collaborating on born-digital scholarship to working with university presses on electronic editions of works.

Read Peter Brantley's post on Mimi Calter's talk on the examination of the Copyright Registration Database that Peter and Carl Malamud worked to hard to set free. Make sure to also read the comments.

I was happy with the responses that Bess Sadler and I received from our presentation on Project Blacklight. Bess went way above the call of duty and completed some UI update tasks (translating language codes and setcodes into human-readable terms and adding browse centuries) and figured out how to combine the Virgo and Repo indexes in Lucene while sitting in our hotel room. We were able to show REALLY up-to-date screen shots in our presentation, including one we added while setting up the laptop at the podium.

Wednesday, October 31, 2007

LibX and OpenURL Referrer browser extensions

We launched our UVA Library LibX plugin for Firefox in June 2007, and its gotten some rave reviews from staff. Now that UVA has approved the rollout of Vista and IE 7 on its computers, we're testing the beta IE version of LibX. I understand we've supplied some feedback on installation and running on Vista.

When I saw the recent announcement of the availability of OCLC's OpenURL Referrer for IE, I paused a bit when considering who to send the annoucement to. LibX is the tool we promote with our users, it recognizes DOIs, ISBNs, ISSNs, and PubMed IDs, and supports COinS and OCLC xISBN, and works with our resolver and our catalog Virgo. We have our resolver working with Google Scholar.

In the end, I didn't forward the annoucement because we're trying to promote the use of LibX and I didn't want to dilute that message for our staff and users. The OpenURL Referrer is a very cool tool and a great use of the OCLC Resolver Registry so users don't have to know anything except the name of their institution to set it up. I'm just not sure if we need both, at least not right now.

I need to ask if we know how much use our LibX toolbar is getting.

JPEG2000 at LC

It is indeed welcome news (Digital Koans and the Jester) that the Library of Congress and Xerox are teaming up on j2k implementation, and even more welcome that it's in context of NDIIPP and preservation.

This is a key part of the announcement for me:

Xerox scientists will develop the parameters for converting TIFF files to JPEG 2000 and will build and test the system, then turn over the specifications and best practices to the Library of Congress. The specific outcome will be development of JPEG 2000 profiles, which describe how to use JPEG 2000 most effectively to represent photographic content as well as content digitized from maps. The Library plans to make the results available on a public Web site.
We all know that we could have jp2 files that are smaller than our TIFF masters. We all know that we could move from TIFF to jp2 as an archival format. We've all researched codecs and delivery applications so we can cut down on the number of deliverable files we generate. That's real progress. What we haven't done is figure out how best to migrate our legacy files and move forward. I look forward to seeing the outcomes of this project.

Thursday, October 25, 2007

OCLC report on privacy and trust

I grabbed the report the other day but only had the chance to barely skim it. It's obviously full of valuable data, and the visualization graphics are great.

I liked these stats, because I have long tried to make the argument that online interactions are becoming ubiquitous:

Browsing/purchasing activities: Activities considered as emerging several years ago, such as online banking, have been used by more than half of the total general public respondents. Over 40% of respondents have read someone’s blog, while the majority have browsed for information and used e-commerce sites in the last year, a substantial increase in activity as seen in 2005. While commercial and searching activities have surged in the past two years, the use of the library Web site has declined from our 2005 study.

Interacting activities: The majority of the respondents have sent or received an e-mail and over half have sent or received an instant message. Twenty percent (20%) or more of respondents have participated in social networking and used chat rooms.

Creating activities: Twenty percent (20%) or more of respondents have used a social media site and have created and/or contributed to others’ Web pages; 17% have blogged or written an online diary/journal. (section 1-6)
It's nice to see library web sites firmly in the middle when grouped with commercial web sites used by all age groups. (section 2-8)

Data on how much private information is shared (section 2-31) is not too surprising but interesting to see quantified. People's faith in the security of personal information on the Internet (section 3-2, 3-4) is higher than mine. That younger respondents have different ideas about relative privacy of categories of data (section 3-9, 3-10, 3-35) is not surprising, but I wonder why more people aren't concerned about privacy. It's good to see trust data in addition to privacy data. (section 3-24)

Section 4 focuses on Library Directors as a category of respondents. It seems that overall they read more, have been online longer, and interact more online than the general public. They also over-estimate how important privacy is to their users.

Section 5 is on libraries and social activities.

Karen Schneider had more time, and her response is worth a read.

Section 7 is the summary if you can't face all 280 pages.

Open Geospatial Consortium

It's ironic in a way, but I was unfamiliar with the Open Geospatial Consortium until seeing mention that Microsoft was joining as a principle member. Their standards seem worth looking at, but I have no sense of how much their reference model or specifications are used or what their compliance certification means for content providers or users, other than the promise of interoperability. They have a long list of registered products, not a lot of which seem to be noted as compliant yet.

Open Library ILL

While I was reacting to one aspect of the New York Times article on the OCA, others were reacting to the very last bit of the article on the proposed ILL-like service from the Open Library initiative. Aaron Swartz was interviewed at the Berkman Center. Günter Waibel briefly described the effort. Peter Brantley had something to say.

But Peter Hirtle really had something to say.

For me, this is the key potion of Peter's post:

Unfortunately just because a book is out of print does not mean that it is not protected by copyright. Right now a library may use Section 108(e) of the Copyright Act to make a copy of an entire book for delivery to a patron either directly or through ILL, and that copy can be digital. But the library has to clear some hurdles first. One of them is that the library has to determine that a copy (either new or used) cannot be obtained at a fair price. The rationale for this requirement was that activity in the OP market for a particular title might encourage a publisher to reprint that work. In addition, the copy has to become the property of the user - the library cannot add a digital copy to its own collection.

Is this inefficient? Yes. Would it be better if libraries could pre-emptively digitize copyrighted works, store them safely, conduct the market test when an ILL request arrives, and then deliver the book to the user if no copy can be found? Yes. But this is not what the law currently allows.

As much as I want to encourage digitization and freeing of post-1923 works that are indeed out of copyright or orphaned, I know that Peter Hirtle has a strong position here. This principle is one that we've been looking at quite closely in a related realm -- developing a proposed "drop-off" digitization service for faculty. Not surprisingly, faculty sometimes hope we can digitize, say, slides that they have purchased commercially. We must determine who the source is and if we can obtain digital versions (whether at a fair price or at all). If we then decide that we can digitize them for the faculty member (which will not always be the case) we definitely cannot add the files to our collections or hold on to them in any way.

Peter ends his post with a goal for us all --that we should all work toward "convincing Congress that the public good would benefit from a more generous ILL policy for out of print books " in order to increase access to our collections.

ArchaeoInformatics Virtual Lecture Series

The Archaeoinformatics Consortium has announced their 2007-2008 Virtual Lecture Series schedule, where archaeologists describe their successful cyberinfrastructure efforts and innovators from other disciplines present information on their cyberinfrastructure initiatives and strategies that may be useful to archaeology.

These lectures are presented every other week using the Access GRID video conferencing system. It is also possible to participate in the lectures by downloading the presentation slides and participating via a telephone bridge. Information on how to connect to the Access GRID system and alternatives are provided at http://archaeoinformatics.org/lecture_series.html. The lectures from the 2006-2007 series and this year’s lectures are also available as streaming video from the archaeoinformatics web site.

Monday, October 22, 2007

New York Times article on OCA

There was an article in the New York Times about libraries choosing to work with OCA instead of Google.

These are two quotes that I kept going back to -- "It costs the Open Content Alliance as much as $30 to scan each book" and "Libraries that sign with the Open Content Alliance are obligated to pay the cost of scanning the books. Several have received grants from organizations like the Sloan Foundation.The Boston Library Consortium’s project is self-funded, with $845,000 for the next two years. The consortium pays 10 cents a page to the Internet Archive ..."

A number of years ago we estimated that it would cost us many hundreds of dollars to digitize a book, but that involved a lot of manual work, from shooting the images to keyboarding of text (rather than OCR) to QA. We haven't revisited such an estimate in a while -- I'm sure it's lower now, but not that low. Of OCA participants, some get foundation support to shoulder the costs, and some fund it entirely themselves. Compared to doing it all yourselves, that's a bargain. I'm a fan of the OCA effort.

But then there's this quote -- “taking Google money for now while realizing this is, at best, a short-term bridge to a truly open universal library of the future.”

We don't actually take any Google money. Yes, Google provides a service for us, but they don't pay us for our participation. We underwrite certain costs for our participation. Yes, there are restrictions. You can read our agreement. One can and should question issues of control over data by any company or institution, but there is value in our pre-1923 volumes being made publicly available through Google Books. The institutions that chose to participate in one project versus the other (or both) should be neither lionized nor assailed.

Friday, October 19, 2007

RUBRIC Toolkit released

The RUBRIC project (Regional Universities Building Research Infrastructure Collaboratively) is sponsored by the Australian Commonwealth Department of Education, Science and Training. The RUBRIC Toolkit is the documentation of the process used by the project to model institutional repository services and evaluate tools. It includes a number of great checklists that could be used by any institution planning for an IR.

Twine

Every place I turn today there's something about Twine:

Twine is Smart

Twine is unique because it understands the meaning of information and relationships and automatically helps to organize and connect related items. Using the Semantic Web, natural language processing, and artificial intelligence, Twine automatically enriches information and finds patterns that individuals cannot easily see on their own. Twine transforms any information into Semantic Web content, a richer and ultimately more useful and portable form of knowledge. Users of Twine also can locate information using powerful new patent-pending social and semantic search capabilities so that they can find exactly what they need, from people and groups they trust.

Twine “ties it all together”

Twine pools and connects all types of information in one convenient online location, including contacts, email, bookmarks, RSS feeds, documents, photos, videos, news, products, discussions, notes, and anything else. Users can also author information directly in Twine like they do in weblogs and wikis. Twine is designed to become the center of a user’s digital life.

This is an exceptionally attractive concept, especially given its ability to parse content for meaning and identify new relationships and context with other content. You identify your "twine" content through tagging, and Twine applies your assigned semantics to other content, adding to your twine. It's also a social site where a personal network can interact with your shared twine and its content, enriching its semantic layer through that interaction. Twine is meant to learn from its folksonomies.

O'Reilly Radar reports on the demo at the Web2 Summit.
Read/Write Web has two postings here and here.

I've submitted a beta participation request. This could be an interesting tool for distributed research efforts.

Thursday, October 18, 2007

killer digital libraries and archives

Yesterday the Online Education Database released a great list of "250+ Killer Digital Libraries and Archives." It lists sites by state, by type, has a focus on etexts, and is a remarkable compendium of digital resources.

Of course, the first thing I did was look for our digital collections on the list. The UVA Library hosts the wonderful Virginia Heritage resource, which brings together thousands of EAD finding aids for two dozen institutions across the state of Virginia. We have our Digital Collections, with more than 20,000 images, 10,000 texts, and almost 4,000 finding aids.

Nope. Not on the list.

Not surprisingly, our former Etext Center was on the list under etexts (the Center no longer exists as a unit and its texts are gradually being migrated). The Virginia Center for Digital History was there, as it should be with its groundbreaking projects and its great blog. IATH was there with its many innovative born-digital scholarly projects.

I sulked about this for a few minutes while thinking about the likely reason we weren't on the list -- for the past few years we've been talking nonstop about our Repository and Fedora and not about our collections. Now, we wanted and needed to talk about Fedora and our Repository because we we really trying new things and solving interesting problems with our development and participating in building a community around Fedora. But users don't care about how cool our Repository development is. They care about the collections in the Repository.

We've spent the last few months working at raising awareness about what we have. Our new Library home page now has a number of links to the digital collections. We have pages on how to find what you're looking for in our digital collections. We have feature pages for all of our collections in the Repository. We're making progress in migrating collections and making the Digital Collections site a central location where they're visible. We have an RSS Feed for additions to the collections. We now have a librarian and a unit dedicated to shepherding collection digitization through the process and working more closely with faculty. I hope the next time someone creates a list like this we'll be visible enough to be on it.

Wednesday, October 17, 2007

two interesting IP decisions

Just in time for World Series fever -- The US Court of Appeals for the Eighth Circuit has upheld (PDF) a lower court's ruling that stats are not copyrightable in a case between CBC Distribution (a fantasy sports operation) against Major League Baseball and the MLB Players Association.

MLB argued that its player names and stats were copyrightable and that CBC—or any other fantasy league—couldn't operate a fantasy baseball league without a multimillion-dollar licensing agreement with MLB. CBC countered that the data was in the public domain and as such, it had a First Amendment right to use it. In August 2006 a US District Court sided with CBC. Now that decision has been upheld after MLB appealed.

In other news, the USPTO has rejected many of the broadest claims of the Amazon One-Click patent following the re-examination request by blogger Peter Calveley. To read the outcome of the re-examination request, go to the USPTO PAIR access site, choose the "Control Number" radio button, enter 90/007,946 and press the "Submit" button. Then go to the "Image File Wrapper" tab and select the "Reexam Non-Final Action" document. This doesn't mean that the patent has been thrown out -- five of the twenty six claims were found to be patentable. The document is the "non-final action" document and Amazon has rights in the process. The patent could be either thrown out or modified. It will be interesting to see.

Tuesday, October 16, 2007

what if google had to optimize its design for google?

Web developers have a lot of hoops to jump through to optimize their sites for discovery through the Google search engine. Getting good Google index placement is paramount, so it is apparently increasingly difficult for designers to offer the uncluttered user experience that they'd like to while building needed content into their pages. Here is a funny look at what would happen to Google's uncluttered look if Google had to design for the Google Search Engine. Not sure how accurate it is ...

Monday, October 15, 2007

Muradora

DRAMA (Digital Repository Authorization Middleware Architecture) at Macquarie University has released
Muradora, a repository application that supports federated identity (via Shibboleth authentication) and flexible

authorization (using XACML). Fedora forms the core back-end repository, while different front-end applications (such as portlets or standalone web interfaces) can all talk to the same instance of Fedora, and yet maintain a consistent approach to access control. A Live DVD image can be downloaded to install Muradora on a server following an easy installation procedure that is based on Ubuntu Linux Distribution. Muradora LiveDVD can be downloaded from http://www.muradora.org/software.

From the announcement email:

- "Out-of-the-box" or customized deployment options

- Intuitive access control editor allows end-users to specify their own access control criteria without editing any XML.

- Hierarchical enforcement of access control policies. Access control can be set at the collection level, object level or datastream level.

- Metadata input and validation for any well-formed metadata schema using XForms (a W3C standard). New metadata schemas can be supported via XForms scripts (no Muradora code modification required).

- Flexible and extensible architecture based on the well known Java Spring enterprise framework.

- Multiple deployments of Muradora (each customized for their own specific purpose) can talk to the one instance of Fedora.

- Freely available as open source software (Apache 2 license). All dependent software is also open source.

Friday, October 12, 2007

public personae

When I joined Facebook earlier this week I had every intention of keeping my activity to a minimum. No picture. No personal details. Minimal professional details. No groups. Why would I share my personal life online? That's personal, that's private!

Then folks started pointing out to me that I was being silly. My email address and information about my job is all over our Library's web site. Presentations that I've given are online everywhere. Old email listserv postings are available through list archives. I use flickr, and while some images are kept restricted to friends and family, most aren't (but with a Creative Commons license). My LibraryThing catalog is public. I have this blog.

Basically, I was told to get over it. I already have a public persona, even if it's not one that interests more than a few close friends and colleagues and folks interested in arcane digital library topics.

But, but, what about privacy? I know better than to share too much about myself online. I have friends who have been the subject of identity theft. One hears cautionary (but possible apocryphal) tales every day about middle schoolers getting stalked online through their MySpace pages. How did my life get so public? Gradually, without my even consciously noticing it. There's no going back.

what is publishing?

I'm in the process of making a transition in my organization, shifting into a newly created position as Head of Digital Publishing Services.

The first question that everyone asks is "What will you be doing?" The second is "What is publishing in a Library?"

We partner with faculty who are selecting content, organizing it, describing it, analyzing and identifying and creating intellectual relationships, and presenting and interpreting that content in new ways as born-digital scholarship. These projects are more frequently being considered in the promotion and tenure process. Is that the Library supporting a publishing activity? Most certainly.

We digitize collections, describe them, organize them, present them online, and promote them to our community for their use in teaching in research. Is that a publishing activity? It can be argued either way (and has been) -- I'm on the side that leans toward yes.

We provide production support and hosting for peer-reviewed electronic journals. No one would argue that we're participating an a publishing activity.

We're evaluating an Institutional Repository. Not a publishing activity per se, but a way to preserve the publishing output of our community. That's a service related to our stewardship role.

As a Library we're already very active participants in publishing activities. We have our Scholars' Lab, Research Computing Lab, and Digital Media Lab serving as the loci for our faculty collaborations. My role will be formalizing what our publishing services are, what our work flows should be, and how we can sustain and expand our consulting services related to scholarly communication.

Wednesday, October 10, 2007

apparently peer pressure works on me

I gave in and joined Facebook. I'm astonished at the high level of activity that some folks seem to be able to maintain. I have a very boring profile: I added a few applications; some people seem to have dozens. I haven't added much personal information. People apparently send lots of messages and post on each other's profiles. I'm going to have to remember to check it, on top of checking other places where I do my sharing and connect with folks (LinkedIn, LibraryThing, flickr). Still, I think it will be good to have another place to try to maintain contact.

Friday, October 05, 2007

digital lives project

Digital Koans posted about the Digital Lives research project, "focusing on personal digital collections and their relationship with research repositories."

For centuries, individuals have used physical artifacts as personal memory devices and reference aids. Over time these have ranged from personal journals and correspondence, to photographs and photographic albums, to whole personal libraries of manuscripts, sound and video recordings, books, serials, clippings and off-prints. These personal collections and archives are often of immense importance to individuals, their descendants, and to research in a broad range of Arts and Humanities subjects including literary criticism, history, and history of science.

...

These personal collections support histories of cultural practice by documenting creative processes, writing, reading, communication, social networks, and the production and dissemination of knowledge. They provide scholars with more nuanced contexts for understanding wider scientific and cultural developments.

As we move from cultural memory based on physical artifacts, to a hybrid digital and physical environment, and then increasingly shift towards new forms of digital memory, many fundamental new issues arise for research institutions such as the British Library that will be the custodians of and provide research access to digital archives and personal collections created by individuals in the 21st century.

I very much look forward to seeing the results of this work, as university archives and institutional repositories increasingly have to cope with not only managing and preserving deposited personal digital materials, but have to potentially describe, organize, and make such collections usable.

While not the focus of their study, anyone who has ever supported teaching with images knows a tangential area of this problem space intimately. Faculty develop their own collections of teaching images -- their own analog photography, purchased slides, digital photography, images found on the open web, images from colleagues, etc. We have licensed images and surrogates of our own physical collections. They want to use materials from their own collections and our repositories together in their teaching. What is the relationship between their image collections and our repositories and teaching tools? Do we integrate their collections into ours? Do we have a role in digital curation and preservation of their data used in teaching and research, which happen to be images? We struggle with the legal and resource allocation issues every day.

elastic lists

I've been exploring a demo of the visualization model called Elastic Lists. It comes from work done on "Information Aesthetics" published by Andrea Lau and Andrew Vande Moere (pdf) and is implemented using the Flamenco Browser developed at UC Berkeley. The facets have a visualization component comprising color (a brighter background connotes higher relevancy) and proportion (larger facets take up a larger relative proportion of space).

I find this to be a really compelling visualization, easier for the user to navigate than the perhaps too-complex Bungee View. I'd like to see this applied to more heterogeneous data, such as one might encounter in a large digital collection.

Thursday, October 04, 2007

Cal has its own YouTube channel for course lectures

The University of California at Berkeley has its own YouTube channel. This complements their use of iTunesU.

Ars Techina covers the launch.

It's interesting to see everything from course lectures to colloquia presentations to football highlights. This is very smart co-branding with YouTube as a web destination, and a savvy use of a commercial hosting service.

a new museum starts virtual

I am thrilled to see that the Smithsonian's newest museum -- the National Museum of African American History and Culture -- is open for business. Their opening exhibition is “Let Your Motto Be Resistance,” featuring portraits and photographs of people who stood against oppression, from Frederick Douglass to Ella Fitzgerald to Malcolm X. They have lesson plans. They have a Memory Book for visitor-supplied content that makes great use of the visualization interface for browsing that's also employed on the rest of the site.

It's a really nice experience, especially for a museum for which groundbreaking is not scheduled until 2012.

From the site:

With the help of a $1 million grant of technology and expertise from IBM, the NMAAHC Museum on the Web represents a unique partnership to use innovative IBM expertise and services to bring the stories of African American History to a global audience. Conceived from the very beginning as a fully virtual precursor to the museum to be built on the Washington Mall, this is the first time a major museum is opening its doors on the Web prior to its physical existence.
This is a very exciting collaboration, and a great way to build a community around an institution and its mission years before anyone will walk through the door.

Tuesday, October 02, 2007

self-determined music pricing

Yesterday it was all over the blogosphere and on NPR that Radiohead is taking control of its own distribution and releasing its new album with self-determined pricing. This was heralded as huge, earth-shattering, and a "watershed moment" according to CNET.

It's as if no one has done this before. But of course, at least one person has, and very successfully.

In the early 1980s I discovered Canadian singer Jane Siberry, who now goes by the name Issa. I own her albums. I've seen her perform twice so far (a third is in the offing -- she's coming to Charlottesville this month).

And, in 2005, she transformed her personal label's inventory from physical to digital, put it online, and allowed self-determined pricing. Some of her earlier material is not available due to licensing restrictions -- she's not encouraging illegal downloading -- but she has successfully licensed some albums and songs she did for Warner and made them available through this route. Her wikipedia article quotes an interview in The Globe and Mail where she says that since she had instituted the self-determined pricing policy, the average income she receives per song is in fact slightly more than standard price. The "Press" section of her Jane Siberry site has an interesting Chicago Tribune article from that year.

But, since almost no one has heard of Jane Siberry and everyone has heard of Radiohead, it's as if no one has ever done this before. There was a Thingology post that commented that Radiohead probably borrowed the idea from someone or thought of it on their own. Here's an example of someone who did it first. I'm not knocking Radiohead. I like their music. I'm thrilled that such a high profile group is doing this. But this is not their watershed moment alone.

Monday, October 01, 2007

freeing copyright data

Messages went out on the Digital Library Federation (DLF) mailing list and on O'Reilly Radar yesterday letting us know that DLF had helped set free the entire dataset from the U.S. Copyright Office's copyright registration database. LC pointed out that Congress set the rules for charging, but they should have known that their referring to the issue as a "blogospheric brouhaha" was just going to drive someone to do something.

DLF and Public.Resource.Org sent a request to open access to the data. Lots of sites picked this up, including BoingBoing. The registration database is a compilation of facts -- it's not copyrightable itself. The U.S. Register of Copyright agreed that these are public records and should be available in bulk.

So Public.Resource.Org and DLF made it so.

It's extremely exciting to see DLF lobbying so effectively and participating in an effort to make data available via open access, and help all libraries provide better copyright search services. BoingBoing celebrates them as guerrilla librarians. This is not your father's DLF.

Friday, September 28, 2007

cool securing tool for kids on the internet

I just saw a commercial for the Fisher Price Easy Link Internet Launch Pad, targeted for children three and up. The Easy Link -- a specialized USB peripheral with its own software -- allows children to explore sites dedicated to characters when they plug a figure of that character, like Elmo, into the Launch Pad. The kids are offered links to read and games to play, and nothing more -- there is no access to the Internet or to the hard drive and any of its applications without a password. It's $30, which is a reasonable price to introduce kids to working with a computer while limited their access to anything they can damage or can harm them.

And yes, kids that young do use computers. I remember watching my cousin's daughter playing computer games when she was four. But of course, both her parents are software engineers.

follow-up to virtual strike

Read the report and see the screen shots from the virtual strike in Second Life.

award for digital preservation tool

A press release was circulated via email announcing that DROID, a tool from The National Archives in London, had won the 2007 Digital Preservation Award.

From the press release:

An innovative tool to analyse and identify computer file formats has won the 2007 Digital Preservation Award.

DROID, developed by The National Archives in London, can examine any mystery file and identify its format. The tool works by gathering clues from the internal 'signatures' hidden inside every computer file, as well as more familiar elements such as the filename extension (.jpg, for example), to generate a highly accurate 'guess' about the software that will be needed to read the file.

Identifying file formats is a thorny issue for archivists. Organisations such as the National Archives have an ever-increasing volume of electronic records in their custody, many of which will be crucial for future historians to understand 21st-century Britain. But with rapidly changing technology and an unpredictable hardware base, preserving files is only half of the challenge. There is no guarantee that today's files will be readable or even recognisable using the software of the future.

Now, by using DROID and its big brother, the unique file format database known as PRONOM, experts at the National Archives are well on their way to cracking the problem. Once DROID has labelled a mystery file, PRONOM's extensive catalogue of software tools can advise curators on how best to preserve the file in a readable format. The database includes crucial information on software and hardware lifecycles, helping to avoid the obsolescence problem. And it will alert users if the program needed to read a file is no longer supported by manufacturers.

PRONOM's system of identifiers has been adopted by the UK government and is the only nationally-recognised standard in its field.

The judges chose The National Archives from a strong shortlist of five contenders, whittled down from the original list of thirteen. The prestigious award was presented in a special ceremony at The British Museum on 27 September 2007 as part of the 2007 Conservation Awards, sponsored by Sir Paul McCartney.

Ronald Milne, Chair of the Board of Directors of the Digital Preservation Coalition, which sponsors the award, said: "The National Archives fully deserves the recognition that accompanies this award."

Thursday, September 27, 2007

xena

I spent a little time this afternoon reading up on the newly released digital preservation tool Xena.

You can point it at a directory of diverse file types and it will convert the files into normalized open formats. The list of supported formats and the conversion outcomes is available in the help docs.

This is potentially a really useful workflow tool but there's a lot to examine here. I don't know how scriptable it is. You can write plugins to add in new formats -- I'm not yet sure if you can change conversion decisions and alter the target formats. Why is the target format for pretty much every image format PNG? Could we change that to TIFF or JPEG2000 if we were willing to write the plugin? It runs on Windows and Linux and requires OpenOffice. On Linux, does it require a graphical environment, or can you run it from the command line?

I'm thinking that this could be really useful for an IR, but I'm not yet sure if it will scale for Library-wide preservation or collection repositories.

Monday, September 24, 2007

archives on the web

Technophilia lists Where the Web Archives Are. Here's what they say:

Some of the most intriguing resources on the web are located in archives—compilations of data that in the past, could only be found by making appointments in dusty libraries. Today, I'm going to take you on a quick tour through some of the most fascinating archives on the web.
So where are they? If I am reading the list correctly, they're pretty much not at any academic libraries.

In the "Government" section, there is the National Archives and the Library of Congress. There is the Internet Archive, which is indeed a library. There's the Rockfeller Archive. There's NASA. There's David Rumsey, possibly the best private map archive in the world. There is the British Library.

Otherwise, it's Calvin and Hobbes, Smithsonian Magazine, the Smoking Gun, and The Balcony Archives of movie reviews.

I don't want to knock their list -- it's an interesting list full of great collections of very worthwhile content. But where are all the other myriad Library special collections and archives on this list? Is it that we aren't visible enough? Or perhaps not cool enough compared to PBS's Nova? Where are our extensive online archives on runaway slaves or civil rights or early American literature? Or political cartoons or penny dreadfuls or sheet music? Or puzzles or jazz or the civil war?

I think we have to remember that our target audience is not just our very local community, but the global community, including non-academics. We all need to think a bit more about how to get the word out about what we've made freely available. Being available in a Google search isn't proactive enough. We need to work to get noticed.

Friday, September 21, 2007

not the usual google law suit

As seen at Tech Crunch, a Pennsylvania resident is suing Google for crimes against humanity and is asking the court for $5 billion in damages because his social security number, when turned upside down and scrambled, spells Google. His handwritten filings are on the Justia site.

Tuesday, September 18, 2007

virtual strike

The first virtual strike is taking place soon. Apparently there are labor actions planned by the union representing Italian employees of IBM over pay negotiations -- as one of their strategies they plan to picket the company's campus in Second Life. They're even providing orientation for IBM employees who are new users. I wonder what the corporate reaction will be? The press this action is getting is pretty intensive.

new york times open access

The story of the day seems to be about the NY Times opening up its archives. So far I've seen postings at boing boing, if:book, open access news, o'reilly radar, and teleread.

So why am I bothering to blog this? Because this made me think about something I blogged about some months ago -- Google News Archive Search. One of the things that galled me at the time was how much of what they indexed was behind a pay firewall. Now, the NY Times is opening almost all their content up (save for 1923-1986), making this a more useful service, at least for resources from one newspaper. If only there wasn't so much other for-fee public domain newspaper content controlled through ProQuest Archiver. I still hope for an OpenURL Resolver service so authorized users can get to authorized resources at ProQuest Historical Newspapers instead.

Saturday, September 15, 2007

career meme

Jerry blogged about the results he received from a test at Career Cruising. Since I was sitting at home on a Saturday afternoon, it seemed the thing to do. I dutifully answered the questions and the follow-up questions, and I just about fell off the sofa when I got the results:

1. Anthropologist
2. Video Game Developer
3. Multimedia Developer
4. Scientist
5. Picture Framer
6. Political Aide
7. Computer Animator
8. Interior Designer
9. Business Systems Analyst
10. Website Designer
11. Market Research Analyst
12. Librarian
13. Medical Illustrator
14. Artist
15. Real Estate Appraiser
16. Computer Programmer
17. Set Designer
18. Cartographer
19. Animator
20. Costume Designer
21. Cartoonist / Comic Illustrator
22. Illustrator
23. Mathematician
24. GIS Specialist
25. Epidemiologist
26. Dental Assistant
27. Statistician
28. Economist
29. Graphic Designer
30. Desktop Publisher
31. Historian
32. Archivist
33. Curator
34. Web Developer
35. Public Policy Analyst
36. Esthetician
37. Hairstylist
38. Technical Writer
39. Makeup Artist
40. Webmaster

I have no idea how their questions led their system to tell me that I should be an anthropologist. Apparently I did select the correct course of study in college and graduate school! Archivist, curator, web designer and developer, and tech writer are all familiar activities to me. I did my share of amateur theatrical work years ago. This was uncannily on target.

But where did dental assistant come from? Or esthetician? Picture framer? Political aide? I just cannot imagine any of those are for me.

Friday, September 14, 2007

oSkope

I don't think that there is much that I can add to this excellent review of oSkope at if:book. I spent some time at oSkope exploring their flickr search. The mouseover shows the title and date for the image, plus whose collection it came from. If you click on the image, a popup appears that includes the above plus the tags and a zoomable thumbnail. There's a slider at the right that changes the number of images that appear in the grid -- from 4 to 500. The grid, stack, pile, and list views are great --- but I'm not sure what the axes are for the graph view.

I like the drill-down navigation through the ebay categories. As noted in the if:book entry, it didn't seem to be working and kept returning no items.

The oSkope User Agreement (pdf) accompanies the language "Use of this website consitutes [sic] acceptance of the oSkope User Agreement and Privacy Policy. Please read these agreements carefully." At six pages it is thorough. There's also a four page privacy policy (pdf).

Monday, September 10, 2007

SWORD

In January I saw a presentation by Julie Allinson at Open Repositories on the UKOLN Repository Deposit Service work. Phil Barker of CETIS has a blog entry on a number of repository standards topics, one of which is SWORD (Simple Web-service Offering Repository Deposit), the project which takes forward the work I saw presented. The goal is to take their deposit protocol and implement it as a lightweight web-service using a prototype "smart deposit" tool for four repository software platforms: EPrints, DSpace, Fedora and IntraLibrary. They're taking advantage of the ATOM Publishing Protocol and extending it, which seems like a smart direction to me. I'm looking forward to seeing more of this.

Sunday, September 09, 2007

UNESCO open source repository report

UNESCO has issued an very interesting report -- Towards an Open Source Archival Repository and Preservation System -- that defines the requirements for a digital archival and preservation system and describes a set of open source software which can be used to implement it. It focuses on DSpace, Fedora, and Greenstone, principally comparing the three systems in their support for OAIS. The report uses as the basis for its comparison a single use case -- the management and preservation of images.

I think it's a very fair report, not deeply technical, but an overview of the capabilities of the tools. Fedora is well-reviewed, with some shortcomings mentioned -- it takes a high level of programming expertise to contribute to the core development (true), the administrative reporting tools could stand some improvement (I could use granular use statistics), and a lack of built-in automated preservation metadata extraction and file format validation. On those last two points, the Fedora architecture very easily supports the integration of locally developed automated processes in metadata extraction and format validation into object preparation. That's what we have done. That Fedora has supported checksum checking support since version 2.2 is a huge step for file preservation.

Thursday, September 06, 2007

google book search features

Google Book Search has introduced a My Library feature, where you can identify volumes in GBS and books that you own and associate them with your Google account. I also ready had an account that I use with blogger and Google Analytics, so there was nothing to set up. I can search and easily click on an "add to my library" link. I can assign a star rating, add a review, and add labels. I don't seem to be able to see a list of labels that I've assigned. I'd like to be able to create individual sets, but there doesn't seem to be a way to do that. The export is a lightweight xml document that's lacking publication data like date, or publisher. You automatically have an RSS feed. It's interesting, but I'm not sure what this gives me over LibraryThing other than URLs for the books in GBS.

The more interesting service is the ability to highlight and quote from a text in GBS. It only works with full view texts -- the tool is not available for any other view. I searched for the term I was interested in and went through 20 screens of results without finding a book that I could try the tool with. I had to resort to an advanced search for titles between 1900 and 1923 to try it. That's an interesting indicator of just how much is in GBS that's post 1923 -- none of the first 200 results in my search were in the domain and full view.

I found a text I wanted to quote and used the tool to draw a box around the text. Drawing the box is a tad tricky -- my first two tries I didn't get the box large enough to get the first line of what I wanted to quote. I was given the option to create an image of the text block or to grab the text. I could add it to my Google Notebook or send it to blogger (because I have an account). You are also presented with a URL that you can use to embed the note in a web page. The quote includes a link to the text in GBS.

This seems really useful to me. In our paradigm at UVA we talk about how it's not enough to digitize something -- you have to be able to use it. This is the first tool I've seen from GBS where it makes its texts into something that you can really take advantage of in a networked environment.

amazon kindle

There was an article in the New York Times yesterday on ebooks that briefly mentioned two upcoming business models:

In October, the online retailer Amazon.com will unveil the Kindle, an electronic book reader that has been the subject of industry speculation for a year, according to several people who have tried the device and are familiar with Amazon’s plans. The Kindle will be priced at $400 to $500 and will wirelessly connect to an e-book store on Amazon’s site.

That is a significant advance over older e-book devices, which must be connected to a computer to download books or articles.

Also this fall, Google plans to start charging users for full online access to the digital copies of some books in its database, according to people with knowledge of its plans. Publishers will set the prices for their own books and share the revenue with Google. So far, Google has made only limited excerpts of copyrighted books available to its users.

The Google announcement is, I think, a fair one -- right now they limit viewing to copyrighted books to a snippet view. If a work is still clearly in copyright and the rights owner wants to release that book for full access, they should be able to charge for that access. It's their right. Of course I'd like to see more publishers make e-versions of their title available freely ...

The Amazon news gives me pause, not knowing all the details yet. You access the files wirelessly -- do you read them via a live connection from their servers, or is the file downloaded to the device? I understand why some think it's a plus to not require a full-fledged computer to get access to a book, but it potentially seems like a really limited version of access. The ebook files will be Mobipocket format and the Kindle device seems to use a proprietary wireless system to grab the files (known through their FCC filing), so the files likely won't be available to other devices. They are not using the Adobe format for their files; it's not clear if the Kindle will support reading of Abobe ebooks from other sources or if you can only read Amazon files. Can you get the files off the device or back it up? If you can get the files off the device, will they work with the desktop version of Mobipocket? There have also been complaints about Mobipocket DRM.

This is all speculation given the lack of details. TeleRead has some speculation of their own. I look forward to hearing more about the product and the service.

Wednesday, September 05, 2007

fair use decision

Today the Tenth Circuit court ruled unanimously in favor of Larry Lessig, et al, in Golan v. Gonzales, a case about the scope of fair use. The court has acknowledged that First Amendment freedoms must be considered when copyright law is made.

The government had argued in this case, and in related cases, that the only First Amendment review of a copyright act possible was if Congress changed either fair use or erased the idea/expression dichotomy. We, by contrast, have argued consistently that in addition to those two, Eldred requires First Amendment review when Congress changes the "traditional contours of copyright protection." In Golan, the issue is a statute that removes work from the public domain.

Monday, September 03, 2007

internet archive and nasa

I missed this announcement last week (even though Peter Suber blogged it) -- NASA and Internet Archive Team to Digitize Space Imagery:

NASA and Internet Archive of San Francisco are partnering to scan, archive and manage the agency's vast collection of photographs, historic film and video. The imagery will be available through the Internet and free to the public, historians, scholars, students and researchers.

Currently, NASA has more than 20 major imagery collections online. With this partnership, those collections will be made available through a single, searchable "one-stop-shop" archive of NASA imagery.

...

NASA selected Internet Archive, a nonprofit organization, as a partner for digitizing and distributing agency imagery through a competitive process. The two organizations are teaming through a non-exclusive Space Act agreement to help NASA consolidate and digitize its imagery archives at no cost to the agency.

...

Under the terms of this five-year agreement, Internet Archive will digitize, host and manage still, moving and computer-generated imagery produced by NASA.

...

In addition, Internet Archive will work with NASA to create a system through which new imagery will be captured, catalogued and included in the online archive automatically. To open this wealth of knowledge to people worldwide, Internet Archive will provide free public access to the online imagery, including downloads and search tools....


From an AP article on Wired News:

Kahle said the archive won't be able to digitize everything NASA has ever produced but will try to capture the images of broadest interest to historians, scholars, students, filmmakers and space enthusiasts.

Kahle said the images already in digital form represent the minority of NASA's collections, and they are scattered among some 3,000 Web sites operated by the space agency. He said those sites would continue to exist; the archive would keep copies on its own servers to provide a single, free site to augment the NASA sites.

...

The Internet Archive is bearing all of the costs, and Kahle said fundraising has just started. The five-year agreement is non-exclusive, meaning NASA is free to make similar deals with others to further digitize its collections.


What's particularly exciting is that this is both an aggregation and a digitization project -- widespread materials will be brought together for easier discovery, get enriched metadata, and important materials will be selected and digitized to add to the corpus.

Friday, August 31, 2007

some weeks, you feel like you've just survived

This week was the first week of classes and it seemed more stressful than other first weeks. We put redirects in place for some directories for resources that we'd migrated from our former Etext Center collections to the Repository. We didn't give as much notice as we could have there and some folks were surprised. We also formally announced that our Etext and Geostat Centers no longer exist and are part of our Scholars' Lab. Those announcements required some redirects. Things then went briefly wrong with the redirects. There was a wrinkle in updating links in our catalog records -- in some cases we weren't migrating individual texts, instead pointing to LION, so where would the links go? And the redirects meant that records weren't going to as granular a location as they were before.

The increased load of the first week of classes caused some text delivery issues, but helped us find what appears to be a bug in Joost that was the cause of mysterious problems in the past that we now have worked around. Two tools the used to exchange data easily didn't anymore (but we found the cause immediately). An old assumption about what regions we included in our simple text search was proven false by some newly migrated texts and we had to make a mid-week change. One of the text sets that we migrated was missing what turned out to be a vital element from its styled delivery. We tried to be nimble in our responses, occasionally briefly breaking something else with a fix, but our amazing team worked hard to address everything quickly.

I know of two outstanding issues to resolve, then we're set until we start the process to completely replace our searching infrastructure and interface. We've got a prototype BlacklightDL almost where it needs to be to start seriously planning the swapout project. Another change management challenge ...

Monday, August 27, 2007

zotero

Catalogablog points out a new 1.0 release candidate 2 of Zotero. Of note from the Zotero site:

  • Zotero now offers full-text indexing of PDFs, adding your archived PDFs to the searchable text in your collection.
  • Zotero’s integration with word processing tools has been greatly improved. The MS Word plugin works much more seamlessly and we now support OpenOffice on Windows, Mac (in the form of NeoOffice), and Linux.
  • Zotero is also now better integrated with the desktop. Users can drag files from their desktop into their Zotero collection and can also drag attachments out of their Zotero collection onto their desktop.
  • We have begun to add tools to browse and visualize Zotero collections in new ways. Using MIT’s SIMILE Timeline widget, Zotero can now generate timelines from any collection or selected items.
  • The new version of CSL (Citation Style Language), used by Zotero to format references into specific styles, is more human readable and easier to edit. We will be adding many more styles soon.
There are also announcements of new compatible sites, including Institute of Physics, BioMed Central, ERIC, Engineering Village, the L.A. Times, The Economist, Time, and Epicurious, among others.

LinkedIn

I have become addicted to LinkedIn. Two departing colleagues have accounts and I was told it was fun and interesting to discover one's level of connectedness while building a network.

It's true. I spent way too much of my weekend searching out colleagues and friends and inviting them to join my network. In one place I can see the connections between different spheres of my life, and the intersections are really interesting. I have also reconnected with two friends from my college years that I lost touch with when I moved away from Los Angeles 16 years ago, which makes it definitely worthwhile to me. Now it's up to me (and not the technology) to stay connected.

Thursday, August 23, 2007

bungee view

I have been playing with a very, very cool image collection visualization and browse application called Bungee View from Carnegie Mellon University.

It was developed for the Historic Pittsburgh image collection, but I found it through a del.icio.us link to American Memory through code4lib. They have American Memory image collections available as a test set. I was particularly impressed by their browse of their music collections.

When I first started the app I thought that something was wrong with my system -- what I thought were search boxes were filled with a myriad of vertical lines. Then I came to understand the UI -- they aren't search boxes, they're visualizations of the distribution of terms in each category. Mousing over them shows links to the various terms and number of times used. You can expand each category to see multiple forms of the visualization -- the distribution of terms in the bar, a simple list of terms, or a detailed color-coded graph. You can drill deep into the collections by combining browse categories. As you browse, color-coding provides clues in the correlation between categories that will provide results or that are negatively associated. You see the results as a set of thumbnails on the right of the screen. You can select a thumbnail and see its full metadata.

Bungee View was implemented using Piccolo, a C# graphics framework that I am not familiar with. The source seems to be available, but the link didn't seem to be working when I tried it. I want to explore this more.

Wednesday, August 22, 2007

ndiipp grant to preserve virtual worlds

Many congratulations to Jerry McDonough at UIUC for his NDIIPP grant to investigate the preservation of virtual worlds. The work will be a collaborative effort between UIUC, Stanford, the University of Maryland, the Rochester Institute of Technology, and Linden Lab.

In addition to developing standards for preservation metadata and content representation, the project will investigate preservation issues through archiving a series of case studies representing early games and literature and later interactive multi-player game environments.
This is very exciting. With this work and work developing around COLLADA as an interchange format for 3-D files, there's hope. We could preserve a very early text-based game like Hammurabi, a classic arcade game like Tempest, or virtual events in Second Life.

Preserving Virtual Worlds.

Sunday, August 19, 2007

trademarking for literary protection

Via TeleRead, and interesting article in the Financial Times on trademarking of literary characters, author names, etc., as a layer of protection after copyright runs out.

Quoting from the Financial Times, "posthumous Publishers who refuse to live and let die," August 16, 2007 (might be behind a pay wall):

Because copyright comes to an end – 70 years after an author's death in Europe, sooner in the US – literary estates have turned to trademark registration for an extra layer of protection. Characters, book titles and authors' names have all been registered.

For dead authors who are still in copyright, trademarking may help estates keep control after the term ends, says intellectual property lawyer Laurence Kaye. "If you intend to republish a book that has gone out of copyright, you would have to do it in a way that did not infringe any trademarks."

IFP (Ian Fleming's estate) has registered everything from Ian Fleming to James Bond and Miss Moneypenny, so any attempt to reproduce the books without permission after they go out of copyright would meet difficulties.

Mr Kaye says: "You would have to manipulate the book so that there was nothing in it that infringed the registered trademarks."
The article also mentions the trend where dead authors such as Robert Ludlum and V. A. Andrews just keep publishing posthumously. Ludlum at least left behind outlines for work he wanted to be written after his death. I don't know what to think of the writer who has written more than two dozen V. A. Andrews novels under her name for twenty years after her death.

Friday, August 17, 2007

pynchon in semaphore

Over a year ago the artist Ben Rubin installed a piece ("San Jose Semaphore") on the Adobe building in San Jose, California with four LED semaphore wheels that broadcast a mystery text, accompanied by an audio component. It took over a year, but two men finally deciphered what the text is -- Thomas Pynchon's "The Crying of Lot 49."

From the San Jose Mercury News, 8/14/2007:

The solution was discovered by two Silicon Valley tech workers, Bob Mayo and Mark Snesrud, who received a commendation at San Jose City Hall today.

Using both the rotating disks and the art project's audio broadcast, they deciphered a preliminary code based on the James Joyce novel, "Ulysses," which was the key to solving the entire message. It took them about three weeks.

"It was not a real easy thing to figure out," said Snesrud, a chip designer for Santa Clara based W&W Communications.

Ben Rubin, the New York artist who developed the project, applauded the duo's "computational brute force" in finding the message. "I'm especially glad the code was cracked and that it was done in a very classical way," Rubin said.

The Pynchon book, written in the mid-1960s, is set in a fictional California city filled with high-tech campuses. It follows a woman's discovery of latent symbols and codes embedded in the landscape and local culture, Rubin said.

The semaphore is made up of four 10-foot wide disks, which are composed of 24,000 light-emitting diodes. The disks each have a dark line going from one end to another and twirl around every eight seconds to create a new pattern. It made its debut on Aug. 7, 2006 as part of the ZeroOne digital art festival. Rubin said there are no plans to stop the semaphore or change its message - at least for the time being.

"It'll change the way people look it," Rubin said of having the solution known. "Maybe in a few years, we'll revisit it.
The choice of text is inspired. I hope that he updates it.

Wednesday, August 15, 2007

radio interview about Google Book Search

Martha Sites, our AUL for Production and Technology was interviewed along with Ben Bunnell of Google about the Google Book Project.

Go to: http://wmet1160.com/schedule/the_beat/

Scroll to "The Beat 20070810 1100 web.mp3" and select it (the scrolling is a bit tricky)

The interview starts at ~4:00 and ends at ~31:00.

Tuesday, August 14, 2007

Announcing the Fedora Commons

The news is finally out there:

Fedora Commons today announced the award of a four year, $4.9M grant from the Gordon and Betty Moore Foundation to develop the organizational and technical frameworks necessary to effect revolutionary change in how scientists, scholars, museums, libraries, and educators collaborate to produce, share, and preserve their digital intellectual creations. Fedora Commons is a new non-profit organization that will continue the mission of the Fedora Project, the successful open-source software collaboration between Cornell University and the University of Virginia.
There's also some staffing news, in addition to Sandy Payette becoming Executive Director:
Daniel Davis will lead Fedora core software development as chief architect, Thornton Staples will lead outreach efforts as director of community strategy and outreach starting on October 1, 2007, and Carol Minton Morris will serve as director of communications and media. Chris Wilper and Eddie Shin will continue in their roles as lead software developer for Fedora software and developer for Fedora software, respectively.
I'm going to miss having Thorny right down the hallway where I can wander down and think things through with him. He's staying local and will maintain an office at UVA, so we've not completely losing him.

The full press release is at the new web site: http://www.fedora-commons.org/about/news.php#moore-grant.

If you're not familiar with Fedora yet, check out the Portfolio portion of the site to see examples of systems built using Fedora. You might also want to check out the first in a series of promotional videos that includes our UVA Library work.

100 Year Archive report

There's an article in InfoWorld Tech Watch -- "Entering the Digital Dark Ages?" -- that notes that we have entered "an era of unprecedented information gathering likely to leave no lasting impression on the future, thanks in large part to a cross-departmental lack of understanding of the business requirements for data archiving" according to a recent study conducted by the Storage Networking Industry Association's 100 Year Archive Task Force.

The article is brief and points out a few key issues, such as data archiving not being considered a valuable business service, which is ironic when some industries have record retention standards with time frames of 50 or 100 years.


While the report was anchored by an organization representing the physical storage business, there was a lot of participation from the records management and archival communities. This work is based on a survey to identify the requirements for "long-term" storage and retention. The survey results and quite a few respondent comments are included in the report. The next steps for the group include production of a reference model similar to OAIS or the Sedona Guidelines covering the storage domain of long-term retention, creation of a "Self-Describing, Self-Contained Data Format" (SD-SCDF) for use as an archival information package in a trusted digital repository, and extending the definition and use of the XAM (eXtensible Access Method) standard that SNIA is already working on.

The report, which was issued in January 2007, is one to read. Register at the Task Force site and you can download it as a PDF.

Literature in a Digital Age

Matt Kirschembaum has an interesting article in the Chronicle of Higher Education -- “Hamlet.doc? Literature in a Digital Age.”

Dan Cohen comments on Matt's observations on how technology such as change tracking creates new possibilities for understanding the creative process, and how important standards will become. Another part of the article resonated even more with me:

The implications here extend beyond scholarship to a need to reformulate our understanding of what becomes part of the public cultural record. If an author donates her laptop to a library, what are the boundaries of the collection? Old e-mail messages, financial records, Web-browser history files? Overwritten or erased data that is still recoverable from the hard drive? Since computers are now ground zero for so many aspects of our daily lives, the boundaries between our creative endeavors and more mundane activities are not nearly as clear as they might once have been in a traditional set of author's "papers." Indeed, what are the boundaries of authorship itself in an era of blogs, wikis, instant messaging, and e-mail? Is an author's blog part of her papers? What about a chat transcript or an instant message stored on a cellphone? What about a character or avatar the author has created for an online game? The question is analogous to Foucault's famous provocation about whether Nietzsche's laundry list ought to be considered part of his complete works, but the difference is not only in the extreme volume and proliferation of data but also in the relentless way in which everything on a computer operating system is indexed, stamped, quantified, and objectified.
I remember the discussion of boundaries when we first started talking about archiving web sites. Where does a web site "end" when it has linkages to other sites? Within the same subdomain? Within the same domain? Do you include the pages that are linked to in other sites because they might provide important context?

Many years ago I served on the board of directors of a professional organization. As part of the organizational archive, I was asked to supply my print files, electronic documents, and my email archives when my service ended. At the time I was an obsessive file archiver and I could supply all my email from four different email addresses and two different environments (Compuserve and Eudora) as well as many snapshots of document versions and web sites over a seven year period. But those were official versions. Would I want every awkward draft of a report or a brochure saved for posterity? Is that really part of the organizations' history?

While I think a lot about privacy and what an author might/should restrict access to (short-term or long-term) when leaving behind their digital legacy, there is so much potential for research. How does working on digital financial records differ from studying account ledgers? How does studying email differ from studying written correspondence or memoranda? Or blogs versus published editorials? They're the same research activities, just different media. Again, from Matt's article:
The wholesale migration of literature to a born-digital state places our collective literary and cultural heritage at real risk. But for every problem that electronic documents create — problems for preservation, problems for access, problems for cataloging and classification and discovery and delivery — there are equal, and potentially enormous, opportunities. What if we could use machine-learning algorithms to sift through vast textual archives and draw our attention to a portion of a manuscript manifesting an especially rich and unusual pattern of activity, the multiple layers of revision captured in different versions of the file creating a three-dimensional portrait of the writing process? What if these revisions could in turn be correlated with the content of a Web site that someone in the author's MySpace network had blogged?
Yes, there are definitely issues in accessing file formats as they age. When I rediscovered my single-sided original Mac disks from the mid-80s with my MA research and thesis written in MacWrite 1.0, or 5 1/4" disks with documentation that I wrote in 1991 in WordPerfect, I had to call in favors from folks with vintage Mac and PC hardware and buy Conversions Plus software to get at the file content (not fully successfully). I was incredibly lucky that the media could be read, let alone transform the files. Let us not even speak of the versions of files over time that I lost on on Mac ZIP disks that were accidentally discarded in a move. There went part of the history of the organization that I mentioned above.

There is a lot of education needed about preserving digital output and the file and media standards to be used. I look forward to seeing the work of Maryland's X-Lit project.

Saturday, August 11, 2007

COLLADA

Reading a review of this year's SIGGRAPH, I read about COLLADA:

COLLADA is a COLLAborative Design Activity for establishing an open standard digital asset schema for interactive 3D applications. It involves designers, developers, and interested parties from within Sony Computer Entertainment America (SCEA) as well as key third-party companies in the 3-D industry. With its 1.4.0 release, COLLADA became a standard of The Khronos Group Inc., where consortium members continue to promote COLLADA to be the centerpiece of digital-asset toolchains used by the 3-D interactive industry.

COLLADA defines an XML database schema that enables 3-D authoring applications to freely exchange digital assets without loss of information, enabling multiple software packages to be combined into extremely powerful tool chains.

However, COLLADA is not merely a technology, as technology alone cannot solve this communication problem. COLLADA has succeeded in providing a neutral zone where competitors work together in the design of a common specification. This creates a new paradigm in which the schema (format) is supported directly by the digital content creation (DCC) vendors. Each of them writes and supports their own implementation of COLLADA importer and exporter tools.

COLLADA is an XML schema combined with its COMMON profile that can be exchanged between proprietary software packages and an open source programs, giving more control over digital assets. The list of products that support COLLDA is impressive. Some support plugins and other directly import or export the format. It's a truly open exchange format.

Having worked in instructional technology at a school of architecture, I saw the difficulties in exchanging files between individuals and projects first hand. This is a huge step for design practice and as a potential preservation format for these files. What a great possibility for preserving and sharing these files in repositories.

Friday, August 10, 2007

wikipedia trustworthiness

There was a brief article in the Chronicle of Higher Ed last week that I didn't spot until yesterday -- UC Santa Cruz researchers have developed a simple yet clever test of the trustworthiness of wikipedia article authors:

... the researchers analyzed Wikipedia’s editing history, tracking material that has remained on the site for a long time and edits that have been quickly overruled. A Wikipedian with a distinguished record of unchanged edits is declared trustworthy, and his or her contributions are left untouched on the Santa Cruz team’s color-coded pages. But a contributor whose posts have frequently been changed or deleted is considered suspect, and his or her content is highlighted in orange.
It's a demo with only a few hundred pages, but it's still a very interesting proof-of-concept. Of course the software cannot do actual fact checking to vet content, but it's a elegant method for looking at the trustworthiness of the people who are source of the content. It's simplistic in a way -- an author could be expert in one area but not in others, or be overruled due to personality issues rather than authoritativeness -- but it's worth reviewing for the process and for the presentation of an article's authority ranking through color coding.

A conference paper describing the work is available.