Friday, December 28, 2007

adding metadata to the list of what we learned

A couple of folks wrote and asked me why I didn't specify metadata standards in my post on what we learned from our repository project. I mentioned the impact of interface design on metadata needs, but some additional reinforcement can't hurt.

So ...

5. It shouldn't even need to be said that you should have your metadata standards identified before you start your development. We did. What we learned is that activities related to categories 2-4 will mean that you will have to make changes to what metadata you create and how you use it. For example, when changes were made to our interface design and functionality, we needed metadata formatted in a certain way for search results and some displays. We thought that we'd generate the metadata on-the-fly, but that turned out to be a lot of overhead so we decided to pre-generate the metadata needed: display name, sort name, display title, sort title, display date, and sort date. It isn't metadata we necessarily create during cataloging processes, but it's something we can generate during the conversion from its original form to UVA DescMeta XML. Another example is faceted browsing. To have the most sensible facets in our next interface, we need to break up post-coordinated subject strings or we'll have a facets for every variation. We thought about pre-generating this, but it turns out that Lucene can do this as part of the index building.

Wednesday, December 26, 2007

copyright the pyramids?

From BoingBoing and Techdirt -- Egypt plans to pass a law that enacts copyright protection for the pyramids and the Sphinx and any object in any museum in Egypt in order to levy fines against creators of full-scale replicas that infringe on this so-called copyright. Apparently the Luxor hotel is safe because it's not an exact replica ...

I understand looking for income streams to care for their cultural heritage and I am extremely sympathetic, but this is such a wrongheaded interpretation of the concept of copyright. That said, Egypt can set their own copyright laws, and if they want to declare this part of their copyright code, they can. I wonder how they plan to enforce this?

NIH Open Access mandate is law

Via Peter Suber, President Bush signed the omnibus spending bill that includes the requirement that publications based on NIH-funded research be submitted to PubMed Central and made publicly available, with a no more than 12-month embargo.

That said, there is no reason to wait to start depositing, articles should be deposited upon publication with restricted access for one year rather than waiting for deposit (Who keeps track of embargoes? -- "Today it's been a year and I should deposit that article"), and the same articles should additionally be self-archived in each researcher's university's own Institutional Repository. If their institution does not have an IR, they should ask "Why Not?"

EDIT at 6:27 PM: Here's the press release from the Alliance for Taxpayer Access.

Saturday, December 22, 2007

Google responds to privacy questions

Via Stephen's Lighthouse, a posting at search engine land on Google's response to 24 questions about privacy issues submitted by US Representative Joe Barton. A PDF of Google's response is provided.

Friday, December 21, 2007

Fedora 3.0b1 is released

The first beta version of Fedora 3.0 is now available. I am thrilled about this (even though we're going to wait until after the release of final 3.0 to upgrade) because this includes a change that I have wanted for a long time -- the Content Model Architecture (CMA). When the Fedora architecture was first designed, the tight bindings between objects and their disseminators seemed to be a good thing. As repository services were developed using Fedora, though, some of us discovered that we needed to make changes to disseminator mechanisms after we developed them. In the previous (current) architecture, because the objects were tightly bound to the disseminators we could not make changes to the disseminators unless NO objects were bound to them. In other words, we had to purge disseminators that were bound to objects (the easiest way was to actually purge the objects), then update the disseminators, and then re-ingest the objects and re-establish bindings. CMA adds the ability to formally instantiate content models in relation to objects, also making it easier to update disseminators without the purging of objects. This also supports a new ability to validate objects against content models, a very desirable QA feature.

what we learned from our repository project

Earlier this week someone asked me what we had learned from our repository development project over the years. This is the first time that anyone has asked that so directly, as opposed to general discussions about assessment and process review and software optimization.

So, what did we learn? This is what I've come up with so far.

1. Have your media file standards in mind before you start. Of course standards will change (especially if you're talking about video objects) during a multi-year implementation project. But if you have standards identified before you start (and minimize the number of file standards that you'll be working with), you at least have some chance of making it easier to migrate and manage and preserve what you've got and to design a simpler architecture. We did this (in conjunction with an inventory of our existing digital assets) and it was key for us in developing our architecture and content models.

2. Know what the functional requirements of your interface will be before you start. We had developed functional spec documents and use cases, but two different stakeholders came back to us during the process with new requests that we couldn't ignore. In those two cases the newly identified functional requirements for our interface required that we change our deliverable files and our metadata standard. We had to go back and re-process tens of thousands and then over 100,000 objects to meet the functional need and have consistency across our objects.

3. Some aspect of your implementation technologies will change during the project. New technologies will become available for implementation during your project that are a better fit than what you planned to use. For example, we never initially identified Cocoon as part of our implementation, but it became a core part of our text disseminators.

4. Your project will never be "done." OK, we've got a production repository with three format types in full production and three in prototype. We've still got to figure out production workflows for those media types and there are more media types to consider. And, as a corollary to point 3, there are new technologies that we want to substitute for what we used. We're obviously going to switch to Lucene and Solr for our indexing. New indexing capabilities will absolutely bring about an interface change. There are also more open source web services applications available now than when we started in 2003. We can potentially employ XTF in place of some very complex TEI and EAD transformation and display disseminators that we developed locally. This is bringing about a discussion about simplifying our architecture -- fewer complex delivery disseminators to manage and develop and more handing off of datastreams to outside web services. Not that there aren't complexities there and a lot of re-development work, but it's a discussion worth having. We're talking a lot these days about simplifying what it takes for us to put new content models into production. The development of Fedora 3.0 will also have a huge effect.

EDIT 28 December 2007:

A couple of folks wrote and asked me why I didn't specify metadata standards. I mentioned the impact of interface design on metadata needs, but some additional reinforcement can't hurt. So ...

5. It shouldn't even need to be said that you should have your metadata standards identified before you start your development. We did. What we learned is that activities related to categories 2-4 will mean that you will have to make changes to what metadata you create and how you use it. For example, when changes were made to our interface design and functionality, we needed metadata formatted in a certain way for search results and some displays. We thought that we'd generate the metadata on-the-fly, but that turned out to be a lot of overhead so we decided to pre-generate the metadata needed: display name, sort name, display title, sort title, display date, and sort date. It isn't metadata we necessarily create during cataloging processes, but it's something we can generate during the conversion from its original form to UVA DescMeta XML. Another example is faceted browsing. To have the most sensible facets in our next interface, we need to break up post-coordinated subject strings or we'll have a facets for every variation. We thought about pre-generating this, but it turns out that Lucene can do this as part of the index building.

(http://digitaleccentric.blogspot.com/2007/12/adding-metadata-to-list-of-what-we.html)

Monday, December 17, 2007

Open Journal Systems 2.2 released

OJS 2.2 has been released. We have an earlier version up that we've been testing with a student journal. I see some additions that I know folks will find attractive. Strangely, one of the potentially most desirable features is the integration of Google Analytics. I have watched the editors of the journals that we host implement them because (sometimes struggling to do so) we're just not in the business anymore of supplying web stats for every site on our servers. I want to upgrade our test instance to the new version to see how it compares with other solutions that we are considering for the hosting of journals and the submission/review/editing process.

petition for open bibliographic records

Peter Suber reports on one of the responses to the draft report from the Working Group on the Future of Bibliographic Control -- a petition to LC to make its bibliographic records openly available.

open data commons

There is a lot of hidden data out there -- data used for and created through research -- and a lot of confusion as to what rights it carries and how it can be used. Talis has announced a partnership with Creative Commons to launch a new type of personal data licensing: a Science Commons protocol and the Open Data Commons Public Domain Dedication and License. Notes on the project are available at the panlibus and Nodalities blogs.

Friday, December 14, 2007

now it's a trend

I shouldn't port my data without someone to watch over my shoulder. Today, supposed power-user that I am, I decided to port my own data in our conversion to Microsoft Outlook and Exchange.

I went to the web page that I was pointed to by many helpful email messages. I backed up my calendar and created the right format file for importing. I didn't need to convert an IMAP mailbox because I had been using POP and would retain my old local Eudora archive. I determined that I had an active account by logging into the web version. But somehow, while setting up the desktop Outllook client I messed up not only my authentication method (it kept asking me for my certificate and wouldn't authenticate using it), but in trying to fix it I set up an offline email box that became my default.

I feel for the poor IT guy (thanks, Rob!) who got called in to help me after I commented to my colleague across the hall that I had messed things up. We had to delete my certificate to get my authentication to reset to asking for a password. Then we were able to get into the application and declare my server email box as my default, not the offline "personal" email box.

We kept trying to check my email, but then I was told that I'd missed a step -- I needed to register Exchange as my new default email server. I think I can blame the documentation -- That was NOT listed as a step on the page "Getting Started with Exchange," I swear. I had four tech support guys in my office (I was the object of much good-tempered group teasing for messing up), and the only reason that one of them could tell me the URL I needed to go to so I could change my default email is that he had printed the page. None of them could find a link to that page anywhere on the site.

I set up Exchange as my default, imported my "old" calendar (through January 14 when we change), and even got my Eudora address book imported with only a little weirdness. I still have no clue how to set up thematic email folders and filters like I had in Eudora. I guess I have to give in to the teasing and sit through some training, or at least read the help docs ...

Thursday, December 13, 2007

sharing personal libraries

I am very excited by the news about the joint project between the Zotero group at the Center for History and New Media and the Internet Archive.

Zotero is a very easy-to-use tool for developing personal citation repositories for distributed resources. The creation of a "Zotero Commons" registry of sorts where materials used by researchers can be shared is a powerful idea. It's an institutional repository without institutional boundaries. The idea of tying this in to the Internet Archive's archive of the web so that materials citied but not directly deposited are also not lost is even more intriguing. That there will be the capacity for both individual and group work is as it should be.

Here's the core of the project to me: "The combined digital collections present opportunities for scholars to find primary research materials, to discover one another’s work, to identify materials that are already available in digital form and therefore do not need to be located and scanned, to find other scholars with similar interests and to share their own insights broadly."

I wonder how this will fit into the landscape with other digital registries and collections. The DLF/OCLC Registry? OAIster? Aquifer? American Memory? What is the relationship between what institutions digitize, what their research communities have deposited in IRs, what is harvested into larger aggregations, and what scholars personally create? This is a problem space that bears a lot more discussion.

Wednesday, December 12, 2007

personal libraries

There is a great post on Hackito Ergo Sum about how they set up their personal library system, from barcode readers to cataloging to shelving. They went with Readerware, a reasonably well-known desktop program.

It's a shame they never encountered LibraryThing in their research. I knew about LibraryThing but didn't do anything with it until the time came to pack up our books to more to our new house in 2006. In late December 2005 I signed up and by mid-February 2006 I had all of our over 2,000 books cataloged, tagged, and packed without the use of a barcode reader (although I do have an original cuecat that Wired shipped out as a subscriber benefit sometime in the 90s, still in its original box). Like them, I have older books without ISBNs and foreign imprints. Many of the foreign imprints I found through non-US library catalogs. I had to manually enter some items, but that was not a challenge.

I guess I'm just a shill for LibraryThing.

Monday, December 10, 2007

print and digital at the New York Times

One of the most insightful blogs on media culture is written by David Byrne. He recently blogged about the new New York Times building, and, in response to his posting, he was invited to meet with various NY Times staff about print and digital journalism. His post on those meetings is one to read.

photography in museums

Please check out Perian's post on photography in museums over at Musematic. It presents both sides, and quotes from a BoinBoing comment of hers to rebut some comments made by Cory Doctorow and others.

Sunday, December 09, 2007

xISBN everywhere

The FRBR blog posted on Eric Hellman's annoucement that wikipedia is now available as a lookup option in xISBN.

xISBN seems to be everywhere now. It's used by LibX. It's available to resolvers. It's obviously part of WorldCat. But through what tools can you take advantage of the option to choose which, or maybe multiple, lookup options you want?

I went to the Alice's Adventures in Wonderland wikipedia article. I wondered which icon would appear next to the ISBNs -- LibX or Find@UVA. It was the latter. For a paperback edition of Alice in Amazon, it's the former.

With LibX and Amazon I got a great application of xISBN against our catalog. With Find@UVA I got no matches because it's not using xISBN (note to self -- make sure that we request support for xISBN and the newly announced xISSN as features, if we haven't already). In both cases the lookup target was our catalog, because that's the default (or only option).

LibX can search other lookups -- WorldCat and GoogleScholar. Depending on the requests they get (Wikipedia has a bad rep as a search resource among many academic librarians and faculty), they could add Wikipedia.

But what if I want to check in more than one location for links or citations or holdings? Is there a tool that can do that?

Saturday, December 08, 2007

how did I miss this?

Apparently today was "Pretend to be a Time Traveler Day."

The Wired geek dad blog had some great suggestions.

I love this one: "Take some trinket with you (it can be anything really), hand it to some stranger, along with a phone number and say 'In thirty years dial this number. You'll know what to do after that.'"

UpLib personal digital library tool

From Bill Janssen of PARC, via TeleRead:

"We’re planning to release our UpLib digital library system as open source software in a couple of months, and I’m still looking for beta testers, people who’d like to receive early a raw codebase tar file which may require some hacking skills to compile and install (though, I like to think, not too much). If you know about ‘configure’ and ‘make,’ you’re a candidate. I’m particularly interested in finding beta testers who use Windows, as that platform is still somewhat of a mystery to me. We use MSYS and MINGW to create the installer for Windows. For those of you who are wondering, the core of UpLib is a document repository server which serves as a base for a number of document analysis functions."

“The system consists of a full-text indexed repository accessed through an active agent via a Web interface. It is suitable for personal collections comprising tens of thousands of documents (including papers, books, photos, receipts, email, etc.), and provides for ease of document entry and access as well as high levels of security and privacy. Unlike many other systems of the sort, user access to the document collection is assured even if the UpLib system is unavailable.”
There are a couple of PARC white paper PDFs available, from 2003 and 2005, and one ACM article from 2005 that I found.

I have no idea what the status of this is -- there appears to have been some conversation on the ebook community list that this has been vaporware for some time -- but I'm intrigued that another tool is entering the personal digital collection problem space.

Thursday, December 06, 2007

supporting data in PQDT

Today we received a downtime notice from ProQuest. This is not unusual, especially during the holidays when services make time to update and upgrade. What cheered my little heart was this notice:

ProQuest Dissertations & Theses (PQDT) Multimedia Support release—ProQuest has seen an increase in dissertations and theses that include supplementary digital materials - audio, video, spreadsheets, etc. To properly support scholarly access to these materials, ProQuest is now making them available online in the Full Text version of the ProQuest Dissertations & Theses (PQDT) database.
Yes, ProQuest is going to include at least some supporting media and data for the theses and dissertations that it makes available. PQDT already has an Open PQDT that includes Open Access ETDs. I cannot find any details about this on the ProQuest site, but it's a promising step.

Wednesday, December 05, 2007

digital whomever

Siva has a great post about the labeling of "Digital" whomevers -- natives, immigrants, generation, millennials -- etc.

He doesn't buy it. I have a slightly different spin and some different reasons, but I don't buy it either.

I often take part in discussions about services for faculty and students, and sometimes hear ageist comments about how older faculty are completely non-digital and all students are automatically all digital. Hah! Just like some folks have an interest or skill in languages or math or art and some folks don't, it's the same with whatever "digital" is. I have worked with faculty in their 60s who saw something in being digital decades ago and have worked in that realm for years. I have worked with colleagues -- librarians and faculty -- in my own age group (I'm 44) who hate all technology with a passion and others who embrace it in all ways. I have worked with students at three different research universities who could not care less about being digital.

Being digital is not generational. At the core of what Jeff Gomez calls "Generation Download" and "Generation Upload" in his book Print is Dead, there is truly an ubiquitousness of digital media use that is changing media consumption and production paradigms and changing the media market. There is absolutely an increased level of acceptance that this is standard operating procedure. I'm still not willing to agree that an entire generation is digital and that the entirety of other generations are not. There's still predilection and interest and skill and, yes, issues of availability and affordability of technology that crosses all generations.

There are degrees of digital-ness. Different comfort levels. Different skill levels. Different levels of access. Why do we have to apply such absolute labels?

developing a service vision for a repository

Dorothea rightly challenged me for not including a service vision in my post on repository goals and vision. I do have something like a vision, but I wouldn't say that it's quite where it needs to be yet. That said, I said I would post it, so I am.

What are the services needed around a repository?

  • Identification and acquisition of valuable content
    • You can't wait for content to come to you – research what’s going on in the departments and at the University, and initiate a dialog.
    • Digital collections must also come from the Library and other University units – University Archives, Museums, etc.
  • Consulting Services
    • Advise on intellectual property and contract/licensing issues for scholarly output.
    • Assistance in preparing files for deposit, creating or converting metadata, and in the actual deposit process.
  • Access
    • Easy-to-use discovery interface with full-text searching and browse.
    • Instruction for community on how to find and use and cite content.
    • Make the content shareable via Open Archives Initiative (OAI).
  • Promotion and Marketing
    • Build awareness of the high cost of scholarly journals, and that we are buying back our own institutional scholarship.
    • Promote the value of building sustainable digital collections – preservation is more than just backing up files.
    • Promote the goals of the Open Access movement, including managed, free online access and a focus on improved visibility and impact.
    • Show faculty that they can build personal and community archives.
    • Market repository building services that will enable the institution to build a body of digital content.
    • Market the repository as a content resource and a venue that increases the visibility of the institution.

Tuesday, December 04, 2007

Report on the Future of Bibliographic Control available for comment

The Library of Congress has released a draft of the Report on the Future of Bibliographic Control for comment. Comments should be received by December 15, 2007, so pull together all your autocat, ngc4lib, and web4lib postings on the issues and get your comments in.

Alliance for Permanent Access

I haven't been able to track down much on the new Alliance for Permanent Access. There's this press release. Did anyone attend the Second International Conference on Permanent Access to the Records of Science held in Brussels on November 15, 2007 where the Alliance was launched?

MARCThing

There's nothing that I can say about the MARCThing Z39.50 implementation for LibraryThing that isn't in the Thingology post. But this statement caught my eye:

I do have a recommendation for anybody involved in implementing a standard or protocol in the library world. Go down to your local bookstore and grab 3 random people browsing the programming books. If you can't explain the basic idea in 10 minutes, or they can't sit down and write some basic code to use it in an hour or two, you've failed. It doesn't matter how perfect it is on paper -- it's not going to get used by anybody outside the library world, and even in the library world, it will only be implemented poorly.
Amen.

goals and vision for a repository

Last week I had the opportunity to have a lengthy conversation with some folks about our Repository. In doing so I was able to get at some really simplified statements about our activities.

Why a Repository?

  • A growing body of the scholarly communications and research produced in our institutions exists solely in digital form.
  • Valuable assets -- secondary or gray scholarship such as proceedings, white papers, presentations, working papers, and datasets -- are being lost or not reproduced.
  • Numerous online digital collections and databases produced through research activity are not formally managed and are at risk.
  • An institutional repository is needed as a trusted system to permanently archive, steward, and manage access to the intellectual work – both research and teaching – of a university.
  • Open Access, Open Access, Open Access and Preservation, Preservation, Preservation.
What's the vision for a Repository?
  • A new scholarly publishing paradigm: an outlet for the open distribution of scholarly output as part of the open access movement.
  • A trusted digital repository for collections.
  • A cumulative and perpetual archive for an institution.
What does success look like?
  • Improved open access and visibility of digital scholarship and collections.
  • Participation from a variety of units, departments, and disciplines at the institution.
  • Usable process and standards for adding content.
  • Content is actively added.
  • Content is used: searched and cited and downloaded.
  • There is a wide variety of content types.
  • Simple counts are NOT a metric.
I really appreciate having the chance to formulate ideas like these that have nothing to do with the technology but everything to do with why we're doing what we do. I want to work this up into something more formal to share broadly.