Friday, December 21, 2007

what we learned from our repository project

Earlier this week someone asked me what we had learned from our repository development project over the years. This is the first time that anyone has asked that so directly, as opposed to general discussions about assessment and process review and software optimization.

So, what did we learn? This is what I've come up with so far.

1. Have your media file standards in mind before you start. Of course standards will change (especially if you're talking about video objects) during a multi-year implementation project. But if you have standards identified before you start (and minimize the number of file standards that you'll be working with), you at least have some chance of making it easier to migrate and manage and preserve what you've got and to design a simpler architecture. We did this (in conjunction with an inventory of our existing digital assets) and it was key for us in developing our architecture and content models.

2. Know what the functional requirements of your interface will be before you start. We had developed functional spec documents and use cases, but two different stakeholders came back to us during the process with new requests that we couldn't ignore. In those two cases the newly identified functional requirements for our interface required that we change our deliverable files and our metadata standard. We had to go back and re-process tens of thousands and then over 100,000 objects to meet the functional need and have consistency across our objects.

3. Some aspect of your implementation technologies will change during the project. New technologies will become available for implementation during your project that are a better fit than what you planned to use. For example, we never initially identified Cocoon as part of our implementation, but it became a core part of our text disseminators.

4. Your project will never be "done." OK, we've got a production repository with three format types in full production and three in prototype. We've still got to figure out production workflows for those media types and there are more media types to consider. And, as a corollary to point 3, there are new technologies that we want to substitute for what we used. We're obviously going to switch to Lucene and Solr for our indexing. New indexing capabilities will absolutely bring about an interface change. There are also more open source web services applications available now than when we started in 2003. We can potentially employ XTF in place of some very complex TEI and EAD transformation and display disseminators that we developed locally. This is bringing about a discussion about simplifying our architecture -- fewer complex delivery disseminators to manage and develop and more handing off of datastreams to outside web services. Not that there aren't complexities there and a lot of re-development work, but it's a discussion worth having. We're talking a lot these days about simplifying what it takes for us to put new content models into production. The development of Fedora 3.0 will also have a huge effect.

EDIT 28 December 2007:

A couple of folks wrote and asked me why I didn't specify metadata standards. I mentioned the impact of interface design on metadata needs, but some additional reinforcement can't hurt. So ...

5. It shouldn't even need to be said that you should have your metadata standards identified before you start your development. We did. What we learned is that activities related to categories 2-4 will mean that you will have to make changes to what metadata you create and how you use it. For example, when changes were made to our interface design and functionality, we needed metadata formatted in a certain way for search results and some displays. We thought that we'd generate the metadata on-the-fly, but that turned out to be a lot of overhead so we decided to pre-generate the metadata needed: display name, sort name, display title, sort title, display date, and sort date. It isn't metadata we necessarily create during cataloging processes, but it's something we can generate during the conversion from its original form to UVA DescMeta XML. Another example is faceted browsing. To have the most sensible facets in our next interface, we need to break up post-coordinated subject strings or we'll have a facets for every variation. We thought about pre-generating this, but it turns out that Lucene can do this as part of the index building.

(http://digitaleccentric.blogspot.com/2007/12/adding-metadata-to-list-of-what-we.html)

No comments: