Monday, October 02, 2006

it's all about the metadata

Last week I attended the NISO "Managing Electronic Collections" workshop. I spoke about our Digital Library Repository implementation, and was gratified to have a number of people ask me questions over the course of the two days that I was there. One question really struck me -- "What is the most important thing that you learned in your process that we should take into account in our project?"

It could almost be a one word answer: metadata.

Of course it's a more complex answer than that. What metadata do you need to capture? Technical, preservation, administrative, descriptive? In what format? What's the minimum? We have experimented a lot in this area, and there has been a certain amount of "lather, rinse, repeat" as we've refined our metadata. In some cases, encoding standards have changed so mappings had to change. Or workflow tools have changed, requiring review of what metadata we can automatically capture, and in what form. Or standards have developed, such as those for the preservation or rights, so we need to review what we're capturing.

One of the most significant change agents has been evolving end-user services. Why? Because you can't support functionality and services (and often usability) if the needed metadata isn't there, or is in the wrong form. Having an extensible architeture is vital. Identifying standards to be used, and having production workflows that can process appropriate content in a timely fashion is key. But really, it's all about the metadata.

Ex: We want to be able to support sorting and grouping of search results by creator or title, which is easier if there are pre-generated sort names and sort titles (doing it on the fly takes a lot of processor overhead).

Ex: We want to create aggregation objects that bring together multi-volume series or issues in a serial title, which is easier if you have the most complete enumeration possible and identify scope to as granular as level as possible (e.g., volume, issue, article).

Ex: We want to supported faceted subject navigation, which is easier if the subjects terms are broken out in a granular way from their post-coordinated forms, such as identifying geographic vs. topical vs. temporal parts in the subject.

Each of these requires a change to our DTD and/or the patterns of our encoding, and, sometimes requires us to regenerate the metadata from the originals sources. But each time we both better document the objects and improve the services and the interface that we provide, so it's worth it.

If you're interested in what we've delved into so far:


Winona said...

Hi Leslie,
I'm currently struggling with the metadata question. I'm also interested in doing some sort of faceted browse interface, but am hampered by the metadata standard we are using (Dublin Core). I was curious to see that you are not using something like DC or MODS but instead are using institution created standard. I’d be interested in hearing about your decision process for this and how you selected your element set.

Leslie Johnston said...

We made some of these decisions almost 5 years ago. MODS wasn't ready, VRA Core wasn't ready, and DC wasn't enough. We started off with DC (qualified, not unqualified) as the basis and it grew from there. We have both AdminMeta and DescMeta in place, and they continue to evolve. Like our having to update our DescMeta for subjects. We were storing the entire subject string, but to support faceted browse we needed more granular subject terms, so we're breaking them up using the subfield indicators (which we keep from the transformation from MARC) and qualifying them.