Digital Eccentric: October 2007

Wednesday, October 31, 2007

LibX and OpenURL Referrer browser extensions

We launched our UVA Library LibX plugin for Firefox in June 2007, and its gotten some rave reviews from staff. Now that UVA has approved the rollout of Vista and IE 7 on its computers, we're testing the beta IE version of LibX. I understand we've supplied some feedback on installation and running on Vista.

When I saw the recent announcement of the availability of OCLC's OpenURL Referrer for IE, I paused a bit when considering who to send the annoucement to. LibX is the tool we promote with our users, it recognizes DOIs, ISBNs, ISSNs, and PubMed IDs, and supports COinS and OCLC xISBN, and works with our resolver and our catalog Virgo. We have our resolver working with Google Scholar.

In the end, I didn't forward the annoucement because we're trying to promote the use of LibX and I didn't want to dilute that message for our staff and users. The OpenURL Referrer is a very cool tool and a great use of the OCLC Resolver Registry so users don't have to know anything except the name of their institution to set it up. I'm just not sure if we need both, at least not right now.

I need to ask if we know how much use our LibX toolbar is getting.

JPEG2000 at LC

It is indeed welcome news (Digital Koans and the Jester) that the Library of Congress and Xerox are teaming up on j2k implementation, and even more welcome that it's in context of NDIIPP and preservation.

This is a key part of the announcement for me:

Xerox scientists will develop the parameters for converting TIFF files to JPEG 2000 and will build and test the system, then turn over the specifications and best practices to the Library of Congress. The specific outcome will be development of JPEG 2000 profiles, which describe how to use JPEG 2000 most effectively to represent photographic content as well as content digitized from maps. The Library plans to make the results available on a public Web site.

We all know that we could have jp2 files that are smaller than our TIFF masters. We all know that we could move from TIFF to jp2 as an archival format. We've all researched codecs and delivery applications so we can cut down on the number of deliverable files we generate. That's real progress. What we haven't done is figure out how best to migrate our legacy files and move forward. I look forward to seeing the outcomes of this project.

Thursday, October 25, 2007

OCLC report on privacy and trust

I grabbed the report the other day but only had the chance to barely skim it. It's obviously full of valuable data, and the visualization graphics are great.

I liked these stats, because I have long tried to make the argument that online interactions are becoming ubiquitous:

Browsing/purchasing activities: Activities considered as emerging several years ago, such as online banking, have been used by more than half of the total general public respondents. Over 40% of respondents have read someone’s blog, while the majority have browsed for information and used e-commerce sites in the last year, a substantial increase in activity as seen in 2005. While commercial and searching activities have surged in the past two years, the use of the library Web site has declined from our 2005 study.

Interacting activities: The majority of the respondents have sent or received an e-mail and over half have sent or received an instant message. Twenty percent (20%) or more of respondents have participated in social networking and used chat rooms.

Creating activities: Twenty percent (20%) or more of respondents have used a social media site and have created and/or contributed to others’ Web pages; 17% have blogged or written an online diary/journal. (section 1-6)

It's nice to see library web sites firmly in the middle when grouped with commercial web sites used by all age groups. (section 2-8)

Data on how much private information is shared (section 2-31) is not too surprising but interesting to see quantified. People's faith in the security of personal information on the Internet (section 3-2, 3-4) is higher than mine. That younger respondents have different ideas about relative privacy of categories of data (section 3-9, 3-10, 3-35) is not surprising, but I wonder why more people aren't concerned about privacy. It's good to see trust data in addition to privacy data. (section 3-24)

Section 4 focuses on Library Directors as a category of respondents. It seems that overall they read more, have been online longer, and interact more online than the general public. They also over-estimate how important privacy is to their users.

Section 5 is on libraries and social activities.

Karen Schneider had more time, and her response is worth a read.

Section 7 is the summary if you can't face all 280 pages.

Open Geospatial Consortium

It's ironic in a way, but I was unfamiliar with the Open Geospatial Consortium until seeing mention that Microsoft was joining as a principle member. Their standards seem worth looking at, but I have no sense of how much their reference model or specifications are used or what their compliance certification means for content providers or users, other than the promise of interoperability. They have a long list of registered products, not a lot of which seem to be noted as compliant yet.

Open Library ILL

While I was reacting to one aspect of the New York Times article on the OCA, others were reacting to the very last bit of the article on the proposed ILL-like service from the Open Library initiative. Aaron Swartz was interviewed at the Berkman Center. Günter Waibel briefly described the effort. Peter Brantley had something to say.

But Peter Hirtle really had something to say.

For me, this is the key potion of Peter's post:

Unfortunately just because a book is out of print does not mean that it is not protected by copyright. Right now a library may use Section 108(e) of the Copyright Act to make a copy of an entire book for delivery to a patron either directly or through ILL, and that copy can be digital. But the library has to clear some hurdles first. One of them is that the library has to determine that a copy (either new or used) cannot be obtained at a fair price. The rationale for this requirement was that activity in the OP market for a particular title might encourage a publisher to reprint that work. In addition, the copy has to become the property of the user - the library cannot add a digital copy to its own collection.
Is this inefficient? Yes. Would it be better if libraries could pre-emptively digitize copyrighted works, store them safely, conduct the market test when an ILL request arrives, and then deliver the book to the user if no copy can be found? Yes. But this is not what the law currently allows.

As much as I want to encourage digitization and freeing of post-1923 works that are indeed out of copyright or orphaned, I know that Peter Hirtle has a strong position here. This principle is one that we've been looking at quite closely in a related realm -- developing a proposed "drop-off" digitization service for faculty. Not surprisingly, faculty sometimes hope we can digitize, say, slides that they have purchased commercially. We must determine who the source is and if we can obtain digital versions (whether at a fair price or at all). If we then decide that we can digitize them for the faculty member (which will not always be the case) we definitely cannot add the files to our collections or hold on to them in any way.

Peter ends his post with a goal for us all --that we should all work toward "convincing Congress that the public good would benefit from a more generous ILL policy for out of print books " in order to increase access to our collections.

ArchaeoInformatics Virtual Lecture Series

The Archaeoinformatics Consortium has announced their 2007-2008 Virtual Lecture Series schedule, where archaeologists describe their successful cyberinfrastructure efforts and innovators from other disciplines present information on their cyberinfrastructure initiatives and strategies that may be useful to archaeology.

These lectures are presented every other week using the Access GRID video conferencing system. It is also possible to participate in the lectures by downloading the presentation slides and participating via a telephone bridge. Information on how to connect to the Access GRID system and alternatives are provided at http://archaeoinformatics.org/lecture_series.html. The lectures from the 2006-2007 series and this year’s lectures are also available as streaming video from the archaeoinformatics web site.

Monday, October 22, 2007

New York Times article on OCA

There was an article in the New York Times about libraries choosing to work with OCA instead of Google.

These are two quotes that I kept going back to -- "It costs the Open Content Alliance as much as $30 to scan each book" and "Libraries that sign with the Open Content Alliance are obligated to pay the cost of scanning the books. Several have received grants from organizations like the Sloan Foundation.The Boston Library Consortium’s project is self-funded, with $845,000 for the next two years. The consortium pays 10 cents a page to the Internet Archive ..."

A number of years ago we estimated that it would cost us many hundreds of dollars to digitize a book, but that involved a lot of manual work, from shooting the images to keyboarding of text (rather than OCR) to QA. We haven't revisited such an estimate in a while -- I'm sure it's lower now, but not that low. Of OCA participants, some get foundation support to shoulder the costs, and some fund it entirely themselves. Compared to doing it all yourselves, that's a bargain. I'm a fan of the OCA effort.

But then there's this quote -- “taking Google money for now while realizing this is, at best, a short-term bridge to a truly open universal library of the future.”

We don't actually take any Google money. Yes, Google provides a service for us, but they don't pay us for our participation. We underwrite certain costs for our participation. Yes, there are restrictions. You can read our agreement. One can and should question issues of control over data by any company or institution, but there is value in our pre-1923 volumes being made publicly available through Google Books. The institutions that chose to participate in one project versus the other (or both) should be neither lionized nor assailed.

Friday, October 19, 2007

RUBRIC Toolkit released

The RUBRIC project (Regional Universities Building Research Infrastructure Collaboratively) is sponsored by the Australian Commonwealth Department of Education, Science and Training. The RUBRIC Toolkit is the documentation of the process used by the project to model institutional repository services and evaluate tools. It includes a number of great checklists that could be used by any institution planning for an IR.

Twine

Every place I turn today there's something about Twine:

Twine is Smart
Twine is unique because it understands the meaning of information and relationships and automatically helps to organize and connect related items. Using the Semantic Web, natural language processing, and artificial intelligence, Twine automatically enriches information and finds patterns that individuals cannot easily see on their own. Twine transforms any information into Semantic Web content, a richer and ultimately more useful and portable form of knowledge. Users of Twine also can locate information using powerful new patent-pending social and semantic search capabilities so that they can find exactly what they need, from people and groups they trust.
Twine “ties it all together”
Twine pools and connects all types of information in one convenient online location, including contacts, email, bookmarks, RSS feeds, documents, photos, videos, news, products, discussions, notes, and anything else. Users can also author information directly in Twine like they do in weblogs and wikis. Twine is designed to become the center of a user’s digital life.

This is an exceptionally attractive concept, especially given its ability to parse content for meaning and identify new relationships and context with other content. You identify your "twine" content through tagging, and Twine applies your assigned semantics to other content, adding to your twine. It's also a social site where a personal network can interact with your shared twine and its content, enriching its semantic layer through that interaction. Twine is meant to learn from its folksonomies.

O'Reilly Radar reports on the demo at the Web2 Summit.
Read/Write Web has two postings here and here.

I've submitted a beta participation request. This could be an interesting tool for distributed research efforts.

Thursday, October 18, 2007

killer digital libraries and archives

Yesterday the Online Education Database released a great list of "250+ Killer Digital Libraries and Archives." It lists sites by state, by type, has a focus on etexts, and is a remarkable compendium of digital resources.

Of course, the first thing I did was look for our digital collections on the list. The UVA Library hosts the wonderful Virginia Heritage resource, which brings together thousands of EAD finding aids for two dozen institutions across the state of Virginia. We have our Digital Collections, with more than 20,000 images, 10,000 texts, and almost 4,000 finding aids.

Nope. Not on the list.

Not surprisingly, our former Etext Center was on the list under etexts (the Center no longer exists as a unit and its texts are gradually being migrated). The Virginia Center for Digital History was there, as it should be with its groundbreaking projects and its great blog. IATH was there with its many innovative born-digital scholarly projects.

I sulked about this for a few minutes while thinking about the likely reason we weren't on the list -- for the past few years we've been talking nonstop about our Repository and Fedora and not about our collections. Now, we wanted and needed to talk about Fedora and our Repository because we we really trying new things and solving interesting problems with our development and participating in building a community around Fedora. But users don't care about how cool our Repository development is. They care about the collections in the Repository.

We've spent the last few months working at raising awareness about what we have. Our new Library home page now has a number of links to the digital collections. We have pages on how to find what you're looking for in our digital collections. We have feature pages for all of our collections in the Repository. We're making progress in migrating collections and making the Digital Collections site a central location where they're visible. We have an RSS Feed for additions to the collections. We now have a librarian and a unit dedicated to shepherding collection digitization through the process and working more closely with faculty. I hope the next time someone creates a list like this we'll be visible enough to be on it.

Wednesday, October 17, 2007

two interesting IP decisions

Just in time for World Series fever -- The US Court of Appeals for the Eighth Circuit has upheld (PDF) a lower court's ruling that stats are not copyrightable in a case between CBC Distribution (a fantasy sports operation) against Major League Baseball and the MLB Players Association.

MLB argued that its player names and stats were copyrightable and that CBC—or any other fantasy league—couldn't operate a fantasy baseball league without a multimillion-dollar licensing agreement with MLB. CBC countered that the data was in the public domain and as such, it had a First Amendment right to use it. In August 2006 a US District Court sided with CBC. Now that decision has been upheld after MLB appealed.

In other news, the USPTO has rejected many of the broadest claims of the Amazon One-Click patent following the re-examination request by blogger Peter Calveley. To read the outcome of the re-examination request, go to the USPTO PAIR access site, choose the "Control Number" radio button, enter 90/007,946 and press the "Submit" button. Then go to the "Image File Wrapper" tab and select the "Reexam Non-Final Action" document. This doesn't mean that the patent has been thrown out -- five of the twenty six claims were found to be patentable. The document is the "non-final action" document and Amazon has rights in the process. The patent could be either thrown out or modified. It will be interesting to see.

Tuesday, October 16, 2007

what if google had to optimize its design for google?

Web developers have a lot of hoops to jump through to optimize their sites for discovery through the Google search engine. Getting good Google index placement is paramount, so it is apparently increasingly difficult for designers to offer the uncluttered user experience that they'd like to while building needed content into their pages. Here is a funny look at what would happen to Google's uncluttered look if Google had to design for the Google Search Engine. Not sure how accurate it is ...

Monday, October 15, 2007

Muradora

DRAMA (Digital Repository Authorization Middleware Architecture) at Macquarie University has released
Muradora, a repository application that supports federated identity (via Shibboleth authentication) and flexible

authorization (using XACML). Fedora forms the core back-end repository, while different front-end applications (such as portlets or standalone web interfaces) can all talk to the same instance of Fedora, and yet maintain a consistent approach to access control. A Live DVD image can be downloaded to install Muradora on a server following an easy installation procedure that is based on Ubuntu Linux Distribution. Muradora LiveDVD can be downloaded from http://www.muradora.org/software.

From the announcement email:

- "Out-of-the-box" or customized deployment options
- Intuitive access control editor allows end-users to specify their own access control criteria without editing any XML.
- Hierarchical enforcement of access control policies. Access control can be set at the collection level, object level or datastream level.
- Metadata input and validation for any well-formed metadata schema using XForms (a W3C standard). New metadata schemas can be supported via XForms scripts (no Muradora code modification required).
- Flexible and extensible architecture based on the well known Java Spring enterprise framework.

- Multiple deployments of Muradora (each customized for their own specific purpose) can talk to the one instance of Fedora.
- Freely available as open source software (Apache 2 license). All dependent software is also open source.

Friday, October 12, 2007

public personae

When I joined Facebook earlier this week I had every intention of keeping my activity to a minimum. No picture. No personal details. Minimal professional details. No groups. Why would I share my personal life online? That's personal, that's private!

Then folks started pointing out to me that I was being silly. My email address and information about my job is all over our Library's web site. Presentations that I've given are online everywhere. Old email listserv postings are available through list archives. I use flickr, and while some images are kept restricted to friends and family, most aren't (but with a Creative Commons license). My LibraryThing catalog is public. I have this blog.

Basically, I was told to get over it. I already have a public persona, even if it's not one that interests more than a few close friends and colleagues and folks interested in arcane digital library topics.

But, but, what about privacy? I know better than to share too much about myself online. I have friends who have been the subject of identity theft. One hears cautionary (but possible apocryphal) tales every day about middle schoolers getting stalked online through their MySpace pages. How did my life get so public? Gradually, without my even consciously noticing it. There's no going back.

what is publishing?

I'm in the process of making a transition in my organization, shifting into a newly created position as Head of Digital Publishing Services.

The first question that everyone asks is "What will you be doing?" The second is "What is publishing in a Library?"

We partner with faculty who are selecting content, organizing it, describing it, analyzing and identifying and creating intellectual relationships, and presenting and interpreting that content in new ways as born-digital scholarship. These projects are more frequently being considered in the promotion and tenure process. Is that the Library supporting a publishing activity? Most certainly.

We digitize collections, describe them, organize them, present them online, and promote them to our community for their use in teaching in research. Is that a publishing activity? It can be argued either way (and has been) -- I'm on the side that leans toward yes.

We provide production support and hosting for peer-reviewed electronic journals. No one would argue that we're participating an a publishing activity.

We're evaluating an Institutional Repository. Not a publishing activity per se, but a way to preserve the publishing output of our community. That's a service related to our stewardship role.

As a Library we're already very active participants in publishing activities. We have our Scholars' Lab, Research Computing Lab, and Digital Media Lab serving as the loci for our faculty collaborations. My role will be formalizing what our publishing services are, what our work flows should be, and how we can sustain and expand our consulting services related to scholarly communication.

Wednesday, October 10, 2007

apparently peer pressure works on me

I gave in and joined Facebook. I'm astonished at the high level of activity that some folks seem to be able to maintain. I have a very boring profile: I added a few applications; some people seem to have dozens. I haven't added much personal information. People apparently send lots of messages and post on each other's profiles. I'm going to have to remember to check it, on top of checking other places where I do my sharing and connect with folks (LinkedIn, LibraryThing, flickr). Still, I think it will be good to have another place to try to maintain contact.

Friday, October 05, 2007

digital lives project

Digital Koans posted about the Digital Lives research project, "focusing on personal digital collections and their relationship with research repositories."

For centuries, individuals have used physical artifacts as personal memory devices and reference aids. Over time these have ranged from personal journals and correspondence, to photographs and photographic albums, to whole personal libraries of manuscripts, sound and video recordings, books, serials, clippings and off-prints. These personal collections and archives are often of immense importance to individuals, their descendants, and to research in a broad range of Arts and Humanities subjects including literary criticism, history, and history of science.

...

These personal collections support histories of cultural practice by documenting creative processes, writing, reading, communication, social networks, and the production and dissemination of knowledge. They provide scholars with more nuanced contexts for understanding wider scientific and cultural developments.

As we move from cultural memory based on physical artifacts, to a hybrid digital and physical environment, and then increasingly shift towards new forms of digital memory, many fundamental new issues arise for research institutions such as the British Library that will be the custodians of and provide research access to digital archives and personal collections created by individuals in the 21st century.

I very much look forward to seeing the results of this work, as university archives and institutional repositories increasingly have to cope with not only managing and preserving deposited personal digital materials, but have to potentially describe, organize, and make such collections usable.

While not the focus of their study, anyone who has ever supported teaching with images knows a tangential area of this problem space intimately. Faculty develop their own collections of teaching images -- their own analog photography, purchased slides, digital photography, images found on the open web, images from colleagues, etc. We have licensed images and surrogates of our own physical collections. They want to use materials from their own collections and our repositories together in their teaching. What is the relationship between their image collections and our repositories and teaching tools? Do we integrate their collections into ours? Do we have a role in digital curation and preservation of their data used in teaching and research, which happen to be images? We struggle with the legal and resource allocation issues every day.

elastic lists

I've been exploring a demo of the visualization model called Elastic Lists. It comes from work done on "Information Aesthetics" published by Andrea Lau and Andrew Vande Moere (pdf) and is implemented using the Flamenco Browser developed at UC Berkeley. The facets have a visualization component comprising color (a brighter background connotes higher relevancy) and proportion (larger facets take up a larger relative proportion of space).

I find this to be a really compelling visualization, easier for the user to navigate than the perhaps too-complex Bungee View. I'd like to see this applied to more heterogeneous data, such as one might encounter in a large digital collection.

Thursday, October 04, 2007

Cal has its own YouTube channel for course lectures

The University of California at Berkeley has its own YouTube channel. This complements their use of iTunesU.

Ars Techina covers the launch.

It's interesting to see everything from course lectures to colloquia presentations to football highlights. This is very smart co-branding with YouTube as a web destination, and a savvy use of a commercial hosting service.

a new museum starts virtual

I am thrilled to see that the Smithsonian's newest museum -- the National Museum of African American History and Culture -- is open for business. Their opening exhibition is “Let Your Motto Be Resistance,” featuring portraits and photographs of people who stood against oppression, from Frederick Douglass to Ella Fitzgerald to Malcolm X. They have lesson plans. They have a Memory Book for visitor-supplied content that makes great use of the visualization interface for browsing that's also employed on the rest of the site.

It's a really nice experience, especially for a museum for which groundbreaking is not scheduled until 2012.

From the site:

With the help of a $1 million grant of technology and expertise from IBM, the NMAAHC Museum on the Web represents a unique partnership to use innovative IBM expertise and services to bring the stories of African American History to a global audience. Conceived from the very beginning as a fully virtual precursor to the museum to be built on the Washington Mall, this is the first time a major museum is opening its doors on the Web prior to its physical existence.

This is a very exciting collaboration, and a great way to build a community around an institution and its mission years before anyone will walk through the door.

Tuesday, October 02, 2007

self-determined music pricing

Yesterday it was all over the blogosphere and on NPR that Radiohead is taking control of its own distribution and releasing its new album with self-determined pricing. This was heralded as huge, earth-shattering, and a "watershed moment" according to CNET.

It's as if no one has done this before. But of course, at least one person has, and very successfully.

In the early 1980s I discovered Canadian singer Jane Siberry, who now goes by the name Issa. I own her albums. I've seen her perform twice so far (a third is in the offing -- she's coming to Charlottesville this month).

And, in 2005, she transformed her personal label's inventory from physical to digital, put it online, and allowed self-determined pricing. Some of her earlier material is not available due to licensing restrictions -- she's not encouraging illegal downloading -- but she has successfully licensed some albums and songs she did for Warner and made them available through this route. Her wikipedia article quotes an interview in The Globe and Mail where she says that since she had instituted the self-determined pricing policy, the average income she receives per song is in fact slightly more than standard price. The "Press" section of her Jane Siberry site has an interesting Chicago Tribune article from that year.

But, since almost no one has heard of Jane Siberry and everyone has heard of Radiohead, it's as if no one has ever done this before. There was a Thingology post that commented that Radiohead probably borrowed the idea from someone or thought of it on their own. Here's an example of someone who did it first. I'm not knocking Radiohead. I like their music. I'm thrilled that such a high profile group is doing this. But this is not their watershed moment alone.

Monday, October 01, 2007

freeing copyright data

Messages went out on the Digital Library Federation (DLF) mailing list and on O'Reilly Radar yesterday letting us know that DLF had helped set free the entire dataset from the U.S. Copyright Office's copyright registration database. LC pointed out that Congress set the rules for charging, but they should have known that their referring to the issue as a "blogospheric brouhaha" was just going to drive someone to do something.

DLF and Public.Resource.Org sent a request to open access to the data. Lots of sites picked this up, including BoingBoing. The registration database is a compilation of facts -- it's not copyrightable itself. The U.S. Register of Copyright agreed that these are public records and should be available in bulk.

So Public.Resource.Org and DLF made it so.

It's extremely exciting to see DLF lobbying so effectively and participating in an effort to make data available via open access, and help all libraries provide better copyright search services. BoingBoing celebrates them as guerrilla librarians. This is not your father's DLF.