Digital Eccentric: August 2007

Friday, August 31, 2007

some weeks, you feel like you've just survived

This week was the first week of classes and it seemed more stressful than other first weeks. We put redirects in place for some directories for resources that we'd migrated from our former Etext Center collections to the Repository. We didn't give as much notice as we could have there and some folks were surprised. We also formally announced that our Etext and Geostat Centers no longer exist and are part of our Scholars' Lab. Those announcements required some redirects. Things then went briefly wrong with the redirects. There was a wrinkle in updating links in our catalog records -- in some cases we weren't migrating individual texts, instead pointing to LION, so where would the links go? And the redirects meant that records weren't going to as granular a location as they were before.

The increased load of the first week of classes caused some text delivery issues, but helped us find what appears to be a bug in Joost that was the cause of mysterious problems in the past that we now have worked around. Two tools the used to exchange data easily didn't anymore (but we found the cause immediately). An old assumption about what regions we included in our simple text search was proven false by some newly migrated texts and we had to make a mid-week change. One of the text sets that we migrated was missing what turned out to be a vital element from its styled delivery. We tried to be nimble in our responses, occasionally briefly breaking something else with a fix, but our amazing team worked hard to address everything quickly.

I know of two outstanding issues to resolve, then we're set until we start the process to completely replace our searching infrastructure and interface. We've got a prototype BlacklightDL almost where it needs to be to start seriously planning the swapout project. Another change management challenge ...

Monday, August 27, 2007

zotero

Catalogablog points out a new 1.0 release candidate 2 of Zotero. Of note from the Zotero site:

Zotero now offers full-text indexing of PDFs, adding your archived PDFs to the searchable text in your collection.
Zotero’s integration with word processing tools has been greatly improved. The MS Word plugin works much more seamlessly and we now support OpenOffice on Windows, Mac (in the form of NeoOffice), and Linux.
Zotero is also now better integrated with the desktop. Users can drag files from their desktop into their Zotero collection and can also drag attachments out of their Zotero collection onto their desktop.
We have begun to add tools to browse and visualize Zotero collections in new ways. Using MIT’s SIMILE Timeline widget, Zotero can now generate timelines from any collection or selected items.
The new version of CSL (Citation Style Language), used by Zotero to format references into specific styles, is more human readable and easier to edit. We will be adding many more styles soon.

There are also announcements of new compatible sites, including Institute of Physics, BioMed Central, ERIC, Engineering Village, the L.A. Times, The Economist, Time, and Epicurious, among others.

I have become addicted to LinkedIn. Two departing colleagues have accounts and I was told it was fun and interesting to discover one's level of connectedness while building a network.

It's true. I spent way too much of my weekend searching out colleagues and friends and inviting them to join my network. In one place I can see the connections between different spheres of my life, and the intersections are really interesting. I have also reconnected with two friends from my college years that I lost touch with when I moved away from Los Angeles 16 years ago, which makes it definitely worthwhile to me. Now it's up to me (and not the technology) to stay connected.

Thursday, August 23, 2007

bungee view

I have been playing with a very, very cool image collection visualization and browse application called Bungee View from Carnegie Mellon University.

It was developed for the Historic Pittsburgh image collection, but I found it through a del.icio.us link to American Memory through code4lib. They have American Memory image collections available as a test set. I was particularly impressed by their browse of their music collections.

When I first started the app I thought that something was wrong with my system -- what I thought were search boxes were filled with a myriad of vertical lines. Then I came to understand the UI -- they aren't search boxes, they're visualizations of the distribution of terms in each category. Mousing over them shows links to the various terms and number of times used. You can expand each category to see multiple forms of the visualization -- the distribution of terms in the bar, a simple list of terms, or a detailed color-coded graph. You can drill deep into the collections by combining browse categories. As you browse, color-coding provides clues in the correlation between categories that will provide results or that are negatively associated. You see the results as a set of thumbnails on the right of the screen. You can select a thumbnail and see its full metadata.

Bungee View was implemented using Piccolo, a C# graphics framework that I am not familiar with. The source seems to be available, but the link didn't seem to be working when I tried it. I want to explore this more.

Wednesday, August 22, 2007

ndiipp grant to preserve virtual worlds

Many congratulations to Jerry McDonough at UIUC for his NDIIPP grant to investigate the preservation of virtual worlds. The work will be a collaborative effort between UIUC, Stanford, the University of Maryland, the Rochester Institute of Technology, and Linden Lab.

In addition to developing standards for preservation metadata and content representation, the project will investigate preservation issues through archiving a series of case studies representing early games and literature and later interactive multi-player game environments.

This is very exciting. With this work and work developing around COLLADA as an interchange format for 3-D files, there's hope. We could preserve a very early text-based game like Hammurabi, a classic arcade game like Tempest, or virtual events in Second Life.

Preserving Virtual Worlds.

Sunday, August 19, 2007

trademarking for literary protection

Via TeleRead, and interesting article in the Financial Times on trademarking of literary characters, author names, etc., as a layer of protection after copyright runs out.

Quoting from the Financial Times, "posthumous Publishers who refuse to live and let die," August 16, 2007 (might be behind a pay wall):

Because copyright comes to an end – 70 years after an author's death in Europe, sooner in the US – literary estates have turned to trademark registration for an extra layer of protection. Characters, book titles and authors' names have all been registered.

For dead authors who are still in copyright, trademarking may help estates keep control after the term ends, says intellectual property lawyer Laurence Kaye. "If you intend to republish a book that has gone out of copyright, you would have to do it in a way that did not infringe any trademarks."

IFP (Ian Fleming's estate) has registered everything from Ian Fleming to James Bond and Miss Moneypenny, so any attempt to reproduce the books without permission after they go out of copyright would meet difficulties.

Mr Kaye says: "You would have to manipulate the book so that there was nothing in it that infringed the registered trademarks."

The article also mentions the trend where dead authors such as Robert Ludlum and V. A. Andrews just keep publishing posthumously. Ludlum at least left behind outlines for work he wanted to be written after his death. I don't know what to think of the writer who has written more than two dozen V. A. Andrews novels under her name for twenty years after her death.

Friday, August 17, 2007

pynchon in semaphore

Over a year ago the artist Ben Rubin installed a piece ("San Jose Semaphore") on the Adobe building in San Jose, California with four LED semaphore wheels that broadcast a mystery text, accompanied by an audio component. It took over a year, but two men finally deciphered what the text is -- Thomas Pynchon's "The Crying of Lot 49."

From the San Jose Mercury News, 8/14/2007:

The solution was discovered by two Silicon Valley tech workers, Bob Mayo and Mark Snesrud, who received a commendation at San Jose City Hall today.

Using both the rotating disks and the art project's audio broadcast, they deciphered a preliminary code based on the James Joyce novel, "Ulysses," which was the key to solving the entire message. It took them about three weeks.

"It was not a real easy thing to figure out," said Snesrud, a chip designer for Santa Clara based W&W Communications.

Ben Rubin, the New York artist who developed the project, applauded the duo's "computational brute force" in finding the message. "I'm especially glad the code was cracked and that it was done in a very classical way," Rubin said.

The Pynchon book, written in the mid-1960s, is set in a fictional California city filled with high-tech campuses. It follows a woman's discovery of latent symbols and codes embedded in the landscape and local culture, Rubin said.

The semaphore is made up of four 10-foot wide disks, which are composed of 24,000 light-emitting diodes. The disks each have a dark line going from one end to another and twirl around every eight seconds to create a new pattern. It made its debut on Aug. 7, 2006 as part of the ZeroOne digital art festival. Rubin said there are no plans to stop the semaphore or change its message - at least for the time being.

"It'll change the way people look it," Rubin said of having the solution known. "Maybe in a few years, we'll revisit it.

The choice of text is inspired. I hope that he updates it.

Wednesday, August 15, 2007

radio interview about Google Book Search

Martha Sites, our AUL for Production and Technology was interviewed along with Ben Bunnell of Google about the Google Book Project.

Go to: http://wmet1160.com/schedule/the_beat/

Scroll to "The Beat 20070810 1100 web.mp3" and select it (the scrolling is a bit tricky)

The interview starts at ~4:00 and ends at ~31:00.

Tuesday, August 14, 2007

Announcing the Fedora Commons

The news is finally out there:

Fedora Commons today announced the award of a four year, $4.9M grant from the Gordon and Betty Moore Foundation to develop the organizational and technical frameworks necessary to effect revolutionary change in how scientists, scholars, museums, libraries, and educators collaborate to produce, share, and preserve their digital intellectual creations. Fedora Commons is a new non-profit organization that will continue the mission of the Fedora Project, the successful open-source software collaboration between Cornell University and the University of Virginia.

There's also some staffing news, in addition to Sandy Payette becoming Executive Director:

Daniel Davis will lead Fedora core software development as chief architect, Thornton Staples will lead outreach efforts as director of community strategy and outreach starting on October 1, 2007, and Carol Minton Morris will serve as director of communications and media. Chris Wilper and Eddie Shin will continue in their roles as lead software developer for Fedora software and developer for Fedora software, respectively.

I'm going to miss having Thorny right down the hallway where I can wander down and think things through with him. He's staying local and will maintain an office at UVA, so we've not completely losing him.

The full press release is at the new web site: http://www.fedora-commons.org/about/news.php#moore-grant.

If you're not familiar with Fedora yet, check out the Portfolio portion of the site to see examples of systems built using Fedora. You might also want to check out the first in a series of promotional videos that includes our UVA Library work.

100 Year Archive report

There's an article in InfoWorld Tech Watch -- "Entering the Digital Dark Ages?" -- that notes that we have entered "an era of unprecedented information gathering likely to leave no lasting impression on the future, thanks in large part to a cross-departmental lack of understanding of the business requirements for data archiving" according to a recent study conducted by the Storage Networking Industry Association's 100 Year Archive Task Force.

The article is brief and points out a few key issues, such as data archiving not being considered a valuable business service, which is ironic when some industries have record retention standards with time frames of 50 or 100 years.

While the report was anchored by an organization representing the physical storage business, there was a lot of participation from the records management and archival communities. This work is based on a survey to identify the requirements for "long-term" storage and retention. The survey results and quite a few respondent comments are included in the report. The next steps for the group include production of a reference model similar to OAIS or the Sedona Guidelines covering the storage domain of long-term retention, creation of a "Self-Describing, Self-Contained Data Format" (SD-SCDF) for use as an archival information package in a trusted digital repository, and extending the definition and use of the XAM (eXtensible Access Method) standard that SNIA is already working on.

The report, which was issued in January 2007, is one to read. Register at the Task Force site and you can download it as a PDF.

Literature in a Digital Age

Matt Kirschembaum has an interesting article in the Chronicle of Higher Education -- “Hamlet.doc? Literature in a Digital Age.”

Dan Cohen comments on Matt's observations on how technology such as change tracking creates new possibilities for understanding the creative process, and how important standards will become. Another part of the article resonated even more with me:

The implications here extend beyond scholarship to a need to reformulate our understanding of what becomes part of the public cultural record. If an author donates her laptop to a library, what are the boundaries of the collection? Old e-mail messages, financial records, Web-browser history files? Overwritten or erased data that is still recoverable from the hard drive? Since computers are now ground zero for so many aspects of our daily lives, the boundaries between our creative endeavors and more mundane activities are not nearly as clear as they might once have been in a traditional set of author's "papers." Indeed, what are the boundaries of authorship itself in an era of blogs, wikis, instant messaging, and e-mail? Is an author's blog part of her papers? What about a chat transcript or an instant message stored on a cellphone? What about a character or avatar the author has created for an online game? The question is analogous to Foucault's famous provocation about whether Nietzsche's laundry list ought to be considered part of his complete works, but the difference is not only in the extreme volume and proliferation of data but also in the relentless way in which everything on a computer operating system is indexed, stamped, quantified, and objectified.

I remember the discussion of boundaries when we first started talking about archiving web sites. Where does a web site "end" when it has linkages to other sites? Within the same subdomain? Within the same domain? Do you include the pages that are linked to in other sites because they might provide important context?

Many years ago I served on the board of directors of a professional organization. As part of the organizational archive, I was asked to supply my print files, electronic documents, and my email archives when my service ended. At the time I was an obsessive file archiver and I could supply all my email from four different email addresses and two different environments (Compuserve and Eudora) as well as many snapshots of document versions and web sites over a seven year period. But those were official versions. Would I want every awkward draft of a report or a brochure saved for posterity? Is that really part of the organizations' history?

While I think a lot about privacy and what an author might/should restrict access to (short-term or long-term) when leaving behind their digital legacy, there is so much potential for research. How does working on digital financial records differ from studying account ledgers? How does studying email differ from studying written correspondence or memoranda? Or blogs versus published editorials? They're the same research activities, just different media. Again, from Matt's article:

The wholesale migration of literature to a born-digital state places our collective literary and cultural heritage at real risk. But for every problem that electronic documents create — problems for preservation, problems for access, problems for cataloging and classification and discovery and delivery — there are equal, and potentially enormous, opportunities. What if we could use machine-learning algorithms to sift through vast textual archives and draw our attention to a portion of a manuscript manifesting an especially rich and unusual pattern of activity, the multiple layers of revision captured in different versions of the file creating a three-dimensional portrait of the writing process? What if these revisions could in turn be correlated with the content of a Web site that someone in the author's MySpace network had blogged?

Yes, there are definitely issues in accessing file formats as they age. When I rediscovered my single-sided original Mac disks from the mid-80s with my MA research and thesis written in MacWrite 1.0, or 5 1/4" disks with documentation that I wrote in 1991 in WordPerfect, I had to call in favors from folks with vintage Mac and PC hardware and buy Conversions Plus software to get at the file content (not fully successfully). I was incredibly lucky that the media could be read, let alone transform the files. Let us not even speak of the versions of files over time that I lost on on Mac ZIP disks that were accidentally discarded in a move. There went part of the history of the organization that I mentioned above.

There is a lot of education needed about preserving digital output and the file and media standards to be used. I look forward to seeing the work of Maryland's X-Lit project.

Saturday, August 11, 2007

COLLADA

Reading a review of this year's SIGGRAPH, I read about COLLADA:

COLLADA is a COLLAborative Design Activity for establishing an open standard digital asset schema for interactive 3D applications. It involves designers, developers, and interested parties from within Sony Computer Entertainment America (SCEA) as well as key third-party companies in the 3-D industry. With its 1.4.0 release, COLLADA became a standard of The Khronos Group Inc., where consortium members continue to promote COLLADA to be the centerpiece of digital-asset toolchains used by the 3-D interactive industry.
COLLADA defines an XML database schema that enables 3-D authoring applications to freely exchange digital assets without loss of information, enabling multiple software packages to be combined into extremely powerful tool chains.
However, COLLADA is not merely a technology, as technology alone cannot solve this communication problem. COLLADA has succeeded in providing a neutral zone where competitors work together in the design of a common specification. This creates a new paradigm in which the schema (format) is supported directly by the digital content creation (DCC) vendors. Each of them writes and supports their own implementation of COLLADA importer and exporter tools.

COLLADA is an XML schema combined with its COMMON profile that can be exchanged between proprietary software packages and an open source programs, giving more control over digital assets. The list of products that support COLLDA is impressive. Some support plugins and other directly import or export the format. It's a truly open exchange format.

Having worked in instructional technology at a school of architecture, I saw the difficulties in exchanging files between individuals and projects first hand. This is a huge step for design practice and as a potential preservation format for these files. What a great possibility for preserving and sharing these files in repositories.

Friday, August 10, 2007

wikipedia trustworthiness

There was a brief article in the Chronicle of Higher Ed last week that I didn't spot until yesterday -- UC Santa Cruz researchers have developed a simple yet clever test of the trustworthiness of wikipedia article authors:

... the researchers analyzed Wikipedia’s editing history, tracking material that has remained on the site for a long time and edits that have been quickly overruled. A Wikipedian with a distinguished record of unchanged edits is declared trustworthy, and his or her contributions are left untouched on the Santa Cruz team’s color-coded pages. But a contributor whose posts have frequently been changed or deleted is considered suspect, and his or her content is highlighted in orange.

It's a demo with only a few hundred pages, but it's still a very interesting proof-of-concept. Of course the software cannot do actual fact checking to vet content, but it's a elegant method for looking at the trustworthiness of the people who are source of the content. It's simplistic in a way -- an author could be expert in one area but not in others, or be overruled due to personality issues rather than authoritativeness -- but it's worth reviewing for the process and for the presentation of an article's authority ranking through color coding.

A conference paper describing the work is available.

Digital Eccentric