Sunday, November 30, 2008

costs of operating a data center

In recent weeks I've been in a number of meetings about storage architecture. In comparing potential solutions for a particular project's needs, there has been a lot of discussion about requirements for space, cooling, power, etc., which are some of the metrics for comparison of the proposals. Today I came across a posting by James Hamilton from Microsoft's Data Center Futures Team, on misconceptions about the cost of power in large-scale data centers.

There is an error in the table for the two amortization entries: the 180-month amortization of the facility says 3 years in the notes column when it should say 15, and the 36-month amortization of the servers says 15 years when it should say 3. The right numbers are used in the calculation -- it's just the explanatory notes that are switched.

There are some very interesting calculations that I plan to forward to other folks. One commenter points out that the figures and graph don't take staffing into account, which, according to Hamilton, is because that cost is a very small percentage of the cost. At Microsoft's scale that's probably true. I guess we're a relatively small- to medium-scale data center, because we are definitely taking that into consideration. And, while the cost of power may be a lower percentage of overall cost than previously considered, power capacity is still definitely a factor when you're considering adding a number of servers and switches.

Saturday, November 29, 2008

Jamie Boyle on public domain

Jamie Boyle's book The Public Domain: Enclosing the Commons of the Mind has been published by Yale University Press, and is also available for free download under a Creative Commons license.

I've seen Jamie Boyle speak two or three times, and I consider him a very important voice in the discussions on the public domain, intellectual property, patents, the economics of same, and their place in technology and culture.

Read this book.

Wednesday, November 26, 2008

LuSql - Lucene indexing of DBMS records

The release of LuSql has been announced on a few email lists:

LuSql is a high-performance, simple tool for indexing data held in a DBMS into a Lucene index. It can use any JDBC-aware SQL database.

It includes a tutorial with a series of increasingly complex use cases, showing how article metadata held in a series of MySql tables can be indexed and how file system files containing full-text can also be indexed.

It has been tested extensively, including using 6.4 million metadata and full-text records to produce a 86GB index in 13.5 hours.

It is licensed with the Apache 2.0 license.

release: http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql
tutorial: http://cuvier.cisti.nrc.ca/~gnewton/lusql/v0.9/lusqlManual.pdf.html

Tuesday, November 25, 2008

commentary on why google must die

John Dvorak write an essay in PC Magazine entitled "Why Google Must Die." It's a pithy article on search engine optimization (SEO) and the SEO tricks that are in play to work best with Google or get around a Google feature. This is an an essay that I never would have noticed had it not been referenced in a posting by Stephen Abram that I very much took notice of, also entitled "Why Google Must Die."

His post is a response to the often-heard suggestion that OPACs, federated search, and web site search engines should be "just like Google." He asks what should be implemented first:

1. Should I start manipulating the search results of library users based on the needs of advertisers who pay for position?
2. Should I track your users' searches and offer different search results or ads based on their private searches?
3. Should I open library OPACs and searches to 'search engine optimization' (SEO) techniques that allow special interest groups, commercial interests, politicians (as we've certainly seen with the geotagged searches in the US election this year), racist organizations (as in the classic MLK example), or whatever to change results?
4. Should I geotag all searches, using Google Maps, coming from colleges, universities or high schools because I can ultimately charge more for clicks coming from younger searchers? Should I build services like Google Scholar to attract young ad viewers and train or accredit librarians and educators in the use of same?
5. Should I allow the algoritim to override the end-user's Boolean search if it meets an advertiser's goal?
6. "Evil," says Google CEO Eric Schmidt, "is what Sergey says is evil." (Wired). Is that who you want making your personal and institutional values decisions?
There's more to the post. I admire a forthright post like this that pushes back on the assertion that doing things the Google way is automatically better.

I used to have lengthy discussions with a library administrator in a past job who wanted image searching to be just like Google images, because searches on a single word like "horse" would always produce images of horses at the top of the results. It was a lot of effort to explain that this was somewhat artificial, due to the sheer number of images and, that, in the absence of descriptive metadata, that having the string "horse" in the file name would ensure that they would be near the top of the list and that Google didn't actually recognize that it was an image of a horse. Sorry, we really did need to expend effort on descriptive metadata.

SearchCampDC

The Library of Congress is hosting the SearchCampDC barcamp on Tuesday, December 2. From the wiki:

The goal of SearchCampDC is to bring together people working on and using search releated technologies in and around Washington DC. We all rely on tools like Google in our daily lives, but as IT professionals we often need to build or integrate search technologies into applications, and the enterprise.

Search technologies typically rely on a wide variety of techniques such as fulltext indexing, geocoding, named entity recognition, data mining, natural language parsing, machine learning, distributed processing. Perhaps you've used a piece of technology, and care share how well it worked? Or perhaps you help develop a search-related tool that you'd like to get feedback on? Or maybe you've got a search-itch to scratch, and want to learn more about how to scratch it? If so, please join us for SearchCampDC.

The idea for this event came about from a happy coincidence that several Lucene, Solr, OpenLayers and TileCache developers were going to be in the DC area at the same time. The idea is to provide them (and hopefully others like you) a space for short & sweet presentations about stuff they are working on, and also to provide a collaborative space for people to try things out, share challenges, ideas etc.

Monday, November 24, 2008

europeana a victim of its own success

There are a number of article documenting the wild success and consequent server failure of the Europeana digital library: Times Online, PublicTechnology.net, and Yahoo Tech. The development site that documents the project is still available.

This is a cautionary tale for those of us who are working on the World Digital Library project, which is set to launch in April 2009. We know that there is a potentially high level of interest in a multi-lingual international digital collection site -- albeit one with a much initial smaller collection -- and seeing this confirms for us that making plans for mirroring is a necessity.

Thursday, November 20, 2008

Europeana

The prototype site for Europeana, the European digital library funded by the EC, is set to launch today, November 20, 2008.

The initial collection of 2 million items comes from museums, libraries, archives, and audio-visual collections, and includes paintings, maps, videos and newspapers. The interface is in French, English, and German, with more languages planned. Highlights include the Magna Carta from Britain, the Vermeer painting “Girl With a Pearl Earring” from the Mauritshuis Museum in The Hague, a copy of Dante’s “Divine Comedy'" and newsreel footage of the fall of the Berlin Wall. Content is free of copyright restrictions.

New York Times and EU Business articles provide more details. The goal is to add 10 million items into the digital library by 2010, at a price of 350 million to 400 million euros.

Wednesday, November 19, 2008

LIFE photo archive in google images

Google has launched a hosted collection of newly-digitized images includes photos and etchings produced and owned by LIFE Magazine.

Apparently only a small percentage of these images have been published; the remainder come from their photo archive. Google is digitizing them: 20 percent of the collection is online, and hey are working toward he goal of having all 10 million photos online.

There are some great Civil War, WWI, WWII and Vietnam war images, portraits of Queen Victoria and Czar Nicholas, civil rights-era documentation, and early images of Disneyland. There are photographs by Mathew Brady, Alexander Gardner, Margaret Bourke-White, Dorothea Lange, Alfred Eisenstaedt, Carl Mydans, and Larry Burrows, among others.

Tuesday, November 18, 2008

cataloging flash mob

I love the idea of a flash mob volunteer effort to catalog the book and videotape collection at St. John's Church in Beverly Farms, Massachusetts. There's more from one of the volunteers.

I'd like to see this effort replicated at small historical societies, house museums, and other organizations that are often supported only by a small cadre of volunteers.

Monday, November 17, 2008

facebook repository deposit app using sword

From Stuart Lewis' blog comes word of a Facebook app -- SWORDAPP -- for depositing content into SWORD-enabled repositories. It's meant to encourage social deposit: notice of your deposits goes out in your Facebook newsfeed, and you can receive news of your friend's deposits. You have to already be eligible to authenticate and deposit into a repository somewhere.

This is an interesting use of SWORD in an application that lives in a really different context. He's looking for testers and feedback.

Sunday, November 16, 2008

arl guide to the google book search settlement

The Association of Research Libraries has created "A Guide for the Perplexed: Libraries & the Google Library Project Settlement," a 23-page document intended to help libraries understand the impact of the proposed Google Book Search settlement.

oxford university preserv2 project session at dlf

My brief, unstructured notes from a presentation by Sally Rumsey from Oxford University on the Preserv2 project, at the DLF Fall 2008 Forum.

  • Effort focused primarily on IRs with research data, but applies to all types of repositories.
  • “Boxed up -- but innovative at the time – repository environment.” 6 different repositories at Oxford, multiple Fedora & Eprints, some sharing the same storage.
  • A desire for decoupled services -- distributed storage and transportable data. Scalability issue.
  • Implemented British National Archives seamless flow approach to preservation. There are three categories of activities: Characterization (inventory and format identification). Preservation planning and technology watch (using the PRONOM technology watch service). Preservation Actions (file migration, rendering tools, etc, based on repository policy).
  • Smart Storage. Ability to address objects in open storage through a repository layer or directly in the storage system. DROID was used as the tool to verify new items as they are stored, and continually check files in storage. They are implementing a scheduler.
  • Oxford looking at Sun Honeycomb server architecture.

exposing harvard digital collections to web crawlers session at dlf

My somewhat unstructured notes from a presentation by Roberta Fox from Harvard, at at the DLF Fall 2008 Forum.

  • Barriers in exposing their collections from their legacy applications : session-based, frames, form-driven , non-compliant coding, URLs with lots of parameters.
  • Crawlers couldn’t get past first page of any of their services (VIA, OASIS, PDS, TED, etc)
  • Concerns: Server load an issue, dynamic database-driven sites.
  • But, exposure to crawler declared a priority.
  • Added robots links on every page to identify how it should be handled -- index, no, follow links, etc, to control server use.
  • Slow down crawl using Google webmaster tools to avoid major server hits.
  • Added alt and title tags to provide context for pages/items found through external search (originally assumed because access would be through the Harvard University Library portal context). They added links on all pages to provide additional context, e.g. the full preferred, presentation, to assist udders when they land on a page through Google and not through the HUL context, so users can find their way around.
  • Crawler friendly – generated a static site map for key dynamic pages, update the site map weekly.
  • Updated, simplified URL structure for deep pages.
  • Access to Page Delivery Service the most challenging for OCR’ed text. Generated a crawler-friendly page for each text in addition to original unfriendly frames version (there is no priority to rewrite the app to remove frames).
  • It is somewhat burdensome to create index pages that point to all the items in a database (such as all the images in VIA) on a weekly basis, but it’s better than no access – it’s an automated creation process and doesn’t take up that much room on the server.

ncsu course views session at dlf

My somewhat unstructured notes from a presentation by Tito Sierra and Jason Casden from NCSU on the Course Views tools, at at the DLF Fall 2008 Forum.

  • Their goal is to create a web page for every single course at NCSU.
  • Blackboard Vista (hard to work with), Moodle (easy to work with), WolfWare (internal tool, easy to integrate), Library web site.
  • Implemented using object-oriented PHP, and the front-end design makes use of YUI, jQuery and CSS. RESTful requests to the Widget System, and Restful URIs.
  • They screen scrape the course directory to build the database of course titles.
  • There is an issue with the course reserves system so they are not currently embedding reserves, just linking.
  • Built their own mini-metasearch which searches one vendor database set for each general discipline (created by subject librarians, who identify one vendor and a set of their databases to be searched).
  • The subject librarians create a “recommended” resource list for each course.
  • Real issues with customization, especially for sections of large courses.
  • Balance of fully custom, hand-made versus fully automated. Librarians given a number of widgets to select from, some of which are canned and some of which can be customized.
  • Reserves used the most by far, then search, then the “recommended” widget.
  • Customized widgets used most (the 3 listed above).

djatoka jpeg 2000 image server session at dlf

My brief, unstructured notes from a presentation by Ryan Chute from Los Alamos National Labs on the Djatoka JPEG 2000 Image Server, at at the DLF Fall 2008 Forum.

  • APIs and libraries: Kakadu SDK, Sun Java advanced images, NIH ImageJ, OCLC OpenURL OOM. Could work with other than Kakadu, including Aware.
  • Viewer adapted from a British Library implementation.
  • Features: resolution and region extraction, rotation, support for rich set of input/output formats, extensible interfaces for image transformations, including watermarking. Persistent URLs for images and for individual regions.
  • URI-addressability for specified regions needed, and OpenURL provided a standardized service framework foundation for requesting regions. This extends the OpenURL use cases.
  • Service example: Ajax-based client requests, obtains metadata and regions from image server using image identifier, client iterates through tiles and metadata to display.

university of texas repo atom use session at dlf

My somewhat unstructured notes from a presentation by Peter Keane from UT Austin on the use of Atom and Atom/Pub in their DASe repository, at at the DLF Fall 2008 Forum.

  • DASe project: lightweight repository, 100+ collections, 1.2 million files, 3 million metadata records.
  • DASe has replaced their image reserves system. Home grown (“built instead of borrowed”), originaly prototyped 2004/2005.
  • They didn’t originally plan to build a repository, they were building an image slideshow and ended up with a repository, too.
  • It’s a data first application. Data comes from spreadsheets, FM, Flickr, iPhoto, file headers, etc. System includes a variety of different collection-based data models. Needed to map to/from standard schemas. Accepted as is, no normalization or enrichment at all.
  • SynOA: Syndicated Oriented Architecture. Importance recognized in being RESTful. DASe is a Rest framework.
  • Use the Atom publishing protocol to represent collections and items and searches. Used internally between services, including upload and ingest (uses http get, post, etc). Everything is Atom with a UI (Smarty PHP templates) on top of it.
  • Working on a Blackboard integration.
  • Interesting use of Google spreadsheets – create Google spreadsheet for whatever they have a name/value pairs, automatically outputs atom, can ingest from feed.
  • No fielded search across collections, only within a single collection. They could map across data models to a common standard, but haven’t. (corrected as per comment below)
  • Repositories were considered a door to libraries, all trying to create a better door. This is not the right concept, instead should be exposing in a standard way to any and all services.
  • Loves REST; used the term “RESTafarian.”

creative commons non-commercial use session at dlf

My somewhat unstructured notes from a presentation by Virginia Rutledge, an attorney from Creative Commons, at the DLF Fall 2008 Forum.

  • Copyright is a bundle of rights. She went over this in some detail for those who are less familiar.
  • Creative Commons exists to support the ability to share, remix, and reuse, legally.
  • Example of recent Library use: the entire UCLA Library web site is under a CC license to clarify its content re-use status.
  • The Creative Commons definition of non-commercial is tied to the intent of the user -- no intent towards commercial advantage or private monetary compensation. BUT, there is no single definition of non-commercial.
  • There are undertaking a research project in many phases. 1st (done) – focus groups. Identified 4 communities: Arts, education, web, and science communities. This proved to be a VERY bad idea, as the boundaries are actually way too fuzzy and interdisciplinary. The work invalidated the assumption that they could do this on a community-based basis.
  • A number of issues of importance to rights holders in allowing non-commercial use were identified in the discussions: Is there a perceived economic value? Who is the user-- an individual or an organization? Non profit or not? Is any money generated? Is access supported by advertising or not? Is the use for the “public good” -- for charity/education? What is the amount of distribution? Will the work be used in part or in whole? Is this use by a “competitor?”
  • There are also subjective issues: Is it an objectionable use? Is it perceived as fair use?
  • Personal creator and personal use versus institutional ownership and use is a distinction that really makes a difference to people, but has no meaning in US law.
  • Some of the confusion over how to define "non-commercial" is not understanding what activities the prohibition of non-commercial use actually prohibits. Most rights holders don’t really want to prohibit all commercial uses, just some, and it varies wildly by person/organization.
  • Based on the research so far, there is no checklist they can come up with.
  • As of November 17, a poll will be available online, and they are encouraging librarians to participate.

google book search session at dlf

I was going to spend some time transforming my notes from Dan Clancy's session on Google Book Search from the DLF Fall 2008 Forum into more coherent prose, but for the sake of timeliness, I'm going to post them as is.

  • 20% of the content in Google Book Search is in the public domain, 5% is in print, and the rest is in an unknown “twilight zone” -- unknown status and/or out-of-print.
  • 7 million books scanned, over 1 million are public domain, 4-5 million are in snippet view.
  • Early scanning was not performed at an impressive rate, and it took way longer than expected to set up.
  • Priorities are working search quality, and exposure to google.com.
  • Search is definitely not solved and “done,” and is harder given the big distribution of relatively successful hits.
  • They are working to improve the quality of scanning and the algorithm to process the books and improve usability. They admit that they still have work to do, especially with the re-processing of older scans.
  • Data to support Long Tail model is right.
  • Creating open APIs, including one to determine the status of a book, and a syndicated viewer that can be embedded.
  • Trying to identify the status of orphans, and release a database of determinations. But institutions need to use determinations to guide their decisions, not just follow them because “Google said so.”
  • On the proposed settlement agreement: Google thought they would benefit users more to settle than to litigate.
  • The class is defined as anyone in the U.S. with a copyright interest in a book, in U.S. use. (no journals or music)
  • For all books in copyright, Google is allowed to scan, index, and provide varying access models dependent upon the status of the book -- if in print or out-of-print. Rights holders can opt out.
  • 4 access models: consumer digital purchase (in the cloud, not downloads – downloads are not specifically included in agreement); free preview of up to 20% of book; institutional subscription for the entire database (site license with authentication, can be linked into course reserves and course management systems); public access terminals for public libraries or higher ed that do not want to subscribe (1 access point in each public library building, some # by FTE for high ed institutions) which allows printing (for 5 years or $3 million underwriting of payments to rights holders).
  • Books Rights Registry to record rights, handle payments to rights holders. It can operate on behalf of other content providers, not just Google.
  • Plan to open up government documents, because they feel that the rights registry organization will deal with the issue of possible in-copyright content included in gov docs, which kept them from opening gov docs before.
  • Admits that publishers and authors do not always agree if publishers have the rights for digital distribution of books. Some authors are adamant that they did not assign rights, some publishers are adamant that even if not explicit, it's allowed. The settlement supposedly allows sharing between authors and publishers to cover this.
  • What is “Non-consumptive research”? OCR application research. Image processing research. Textual analysis research. Search development research. Use of the corpus as a test corpus for technology research, not research using the content. 2 institutions will run data centers for access to the research corpus, with financial support from Google to set up the centers.
  • What about their selling books back to the libraries that contributed them via subscriptions? They will take the partnership and amount of scanning into account and provide a subsidy toward a subscription. Stanford and Michigan will likely be getting theirs free. Institutions can get a free limited set of their own books for the length of the copyright of the books. They can already do whatever they want with their public domain books.
  • They will not necessarily be collecting rights information/determinations from other projects for the registry. In building the registry, they are including licensed metadata (from libraries, OCLC, publishers, etc), so they cannot publicly share all the data that will make up the registry. But they will make public the status of book that are identified/claimed as in copyright.
  • If Google goes away or becomes “evil Google,” there is lots of language in contracts and settlement for an out.
  • The settlement is U.S. only because the class in the suit was U.S. only. Non-U.S. terms are really challenging because many countries have no concept of class-action, and there is a wide variation of laws.
  • A notice period begins January 5. Mid 2009 is the earliest time this could be approved by the court.

Friday, November 14, 2008

omeka 0.10

Omeka v 0.10 has been released. Omeka 0.10b incorporates many requested changes: an unqualified Dublin Core metadata schema and fully extensible element sets to accommodate interoperability with digital repository software and collections management systems; elegant reworkings of the theme and plugin APIs to make add-on development more intuitive and more powerful; a new, even more user friendly look for the administrative interface; and a new and improved Exhibit Builder.

scholastic books flickr set

I came across a link to a flickr set that someone has created with covers and illustrations from Scholastic Book Services books from the 1960s and 1970s. I _loved_ it when the Scholastic book order forms were distributed, and I always ordered something like a dozen books every time. This set includes a number of books that I know I owned, and even a very few that I _still_ own. This person has a great collection.

EDIT: Here's another flickr set.

Greenstone release

Greenstone v2.81 has been released. Improvements include handling filenames that include non-ASCII characters, accent folding switched on by default for Lucene, and character based segmentation for CJK languages. There are many other significant additions, including the Fedora Librarian Interface (analogous to GLI, but working with a Fedora repository).

Rome Reborn and Google Earth

BBC News reported on a release of a collaboration between Google Earth and the Rome Reborn project. Ancient Rome is the first historical city to be added to Google Earth. The model contains more than 6,700 buildings, with more than 250 place marks linking to key sites in a variety of languages.

roman de la rose digital library

Johns Hopkins University and the Bibliothèque nationale de France have announced that the Roman de la Rose Digital Library available at http://romandelarose.org/. The goal is to bring together digital surrogates of all the approximately 270 extant manuscript copies of the Roman de la Rose. By the end of 2009 they expect to have 150 versions included in the resource. There is an associated blog available at http://romandelarose.blogspot.com/.

I am particularly interested in the pageturner and image browser that they used -- the FSI Viewer, a Flash-based tool. It seems to work with TIF, JPG, FPX, and PDF (but not JPEG2000?), and converts files to multi-resolution TIFs. It's a very intuitive interface.

Monday, November 10, 2008

photoshop ui rendered in real-world objects

Via BoingBoing, the UI for Photoshop recreated with real objects, created by the agency Bates 141 in Jakarta for Software Asli. Follow the links to the image and to the "making of" flickr set.