Sunday, November 16, 2008

exposing harvard digital collections to web crawlers session at dlf

My somewhat unstructured notes from a presentation by Roberta Fox from Harvard, at at the DLF Fall 2008 Forum.

  • Barriers in exposing their collections from their legacy applications : session-based, frames, form-driven , non-compliant coding, URLs with lots of parameters.
  • Crawlers couldn’t get past first page of any of their services (VIA, OASIS, PDS, TED, etc)
  • Concerns: Server load an issue, dynamic database-driven sites.
  • But, exposure to crawler declared a priority.
  • Added robots links on every page to identify how it should be handled -- index, no, follow links, etc, to control server use.
  • Slow down crawl using Google webmaster tools to avoid major server hits.
  • Added alt and title tags to provide context for pages/items found through external search (originally assumed because access would be through the Harvard University Library portal context). They added links on all pages to provide additional context, e.g. the full preferred, presentation, to assist udders when they land on a page through Google and not through the HUL context, so users can find their way around.
  • Crawler friendly – generated a static site map for key dynamic pages, update the site map weekly.
  • Updated, simplified URL structure for deep pages.
  • Access to Page Delivery Service the most challenging for OCR’ed text. Generated a crawler-friendly page for each text in addition to original unfriendly frames version (there is no priority to rewrite the app to remove frames).
  • It is somewhat burdensome to create index pages that point to all the items in a database (such as all the images in VIA) on a weekly basis, but it’s better than no access – it’s an automated creation process and doesn’t take up that much room on the server.

No comments: