Monday, August 25, 2008

concordance as word cloud

Eric Lease Morgan posted about a cool little hack to present a text concordance as a word cloud. A visualization of a concordance -- what a nice idea! It would be interesting to see one at a larger scale -- for every word in a book. I'd like to see how the visual metaphor scales.

Eric said one thing, though, that gives me pause:

"It is a trivial example of how libraries can provide services against documents, not just the documents themselves."

He is absolutely right -- it is a trivial effort to create this useful service. What is still unfortunately not as trivial as it should be is getting access to accurate transcriptions of all the texts once might want to analyze. There are ascii transcriptions for many, many works, but there is always a question of accuracy, and if the desired edition(s) are available. There's OCR, but it's a fair amount of effort to check and correct the output. Google isn't releasing its OCR, but even if they did that, too needs correction. Keyboarding is expensive. And many works in copyright haven't been touched for fear of legal action.

We have the ability to build extraordinary analytical tools. Where is the critical mass of text content?


Eric Lease Morgan said...

Leslie said, "We have the ability to build extraordinary analytical tools. Where is the critical mass of text content?"

Thank you for the feedback.

I believe there is quite a mass of textual content. I would begin with the freely available texts in things like the Open Archive as well as the article content linked from the Directory of Open Access Journals.


Leslie Johnston said...

The IA Open Text Archive is indeed a large collection, but again, it's mostly uncorrected OCR that I think is challenging to do critical analysis on. Maybe I should have said "Where is the critical mass of authoritative text content?" I know from my experience at UVA that it is exceptionally labor-intensive and expensive to create full-text content that you can say is authoritative and accurate. I'm in no way knocking OCR, because any full-text provides far better access than nothing, but is it good enough for that level of text analysis?