Sunday, 11 March 2012

Crowdsourcing transcription of handwritten archives

One of the big differences between libraries and archives is that libraries tend to have more of ‘the printed word’ whilst archives have vast amounts of ‘the handwritten record’.  While some libraries are getting up to speed with mass digitisation of books and journals and then being able to offer users full text searchable digitised items, this is still a distant dream for most archives.  Some archives are undertaking mass digitisation, but the second step – making handwritten records full-text searchable is a massive challenge.  The reason for this is in the technology and processing steps.

After scanning a ‘printed word’ page into an image file a piece of software called Optical CharacterRecognition (OCR) converts the image into searchable text.  The OCR works best with clean, clear, black and white typeface such as a word document or a book, not quite so well on old books and journals, and very poorly on old newspapers.  When it comes to converting handwriting it fails miserably.  It just can’t distinguish and convert handwriting to text in the way the human eye can.  Therefore archives can’t easily automate the second part of the digitisation process using OCR software like libraries can for the printed word.

If you at least get some OCR text from print that is readable and therefore searchable you can offer a service to users to full-text search the books or journals such as Google does. If the OCR text is poor there are some things you can do to improve it. You can encourage users of your service to correct the OCR text with a text correction tool so that the searching is improved, such as Trove does with the Australian Newspapers.

Unfortunately the only viable option open to archives to convert digital images into full-text searchable text is to use a manuscript transcription tool, in combination with harnessing the power of a crowd to do the transcription work.  The transcription work for handwritten records is much harder than for example text correcting old newspapers because the handwriting is often difficult to read, old fashioned, barely legible and not necessarily structured in lines or columns. There is often nothing to go on.

I recently stumbled across a blog all about manuscript transcription tools that is written by a software developer Ben W Brumfield in Texas. Ben developed his own software to transcribe his great-great grandmother’s journal. ‘FromthePage’ is now being used by archives because Ben has made it available open source.

A year ago he wrote an in-depthblog post that covered manuscript transcription tools under development, manuscript transcription projects in archives, and made some predications for future directions of manuscript transcription.  I am not going to repeat what he said here, I suggest you read the post in full.  He notes that software development in this area is still fragmented and young with no particular tools taking dominance. Most developed applications are being made available open source. A standout is ‘Scribe’ from the Zooniverse team, currently being used by both the ‘Old Weather’ project to transcribe maritime weather records and by ‘What’s the score’ project to transcribe music scores at the Bodleian Library, Oxford.

Before an archive implements a manuscript tool it needs to find out what it’s users  would most like to be easily full-text searchable from the vast vaults of all the content it has. It is important to find this out, because the crowd will only be motivated and swell in numbers if they really feel what they are doing is very important to a broad group of people and really matters either right now, or in the long-term and is also interesting.  They have to feel this before they will join in.  Once they have joined in there are other motivational tips you can do to keep them going.  Just implementing a manuscript tool is simply not enough.  You need to engage, watch, understand and learn from your crowd, for they hold the passion and power in their hands to make your project successful or not.

Photo by Rose Holley, outside Canberra Bus Station


  1. Thanks for the kind words about my blog. A couple of recent posts about motivation may be of interest: my own WebWise talk touches briefly on the subject in the last three minutes, but more importantly Trevor Owens's post on institutional motivation for crowdsourcing and the resulting comment thread is very much worth reading: Crowdsourcing Cultural Heritage: The Objectives Are Upside Down.

  2. Thanks for sharing this Ben. It was very useful to hear about your projects behind the scenes.