Rose Holley's Blog - views and news on digital libraries and archives: March 2012

Sunday, 25 March 2012

Crowdsourcing: the crowd ‘rations’ its experience to make it last

When I am at work I feel I have far too much to do and that my ambitions can never all be achieved. This tends to make me feel despondent. However in crowd sourcing projects it is well known that providing way too much work and impossible goals is a very powerful motivator. Rather than leaving individuals in the crowd feeling despondent it drives them to put in even more hours.

I know from experience that this is the case because online volunteers working on the Australian Newspapers text correction have told me this. Also after adding thousands of new pages to the service, surges in text correction would be observed. This was a regular pattern.

Last week I was alerted to a great article in the Guardian about crowdsourcing and two good blog posts on crowdsourcing in cultural heritage. All three articles are well worth reading and give some fascinating background to specific crowdsourcing projects. They all touch on the fact that the crowd wants to be given as much work as possible.

Crowdsourcing Cultural Heritage: the objectives are upside down, by Trevor Owens 10 March 2012

Ben Brumfield noticed that in his transcription project one of his most valuable power users was slowing down on their transcriptions. The user had started to cut back significantly in the time they spent transcribing this particular set of manuscripts. Ben reached out to the user and asked about it. Interestingly, the user responded to explain that they had noticed that there weren’t as many scanned documents showing up that required transcription. For this user, the 2-3 hours they spent each day working on transcriptions was such an important experience, such an important part of their day, that they had decided to cut back and deny themselves some of that experience. The user needed to ration out that experience. It was such an important part of their day that they needed to make sure that it lasted.

Crowdsourcing at IMLS Webwise 2012 by Ben Brumfield

Galaxy Zoo and the new dawn of citizen science, by Tim Adams, 18 March 2012

The volunteers only worry that the source of their obsession will dry up, and that they will run out of visible galaxies to classify. "In the beginning," Alice Sheppard said, "we all were enjoying it so much that we didn't like the idea of getting to the end." As it has worked out, more data sets have kept becoming available just as one tranche of images has been classified; now Sheppard believes that the work will continue to expand like the objects of its attention, "though no one seems quite sure how many galaxies are in the Hubble database?"

So the lesson we can learn from this is that we must give our crowd as much work and new data as we can. We don’t want our crowd to have to ‘ration themselves’ because we haven’t left them enough work to do.

Photo: a worker bee eats the last crumbs of my sticky date pudding.

Sunday, 11 March 2012

Crowdsourcing transcription of handwritten archives

One of the big differences between libraries and archives is that libraries tend to have more of ‘the printed word’ whilst archives have vast amounts of ‘the handwritten record’. While some libraries are getting up to speed with mass digitisation of books and journals and then being able to offer users full text searchable digitised items, this is still a distant dream for most archives. Some archives are undertaking mass digitisation, but the second step – making handwritten records full-text searchable is a massive challenge. The reason for this is in the technology and processing steps.

After scanning a ‘printed word’ page into an image file a piece of software called Optical CharacterRecognition (OCR) converts the image into searchable text. The OCR works best with clean, clear, black and white typeface such as a word document or a book, not quite so well on old books and journals, and very poorly on old newspapers. When it comes to converting handwriting it fails miserably. It just can’t distinguish and convert handwriting to text in the way the human eye can. Therefore archives can’t easily automate the second part of the digitisation process using OCR software like libraries can for the printed word.

If you at least get some OCR text from print that is readable and therefore searchable you can offer a service to users to full-text search the books or journals such as Google does. If the OCR text is poor there are some things you can do to improve it. You can encourage users of your service to correct the OCR text with a text correction tool so that the searching is improved, such as Trove does with the Australian Newspapers.

Unfortunately the only viable option open to archives to convert digital images into full-text searchable text is to use a manuscript transcription tool, in combination with harnessing the power of a crowd to do the transcription work. The transcription work for handwritten records is much harder than for example text correcting old newspapers because the handwriting is often difficult to read, old fashioned, barely legible and not necessarily structured in lines or columns. There is often nothing to go on.

I recently stumbled across a blog all about manuscript transcription tools that is written by a software developer Ben W Brumfield in Texas. Ben developed his own software to transcribe his great-great grandmother’s journal. ‘FromthePage’ is now being used by archives because Ben has made it available open source.

A year ago he wrote an in-depthblog post that covered manuscript transcription tools under development, manuscript transcription projects in archives, and made some predications for future directions of manuscript transcription. I am not going to repeat what he said here, I suggest you read the post in full. He notes that software development in this area is still fragmented and young with no particular tools taking dominance. Most developed applications are being made available open source. A standout is ‘Scribe’ from the Zooniverse team, currently being used by both the ‘Old Weather’ project to transcribe maritime weather records and by ‘What’s the score’ project to transcribe music scores at the Bodleian Library, Oxford.

Before an archive implements a manuscript tool it needs to find out what it’s users would most like to be easily full-text searchable from the vast vaults of all the content it has. It is important to find this out, because the crowd will only be motivated and swell in numbers if they really feel what they are doing is very important to a broad group of people and really matters either right now, or in the long-term and is also interesting. They have to feel this before they will join in. Once they have joined in there are other motivational tips you can do to keep them going. Just implementing a manuscript tool is simply not enough. You need to engage, watch, understand and learn from your crowd, for they hold the passion and power in their hands to make your project successful or not.

Photo by Rose Holley, outside Canberra Bus Station

Saturday, 3 March 2012

The digital game: helping librarians get digital jobs

A question I am often asked is “How can I become a digital librarian and get a job in this field?”

I wish I could say “Well the subject is well covered in a Library/Information Science courses, and there are lots of opportunities for you to gain experience.” But unfortunately neither is the case in Australia or New Zealand. I have been mentoring and helping Masters in Library Studies students with their assignments on digital topics for the last 10 years. I do this for two reasons. Firstly because I am naturally curious about the assignments and how close to reality they are, and secondly because I live in the hope that some or even just one of these new graduates may be inspired rather than discouraged with digital, and end up becoming a digital specialist like me. We are so short of digital specialists.

I am disappointed that most library courses and degrees still offer the digital bit as an optional rather than compulsory part of the program, even though these days most libraries would be doing something they call ‘digital’, in the same way they all catalogue. There is no Australian University course I am aware of that actually covers the whole breadth of digital topics at degree level for cultural heritage specialists (museums, galleries, libraries, archives) i.e. digitisation, digital delivery, digital preservation, data sharing. However I am encouraged because the digital assignments from library courses that I am asked about are increasingly becoming more realistic and practical. They are moving on from theoretical questions about online catalogues and digitisation to topics such as utilising social media and digital preservation. But it is still hard for new graduates to find jobs, when they may have theoretical knowledge only and no practical experience in the field. It is also hard to up-skill our existing librarians.

I was very interested therefore to hear about a new board game focusing on digital topics that was road tested at the DISH2011 conference. I thought it held immense value as a tool for three things: graduate teaching; for up-skilling staff in an organisation; and for interview practice to get some of those tricky digital questions right. The game is based on monopoly and covers the whole digital life cycle, including digitisation and digital preservation. I was interested to see that some of the questions are ones I have actually been asked at interview. Things like “What would you do if half way through your digitisation project the funding was cut?” The game is created by the European DigCurv Project. DigCurV brings together a network of partners to address the availability of vocational training for digital curators in the library, archive, museum and cultural heritage sectors in Europe. These skills are needed for the long-term management of digital collections. There is a very good blog post with pictures of the game being played and some of the questions, so I won’t repeat them here.

At the moment the game is being refined and will be only available to European partners of DigCurv (some of whom would like it translated from English into their own language). It would be great if copies could be obtained for the national Australian Cultural Heritage Institutions and Australian Universities offering Library/Archive/Museum degree courses.

There have been a number of organisations set up in Europe in the past to address training issues in digitisation and digital preservation. Not all of these survived, many being based on short term funding. The earliest I am aware of was in the UK in the year 2000, funded through revenue from the National Lottery. 50 million pounds was given away as ‘nof-digitise’ for organisations to start digitisation projects. However it was quickly realised that training would be required before the digitisation and delivery could start and so short term national training courses were set up. In 2001 the UK was the place to be if you were working as an information professional and wanted to learn about digitisation on the job and had got your hands on some of the nof-digi money. Sadly in Australia and New Zealand we are still awaiting a financial windfall for digitisation on the scale we have seen from the European Union, French and Scandinavian Governments and UK Lottery Funds. This means that we also haven’t developed the training we need and have no such equivalent organisation as DigCurv. I’m still hoping the proposed National Cultural Policy may address some of these things in 2012.

Photo from DEN Flickr stream: