Tuesday, 13 December 2011

From little things big things grow – gold star to newspaper text correctors for e-books!

For the last 4 years I have been working with a largely invisible crowd of people. That’s the way in now in the digital library world. In the old days as reference librarian you saw your crowd standing in a great big queue in front of you. You got to know your regulars, met visitors, and people doing their personal research. 25 years ago a busy Saturday in a city reference library involved having a face-to-face conversation with about 150 people in a day (and in winter transferring their germs!).

These days I know there are about 4 million people using the Trove service, which at the moment averages 10,000 searches per hour, every hour. Of these people there are an extra special group ‘my regulars’. About 10,000 people login at home each morning, afternoon and evening to do a little bit of newspaper text correction. As I sit in my office in Canberra and they sit in their living rooms at home, I know they are there, I see what they do. Their presence and achievements are felt. Sometimes I wish they could all be physically here, standing in front of the Library, just to show what a big crowd they are and so that I can thank them in person.

Most of the text correctors just started by thinking they would do a little bit, often correcting an article involving their family history, and thinking they would just do that one. But everything snowballed, people got sucked in, addicted, hooked. Now together my virtual special crowd has corrected more than 50 million lines of text. The busiest corrector Ann Manley achieved a personal best of 1 million lines single-handedly last week. The day she achieved this was International Volunteers Day. Most people are doing the work because they want to help others find information (especially names) and the text correction helps with this. Also they think they are making a difference to the nation by accurately recording Australian History in this way (which they definitely are). The text correctors never fail to inspire, impress and surprise me with their dedication and commitment.

Last week I received a message from two of our correctors high in the ranking tables. Their activity has expanded to a whole new level. After correcting shipping notices for some time they have now moved on to stories and novels that appeared in the newspapers. In their own words:

“It is quite enjoyable being able to read a story while doing the text correction. The added bonus is being able to put them online in e-book format for others to read, as we have found lots of stories that are found nowhere else. We started on a novel by Fred M. White back in March, 2010 entitled 'The Case for the Crown' (it started in the Sydney Morning Herald on 2/11/1918) when we were doing some corrections concerning the end of the first World War. We initially focussed in on Fred M. White who was an English author who wrote a number of crime/intrigue/suspense novels and short stories e.g. The Red Speck (1899), The Corner House (1906), The Slave of Silence (1906), The Law of the Land (1908), The Scales of Justice (1909), The Empty House (1909), Hard Pressed (1910), A Front of Brass (1910), A Rope of Snow (1911), A Mummer's Throne (1911), The Man Called Gilray (1912), A Secret Service (1913), The Case for the Crown (1919), The Leopard's Spots (1920). The earliest novel that we collected was 'The Miser's Daughter' by William Harrison Ainsworth and it was published in The Colonial Times, Hobart starting in Aug. 1842. The story started in the first edition of 'Ainsworth's Magazine' in Feb. 1842, which was then used to produce the story in the Hobart paper. At times publication in the Hobart paper was delayed due to non-arrival of mail from England.

Works by Australian authors include: John Sandes - 'A Bush Bayard', and 'The Call of the Southern Cross', Arthur Gask - 'The Lonely House’, 'The Shadow of Larose', and Rev. William Draper - 'The Hermit Convict'. We have also produced e-books by Katharine Tynan (an Irish Author), Bernard Capes, Percy Andreae, Vernon Houseman, Harold Bindloss and a few others.

The stories are mostly serialized and so first we correct them, then save the corrected text. Most of the text downloads in one big block of text which needs to be broken up into paragraphs. I then have to reduce all word spacings back to one space as well as fixing broken words with a hyphen in them. Then run a spell-check and a run through Guiguts (a Project Gutenberg program which finds a lot of errors and re-wraps the text to a line length of 70). After that I have to sit down and proof-read the whole document before it is ready to send off to Project Gutenberg Australia. This whole process is quite time-consuming and can take longer than the original newspaper text-correcting. But it is very rewarding, so far I have managed to create 105 e-books from the text in the last 18 months. We upload the e-books to Project Gutenberg Australia. I found recently that some of the stories that we have uploaded have since been copied by some other websites and set up in different file formats. Some of the stories have even been put onto Amazon.com.”

The inventiveness and dedication of the public text correctors and crowdsourcing activity must not be under-estimated. Gold star to all text correctors!


  1. I am one of these, but only I admit doing my own family history stuff, and articles on the same pages, as I can. It is rewarding...

    I don't know if you have seen it, but this may be of interest...


  2. Thanks for this. The TedX video of Luis von Ahn talking about how he developed Captcha to improve OCR on books is interesting. As far as I know it's only the New York Times and Google Books who get the benefit of this.

    His latest project Duolingo to get people to translate the web and learn a new language, sentence by sentence also sounds interesting.