Sunday, 18 December 2011

Software for journal and newspaper text correction

When I was working as Manager of the Australian Newspapers Digitisation Program I led the development of two pieces of software: the Digital Newspaper Content Management System and the Australian Newspapers Delivery Service. Quite early in the process of developing the public delivery system concerns were raised by the National, State and Territory Libraries of the poor quality of the newspaper OCR text, which negatively effected public searching. In a team brainstorming session we came up with the idea of opening up the OCR text for public correction and improvement.  We developed our own software to do this. Because we knew that what we were doing was quite groundbreaking at the time, may be very useful for other archives and libraries, and no ‘off the shelf’ software existed to do this, we made a decision to share the code as open-source. The code was developed by Kent Fitch Director of Project Computing: System Architect for the National Library of Australia, ably assisted by Ninh Nguyen, programmer at the National Library of Australia.  They created a fantastic service for Australians. The public interface for delivery of newspapers and text correction was designed, developed and tested on real users in a matter of weeks by Alexi Paschalidis Creative Director of Oxide Interactive .

Over the course of the next two years I had approaches from about 20 National Libraries from around the world asking if they could have the code.  Although we said yes every time, I was disappointed that to the best of my knowledge no library actually implemented public text correction by using or adapting our open source software as we had hoped. This mainly seemed to be because libraries other than us were still very unsure about “allowing” user edits, crowdsourcing, public interaction - whatever you want to call it.  Giving user’s freedom over data, rather than retaining tight control seemed to be a daunting prospect for libraries and a step they didn’t want to take. Several commented they would like a bit more time to observe whether our user activity was a ‘fad’ before committing to doing the same.  For this reason only three libraries as far as we were aware even got to the stage of having their IT support download the open source code from the National Library of Australia’s ‘LibraryForge’ website. True it would have been quite difficult for a library to implement and adapt our code to hook into a content management system, but this did not seem to be the reason for the low uptake.  We never shared the code on the newspaper content management system because our philosophy was to only share code once the product was in a ready state and usable, and the content management system was under constant development for three years. Now after four years the National Library of Australia has decided to remove the open-source code for Australian Newspapers from its download site.

However all is not lost for libraries wanting to have software for text correction or to deliver newspapers and journals, since I was recently alerted to the fact that a New Zealand company had developed software to replicate the text correction functionality of Australian Newspapers.  DL Consulting are selling a product called Veridian, which has already been installed by some US libraries.  The company has a background in newspaper software development having been involved in a very early open source digital product called ‘Greenstone’ which was used to deliver Maori Newspapers from the University of Waikato back in 2002.  I remember the service well from my time working at the University of Auckland.  The digital full-text search access was groundbreaking for the Maori, history and research community.

Kent Fitch has continued to work for the National Library of Australia on system architecture and went on to lead development of Trove and integrate the Newspapers service into it. Kent has worked as a programmer for over 30 years. Since 1982 he has been a principal of the Canberra software development company, Project Computing Pty Ltd. Kent has developed many commercial systems and communications packages and custom software for many clients. In the past ten years, his work has focused on library-related systems including AustLit, NLA Newspapers Digitisation, and Trove.

Last week Kent gave a presentation to the Australian Computer Society called ‘Scaling up: the technology behind the NLA's newspaper digitisation andthe Trove search service.’ where he describes in more detail some of the technical aspects.

2 comments:

  1. Thanks for the post. The New Brunswick Free Public Library in New Jersey, USA has 51 year worth of newspaper digitization, but most of it has very poor OCR results. The vendor we use does not have the ability to correct those texts, and we do not have the budget to pay for Veridian. Do you have any suggestion for us? Thanks
    Hsien-min Chen,
    Principal Librarian

    ReplyDelete
  2. Dear Hsien-min, I have three things to suggest and 2 of them would require a short-term programmer to help you:
    1) Utilise Re-Captcha. Re-Captcha is free but you would need some programming help to be able share your text with the programme and then feed the changes back into your system. Re-Captcha is used by all sorts of services so the people using it are unaware they are doing text correction for your project. At the moment Google Books and the Historic versions of the New York Times are using it on their OCR text.
    2). Use some open source software for text correction and integrate this into your system with help of a programmer. Software I am aware of is that from the National Library of Australia, SCRIBE from Galaxy Zoo (though aimed more at handwritten text, could be adapted), Wikisource. I'm not sure if the IMPACT (Improving Access to Text)European Project has finsihed the open source software they were developing for text correction yet and how this stacks up against other software.
    4). Contact Wikisource Transcription Projects to see if they want to set up a project to help you. http://en.wikisource.org/wiki/Wikisource:Transcription_Projects They have only been transcribing books so far, I'm not sure what their capability is for newspapers. It may be best to talk to your local chapter first.

    Hope that helps.

    ReplyDelete