Sunday, 18 December 2011

Software for journal and newspaper text correction

When I was working as Manager of the Australian Newspapers Digitisation Program I led the development of two pieces of software: the Digital Newspaper Content Management System and the Australian Newspapers Delivery Service. Quite early in the process of developing the public delivery system concerns were raised by the National, State and Territory Libraries of the poor quality of the newspaper OCR text, which negatively effected public searching. In a team brainstorming session we came up with the idea of opening up the OCR text for public correction and improvement.  We developed our own software to do this. Because we knew that what we were doing was quite groundbreaking at the time, may be very useful for other archives and libraries, and no ‘off the shelf’ software existed to do this, we made a decision to share the code as open-source. The code was developed by Kent Fitch Director of Project Computing: System Architect for the National Library of Australia, ably assisted by Ninh Nguyen, programmer at the National Library of Australia.  They created a fantastic service for Australians. The public interface for delivery of newspapers and text correction was designed, developed and tested on real users in a matter of weeks by Alexi Paschalidis Creative Director of Oxide Interactive .

Over the course of the next two years I had approaches from about 20 National Libraries from around the world asking if they could have the code.  Although we said yes every time, I was disappointed that to the best of my knowledge no library actually implemented public text correction by using or adapting our open source software as we had hoped. This mainly seemed to be because libraries other than us were still very unsure about “allowing” user edits, crowdsourcing, public interaction - whatever you want to call it.  Giving user’s freedom over data, rather than retaining tight control seemed to be a daunting prospect for libraries and a step they didn’t want to take. Several commented they would like a bit more time to observe whether our user activity was a ‘fad’ before committing to doing the same.  For this reason only three libraries as far as we were aware even got to the stage of having their IT support download the open source code from the National Library of Australia’s ‘LibraryForge’ website. True it would have been quite difficult for a library to implement and adapt our code to hook into a content management system, but this did not seem to be the reason for the low uptake.  We never shared the code on the newspaper content management system because our philosophy was to only share code once the product was in a ready state and usable, and the content management system was under constant development for three years. Now after four years the National Library of Australia has decided to remove the open-source code for Australian Newspapers from its download site.

However all is not lost for libraries wanting to have software for text correction or to deliver newspapers and journals, since I was recently alerted to the fact that a New Zealand company had developed software to replicate the text correction functionality of Australian Newspapers.  DL Consulting are selling a product called Veridian, which has already been installed by some US libraries.  The company has a background in newspaper software development having been involved in a very early open source digital product called ‘Greenstone’ which was used to deliver Maori Newspapers from the University of Waikato back in 2002.  I remember the service well from my time working at the University of Auckland.  The digital full-text search access was groundbreaking for the Maori, history and research community.

Kent Fitch has continued to work for the National Library of Australia on system architecture and went on to lead development of Trove and integrate the Newspapers service into it. Kent has worked as a programmer for over 30 years. Since 1982 he has been a principal of the Canberra software development company, Project Computing Pty Ltd. Kent has developed many commercial systems and communications packages and custom software for many clients. In the past ten years, his work has focused on library-related systems including AustLit, NLA Newspapers Digitisation, and Trove.

Last week Kent gave a presentation to the Australian Computer Society called ‘Scaling up: the technology behind the NLA's newspaper digitisation andthe Trove search service.’ where he describes in more detail some of the technical aspects.

Tuesday, 13 December 2011

From little things big things grow – gold star to newspaper text correctors for e-books!

For the last 4 years I have been working with a largely invisible crowd of people. That’s the way in now in the digital library world. In the old days as reference librarian you saw your crowd standing in a great big queue in front of you. You got to know your regulars, met visitors, and people doing their personal research. 25 years ago a busy Saturday in a city reference library involved having a face-to-face conversation with about 150 people in a day (and in winter transferring their germs!).

These days I know there are about 4 million people using the Trove service, which at the moment averages 10,000 searches per hour, every hour. Of these people there are an extra special group ‘my regulars’. About 10,000 people login at home each morning, afternoon and evening to do a little bit of newspaper text correction. As I sit in my office in Canberra and they sit in their living rooms at home, I know they are there, I see what they do. Their presence and achievements are felt. Sometimes I wish they could all be physically here, standing in front of the Library, just to show what a big crowd they are and so that I can thank them in person.

Most of the text correctors just started by thinking they would do a little bit, often correcting an article involving their family history, and thinking they would just do that one. But everything snowballed, people got sucked in, addicted, hooked. Now together my virtual special crowd has corrected more than 50 million lines of text. The busiest corrector Ann Manley achieved a personal best of 1 million lines single-handedly last week. The day she achieved this was International Volunteers Day. Most people are doing the work because they want to help others find information (especially names) and the text correction helps with this. Also they think they are making a difference to the nation by accurately recording Australian History in this way (which they definitely are). The text correctors never fail to inspire, impress and surprise me with their dedication and commitment.

Last week I received a message from two of our correctors high in the ranking tables. Their activity has expanded to a whole new level. After correcting shipping notices for some time they have now moved on to stories and novels that appeared in the newspapers. In their own words:

“It is quite enjoyable being able to read a story while doing the text correction. The added bonus is being able to put them online in e-book format for others to read, as we have found lots of stories that are found nowhere else. We started on a novel by Fred M. White back in March, 2010 entitled 'The Case for the Crown' (it started in the Sydney Morning Herald on 2/11/1918) when we were doing some corrections concerning the end of the first World War. We initially focussed in on Fred M. White who was an English author who wrote a number of crime/intrigue/suspense novels and short stories e.g. The Red Speck (1899), The Corner House (1906), The Slave of Silence (1906), The Law of the Land (1908), The Scales of Justice (1909), The Empty House (1909), Hard Pressed (1910), A Front of Brass (1910), A Rope of Snow (1911), A Mummer's Throne (1911), The Man Called Gilray (1912), A Secret Service (1913), The Case for the Crown (1919), The Leopard's Spots (1920). The earliest novel that we collected was 'The Miser's Daughter' by William Harrison Ainsworth and it was published in The Colonial Times, Hobart starting in Aug. 1842. The story started in the first edition of 'Ainsworth's Magazine' in Feb. 1842, which was then used to produce the story in the Hobart paper. At times publication in the Hobart paper was delayed due to non-arrival of mail from England.

Works by Australian authors include: John Sandes - 'A Bush Bayard', and 'The Call of the Southern Cross', Arthur Gask - 'The Lonely House’, 'The Shadow of Larose', and Rev. William Draper - 'The Hermit Convict'. We have also produced e-books by Katharine Tynan (an Irish Author), Bernard Capes, Percy Andreae, Vernon Houseman, Harold Bindloss and a few others.

The stories are mostly serialized and so first we correct them, then save the corrected text. Most of the text downloads in one big block of text which needs to be broken up into paragraphs. I then have to reduce all word spacings back to one space as well as fixing broken words with a hyphen in them. Then run a spell-check and a run through Guiguts (a Project Gutenberg program which finds a lot of errors and re-wraps the text to a line length of 70). After that I have to sit down and proof-read the whole document before it is ready to send off to Project Gutenberg Australia. This whole process is quite time-consuming and can take longer than the original newspaper text-correcting. But it is very rewarding, so far I have managed to create 105 e-books from the text in the last 18 months. We upload the e-books to Project Gutenberg Australia. I found recently that some of the stories that we have uploaded have since been copied by some other websites and set up in different file formats. Some of the stories have even been put onto Amazon.com.”

The inventiveness and dedication of the public text correctors and crowdsourcing activity must not be under-estimated. Gold star to all text correctors!

Monday, 12 December 2011

Digital Libraries: into the future: food for thought

During November I read and listened to three talks from well known people, that all gave views on digital libraries and the future. The contents were considered to be ‘controversial’ and ‘thought provoking’ by your average librarian. I loved each talk but did not consider them controversial. Quite the opposite, I considered them to be stating the obvious, and clearly explaining the reality of the situation that currently faces libraries. I wish more library managers would grasp the reality, focus on it and do something about it in a timely way. And now is the time. Many libraries, including the National Library of Australia are developing their statements/strategies/visions for the years ahead right now. Each talk was also packed with useful quotes and advice for library managers. The speakers praised Trove and thought it demonstrated part of the way forward, which was good to hear. Here’s a bit of a summary with the links:

Caroline Brazier (British Library). ‘Collect/connect’ A presentation to the Libraries Australia Forum. http://www.nla.gov.au/librariesaustralia/news-events/forum/2011-forum/2011-program/

Caroline’s presentation focused on the 2020 vision for the British Library, but later over coffee with me she shared some other thoughts.  In times of severe budget constraint innovation should thrive.  Some of the best ideas have always evolved from having no money. You have to RE-THINK the ways you do things, FOCUS on what is the most important thing, and come up with NEW IDEAS on how to do things. The benefit of budget cuts in libraries could be innovation. Innovative services rarely come from having as much budget as you want to play with.  I can identify with this, through the Australian Newspapers and Trove project.  It may surprise you to know that contrary to popular belief we did not have a team of hundreds behind the scenes at the National Library of Australia working on either project, just a sum total of 5 staff.  We had no additional budget to develop, design or support the new services. Our only extra budget was to digitise items. I am sure that if we had of had more money we would not have come up with the idea of public newspaper text correction, we would simply have paid the digitisation contractors more money to manually re-key text.

People in the UK were asked what 2020 might mean for libraries and one person commented that because digital is the way forward the proportion of IT/web staff to other staff needs to shift considerably. I’m not sure what the current proportion is at the British Library but most libraries would have less than 10% of their staff being IT and this needs to change.

My favourite quote from a member of the public was “BL should replicate the Glastonbury Festival feeling and at the same time provide the great scholastic silence”. The BL had trouble understanding what that meant and didn’t think you could do both (assuming the whole thing is physical), but I took it to mean that the users want an online creative, collaborative, feeling with a ‘wow’ factor of digital collections and services doing what they want, combined with a physical reflective space.   This is exactly the sort of thing that my dear friend Paul Reynolds would have said, and in fact did just before he died, in his valedictory lecture to librarians at NLNZ. 

Nathan Torkington (web expert). ‘Where it all went wrong’ A speech delivered to the National and State Librarians of Australasia. http://nathan.torkington.com/blog/2011/11/23/libraries-where-it-all-went-wrong/

“…libraries are like Microsoft. At one point you had a critical role: you were one of the few places to conduct research. When academics and the public needed to do research into the documentary record, they’d come to you.  As you know that monopoly has been broken. The internet led by Google, is the start and end of most people’s research. It’s good enough to meet their needs, which is great news for the casual researcher but bad news for you. Now they don’t think of you at all.”

“You need to be useful as well as important. Being useful helps you be important… Oh I know you thought about digital a lot. You’ve got digitisation projects. You’re aggregating metadata. But these are bolt-ons. You’ve added digital after the fact. You probably have special digital groups, made up of younger people than the usual library employee. You have some advance R&D guys working on the future while the rest of us just get on with the past”.

“Your new reading room is your patron’s web browser. Are you designing distribution for that? How much did you spend building a new reading room? How much are you spending on digital delivery? The first place users start looking for things is Google. Are you designing discovery for that? Do you know how to be found? If I look at the results of the digitisation projects, I find the shittiest websites on the planet. It’s like a gallery spent all its money buying art and then just stuck the paintings in supermarket bags and leaned them against the wall.”

Nat does not think that Google is the answer, but libraries are:
“The best solution is when both man and machine work together: librarians make sense of indexes, this is what they do. Computers are great at building indexes. Don’t think ‘either-or’, think ‘and’. Libraries need to FOCUS. Success for you is RELEVANCE. Make things that people use. Then when someone asks ‘why do we tip all these millions into this?’ or ‘doesn’t Google do that already?’ your relevance is your answer.”

Peter Macinnis (author). ‘A question of collaboration’ Interview    on ABC Ockham’s Razor http://www.abc.net.au/radionational/programs/ockhamsrazor/a-question-of-collaboration/3692142

Peter talked about science and technology and his visions for the future, as an author and member of a changing society.
“Why do people happily embrace the prospect of a world without libraries based on the prediction that we don’t need books or libraries anymore because we can get everything we need from the internet? Those who make sweeping assertions like this don’t know what books are, have no sense of what libraries do and absolutely no idea of what the internet is - or offers. Most importantly, these rigid descendants of Wackford Squeers lack the wit to see that INSTITUTIONS EVOLVE.”

He then talked about Wikipedia, Trove and the Australian Newspapers, what he had searched for and found and how he had added his knowledge to the sources by editing, correcting newspaper text, tagging and commenting, and how other users coming after him could use his annotations in context.

 “As the twig bends, so the tree bends. A future built on COLLABORATION relies on people who gain a quiet joy from contributing gems, nuggets and crumbs to future generations, whimsical folk who amuse themselves by committing acts of anonymous scholarship.”

Wednesday, 7 December 2011

Rose Holley: recently published articles

Trove:
Holley, R. (2010). Trove: Innovation to Access to Information in Australia. http://www.ariadne.ac.uk/issue64/holley/

Holley, R (2011). Trove: The First Year January 2010- January 2011
http://www.nla.gov.au/openpublish/index.php/nlasp/article/view/1882

Holley, R (2011). Resource Sharing in Australia: ‘Find’ and ‘Get’ in Trove – making ‘Getting’ better
http://www.nla.gov.au/openpublish/index.php/nlasp/article/view/1868

Holley, R (2011). Extending the scope of Trove: Addition of e-resources subscribed to by Australian Libraries
http://dlib.org/dlib/november11/holley/11holley.html

Australian Newspapers Digitisation:
HOLLEY, Rose. (2009) How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs. D-Lib Magazine, 2009, vol. March/April 15, n. 4. http://www.dlib.org/dlib/march09/holley/03holley.html

HOLLEY, Rose.(2009) A success story – Australian Newspapers Digitisation Program. Online Currents, Volume 23, Issue 6, December 2009 pp 283-295 http://eprints.rclis.org/17665

HOLLEY, Rose. (2010) Ramping It Up: 10 Lessons Learned in Mass Digitisation. Online Currents, volume 24, Issue 1, March 2010 pp 16 -24 http://eprints.rclis.org/18064

HOLLEY, Rose. (2010) Tagging full text searchable articles: An overview of social tagging activity in historic Australian Newspapers August 2008 – August 2009. D-Lib Magazine, vol 16, no 1/2 Jan/Feb 2010 http://dlib.org/dlib/january10/holley/01holley.html

Crowdsourcing:
HOLLEY, Rose (2009). Many Hands Make Light Work: Public Collaborative OCR Text Correction in Australian Historic Newspapers. National Library of Australia, March 2009, ISBN 978-0-642-27694-0. http://www.nla.gov.au/ndp/project_details/documents/ANDP_ManyHands.pdf

Holley, R (2010). Crowdsourcing: How and Why Should Libraries Do It?
http://www.dlib.org/dlib/march10/holley/03holley.html

HOLLEY, Rose. (2011) Crowdsourcing and social engagement in libraries: the state of play. 29 June 2011
http://www.crowdsourcing.org/document/crowdsourcing-and-social-engagement-in-libraries-the-state-of-play/5550

HOLLEY, Rose (2011) Text Correction in Aussie Newspapers (video)
http://www.dlconsulting.com/rose-holley-on-text-correction-at-trove/

HOLLEY, Rose (2012). Harnessing the Cognitive Surplus of the nation: new opportunities for libraries in a time of change.
http://www.sl.nsw.gov.au/about/awards/docs/Arnot_Memorial_Fellowship_Winner%202012.pdf

Rose Holley: recent presentations

I've been uploading my PowerPoint Presentations into slideshare:

http://www.slideshare.net/RHmarvellous/presentations

Last year I did 62 presentations in 45 weeks so this year I've been taking a bit of a break! Sorry folks!

My most popular talk was one I gave to the public in Sydney called 'The making of our Digital Nation' which you can view on Youtube: http://www.youtube.com/user/TroveNLA#p/f/1/a19icvJO_HE  It's rather long at an hour and 18 minutes!  I very much enjoyed this talk because usually I only get to talk to librarians.

My most recent presentation was to librarians in Spain and called
'Collecting, sharing and improving data: Changing roles for librarians and users.'  I drew on my personal experience over the last 25 years and particularly talked about Australian Newspapers and Trove.  You can see it on Vimeo: http://vimeo.com/23614929. Again rather long at an hour!