Saturday, 10 November 2012

National Archives of Australia embraces crowdsourcing and releases ‘The Hive’.

The National Archives of Australia (NAA) has made a bold step into the cultural heritage crowdsourcing arena with ‘The Hive’ which was released two weeks ago. The brand makes a clever play on the word ‘Archive’ combined with the idea of a hive of working bees (the public).  The site encourages the public to transcribe archive records.

Early this year when David Fricker became Director General of the NAA he was quick to encourage staff to think innovatively, embrace change, and to harness opportunities such as crowdsourcing to improve access to our collections. He publicly spoke in favour of  crowdsourcing and a changing business model for archives at the International Council of Archives Congress in August:

“Another key development in expanding access is crowdsourcing. As many of us are now seeing, by allowing the public to contribute to the description of archival resources we are enhancing the ability of future generations to discover and learn from our archives. I also think it is a wonderful opportunity for the public to be more engaged with us as archives and to share in the work we do – preserving the memories of our nations. There is still some work to do here, in order to maximise the value of contributions and to maintain the integrity of our archives as authentic and accurate. However, I do not believe these problems are insurmountable, and indeed I believe these systems can to some extent be self-correcting.

This is a type of the co-design, citizen first activity… drawing on the interest and enthusiasm of the community to bring more of our archives into view – discoverable and retrievable…Access will be online and everywhere, improved by rich new data visualisation techniques and expanded descriptive contributions from an engaged citizenry”.

The Hive is the Archives pilot and experimentation into the potential of large scale transcription crowdsourcing to improve access to records.  Staff have looked closely at other crowdsourcing sites on offer and attempted to build on their knowledge and techniques, to provide a site that could be used as a large scale platform for a variety of transcription crowdsourcing projects.

At present the site offers just over 800 lists for the public to transcribe. Some of these are typed and some handwritten.  They are rated in difficulty as easy, medium or hard.  Part of the difficulty with this project is that the public need to have some understanding of how archives receive and describe their records to make sense of what they are being asked to do.  In simple terms archives receive vast amounts of records (referred to as consignments).  Each consignment comes with a list of the items in it.  However because of the large volume of records being received it is usual that only the consignment record is entered into the catalogue e.g. ‘100 boxes of plans and drawings’ from x government agency, rather than all the individual items on the consignment list being described in the catalogue.  The ideal scenario for users of the archives is that every item e.g. plan and drawing is described on the catalogue so that it can be found.  Without this a lot of guess work goes into finding relevant things, or alternatively personal visits are required to view the hard copy consignment lists.
The project that the archives is undertaking is to digitise consignment lists and then make them available for transcription by the public. Once transcribed they become searchable and the items within them can be found more easily.  Because so many of the lists are old and handwritten it is virtually impossible to get good OCR on them.  That’s where the public come in who can read them with the human eye. Also the time of the public is needed to speed up the access. Projections on the time it would take archives staff to describe the lists without public help currently stand at 210 years.  It is anticipated that a member of the public could with relative ease describe several hundred items per hour with the Hive tool, which would make a big difference, especially if there was a swarm.

The consignment lists in the pilot are those that have proved most popular with researchers and contain items in the ‘open period’, that is older than 30 years and now open to the public.  The top interest is lists of architectural drawings and historic buildings. This is closely followed by PNG patrol officer records, maritime incidents, personal records from the war office, prisoners of war, meteorology and cyclones, WW1 intelligence, and oil drilling on the Great Barrier Reef.

In the first 2 weeks 300 records have been transcribed of the 800. There is a definite preference for the lists rated hard (handwritten) and ones that involve names.

The site is well presented and gives volunteer transcribers things we know they want such as progress chart, recent activity, points scoring system, rewards, optional login using Open ID e.g. their Google ID, ability to search and choose items, or just take the next one served up, to pick easy or difficult items, to add a marker for where they got to if they are interrupted, and to favourite records.  The only slight drawback is the placing of the transcription window at the bottom of the screen rather than right or left, which often means it is hard to see the transcription window and the content you are transcribing at the same time. Also the OCR text in the transcription window and the cursor is not hooked directly to the text in the image so it is easy to get lost whilst transcribing sometimes.  This is largely because most of the lists are in tables, and the table rows and columns have not been retained in the OCR, so the OCR is somewhat muddled.  Further development of the site will largely depend on feedback given by the public users, and the ability of the archives to keep up a steady supply of new, interesting digitised consignment lists to the Hive.  The Archives is still considering how it may be able to integrate the public content back into its main catalogue RecordSearch, or integrate the Hive into RecordSearch. In the meantime the list content will remain searchable in the Hive.

There is obviously an expectation from the Archives that by making its content more discoverable it will lead to more access requests.  This is why at point of transcription there is a button which enables the user to request a copy of the item.  These requests are being met by digitising the item, and then uploading them into the main catalogue ‘RecordSearch’ with the full item description.

I congratulate the National Archives of Australia Access Team on the development of this exciting new site, which holds so much potential to improve access to records and engage with our citizens in new ways.

The screenshots below show the site in action:

 Easy level transcription- Archived drawings

Medium Level Transcription - ABC Drama Scripts

Difficult level transcription - Plans


  1. Thanks for this, Rose. I'd seen the program announced but hadn't had time to take a look at it, so the screenshots are invaluable.

    Do you happen to know what software platform they're using? It looks like it may be a heavily styled version of Mediawiki with the plug-ins used for Wikisource, but I'm not sure.

    Also, when are you going to join Twitter?

  2. Much obliged concerning this, Rose. I'd seen the project affirmed however hadn't had sufficient energy to examine it, so the screenshots are significant Top crowdsourcing sites.

  3. Wow! It’s so nice that the National Archives have decided to employ crowdsourcing to successfully transcribe these documents. It’s one way to be able to digitize and preserve these documents. Just a suggestion maybe is to store them in a media vault once they’ve saved and the transcriptions are ready for storage, in order to make sure that they won’t get destroyed over time.
    Ruby Badcoe