Rose Holley's Blog - views and news on digital libraries and archives: 2012

Saturday, 10 November 2012

National Archives of Australia embraces crowdsourcing and releases ‘The Hive’.

The National Archives of Australia (NAA) has made a bold step into the cultural heritage crowdsourcing arena with ‘The Hive’ which was released two weeks ago. The brand makes a clever play on the word ‘Archive’ combined with the idea of a hive of working bees (the public). The site encourages the public to transcribe archive records.

Early this year when David Fricker became Director General of the NAA he was quick to encourage staff to think innovatively, embrace change, and to harness opportunities such as crowdsourcing to improve access to our collections. He publicly spoke in favour of crowdsourcing and a changing business model for archives at the International Council of Archives Congress in August:

“Another key development in expanding access is crowdsourcing. As many of us are now seeing, by allowing the public to contribute to the description of archival resources we are enhancing the ability of future generations to discover and learn from our archives. I also think it is a wonderful opportunity for the public to be more engaged with us as archives and to share in the work we do – preserving the memories of our nations. There is still some work to do here, in order to maximise the value of contributions and to maintain the integrity of our archives as authentic and accurate. However, I do not believe these problems are insurmountable, and indeed I believe these systems can to some extent be self-correcting.

This is a type of the co-design, citizen first activity… drawing on the interest and enthusiasm of the community to bring more of our archives into view – discoverable and retrievable…Access will be online and everywhere, improved by rich new data visualisation techniques and expanded descriptive contributions from an engaged citizenry”.

The Hive is the Archives pilot and experimentation into the potential of large scale transcription crowdsourcing to improve access to records. Staff have looked closely at other crowdsourcing sites on offer and attempted to build on their knowledge and techniques, to provide a site that could be used as a large scale platform for a variety of transcription crowdsourcing projects.

At present the site offers just over 800 lists for the public to transcribe. Some of these are typed and some handwritten. They are rated in difficulty as easy, medium or hard. Part of the difficulty with this project is that the public need to have some understanding of how archives receive and describe their records to make sense of what they are being asked to do. In simple terms archives receive vast amounts of records (referred to as consignments). Each consignment comes with a list of the items in it. However because of the large volume of records being received it is usual that only the consignment record is entered into the catalogue e.g. ‘100 boxes of plans and drawings’ from x government agency, rather than all the individual items on the consignment list being described in the catalogue. The ideal scenario for users of the archives is that every item e.g. plan and drawing is described on the catalogue so that it can be found. Without this a lot of guess work goes into finding relevant things, or alternatively personal visits are required to view the hard copy consignment lists.

The project that the archives is undertaking is to digitise consignment lists and then make them available for transcription by the public. Once transcribed they become searchable and the items within them can be found more easily. Because so many of the lists are old and handwritten it is virtually impossible to get good OCR on them. That’s where the public come in who can read them with the human eye. Also the time of the public is needed to speed up the access. Projections on the time it would take archives staff to describe the lists without public help currently stand at 210 years. It is anticipated that a member of the public could with relative ease describe several hundred items per hour with the Hive tool, which would make a big difference, especially if there was a swarm.

The consignment lists in the pilot are those that have proved most popular with researchers and contain items in the ‘open period’, that is older than 30 years and now open to the public. The top interest is lists of architectural drawings and historic buildings. This is closely followed by PNG patrol officer records, maritime incidents, personal records from the war office, prisoners of war, meteorology and cyclones, WW1 intelligence, and oil drilling on the Great Barrier Reef.

In the first 2 weeks 300 records have been transcribed of the 800. There is a definite preference for the lists rated hard (handwritten) and ones that involve names.

The site is well presented and gives volunteer transcribers things we know they want such as progress chart, recent activity, points scoring system, rewards, optional login using Open ID e.g. their Google ID, ability to search and choose items, or just take the next one served up, to pick easy or difficult items, to add a marker for where they got to if they are interrupted, and to favourite records. The only slight drawback is the placing of the transcription window at the bottom of the screen rather than right or left, which often means it is hard to see the transcription window and the content you are transcribing at the same time. Also the OCR text in the transcription window and the cursor is not hooked directly to the text in the image so it is easy to get lost whilst transcribing sometimes. This is largely because most of the lists are in tables, and the table rows and columns have not been retained in the OCR, so the OCR is somewhat muddled. Further development of the site will largely depend on feedback given by the public users, and the ability of the archives to keep up a steady supply of new, interesting digitised consignment lists to the Hive. The Archives is still considering how it may be able to integrate the public content back into its main catalogue RecordSearch, or integrate the Hive into RecordSearch. In the meantime the list content will remain searchable in the Hive.

There is obviously an expectation from the Archives that by making its content more discoverable it will lead to more access requests. This is why at point of transcription there is a button which enables the user to request a copy of the item. These requests are being met by digitising the item, and then uploading them into the main catalogue ‘RecordSearch’ with the full item description.

I congratulate the National Archives of Australia Access Team on the development of this exciting new site, which holds so much potential to improve access to records and engage with our citizens in new ways.

The screenshots below show the site in action:

Easy level transcription- Archived drawings

Medium Level Transcription - ABC Drama Scripts

Difficult level transcription - Plans

Tuesday, 2 October 2012

Digital Motor Archive available free for the month of October in return for…...

It came to my notice last week that a publisher in the UK had digitised both the current and back issues of their magazine the ‘Commercial Motor’ which in its early life was a newspaper. In the cut throat world of publishing there are few publishers left that are still publishing the same title they were 100 years ago, and who also have a complete set of back copies. If they fall into this bracket they are in the unique position of being able to either digitise the content themselves for their readers (usually at a loss); offer it to a library to digitise (at no cost); or sell it to a commercial e-vendor to package with another product for academia (and make a profit). Unfortunately most choose the latter, which makes this type of content only really accessible to academics and students via academic libraries. E-vendors normally charge high subscription rates to digitised magazines and newspapers and package them up with other content, making it only viable for large universities and national libraries to purchase, and therefore severely restricting readership to the content.

Not a lot of publishers are digitising their own content because generally speaking the cost of preparation, digitisation and OCR, and building a good website to deliver the content outweigh the amount of money they would ever recuperate from reader subscriptions. Normally a subscription to the current copy would be packaged up with old copies. And that’s where the model fails, because often current readers have no interest in the old stuff. People who do have an interest in the old stuff are generally a different group of people – historians, researchers etc. The exception to this rule appears to be anything to do with hobbies such as knitting, cooking, railways, cars, and stamps.

The Commercial Motor Archive http://archive.commercialmotor.com/ came to my notice because it is available for free for the month of October and I wondered why. It is a rich archive going from 1905 to the present day, covering a complete century and two world wars, well illustrated and with everything you ever wanted to know about commercial vehicles.

The search and browse mechanism is very impressive and works well. For example articles on pages have been zoned so you can search and find the article easily within a page. It has many similarities to the hugely popular Australian Newspapers http://trove.nla.gov.au/newspaper. The page displays alongside the OCR text to make it easier to read. You can browse covers, browse by date, and zoom in on pages. Results can be filtered. Users can add comments and tags. The quality of the OCR text and therefore search is very good.

It has one thing that was never implemented on Australian Newspapers (though often asked for by users) which is a little box on each page called ‘Report an error’ and this is the reason for its free access. The site owner is hoping that as people use and read the pages in the archive they will report errors as they see them, and for this they get free access to the content. The known errors that need to be identified are incomplete articles (where the zoning has gone wrong); OCR text error in headlines of articles; and OCR errors in text. However readers can only report them, not actually fix them. The site states:

“the archive is beta because it isn’t perfect at the moment and there are a few glitches to be ironed out. Every article page has a 'Noticed an error?' button you can use to report a problem. Please don’t expect an immediate change to the error - we gather all the reports together and prioritise them, fixing the most pressing errors first.”

It sounds like they don’t know how many errors there are, how many people will report errors, and who and how the errors will be fixed. Interesting.

Although the ‘report an error’ button was never implemented on Australian Newspapers (it was mainly needed to report upside down and duplicate pages) it had already been decided that a ‘super user’ would be the person to review these reports and take action. In a world where some volunteer text correctors wanted to take on extra responsibility and have special roles like the hierarchy in the Wikipedia editors community this would have been a good thing for trusted volunteers to do.

The Commercial Motor Archive has impressed me because they are clearly striving for perfection, they understand that the fewer mistakes there are the better the search will be and they have taken a brave step by asking the public to help in return for free access. This is indeed unusual for a commercial publisher, belonging more in the realms of libraries and archives and referred to as crowdsourcing……

Sunday, 26 August 2012

Digital vandalism or just good fun? The Prado comes to Brisbane

Since I was attending the International Congress of Archives (ICA 2012) in Brisbane last week I had the opportunity to visit the Queensland Art Gallery which is a stone’s throw from the Brisbane Convention Centre. As an added bonus all conference attendees got a discounted entry to the ‘Portrait of Spain’ Exhibition, which has 100 paintings from the Prado, Madrid on show.

I was surprised to see that visitors were encouraged to get their iphones out at the start of the tour. The primary reason was to take a photo of yourself against a backdrop of a Prado gallery, so that you could pretend to friends you had actually been to the Prado. The ticket collector obliged and took my photo:

As I approached the first painting a security guard then warned me that photos were not allowed, with or without flash, and I should put my iphone away. At the next painting when I commented to another visitor how little description was provided beside paintings another security guard overheard and said ‘oh you can use your iphone’. Now somewhat perplexed I asked again if I could take a photo. ‘No, but you can scan the QR codes beside the paintings which tell you more information about them’. [If you don't know what QR codes are and how galleries and museums use them read this short explanation].

I couldn’t be bothered with that. That is, until I got to a breathtaking painting that had absolutely no explanation about why it took your breathe away. The painting appeared to be of a man (with hairy forearms, moustache, thick neck) but dressed in formal women’s court clothing with a bust. The description had no mention about this surprising phenomena. At that point I got my iphone out and checked the QR code. Disappointingly still no info on the surprise, simply that the artist had ‘a good eye for detail and had painted the hair well. The woman was graceless and had a less than feminine appearance’. I made a note of the painting to look it up afterwards and see if anyone else had more information on the person in it that they were willing to candidly share. Perhaps a Wikipedia entry? The painting is called Senora de Delicado de Imaz by Vincent Lopez Portana from the Spanish Court of 1833.

On leaving the exhibition I felt in dire need of a cup of tea so headed towards what I thought was the café, only to be blocked by a guard who demanded to see my entry ticket (yes there are a lot of guards). I queried why I should need to show my entry ticket to partake of tea and was told that this café was themed with Spanish food (?). I was about to turn away when she also mentioned that the photo booths were included in the ticket price. Being a sucker for a photo I couldn’t resist so passed through the checkpoint having no idea what the photo booth would do.

Well – talk about surprise!! You are absolutely not allowed to take a photo of a painting in an art gallery. The answer usually given is because of copyright or mis-use or inappropriate use of images. Clearly the images in the Prado exhibition are out of copyright dating from 1500-1800. Images of them are sold in the gift shop. However the photo-booth allowed me to stick my own face into a selection of the Prado portraits (in a similar way to sticking your head through a funfair cardboard cut out of Popeye and Olive) and create my own digital image. After all the high brow gallery poppycock I have heard over the years about galleries digitising images and rules they have made up around this, I was staggered and thrilled to be able to do this fun activity (which I think is probably aimed at children!). It was the most fun I have had for a while and my first foray into what is commonly called ‘digital vandalism’ or mis-appropriate use of digital artworks.

It made me recall the battle that Wikipedia had with art galleries and use of images over a long duration. On one particular occasion Liam Wyatt the VP of Wikimedia Australia spoke about how the public were not allowed to take or share photos of artworks for invalid reasons of copyright ownership or inappropriate or commercial use of images, but then the galleries or museums in question would use the same images themselves to make things like ties or mugs for a profit in the gift shop. This was such a case exactly, but more extreme than any I have yet seen. In case you are in doubt – I am endorsing the activity offered in the photo booths at Queensland Art Gallery. The e-mail I received with my bastardised portrait also had an animated version… where things like my hand and head moved – crikey! The experience led me to formally write to Queensland Art Gallery (via their online form) and ask them why if I can do this I was not allowed to take a photo without flash of the same painting in the exhibition? That was 7 days ago and I still have not had a reply….

Incidentally I cannot find a Wikipedia entry for the portrait of Senora de Delicado de Imaz, or anything about her life and circumstance which I am sure is most interesting. I could also not find a really good digital print online that matched the real life experience of seeing the painting. I did manage however to take a photo of the painting myself via the photo booth. For some strange reason the photo booth kept offering me this painting as the perfect match to put my face into.

However I preferred to go with a much more regal match.

Me with my head digitally stuck into the portrait of La infanta Isabel Clara Eugenia Magdalena Ruiz by Alonso Sanchez Coello 1588.
The animation had me fiddling with my minature and the monkeys.....

Saturday, 25 August 2012

Crowdsourcing and Social Media at US National Archives (NARA). The Citizen Archivist Dashboard

Last week I attended the International Congress of Archives (ICA 2012) which was held in Brisbane. Over 1,000 Archivists from 93 countries attended.

The much anticipated opening keynote on the first day was given by David Ferriero

head of US National Archives. He is the first librarian to become a National Archivist, previously being in charge of New York Public Library and known for promoting use of social media and relationships with Google and Wikipedia. His talk was called ‘A world of social media’. I was looking forward to hearing what the US National Archives are doing with social media and crowdsourcing. People were generally of the opinion that this organisation will/is leading by example in this field.

David Ferriero took to the stage and took us by surprise. He only used 20 minutes of his 40 minute slot, gave no presentation, instead reading from his notes at breakneck speed and bombarding us with statistics that were largely out of context. At the end he took no questions and dashed off the stage. He left a surprised and bewildered audience behind. I for one was immensely disappointed not to see and hear more about some of the exciting US Archives activities. He of course may have had mitigating circumstances that I am totally unaware of. He did however give small tasters of what his organisation is doing. There was brief mention of large scale crowdsourcing on unspecified projects, a citizen archivists dashboard, and a relationship with Wikipedia which peaked my interest.

So I decided to follow up online and find out for myself what may be happening at NARA. I took me quite some time to search the internet and blogs and get the information I had hoped David would give in his keynote, but it was worth it. Here is what I found:

1. Citizen Archivist Dashboard Webpage http://www.archives.gov/citizen-archivist/

In January 2012 the US National Archives launched the Citizen Archivist Dashboard. This is a great webpage bringing all the online and physical social engagement and crowdsourcing activities together. It is easy for someone to see what options they may have to help the US National Archives. It is very clearly designed and I like it a lot.

2. Transcription Projects

US National Archives Tool http://transcribe.archives.gov/

There are two transcription projects going on for handwritten records. Firstly the National Archives Transcription Pilot Project. It appears still to be in ‘pilot’ mode (started in January 2012) since only 300 documents (about 1,000 pages) are available for transcription. They have been very carefully selected from a collection of billions of pages and graded by colour codes according to how difficult the handwriting is to read. This pre-selection must have taken very valuable staff time. You can browse or search by difficulty of transcription, year, and the status of transcription: “Not Yet Started,” “Partially Transcribed,” and “Completed.” You then choose a page to work on and then that page is blocked to other users, so it’s not being edited by multiple users at the same time. The interface is very simple, much like the Australian Newspapers. In a free text box beside the image you can transcribe what you see. No login is required, though you do have to complete a captcha.

The missing part is that I can’t see how many people have transcribed what. It’s not clear if the documents disappear from here when fully transcribed, and how and where they become full text searchable in the collection. It also seems to be a time consuming process for NARA staff to do the pre-selection and difficulty rating of the documents. This is of course a very small pilot and hopefully lessons will be learnt and the site will be developed further to reach it’s full potential. Also it would be good if more documents became available for transcription. This is one of the easiest handwritten transcription tools I have seen. I could not find any information about who developed the tool and if it is available open source.

Interestingly David Ferriero says that many US school children are no longer taught cursive handwriting and therefore cannot read handwriting. He says ‘Help us transcribe records and guarantee that school children can make use of our documents’. I’m not quite clear if he thinks this is a potential crowdsourcing exercise for school children to learn handwriting and become better educated, or if adults are supposed to do it so that school children can just read the finished text.

Wikisource Tool for Wikiproject NARA. http://en.wikisource.org/wiki/Wikisource:WikiProject_NARA

The National Archives have developed a relationship with the Wikipedia Community and currently have a Wikipedian in residence. As part of that program they have shared some primary handwritten national documents into ‘Wikisource’ for transcription via the Wikisource Tool. These documents are mostly at the beginner level in terms of difficulty. I’m not clear if they are the same ones in being used in the Archives own pilot, or different documents. I’m also not clear why they are piloting two different methods for transcription, or what the initial results are compared to each other. Wikisource offers more than transcription however, Wikipedians (if they can get access to original documents or copies) can also scan documents and OCR them.

3. Scanning Projects

Scanathons

For reasons I don’t understand the US National Archives has only digitised 750,000 of its 40 million images. This is a very low figure for an organisation like this. They seem to be focusing quite a lot of effort on getting physical volunteers to come in person to the Archives to digitise/scan images for them at ‘Scanathons’. This started in 2011. In January 2012 there was a 4 day Wikipedia ExtravaSCANza. Over the 4 days a group of Wikipedians met in the Still Pictures Research Room and scanned 500 images on desktop scanners. Each day there was a theme: NASA, women’s history, Chile, and battleships.

Photograph it yourself http://www.flickr.com/groups/citizenarchivist/

NARA encourages readers to take their own photos of records in the reading rooms and upload them to a special group in Flickr. The important thing here is that they should also be described with title, series, and record group if possible so they can be found. So far only 20 people have joined the group and 133 photos have been uploaded (most of these by the same person). I’m not clear how NARA intends to link these digital images back to the item descriptions in their collections but this is a great idea to tackle large scale digitisation of images.

4. Tag it Tuesdays http://blogs.archives.gov/online-public-access/?cat=260

The tagging facility, unlike the other pilots seems to me to be unlikely to succeed in its objectives. This is perhaps because of the tight controls that have been placed around it and the isolation of the activity from normal search and browse behaviour. Whilst anyone can easily transcribe a record without needing to login the process for tagging is difficult.

The activity is focused on Tuesdays and themed around a topic. Records for the topic are pre-selected by the Archives and available in an online group e.g. Elvis, Titanic. Volunteers must register and follow a set of guidelines; Tags will be reviewed by NARA staff before being accepted and going live on the database. I looked at the topics and it was unclear to me why if the Archives had already identified the items as being about Elvis they couldn’t simply generate an automatic tag for ‘Elvis’. In my opinion tagging is not actually a crowdsourcing activity because individuals are motivated to add tags to help themselves find things, it is a by product of search. Research shows it is rare for users to have concensus on tag terms and use. Crowdsourcing activities achieve a big clear goal that could not be achieved by individuals alone, and everyone in the crowd should be aware of how they are helping the ultimate goal.

5. Indexing the 1940 Census

On April 2, 2012, NARA released the digital images of the 1940 United States Federal Census after a 72 year embargo. The census images will be uploaded and made available on Archives.com, FindMyPast.com, National Archives, ProQuest, and FamilySearch.org. The entire 1940 census data will be indexed by a community of volunteers and made available for free. The free index of the census records and corresponding images will be available to the public for perpetuity.

6. Useful Links

I found a recent presentation given this year by Pamela Wright – Chief Digital Access Strategist at NARA which gives screenshots of what I have talked about above. ‘From access to engagement’

7. Social Media

NARA are active users of social media channels and they have started to monitor their activity. The Social media statistics from NARA May 2012 may be interesting reading for some.

I would be interested in reading more presentations or articles about the citizen archivist pilot projects from NARA and finding out what they have achieved and learnt so far. I hope this information is made available to the archives and library community soon. Please reply in comment if you have any more information on the pilot activities.

Sunday, 17 June 2012

If only they would crowdsource! – Diamond Jubilee - Royal Archives at Windsor Castle

Many years ago I worked for a software company installing the first archive management systems into large UK archives such as the London Metropolitan Archives, Cumbria Archives at Carlisle Castle and the Royal Archives at Windsor Castle. It was a challenging time for archives going from paper systems to computer systems, in fact very similar to the challenges archives now face transitioning from managing paper records to born digital records. Ironically I have just returned again to the archives sector and am now working at the National Archives of Australia on the second challenge.

When the first computer systems were installed in archives it often came as a shock to archivists to discover that when the system was installed it would be ‘empty’ and their records would not somehow miraculously appear in the system. This was the first piece of news I usually had to convey in training before showing an online process for acquisitions. I particularly remember that at the Royal Archives they estimated with their current staff of 4 it would take them 700 years to record their archive collection into their new system, and they were somewhat despondent to say the least. Nevertheless the Queen was pleased with the install of the first computerised system at Windsor Castle and awarded the software company I worked for the Royal Warrant, which meant we could use the Royal Coat of Arms on our letterhead. The warrant is more often seen on pots of jam and pickle than on software. The celebration of implementation party at Windsor Castle with members of the Royal Household and staff was one to remember.

The Round Tower at Windsor Castle contained every hand written record every monarch and members of their household had ever created. Queen Victoria’s collection was particularly large. The Royal Archives could only be contacted by letter and each year less than 10 well vetted members of the public were allowed to access a very restricted and pre-agreed part of the collection under strict supervision. Because it was largely uncatalogued, described or known there was a terrible fear of what a member of the public might find in the archives. This was understandable since household records such as the cost of banquets were intermingled with personal letters and diaries. From the public's point of view the archive is that of our Kings and Queens and we would like to access it, but from HRH's view it is her private family archive. Although it is now more acceptable to expose skeletons in the family closet and programs such as "Who do you think you are" promote this, there is probably a reticence from aristocracy and royalty to do this. The Royal Archives is one of the richest, most interesting and significant collections ever created. It could aptly be described as a pot of gold – an absolute treasure trove. The archivists were aware of this and some of the treasures within it. The Royal Library at Windsor Castle was in a similar situation and also had extremely restricted access. Because I have always championed access to archive and library collections I felt very sad whenever I thought of the treasures locked up and hidden (literally) at Windsor Castle.

I was very interested therefore to read about a new development at the Royal Archives timed to coincide with the Diamond Jubilee. The Queen released this message:

“In this the year of my Diamond Jubilee, I am delighted to be able to present, for the first time, the complete on-line collection of Queen Victoria's journals from the Royal Archives. These diaries cover the period from Queen Victoria's childhood days to her Accession to the Throne, marriage to Prince Albert, and later, her Golden and Diamond Jubilees. Thirteen volumes in Victoria's own hand survive, and the majority of the remaining volumes were transcribed after Queen Victoria's death by her youngest daughter, Princess Beatrice, on her mother's instructions. It seems fitting that the subject of the first major public release of material from the Royal Archives is Queen Victoria, who was the first Monarch to celebrate a Diamond Jubilee. It is hoped that this historic collection will make a valuable addition to the unique material already held by the Bodleian Libraries at Oxford University, and will be used to enhance our knowledge and understanding of the past.”

I was intrigued by this and immediately found the website http://www.queenvictoriasjournals.org/home.do which tells us a lot more about Queen Victoria’s diaries and that this was a project undertaken in conjunction with the Bodleian Library at Oxford and Pro-Quest. However on looking further it was a bit disappointing since although every page of all the journals has been scanned they have not all been transcribed. Because they are all handwritten, they won’t be fully text searchable until they are all transcribed, a process which at present is most effectively done by the human hand and eye. The website doesn’t give any indication of when or how they will be transcribed that I could see, although it says it ‘is in progress’. So far only the first diary has been transcribed, by whom I am not sure. I bet the project is only letting academics do it, who will be paid lots of money and progress very slowly. There is a lot to do: 1832- 1901 since Queen Victoria wrote her diary every day.

If ever I saw a collection that was so well-suited to crowdsourcing for public transcription this is it! I could guarantee that in a few days or weeks all of Queen Victoria’s diaries would be transcribed by a willing and fascinated public. The handwriting is hard to decipher but with thousands of eyes, and amateur/professional genealogists and historians used to reading old writing, that are highly motivated I am sure it could be achieved. I feel excited just imagining it. But why stop there? What about the rest of the collection - the offical royal records and the personal records? When is that going to come out of hiding? It’s just crying out for public description, tagging and transcribing. If only. If only.

Extract of Queen Victoria’s diary.

Then thinking I would come back later and have another look I was most disappointed to read that following the example set by the British Library with its UK digitised newspapers the intent is to restrict access to the UK only, and to charge for access from July. So loyal British subjects living in Commonwealth Countries, and academic researchers – you only have 14 more days to look at this for free, or at all. Great shame! But congratulations to whoever it was behind the scenes that convinced HRH to release the diaries from the Royal Archives, and who set up and managed the project with the Bodleian and Pro-Quest. Bravo!! Perhaps we just need to beg and grovel for more content and offer our unconditional help to get it for free.

Photo by Rose. June 2012. After participating in Trooping the Colour for the Queens Birthday in Canberra, Irish Guard Cliff Doidge (who plays the clarinet in the Royal Military Band and is on exchange from London to Australia for 4 months) stands beside Lake Burley Griffin with the National Library of Australia behind.

Tuesday, 8 May 2012

Libraries harnessing the cognitive surplus of the nation

It was with great pleasure that I accepted an invitation to lunch at the Parliament of New South Wales last week with Her Excellency Marie Bashir, the Governor of NSW. The lunch was in memory of Jean Arnot, forward thinking librarian. In her memory each year a female librarian is awarded a prize for the best essay on librarianship. This year I was the JeanArnot Memorial Fellowship prize winner for my essay: ‘Harnessing the cognitive surplus of the nation: new opportunities for libraries in a time of change’.

The judges said

“your essay was energetic and passionate, and argued cogently for your position, which obviously has significant import for the Library profession”.

The essay was an amalgamation of my ideas, research and practice over the last 4 years into crowdsourcing in libraries. Although it is aimed at librarians it is equally relevant to archivists. The essay focuses on the idea of cognitive surplus and how and why libraries urgently need to tap into this opportunity. ‘Cognitive surplus’ is a phrase coined by the author and academic Clay Shirky (whose mother is a librarian). It means the free time that people have in which they could be creative or use their brain. Many people spend their ‘cognitive surplus’ time by watching hours of television, gaming, surfing the internet or reading. However, due to the increased availability of the internet in households, the rise of social media technology, and the desire of people to be creative rather than consumptive, there is now a major change in use of cognitive surplus time. People want to produce and share just as much if not more than consume. Due to new forms of online collaboration and participation, people are seeking out and becoming very productive in online social endeavours. Clay Shirky hypothesizes in his books that there is huge potential for creative human endeavour if the billions of hours that people watch TV are channelled into useful causes instead.

I suggest that libraries can and should harness this cognitive surplus to save themselves. Four powerful examples of libraries harnessing cognitive surplus are:

2008. The National Library of Australia set an international example of how to harness the cognitive surplus of the nation with the Australian Newspapers service. The community is able to improve the computer generated text in digitised historic newspapers by a ‘text correction’ facility, thereby improving the search results in the service. 40,000 people have corrected 52 million lines of text.

2010. The National Library of Finland was the second library to implement community newspaper text correction in their Digitalkoot crowdsourcing project. So far 50,000 people have corrected the text to 99% accuracy.

2011. The New York Public Library released ‘What’s on the menu?’, a crowdsourcing project where the community transcribe text from digitised menus held in the library’s collection. So far 800,000 dishes have been transcribed from 12,000 menus, making them full-text searchable.

2012. The Bodleian Library released the fourth large scale library crowdsourcing project this year. ‘What’s the score?’ is a project where the community can help describe the vast music score collection at Oxford.

If the library profession leverages our expertise with technology and collaboratively harnesses the cognitive surplus of the community we will be able to develop, expand, and open our collections. We will be able to enhance and preserve the social history of the nation while meeting the ever-changing needs of our society. By engaging the community, libraries can develop projects of equal scale, quality and output of commercial endeavours.

The survival of libraries is under threat and I believe that gaining the help of our community with their ideas, knowledge, skills, time and money is the answer. To remain relevant and valued in society libraries must look at their collections and communities in new, imaginative and open ways. We have the technology to do whatever we want. We must change our culture and thinking to embrace new opportunities such as crowdsourcing on a mass scale. The value and relevance of libraries is two-fold. It lies in both our collections and in the community that creates, uses, and values these collections. Let us demonstrate this and our place in it. Let us hold onto our original values of open access to all, and do whatever it takes to remain core, valued and relevant in society.

I would encourage you to read the full essay, pass it onto your colleagues, think about this idea deeply and work out how you can harness cognitive surplus to help your profession and organisation in the immediate future.

Photo: Women reach for the skies, big opportunities are out there….

This Andrew Rodgers sculpture was unveiled at Canberra airport on 2 April 2012. It is the largest bronze figurative sculpture in

Australia

and is called ‘Perception and Reality 1’.

Monday, 7 May 2012

Church archive starts crowdsourcing: help tag sermon podcasts

There are many church and cathedral archives around the world but a particular one that has just caught my eye and held my interest is All Souls Anglican Church at Langham Place, London. This is because of a crowdsourcing project it has started. It is setting a fine example for other cathedral and church archives to follow. In a blog post last week the church appealed for Christian volunteers to help make the archive more accessible and used. The church upholds the principles of information access, strongly believing that resources it generates should be free and open to the community. The church puts its current sermons and talks up on its website as podcasts. However they have a large back archive of sermons: 3,600 to be precise. As far as I can see these are all available as podcasts. To increase their usage and make them more findable they want the community to add subject tags to them. There is a webpage explaining how to do this.

I followed through to see how simple the process would be. It is pretty simple and easy to do, but there are a couple of surprising things here. Firstly it is assumed that only one person needs to allocate tags to a sermon and they will put the ‘right’ tags on. Because of this once someone has ‘grabbed’ a series of sermons to tag no-one else can pick them as far as I could see. This may have been set up like this because they may have thought that not enough people would sign up to help. However even though the call for help only went out last week, there are very few sermon series left that haven’t been grabbed. I think they have under anticipated the interest and enthusiasm of the crowd here. Personally I think it may be helpful to encourage more than one person to add tags to the same sermon. The general premise in crowdsourcing is to use the wisdom of the crowd. This is particularly relevant for tagging. In order to choose tags the sermon or talk needs to be listened to first. This takes about 30 minutes for each one.

The next interesting thing is that the volunteers can pick 3-4 tags from a very small controlled list and then they have a chance to add one tag of their own choosing that is not on the list. That one tag will be moderated by the archivist (and presumably added to the list if deemed suitable and often used). This is the first time that I have seen a combination tagging approach. Again I’m not quite sure about the thinking behind this. I would like to know more. This is a very interesting project to me because firstly it is a small controlled experiment into crowdsourcing where it will be very easy to report back to the community on results, levels of activity and lessons learned. If successful as I am sure it will be, it could easily be replicated in other church archives, or widened for other item types in the church archive. It is also a demonstration of how to make audio-visual content better searchable, as well as calling on a specific group of the community – Christians.

I am really interested to hear more about the results and lessons learned from this small experiment.

Photo: I was lost and parked the car to consult the map when I noticed the car in front of me, it gave me a chuckle…

Sunday, 29 April 2012

Mobilising and archiving social metadata (user generated content).

It is fantastic to see members of our library communities adding their own knowledge and opinions to our content through use of features such as tags and comments, and social media tools such as Twitter and Facebook. More libraries are opening their content and sites to their communities through these tools and features than ever have before. We call this content user generated content (UGC) or social metadata.

But if we think about it for too long it gives us a big headache. Being of the ‘collecting’ mind we really want to care for and keep the UGC in the same way we care for our collection content. Caring for it means:

knowing how much has been added and keeping meaningful statistics.
keeping the UGC in context with the data the users meant it to be related to.
archiving it for the long-term.
being able to migrate it along with our own content as our services and interfaces change in the future.
being able to mobilise it to share with other services.
being able to easily supply it back to the original creators if they want it.

Doing any one of these things is currently difficult, let alone all of them together. We really haven’t got our act together yet for managing UGC content and social metadata, only enabling the facility for the community to add it.

Firstly let’s take a simple concept. A member of the community is actively engaged with your site. They are contributing a lot of data to it in the form of comments and descriptions. After a while they want to get all of ‘their’ data out so they can use it for something else they are working on. Let’s call this ‘user takeout’. Seems reasonable, seems simple, but I don’t know of any library site that does this. For example a ‘user takeout’ option in Trove newspapers would let a contributor get a copy of all of the comments and tags they have added to historic newspaper articles. You may ask “Do people want to do this?” Contributors seem to accept that content they add to sites will be locked to that site. I’m not sure they even think about it very much when they start to add stuff, or check the user licence for the terms. Many don’t intend to add the volume of stuff that they do. But suddenly they think about it when either a better site comes along that they would like to transfer or copy their content to, or the site they are adding to is unexpectedly taken down or frozen. Recent examples in the news are Facebook users wanting to be able to transfer or ‘user takeout’ their photographs from the site. Although of course it is easily technically possible to implement this social media sites such as Facebook are reluctant to let users do this, for fear they will take their content and move to competitors sites. However in the library world it is reasonable that users may want to share their value added data around multiple library sites, and yet we still don’t enable it. Another item in the news was the suddenclosure of poetry.com. Over 7 million users were given 15 days notice that the 14 million poems they had added would be taken down when the site was sold. They were not given an easy option to ‘takeout’ their poems, but instead it was suggested that they could copy and paste their poems if they had time. This infuriated many users who read the message, and many others who didn’t read the message in time. It’s worth pointing out here that a lot of sites people use frequently and think are for the common good are actually commercial sites that can do exactly what they like, and do not ever promise to keep, manage or archive content in the same way libraries do. Although the new owners restored the poetry.com site, it appears that the 14 million poems added prior to 2012 are still not restored hence the large pink box at the top ‘Where’s my poem?’

If we think about measuring our user activity and data through all channels i.e. our own site as well as Twitter, Facebook etc we hit a brick wall. Providing useful statistics on both volume and value of data social metadata is difficult. For social media sites such as Twitter and Facebook your options are to either buy costly software and do it yourself, or employ a company (many of which are springing up) to do it for you. These companies however would have great difficulty integrating measurements on the value and content from social media sources with those that go directly to your site i.e. your own comments, tags, blogs. Doing measurements separately is difficult, but combining them even more so.

Many libraries are part of central or local government so have requirements to archive records and content they create, which should also include social metadata and media. But does anyone know the best way to do this and are our archives agencies telling us how to do it? The simple answer is no. The National Archives in USA (NARA) say they are working on it as a matter of urgency. They are due to explain how it should be done by this July. The National Archives of Australia website states that “The Archives Act 1983 does not define a record by its format. Generally, records created as a result of using social media are subject to the same business and legislative requirements as records created by other means.” But the guidelines on the NAA website as to how this should be done simply say “Methods of capturing social media content as a record may vary according to the tools being used”. This month the Public Record Office of Victoria released an issues paper for comment: ‘Recordkeeping implications ofsocial media’.

An extract of the PROV proposed guidelines for archiving social metadata follows:

How should the record be captured?

Currently printing screenshots to .pdf and registering the resulting document in an Electronic Document and Record Management System (EDRMS) to record the necessary metadata is the most accessible and expedient method of creating social media records. Necessary metadata includes who sent it (username and real name), date and time of sending, context and purpose of content, name of tool used to create it.

My first reaction on reading this was ‘this is mad!’ Perhaps the archives are under-estimating the amount of social metadata and media activity that is going on. Taking a screenshot of every tweet for example would assume that you are not going to get thousands, whereas successful sites and topics such as Trove do get thousands and millions of interactions, which makes this unworkable from a staff resourcing point of view. Twitter is notorious for ‘disappearing tweets’ after a very short amount of time – sometimes less than a week because of the volume of activity that takes place. This also puts pressure on to archive tweets at the time of creation. You don’t have the luxury to go back and archive later. This suggested form of archiving only gives a screen-based image, which is not in context, not searchable, has no metadata, no timestamp, and is not authenticable. It seems there is money to be made if someone develops a simple software system to mechanically capture the tweet, its response and its components and safely and uniformly archives/indexes them along with descriptive metadata. The tool could also render the page "as it appears" and save it as a PDF if that is required.

In April 2010 The Library of Congress rather bravelyannounced that it intended to archive all tweets since they began in 2006 to record the social fabric of the world and signed an agreement with Twitter and Google. In 2010 the Twitter archive was growing rapidly with users sending 50 million tweets a day. A year and a half later several news agencies tried to get a progressreport from LC without much success. Other than trying to transfer the data from Twitter servers to LC servers the LC weren’t giving any detail on what technological developments they were creating to do the mammoth task. The task seemed to be growing bigger by the day with usage of Twitter increasing. Currently 140 million tweets are sent every day.

A core element of the archive process should be that the data is kept in context with that it was referring to, and other elements surrounding it. Most libraries that are keeping UGC and social metadata are keeping it in a separate layer to their own content in the database to protect the provenance, but may integrate it for public display. If it is kept separate it can easily be stored, managed, and moved, but is at risk of becoming separated from the context it is related to. This is something libraries need to work out. This will become more pressing in a few years time when existing services are migrated as part of their maintenance. The UGC needs to be migrated in context with them.

On this topic I have more questions than answers. I think libraries and archives need to work together to take an active role in firstly encouraging mobilisation of social metadata -‘user takeout’, and secondly demonstrating how social metadata and social media activity can be archived. I see massive opportunities for start-ups to create archiving tools to bolt onto Facebook, Twitter, Youtube and Blogging software to meet the requirements of government archiving.

Photo: Prime Ministers Chiefly and Curtin chat on the way to work 1945. Bronze sculpture by Peter Corlett outside the National Archives of Australia, Canberra. Rose Holley

Sunday, 25 March 2012

Crowdsourcing: the crowd ‘rations’ its experience to make it last

When I am at work I feel I have far too much to do and that my ambitions can never all be achieved. This tends to make me feel despondent. However in crowd sourcing projects it is well known that providing way too much work and impossible goals is a very powerful motivator. Rather than leaving individuals in the crowd feeling despondent it drives them to put in even more hours.

I know from experience that this is the case because online volunteers working on the Australian Newspapers text correction have told me this. Also after adding thousands of new pages to the service, surges in text correction would be observed. This was a regular pattern.

Last week I was alerted to a great article in the Guardian about crowdsourcing and two good blog posts on crowdsourcing in cultural heritage. All three articles are well worth reading and give some fascinating background to specific crowdsourcing projects. They all touch on the fact that the crowd wants to be given as much work as possible.

Crowdsourcing Cultural Heritage: the objectives are upside down, by Trevor Owens 10 March 2012

Ben Brumfield noticed that in his transcription project one of his most valuable power users was slowing down on their transcriptions. The user had started to cut back significantly in the time they spent transcribing this particular set of manuscripts. Ben reached out to the user and asked about it. Interestingly, the user responded to explain that they had noticed that there weren’t as many scanned documents showing up that required transcription. For this user, the 2-3 hours they spent each day working on transcriptions was such an important experience, such an important part of their day, that they had decided to cut back and deny themselves some of that experience. The user needed to ration out that experience. It was such an important part of their day that they needed to make sure that it lasted.

Crowdsourcing at IMLS Webwise 2012 by Ben Brumfield

Galaxy Zoo and the new dawn of citizen science, by Tim Adams, 18 March 2012

The volunteers only worry that the source of their obsession will dry up, and that they will run out of visible galaxies to classify. "In the beginning," Alice Sheppard said, "we all were enjoying it so much that we didn't like the idea of getting to the end." As it has worked out, more data sets have kept becoming available just as one tranche of images has been classified; now Sheppard believes that the work will continue to expand like the objects of its attention, "though no one seems quite sure how many galaxies are in the Hubble database?"

So the lesson we can learn from this is that we must give our crowd as much work and new data as we can. We don’t want our crowd to have to ‘ration themselves’ because we haven’t left them enough work to do.

Photo: a worker bee eats the last crumbs of my sticky date pudding.

Sunday, 11 March 2012

Crowdsourcing transcription of handwritten archives

One of the big differences between libraries and archives is that libraries tend to have more of ‘the printed word’ whilst archives have vast amounts of ‘the handwritten record’. While some libraries are getting up to speed with mass digitisation of books and journals and then being able to offer users full text searchable digitised items, this is still a distant dream for most archives. Some archives are undertaking mass digitisation, but the second step – making handwritten records full-text searchable is a massive challenge. The reason for this is in the technology and processing steps.

After scanning a ‘printed word’ page into an image file a piece of software called Optical CharacterRecognition (OCR) converts the image into searchable text. The OCR works best with clean, clear, black and white typeface such as a word document or a book, not quite so well on old books and journals, and very poorly on old newspapers. When it comes to converting handwriting it fails miserably. It just can’t distinguish and convert handwriting to text in the way the human eye can. Therefore archives can’t easily automate the second part of the digitisation process using OCR software like libraries can for the printed word.

If you at least get some OCR text from print that is readable and therefore searchable you can offer a service to users to full-text search the books or journals such as Google does. If the OCR text is poor there are some things you can do to improve it. You can encourage users of your service to correct the OCR text with a text correction tool so that the searching is improved, such as Trove does with the Australian Newspapers.

Unfortunately the only viable option open to archives to convert digital images into full-text searchable text is to use a manuscript transcription tool, in combination with harnessing the power of a crowd to do the transcription work. The transcription work for handwritten records is much harder than for example text correcting old newspapers because the handwriting is often difficult to read, old fashioned, barely legible and not necessarily structured in lines or columns. There is often nothing to go on.

I recently stumbled across a blog all about manuscript transcription tools that is written by a software developer Ben W Brumfield in Texas. Ben developed his own software to transcribe his great-great grandmother’s journal. ‘FromthePage’ is now being used by archives because Ben has made it available open source.

A year ago he wrote an in-depthblog post that covered manuscript transcription tools under development, manuscript transcription projects in archives, and made some predications for future directions of manuscript transcription. I am not going to repeat what he said here, I suggest you read the post in full. He notes that software development in this area is still fragmented and young with no particular tools taking dominance. Most developed applications are being made available open source. A standout is ‘Scribe’ from the Zooniverse team, currently being used by both the ‘Old Weather’ project to transcribe maritime weather records and by ‘What’s the score’ project to transcribe music scores at the Bodleian Library, Oxford.

Before an archive implements a manuscript tool it needs to find out what it’s users would most like to be easily full-text searchable from the vast vaults of all the content it has. It is important to find this out, because the crowd will only be motivated and swell in numbers if they really feel what they are doing is very important to a broad group of people and really matters either right now, or in the long-term and is also interesting. They have to feel this before they will join in. Once they have joined in there are other motivational tips you can do to keep them going. Just implementing a manuscript tool is simply not enough. You need to engage, watch, understand and learn from your crowd, for they hold the passion and power in their hands to make your project successful or not.

Photo by Rose Holley, outside Canberra Bus Station