Saturday, 11 February 2012

Crowdsourcing: more cool sites to give libraries, archives and museums inspiration

Many people know of my interest in the relevance and application of online digital crowdsourcing for libraries, archives and museums, due to an article I wrote in 2010 called ‘Crowdsourcing: how and why should libraries do it?’, and my initiation of the Australian Newspapers public text correction. People therefore often send me links to sites they think may interest me.  This is really great. Sometimes sites which are nothing to do with libraries or archives may give us ideas. There is a ‘List of Crowdsourcing Projects’ in Wikipedia (which is separate to the main article on crowdsourcing). This is a useful starting point to get an overview of the sorts of activities going on. It goes without saying that Wikipedia is of course the greatest crowdsourcing project ever!
In this post I wanted to mention some newish crowdsourcing projects that I have been looking at that interest me, and that I haven’t written about before. 
1.       Star Wars Uncut (SWU) Released August 2011
About the project:  In 2009, Casey Pugh a web developer asked thousands of Internet users to remake "Star Wars: A New Hope" into a fan film, 15 seconds at a time. Contributors were allowed to recreate scenes from Star Wars however they wanted.  Multiple submissions were submitted for each scene, and votes were held to determine which ones would be added to the final film. Although the scenes reflect the dialogue and imagery of the original film, each scene is created in a separate distinct style, such as live-action, animation and stop-motion.  Within just a few months SWU grew into a wild success. The creativity that poured into the project was unimaginable. SWU has been featured in documentaries, news features and conferences around the world for its unique appeal. In 2010 it won a Primetime Emmy for Outstanding Creative Achievement in Interactive Media. Now the crowdsourced project has been stitched together and put online in YouTube and Vimeo. The "Director's Cut" is a feature-length film that contains hand-picked scenes from the entire collection.

Relevance for libraries and archives:  In the world of film, TV and radio fans and consumers are the subject experts.  They not only have in-depth knowledge, but also have the motivation and interest to share their knowledge with others in creative ways. This project really shows that.  The fans apparently had no trouble identifying specific seconds in a very long film.  This type of knowledge and interest is really useful for librarians and archivists when you want to open up discovery of audio items.  It is much more likely that a fan will know which series, episode, minute and second a subject came up, or a thing was said than the librarian who created the catalogue record. The knowledge could be used to help with the discovery process.  At the moment most audio is still catalogued and described at item level for example “it’s an interview with x”. It is still a costly and difficult process to convert speech from audio into text, and to manually add subject tags.  Most of our historic audio collections do not have this level of discoverability. A crowdsourcing project which taps into the crowd to help make films more discoverable by use of public tags is ‘Waisda’.   We know that the public like to consume by watching and listening, but they also want to create and share. There is potential for crowdsourcing to improve accessibility of historic digitised audio especially that which has a fan base or is iconic.

2.       What’s on the Menu (New York Public Library) Launched April 2011.
About the project:  With approximately 40,000 menus dating from the 1840s to the present, The New York Public Library’s restaurant menu collection is one of the largest in the world, used by historians, chefs, novelists and everyday food enthusiasts. But the menus cannot be searched for specific information about the dishes and prices. To solve this problem the NYPL is appealing for the public to transcribe the menus, dish by dish. Doing this will enable the collection to be accessed and researched in new ways, opening the door to new kinds of discoveries. The site was launched in late April 2011 and the original aim was to transcribe the 9,000 menus photographed several years before for inclusion in the NYPL Digital Gallery.   Volunteers transcribed all of these in the first three months, so more items have been scanned from the collection and are now awaiting transcription. As of 5 February 2012, there have been 758,748 dishes transcribed from 12,167 menus. The ultimate goal is to get the whole collection transcribed and to turn it into a powerful research tool.  NYPL are also looking into partnering with other libraries and archives with menu collections.

Researchers who use the collection for example historians, chefs, nutritional scientists, and novelists, are looking for a juicy period detail. They often have very specific questions they’re trying to answer for example:

“Where were oysters served in 19th century New York and how did their varieties and cost change over time?”
 “When did apple pie first appear on a menu? What about pizza?”
“What was the price of a cup of coffee in 1907?”

To find out these sorts of things more easily, the text on the cards needs to be transcribed.  Quotes on their website about the usefulness of the project:
Rich Torrisi, New York Chef:

What’s on the Menu is a tremendous educational resource that breathes life into our city’s most beloved restaurants and dishes.  It has been an indispensable and hugely inspirational tool in the ongoing development of my restaurant…”

Mario Batali, New York Chef, Author, Entrepreneur:

“Menu writing is an art form seldom appreciated, In our restaurants, we put an incredible amount of time and thought into crafting menus. It’s remarkable to see menus being preserved and documented, for them to become a resource for future chefs, sociologists, historians and everyone who loves food.  It’s not just What’s on the Menu, it reveals so much more.”

Relevance for libraries and archives:  Libraries love to collect and keep stuff and that includes things like menu’s, tickets, pamphlets, posters, invitations, theatre programs and greeting cards. We call this stuff ‘ephemera’. Ephemera is a Greek word and it means printed matter that it is intended to be transitory, short lived, or only last a day.  When the item is created it is not intended that it will be retained or preserved.  However I haven’t encountered a single library that did not have a large ‘ephemera’ collection and intend to keep it long-term. The National Library of Australia is no exception and collects ephemera because it is “a record of Australian life and social customs, popular culture, national events, and issues of national concern”. There are 2.3 million items of ephemera in the collection at the NLA. Nearly 170,000 of them have been digitised and are browsable by title.
However their full potential has still not been unlocked.  Ephemera is printed on a few pages which usually contain both words and pictures.  When ephemera is digitised it is scanned or photographed as an image file, and therefore the text is not indexed or searchable.  It would be very hard to apply OCR on the text because of the varying and usually fancy typefaces used.  The only way to make the text searchable, thereby unlocking the full discoverability potential is to manually transcribe it.  Librarians don’t have time for this, but an interested public do.  Give them a really interesting or topical ephemera collection like the menu cards and watch them go!
3.       Historypin Launched July 2011
About the project:  Historypin was launched in July 2011.  It allows people to upload historic and contemporary photos, videos and sounds to a specific geo location on a map of the world.  Well it’s actually not just any map, it’s a Google map and this is likely to make all the difference. It’s a combination of a crowdsourcing project (they want organisations and individuals to load content), a useful educational site, and a service that libraries and archives can hook into to expose their content and collections to new audiences (similar to Flickr Commons).  I’ve seen quite a few sites like this before, but on a small scale for specific locations. For example Sydney Sidetracks was launched in 2008 by the ABC in partnership with The Dictionary of Sydney, The National Film and Sound Archive, The City of Sydney, The Powerhouse Museum, The State Library of New South Wales and the Museum of Contemporary Art. There is a website and mobile app from which historic images, videos and sound are available for locations in central Sydney overlaid on a map.
The big difference with Historypin is that it has been developed by ‘We Are What We Do’,( a not for profit organisation that creates ways for millions of people to do more small, good things) in partnership with Google. Google is the main technology partner on the project and has helped with Google tools, including Google Maps, Google Street View, Picasa, Google App Engine and Android. Google has supported the development costs of the project with donations and sponsorship.  It has also given marketing support and created the video to promote the service:  a one minute introduction to Historypin. This means this is not some small scale project that may suffer from lack of budget, development, maintenance or marketing.  It is something likely to be around for a while and perhaps rival Flickr Commons. Google says “We share ‘We Are What We Do’s commitment to Historypin as a non-commercial, collaborative project that delivers social impact and contributes to digital inclusion.”
The marketing blurb says “Historypin is a way for millions of people to come together, from across different generations, cultures and places, to share small glimpses of the past and to build up the huge story of human history through a well-known medium - picture.”
Relevance for libraries and archives: Interestingly although the initial crowd Historypin were trying to attract was the public to contribute their photos and stories, it now appears that the crowd may actually be the libraries and archives community. This community has massive amounts of digitised content in image, video and sound format, and they want it more widely exposed, tagged, and used.  A service in which libraries and archives can do this, which they don’t have to develop and support themselves, and has no geographical boundaries is certainly a drawcard.  Batch upload has already been enabled, as has ‘make your own collection’ and ‘view slideshow’.  You can pin your content on any Google Street View scene, in any country of the world.  If you happen to be somewhere that Street View hasn’t yet been – don’t worry you can still pin your content down. It is a service that will be more valuable the more content there is.  I only wonder if they have under-estimated the interest that libraries and archives will have in joining, and the volume of content they will have.  If so it is advisable to get in early in case there is a three year waiting list like Flickr Commons had when it started. This is a crowdsourcing project that has a direct relevance to libraries and archives, no matter what their size or where they are located.
TEDx video: Nick Stanhope on mapping history  

4.       Ancient Lives  – Decoding Papyri Launched July 2011
About the project: The Ancient Lives project presents you with fragments of 1,000-year-old papyri to decode. The papyrus was discovered by researchers from Oxford University over a century ago in Oxyrhynchus (the city of the long-nosed fish).
With about 100 men from the local village, Grenfell and Hunt dug in the high winds roaring across the desert. In early January of 1897 a papyrus containing the apocryphal Gospel of Thomas was unearthed, and then a fragment of St. Matthew’s Gospel. The flow of papyri began. Within a few years not only Thucydides and Plato were delicately pulled from the sand, but also Greek lyric poetry that had not been seen or read in about 1000 years. Further, the private documents of this vanished city were collected en masse: private letters, accounts, wills, marriage certificates, land leases, etc. Ancient garbage became a modern treasure. By 1907 the digging ceased. 700 boxes of papyri, potentially carrying about 500,000 fragments, made the long journey back to Oxford University, where Grenfell and Hunt opened up a new branch of study: papyrology. A little over a century later, only a small percentage has been translated by scholars. The Oxyrhynchus collection is owned and overseen by the Egypt Exploration Society.”
The papyrus can be decoded easily by volunteers who match known characters from a grid to the unknown characters on the fragment.  Fragments can be matched by adding measurements of the fragments and the columns within them. The task is mammoth and before the arrival of the online tool could only be undertaken by scholars who were familiar with the code. A very difficult task has been effectively simplified, whilst retaining the challenge that is found in crosswords or code-breaking.
The project was launched in July 2011 and is part of the the Citizen Science Alliance, which is a transatlantic collaboration of universities and museums who are dedicated to involving everyone in the process of science. Growing out of the wildly successful Galaxy Zoo project, it builds and maintains the Zooniverse network of crowdsourcing projects, of which Ancient Lives is one of the newest. Nearly half a million people are contributing to the Zooniverse crowdsourcing projects.
Relevance for libraries and archives: This is a good example of a task that appears on the surface to be too difficult and extensive for a crowd to undertake.  By clever breaking down of the task and designing a simple user interface it becomes achievable.  It also demonstrates that private information about people is of eternal interest to the public. This project along with all the other Zooniverse projects has extensive public discussion forums to firstly foster the volunteer community and secondly let them know how their work helps new discoveries and knowledge grow and develop. We can learn much from how Zooniverse treats its volunteer community.
5.       Duolingo -  translate the web and learn a new language Launched November 2011
About the project: Luis Von Ahn of the Carnegie Mellon University is the creator of CAPTCHA and reCAPTCHA. Google bought both and reCAPTCHA has effectively helped Google Books improve the OCR in its digitised books word by word. Each year 750 million people are unwittingly converting the equivalent of 2.5 million books by using reCAPTCHA.  This is a crowdsourcing project where people don’t realise they are in a crowd or what they are doing. Luis is now working on a new project: Duolingo.  Luis says “Before the internet the biggest projects had 100,000 people involved and with that you could for example put a man on the moon.  My question is what can you achieve with the internet when you can have 100 million people working together on something?”  A good question.  Especially when you combine the number of people with all that ‘cognitive surplus’ that Clay Shirky is always talking about.
Duolingo will help people learn a new language and simultaneously (unwittingly) translate the Web.  He says “It is estimated that there are over 1 billion people learning a foreign language at any given time”. OK so this means a big potential crowd. The Google translator tool is quite good at translating websites but not as good as he thinks the new project Duolingo will be.  The site went live in beta mode in November 2011, but only a few road testers have been accepted.  There is a waiting list of 100,000 who want to join the site already. Luis says “Duolingo is a 100% free language learning site in which people learn by helping to translate the Web. That is, they learn by doing.” The difference to reCAPTCHA is that people will know what they are doing and consciously want to do it. Watch this space.
Relevance for libraries and archives: I’m not sure what the relevance for libraries and archives will be.  Although reCAPTCHA is a free program that is obviously very relevant for libraries and archives it has only been utilised by commercial companies so far, namely the New York Times historic newspaper archive and Google Books. No library has utilised it. I thought I should mention the new project Duolingo since the potential also seems big.  It’s a good idea to translate the web, but I also like the idea of something Luis didn’t mention which is translating books and newspapers into different languages. A question that the National Library of Australia was thinking about last week was “will our volunteer newspaper text correctors be as keen to correct Australian newspapers in foreign languages as they are the English ones? Will they correct them even if they don’t speak the language?” We are asking this because we will soon be adding Australian newspapers in foreign languages to Trove. If this content is classed as ‘part of the web waiting to be translated’, then I guess Duolingo holds big relevance for all national libraries. Duolingo is at an early stage of development so we will have to wait and see. That is unless libraries want to be really pro-active and actually make suggestions to the development team for things that would help them make their content more widely accessible and used……
The TEDx video:  Luis talking on CAPTCHA, reCAPTCHA and Duo-lingo

I hope you find some inspiration from these five crowdsourcing sites for your library, archive or museum.  If there is a newish site of relevance to libraries and archives that you think I’ve missed please add a comment to this post and share. Crowdsourcing sites I have previously reviewed are:
·         Picture Australia (National Library of Australia)
·         FamilySearchIndexing (Church of Latter Day Saints)
·         Distributed Proofreaders (contributes to Project Gutenberg)
·         Wikipedia  
·         UK MP's Expenses (The Guardian)
·         Galaxy Zoo  (Citizen Science Alliance)
·         BBC WorldWar2 Peoples War (BBC)
·         Digitalkoot (National Library of Finland)
·         Old Weather (National Maritime Museum and Citizen Science Alliance)
·         Remember Me: Displaced Children of the Holocaust (United States Holocaust Memorial Museum)
·         Trove Australian Newspapers (National Library of Australia)
·         Transcribe Bentham (University College of London)
·         Waisda (Netherlands Institute for Sound and Vision)

Read more - related posts by Rose Holley on crowdsourcing:
·         Gold star to text correctors for e-books, 13 December 2011
·         Software for journal and newspaper text correction, 18 December 2011
·         Digital cultural heritage awards for crowdsourcing, 4 February 2012

In March 2011 images of the digitised Australian Women's Weekly 1932- 1984 were projected onto the National Library of Australia building as part of the ‘Enlighten’ Festival in Canberra. Nearly 395,000 articles from the Australian Women's Weekly can be improved by public text correction in Trove.  Photograph by Paul Hagon.


  1. Hi Rose,
    This post is a great summary of some interesting projects. I see crowdsourcing as an exciting opportunity for the GLAM community. I work mostly with students and I think these suit of projects present them with a powerful experiences to engage in history in a real way (and other topics too). It fits very well with contemporary notions of education.
    Do you have any links or information about what aspects of crowdsourcing projects makes them successful? Ie. What draws the crowd?

  2. I'm glad you found this useful. I previously wrote an article which gave 14 tips for what to do to make your crowdsourcing project successful. I illustrated the tips with screenshots from real sites. The article is here:
    Hope that helps.

  3. A couple of things I'd like to add:

    First, the Zooniverse team have made both their transcription tool and their discussion tool open source on github: Scribe and Talk. They've also deployed Scribe on a library-drive site at What's the Score at the Bodleian?

    Second, I'd like to mention a recent article describing a very small-scale crowdsourcing project at Southwestern University in Georgetown, Texas which uses my own transcription software: Collaborative Transcription Project. I will be speaking about the lessons to be learned for small crowdsourcing projects at IMLS WebWise next week, and hope to have something interesting to say.

    1. Dear Ben

      Thanks for sharing this. Congratulations on writing your own tool for your own personal job, which archives can now use as well because you have shared it open source. That is so good to hear. I had not heard of the new project to transcribe music scores at the Bodleian Library which looks very interesting, so thanks for sharing that too.