Sunday, 29 April 2012

Mobilising and archiving social metadata (user generated content).

It is fantastic to see members of our library communities adding their own knowledge and opinions to our content through use of features such as tags and comments, and social media tools such as Twitter and Facebook.  More libraries are opening their content and sites to their communities through these tools and features than ever have before. We call this content user generated content (UGC) or social metadata.

But if we think about it for too long it gives us a big headache. Being of the ‘collecting’ mind we really want to care for and keep the UGC in the same way we care for our collection content.  Caring for it means:

  • knowing how much has been added and keeping meaningful statistics.
  • keeping the UGC in context with the data the users meant it to be related to.
  • archiving it for the long-term.
  • being able to migrate it along with our own content as our services and interfaces change in the future.
  • being able to mobilise it to share with other services.
  • being able to easily supply it back to the original creators if they want it.
Doing any one of these things is currently difficult, let alone all of them together. We really haven’t got our act together yet for managing UGC content and social metadata, only enabling the facility for the community to add it.

Firstly let’s take a simple concept. A member of the community is actively engaged with your site.  They are contributing a lot of data to it in the form of comments and descriptions.  After a while they want to get all of ‘their’ data out so they can use it for something else they are working on. Let’s call this ‘user takeout’. Seems reasonable, seems simple, but I don’t know of any library site that does this.  For example a ‘user takeout’ option in Trove newspapers would let a contributor get a copy of all of the comments and tags they have added to historic newspaper articles. You may ask “Do people want to do this?”  Contributors seem to accept that content they add to sites will be locked to that site.  I’m not sure they even think about it very much when they start to add stuff, or check the user licence for the terms. Many don’t intend to add the volume of stuff that they do.  But suddenly they think about it when either a better site comes along that they would like to transfer or copy their content to, or the site they are adding to is unexpectedly taken down or frozen.  Recent examples in the news are Facebook users wanting to be able to transfer or ‘user takeout’ their photographs from the site.  Although of course it is easily technically possible to implement this social media sites such as Facebook are reluctant to let users do this, for fear they will take their content and move to competitors sites. However in the library world it is reasonable that users may want to share their value added data around multiple library sites, and yet we still don’t enable it.  Another item in the news was the suddenclosure of Over 7 million users were given 15 days notice that the 14 million poems they had added would be taken down when the site was sold.  They were not given an easy option to ‘takeout’ their poems, but instead it was suggested that they could copy and paste their poems if they had time. This infuriated many users who read the message, and many others who didn’t read the message in time.  It’s worth pointing out here that a lot of sites people use frequently and think are for the common good are actually commercial sites that can do exactly what they like, and do not ever promise to keep, manage or archive content in the same way libraries do. Although the new owners restored the site, it appears that the 14 million poems added prior to 2012 are still not restored hence the large pink box at the top ‘Where’s my poem?’

If we think about measuring our user activity and data through all channels i.e. our own site as well as Twitter, Facebook etc we hit a brick wall.  Providing useful statistics on both volume and value of data social metadata is difficult.  For social media sites such as Twitter and Facebook your options are to either buy costly software and do it yourself, or employ a company (many of which are springing up) to do it for you. These companies however would have great difficulty integrating measurements on the value and content from social media sources with those that go directly to your site i.e. your own comments, tags, blogs.  Doing measurements separately is difficult, but combining them even more so.

Many libraries are part of central or local government so have requirements to archive records and content they create, which should also include social metadata and media.  But does anyone know the best way to do this and are our archives agencies telling us how to do it?  The simple answer is no. The National Archives in USA (NARA) say they are working on it as a matter of urgency. They are due to explain how it should be done by this July. The National Archives of Australia website states that “The Archives Act 1983 does not define a record by its format. Generally, records created as a result of using social media are subject to the same business and legislative requirements as records created by other means.” But the guidelines on the NAA website as to how this should be done simply say “Methods of capturing social media content as a record may vary according to the tools being used”.  This month the Public Record Office of Victoria released an issues paper for comment: ‘Recordkeeping implications ofsocial media’.

An extract of the PROV proposed guidelines for archiving social metadata follows:

How should the record be captured?

Currently printing screenshots to .pdf and registering the resulting document in an Electronic Document and Record Management System (EDRMS) to record the necessary metadata is the  most accessible and expedient method of creating social media records. Necessary metadata includes who sent it (username and real name), date and time of sending, context and purpose of content, name of tool used to create it.

My first reaction on reading this was ‘this is mad!’ Perhaps the archives are under-estimating the amount of social metadata and media activity that is going on.  Taking a screenshot of every tweet for example would assume that you are not going to get thousands, whereas successful sites and topics such as Trove do get thousands and millions of interactions, which makes this unworkable from a staff resourcing point of view. Twitter is notorious for ‘disappearing tweets’ after a very short amount of time – sometimes less than a week because of the volume of activity that takes place.  This also puts pressure on to archive tweets at the time of creation.  You don’t have the luxury to go back and archive later. This suggested form of archiving only gives a screen-based image, which is not in context, not searchable, has no metadata, no timestamp, and is not authenticable. It seems there is money to be made if someone develops a simple software system to mechanically capture the tweet, its response and its components and safely and uniformly archives/indexes them along with descriptive metadata. The tool could also render the page "as it appears" and save it as a PDF if that is required.  

In April 2010 The Library of Congress rather bravelyannounced that it intended to archive all tweets since they began in 2006 to record the social fabric of the world and signed an agreement with Twitter and Google. In 2010 the Twitter archive was growing rapidly with users sending 50 million tweets a day. A year and a half later several news agencies tried to get a progressreport from LC without much success.  Other than trying to transfer the data from Twitter servers to LC servers the LC weren’t giving any detail on what technological developments they were creating to do the mammoth task. The task seemed to be growing bigger by the day with usage of Twitter increasing.  Currently 140 million tweets are sent every day.

A core element of the archive process should be that the data is kept in context with that it was referring to, and other elements surrounding it.  Most libraries that are keeping UGC and social metadata are keeping it in a separate layer to their own content in the database to protect the provenance, but may integrate it for public display.  If it is kept separate it can easily be stored, managed, and moved, but is at risk of becoming separated from the context it is related to. This is something libraries need to work out.  This will become more pressing in a few years time when existing services are migrated as part of their maintenance.  The UGC needs to be migrated in context with them.

On this topic I have more questions than answers. I think libraries and archives need to work together to take an active role in firstly encouraging mobilisation of social metadata -‘user takeout’, and secondly demonstrating how social metadata and social media activity can be archived. I see massive opportunities for start-ups to create archiving tools to bolt onto Facebook, Twitter, Youtube and Blogging software to meet the requirements of government archiving.
Photo: Prime Ministers Chiefly and Curtin chat on the way to work 1945. Bronze sculpture by Peter Corlett outside the National Archives of Australia, Canberra. Rose Holley