Crowdsourcing

Advice document on key features of crowdsourcing projects.

Crowdsourcing

Image from the Library of Congress. Persistent URL: http://www.loc.gov/pictures/resource/ggbain.21683/

Advice paper showing examples of where crowdsourcing has been used effectively to create metadata and detail content for digitisation projects: highlighting challenges and how to overcome these.

Introduction
The past decade has seen the rapid rise in organisations using crowds to carry out work that would otherwise not be possible with a finite resource of professional employees.  Examples of crowdsourcing can be seen in many domains and activities, from Wikipedia and Open Source software to charity, aid work and commercial business.  The sorts of activities that are typically undertaken in crowdsourcing projects in the culture, heritage and education sectors include the rating, tagging, transcribing and classifying of existing media and text, as well the creation and sharing of new collections and information. The benefits and rewards for these organisations can be huge.  Not only are they able to carry out more work and expand and advance research, but are able also to build new communities, have better engagement with users, and foster a shared responsibility to educational resources and cultural heritage.

Moreover, crowdsourcing is happening in the context of mass digitisation and an ever increasing body of born-digital data.  When working in combination with the development of open data and the universality and egalitarian spirit that embodies the best of the Internet's educational resources, crowdsourcing has the potential to provide new access points to information, create new collections, involve more people in the work of professional curators and collection owners, inform, educate and ultimately democratise knowledge for the benefit of everyone.

This advice paper examines some key features of crowdsourcing.  First, it looks some of the notable events of the past decade and points out some differences in ‘Community’ and ‘Crowd’ based projects.  It then goes on to address challenges that organisations face when embarking on crowdsourcing endeavours, such as the quality assurance of gathered data, copyright and audience motivation, both in terms of sparking initial enthusiasm and then sustaining this interest. 


Community vs Crowd
Community Collections tend often to have different characteristics to what may be termed as typical crowdsourcing projects.  Community Collections usually involve a smaller number of participants who are comparatively more active, are more likely to have social interaction with their peers, may require more qualitative recognition for their efforts, and may also contribute data or ideas of their own that can shape the project's direction.  In academia, a great example of the Community Collection was the Great War Archive run by the University of Oxford in 2008. In this project, participants were invited to upload digitised images and documents relating to their own family history of the First World War. This project was notable also because, as an early example of a successful form of community engagement, it provided experience and development of methodologies that have been useful for projects that have followed to learn from (1).

Participants in crowdsourcing, on the other hand, are usually involved in what has been termed lightweight peer production (2).   They may be comparatively anonymous, less frequent contributors, be required to fulfil tasks that are less skilled and which require more strict guidelines and rules.   However, crowds derive their power from their greater number.  Many of the Citizen Science projects that fall under the Zooniverse umbrella come under this category.  The pioneering and hugely successful Galaxy Zoo project, where users were asked to categorise images of many millions of galaxies taken by space telescopes, is one such example.

While it is, to some extent, useful to make distinctions in crowdsourcing projects in this way, it is also true that many projects display characteristics of both.  Wikipedia, for example, has many thousands of casual and sporadic Wikipedians, while also having a smaller number of more regular contributors who take on a greater editorial role.  Both crowd and community also have in common a blurring of the boundary between professional and amateur, both use web technology and social media as a primary means of communication and gathering data, and both provide a sense of collaboration and of working towards a shared outcome for those taking part. It is also true that both crowd and community face similar challenges in terms of how to motivate people to take part; in how to sustain this motivation and in how to manage risks around the quality of the data they are collecting.

Before going on to look at some of these risk and challenges in greater detail it is useful to provide examples of some key projects that have taken place in the last decade.


Disaster zone and aid work
While not applicable directly to educational and cultural and heritage domains, the use of crowdsourcing in recent disaster recovery operations provides valuable insight into the cumulative effect of shared action.  Disaster zones are typically characterised by a struggle to piece together accurate, up-to-date information which makes it difficult for traditional media outlets to make sense of events and for aid workers to target their efforts.   Paradoxically, there can be a wealth of information being shared by individuals on the ground via mobile phones, blogs and other social media tools.  Much of this data can be collected efficiently by software tools, such as the open source Ushahidi platform, and can provide raw data which can be visualised and mapped interactively to help disaster recovery teams, aid workers and media outlets locate people and concentrate effort (3).   Tools such as this have been used to great effect in recent disaster areas such as Haiti, the Philippines and in Japan.


Public Interest and News
Crowdsourcing has also been used effectively by the newspaper industry as a means of filtering and researching large data-sets.  Following the publication in 2009 of over 700,000 digitised copies of MPs expenses receipts, the Guardian newspaper asked the public to help to sift the information. The Guardian was, therefore, able to use its readers to analyse what would have been an impossibly large set of information for any group of journalists and thus uncover potential stories relating to MPs expenses (4).
Academic Crowd-sourcing


In Academia, some of the most well-known crowdsourcing projects include:

The Great War Poetry Archive
As noted above, this project, run by the University of Oxford until 2008, sought public involvement in major research into the First World War (5).   It invited the public to submit their own images, media and stories to the archive and collected over 6500 digital objects in its lifetime.  This material has now been subsumed into the larger First World War Digital Archive where it can be searched for and viewed alongside other archive material (6).

Zooniverse
Zooniverse is an example of crowdsourcing on a large scale (7).   It started as the Galaxy Zoo project in 2007 where, as mentioned above, the aim was to ask volunteers to classify - according to shape and colour - many millions of images of galaxies taken by NASA’s Space Hubble telescope (8).   The success of this project enabled the team to re-use and further develop their methods and tools for other projects to use under the wider Zooniverse umbrella.  Zooniverse now runs dozens of crowdsourcing projects covering not only space and astronomy based subjects but also humanities, biology, climate and nature.

Old Weather
One well-known Zooniverse project is the Old Weather Project.  The aim of the project, which is on-going still, is to track climate changes using ships logs from 19th Century United States naval records.  Users are asked to transcribe these logs of which there are many thousands of pages.

Transcribing Bentham
Transcribing Bentham is a project, also on-going, that asks volunteers to transcribe and encode the manuscript papers of the philosopher and reformer Jeremy Bentham.   The ultimate aim of the project is to use the resulting data to contribute to a new scholarly publication of the collected works of Jeremy Bentham (9).  To date, participants have completed the transcription of over six thousand manuscripts.  Transcribe Bentham is also notable because it demands a high level of skill of its users a by asking them, in addition to their transcriptions, to encode texts using the Text Encoding Initiative (TEI) standard.

Australian Newspapers Digitisation Program
The Australian Newspapers project was one of the early crowdsourcing projects in a national library.  This project involved the public in the transcription and correction of digitised and OCR’d Australian national newspapers. The project has completed and corrected over 100,000 newspapers to date, and many of the lessons learned from this project have informed subsequent transcription endeavours.  The project continues to run in the National Library of Australia's TROVE website (10).


Challenges

Quality Assurance
As noted above, there are risks and challenges that institutions have to deal with when undertaking crowdsourcing initiatives.  One such challenge is to ensure that data collected, encoded or annotated is of sufficient quality.   Notions of quality are particularly resonant for organisations like museums, libraries and archives that are judged by their usually professionally acquired, curated and annotated data and information.  Crowdsourcing presents a challenge here as it inevitably leads to some loss of control and a blurring of the boundary between amateur and professional.   Organisations, therefore, have to facilitate amateur and enthusiast contributions while ensuring the quality of the information being collected remains fit for purpose.

Institutions can develop various strategies in order to do this.  These strategies typically include a combination of technological and interaction aids.   For web-based projects, it is important to design an interface and tools which make tasks simple and understandable for users, and if possible, facilitate effective data checking mechanisms.  The Australian TROVE project website provides a good example of this.   TROVE has a user interface which allows participants to correct and edit text transcriptions, but alongside this also makes apparent the various issues that can be found when using OCR optical character recognition and immediately shows the corrected version on the screen.  This helps the user to understand the value of their contribution and to share their corrections with other participants.

Another aid to ensuring quality is to build a supportive team environment.  This can include providing aids to the interaction between professional and amateur staff and also between participants themselves.   So many projects will run discussion fora and blogs alongside the core work where participants can share their experiences.

It is also important to assume volunteers do it right rather than wrong.  As we have seen there are various strategies that can be deployed to check the quality and consistency of data, but experience shows that participants in crowdsourcing, who are often motivated by interest in the subject and reaching a shared common goal, are likely to care about the quality of their work and are keen to provide the best information and work that they can.

Thus, effective design, simple tools and checking mechanisms as well as providing channels for communication and setting clear guidelines will help projects to develop a philosophy and ethos around the project and to foster clear behavioural norms, which in turn help build an image of the desired quality of content.

IPR
IPR generally and copyright in particular can be a challenge for crowdsourcing projects.  Crowdsourcing, basically, is a form of mass outsourcing.  In commercial outsourcing there would normally be a contractual agreement between the organisation and the worker, and the organisation would usually assert their copyright over whatever was produced.  Also, the worker would usually receive payment for their effort.  Crowdsourcing can turn this usual scenario on its head.  Participants are not usually paid, and there is rarely a form of formal contract between the organisation and the participant.  However there is still an outcome, or a result of someone’s intellectual endeavour which is produced, be it a transcription or an edit, an idea expressed or in some cases media file or image that has been submitted and  which ‘belongs’ to someone.

There are broadly two ways of dealing with this.  The first is for the organisation to assert their copyright on what is produced up front in the form of a disclaimer.  This would normally be placed in the terms and conditions of taking part in the project and be implicitly agreed to by participants.  Transcribe Bentham, for example, has a copyright disclaimer whereby participants agree that Transcribe Bentham have all rights over corrections and data submitted, which never presented any issues for their participants (11).    The second option is to give users copyright of their data, but to also make sure that they sign up to then licence their work for others to use.  Wikipedia for example, states that any contributions remain the copyright of the person who created or edited an article, while at the same time participant’s sign up to Creative Commons CC-BY-SA and open source GFDL licence, which allows the work to be freely distributed and edited by others. 

In non-profit crowdsourcing projects where the outcome is a cause for the greater good, it is likely that both models could be deployed without major concern for participants.  The important thing to note for projects is to be clear and to have polices in place for the outset as to who own copyright and what rights participants may have.

Motivation
Research shows that people are motivated to engage in crowdsourcing activities for various reasons.  These range from wanting to be involved in a worthy cause; to learn and discover new things; to feel a belonging with a community; to have fun; to be intellectually challenged; and to do something interesting with their free time (12).   Recent research suggests that advances in social media technologies have led to changes in people’s use of their free time, or ‘cognitive surplus’.  Sociologist Clay Sharkey and others have identified a trend in peoples use of the Internet and social media, away from being predominately passive consumers of information – for example in reading or watching television – to be more likely to contribute to and participate in creating content themselves (13).    Other commentators, like for example James Surowieki and his book ‘The Wisdom of Crowds’, point to the generally better decision making and inherent power crowds of people have over individuals (14).    There is therefore huge potential in tapping into this cognitive surplus and to engage crowds in a common goal.  The key for institutions is to spark potential participants’ imagination in the first place but then also to sustain it.

Have a Big Target
Most successful crowdsourcing projects have targets that would be impossible to reach with just a core team of professional staff.  Indeed evidence would suggest that the bigger the task or problem faced, the more likely people are to get involved.  Old Weather estimated that it could take one person up to 28 years to accurately transcribe one Captain's logbook from one voyage.  Using crowdsourcing the same task can be done in six months.  Galaxy Zoo’s initial target was to classify one million galaxies - an ambitious target which they far exceeded. In fact they had a total of 50 million individual classifications by the end of the first year, which of course meant that these images were classified more than once, which provided a great way of checking the data and ensuring it was accurate.

Communicate Progress
It is important to make sure that progress bars or other mechanisms are used to make it clear how much still needs to be done.  Again this will psychologically encourage people to sustain their initial spark of motivation, and usually, progress indicators are placed upfront and unambiguously on project home pages.  

Make the system easy to use, fun and reliable
The Transcribing Bentham project survey showed that one of the biggest dissuading factors for participation, after a lack of time, was the complexity and difficulty of the task.  It is important therefore particularly for crowdsourcing projects that expect a lot of casual or one time participants to make the tasks a simple and as easy to follow as possible.

Gamification
One technique that can be deployed to make tasks fun and also to encourage sustained participation is to introduce a reward system or level of competition for users.  For example, Old Weather has a ranking system that is based on naval rank: the occasional or sporadic user is afforded the rank of cadet, and can, based on their contributions rise through the ranks to become lieutenant and then a Captain of a vessel.  Points are awarded for each transcription carried out, and new cadets on a given ship are told their captain’s points total and asked to try and beat it.

Gaming techniques are also used successfully in metadata or tagging types of crowdsourcing. Dartmouth College in the US has developed a platform called ‘metadatagames.org’ where users can simply and quickly tag resources, and even join up and collaborate with other participants online (15).   Institutions that require large datasets of digital images, movies or sound, which are too big and time consuming for any small team to sift through, can send their data to the platform to be included.  Again like Old Weather above, users are given scores for each piece of media they tag.

Another good example of gamification techniques being used in metadata crowdsourcing can be found at the Museum of Design in Plastics at Bournemouth University.  Here they have developed an online game to find information on the ‘10 Most Wanted’ objects from the collection (16).    ’10 Most Wanted’ is a play on the FBI's most wanted lists where users can become Agents with the task of finding wanted information, such as designer, maker, or material.  Agents can earn extra points where they can achieve 'Special Mention' if they can add contextual stories and information about objects, and can rise from Field Agent to Special Agent based on their points score.

There is no doubt that this type of gamification can work well in certain types of crowdsourcing project.  However, an interesting piece of research by the Transcribing Bentham team shows that competition is a small motivational factor when compared to the sense of community, shared sense of purpose or interest in the subject matter.  The Bentham project ran a survey of their participants which showed that their users were overwhelmingly motivated by being part of a collaborative endeavour and an interest in the subject rather than any notion of competition or individual recognition (17).   That said, the project themselves did say that this may have been due in part to participants in their project not wanting to admit to themselves and others that they were motivated by the competition!  It is true however that in some situations an overly ‘gamified’ solution may put potential participants off as it may seem to trivialise their effort.  Yet, used carefully it can provide a useful means to attract and sustain the motivation of participants.


Conclusion

Crowdsourcing has fundamentally changed the way in which cultural, heritage and education organisations and collections work.  A unique combination of improved communication and social media tools, a resulting change in peoples expectations of and interactions with collections and their use of their free time,  and a rise in the creation and use of open data and resources has enabled organisations to reach out to users in new and innovative ways, to improve the management and accessibility of their collections and to share experiences, resources and information in ways that bring collections closer to users and that advance knowledge and research.

There are challenges that organisations face: such as ensuring quality, managing rights and sparking and sustaining motivation.  This paper had provided some guidance and techniques that organisations can use to mitigate these challenges, from making tasks clear and unambiguous, to good communication and, where appropriate, introducing gamification techniques to sustain motivation.   And ultimately, when organisations understand these risks and take appropriate steps to alleviate them, the benefits that crowdsourcing brings to our cultural heritage far exceeds any potential negatives.