A Crowdsourcing Platform for Wales

The National Library of Wales have recently appointed Digirati to deliver a new open source crowdsourcing platform. It will enable them, and other organisations, to run crowdsourcing projects across various digital collections (see announcement).

The specification summarises the core objectives as follows:

“.... to develop and implement a customisable, bilingual platform, which can be enhanced in the future and offered to other organisations who may wish to deliver digital crowdsourcing projects. We expect that the availability of a crowdsourcing system built using open standards, namely IIIF and Web Annotations, will be of interest to a wide range of international organisations and also other cultural heritage organisations in Wales.”

We’re going to build this platform on the Omeka S collection management system, with some new custom modules and our W3C annotation server, Elucidate. We’re already working on a project (the Indigenous Digital Archive) that uses some of the same components, and we will build on that work to deliver a crowdsourcing engine that can be used by the NLW and others. You deploy the platform, pick the material you want to build a project around, define what you want people to capture from the material, and let the platform present the material and manage contributions from volunteers. The platform is extensible so you can define new kinds of thing to capture for different projects.

Two open standards lie at the heart of the project - the International Image Interoperability Framework (IIIF) and the W3C Web Annotation Data Model.

IIIF gives us APIs for interoperability of content, so that anyone who provides digitised archival material, books, manuscripts, photographs and other content as IIIF can use the platform to expose the material for crowdsourcing projects. The National Library of Wales is a pioneering institution both in IIIF and in crowdsourcing, and they have large collections available via the IIIF APIs.

When volunteers take part in a crowdsourcing project, their contributions take the form of tags, comments, descriptions, identifications, transcriptions and other types of annotation. Just as IIIF gives us an interoperable way of presenting and sharing the content, the W3C Web Annotation Data Model gives us an interoperable way to say things about the content. The combination of these two standards makes the project possible. Both the inputs (IIIF resources) and outputs (W3C annotations) are external to the platform, and are both defined by open standards.


What we propose to do

1. Work with NLW to develop a simple extensible data model that is used by a crowdsourcing project to define what that project is to capture (in the form of W3C annotations) from IIIF resources by crowdsourcing volunteers.

2. Develop user interface components that capture entities described by the model (transcriptions and comments, but also people and places and events and other topics).

A model to capture might be very simple. A project comprising a set of handwritten correspondence might only wish to capture simple full page text transcriptions, one annotation per image. Other projects need to capture more complex entities, such as people, places and other concepts. For example: from a set of First World War military hospital records, we want to capture information about personnel. The records are formed of pages of tabular data. Each table row corresponds to a person. The model to capture is a person, with additional fields specific to the project (such as rank and regiment).

Where the model differs from simply describing common entities like people and places is that it can specify, if needed, how that entity is represented in the images that the crowdsourcing platform will present to volunteers via its user interface. This is not just a person - it is a person represented as a table row in a document. The model contains the information required by user interface components to enable the volunteer to extract instances of the model from images. The user interface components handle particular classes from the data model (component X is a UI element that is used to an extract people from table rows).

A person might be present in the source image as a figure in a captioned photograph, a row in tabular data, or from mentions in unstructured text. In each case the model to be captured is a person, maybe with additional project-specific fields - but the actions the volunteer performs to capture that model are different and require different UI components.

New projects can reuse existing models or define new ones. For example, a recipe book transcription project might involve specifying a “recipe” model. If this can be captured as a set of free text fields and lookups from controlled vocabularies, it may not need any additional user interface components because it can be built from free text elements, taxonomy pickers, etc. Other models might occasionally require new UI components for volunteers to use. A model contains enough information for two purposes - generating UI to capture that model from contributors (drawing tools, form fields etc), and generating W3C Web Annotations that allow a representation of the model to be stored in an annotation server (or for an annotation server to store a reference to an entity held elsewhere).

The human crowdsourcing activity could be preceded or complemented by automated processes for capturing parts of capture models, such as tagging entities in text, identifying the subjects of images, face recognition and so on. Although these parallel machine-driven processes are out of scope of the platform, the standardisation around W3C Annotations means that machines and humans can both contribute to the generation of annotations. The platform can use annotations that may have already been partially generated by machine, or machines can do further work on human-created annotations.


3. Incorporate a module for Omeka S (the IIIF import and display module) that creates Items and Item Sets) from supplied IIIF manifests and collections, and extends Omeka S item display with a page per canvas.

This module is being developed by Digirati as part of the Indigenous Digital Archive project. It allows an admin user to create new Omeka S Items and Item Sets from IIIF resources.

You could use the import and display module without doing any crowdsourcing. You can pick some IIIF resources from any provider and build a small interpretative site around them. For example, collect interesting editions of Euclid from around the world and write some interpretative material around them.

As well as creating new Omeka S resources the module also serves a page for every image. We need a simple, robust and easily shared solution to accommodate a wide range of crowdsourcing use cases and potential user interface requirements for different kinds of capture models. This kind of interface is familiar from many existing crowdsourcing applications including some of NLW’s existing projects. Using this module, the crowdsourcing platform presents the IIIF resource in “exploded” view, spread across many web pages on which annotation activity can occur.


4. Develop a module (the crowdsourcing module) for Omeka S that allows administrators to create a new crowdsourcing project, specify the IIIF resources that provide the content of the project, and define the model(s) that they want volunteers to contribute in the form of annotations. The platform will then be able to “execute” the model by generating navigation, rendering pages and rendering the UI components for model capturing.

An administrator logs in to Omeka S and creates a new Site (in Omeka S terms). This corresponds directly to a crowdsourcing project. Multiple projects are simply multiple Omeka S sites. The crowdsourcing module adds extra admin functionality to site/project creation and management. As well as a choice of site themes and content management functionality, the administrator can specify a IIIF Collection (or multiple IIIF resources) from which the crowdsourcing module will configure a new project.

This collection might be an existing IIIF collection already published by the National Library of Wales (or anyone else), or it might have been specifically created to group together a small set of material for a particular crowdsourcing project. This module adds the functionality into Omeka S to specify the resources required - by supplying IIIF resource URLs.  A project can comprise a mixture of resources from different IIIF publishers, there is no reason why the source material has to be from NLW.

The following diagram shows the correspondence between NLW’s concepts of a crowdsourcing project, Omeka S, and IIIF:

We will also be developing internationalisation and authentication modules for the crowdsourcing platform. Every aspect of the user interface and editorial content needs to be available in 2 or more languages, and the volunteer will be able to choose the language, and switch at any time. NLW requires that the platform allow users to authenticate via OAuth2 providers, and via Shibboleth for use within the Library.

We’ll be following up with some more details about how the platform works on top of Omeka S very soon.