Europeana Transcribe: Crowdsourcing WWI data from handwritten text


Over 180 days last November, over 200 000 personal items from World War One were collected for digitisation by the Europeana Transcribe project.

Guided by the aim of uncovering stories hidden in the forgotten pages of handwritten text, the project has called on the crowd to help transform these analogue data into digital troves for journalists and historians.

We spoke to Frank Drauschke, from Facts & Files, and Ad Pollé, Europeana, about the project and the value of converting analogue documents into digital data.

DDJ: Can you tell us a little bit about the project and how it came about?

Frank and Ad: The project Europeana 1914-1918 started in 2011 with the collection and digitisation of private memorabilia from WWI. The idea derived from the Great War Archive project of the IT-Department of Oxford University. The first collection days were held in 2011 in 8 different cities in Germany, afterwards the project went around Europe. Until today over 180 collection days in 22 countries took place. See the map of the collection days here.

Together with online contributions, over 16,000 stories with over 205,000 digital images, have been collected so far. With over 360,000 items of institutional content, www.europena1914-1918.eu represents the biggest international digital archive on the First World War.

With this resource it is now possible to read and research private letters and diaries from Serbia, Germany, France, Denmark and England, from all parts of Europe from every side of the former front lines at the convenience of your home or work desk. But it is a raw diamond, which now needs to be polished and used by writers, scholars and journalist. One challenge to do this in a more extensive way is the difficult to read handwriting. Most sources are written by hand, very often even hard to decipher. So Europeana Transcribe thus is aiming to help ‘polish’ this diamond by making the sources more useable by transcribing them.

In 2014 Facts & Files and Olaf Baldini / piktoresk started to develop a prototype transcription tool, which was funded by the German Federal Government Commissioner for Culture and the Media (BKM). Based on this experience and knowledge we continued and develop the current transcription website in 2016. We also held first transcription workshops at a Berlin school (Primo Levi Gymnasium) in order to test the transcription tool and the concept with pupils. The Europeana Transcribe website was finally launched with first international Transcribathon challenge at the Latvian National Library in Riga in November 2016.

We are now continuing the project in a series of thematic ‘Runs’, such as the Christmas Run and the Love Letter Run, which continuously encourage our transcribers to help us uncover the untold stories of the First World War.

The Loveletters Transcribathon introduces and kicks off Europeana’s new pan-European campaign #AllezLiterature working with libraries, archives and the public across Europe in the first half of 2017 to celebrate the written word and Europeana textual treasures. After Valentine’s Day, the next stages of the campaign will focus on World Poetry Day (21 March 2017) and World Book Day (23 April 2017).

Something very compelling about the Europeana Transcribe project is its commitment to converting analogue data from World War One documents into digital data. How do you think the creation of a digital data repository will impact the way we tell stories from the War?

It makes the data readable, translatable and thereby accessible to everyone. This makes it possible to make as many private stories and documents from diverse languages and regions available to the public. By being able to read these moving first hand accounts, it becomes clear, that sorrow and suffering of ordinary people was the same on all sides of the battle line.

What are the biggest challenges transcribing analogue data from the historical sources into digital data?

The greatest challenge for transcribing digitised documents is being able to decipher and read old handwriting. It is not only difficult to distinguish the letters and words but transcribers have to deal with old spelling, old-fashioned vocabulary or like in German with a complete different script. The old-German script (Kurrent) was only taught in schools until after the Second World War.

Why was a crowdsourcing approach chosen?

It was chosen because the power of the crowd is the only feasible to deal with such a great quantity of diverse (type and languages) material. And since Europeana 1914-1918 is a crowsourced archive of normal people, it was the logical step to advance it into a citizen science project and engage the people.

We are very happy about the overwhelming positive response and with the successful start. The Christmas Run is the ideal example of this: In only one month over 1,600 documents and more than 1.5 million characters in six languages were transcribed. Even in this few days since Valentine’s Day, we got 130 more registered transcribers and some of them made it within one day from the starting rank of Recruit via Runner to Champion. These astounding results demonstrate the effectiveness of the power of the crowd and the incredible dedication of our community.

How do you validate transcriptions? What kind of quality controls do you have in place to ensure that the transcribed data is accurate?

Once more, we utilise the power of our community to edit, review and mark transcriptions as complete. Every Champion can review a transcription, check former versions, edit and mark them it as complete. With this process the quality of the data will be ensured and increases over time by every revision.

The project also includes an interactive map that lets you explore transcribed documents. Why was this visual format chosen to represent your data and what design choices did you make in its construction?

To geotag stories and documents and then display these locations on a map is a great way to visualise and understand historical sources and get a better pictures of the geographical extent of each story and the whole collection. Everyone is curious to discover documents, which come from his or her home town or the place where your grandfather was born or … One transcriber wrote to us “It was very interesting to participate and I cried, sometime. I had the sheer chance to transcribe a letter where the action was in the town where I studied for about eight years (and my father too) ; I knew all the cited places.”

The interactive map on the home page depicts one geolocation per story. Especially here you have already now a great visualisation of the global dimension of the war. Each story has also such a map, which then depicts all geotags from the items of the story and each document also has a small map, where users can actually add geocoordinates for locations just transcribed from this document.

We are still working on the website and will develop more new functionalities. These will include a dedicated map search page and other ways how to display and interact with geographic data.

For the map we especially wanted to have a different look and feel then just the normal Google Maps design. Inspired by some other examples on the map we are utilising the facilities of the Google Maps API, but choose our own colour scheme and display option in order to have a visual effect which better adapts to historical sources, the topic and the design of the whole page.

We especially like the look of the mountain areas, like on these screen shots:



Once the project is complete, how can journalists, or other storytellers, access the project’s data?

The process to interact with these many historical documents will certainly be an long-lasting ongoing process. Nevertheless, all the materials, the digitised objects and also the transcription are already available for research and reuse. Everybody is free to use all of them, by just citing the source. All 200,000 digital UGC images are available under the creative commons license CC-BY-SA. Later in the process we will also develop new functionalities, that it will be possible for instance to download a pdf with all transcriptions from one story. We are also planning to have collaborations with publishers and institutions to publish some special sources as books or in other ways.

Explore the project further (or start transcribing) here.