11/8/2017

Marple: “We tried to automate story finding in data - this is what we learnt”

 

Journalism++ Stockholm has spent a year developing an automated news service for finding stories in large amounts of statistical data. It turned out to be more challenging than they expected - but also very successful.

Robot journalism has been a big buzzword over the last couple of years. Most examples of robotism in the newsroom have revolved around text generation using natural language generation systems. Software, such as Wordsmith and AX Semantics, is currently being used to write stories on sport results and financial reports.

A year ago, our project Marple was awarded funding from Google’s Digital News Initiative to work on a different kind of newsroom automation. We are trying to automate story finding.

Using bayesian statistical models we analyze large amounts of statistical data to find anomalies, particularly local ones. When we find a local peak in crime or a regional trend in unemployment, we file the story to local newsrooms in Sweden.

In this blog post, we will share some of our learnings by showing how we iterated towards a model for monitoring unemployment data to find local stories.

Step 1. Going for the extremes

When we approach a new topic we ask ourselves: what would be a story in this data? The most obvious answer is simple: a new record. “More bike thefts than in ten years”, “Record temperatures in July”.

Initially, this looked like a promising approach for examining unemployment data as well. We found plenty of local stories like this one:

(Unemployment among foreign born citizens in Degerfors municipality, Sweden)

Unemployment among the foreign born population was rising and reaching new record levels all across Sweden.An obvious story, but not very unique. We were basically picking up local examples of a pretty well-known national trend.

Step 2. Let’s compare

So our first simple method had some flaws. Consider the following. Unemployment is rising in four out of five municipalities, in the fifth it is declining. Which story is most interesting from a local point of view?

We quickly realized that looking at one municipality at a time was not enough. We started comparing municipalities with their parent counties. How had the unemployment gap between the municipality and the county evolved?

This made things a lot more interesting. We were able to pick up trends such as these:

(Unemployment in Härnösand municipality and Västernorrland county)

This chart shows a record gap between the municipality (growing unemployment) and county (declining unemployment). A solid newslead!

But only looking for extreme values is a pretty blunt method for detecting statistical anomalies. House prices, for example, typically reach new record levels every month. And how many months in a row is it meaningful to report about a “new record”?

We realized there is also a risk that we capture statistical “noise”, rather than actual, “real”, anomalies.

Step 3. Hello Bayes

Journalists love trends. It is always a story when we see a new trend in a time series. But defining a trend is easier said than done. When does a trend start? When does it end? Is a small bump down one month enough to say that a trend is broken?

We have seen that simple statistical methods works pretty well on small datasets. It’s a story if the air temperature last month was one of the 10 percent warmest or coldest in the past 50 years. Simple anamolies also have the advantage that they are easy to describe.

But to find the really interesting signals in larger amounts of data we had to turn to more developed statistical models.

These models may take previous level, variation and seasonality into account using Bayesian state space modeling. A sudden rise in unemployment is not necessarily surprising (in a statistical sense) if there has been a lot of ups and downs historically. And a small decline in unemployment one month is probably not a story, but if that same small decline continues for three months it might be something reporters should look into.

The purpose of these sniffers is to distinguish “real” trends from random developments and help the journalist focus on what is actually interesting, rather than random chance or seasonal effects.

(Unemployment among foreign born citizens in the municipality of Gnesta and county of Södermanland)

Conclusions

A year of work on this project has left us both encouraged and humble. Our platform works and we are able to help local reporters find stories that they would not have done otherwise.

But automation is also hard. There is no magical formula that works across all datasets. We have to approach every topic with fresh analytical minds. The definition of a news story in unemployment data is not the same as in crime, car accident or housing price data.

But with more refined statistical models we are able to approach larger datasets and find leads beyond the most obvious ones.

About Marple

  • An automated news service developed by Journalism++ Stockholm for finding stories in statistical data
  • The service monitors statistical data for anomalies and generates story leads consisting of a brief generated text, a chart and data in a spreadsheet
  • Marple got prototype funding from Google’s Digital News Initiative in 2016
  • Currently operating in Sweden with local newsrooms as the main users group

About the authors

Jens Finnäs is a data journalist and founder of Journalism++ Stockholm.

Måns Magnusson is a PhD student in statistics at Linköping University.

Explore Marple here.

Comments