3/9/2017

Editorial transparency in computational journalism

 

As journalists, we are committed to be the watchdogs, to call government and companies to account. A new area of accountability is the use of algorithms employed in multiple facets of our lives – often without our knowledge – to determine prison sentencing, hiring, whether someone should be granted a loan, and so forth. With computational methods also becoming more prevalent in the newsroom - including data journalism, curated news feeds, automated writing, social media analytics, and news recommender systems - we should hold ourselves to the same standards by making ourselves accountable and our processes transparent. To begin this process, I presented a paper at the Computation+Journalism 2016 conference, which looks at how we can develop standards and expectations for transparency.

Benefits of editorial transparency

Releasing documented code and data does sound like extra work, and it is. But by investing this extra time, we benefit ourselves, our field, and our readers.

Writing code that others will see, and potentially criticise or question, encourages us to write cleaner code, with appropriate commenting, logical organisation, and visualisations. Adopting this process ensures that what we are reporting is based on evidence, and that the evidence is present and correct. And when our code and data are open, as with most open source fields, rapid development within that field occurs.

The field of journalism benefits from an educational standpoint, where journalists can learn from clear, well-documented code. Journalists can access the data and create something new from it that they had not considered, had no time for, or to tell a local story.

We also benefit our readers. Particularly in these times, establishing and maintaining trust with readers is tantamount to the continued success of journalism. Part of building that trust is enabling readers to check your work or see the steps that lead you to your story, facts, and conclusions. Just as we cite our sources, we should provide evidence for our data journalism also. We allow readers to engage with our work more by providing the code and the data, and provide the opportunity for new stories, or even corrections if errors are found in our code.

Case studies

For context by example, below are two case studies of my own work where all the transparency tools used are free and open source.

Uber – a data driven journalism project investigating uberX wait times across demographics in Washington, D.C.

  • Data was shared via a lab-account Google drive. Google Drive is great if the data set is too large to upload to GitHub.
  • Interim data (cleaned and processed raw data) was shared as a .csv file within the GitHub repository enabling others to pick up the data analysis at later stages if desired
  • Data analysis code was shared in commented Jupyter notebooks and python scripts in the GitHub repository
  • Project and code documentation, the data dictionary and other experimental particulars were described in the readme
  • Everything that could be achieved programmatically was done programmatically, rather than manually, to enable replicability and facilitate reproducibility
  • The Google drive and the GitHub repository were linked to in the news article
  • The news article was linked to in the GitHub repository

In doing this we:

  • Were accountable so that others could inspect our code, data, and assumptions. We were notified of a bug in our code via the “Issues” affordance in GitHub.
  • Facilitated several independent policy studies in other States and cities based on our code
  • Enabled others to conduct novel studies and data visualisations, including this one by Kate Rabinowitz

Image: Map showing average wait times in seconds for each census tract in Washington D.C.  for an UberX car.

@AnecbotalNYT - a twitter bot that tweets comments from news articles shared on twitter. The role of this bot is to surface comments and stimulate broader engagement with articles, news and social media users.

  • Code is available on GitHub
  • Usage and customisation instructions and documentation in the README.md file
  • The tool is independent of platform (Mac, PC, Linux) - all that is necessary to create one’s own bot is to install the required python libraries, edit the configuration file, and run it on a server like AWS

The goal was to make our tool as accessible to others as possible, so that any newsroom or individual could make a copy of the code, and customise the configuration settings to create their own bot. It was challenging to discern how much customisation to build into the tool while not making it so flexible as to render it too complicated or to dilute the specific goal of the bot.

Image: Screenshot of @AnecbotalNYT with example of a comment from the New York Times tweeted as an image.

Other news organisations’ Github examples

These are just some examples of newsrooms sharing, but not all of them come with documentation or links to the articles they were published in, so simply sharing code is not equivalent to transparency.

Documentation

 What does documentation entail?

  • Commenting code, explaining what each line, code block or function does
  • Writing code in Jupyter Notebooks can be helpful if the project is in Python, R, or Julia, as HTML and Markdown text can be added in between code blocks to provide context or explanation. Graphics can also be displayed inline with the code. It can also be viewed online without requiring you to install specific software.
  • Writing a README.md for your GitHub repository that provides context for the study, links to the article, links to the data, a list of code dependencies (which code libraries were used), a data dictionary, and any other information that may assist someone in following the code
  • Linking within your news article to the code and data repositories
  • Linking to reference material for any APIs or external software used. This saves you from re-writing instructions on how to use the APIs. For example, the comment bot collects comments from Disqus forums. Rather than explain how to set up a Disqus API account, I referred the reader to the Disqus documentation. 

Considerations when sharing

Each project will have it’s own unique considerations. Sometimes sharing the data in its raw form will not be possible because of privacy issues. In these cases, it might be possible to share the aggregated or cleaned data that no longer contains personal identifiable information. The data used may be proprietary or have been provided to you with restrictions or an agreement. Sharing this data will of course not be possible, and a statement could be made to that effect. If the data itself cannot be shared, still consider sharing the code used to clean the data and analyse it. If graphics used in the article were created using code, share that too.

Licensing

Sharing code is great, but for others to use your code, it has to come with a licence. There are many different ones to choose from depending on whether you’re sharing code, data, or a mixture of the two. Check out the links below to help you choose:

Closing

I hope these case studies and suggestions can stimulate discussion that leads us to developing expectations and standards for transparency and accountability in computational journalism.

Read my full research paper here.

Image: Mariano Mantel.

Comments