When data writes…who is the author?


If a robot wrote this article, you probably wouldn’t know it. The movement to produce automated content, or ‘robot journalism’, has grown substantially over the past couple of years. But editorial policies surrounding how this content is attributed have not moved at the same pace. Typically, human-written stories are accompanied by a byline crediting the journalist who spent hours researching and writing a piece; yet, when a data driven algorithm is doing all the work, this attribution imperative is not as clear. So how do news organizations gauge the best attribution regime for automated journalism? And why is it important to develop a sound crediting policy for this type of content?

To find out more about the prevailing authorship and crediting policies for automated content, Tal Montal and Zvi Reich, from the Ben-Gurion University of the Negev, conducted a content analysis of automated stories on 12 websites and collected qualitative data from interviews with associated news outlets. We spoke to Tal about their research.

DDJ: Your study looks at transparency behind automated journalism, particularly disclosure transparency and algorithmic transparency. Can you provide a brief outline of these concepts, as well as why journalists and audiences should care about them?

Tal: Transparency is a crucial feature of modern journalism, with the ongoing debate over the necessity of openness towards the public. In the field of automated journalism, where the "black box" of generative algorithms is being used through the different levels of news production, and considering the potential implications of the use of these algorithms, we saw this as an inevitable part of any desired crediting policy.

The disclosure transparency refers to the extent of which journalistic organization and news producers are communicating the way their news stories are selected, processed and produced, to the readers. In our case, it can be reflected in the way of informing the readers regarding the fact that a certain story was generated by an algorithm. This type of transparency, which became more common with the rise of online journalism, is a-must in the domain of automated journalism.

The algorithmic transparency is a more complex, expensive, and controversial type of transparency, in which the methodology of an algorithm (the way it actually works, from input to output, and even its limitations) is disclosed to the audience of readers. It is more complex because the methodology should be explained to readers who are not familiar with computer programming, more expensive in terms of time consuming, and controversial due to trade secrets and copyrights issues, which is reflected by the fact that algorithms' owners (either software companies from that field, or news organizations) are reluctant to reveal, obviously, the way the algorithms work.

Making these types of transparency routines a part of a crediting policy for automated content, will diminish the effects (that are covered in the next question) and help both journalists and readers understand the strengths and weaknesses of this way of content production and evaluate the output in a proper manner.

You categorized your findings into four levels of transparency. Can you outline what these are and how they can be used to assess the current state of disclosure and algorithmic transparency?

The four level of transparency were based on a content analysis of automated stories from the sites we had solid information regarding their use of algorithms in generating stories. We separated between sites (or different sections of the same site) that had a full disclosure note (regarding the algorithmic nature of the specific story, the developer of the algorithm, data sources, etc.) and ones who didn't. This was the top tier – Full Transparency. Besides of a full disclosure note, all of the Full Transparency sites all had a byline, crediting either the news organization (such as AP), software company (such as Automated Insights or Narrative Science), the human reporter and even the bot itself (like Quakebot in LA Times). Then, for sites which had only a byline and didn’t have a full disclosure note, we separated between those who had a byline crediting the software vendor or the algorithm itself, which implied the automated nature of the news piece, though not explicitly (=Partial transparency), those who had a byline crediting the organization itself, without mentioning a name of the writer, hence implying the uniqueness of this particular news story (=Low transparency), and those without byline at all. (=No transparency). Only the Full Transparency sites actually employ both disclosure transparency, and (to some extent) algorithmic transparency. The partial transparency sites can only be considered to employ (to some extent) a disclosure transparency routine, though it is arguable. The Low Transparency, and of course the No Transparency sites, actually don’t stand in the standards of any transparency routine type.

What did your study reveal about news organisations' views towards the relationship between automated and public interest journalism?

This was a very interesting part, involving interviews with seniors and experts that come from a variety of roles (editors, managers, journalists, developers), domains of coverage (sports, weather, finance, etc.) and modes of automated content generation (in-house development or software vendors).

First, we discovered there's a unanimous anthropomorphic perception regarding the author of the automated content. All of the interviewees mentioned a single human author (mainly the developer of the code) or the organization as a whole (as it is considered a collaborative process, and also because of the responsibility the organization is taking over the automated output), and none of them regarded the algorithm itself as the author. Second, when investigating their views regarding byline and crediting policies, it seems that most of them didn't think this case requires a different or more adequate policy. Some of the respondents said it shouldn't differ from the common human crediting policy or any policy that is prevailed in their organization, and didn't mention a special need of adding a full disclosure note, in that matter. More interesting, the third thing we found was that although they all regard a human author, and don't believe there should be a special and unique crediting policy in the case of automated journalism - transparency is considered crucial in the eyes of these seniors and experts, who believe the readers has the right to know that these stories are automated. This is actually their way of totally agreeing with the importance of disclosure transparency, and for some part, with the algorithmic transparency, where, for instance, one of the respondents even spoke about the importance of publishing the stories' data sources.

We must admit that the interviewees come mostly, as you might presume, from Full Transparency organizations, which understand the importance of transparency routines, so their views represent organizations which accept and act in that manner. These views and the gaps between them and between the actual practices, as well as discrepancies between these views and the scholar literature, made us understand the crucial need for a new comprehensive and consistent byline and full disclosure policy, in the case of automated journalism.

What challenges does automated journalism present for public interest journalism? Are there any opportunities?

Well, the opportunities are pretty exciting for both readers and journalists. Readers are now provided with news pieces and journalistic stories in niche coverage domains (such as high-school basketball leagues), they can get important and accurate data in almost real-time (such as earthquake reports, minutes after it occurs, or earning previews), or they can now "drill down" into granular data in broad and aggregated stories (such as schools projects in the US, in which algorithms provide the ability to filter and read a particular story about each and every school).

From the journalists and journalistic organizations' side, it seems like a life-saver for the struggling media, in the sense that it expands possibilities of reaching these "long-tail" readers with no additional marginal cost (developing and running an algorithm for generating sport recaps is almost the same for 1000 recaps or 10000 recaps, except for the reload and processing times which are getting shorter as we speak). It also allows creating collaborative pieces where the algorithms provide the data and the short textual pieces, and the human writers expand it or use it in a wider perspective. It even provides a way of automatically generating formulaic news pieces (such as earning previews) which releases the human journalists to write other, broader, deeper stories, which still need the human journalistic routines. In addition, these generative algorithms can perform as a stopgap for understaffed media organizations.

Nevertheless, there are challenges which we cannot ignore. We identified five major potential implications of using these algorithms in news and journalistic organizations. It has (1) practical implications, due to the fact that these algorithms are used in fields such as real-estate or securities. Any mistake, wrong data sources, or even a misleading sentence can affect the decision making of the readers audience in real life. It goes to human news-story as well, but the quality assurance routine may differ, along other effects (such as the following ones), that automated news have, hence misleading the readers. This is related to the (2) psychological effect of the algorithms, which are perceived as more objective, accurate and fair. This affects both readers and journalists, and therefore may influence their evaluation processes and practices. (3) These algorithms have the (potential) ability to choose the required data or process it in a certain way, and "frame" it (by either choosing the relevant data, or by the speech patterns they use), so when covering social and political issues, they can affect the visibility of socio-political actors, or maintain a predetermined agenda. These ramifications may lead to complaints and even lawsuits against the news organizations which use these generative algorithms, and there lies the (4) vicarious liability towards the audience of readers - not only legally but also from an ethical perspective. The last type of implications of this technology is the (5) occupational aspect, which might threat the human journalists, their actual positions, their practices and their autonomy, but on the other hand, can assist them in their work, as mentioned earlier.

Following the completion of the study, what are your views on the best way to treat algorithmic authorship in line with the public interest?

Due to the variety of crediting policies, the interviews’ replies from seniors and experts, and the challenges inherited in using this innovative technology, and based on our theoretical integration, we suggest in our study a new, consistent and comprehensive policy that distinguishes between an output which is fully generated by an algorithm (algorithmic content generation), to an output generated by an algorithm in collaboration with a human journalist (integrative content generation), while sponsoring the public interest.

Our suggested attribution policy for algorithmic content generation is as follows:

  • The byline should be attributed to the software vendor, or the programmer in the case of an individual in-house programmer.
  • The full disclosure should clearly state the algorithmic nature of the content (while describing the software vendor, or the programmer’s role in the organization), and detail the data sources of the particular story and the algorithm methodology.

In case of integrative content generation, our suggested policy is as follows:

  • The byline should be attributed to the human journalist(s), as the representative of the collaborative work done with the algorithm, in accordance with the anthropomorphic characteristics of the modern journalistic credit.
  • The full disclosure should declare the objects created by an algorithm in the particular story (a chart, map, specific paragraph, etc.), as well as the content’s algorithmic nature (describing the software vendor’s business domain, or the programmer’s role), data sources of the story and the algorithm methodology.

You can see we accepts the notion of the human author of automated content, as argued by most interviewees and scholars, as the representative of the collaborative work for integrative content generation, and as the programming entity (programmer or software vendor) for algorithmic content generation. Our suggested policy is tailored, however, to the current level of technological development of robot journalism algorithms, considering the current level of creativity, among other criteria. Any significant technological development or a legislative progress regarding computer-generated works, may invite respective adjustments of this policy.

Read the full research article here.

Image: Rene Passet.