Data journalism and the ethics of publishing Twitter data


Collecting and publishing data collected from social media sites such as Twitter are everyday practices for the data journalist. Recent findings from Cardiff University’s Social Data Science Lab question the practice of publishing Twitter content without seeking some form of informed consent from users beforehand. Researchers found that tweets collected around certain topics, such as those related to terrorism, political votes, changes in the law and health problems, create datasets that might contain sensitive content, such as extreme political opinion, grossly offensive comments, overly personal revelations and threats to life (both to oneself and to others). Handling these data in the process of analysis (such as classifying content as hateful and potentially illegal) and reporting has brought the ethics of using social media in social research and journalism into sharp focus. 

Ethics is an issue that is becoming increasingly salient in research and journalism using social media data. The digital revolution has outpaced parallel developments in research governance and agreed good practice. Codes of ethical conduct that were written in the mid twentieth century are being relied upon to guide the collection, analysis and representation of digital data in the twenty-first century. Social media is particularly ethically challenging because of the open availability of the data (particularly from Twitter). Many platforms’ terms of service specifically state users’ data that are public will be made available to third parties, and by accepting these terms users legally consent to this. However, researchers and data journalists must interpret and engage with these commercially motivated terms of service through a more reflexive lens, which implies a context sensitive approach, rather than focusing on the legally permissible uses of these data.

Social media researchers and data journalists have experimented with data from a range of sources, including Facebook, YouTube, Flickr, Tumblr and Twitter to name a few. Twitter is by far the most studied of all these networks. This is because Twitter differs from other networks, such as Facebook, that are organised around groups of ‘friends’, in that it is more ‘open’ and the data (in part) are freely available to researchers. This makes Twitter a more public digital space that promotes the free exchange of opinions and ideas. Twitter has become the primary space for online citizens to publicly express their reaction to events of national significance, and also the primary source of data for social science research into digital publics.

The Twitter streaming API provides three levels of data access: the free random 1% that provides ~5M tweets daily and the random 10% and 100% (chargeable or free to academic researchers upon request). Datasets on social interactions of this scale, speed and ease of access have been hitherto unrealisable in the social sciences and journalism, and have led to a flood of journal articles and news pieces, many of which include tweets with full text content and author identity without informed consent. This is presumably because of Twitter’s ‘open’ nature, which leads to the assumption that ‘these are public data’ and using it does not require the rigor and scrutiny of an ethical oversight. Even when these data are scrutinised, journalists don’t need to be convinced by the ‘public data’ argument, due to the lack of a framework to evaluate the potential harms to users. The Social Data Science Lab takes a more ethically reflexive approach to the use of social media data in social research, and carefully considers users’ perceptions, online context and the role of algorithms in estimating potentially sensitive user characteristics.

A recent Lab survey conducted into users’ perceptions of the use of their social media posts found the following:

  • 94% were aware that social media companies had Terms of Service
  • 65% had read the Terms of Service in whole or in part
  • 76% knew that when accepting Terms of Service they were giving permission for some of their information to be accessed by third parties
  • 80% agreed that if their social media information is used in a publication they would expect to be asked for consent
  • 90% agreed that if their tweets were used without their consent they should be anonymized

These survey findings show that there may be a disjuncture between the current practices of social researchers and data journalists in relation to publishing the content of Twitter posts, and users’ views of the fair use of their online communications in publications and their rights as data subjects. Much of this disconnection seems to stem from what is perceived as public in online communications, and therefore what can be published as data without consent or protection from anonymisation.

Existing ethical guidelines that provide principles for research in public places focus on traditional forms of data and data collection. Most guidelines stress that consent, confidentiality and anonymity are often not required where the research is conducted in a public place where people would reasonably expect to be observed by strangers. However, the perceptions of the majority of users of Twitter clearly differ with this viewpoint. This is most likely because Twitter blurs the boundary between public and private space.

A social media researcher’s point of view must take into account the unique nature of this online public environment. Internet interactions are shaped by ephemerality, anonymity, a reduction in social cues and the realisation of time-space distanciation, leading individuals to reveal more about themselves within online environments than would be done in offline settings, blurring the public and the private. Research has highlighted the disinhibiting effect of computer-mediated communication, meaning Internet users, while acknowledging the environment as a (semi) public space, often use it to engage in what would be considered private talk. Online information is often intended only for a specific networked public made up of peers, a support network or specific community, not necessarily the Internet public at large, and certainly not for publics beyond the Internet. When it is viewed by unintended audiences it has the potential to cause harm, as the information is flowing out of the context it was intended for. We may be satisfied with the argument that academic and regulatory delineations of the public-private divide may not hold in online contexts and as such privacy is a concept that must include a consideration of expectations and consensus within context.

Informed consent and anonymity are further warranted given the abundance of sensitive data that are generated and contained within these online networks. Lab research shows associations between sexual orientation, ethnicity, and gender; and feelings of concern and expectations of anonymity. A principle ethical consideration is to ensure the maximum insight from data journalism whilst minimising the risk of actual or potential harm during data collection, analysis and publication.

Potential for harm in data journalism using social media increases when sensitive data are estimated. These data can include personal demographic information (such as ethnicity and sexual orientation), information on associations (such as memberships to particular groups or links to other individuals known to belong to such groups) and communications of an overly personal or harmful nature (such as details on morally ambiguous or illegal activity and expressions of extreme opinion). In some cases, such information is knowingly placed online, whether or not the user is fully aware of who has access to this information and how it might be repurposed. In other cases, sensitive information is not knowingly created by users, but it can often come to light in analysis where associations are identified between users and personal characteristics are estimated by algorithms.

If we are to balance the privacy of Twitter users (the disinhibiting nature of the environment and the abundance of sensitive information accepted) with the needs of data journalists, a sensible way forward would be to collect data without explicit consent and seek informed consent for all directly quoted content in publications. The alternative of providing anonymity to directly quoted users is not practical in this form of research, due to Twitter guidelines and the issue of online search (where quoted text is easily searchable rendering users and their partners in conversation identifiable). In the case of the reproduction of tweets (public display of tweets by any and all means of media) Twitter (2016) Broadcast guidelines state publishers should:

  • Include the user’s name and Twitter handle (@username) with each Tweet;
  • Use the full text of the Tweet. Editing Tweet text is only permitted for technical or medium limitations (for example, removing hyperlinks);
  • Not delete, obscure, or alter the identification of the user. Tweets can be shown in anonymous form in exceptional cases such as concerns over user privacy;
  • In some cases, seek permission from the content creator, as Twitter users retain rights to the content they post.

If data journalists are to abide by these guidelines informed consent should be sought from each tweeter to directly quote their post in research outputs, given anonymity is not advised. This is particularly important considering Twitter’s view that users retain rights to the content they post. The issue of deletion, and the ‘right to be forgotten’ further buttress the need for consent to directly quote. Twitter (2015) terms of service for the use of their APIs by developers require that data harvesters honour any future changes to user content, including deletion. However, data journalists should not conclude that conventional representation of social media content is precluded. As in conventional journalism, journalists can make efforts to gain informed consent from a limited number of posters if verbatim examples of text are required.

In line with the points raised in this blog we propose researchers conduct a risk assessment ahead of publishing tweets in research outputs. The decision flow chart below is designed to assist researchers and data journalists in reaching a decision on whether or not to publish a tweet, and in what contexts informed consent (opt-in or opt-out) may be required.

Image: Esther Vargas.