Islam, Media Subject: How to quantify the perception of Islam in the media


By Pierre Bellon, for Skoli agency.

Islam, Media Subject is a collaboration between Moussa Bourekba, researcher at Barcelona Center for International Affairs (CIDOB) and associate professor at Ramon Llul university of Barcelona, and Skoli, our young agency created in 2016 by Pierre Bellon (developer), Gauthier Bravais (project manager) and Lucas Piessat (journalist).

Project's origins

When we created Skoli, we were inspired by our experience in the data driven journalism community. We wanted to help knowledge producers (like researchers, activists, NGOs, et cetera) spread their information through innovative web formats.

This project is a proof-of-concept for what we mean by "innovative web formats". The idea behind it is to quantify how French media are talking about Islam because, with time, it became a political - and therefore a journalistic - issue we wanted to question. To this end, we chose to quantitatively analyse newspapers articles mentioning terms related to Islam and Muslims.

A small disclaimer: since it was the first project of our agency we were quite limited in terms of time and material resources, which limited our ability to explore and gather all the data we wanted. Our time schedule was to spend one month of documentation, research, and conception, with two months of analysis and development.

Delimiting the perimeter

To quantify the evolution of Islam-related subjects we chose to focus on newspaper articles that mentioned "Islam" or "Muslim", as well as articles mentioning "Islamism", "Islamist" and so on. Then, we needed to decide which newspapers to target. To do this, we investigated the most read daily national newspapers in France.

From these newspapers, we chose to select only three titles: Le Monde, Le Figaro, and Liberation. This small selection can be explained by our will to represent the press’ political spectrum  by looking at Libération’s left leaning coverage, Le Figaro’s right-leaning coverage, and Le Monde’s centrist coverage. This was also an economical and time-saving choice. Indeed, we observed some inequalities of access to articles (old or not) for some newspapers. For instance, some didn't have a working search engine to retrieve articles with, so we decided to reduce our scope in order to harvest data in reasonable ways.

The last choice we had to make was the time period to study. At first, we wanted to study a period of 20 years (1995-2015) but we were soon forced to reduce our scope because some newspaper articles were not accessible before 1997. Therefore, we fixed our period to 1997-2015, hoping to actualize this study later to cover a larger time lapse.


We separated this project into two distinct applications (and code-bases): the backend responsible of harvesting, storing, cleaning and analyzing the data, and the frontend responsible to present the analysis we conducted in collaboration with the researcher.


Concerning the backend, a python-only environment was a natural choice for us for two reasons. First, all the bricks needed for this part were available in python. It guaranteed us a great homogeneity and interoperability between the different components of this part, which simplified and accelerated the development. Second, it's a language we're familiar with, so we could begin without having to overcome a big technical barrier.

The frontend was designed as a static web application fed with exported data from the backend. It was developed with the AngularJS framework combined with the famous d3js data-visualization library. And again, we chose those technologies because they were familiar to us so we could concentrate on the developing our application and not on learning new techniques.

Data collection and preparation

Once we delimited our study perimeter and coded the scrapers needed for our various sources, we started collecting data and ended up with more than 40.000 articles in our corpus.

We undertook basic operations to clean it, like removing empty articles and cleaning texts from artefacts, such as remaining JavaScript code or HTML entities. Then we noticed that newspapers tended to duplicate the publication of articles with identical (or almost identical) content. This could be ignored but we considered that it could give artificial importance to certain terms or words in our analysis so we decided to remove those duplicates.

Finally, to prepare our data for analysis, we removed punctuation, lowered the case to improve measurement of the different terms (so "Muslim" and "muslim" can be counted as the same term) and we removed the different stop words/tool words that didn't contribute much information.

Analysis model

This is where things got exciting for us because we finally could start to explore the data. We experimented with many text mining models and libraries but in the end we sticked to a "simple" model because we didn't want to use techniques we didn't totally understand nor produce results that we couldn't explain.


As you can see, this model depends on two different parts; the part-of-speech tagging, which will turn raw texts into tagged texts so that we can see what words are nouns, adjectives, verbs, et cetera; and, the CountVectorizer feature, borrowed from sci-kit learn, a Python library for machine-learning operations, to create Document-Term Matrices, which is a key element in text-mining analyses.

The basic principle of this matrix is to record the occurrences of terms in our corpus documents. As shown in the scheme above, the matrix's columns indices represent the corpus' terms (which can be a word or a set of words, n-grams) while rows' indices represent the documents. Since we hold in a separate array all references to those documents we could then query this matrix with complex tasks.


For instance, we could retrieve all the matrix columns that represented a word containing "Islam", then process it with pandas to either just obtain the total occurrences or to examine its occurrence at certain periods of time. Also, thanks to part-of-speech tagging, we could create even more complex queries like "give me all the terms with a first word like Islam associated with an adjective", as illustrated above.

We could then divide our analysis in two parts. First, a quantitative study where we only focused on how the occurrences of "Islam" evolved in time. And a more qualitative second part where we wanted to study how journalists were talking about Islam and how it evolved over time depending on the newspapers they published in.

Visualizing the evolution of "Islam"

In this part, we wanted to have two levels of granularity: annually and monthly. The first level is useful to see the bigger picture and the global trend in our corpus. It helped us to see a global shift around 2001 because of the 9/11 attacks. After this date, "Islam" occurrences count reached a level that never came back to what it was before.


The second granularity level (occurrences / months) is more sensitive to variations, which is helpful to discover "hot" and "cold" moments in the actuality (see the big height that occurred in September of 2001 below).


The evolution of terms & adjectives

We then studied trends in the use of certain terms, and visualized these in two ways. First, we created a word-cloud in order to create a global overview of the most used words from within the whole corpus. We then complemented this visual with a bubble chart, created thanks to various parts of d3js (detailed in this gist). The idea behind this visualization is to show the evolution of the main annual subjects per newspaper, with new words represented by filled bubbles. For instance, we can see below that war in Iraq was one of the main subject of 2003.


The last part of this analysis was focused on visualizing the adjectives associated with Islam. It was one of our main goals because we thought it would reflect the tone used when speaking about Islam. As with the first part, we divided this goal into two data visualizations. First, a word-cloud to provide a global picture of our corpus. Then, with some line charts to see and compare the usage of a small number of adjectives that we wanted to study in depth.



In conclusion, we're happy that we could publish this little study in the way we initially imagined it. However, in the process of conducting this study, we learnt about a ton of techniques that we couldn’t use due to resource limitations. We hope to continue this study further because we think this corpus deserves deeper consideration. With additional text-mining techniques, like topic modelling, and some more time, we will hopefully be able to draw out more precise and qualitative information that will improve our understanding of how Islam is treated in the media.

Nevertheless, this was a rich and profitable collaboration between a researcher and a data driven agency, and we're eager to continue exploring new topics and trends.

Visit Islam, Media Subject here.