Counting the Pakistani Prime Minister’s speech writers: A stylistic analysis


Several months back, a series of terrorist attacks in Pakistan, and the ensuing pressure by the army for initiating a crackdown on terrorists, prompted the Prime Minister Nawaz Sharif to address the nation and clarify his government’s strategy. After the speech, an audio clip leaked onto social media, allegedly of the Prime Minister asking his aides for advice on his diction throughout the speech.

Fast forward a few months, and the Panama Papers are leaked, in which the Prime Minister’s family was put into the spotlight for their properties in Panama. This impelled the PM to give another speech. And purportedly another audio clip was leaked; again, containing discussions of the PM and his advisers about the diction for his speech.

Can we, by any means, find out if the PM normally seeks advice about his speech writing? Or does he have specific authors for speeches to be delivered at special occasions? The speeches, for which audio clips were leaked, were written in the native Urdu. Speeches in English may possibly be designated to others.

This presented an interesting opportunity: Whilst traditional journalism will need to traverse long and perilous paths to somehow find out how many authors the PM has for his speeches, data journalism may provide a better solution with the help of Stylometry. In this field, stylistic patterns in text are analysed to detect deviations – these would indicate the possible presence of a different author. The crucial assumption is this: Each author has a unique writing style, so the way that they use function words and construct sentences form ‘textual fingerprints’. If we are able to extract these ‘fingerprints’, then we can possibly identify authorship.

Therefore, I decided to conduct a stylistic analysis of the PM’s speeches, having obtained texts from the PM’s official website. I first cleaned the text data and removed the names of dignitaries and honorifics, in order to prevent the speeches from becoming too contextual. Thereafter, I had to generate feature sets that would capture Stylistic patterns and enable the identification of shifts in authorial signatures. For this purpose, I generated four data sets based on the following:

  • Bigrams for Most Frequent Parts of Speech (POS) Tags
  • Unigrams for Most Frequent Function Words
  • Bigrams for Most Frequent Function Words
  • 4-grams for Most Frequent Characters

Once these were available, I applied the following analytical techniques:

  • Multi-Dimensional Scaling
  • Hierarchical Clustering with Heatmap
  • Bootstrap Consensus Networks (a recent innovation due to Maciej Eder)

The first two techniques are well known, but the third merits elaboration for its unique procedure: We define batches of sets of features that are used to calculate the dissimilarity for the speeches (for example, the first batch contains the first 50 features, the second batch contains the first 100 features, the third contains 150, and so on). Using each of the feature sets, dissimilarity is calculated preferably using Cosine Delta (an application of the Cosine Distance on z-score standardised features). Once the dissimilarity matrices are available for each batch of features, we then select only the top n number of speeches that are similar to each speech in a batch – akin to nearest neighbour selection. Since we have several batches of speech dissimilarity matrices with n most similar speeches, we can then form consensus from all these batches to verify if indeed two speeches are similar by chance or by exhibited patterns.


In my analyses, each of these three techniques produced two distinct clusters of speeches. Since this meant the two clusters exhibit Stylistic differences, it is conceivable that two authors are possibly involved in speech writing for Prime Minister Nawaz Sharif. One of the clusters was found to be bigger in terms of the number of speeches, also containing more speeches delivered at important venues, such as UN meetings.


So does the PM have two authors? Does he prefer one author over the other for writing important speeches?

It is difficult to prove with complete certainty, but it all seems very likely!

Check out the full analysis here.