How researchers created a journalism tool by scraping data from science journals


Using data extraction and analysis to create tools for other forms of journalism - not just the reporting itself.

Complex jargon often makes scientific work less accessible to the general public. By employing a set of specific reporting strategies, journalists bridge these groups by delivering information about scientific advances in a readable, engaging way. One such strategy is using simpler terms in place of complex jargon. To assist in this process, we created DeScipher [3], a text editor application that suggests and ranks possible simplifications of complex terminology for a journalist while she is authoring an article. DeScipher applies simplification rules (which consist of mappings between complex words and simple words with the same meaning) derived from a large collection of scientific abstracts and associated author summaries. DeScipher is also designed to consider properties of the textual context in suggesting simplifications to the journalist.

The corpus

We used 15,867 of abstracts and author summaries from various open-access PLOS (Public Library of Science) journals created between 2010 and 2014 [5]. In the author summary section, the authors of a scientific article are guided to use simpler and non-technical terms. The main purpose of this section is to make the article more accessible to a wider (non-scientific) audience. In contrast, the abstract is intended to describe the contributions succinctly and clearly for readers without necessarily simplifying the language.

How we learn simplification rules

We adapted Biran et al.’s context aware lexical simplification technique [1], which generates simplification rules given a corpus divided into complex and simple documents (i.e.,“standard” and “Simple English” versions of Wikipedia in the original implementation). Each rule consists of a pair of words where the second word can be used to simplify the first (e.g., acaricides and pesticides, Na and sodium) along with a score indicating the similarity of the pair based on the other words they co-occur with.

The first stage of the pipeline consists of identifying simplification rules. We first found content words in the combined abstracts and author summaries (i.e., words that remained after eliminating stop words, numbers and punctuation). For all possible pairs of content words we observed, we filtered using the following set of steps: stem both words (using the Porter stemmer provided by the Python NLTK library [2]) and omit those pairs which share a lemma (e.g., permutable, permutation); tag the part of speech (POS) of each word and omit those for which the POS differs (e.g., permutation (noun), change;d (verb)); check that the pairs have a synonym or hypernym relation to each other using WordNet [4], and exclude those that do not.

We then ensure that one word is, in fact, simpler (to ensure that the suggestions that DeScipher makes are simplifications, not simply synonyms). To do so, we first calculate the corpus complexity of each word in a pair as the ratio between the frequency of occurrence of the word in the complex versus simple corpus.

To ensure that the suggestions that are made by DeScipher are grammatically correct, we produce additional pairs for morphological variants of the original pair by generating other possible conjugations of verbs and other possible tenses for nouns.


After creating DeScipher, we evaluated the simplification rules the system suggested for a new set of scientific texts (the abstracts of articles in the journal Science) by comparing these results to those found by the original Biran et al. approach, which learned rules from Wikipedia. We presented each rules from DeScipher as well as each rule from Biran et al.’s pipeline to 12 crowdworkers. We asked them to confirm which of the two words was more familiar, and how well the more familiar term helped them understand the meaning of the complex term. We found that while workers were slightly less likely to agree that the second word was simpler than the first in pairs produced on PLOS versus Wikipedia (55.5% vs 62.8%), they were more likely to find our simplifications helpful in cases where the complex word was correctly identified (f(1,2817)=0.76, p < 0.001).  In our continuing work, we are increasing the size of our corpus can result in better predictions of which word is more complex. We are also now evaluating how DeScipher fits into existing science journalists authoring processes.


[1] O. Biran, S. Brody, and N. Elhadad. Putting it simply: A context-aware approach to lexical simplification. In ACL ’11, 2011.
[2] S. Bird, E. Klein, and E. Loper. Natural language processing with Python. ” O’Reilly Media, Inc.”, 2009.
[3] Kim, Y-S., Hullman, J., Adar, E. DeScipher: A Text Simplification Tool for Science Journalism. Computation+Journalism 2015.
[4] G. A. Miller. Wordnet: A lexical database for english. Commun. ACM, 38(11):39–41, Nov. 1995.
[5] Zhang, C. DeepDive: A Data Management System for Automatic Knowledge Base Construction. Ph.D. Dissertation, University of Wisconsin-Madison, 2015.

Photo: Wolfram Burner