A Fundamental Way Data Repositories Must Change


In 1966, Romania engaged in one of the world's most brazen attempts at population growth. From one day to the next, the main method of birth control in the country, abortion, was banned. The birth rate increased in line with the official plan. It was accompanied by a rise in illegal abortions and a spike in maternal and infant mortality. Unhappy with the outcome of his experiment, Romanian dictator Nicolae Ceausescu decided to hide the traces of his failure by preventing hospitals to register infant deaths, from 1980 onwards. Births were only registered after one to four weeks of life, ensuring that stillbirths did not make it into the official tally.

The statistical history of this tragedy is nowhere to be seen in most data repositories. Data from the World Health Organization shows no trace of the 1967 spike in infant mortality, as data is smoothed over several years, and does not mention the absence of reliable data in the 1980s.

As data becomes more open and fashionable, it finds its way into newsrooms and government cabinets. A fact-based approach is certainly better than gut feelings - when the facts are reliable. Corrupted data can lead to tragic consequences, from bad investment decisions to improper risk management. Greece's tampering with its government accounts before joining the Euro is one of the major causes of the ongoing economic crisis in Europe. 

Data in dictatorships

Rwanda is considered by many in Western capitals as a poster child for post-conflict reconstruction. Official statistics certainly support this picture. Infant mortality, for instance, appears to have slid from the world's third worst performance in 1994 to reaching the world average in 2011. No other country can boast such a stellar improvement.

American health professionals familiar with the situation in Rwanda have claimed that, despite real progress since the 1994 genocide that left the country's infrastructure in a dismal state, official statistics do not accurately represent the situation. "What we see and what is reported is not the same picture," one of them says. A hospital in Rwanda manages ten smaller health centers, where midwives are poorly trained and records are inadequate. If a woman gives birth to a stillborn baby on the drive from a health center to a hospital, for instance, the death is not registered at all.

Freedom of speech for data scientists

The American sources mentioned earlier agreed to talk on strict condition of anonymity, fearing that going on the record would jeopardize their work. Data collection and analysis is no different than any other sort of expression. Governments that restrict freedom of speech foster unreliable statistics. Even if the technical infrastructure for sound data collection is in place, fear of retribution might cause statisticians to redress the data in ways that please the administration.

Christopher Balding, who exposed several of the inconsistencies that riddle the accounting of Temasek, a sovereign wealth fund managed by the government of Singapore, publicly declared that he would fear for his safety, were he to set foot in the city-state. Among other things, he revealed that compensation of Temasek employees was 60 times the amount at Goldman Sachs, an investment bank not known for its frugality.

The results of the 1937 census in the Soviet Union displeased Stalin so much that he had the whole team of statisticians crunching the data sent to the gulag. Even if the life expectancy of statisticians in freedom-deprived places has improved, self-censorship and corruption still mar the reliability of the data.

Precision is as important as the data itself

Bogus data need not be worthless. Chinese data is another example of grossly manipulated statistics. GDP growth figures for 2012, for instance, show that all regions but two beat the national growth average of 7.8%. (The weighted average shows a national growth rate of 10.3%.) Academics argue that the discrepancy between official data and reality could reach 10% of Chinese GDP.

Tom Rafferty, an analyst with the Economist Intelligence Unit (EIU), a data analysis company, explains that even though there is not much that can be done (recomputing the GDP figures from scratch would be an impossibly long endeavor), he can offer some perspective to EIU's clients using other indicators less likely to be manipulated. Indexes of manufacturing activity or electricity consumption are useful, but no single one can give an accurate view of the whole Chinese economy. Mr Rafferty says part of his job is to highlight the strengths and weaknesses of each indicator to clients.

Companies such as EIU have the experience needed to understand what insight can be gained from a statistical measurement. As data becomes more popular and more accessible, laypeople start to use source data without the skills and experience required to understand what is the value of a particular data set. As the most popular gateway to data, repositories such as the World Bank's have a duty to communicate the precision and trustworthiness of the measurements in a much better way.

A call to the World Bank

A simple solution would be to indicate the number of significant figures in a measurement. Significant figures are a very powerful way to convey precision. It is easy to understand that a data point of 12.4, with 3 significant figures, is precise to the nearest 0.1. It could be 12.3 or 12.5. Similarly, the more precise 12.40 can be anywhere between 12.39 or 12.41. It is impossible to know that the population of France was 65,696,689 in 2012, for instance, were it only because births and deaths occur too often for such a precision to make any sense. It would be much clearer if the World Bank displayed a figure of 65,697,000, with 5 significant figures instead of 8.

The World Bank should also communicate the reservations scientists have on some data sets. Russian scientists reported that some of the official statistics are so bogus that they cannot find any use for them. This, as well as the problems regarding Chinese, Rwandan, Singaporean and Romanian data, among others, should be displayed prominently on the World Bank Data portal. A warning icon next to uncertain data points would go a long way.

Asked for comment, a World Bank spokesperson simply stated that the World Bank Data portal displays the sources from which the data originates. This is not enough. Journalists and analysts trust the Bank to provide accurate data and do not have the time to thoroughly investigate each data set. The whole point of statistics is to synthesize a diffuse reality. If data repositories lose sight of this basic mission to become passive aggregators of unusable data, all companies and political bodies that use data as an input will suffer. Newsrooms and NGO could do more, too, to put pressure on statistics bureaus. Offering competing measures of inflation, the indicator most likely to be tampered by governments (because pensions are indexed on it), would force official outlets to improve their game and be more transparent.



Read about the politics of Romanian birth control in The Politics of Duplicity by Gail Kligman. A physician who was active in Romania in the 1980s confirmed that births were recorded late but explained that no written order was ever issued.

Greece's creative accounting led to several scandals. Read this FT piece about the 2001 data fiddling, or this Reuters article about similar wrongdoings from 2009.

Christopher Balding's blog offers lots of insights on data quality in China and Singapore. He is the author of How Badly Flawed is Chinese Economic Data? in which he claims that China's GDP could be 1 trillion lower that offical claims because of bogus inflation calculations. This post by the Online Citizen, a Singapore-based outlet, on the island-state's special brand of accounting is also interesting.

The quality of Russian statistics is discussed in this article (in Russian) and referenced in this piece (in English). This piece from Journalism.co.uk explains how La Nacion, an Argentinian newspaper, measures inflation. Long term polling solutions such as Elva or Feowl could also be repurposed to tackle inflation figures. (Disclaimer: I am the head of Journalism++, which developed Feowl.)