Prescription drugs that kill: The challenge of identifying deaths in government data


By Mike Stucka, The Palm Beach Post

An editor at The Palm Beach Post printed out hundreds of pages of reports and asked a simple question that turned out to be weirdly complex: How many people were being killed by a prescription drug?

That question relied on version of a report that was soon discontinued by the U.S. Food and Drug Administration. Instead, the agency built a new web site that doesn't allow exports or the ability to see substantial chunks of the data. So, I went to raw data files that were horribly formatted -- and, before the project was over, the FDA had reissued some of those data files and taken most of them offline.

But I didn’t give up hope. Behind the data -- known as FAERS, or FDA Adverse Event Reporting System -- are more than a decade of data for suspected drug complications of nearly every kind. With multiple drugs in many reports, and multiple versions of many reports, the list of drugs alone comes to some 35 million reports. And it's a potential gold mine.

How much of a gold mine? For one relatively rare drug, meant only for the worst kind of cancer pain, we found records tying the drug to more than 900 deaths. A salesman had hired a former exotic dancer and a former Playboy model to help sell the drug known as Subsys. He then pushed salesmen to up the dosage, John Pacenti and Holly Baltz found in their package, "Pay To Prescribe? The Fentanyl Scandal."

FAERS has some serious limitations, but some serious benefits. The data can tell you why a drug was prescribed; it can tell you if a person was hospitalized because of a drug reaction, or killed, or permanently disabled. It can tell you what country the report came from. It's got the patient age. It's got the date of reporting. It's got other drugs involved. Dosage. There's a ton of useful information.

Now the bad stuff: There may be multiple reports for each actual case, as well as multiple versions of a single "case" ID.

Reporting cases is both voluntary and potentially time consuming, so underreporting is probably quite common as well; one 2017 study suggested that for two classes of drugs, perhaps a fifth to a third of serious events were reported. In a 1996 newsletter, the FDA itself cited a study that said the FDA gets direct reports of less than one percent of serious adverse drug reactions; that study was from 1987. Some of those authors went on to contract with the FDA and launch an education program to train doctors to report cases; they got a 17-FOLD increase in direct reports to the FDA. About half the hospitals in a 1989 survey reported no adverse events at all. 

The FDA warns people to be cautious of trying to calculate rates of reactions. If minor reactions are reported one of 100 times but every death makes it into the database, deaths are going to appear 100 times more prevalent at a relative level to the minor reactions.

On top of that, the data itself is terribly maintained by the FDA. As mentioned, most of the raw data files were stripped off the FDA's site before we were done. The files are erratically named; the contents of the files are even more erratic. (Contact me, or Investigative Reporters and Editors' database team, if you have problems getting what you need.)

So, the first problem was getting the files. I wrote a scraper to get all the ASCII versions of the datasets. 

asci 08/10/2017 12:05 PM ascii 08/10/2017 12:04 PM asii 08/10/2017 12:04 PM


The FDA literally found three different ways to spell ASCII. This was a sign of future surprises.

I had to find the data files I wanted -- all with a .txt ending, but in three directories -- and put 'em in a good place. I had to clean up the filenames themselves. That wound up being some easy Python code:


subdirectories = next(os.walk("./raw"))[1]
for subdirectory in subdirectories:
    dirfiles = os.listdir("./raw/" + subdirectory)
    for filename in dirfiles:
        proposed = "./clean/" + str(filename).lower()
        if proposed[-4:] == ".txt" and filename.lower()[:4] != "stat":
            print("./raw/" + subdirectory + "/" + filename + " ... to ... " + proposed)
                os.rename("./raw/" + subdirectory + "/" + filename, proposed)
                print("\tFile already found in clean: " + proposed) 


The next step was reading the first line of each of the purported data files and seeing if my headers were consistent. Now, some of these data files came from an older system simply known as AERS, instead of the newer FAERS. But other files had some really random headers. Sometimes it was a typo of a data field I was supposed to have; " rept_dt" is not the same thing as "rept_dt". Once I identified bad header names, I had to figure out how to handle them. I also had to plan a conversion between the FAERS and AERS data. That was annoying but fairly straightforward, thanks to the FDA's documentation. For example, I'd leave the AERS version identifier intact, but copy it over for FAERS, while also patching up problematic field names:


if "foll_seq" in row:
    row['caseversion'] = row["foll_seq"]
if "FOLL_SEQ" in row:
    row['caseversion'] = row['FOLL_SEQ']
if " rept_dt" in row:
    row["rept_dt"] = row[" rept_dt"]


The end result was seven gigabytes of CSVs with consistent field names. With a little bit more work, my Python script generated SQL import commands based on the (cleaned) column headers it already had, and made a few exceptions for fields I knew I'd need later. For example, "text" worked for most fields, but MySQL doesn't allow "text"-type fields to be indexed. So primaryid and caseid both became varchar(100), nda_num somehow became varchar(50), and drugname needed varchar(600). It took some painfully slow iterations for this to work out.

That got my data into MySQL. The rest was some pretty basic SQL work, with just a few exceptions.

For example, to get the first cut of Subsys cases:

drop table if exists sdrug; create table sdrug like drug; insert into sdrug select * from drug where nda_num="202788" or drugname like "%subsys%";

The FDA's online reporting site looks up drugs literally by just the drug name. There's a drug identifier known as the NDA that the site doesn't use at all, which in this project undercounted the deaths by about 10 percent. I looked where the NDA matched Subsys' to see what drug name was reported. If the case had Subsys' NDA and a generic drug name -- "fentanyl" -- listed, I counted it as Subsys. An expert validated this approach. In going through the data, I also found a description where the reaction was described as "death" but death wasn't listed as a reaction. We did not include that case in our tallies.

Once I had the cases identified through the drug list, I built up a master table that included details from tables to include demographic information, outcomes and why the drug was prescribed. A lot of quality-control verification went into this to make sure. In the end, we settled only on cases we knew resulted in death, where Subsys was the primary suspect drug in the reaction, and where the reports had come through Subsys' manufacturer.

We ran the Subsys story with the obligatory nerd box.

Besides the more than 900 deaths, we found quite often we couldn’t tell why Subsys had been prescribed. While the ages of some of the Subsys deaths suggested they might be unlikely to have cancer, and thus have died after off-label prescriptions, the FAERS data was entirely too limited to let us draw definitive conclusions. The data set was very useful to get us a few very important paragraphs in the story. There are surely other great stories that can be done with FAERS.

About the author

Mike Stucka has been the self-described data dork at The Palm Beach Post since early 2016. Before that, he used data on the beat at places like The Salem (Mass.) News and The Telegraph of Macon, Ga. He started with data at his college paper, when he still had hair. He is a graduate of Northeastern University, Loyola University Chicago, and a great IRE bootcamp a decade ago. Follow on Twitter: @mikestucka.

If you need help with your project, please reach out.

Explore the project here.

Image: Matt Allworth.