1/9/2017

World Series Mortality Tracker: How many fans actually get to see their team win more than once?

 

Last year the Chicago Cubs won their first World Series in more than 100 years, ending the longest title drought of any Major League team, and giving generations of Cubs fans the ability to say they saw at least one championship in their lifetime.

Working at WBEZ, the NPR member station in Chicago, we wanted to cover the Cubs’ run to the World Series (and their playoff run the year before) in a different way. That idea of “just once in my lifetime” stuck with me. Just how many people could say that for their favorite team? How many people, who are still alive, saw any given team win its last World Series?

Luckily for me, the United States Census Bureau keeps estimates of population by age back to 1900, three years before the first World Series. And so came the World Series Mortality Tracker.

The project estimates how many Americans were alive following every Series win by every Major League Baseball team. It’s our way to try and put the collective pain of each team’s fanbase into perspective.

While the idea was simple — just adding up Census figures over a period of time — as always, the details were more complicated than I initially thought.

Cleaning 110+ years of Census data

The project took two sets of data; yearly population estimates from the U.S. Census Bureau since 1900 and a list of World Series Winners since 1906.

The Census helpfully provides all of its yearly populations estimates back to 1900 in one FTP site (Unfortunately, this location keeps changing, so check back often).

After downloading all 115 available years, I realized that the Census stores the data in different formats for different decades. From 1900 to 1979, each year has a single CSV file with a standard format. But the 1980s come in fixed-width text files, the 1990s in a single header-less CSV, and in the 2000s estimates for those 85 and older are kept in separate files. Also, each period tracked different data as the Census – and the country – evolved. The oldest single age tracked is 75 until 1940, 85 until 1980 and 100 since then.

Even with the changing formats, there were only five types to deal with (pre-1979, 1980s, 1990s, 2000s, and 2010 onward). That meant that though there were 115-plus years to deal with and up to 100 ages for each, scripting the process would take only five functions.

I knew I wanted a table with ages as columns and years as rows, with a population count filling each cell. I chose to do this in python and pandas as it’s what I’m most comfortable with, but the process is something that could easily be down in R or your language of choice.

I started the script by making a list of possible ages (0-100, the oldest age released by the Census) and creating a dataframe with those ages as the columns. From there, I made five functions to clean the five different eras of Census files. For each file type I iterated over all the years in that group, pulled out the age for each population, and added the results for that year to a new row in the data frame.

That left me with a table containing the population estimates for all years and ages. Even that first step showed some interesting things, such as this real early heatmap.

 

Matching it to the World Series

Baseball has a long history of obsessing over numbers, so it’s no surprise baseball data is pretty easy to come by. My go-to source for historic baseball stats is the Lahman Database, a free collection of player and team stats. The project is run by Sean Lahman, who is also a data reporter with the Rochester Democrat and Chronicle.

Now that I had a table of population by age and by year, with tables of World Series winners and teams, I needed to find a way to combine them. I created a new blank dataframe using team IDs as the index and years (1903-2015) as the columns.

I iterated over each year (the columns), figured out for each team how long it had been since it had won a World Series, and then summed all total number of people older than that age. So when the script was on the year 2008 and found the Los Angeles Dodgers -  who last won in 1988 - it would add up everyone at least 20 years old.

Using pandas' apply function, I was able to quickly run that function on every column in my dataframe, which calculated the population figure for each team and added it to the correct cell. Now it was possible to not only see how many Americans who are currently alive, and were also alive for each team’s last World Series victory, but also to track how that number had changed over the past century.

Visualizing

As an NPR member station, we use a lot of tools from the NPR Visuals team. One of the projects we get the most use out of is the Dailygraphics Rig. The rig has a set of templates and a lightweight server to setup and test (mostly) d3-based charts. It also has options to push the finished graphics to s3, and has the team’s pym.js built in for responsive iframes of the finished graphics.

I knew I wanted to display the yearly population figures for each team together as a line graph. While this was a little more than the rig was meant to deal with, it was pretty straightforward modifying the code to handle a larger amount of data and add a drop down to select an individual team. And, using the rig meant the whole thing was still responsive when I embeded it into the final article.

Working on the visualization also triggered a change in how I processed the data. Originally, expansion teams - those that were added to the league over time - just showed zeros for every year, even if they hadn’t joined the league yet. My editor caught that, and I started recording years a team wasn’t active as ‘NaN’, since d3 lets you have a line with missing data using the defined property.

Overall, while this wasn’t the most serious of topics, we felt like it let us be part of the news of the day in a fun way. It also showed that sometimes you can find a new perspective on an old story by combining two seemingly unrelated sets of data in a different way.

Explore the project here.

About the author:

Chris Hagan is an Interactive Producer with Capital Public Radio in Sacramento. He’s previously worked at WBEZ in Chicago and the Statesman Journal in Salem, Oregon. He has a BS in Journalism from the University of Oregon, and only got interested in data and programming after messing around with baseball statistics in his spare time.

Image: Brandon Schatsiek.

Comments