That’s no moon: Data stories don’t always have to be downers


Sometimes it may seem that data journalism revolves around bleak numbers. Hard facts could make a story a little too serious and heavy. While these important stories need to be told, there is also lighter side to data journalism. Fluff pieces with numbers if you may.

This is one of those.

*Star Wars fan trigger warning*

For a data visualization class at the Newhouse School of Public Communications at Syracuse University (Yes, they teach data visualization in J-schools now) we had to use HighCharts to visualize any open dataset. While the directives of the assignment were vague, I knew I wanted to use data that would interest me.

I was watching the third season of Star Wars: Rebels at that time and the word ‘Karabast’ (A Lasat exclamation of frustration) had found its way into my everyday vocabulary, so I began to wonder how many times the characters used it in the show. A quick Google search didn’t turn up anything but it did open a door to the vibrant world of ‘Star Wars Data Analysis’.

I tried looking for the scripts of the show but all I found was closed captions. These could have worked for my original quest to find the number of times ‘Karabast’ was said in Rebels, but by then I was onto something bigger. Knowing that the Star Wars film scripts would be available, I decided to do a histogram of some of the more popular phrases from the franchise.

I had planned to write python code that would strip all the text from the scripts, convert that text to lowercase then run a regex match on the phrases. But during preliminary analysis (skimming through the scripts) I realized that I could do a simple find on the pdf of the script and manually count the occurrences of that particular phrase. Having to choose between a knuckle dragging solution and writing code to automate it was an interesting dilemma. In the end, I chose to manually count the phrases because at the time I was new to coding and I thought it would be faster to finish the assignment this way. Sometimes you must work with familiar tools to get the job done on time. Deadlines can really restrict you that way.

Other dimensions to the data I collected was the name of the character who said the phrase and in which movie.

Visualizing the data

Since the assignment’s directive was to use HighCharts, a library of interactive Javascript charts, I was limited in the tools I could use to visualize this data. HighCharts has a vast library to choose from. You can download the code of an example and then play around with it to make it yours. They even let you edit the example in JSfiddle which is a great way to explore the code before you decide on using it.

Scrolling through the options, I chose a few that I thought would work. Beginning in JSfiddle, I plugged in some of my data to see how it looked. After much trial and error I settled on a drill down treemap. It was the perfect representation. The first level highlights the phrases and then, once you click into it, it expands into the films/episodes. By hovering over the episodes you can see the number of times that phrase was said in the the film. Clicking on an episode expands into the characters and, like the previous levels, hovering over a name tells you how many times that character said that phrase in that episode.

Taking it further

Since this was a class assignment, its scope was limited. But this project does leave room for a lot of additions. I am still in the process of looking for the cartoon scripts and have even begun my search for the video game scripts. I think it may be easier to get the text from all the books written in the Star Wars connanical universe. If you wish to contribute to this ongoing project or have some star wars data for your own, please email me at .(JavaScript must be enabled to view this email address), I would love to talk geek with you.

About the author

Mahima Singh is data journalist at the Palm Beach Post in South Florida. Before that, she was with a news analysis and media criticism website in India. She has an MS in Computational Journalism from Syracuse University where she learned to programme in Python and hasn't stopped since. 

Explore the project here.

Image: Lucas Lima 91.