Data in multiple languages means data in multiple cultures


My colleague Rahul Bhargava and I care a lot about data literacy, particularly for the 99% of us who are not data scientists and don't have an advanced degrees in statistics. At Emerson College, I teach journalism students how to work with datasets to tell stories in the public interest - a skill that is desperately needed as more and more public decisions are made using data and algorithmic processing. Rahul and I think it can be both easy and fun to get started working with data which is why we built the tool Databasic.io.


Image: Databasic.io is a suite of simple tools that help learners get started with data.

Databasic consists of three tools that help new learners to get started with data. Do you have a document dump like the Panama Papers? Then quantitative text analysis might be a good option and you can get started with Databasic's tools WordCounter or SameDiff. Did your government source just send you a crazy complicated spreadsheet? You might want to run it through WTFcsv, which gives you a basic analysis of each column in the sheet so that you can start to ask better questions of the data. 


Image: Databasic's latest release makes the button to choose your language very clear.

We released the first version of Databasic in January of this year. We always wanted to support working with data in multiple languages so we included a spanish version from the beginning - but we've learned a couple things since then. First of all, we had an Argentine journalist rave about the tools and email us to ask when we'd release a spanish language version so she could teach it in Latin America. This made us realize that the button to change language was completely hidden in the top right corner of the app. Second, users wrote to us saying that the spanish language data was still in English. This clearly makes it more difficult if you can't read the columns, rows and values in your data set.

We listened and we iterated and now we are proud to release a multilingual and multicultural version of Databasic. This includes all three tools in spanish and portuguese along with culturally appropriate sample data sets in those languages. The spanish sample data includes music lyrics by Paulino Rubio and Maná and political speeches by Fidel Castro along with spreadsheets about weather patterns in Costa Rica. For the portuguese sample data, we also included popular musicians along with spreadsheets about Brazilian soccer. We found the sample data by putting out calls to email lists like DDJ and NICAR and some great folks like Miguel Paz, Vivian Guilherme and Daniel Paz de Araújo stepped forward to source data sets in other languages.


Image: Sample data in portuguese includes football/soccer stats, baby names, internet domain names and tourist spots in Rio.

Our main takeaway from this process is that supporting multiple languages is really about supporting multiple cultures. And multiple cultures might not even mean national cultures, but also professional cultures. One of the things we continue to learn in workshops is that learners need examples of data that relates to them in order to take the next step with data. I taught a Databasic workshop to 50 municipal government officials in the US, for example. While they were delighted at analyzing music lyrics by Prince and the Beatles, they raised their hands and asked "How does this relate to us?" Once I provided qualitative text data from government surveys to analyze, the example clicked and the groups started chattering about the different applications they could see for their offices.

We will continue to iterate Databasic.io based on learner feedback so please get in touch with us via email or on twitter (@kanarinka, @rahulbot). And send us your interesting sample data sets in multiple languages!