7/6/2016

A Mirador for the data forest: From exploration to predictive modeling

 

Visualization tools allow us to explore vast and lush “forests of data,” and to search for paths that could lead us to unexpected insights. Each path is a potential story –a feasible predictor that takes the form of a relationship between quantities observed in the data. However, no visualization tool can definitively tell us whether these relationships are meaningful, real paths to new knowledge, or simply the result of chance –accidental paths leading nowhere. The subsequent stage in the data analysis process often involves using our intuition and expertise about the data, sometimes with the help of statistics, to choose the most promising paths to explore further. In collaboration with the Sabeti Lab and Fathom Information Design, we created the visualization tool Mirador –named after the Spanish word for "vantage point"– to not only provide a novel mechanism for searching for these meaningful relationships and patterns, but also to act as a convenient launching platform for the second stage of the analysis, with the ultimate goal of generating actionable knowledge.

mrador.png

Figure 1: Mirador's main interface

We can use Mirador in two distinct ways: first, as a navigation interface for “open-ended” exploration of correlation patterns, and second, as a tool to identify the factors that could be predictive of an outcome that the user is particularly interested in (for example, patient diagnosis or prognosis given observed symptoms). Mirador’s interface was designed as an interactive matrix of all possible correlations between the variables in the data, primarily with the open-ended search use case in mind. This type of matrix is typically referred to as "small multiple views," with the scatterplot matrix being one form of such views. Mirador adds four features to the traditional scatterplot matrix:

  1. the possibility of searching through variables very quickly, by either name or hierarchical navigation, allowing the user to probe hypotheses as soon as a pattern leads to a new idea or intuition;
  2. defining subsets of the data by setting intervals of interest, which results in an immediate update of the view matrix. This allows users to explore conditional dependencies interactively and to inspect whether a pattern of association persists in a specific subsample;
  3. automatic evaluation of the P-value of each pattern, with a Mutual Information-based test of statistical significance; and
  4. a unified visual representation of pairwise correlations, via a type of graph that integrates eikosograms (which are a particular type of mosaic plot) and boxplots. With these unified plots, associations between variables have a consistent visual representation, irrespective of whether the variables are continuous or discrete, numerical or categorical.

dataoc.png

Figure 2: Results from the Mirador Data Competition

After releasing the first version of Mirador in late 2014, we organized a competition in which we invited users to search for interesting correlations in four open datasets: the National Health and Nutrition Examination Survey (NHANES), the Behavioral Risk Factor Surveillance System (BRFSS), the World Bank Development Indicators, and the Lahman’s Baseball Database. Each user interrogated the data using the same Mirador exploration interface, but with his or her own unique perspectives, interests, and definition of “interesting.” The winners found correlations between the number of researchers in a country and its spending on R&D; between salary and birth month among baseball players; and between general health and exercise over the past 30 days.

This approach did not necessarily presuppose the existence of a hypothesis to test in the data (although Mirador can certainly be used for that purpose), but rather was intended as an exploratory expedition through the data that could lead to new hypotheses. Such expeditions do come with a risk: interactive visualization, coupled with the finely-tuned capability of humans to find visual patterns, allows users to “nudge” the data until a discernible pattern can be extracted –even if it is merely the result of chance. Indeed, after a large enough number of attempts, a pattern is guaranteed to emerge by chance alone. Therefore, any correlation “discovered” with Mirador needs to be interpreted just as a signal for a potentially meaningful effect that requires subsequent analysis and validation. One possibility for incorporating multiple hypotheses testing into the interactive exploration itself could be to use some form of dynamic Bonferroni correction tied to the number of patterns the user visualizes with the tool (a feature that we are considering to add in the next version of Mirador).

It is very important to understand how the interface design of data exploration tools conditions the search process of their users; and how interfaces and statistical computations could be effectively combined to allow users to discover patterns that would be overlooked by visual exploration or algorithmic search alone. As a very first step to study these search processes, we visualized the submissions in the data competition as paths through the variables the users visited while working with Mirador. We observed that search patterns differed across both users and datasets, which is expected, but also that the correlations selected by the users were not necessarily the statistically strongest ones, and perhaps the most relevant to the user by some other measure. This suggests that purely algorithmic search can be complemented with human intuition and expertise, which are harder to quantify explicitly.

Video showing the meta-visualization of the search processes with Mirador

As mentioned earlier, Mirador can be used to understand the relationships between a specific variable of interest and the rest of the variables in a dataset. This is how we are currently using Mirador in the Sabeti lab to explore clinical records from Ebola and Lassa Fever patients. Our aim is to construct predictive models to calculate the chances of patient survival given the most relevant clinical symptoms. Doctors and nurses could use these models to triage patients based on their risk score, and to decide which patients need more urgent interventions or even experimental treatments in cases with poor prognoses.

The first step in this process is to determine which are the relevant symptoms to be included in the development of the prognosis predictors. As Mirador quantifies the strength of the relationships between all variables using a general test of statistical significance, we can also rank all the associations with one specific variable, in this case patient outcome. This ranking will include several false positives (variables that appear to be correlated with outcome because of chance alone), particularly if the number of candidate symptoms is large. There are several techniques to deal with multiple hypothesis testing. Mirador uses the False Discovery Rate (FDR) test, originally proposed by Benjamini and Hochberg. The Mirador user can specify the FDR fraction of associations mistakenly considered to be significant that he or she is comfortable with, expressed as a percentage of the total number of associations listed in the ranking (this FDR test is entirely separate from the dynamic multiple hypothesis correction we are planning to add to the open-ended search mode). Once this final list of variables is determined, we can export it and use the variables as predictors in prognosis models trained from the dataset. A proof-of-concept of this approach was recently published in the PLoS Neglected Tropical Diseases journal, and includes an additional step after model training, packaging and deploying these models as mobile apps:

app.png

Figure 3: Mobile app for predicting mortality risks of Ebola patients

The predictive models are automatically compiled as apps that take the available symptoms as inputs and generate a risk score that is presented to the user. Although the apps are currently only prototypes, this approach illustrates a possible way in which the models could be made actionable and field-deployable, therefore defining an entire pipeline from visual exploration of clinical records, to training and deployment of predictive models.

To ensure that the apps effectively assist physicians and health care workers in decision-making, we plan to continue working on the app generation component of Mirador and to collaborate with the stakeholders who would directly benefit from these models and apps, in order to identify the gaps in our visualization and modeling approach. The use of mobile devices and apps for data collection, patient diagnosis, and physician notification has been growing at rapid pace in the past few years. It is clearly an area in which Mirador-derived models could have a significant impact, provided that they are properly deployed in the field and able to address the needs in patient care, hopefully fulfilling our initial goal of generating actionable knowledge from data.

Acknowledgments: Many thanks to Mary Carmichael for a detailed review of the text.

Comments