Who's afraid of the big black box?: Statisticians' vital role in big data and predictive modelling



What goes on inside the black-box algorithms that turn big data into something useful? The answer, say Max Kuhn and Kjell Johnson, is statistical – so statisticians should come to the big data party.

Over the past century, the field of statistics has played an important role in developing society's awareness of how an understanding of data can form, shape, and improve our lives. Our field began to grow and thrive before the advent of modern computing and data storage, when the scale of experiments and data analyses was severely limited compared to what can be done today. During this time, statisticians’ development of experimental design and theoretically sound inferential analyses produced global, game-changing reverberations in fields such as finance, manufacturing, and pharmaceutical research and development. These important contributions clearly signalled that collecting relevant data leads to meaningful understanding.

Computational technology has rapidly improved, and we now have the ability to store ever increasing amounts of data and perform ever more intricate mathematical calculations on these data. Furthermore, these computational improvements have enabled individuals in fields such as computer science, bioinformatics and engineering to develop complicated, difficult-to-understand black-box algorithms to search for meaningful relationships within these large databases. They are called “black-box” algorithms for the simple reason that we know what goes in, we know what comes out, but very few understand what happens in-between. A less loaded term for the same thing is “predictive models” – a term that falls within the comfort zone of the average statistician.

Using this state-of-the-art technology, businesses have seen how the application of black boxes can help them accurately target their customers and make bigger profits; insurers use models to relate individuals’ characteristics to risk; pharmaceutical companies connect patient attributes to the effectiveness of a therapy; and investment managers predict events that influence market directions. The successful application of predictive models, along with ever improving and cheaper computational technologies, has, in turn, prompted public and private sector entities to gather increasing amounts of data. More data, they hope and assume, means that more valuable information can be extracted for corporate or municipal benefit.

Problems, of course, come in producing the algorithms that go inside the black boxes. These problems are so widespread, and the gains from solving them are so great, that they have sprung to life across the web. One recent, widely publicised big data problem was the Heritage Health Prize that sought a model to “Identify patients who will be admitted to a hospital within the next year, using historical claims data” (http://www.heritagehealthprize.com/c/hhp). The notoriety of this challenge was in part due to the assertion that a model could predict avoidable hospitalisations, help to improve patient health and decrease healthcare costs. It is a radical assertion, but, in the world of big data, a reasonable one. Moreover, the $3 million grand prize vaulted this problem to the front of the news and stimulated interest from many competitors. During the course of this competition, more than 1600 individuals submitted more than 25,000 models for evaluation.

More data, they hope and assume, means that more valuable information can be extracted for corporate or municipal benefit

These big data and predictive modelling stories now pepper the web with titles that include phrases like “data mining”, “machine learning”, “artificial intelligence” and “predictive modelling”. While the word “statistics” is often found somewhere in predictive modelling articles and competitions, it rarely appears in the title. This lack of bold-type recognition can make statisticians feel that we are not an essential part of the big data party. At the same time, our rightly engrained suspiciousness of observational data and lack of understanding of black-box algorithms make us leery of attending any such party.

Does statistics play an essential role at the big data party? Without a doubt. The job of the black boxes, as we said earlier, is to search for heuristic relationships; and that, surely, is what statistics is all about. That the data sets are large should not deter us. Here we will explain the case for statistical involvement by highlighting the advantages that big data offer, as well as the pitfalls they present and how statisticians are uniquely trained to identify and avoid them.

First, let us define what we mean by “big data”. Information is usually regarded in two dimensions: the number of attributes (variables like blood pressure, temperature and antibody concentrations, for example) and the number of cases (samples or data points – for example, the number of different blood pressure readings). Big data result from drastically increasing either of these dimensions, and increases in either dimension can have important positive and negative repercussions. Following the common notation, we will refer to “big N” as an increase in the number of cases, and “big P” as an increase in the number of attributes, and we will examine each of these scenarios below.

Big N

Is it not always better to have more samples? Under many circumstances, it is a statistician's dream come true, since our statistical training tells us that the larger the sample size we have, the more power we have to perform a test and the closer the statistics will be to the parameters we are interested in understanding. A practical application of these principles is in Ian Ayers's 2007 book, Super Crunchers1. He describes the process he followed to find a title for the book. To understand the reader population's opinion on his choices of titles, he used several variations of targeted Google Ads, each with a different candidate name for the book. After a short period of time, he collected a quarter of a million data points related to which advertisement was clicked on most. Since the ads were served at random, this large-scale randomisation test provided strong evidence of which book name the reader population liked best.

While technology like the web now makes big N data sets easy to acquire, an important question that statisticians are trained to understand is: has the data sampling mechanism changed and introduced potential biases in the data? Access to large databases does not reduce the need to understand why these particular samples are available. If there is a systematic bias in a small data set, there will be a systematic bias in a larger data set, if the source is the same. As an extreme example, consider the US Food and Drug Administration's adverse event reporting system database which provides information on millions of reported occurrences of drugs and their reported side effects (http://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Surveillance/AdverseDrugEffects/). Obvious biases abound in this collection of data. For example, a search on a drug for treating nausea may reflect the fact that a large proportion of the patients using the treatment had leukaemia. An uninformed analysis may identify leukaemia as a potential side effect of the drug. The more likely explanation is that the subjects were taking the nausea medication to mitigate the side effects of the cancer therapy. This may be intuitively obvious to statisticians, but clearly the availability of large quantities of records is no protection against bias. Conversely, there may be benefits from larger samples to give better coverage of the population, since rare or novel subgroups may be sampled to the point where they are detectable above the noise.

If the data do not contain problematic sampling biases, then building the black boxes – the predictive models that mathematically elucidate a relationship between the outcome and the predictors – can significantly benefit from the large amount of information available about the underlying population. Exploratory statistical methods – restricted cubic splines2 for example – can help empirically derive the functional form of a predictor within a model which may be difficult to obtain with moderate to small sample sizes. Big data, in short, with good statistical methodology, help to find the form of a mathematical relationship which can usefully sit within the black box.

When it comes to building these models, the number of available observations also affects what statisticians understand as the bias–variance trade-off3. We know that the mean squared error (MSE) of a model can be decomposed into a squared bias term – crudely, how close the averaged predictions come to the correct result – and a variance term – how spread out are the predictions. To briefly explore this relationship, consider a model which is likely to have low bias but high variance. A classification or regression tree model might be an example4. This implies that slight perturbations in the data you feed in can cause major changes in the structure (and interpretation) of the tree – in what happens inside the black box – while still getting, on average, close to the correct result. Other low-bias, high-variance models include neural networks, k-nearest neighbours, support vector machines, as well as other “black-box” models (Figure 1). Statisticians will recognise them; non-statisticians need only know that they may be part of the contents of the black boxes.

Figure 1.

Common predictive models and their bias–variance positions

Fundamental statistical knowledge tells us that one way to reduce the variance of an estimate is to increase sample size. Having access to big N therefore helps to reduce the variance of low-bias, high-variance models. To illustrate this big N advantage, consider a simple relationship between a predictor and a response that follows a sine wave pattern with noise. In the left-hand plot of Figure 2, 50 samples of size 50 were randomly drawn from this population, and a regression tree was built on each sample. The resulting predictions from each of those trees are illustrated with grey lines. Overall, each tree's prediction generally follows the shape of the sine wave (low bias), but there is considerable variation in the predictive patterns (high variability). The average root MSE across these 50 models is 0.43. The right-hand plot illustrates the effect of a large sample size on the resulting regression trees. In this case, 50 samples of size 5000 were randomly drawn from the population, and a regression tree was built on each sample. Clearly, the resulting predictions follow the shape of the sine wave much more closely and there is much less variation between them. The increase in sample size produces very consistent trees, as illustrated by the overlap in predictions across the models. The average root MSE across these 50 models is 0.29. As this example demonstrates, access to big N can aid in the model building process.

Figure 2.

The impact of increasing sample size for a simple CART model. Left: a CART model was built on 50 different data sets of size n=50. The predictions from these models have high variance. Right: a CART model was built on 50 different data sets of size n=5000. The predictions from these models have lower variance

High-bias, low-variance models include techniques such as linear regression, linear discriminant analysis, partial least squares and naïve Bayes, to name but a few (Figure 1). These models are more stable, but lack the flexibility to capture many types of non-linear relationships between the predictors and outcomes. They also tend to be less computationally burdensome and are well suited to very large data sets. Big N can again reduce variance and enable highly accurate parameter estimates for these models – but it does not help to reduce the underlying model bias. But low-variance, high-bias models can still be used effectively if the modeller has the time to reduce the bias by more standard statistical means. This manual approach to approximating the functional form of the model can be rewarding when a solid methodology is combined with large data sets.

In many cases big N has less positive consequences. First, many of the black-box methods incur significant computational burdens as N (or P) grows. The direct impact of increasing N and/or P on computational burden can be exponential and depends on the model's implementation, ability of the model building to be parallel processed, and availability of computational resources (i.e. memory and processors). As an example, consider building a single classification. This model performs many exhaustive searches across N and P to find optimal splits of the data. As N and P grow, computation time likewise grows. Moreover, the computational burden for ensembles of trees (the techniques known as boosting, random forests, etc.) is even greater, often requiring more expensive hardware and/or special implementations of models that make the computations feasible. Second, there are diminishing returns on adding more data from the same population. Since parameter estimates stabilise with sufficiently large N, garnering more N is less likely to have an impact on the model (of course, “large” depends on the nature of the problem). An alternate strategy in this situation is to sample the data for maximum diversity5 so that the deluge of points in the mainstream of the data does not drown more subtle trends.

Therefore, statisticians’ fundamental understanding of relationship among bias, variance, and big N in the context of modelling can help to guide the model building process towards more accurate and computationally efficient models given the data at hand.

Big P

Now that we see the advantages and disadvantages of a large number of samples in the context of modelling, let us turn to the impact of “big P”. Many technologies are moving in the direction of making large-scale measurements on each case, such as the advances in genetic sequencing, imaging, online commerce and financial indicators. Expanding the number of meaningful attributes collected for each case can improve model performance more effectively than expanding the sample size; however, adding non-informative attributes can greatly reduce model performance.

As a simple example, if you are trying to model energy consumption for residential homes, the model will need measurements such as square footage of home, age of home, type of construction materials, and so on. But factors such as the colour of a home's exterior siding or its proximity to the closest school do not affect energy consumption. Feeding irrelevant attributes into your model will reduce its effectiveness and will require additional computational time.

Big P has traditionally been a double-edged sword for small- to medium-N data sets. The major concern statisticians understand is over-fitting6. This happens when the output of your model is too exact: it reflects not just the underlying structure of the input data – which you want – but the random noise fluctuations of that data as well – a distracting irrelevance that one would prefer to see filtered out by the time they reach the far side of the black box. Given a large-N dataset, we can reserve a large enough random sample of the data to use solely for predictor selection, separated from parameter estimation and hence mitigating the over-fitting problem. Better yet, a predictive model that internalises feature selection is likely to be optimal since the filtering of predictors is determined in conjunction with the model training (as opposed to externally with a separate data set). For example, if a classification tree only splits on 20 predictors in the model, the rest of the predictors are essentially filtered out. Models with built-in feature selection tend to be low-bias and high-variance, but there are a few low-variance alternatives, such as the lasso7 and elastic net8.


Regardless of big N or big P, the modeller should always pause to reassess the goals of the analysis. The tendency with predictive modelling is to focus on model performance. But big data may offer the ability to find and develop more targeted models. That is, enough information may be present in large data sets to identify subsets of data where local models are more informative. For example, a general focus of business analytics problems is to find models that increase revenue. In these cases, the best model may not globally incorporate all data; instead, there may be small niche subpopulations that generate unique relationships between predictors and the outcome of interest and lead to better revenues than any global model. In this context, local models beat global models. Big data can enable the identification of these opportunities and contain the content to capitalise on them.

Given these ideas, how does the statistician contribute? Our biggest role is to affect the methodology. Predictive modelling techniques have been developed and pioneered in other fields, such as computer science, chemistry and engineering (the head of engineering at a diagnostic company once remarked, “We need machine learning, not statistics!”). However, in essence, most predictive models are statistical models. As such, a well-trained statistician will be very attuned to the assumptions used on the data, how to measure uncertainty and avoid logical dangers.

As previously mentioned, big P without big N can lead to severe over-fitting due to feature selection. For example, Ambroise and McLachlan6 re-examined the analysis of a microarray study9 and determined that there was a logical error in how genes were selected for the predictive model. In one case, the mistake was so significant that, even when the outcome classes were completely non-informative, the proposed algorithm could achieve zero errors. The issue related to how the uncertainty in feature selection was aggregated, which is purely a statistical problem. If the feature selection routine is viewed as a (non-statistical) computer science algorithm, the likelihood of these types of mistakes increases. As another example, when faced with missing data, many practitioners will focus on the technical consequences, such as how the data are encoded. A statistician is more likely to ask why data are missing, if this is informative and how it will affect the posterior probabilities generated by a predictive model.

Does statistics play an essential role at the big data party? Hopefully we have convinced you that it does. The original principles developed by our field's predecessors who helped the world to recognise the importance of data still, and will always, play a crucial role in uncovering truth, helping to move “statistics” to the bold-type font it deserves to be in the twenty-first century.