The biological sciences are being given an increasing number of tools that allow for the measurement of large numbers of quantities in a single “assay.” These tools include genes (single-nucleotide polymorphisms, copy number variation, and direct sequencing), transcription (microarrays and real-time polymerase chain reaction), proteins (shotgun proteomics), multiplexed cytokine assays,1 and many more. I will refer to this entire range of instrumentation as high-dimensional biology, that is, techniques in which the number of results per sample often vastly exceeds the number of samples in an experiment.
These new data sources have brought with them several important challenges, many of which are only beginning to be adequately addressed. A 1997 address by Jerome Friedman on the challenges of data mining captures a key issue: Every time a technology increases in effectiveness by a factor of 10, one should completely rethink how to apply it. Consider the historical progression from walking to driving to flying. Each increases speed by roughly a factor of 10. However, each such purely quantitative increase has completely changed our thinking of the use of transportation in society. A corollary of this might be ‘every time the amount of data increases by a factor of 10, we should totally rethink how we analyze it.’2
In the biological sciences, we have seen such increases multiple times in the last several years, and the end is not yet in sight. I sometimes feel like I am watching a bench-top arms race, with each new device obsolete almost before we fully understand it. We need to avoid the recently voiced characterization that “it's not an information explosion, it's a data explosion and that is not necessarily the same thing.”3
The purely computational challenges that arise are complex; it is not completely clear how all of them will be solved.4 One of the most obvious changes is the way in which data are stored and understood. A standard model for many decades has been the laboratory notebook, which assembles all of the data into a compact space for examination and cogitation. The modern spreadsheet has largely served as an electronic extension of this principle. High-dimensional methods break this model by simply generating too much data for the mind to grasp. One of the most insidious consequences of this is that simple laboratory anomalies such as instrument quality control, which would normally be caught at or near their inception, now become invisible. A well-publicized case was that of a “proteomic signature” for ovarian cancer that was later ascribed to drift of the primary instrument over time; detection of the problem required fairly sophisticated methods.5 Such errors are not unique.6 I had a similar experience in a study of digital mammograms in which the computational tool found a “signature” for future malignancy that was somewhat inexplicable. After 2 months of further digging and forming hypotheses in an attempt to construct a biological story about the result, we found the primary issue to be a particular form of noise in the scanner, which had been reset during regular maintenance partway through the study. Again, this was a pattern that would have been quickly spotted in a small set of numbers but was hidden in the mass. Thus, the challenges of the new methods include not just the mechanics of storing and manipulating such large data sets but also the need to create simpler summaries and data checks that will flag issues quickly. When using large amounts of data, we need to remember that more, not less, diligence is required in examining and cross-checking results before they are presented to a wider audience. External validation of any results is almost certainly needed.
The second issue is the screening problem. Any test that is applied to a large population in which the fraction of true positives is small, no matter how precise, will find a larger number of false positives than true positives. This is well understood in clinical settings (e.g., screening for hepatitis in a blood bank). In this setting, the usual P value, which measures whether any given test is a false positive, is not particularly useful, and instead one has to focus on the overall sensitivity and specificity of the test. The identical case holds in microarray studies, in which each gene is tested for its association with a particular disease, with the a priori hypothesis that the great majority of genes have a negative association. Here the genes are the “subjects,” and spurious associations are the false positives. The language of reporting is likewise in terms of the sensitivity, specificity, and false discovery rate (1 − specificity) instead of null hypotheses and P values. However, the primary consequence is that like disease screening, any given positive must be viewed only as an indication for further exploration of that pathway and not as reliable stand-alone evidence for association.
Closely related to the screening problem is the problem of unrelated factors. Assume that some particular instrumentation—single-nucleotide polymorphisms, microarrays, or whatever—is absolutely reliable, giving exact expression results on 20,000 genes, and assume further that one-half of these genes both vary from subject to subject, as would be expected for a human population, and have nothing whatsoever to do with a given disease under study. Then, in a small sample comparison of five case to five controls, we would expect, by chance alone, 80 genes to perfectly segregate between the cases and the controls. That is, the values for the five cases are larger (or smaller) than those for all five controls. Validating the results with another assay does nothing to alleviate this: the five cases really are different on this particular factor, be it darker hair or a longer second toe or whatever. A good global screening method will find everything that differs between two groups, whether it is relevant or not. The standard cure for this is to use group sizes in the thousands, which diminishes the likelihood of a large imbalance for such chance associations. This is currently the fashion in genome-wide association studies; however, it is not a realistic option for most work.
Ultimately, high-dimensional methods are just too good as a screening tool. I usually find that the investigator wishes that the “significant” list from such a study had far fewer entries. The classic modeling approach, which tries to pull out a single subset of significant variables from a multivariate model, is nearly useless in this context despite a lot of interesting work on adapting them to a large number of predictors. Nakagawa et al.7 evaluated nine different gene signatures obtained from five studies in a single validation sample. Despite minimal overlay in their respective gene lists, predictive accuracy was good with all the proposals.
What then is the direction out of this dilemma? The most promising is to take a global approach, that is, to cease looking at the data as a list of individual results but rather as a system of interconnected pathways. In order to be successful, this type of approach needs to integrate data from outside the study, however. It is too much to ask of a single data set that it illuminate both the high and the low, both an overlying network topology and how the given data set participates in those networks. There is a bootstrap problem of course, in that refining the network maps depends on good systems interaction data, which in turn depend on accurate network topology for their interpretation. Work on this next-generation genome, beyond the first step of simply enumerating all genetic components, is proceeding apace with resources such as the Gene Ontology project and Cancer Genome Atlas and many more. An example of recent methods that make use of such data is a pair of papers which address the analysis of predefined gene sets; the results show a much lower false discovery rate and more stable inference than one-gene-at-a-time approaches.8, 9 This is only one example of multiple new tools. A recent use of such methods is an overview of the genetics of immune-related disease,10 which used pathway information to integrate seemingly disparate data from multiple studies of immune-related disease. Data from 56 large studies of 11 different diseases generated a plethora of possible associations; the authors were able to map a large fraction into three major immunological pathways.
This is of course the familiar call of system biology.11 The most serious issue with this approach is that the central problem is very hard: networks are complex. It is always easier to step closer than to step back, and in these times of worried funding, a safe, less ambitious approach is certainly appealing. Nevertheless, we need to go in that direction if we are to take full advantage of the feast of data spread before us and make substantial, deep progress in the understanding of human health and disease.