A dispatch from the multivariate frontier

Authors


David Houle, Department of Biological Science, Florida State University, Tallahassee, FL 32306-1100, USA.
Tel.: 850 645 0388; fax: 850 644 9829; e-mail: dhoule@bio.fsu.edu

The central challenge in biology is how to relate the vast amount of information in an organism's genome to its highly multivariate phenotype, a task that can be labelled phenomics. Blows contribution is helpful to this central effort because it reminds us of statistical techniques that can greatly simplify complex data sets. Our limited human brains need all the help they can get grappling with biology's multivariate future.

In the early 19th century, a handful of Europeans explored the interiors of North America and Australia. Word of what they found was eagerly awaited by the settlers along the coast. Mark Blows’ paper is like one of those reports of the frontier, in this case the multivariate frontier (Blows, 2007). He has been out there, found that it is conducive to human understanding, and taken the time to explain it to us.

Organisms are fantastically complicated. Traditional biological intuition comes down to choosing an interesting character as the object of study, from the essentially infinite number of characters that could be identified. Scientists who pick characters that are both biologically important and readily measured can obtain interesting results that resonate with others; those that do not pick so fortunately struggle.

About 25 years ago Lande (1979; Lande & Arnold, 1983) and others formulated an elegant model for the evolution of multiple traits that seemed to offer a way around such difficulties. The most familiar of these predicts the response to selection on n traits as inline image = Gβ, where G is an n × n matrix of genetic variances and covariances, and β an n × 1 vector of linear selection terms. Lande reminded us that the simultaneous study of selection and inheritance on many traits at once would allow us to identify which traits were subject to the most natural selection, and which were best able to respond to selection.

Although the Lande equations are widely used heuristic and theoretical tools, their informative use in empirical studies has been frankly limited. For example, my colleagues and I were recently involved in a survey of the literature on the strength of directional selection as measured by β (Hereford et al., 2004). I was greatly surprised by how few of those studies incorporated large numbers of traits, or were competently analysed and adequately reported. The many estimates of G and β have almost never been obtained in the same populations where a real prediction could be made (with a few exceptions).

There are two major stumbling blocks to informative empirical application of the Lande models. First, it is very difficult to get the data necessary to adequately fit the models. On the selection side, one really wants to study selection in natural populations where manipulating and tracking organisms is difficult or even harmful to the population. On the inheritance side, the necessary sample sizes (e.g. Lynch and Walsh 1998) strain our capabilities, even in model systems. I have been led to the study of the wings of Drosophila precisely because I could automate their measurement (Houle et al., 2003).

Blows’ (2007) paper is targeted at relieving a second limitation on such studies – our relative inability to imagine simple generalizations that could be obtained from multivariate data. We do not just pick one trait to study because there is less to measure, but also because it is difficult to think in terms of multiple traits. I am sure I am not alone in feeling somewhat at a loss when confronted by a huge G matrix (e.g. Riska et al., 1984; Cowley & Atchley, 1990): how am I supposed to make sense of this? We humans need simplifying principles that help us make the complex understandable.

Blows reminds us of two such simplifying ideas based on matrix diagonalization. First, the matrix of nonlinear selection terms can be diagonalized to simplify the picture of the selective surface, as originally noted by Phillips & Arnold (1989). Secondly, the G matrix can be diagonalized to summarize the distribution of breeding values in multivariate space (e.g. Kirkpatrick & Lofsvold, 1992). This can tell us about the number and nature of phenotypic dimensions with genetic variation, and conversely about the ‘null space’ where genetic variation may be absent, or at least less abundant (Mezey & Houle, 2005). There are other relevant simplifying ideas that could also be mentioned, for example the treatment of some characters as functions, rather than as a set of discrete values (Kirkpatrick & Heckman, 1989). Size during growth is much more simply represented as a continuous function of age, rather than an infinite number of sizes through time.

Blows’ paper reminds us that the goal of multivariate analysis is not to study more traits than our limited human brains can grasp at once, but to discover how to make sense of many traits in simpler terms. Once we have done this, we can focus our attention on just what is important, where importance is verified algorithmically, rather than just by intuition.

In 1979, when Lande first published his famous equation, the challenge of thinking in multivariate terms seemed like a private game for smart evolutionary biologists. The world of biology has changed since then. We now have whole genomes, patterns of RNA expression and interacting networks to study. The biggest challenge in all of biology is now to connect all the multivariate genomic data to the multivariate whole organism phenotype. We need to invent methods for high-throughput phenotyping and the ability to relate that data to the genome, a task I have called phenomics (Houle, 2001). All of biology must now become multivariate, and not least evolutionary biology.

Thus overcoming the two major limitations to multivariate studies of evolution, obtaining data and interpreting it, is no longer a game but a necessity. The problem of obtaining data may ultimately be more severe, but the challenge of interpretation is perhaps the more important in the short term. We would not want to solve the data limitation issue until we can see what generalizations might flow from it. This is just where the value of Blows’ paper lies – a simple, lucid explanation of what we could gain from a multivariate study of selection. The challenge going forward is that there are only a handful of evolutionary biologists who can think well enough in multivariate terms to formulate the connections between simple ideas and complex data sets, Blows being prominent among them. On top of this, there are not that many more who can understand such connections once they are formulated. Go multivariate, young scientist! The frontier awaits you.

Ancillary