With the advent of modern molecular techniques, increasing attention is being paid to nonmodel organisms for investigating the genetic basis of various phenotypes in physiological, ecological or geographical contexts. As genes are discovered that covary with an environmental parameter such as temperature, light or latitude, there is a natural temptation to ascribe causality to these correlations. However, correlations are only the tantalizing starting points for robust experimental designs and, in themselves provide evidence for neither causality nor an underlying functional mechanism. Herein, we use covariation of traits with latitude to illustrate the problem of confounding causation and correlation over geographic gradients. We begin with a simple diagram:
If A is correlated with C and B is correlated with C, then A will automatically be correlated with B. There follows the natural temptation to infer or conclude that A causes B, that is genetic variation in A constitutes the genetic basis of B. As an example, we consider the relationship between the circadian clock regulating daily activities of organisms and the photoperiodic timer regulating seasonal activities of organisms. This relationship has a long and contentious history (Tauber & Kyriacou 2001; Hazlerigg & Loudon 2008; Bradshaw & Holzapfel 2010; Saunders 2010; Koštál 2011), a legacy of Bünning’s (1936) proposition that the circadian clock formed the causal basis of photoperiodism. At the molecular level, a probabilistic cause between circadian rhythmicity and photoperiodism occurs in plants (Kobayashi & Weigel 2007; Wilczek et al. 2009) and in a long-established laboratory strain of Syrian hamsters (Shimomura et al. 1997; Lowrey et al. 2000). However, there are no examples where the circadian clock has been shown to be necessary, let alone sufficient for regulating photoperiodic response in natural populations of any animal. Yet, elements of the circadian clock have been shown to vary with latitude as have phenotypes of the photoperiodic timing mechanism (Fig. 1). Therein lies the problem: Covariation is not proof of causation.
The seasonal timing of life-history events, which is typically orchestrated by the photoperiodic timer, is correlated with latitude in both plants (Wilczek et al. 2009) and animals (Bradshaw & Holzapfel 2007). An increasing number of circadian-related genes are now known to vary with latitude in Neurospora (Michael et al. 2007), plants (Arabidopsis: Michael et al. 2003; Caicedo et al. 2004; Stinchcombe et al. 2004; Glycine: Zhang et al. 2008), Drosophila (Kyriacou et al. 2008; Rand et al. 2010), fish (O’Malley & Banks 2008, O’Malley et al. 2010), birds (Johnsen et al. 2007), and humans (Cruciani et al. 2008). Given the observation that both circadian genes and photoperiodically mediated seasonal traits vary with latitude, the tendency is to conclude a causal connection between the circadian clock and the photoperiodic timer based on their latitudinal covariation.
The covariation of two traits with latitude could indeed be due to a common causal mechanism (pleiotropy), in which case an interesting relationship has been established and the question then becomes resolving the mechanistic basis of their coevolution. However, while latitude usually and appropriately serves as a composite variable, latitudinal variation represents multiple environmental factors, any one or a combination of which could be exerting parallel selective forces. The covariation of two traits with latitude could be a result of different selective forces acting on the two traits, the same selective force acting on two genetically independent traits, or a single selective force acting on one trait accompanied by genetic hitchhiking of a closely linked trait (Li 1997, ch. 9; Schluter et al. 2004, 2010). Examination of the relationship between variables can be made using techniques described in Sokal and Rohlf (1995, ch. 16): partial correlation examines the relationship between two variables, while all the other correlated variables are held constant; path analysis incorporates simultaneously the contribution of several correlated variables. While useful, these statistics are complex, may suffer from collinearity of the independent variables (Petraitis et al. 1996), are not readily accessible in many statistical packages and heretofore have not incorporated discrete variables. We are proposing a more transparent test that requires little more than a hand calculator or an Excel spreadsheet and incorporates both linear regression and analysis of variance. Below, we provide examples from flies and fish to illustrate the simplicity and usefulness of the analysis of residuals to avoid a spurious conclusion of causation when only correlation exists. When Y is regressed on X, the regression equation, plots the regression line and = deviations from regression (residuals). The residuals are zero correlated with X, that is the effect of X on Y has been factored out. If A is a causal element of B and both are correlated with latitude, then even when the common element of latitude is factored out the residuals should still be correlated; if not, their common correlation with latitude is due to linkage or independent evolution and not due to a basic underlying causal relationship between A and B. When A or B is a discrete and not a continuous variable, the residuals are computed as deviations from mean latitude for each category of Y. Although applicable to the covariation or association of any two traits or processes with any environmental parameter, we continue with examples from the biological timing literature. To illustrate the test, we have chosen two specific examples because of their connection with latitude, because of the large number of sample populations over a wide latitudinal range, and because the numerical data were available in the source papers. This sort of analysis was not possible for most of the papers we read because either the sample size was too small or the tabular, numerical data from which figures were generated were not available either in the body of the text or in supplemental online material. The advent of requiring the posting of such data (Fairbairn 2011) will make subsequent verification via independent analysis tractable.
First, in Drosophila littoralis, Lankinen (1986) found significant correlations between latitude and a proxy for the photoperiodic timer (critical photoperiod necessary to induce adult diapause) and between latitude and the two most fundamental properties of any circadian rhythm (the period and amplitude of its oscillation) (Fig. 2). Insightfully, he factored out the common effect of latitude and showed that the residuals of critical photoperiod were no longer correlated with the residuals of either period or amplitude of the circadian eclosion rhythm. Hence, he proposed that their covariation with latitude was due to linkage and not a causal relationship between them. To verify this conclusion, Lankinen and Forsman (2006) crossed two extreme populations, allowed free recombination and then imposed selection for nondiapause on short days. The hybrid lines exhibited a more ‘southern’ photoperiodic response and a more ‘northern’, circadian-based eclosion rhythm than found in any of Lankinen’s original geographic strains (Fig. 3), that is the reverse of what would have been expected had the circadian clock been a causal component of photoperiodism. These experiments confirmed Lankinen’s earlier conclusion (1986) that when the common effect of latitude was factored out, critical photoperiod was not correlated with either fundamental property of circadian rhythmicity. More generally, Lankinen and Forsman’s (2006) experiments confirmed the robustness of testing for a potentially causal connection between two traits by using residuals to factor out their common, correlated element.
Second, in Chinook salmon, Oncorhynchus tshawytscha, O’Malley and Banks (2008) found a significant correlation between latitude and their proxy for the circadian clock (length of the polyglutamine repeat in the gene OtsClock1b, hereafter, Poly Q) (Fig. 4a). They also found a significant association between latitude and their proxy for the photoperiodic timer (run time = seasonal timing of upstream migration in freshwater) (Fig. 4b). O’Malley and Banks (2008, p. 2813) conclude with the suggestion ‘that length polymorphisms in OtsClock1b may be maintained by selection and reflect an adaptation to ecological factors correlated with latitude, such as the seasonally changing day length.’ After extending their correlative analyses to three more species of salmon (Onchorhynchus), O’Malley et al. (2010, p. 3705) state more boldly that the ‘Clock gene is a central component of an endogenous circadian clock that senses changes in photoperiod (day length) and mediates seasonal behaviours’. At the heart of the conclusion is the association between latitude, Poly Q and the timing of migration and spawning. This conclusion makes at least three essential, but untested assumptions.
First, the conclusion requires that a single genotype (high frequency of the 335 allele and concomitant low frequency of the 359 allele) of the Chinook Poly Q domain is the primary determinant of two different run times, spring and autumn (Fig. 4b), even in the same river. This assumption may or may not be true.
Second, the conclusion assumes that the salmon-specific OtsClock1b plays a functional role in salmon circadian rhythmicity. There are two Clock paralogs in salmon: OtsClock1a and OtsClock1b, only the latter of which shows a significant correlation with latitude. However, the assumption that OtsClock1b has the same functional role in salmon as its ortholog in the mammalian circadian clock (Baggs et al. 2009) is untested.
Third, the conclusion assumes that there is a causal relationship between the daily circadian clock and the seasonal photoperiodic timer. This assumption is at best contentious (Bradshaw & Holzapfel 2010; Saunders 2010; Koštál 2011) and has not been tested in any fish. There is then a great leap from observing a correlation between latitude and only the OtsClock1b paralog and a correlation between latitude and run time or spawning date to concluding that the circadian clock is responsible for the evolution of photoperiodism and, hence, seasonal timing (O’Malley et al. 2010).
Strictly for purposes of illustration, we assume the first two of the above three assumptions to be true. We then use Lankinen’s (1986) approach of analysing residuals to test for an association between Poly Q and run time by factoring out the effect of latitude on Poly Q. In this case, Poly Q is a continuous variable and run time is a discrete variable. We therefore calculated the residuals from regression of Poly Q on latitude (Fig. 4a) and performed one-way anova of the residuals using run time as treatments. After factoring out the effect of latitude, run time accounted for a nonsignificant 7% of the residual variation in Poly Q (Fig. 4c). We therefore conclude that there is no basis to infer or suggest a causal relationship between them, either as a direct, independent effect of Poly Q on run time or as an indirect effect of Poly Q on the circadian clock. Further discussion of the adaptive significance of Poly Q in relation to run time is unwarranted, as is any speculation about a potential connection between the circadian clock and the seasonal photoperiodic timer. Future research might well be directed towards determining the function and adaptive significance of Clock1b in salmon in the context of the circadian clock itself, much as have other studies in diverse organisms (Yerushalmi & Green 2009).
Hence, we propose that before inferring a causal relationship in similar cases of covariation of two or more traits with a third physiological or ecological independent variable, that a straightforward analysis of deviations from the common independent variable be used. Absent a significant association, no causal relationship should be inferred or suggested. Even an inference of a causal relationship would be reasonable only if all of the following were true: (i) Variation in each trait is significantly correlated with a third common element, in our case, with latitude. (ii) The significant correlation between the two traits persists after the effect of the common element is factored out. (iii) The environmental conditions used to show the correlations in (a) and (b) were in the same organism and determined under the same conditions.
Note that our test accommodates the situation where both the trait and the gene are associated with latitude in the same way. In that case, their latitudinal covariation is due to an environmental factor(s) selecting concomitantly on both the gene and the trait; no correlation between them should persist once the latitude-dependent causal environmental factor(s) is accounted for. If the relationship between the gene and the trait is due to an underlying causal connection, then a significant correlation between them should persist independently of latitude.
Significant, positive results from analysis of residuals serve as a point of departure for future experiments but, in of themselves, do not substitute for an understanding of the functional connection between genotype and phenotype (Kingsolver & Schemske 1991; Petraitis et al. 1996; Dalziel et al. 2009; Blackman 2010; Storz & Wheat 2010). Successful connections between molecular variation and functional phenotypes have been established (but only after additional study) in both model organisms such as Drosophila (Schmidt et al. 2008; McKechnie et al. 2010; Paaby et al. 2010) or Arapbidopsis (Wilczek et al. 2009) and in natural populations of nonmodel organisms such as the house mosquito, Culex pipiens (Labbéet al. 2009), lizards (Rosenblum et al. 2010), and organisms cited by Storz and Wheat (2010) and Dalziel et al. (2009), their Appendix S1, Supporting information), including killifish, butterflies, garter snakes, deer mice, oldfield mice, three-spined stickleback, and Darwin’s finches.
With the advent of tractable molecular approaches in an increasing number of nonmodel organisms with interesting physiological or ecological backgrounds, there will be increasing impetus to ascribe an adaptive significance to molecular genetic variation. Because postglacial climate change has established many eco-climatic selection gradients across latitudes in nature, any correlation between molecular variation in SNPs, nonsynonymous substitutions or transcriptional profiles with latitude provides a tempting avenue for concluding an adaptive significance for the observed genetic variation. Instead of proposing untested suggestions or implications because of their inherent plausibility, investigators should first examine residuals as described herein. If nonsignificant, further discussion or speculation of the potential adaptive significance of their covariation is not warranted. If significant, then an inferred causal connection can be used as a platform from which to seek a functional connection between genotype, phenotype and, ultimately, fitness.