Testing for causality in covarying traits: genes and latitude in a molecular world


William E. Bradhsaw, Fax: +1 541 346 2364; E-mail: mosquito@uoregon.edu


Many traits are assumed to have a causal (necessary) relationship with one another because of their common covariation with a physiological, ecological or geographical factor. Herein, we demonstrate a straightforward test for inferring causality using residuals from regression of the traits with the common factor. We illustrate this test using the covariation with latitude of a proxy for the circadian clock and a proxy for the photoperiodic timer in Drosophila and salmon. A negative result of this test means that further discussion of the adaptive significance of a causal connection between the covarying traits is unwarranted. A positive result of this test provides a point of departure that can then be used as a platform from which to determine experimentally the underlying functional connections and only then to discuss their adaptive significance.

With the advent of modern molecular techniques, increasing attention is being paid to nonmodel organisms for investigating the genetic basis of various phenotypes in physiological, ecological or geographical contexts. As genes are discovered that covary with an environmental parameter such as temperature, light or latitude, there is a natural temptation to ascribe causality to these correlations. However, correlations are only the tantalizing starting points for robust experimental designs and, in themselves provide evidence for neither causality nor an underlying functional mechanism. Herein, we use covariation of traits with latitude to illustrate the problem of confounding causation and correlation over geographic gradients. We begin with a simple diagram:

inline image

If A is correlated with C and B is correlated with C, then A will automatically be correlated with B. There follows the natural temptation to infer or conclude that A causes B, that is genetic variation in A constitutes the genetic basis of B. As an example, we consider the relationship between the circadian clock regulating daily activities of organisms and the photoperiodic timer regulating seasonal activities of organisms. This relationship has a long and contentious history (Tauber & Kyriacou 2001; Hazlerigg & Loudon 2008; Bradshaw & Holzapfel 2010; Saunders 2010; Koštál 2011), a legacy of Bünning’s (1936) proposition that the circadian clock formed the causal basis of photoperiodism. At the molecular level, a probabilistic cause between circadian rhythmicity and photoperiodism occurs in plants (Kobayashi & Weigel 2007; Wilczek et al. 2009) and in a long-established laboratory strain of Syrian hamsters (Shimomura et al. 1997; Lowrey et al. 2000). However, there are no examples where the circadian clock has been shown to be necessary, let alone sufficient for regulating photoperiodic response in natural populations of any animal. Yet, elements of the circadian clock have been shown to vary with latitude as have phenotypes of the photoperiodic timing mechanism (Fig. 1). Therein lies the problem: Covariation is not proof of causation.

Figure 1.

 Inference of causality between circadian rhythmicity and photoperiodism due to their common covariation with the independent variable, latitude. If allelic variation in a circadian gene is correlated with latitude and a proxy for the photoperiodic timer is correlated with latitude, the incorrect conclusion could be drawn that the circadian clock forms the causal basis of the photoperiodic timer, that is that the circadian clock is necessary for or forms the mechanistic basis of photoperiodic time measurement. In fact, the circadian clock, the photoperiodic timer, and an endless array of other variables are correlated with latitude but are not necessarily causally connected.

The seasonal timing of life-history events, which is typically orchestrated by the photoperiodic timer, is correlated with latitude in both plants (Wilczek et al. 2009) and animals (Bradshaw & Holzapfel 2007). An increasing number of circadian-related genes are now known to vary with latitude in Neurospora (Michael et al. 2007), plants (Arabidopsis: Michael et al. 2003; Caicedo et al. 2004; Stinchcombe et al. 2004; Glycine: Zhang et al. 2008), Drosophila (Kyriacou et al. 2008; Rand et al. 2010), fish (O’Malley & Banks 2008, O’Malley et al. 2010), birds (Johnsen et al. 2007), and humans (Cruciani et al. 2008). Given the observation that both circadian genes and photoperiodically mediated seasonal traits vary with latitude, the tendency is to conclude a causal connection between the circadian clock and the photoperiodic timer based on their latitudinal covariation.

The covariation of two traits with latitude could indeed be due to a common causal mechanism (pleiotropy), in which case an interesting relationship has been established and the question then becomes resolving the mechanistic basis of their coevolution. However, while latitude usually and appropriately serves as a composite variable, latitudinal variation represents multiple environmental factors, any one or a combination of which could be exerting parallel selective forces. The covariation of two traits with latitude could be a result of different selective forces acting on the two traits, the same selective force acting on two genetically independent traits, or a single selective force acting on one trait accompanied by genetic hitchhiking of a closely linked trait (Li 1997, ch. 9; Schluter et al. 2004, 2010). Examination of the relationship between variables can be made using techniques described in Sokal and Rohlf (1995, ch. 16): partial correlation examines the relationship between two variables, while all the other correlated variables are held constant; path analysis incorporates simultaneously the contribution of several correlated variables. While useful, these statistics are complex, may suffer from collinearity of the independent variables (Petraitis et al. 1996), are not readily accessible in many statistical packages and heretofore have not incorporated discrete variables. We are proposing a more transparent test that requires little more than a hand calculator or an Excel spreadsheet and incorporates both linear regression and analysis of variance. Below, we provide examples from flies and fish to illustrate the simplicity and usefulness of the analysis of residuals to avoid a spurious conclusion of causation when only correlation exists. When Y is regressed on X, the regression equation, inline image plots the regression line and inline image = deviations from regression (residuals). The residuals are zero correlated with X, that is the effect of X on Y has been factored out. If A is a causal element of B and both are correlated with latitude, then even when the common element of latitude is factored out the residuals should still be correlated; if not, their common correlation with latitude is due to linkage or independent evolution and not due to a basic underlying causal relationship between A and B. When A or B is a discrete and not a continuous variable, the residuals are computed as deviations from mean latitude for each category of Y. Although applicable to the covariation or association of any two traits or processes with any environmental parameter, we continue with examples from the biological timing literature. To illustrate the test, we have chosen two specific examples because of their connection with latitude, because of the large number of sample populations over a wide latitudinal range, and because the numerical data were available in the source papers. This sort of analysis was not possible for most of the papers we read because either the sample size was too small or the tabular, numerical data from which figures were generated were not available either in the body of the text or in supplemental online material. The advent of requiring the posting of such data (Fairbairn 2011) will make subsequent verification via independent analysis tractable.

First, in Drosophila littoralis, Lankinen (1986) found significant correlations between latitude and a proxy for the photoperiodic timer (critical photoperiod necessary to induce adult diapause) and between latitude and the two most fundamental properties of any circadian rhythm (the period and amplitude of its oscillation) (Fig. 2). Insightfully, he factored out the common effect of latitude and showed that the residuals of critical photoperiod were no longer correlated with the residuals of either period or amplitude of the circadian eclosion rhythm. Hence, he proposed that their covariation with latitude was due to linkage and not a causal relationship between them. To verify this conclusion, Lankinen and Forsman (2006) crossed two extreme populations, allowed free recombination and then imposed selection for nondiapause on short days. The hybrid lines exhibited a more ‘southern’ photoperiodic response and a more ‘northern’, circadian-based eclosion rhythm than found in any of Lankinen’s original geographic strains (Fig. 3), that is the reverse of what would have been expected had the circadian clock been a causal component of photoperiodism. These experiments confirmed Lankinen’s earlier conclusion (1986) that when the common effect of latitude was factored out, critical photoperiod was not correlated with either fundamental property of circadian rhythmicity. More generally, Lankinen and Forsman’s (2006) experiments confirmed the robustness of testing for a potentially causal connection between two traits by using residuals to factor out their common, correlated element.

Figure 2.

 Use of residuals to test for a causal relationship between circadian rhythmicity and photoperiodism in Drosophila littoralis. (Top) Latitudinal covariation in photoperiodic response (critical photoperiod) and two fundamental properties of the circadian clock, period and amplitude of the oscillation; (Bottom) lack of correlation between deviations from regression of critical photoperiod, period and amplitude on latitude. Any significant relationship between photoperiodic response and properties of the circadian clock is eliminated when the common element of latitude is factored out (plotted from Table 2 in Lankinen 1986). Details of analyses are provided in Appendix S1.

Figure 3.

 Verification of analysis of residuals as a test for a causal relationship between photoperiodism and circadian rhythmicity in D. littoralis by response to selection on critical photoperiod and period (τ) of the circadian oscillation in D. littoralis. A northern and a southern population were hybridized, maintained for eight generations on constant light (L:L) to allow free recombination, selected for nondiapause under short days (L:D = 12:12) for 30 generations, maintained in L:L for 10 generations, and the descendents of a full-sib pair maintained in L:L for a further six generations (plotted from data in Lankinen and Forsman 2006).

Second, in Chinook salmon, Oncorhynchus tshawytscha, O’Malley and Banks (2008) found a significant correlation between latitude and their proxy for the circadian clock (length of the polyglutamine repeat in the gene OtsClock1b, hereafter, Poly Q) (Fig. 4a). They also found a significant association between latitude and their proxy for the photoperiodic timer (run time = seasonal timing of upstream migration in freshwater) (Fig. 4b). O’Malley and Banks (2008, p. 2813) conclude with the suggestion ‘that length polymorphisms in OtsClock1b may be maintained by selection and reflect an adaptation to ecological factors correlated with latitude, such as the seasonally changing day length.’ After extending their correlative analyses to three more species of salmon (Onchorhynchus), O’Malley et al. (2010, p. 3705) state more boldly that the ‘Clock gene is a central component of an endogenous circadian clock that senses changes in photoperiod (day length) and mediates seasonal behaviours’. At the heart of the conclusion is the association between latitude, Poly Q and the timing of migration and spawning. This conclusion makes at least three essential, but untested assumptions.

Figure 4.

 Latitudinal covariation of mean OtsClk1b Poly Q domain length (Poly Q) and run (migration) time in Chinook salmon. (a) r2 = coefficient of determination from the regression. The plot is redrawn from data from Table 3 in O’Malley and Banks, 2008; n = 40, omitting the single ‘W’ and undefined ‘F’ runs as did O’Malley and Banks; (b-c) vertical lines show means; r2 = reduction in total sum of squares from one-way anova. Plots and analyses are based on the same data set as in (a). Details of analyses are provided in Appendix S2 (Supporting information).

First, the conclusion requires that a single genotype (high frequency of the 335 allele and concomitant low frequency of the 359 allele) of the Chinook Poly Q domain is the primary determinant of two different run times, spring and autumn (Fig. 4b), even in the same river. This assumption may or may not be true.

Second, the conclusion assumes that the salmon-specific OtsClock1b plays a functional role in salmon circadian rhythmicity. There are two Clock paralogs in salmon: OtsClock1a and OtsClock1b, only the latter of which shows a significant correlation with latitude. However, the assumption that OtsClock1b has the same functional role in salmon as its ortholog in the mammalian circadian clock (Baggs et al. 2009) is untested.

Third, the conclusion assumes that there is a causal relationship between the daily circadian clock and the seasonal photoperiodic timer. This assumption is at best contentious (Bradshaw & Holzapfel 2010; Saunders 2010; Koštál 2011) and has not been tested in any fish. There is then a great leap from observing a correlation between latitude and only the OtsClock1b paralog and a correlation between latitude and run time or spawning date to concluding that the circadian clock is responsible for the evolution of photoperiodism and, hence, seasonal timing (O’Malley et al. 2010).

Strictly for purposes of illustration, we assume the first two of the above three assumptions to be true. We then use Lankinen’s (1986) approach of analysing residuals to test for an association between Poly Q and run time by factoring out the effect of latitude on Poly Q. In this case, Poly Q is a continuous variable and run time is a discrete variable. We therefore calculated the residuals from regression of Poly Q on latitude (Fig. 4a) and performed one-way anova of the residuals using run time as treatments. After factoring out the effect of latitude, run time accounted for a nonsignificant 7% of the residual variation in Poly Q (Fig. 4c). We therefore conclude that there is no basis to infer or suggest a causal relationship between them, either as a direct, independent effect of Poly Q on run time or as an indirect effect of Poly Q on the circadian clock. Further discussion of the adaptive significance of Poly Q in relation to run time is unwarranted, as is any speculation about a potential connection between the circadian clock and the seasonal photoperiodic timer. Future research might well be directed towards determining the function and adaptive significance of Clock1b in salmon in the context of the circadian clock itself, much as have other studies in diverse organisms (Yerushalmi & Green 2009).

Hence, we propose that before inferring a causal relationship in similar cases of covariation of two or more traits with a third physiological or ecological independent variable, that a straightforward analysis of deviations from the common independent variable be used. Absent a significant association, no causal relationship should be inferred or suggested. Even an inference of a causal relationship would be reasonable only if all of the following were true: (i) Variation in each trait is significantly correlated with a third common element, in our case, with latitude. (ii) The significant correlation between the two traits persists after the effect of the common element is factored out. (iii) The environmental conditions used to show the correlations in (a) and (b) were in the same organism and determined under the same conditions.

Note that our test accommodates the situation where both the trait and the gene are associated with latitude in the same way. In that case, their latitudinal covariation is due to an environmental factor(s) selecting concomitantly on both the gene and the trait; no correlation between them should persist once the latitude-dependent causal environmental factor(s) is accounted for. If the relationship between the gene and the trait is due to an underlying causal connection, then a significant correlation between them should persist independently of latitude.

Significant, positive results from analysis of residuals serve as a point of departure for future experiments but, in of themselves, do not substitute for an understanding of the functional connection between genotype and phenotype (Kingsolver & Schemske 1991; Petraitis et al. 1996; Dalziel et al. 2009; Blackman 2010; Storz & Wheat 2010). Successful connections between molecular variation and functional phenotypes have been established (but only after additional study) in both model organisms such as Drosophila (Schmidt et al. 2008; McKechnie et al. 2010; Paaby et al. 2010) or Arapbidopsis (Wilczek et al. 2009) and in natural populations of nonmodel organisms such as the house mosquito, Culex pipiens (Labbéet al. 2009), lizards (Rosenblum et al. 2010), and organisms cited by Storz and Wheat (2010) and Dalziel et al. (2009), their Appendix S1, Supporting information), including killifish, butterflies, garter snakes, deer mice, oldfield mice, three-spined stickleback, and Darwin’s finches.

With the advent of tractable molecular approaches in an increasing number of nonmodel organisms with interesting physiological or ecological backgrounds, there will be increasing impetus to ascribe an adaptive significance to molecular genetic variation. Because postglacial climate change has established many eco-climatic selection gradients across latitudes in nature, any correlation between molecular variation in SNPs, nonsynonymous substitutions or transcriptional profiles with latitude provides a tempting avenue for concluding an adaptive significance for the observed genetic variation. Instead of proposing untested suggestions or implications because of their inherent plausibility, investigators should first examine residuals as described herein. If nonsignificant, further discussion or speculation of the potential adaptive significance of their covariation is not warranted. If significant, then an inferred causal connection can be used as a platform from which to seek a functional connection between genotype, phenotype and, ultimately, fitness.


We thank Kathleen O’Malley, Lisa Crozier, William Cresko, Vincent Cassone, and David Hazlerigg for useful discussion, the reviewers and Nolan Kane for their thoughtful comments, the National Science Foundation through grants DEB-0917827 and IOB-083998 to WEB and grant IOS-064226 to WA Cresko, and the National Institutes of Health through grant 1R24GM079486-01A1 to WA Cresko.

C.O.B’s research deals with the genetic foundations of photoperiodism and hormonal events regulating seasonal sexual maturation in the Threespine Stickleback, Gasterosteus aculeatus. Dr. W.B. and Dr. C.H. pursue the general question: ”How has the genetic differentiation of populations actually taken place in nature?” Their diverse backgrounds blend physiology, development, quantitative and molecular genetics, and second generation high throughput genomics to answer questions ranging from the evolutionary consequences of climate change, the central processing and genetic control of biting in mosquitoes, to seasonal regulation and biological timing in the pitcher-plant mosquito, Wyeomyia smithii.