#### Common garden data

Common garden data on metamorphosed juveniles from earlier studies (Laugen *et al.*, 2003b, 2005a; Palo *et al.*, 2003) were analysed to separate the contributions of additive, dominance, maternal and environmental variations to variation in body size–corrected leg length in six populations (Table 1) across the latitudinal gradient. The same three leg traits – femur and tibia lengths and the ratio of femur to tibia length – were used as in the case of the wild adults. We used the data to estimate heritability based on the additive genetic component and to test for possible latitudinal divergence in the traits. As described in the following paragraphs, there were three temperature and two food treatments during the larval stage, allowing us to test for genotype–environment interactions and thus for possible latitudinal genetic divergence in phenotypic plasticity of the traits. Details of the common garden experiments have been published previously (Laugen *et al.*, 2003b, 2005a; Palo *et al.*, 2003), but the pertinent parts will be briefly restated and leg length measurements described in the following sections.

Tadpoles for the common garden experiments were produced in artificial laboratory crosses of adult frogs caught at spawning sites in the beginning of the breeding season. A North Carolina type II breeding design (Lynch & Walsh, 1998) was used, except for the Ammarnäs population, for which eight freshly-laid spawn clutches were collected from the wild. Except for Umeå and Ammarnäs populations, 16 maternal half-sib families (i.e. 32 full-sib families) were created where eggs from each of eight females were fertilized by sperm from four of the 16 males. The Umeå tadpoles came from a similar design, but for this population two sets of 16 maternal half-sib families were created on different fertilization dates. Because of the large difference in the onset of spawning along the latitudinal gradient (Merilä*et al.*, 2000), the starting dates among the other populations also differed. The fertilizations for the southernmost population (Lund) were performed on 9 April 1998, whereas the corresponding date for the northernmost population (Kilpisjärvi) was 4 June 1998. However, the rearing conditions were the same for all populations. The crosses were carried out following the principles outlined in Laugen *et al.* (2002). The eggs were divided into three different temperature treatments (14, 18 and 22 ± 1 °C, two bowls per cross in each temperature) at which they were kept until Gosner stage 25 (Gosner, 1960). Water was changed every third day during embryonic development. When most of the embryos in a given temperature treatment had reached Gosner stage 25, eight randomly chosen tadpoles from each cross were placed individually in 0.9-L opaque plastic containers at each of two food levels (restricted and *ad libitum*). This procedure was repeated for each population in the three temperature treatments, resulting in 48 experimental tadpoles per cross. However, because of mortality during the experiment, the final number of tadpoles per family was typically fewer than 48 (Table 1). Every seventh day, the tadpoles were fed a finely ground 1 : 3 mixture of fish flakes (TetraMin; Ulrich Baensch GmbH, Melle, Germany) and rodent pellets (AB Joh. Hansson, Uppsala, Sweden). The amount of food given to each tadpole was 15 mg (restricted) and 45 mg (*ad libitum*) for the first week, 30 and 90 mg for the second week and 60 and 180 mg per week thereafter respectively until metamorphosis. The *ad libitum* level was selected to be such that the individuals did not consume all the food before the next feeding event at any of the temperature treatments. In the restricted food treatment, the tadpoles at the two highest temperature treatments consumed all of their food resources before the next feeding, indicating food limitation, but in the low temperature treatment, the tadpoles frequently had food left even after 7 days of feeding. The tadpoles were raised in dechlorinated tap water that was aerated and aged for at least 24 h before use. The water was changed every seventh day in conjunction with feeding. The light rhythm was 16L : 8D. As the rearing of the tadpoles continued from mid-April to late August, temperatures were measured in the laboratories at fixed locations twice a day throughout the experiment to check that the water temperature did not change over time. There was no temperature change over time in any of the laboratories (see Laugen *et al.*, 2005a).

At the time when metamorphosis at the given population was anticipated to start, the vials were checked once a day. Metamorphosed frogs (Gosner stage 42) were weighed, and age at metamorphosis (in days since Gosner stage 25) was recorded. Water level in the vials was reduced, and the metamorphs were allowed to absorb their tails before being anesthetized and killed with an overdose of MS-222. After this, they were frozen in −20 °C. Leg measurements were later taken from the thawed metamorphs by measuring their right tibia and femur length under stereomicroscope with the aid of digital callipers. Snout–vent length was measured similarly. All the measurements were taken by one person – blind in respect to the identity of the samples – and recorded to nearest 0.1 mm.

#### Statistical analysis

We calculated the repeatability of femur and tibia lengths and the ratio of femur to tibia length both for the adults caught from the wild and for the juveniles reared in the common garden following Lessells & Boag (1987). In addition, repeatability was calculated for the snout–vent length of the juveniles. In short, one-way analysis of variance using the functions *lm* and anova in R (R Development Core Team, 2009) was used and the repeatability for each trait was derived as:

- (1)

where *s*^{2} was the within-individual mean squares and was calculated from

- (2)

where MS_{A} was the among-individual mean squares, MS_{W} the within-individual means squares and *n* the number of measurements per individual, i.e. two. Ninety-five percent confidence intervals for the repeatability estimates were obtained by nonparametric bootstrap, resampling the data 5000 times.

The repeatabilities for all traits were generally high. For wild-collected adults, they were 0.98 [95% credible intervals (CI): 0.97–1.00] for femur length, 1.00 (95% CI: 0.99–1.00) for tibia length and 0.80 (95% CI: 0.76–0.98) for the ratio of femur to tibia length. For juveniles reared in the common garden, the repeatabilities were 0.98 (95% CI: 0.97–0.99) for snout–vent length, 0.92 (95% CI: 0.85–0.96) for femur, 0.98 (95% CI: 0.96–0.99) for tibia and 0.61 (95% CI: 0.25–0.79) for the ratio of femur to tibia length.

We used a linear mixed model to investigate variation in femur length, tibia length as well as their ratio in wild-collected adults. We incorporated snout–vent length as a covariate to correct for the variation in age and body size, and latitude and its square as other covariates to test for a latitudinal effect on the relative extremity length. Sex was included as a fixed effect and population as a random effect to correct for the varying sample size and the nonindependence of the data. Finally, we included the interactions between sex and latitude, sex and the square of latitude, and sex and snout–vent length in the models. The model fitting was performed in R with the *lmer* function of the *lme4* extension package (available through CRAN; http://cran.r-project.org). *P*-values are not available for the fixed effects of linear mixed models in R because they involve unresolved statistical issues, and we hence obtained 95% highest posterior density intervals (HPDI) for parameter estimates by Markov chain Monte Carlo (MCMC) methods using functions *mcmcsamp* and *HPDinterval* of the *lme4* package.

A Bayesian univariate hierarchical model (Gelman *et al.*, 2004) was constructed for femur and tibia lengths and for the ratio of femur to tibia length measured in the common garden environment. For comparison, a similar model was fitted to the snout–vent length data. The model allowed us to estimate simultaneously among- and within-population genetic variances and heritabilities of the traits accounting for the different quantitative genetic variance components, and the degree of population differentiation as measured by *Q*_{ST} (see e.g. Merilä & Crnokrak, 2001; Leinonen *et al.*, 2008). In addition, it allowed the estimation of the correlation of population effects with latitude and the correlation of pairwise *Q*_{ST} values with pairwise *F*_{ST} estimates (divergence in neutral molecular marker loci; see e.g. Merilä & Crnokrak, 2001; Leinonen *et al.*, 2008) and physical distances separating the populations. We obtained estimates for the parameters of interest from the joint posterior distribution by MCMC simulation (Gelman *et al.*, 2004) using OpenBUGS version 3.0.3 (Lunn *et al.*, 2009). The estimates were summarized as posterior means and 95% credible intervals (CI), i.e. Bayesian confidence intervals. For each trait, we ran three chains, 150 000 iterations each, and thinned them by five. The first 10 000 of the 30 000 thinned iterations were discarded as burn-in. The convergence and mixing of the chains was checked visually.

The Bayesian model had similarities to the one used by Palo *et al.* (2003). The model was a linear mixed effects model with treatment combination (temperature, food availability)-specific means. The means were given vague normal priors *N*(*μ*, *σ*^{2}), where *μ* was the observed trait mean and *σ*^{2} variance (0.1 for the ratio of femur to tibia length, 10 for other traits). Although laboratory blocks were shared between different populations and in principle the environmental conditions remained always physically the same, we included block as a population-specific fixed effect as there was no complete temporal overlap between populations because of different fertilization times. In the case of the Umeå population, we used separate block effects for the two different fertilization dates. The effect of first block within each population and fertilization group was fixed to zero, and other block effects were given vague normal priors *N*(0, 100), where the second parameter is variance. Snout–vent length (mm) and age were included as linear covariates with mean subtracted values so that the population and treatment combination–specific means were defined for individuals of average length (15.7 mm) and age (48.9 days). In the analysis of snout–vent length, it was itself obviously omitted from the explanatory part of the model. The regression coefficients of the snout–vent length and age were given vague normal priors *N*(0, 10) and *N*(0, 0.1), respectively, except in the case of the femur to tibia length ratio, where the regression coefficient of the snout–vent length was also given the normal prior *N*(0, 0.1). Population, dam, sire and family were included as population and treatment combination–specific random effects. The variances of dam, sire and family effects and the residual variance were modelled in terms of the underlying variance components (Lynch & Walsh, 1998):

- (3)

- (4)

- (5)

- (6)

where *V*_{A} is the within-population additive genetic, *V*_{M} the maternal, *V*_{D} the dominance and *V*_{ε} the microenvironmental variance. The variance of population effects corresponded to the among-population additive genetic variance *V*_{population}, derived from *Q*_{ST} as described below. All variance components were defined to be population and treatment combination specific but subscripts indicating this have been omitted from the notation for simplicity.

To obtain flat priors for *Q*_{ST} values, we parameterized the model in terms of treatment combination–specific *Q*_{ST} rather than among-population additive genetic variance. Thus, *Q*_{ST} was given a uniform prior *U*(0, 1) and the treatment combination–specific among-population additive genetic variances *V*_{population} were calculated from:

- (7)

where *μ*_{VA} is the across populations mean of the additive genetic variances *V*_{A}. The other variance components *V*_{A}, *V*_{M}, *V*_{D}, *V*_{ε} were given gamma priors Gamma (0.001, 0.001). Heritability (*h*^{2}) was calculated as *V*_{A}/(*V*_{A} + *V*_{M} + *V*_{D} + *V*_{ε}) for all populations and in all treatment combinations and summarized as the mean *h*^{2} across all populations and all treatment combinations.

The data collected from the Ammarnäs population consisted of full-sib families instead of half-sibs (Table 1), and the variance components for this population were thus confounded. However, because parameters of interest were estimated from the joint posterior distribution, the results were valid and the confounding effects expressed themselves solely as possible wider CI for this population.

Pairwise *Q*_{ST} values, i.e. *Q*_{ST}s for each two-population combinations, were calculated from:

- (8)

where *V*_{pairwise} was the variance of the estimates of the population means of the two populations.

*F*_{ST} (see, e.g. Merilä & Crnokrak, 2001; Leinonen *et al.*, 2008) was used to measure the degree of population divergence in neutral marker loci. The overall and pairwise *F*_{ST} estimates published by Palo *et al.* (2003) for our study populations based on eight presumably neutral microsatellite loci were used. Correlations between pairwise estimates of *Q*_{ST}, *F*_{ST}, and geographical distance were calculated from the posterior distribution for all treatment combinations and traits, using the odds [*p*/(1 − *p*)] of pairwise *Q*_{ST} and *F*_{ST} values. The *F*_{ST} values were simulated from the pairwise estimates and associated 95% confidence intervals from Palo *et al.* (2003) assuming normality. The correlations were equivalent to the ones calculated in Mantel tests, and the CI of the correlation coefficients take into account the correlations between pairwise *Q*_{ST} estimates (Palo *et al.*, 2003). The correlations between pairwise *Q*_{ST} and geographical distance (*r*_{QST}) and between pairwise *F*_{ST} and distance (*r*_{FST}) were used to assess the population divergence and role of selection. A significant positive *r*_{QST} would suggest that the relative leg length differs genetically between populations and *r*_{QST} > *r*_{FST} would be evidence of natural selection being a stronger force than genetic drift in driving the divergence. Furthermore, the difference between overall *Q*_{ST} and *F*_{ST} in each treatment combination was calculated. If significantly > zero, this would also suggest divergent selection. Finally, for each treatment combination, we calculated the correlation between population effects and latitude. A significant correlation would indicate latitudinal divergence.

The fit of the Bayesian model was checked by eye and by a formal test. Visual examination of the residuals plotted against the fitted values revealed a weak positive trend which, however, was deemed to have no practical effect on the analysis. The conclusion was supported by formal testing which found no evidence of lack of model fit: The Bayesian *P*-value for the χ^{2} discrepancy test was 0.80 for femur length, 0.79 for tibia length, 0.73 for the ratio of femur to tibia length and 0.78 for snout–vent length.