The analyses of genome-wide association studies (GWASs) have reminded researchers that seemingly homogeneous populations often turn out to be structured, and have also shown just how easily such effects are able to generate false-positive associations even in population-specific case–control studies (see the example from genetics below). Testing and making compensation for structure has become an indispensable part of complex genetics but seems to have been largely ignored in efforts to explore the possible role of environmental factors in complex traits. In our analysis of publicly available birth records, we confirm that birth rate is heterogeneous in the general population, where it is characterized by substantial and significant differences both between and within populations. It follows that birth rates calculated by the unweighted averaging of available population statistics are unlikely to provide appropriate controls for studies of specific diseases, where case collections are invariably heterogeneous with respect to year of birth and regional origin. Although it seems logical to average over many observations to generate more reliable control estimates, the mean birth rates obtained by such a process are a weighted average of the seasonality present in the population and not an estimate of some fundamental underlying rate. This weighted mean can only safely be compared with cases that have the same structure. Self-evidently, spurious differences will arise if birth rates are calculated for cases having a different distribution across subgroups that make up the normal population. Three groups[13, 17, 18] have considered unaffected siblings as controls in an attempt to avoid such problems. However, although these related controls are inevitably much better matched for regional influences, they are necessarily unmatched for year of birth and are invariably limited in size and therefore underpowered (see Supplementary Information and Fig 4).
The confounding effects described here are a consequence of the substantial geographical and temporal variation in birth rate that is present in the general population; they are not specific to multiple sclerosis and have the potential to generate false-positive association with month of birth in any study where cases are inadequately matched, regardless of the phenotype. To date there are >500 reports relating month of birth to the etiology of a complex trait. The list includes various autoimmune diseases (celiac disease, diabetes, Graves disease, Hashimoto disease, inflammatory bowel disease, rheumatoid arthritis, and systemic lupus erythematous), mental health disorders (schizophrenia and suicide), and health-related traits (birth weight). Because almost all of these studies are based on traits that are known to vary in frequency geographically and they have invariably used averaged national statistics as their source of controls, it is likely that many of these apparent associations resulted from confounding rather than any true biological effect.
The extent of variation in the seasonality of birth rates and correlation of this phenomenon with latitude and year of birth are surprising, underappreciated, and difficult to compensate for fully in the design and analysis of individual studies. Our observations serve as a reminder that risk factors that are easy to determine and seemingly homogenous, such as date of birth, may yet be heterogeneous within the general population and therefore generate false-positive signals if cases and controls are not adequately matched for the relevant extraneous variables. Efforts to identify and correct for confounding should be no less rigorous in the study of environmental risk factors than are now routine in the field of complex genetics.
National birth statistics show that within the general population: (1) birth rate is subject to highly significant variation with respect to place and time; and (2) the probability of being born in the spring (March, April, and May) is positively correlated with latitude, whereas the probability of being born in the winter (November, December, and January) is negatively correlated.
Because typical multiple sclerosis case collections are unlikely to be uniform with respect to region of origin or year of birth: (1) studies using national statistics as controls are predisposed to generate false-positive associations with month of birth; and (2) the associations generated by such studies are inherently biased in favor of showing a false-positive apparent excess of births in spring and/or reduced births in winter.
The effects of structure: An Example from Genetics
Confounding due to structure is a well-recognized problem in complex genetics. Consider the data shown in the Table, which indicate the total population in each of the 11 Government Office Regions of the United Kingdom in mid-2010 (Office of national Statistics: http://www.ons.gov.uk), together with corresponding multiple sclerosis prevalence estimates and the approximate region-specific frequency of the C allele of the rs1042712 single nucleotide polymorphism (SNP) from the lactase gene (taken from Fig 7 of the supplementary file of the Wellcome Trust Case Control Consortium Genome Wide Association Study). Inspection of the Table shows that both the prevalence of multiple sclerosis and frequency of the C_rs1042712 allele vary geographically; prevalence correlates positively with latitude, whereas allele frequency is negatively correlated. Despite absolute differences in allele frequency only amounting to a few percentage points, the resulting structure is sufficient to confound association studies if ignored, and the UK population is wrongly assumed to be homogenous with respect to the frequency of this allele.
Table 1. Regional Variation in the Prevalence of Multiple Sclerosis (per 100,000) and the Frequency (%) of the Lactase Single Nucleotide Polymorphism Allele (C_rs1042712) in the United Kingdom
|Yorkshire and Humber||5,301,252||109||10.9|
If controls are randomly selected from across the United Kingdom (as, for example, they would tend to be in a birth cohort), the expected allele frequency is 11.6%, less than is seen in London (14.5%) and more than in Wales (9.5%). If an association study were performed in London using 10,000 such controls and 2,000 local patients, it would find nominally significant evidence (p < 0.05) that C_rs1042712 is a risk factor for multiple sclerosis, with >99.9% certainty, and would generate a genome-wide significant result (p < 5 × 10−8) >40% of the time. If the study were then repeated using a similar design but with 2,000 patients from the East of England Government Office Region, the apparent association would be replicated with 1-tailed nominal significance on the majority of occasions (>90%). Conversely, if replication were attempted in Wales, researchers would most likely (>95%) find nominally significant evidence that C_rs1042712 appears protective. The ease with which apparently significant results can be generated despite the relatively modest absolute difference in allele frequency illustrates how sensitive case–control analysis is to unrecognized, and therefore uncompensated, population structure.
Even if everyone in the UK population were genotyped for the rs1042712 SNP, enabling all cases to be compared with all controls, nominally significant evidence that C_rs1042712 is associated (protective) would be observed on >95% of occasions. This bias arises because a disproportionate number of the cases will come from northern (high prevalence) parts of the country, which generally have a lower than average allele frequency for this particular SNP; as a result the estimated allele frequency in cases will tend to be lower (11.3%) than in the general population (11.6%). This illustrates why, even if all cases and all controls from a country are included in an analysis, structure can still generate false-positive associations unless there is compensation for the confounding effects of their unequal distribution. Whether a study that only employs cases and controls from a single Government Office Region would be free from bias cannot be judged from the figures provided in the Table. Given the highly significant difference in both disease and allele frequency between regions, it seems probable that structure will also occur within, although to a lesser extent than between, regions.
Genetic analyses are uniquely well placed with respect to the assessment and compensation of structure. First, although the above analysis has considered matters from a geographical perspective, the primary confounding variable in genetic studies is ancestry, not geography. The probability that individuals carry a particular allele of interest primarily depends upon who their ancestors were and not where they were born or domiciled when recruited to a study. Because ancestry does not change over time, and can be accurately inferred using genetic markers, in principle little if any geographical precision is necessary to assess and correct for structure in allele frequency distribution within a population. Conversely, for social and mobility reasons, there is an inevitable correlation between geography and ancestry, and this relationship means that geography often provides a reasonable, and useful, surrogate for ancestry as illustrated above. Second, because GWASs typically include many thousands of markers that are not associated with the particular disease of interest, these additional neutral data allow researchers both to measure and to compensate for differences in the ancestry of cases and controls. By comparing allele frequency distributions across regions of the United Kingdom, the Wellcome Trust Case Control Consortium established that most common variants show little variation in allele frequency within populations, the rs1042712 SNP from the lactase gene being 1 of only a limited number of exceptions to this rule. Available evidence suggests that this SNP is part of a haplotype that has been subject to considerable selection in the recent past and hence is now highly structured in the population. However, for the vast majority of common variants studied in GWASs, there is only limited structure within and between populations, and this can usually be compensated for using ancestral information from the thousands of unassociated markers that are inevitably typed as part of a GWAS.