‘Big data’ from shrinking pathogen populations


Correspondence: Daniel E. Neafsey; E-mail: neafsey@broadinstitute.org


Falling costs for genome sequencing and genotyping mean that population genomic data sets are becoming commonplace for a wide variety of species. Once these data are used for the initial tasks of investigating population structure and demographic history, however, is there reason to go back for more? In this issue of Molecular Ecology, Nkhoma et al. (2013) explore the applications of longitudinal genomic diversity data for detecting changes in the prevalence and transmission of the Plasmodium falciparum malaria parasite in South-East Asia. While this study finds several genetic signatures indicative of reduced disease transmission, other measures, such as short-term effective population size, geographical population structure and heterozygosity, were not informative. These results indicate the potential contribution of genomic data to the surveillance of small, dynamic populations, whether they are at risk of extinction or targeted for elimination. The interpretation of such data will require close consideration of biological context, however, at both the species and the population level.

The field of population genetics is traditionally considered to have been founded in the early 1930s, but existed without nucleotide data until the first description of variation in the Adh locus in Drosophila in 1983 (Kreitman 1983). As such, it has been a largely theoretical field longer than it has been an empirical one. The advent of the empirical era resulted in a rapid flowering of techniques for extracting meaningful information from sequence data through application of theory, but recent years have seen the arrival of a new ‘big data’ empirical era. Data sets composed of thousands of sequenced genomes or thoroughly genotyped individuals now offer novel opportunities for applying population genetic theory to massive data sets for practical purposes.

A study in this issue of Molecular Ecology by Nkhoma et al. (2013) explores the potential epidemiological applications of ‘big’ genotype data gathered from 96 SNPs typed in 1731 clinical malaria samples collected over a decade in South-East Asia (Fig. 1). Malaria has been on the decline in this region, largely due to increased treatment for disease with artemisinin combination therapy. The authors explored the genomic data for signatures of reduced disease transmission and made a number of observations that are informative in the light of the life cycle and biology of Plasmodium malaria parasites. The power of this approach is impressive, considering it is based on a small collection of SNPs rather than complete genome sequences, and suggests that full genome sequencing of population samples may be economically justifiable for many applications only for initial identification of the most informative (high frequency) variants within a population.

Figure 1.

Habitat of the Pfalciparum malaria parasite in Thailand. Disease incidence has exhibited a steady decline in this region in recent years.

The most revealing genetic indicator of reduced disease transmission over time was one of the simplest to infer: the incidence of infections composed of more than one parasite clone. Plasmodium parasites are haploid in humans, and mixed clonal infections may therefore be reliably detected through heterozygous genotype calls. The authors saw mixed clone infections drop from 63% to 14% of samples in annual collections over the study period. Sexual outcrossing in malaria parasites happens in mosquitoes and occurs most commonly when a mosquito bites a multiply-infected person, so it would be reasonable to expect that reduced sexual outcrossing during the study period, coupled with reduced transmission, could affect the genetic effective population size (Ne) of the parasite population.

However, this is not what Nkhoma et al. observed. Effective population size, measured via variance in allele frequency between transmission seasons, was unchanged during the course of the study, despite falling infection prevalence. The authors proffer several biological explanations for this null result, including migration-based stabilization of allele frequencies and an unchanged reservoir of asymptomatic infections between transmission seasons. A third explanation that this finding is a false negative resulting from insufficient power also deserves consideration. Figure 2 illustrates the results of a binomial sampling-based simulation to explore the expected drift-based allele frequency variance over the course of 1 year in a parasite population, assuming eight parasite generations per year. As the plot shows, discerning changes in Ne on the basis of allele frequency variance may be difficult when Ne is >1000. Even with massive sample sizes designed to reduce sampling variance, expected differences in allele frequency may be too small in large populations to be detectable (Hare et al. 2011), regardless of minor allele frequency. Consequently, detectable changes in variance Ne may be most useful for identifying populations on the brink of collapse.

Figure 2.

Drift-based allele frequency variance in a parasite population over the course of one year. A binomial sampling simulation was used to predict variance in allele frequency as a function of minor allele frequency and effective population size (Ne). Variants with high minor allele frequency are expected to be most informative, and only then in small populations. The simulation assumes eight generations per year and a constant population size. Points represent mean allele frequency variance after eight generations calculated from 1000 replicates.

For an endangered species subject to conservation efforts, such information may come too late to be of practical use. For campaigns against infectious diseases, however, this genetic signature could be a critical indicator of an opportunity for local disease elimination (Volkman et al. 2012). Even when estimates of Ne fail to provide resolution, Nkhoma et al. show that genomic data may illuminate other aspects of parasite transmission dynamics, such as the persistence of multilocus genotypes. Given that Plasmodium parasites facultatively outcross when distinct strains co-infect a mosquito, such information is another useful means of inferring the incidence of mixed infections and profiling the nature of infectious reservoirs in the human population. Collectively, these data appear to indicate that, although Pfalciparum parasite populations in South-East Asia are low relative to some other disease-endemic regions and that they continue to shrink, they are not yet on the cusp of elimination. Full realization of the potential for genomic tools to provide useful surveillance of population dynamics will ultimately require the collection of benchmark data sets across a range of populations, calibrated using traditional estimators population size or disease transmission rate.

D.E.N. performed the simulation analysis and wrote the perspective.

Data accessibility

The Perl script used for conducting the genetic drift simulation in Fig. 2 is available as supplemental online information.