PopGenReport: simplifying basic population genetic analyses in R

Authors


Summary

  1. Using scripting languages such as R to perform population genetic analyses can improve the reproducibility of research, but using R can be challenging for many researchers due to its steep learning curve.
  2. PopGenReport is a new R package that simplifies performing population genetics analyses in R, through the use of a new report-generating function. The function popgenreport allows users to perform up to 13 pre-defined and 1 user-defined analyses through the use of a single command line. Each analysis generates figures and tables that are incorporated into a pdf report and are also made available as individual files (figures are provided in multiple formats, table contents are provided as csv files).
  3. The package includes new R functions that simplify the importation of data from a spreadsheet file, examine allele distributions across populations and loci and identify private alleles, determine pairwise individual genetic distances using the methods of Smouse and Peakall (1999) and Kosman and Leonard (2005), respectively, detect the presence of null alleles, calculate allelic richness, and test for spatial autocorrelation in genotypes using the methods of Smouse and Peakall (1999).
  4. The package has a modular structure that makes the process of adding new functionality straightforward. To facilitate the addition of user-designed functions, the package includes a fully customizable module that can be automatically included in the pdf report.
  5. To support users not experienced in R, the website (www.popgenreport.org) has a tutorial for the package and a downloadable, portable version of the package with LaTeX pre-configured for the Windows operating system.

Introduction

Microsatellite genetic markers are an important and widely used tool for answering questions about populations and individuals (Hartl & Clark 1997; Selkoe & Toonen 2006; Blanchet 2012) in fields including ecology, agriculture and forensic sciences among others. They have been used to answer questions about many topics including population structure (Pritchard, Stephens & Donnelly 2000), dispersal (Ouborg, Piquot & van Groenendael 1999; Goudet, Perrin & Waser 2002), immigration (Rannala & Mountain 1997), kinship and parentage (Queller, Strassmann & Hughes 1993; Jones & Ardren 2003) and individual identity (Waits, Luikart & Taberlet 2001). Molecular markers have a number of characteristics that make them attractive for use in a wide range of fields. Sufficient material for genotyping an organism can be collected using minimally invasive sampling methods (e.g. taking small tissue samples) or without ever observing an individual by collecting discarded tissues (e.g. hair, blood or faecal samples). Once a sample has been collected, it is relatively straightforward to genotype it and determine which species or individual was the source of the sample and, in combination with genotypes from additional samples, can answer questions about the demographics and origins of populations. Microsatellite markers are often used over other types of markers due to their high allelic diversity, which allows large amounts of population genetic information to be gained from a relatively small number of loci (Guichoux et al. 2011).

The large amount of data produced using microsatellite DNA markers have led to the development of a wide array of stand-alone statistical programs for the analysis of these data (e.g. Excoffier & Heckel 2006). The authors surveyed three recent issues of Molecular Ecology (volume 22, issues 9, 10, and 12) and found 22 articles whose analyses were predominantly based upon microsatellite genetic data. Sixty different programs were used across the 22 articles (76 if individual R packages are counted separately) with an average of 7 programs being used (range of 1–13 programs) for each article. Having many programs to choose from makes it relatively easy to conduct very sophisticated analyses on a data set. However, statistical programs often have distinct input file formats. GenAlEx (Peakall & Smouse 2012) for example can transform its own file format into 30 other file formats. The process of converting files is becoming more automated, but can still consume significant amounts of time and introduce errors into data sets. Additionally, as Goecks, Nekrutenko and Taylor (2010) note, ‘the sudden reliance on computation has created an ‘informatics crisis’ for life science researchers: computational resources can be difficult to use, and ensuring that computational experiments are communicated well and hence reproducible is challenging’. The data analysis process is often iterative with new data being added, subsets of data having to be reanalysed, or errors in existing data being identified and corrected, resulting in the need to repeat the data conversion and analysis steps multiple times, increasing the odds of new errors being introduced or in the analyses being inconsistent between iterations due to the inconsistent use of program settings. Thus, a scripted workflow within a single framework rather than a workflow based on linking the outputs of multiple point and click programs should be given preference as it potentially reduces inconsistencies and facilitates repeatability of an analysis.

Many problems associated with using separate stand-alone programs for genetic analyses could be resolved by developing script-based analyses in R (R Core Team 2013) and using a mix of existing R packages (e.g. http://cran.r-project.org/web/views/Genetics.html) and external calls to stand-alone programs where necessary. Using this approach, we believe that the number of programs necessary for a typical analysis could be reduced substantially, depending on the analysis being done, while at the same time increasing the reproducibility of the analysis. We are aware that analysing genetic data using R until recently suffered from the multiple input file problem that stand-alone programs have, but the issue is being addressed. The package adegenet (Jombart 2008) introduced a new data class for the storage of genetic data in R, called a ‘genind object’, and functions that import several of the most widely used file formats (e.g. structure (Pritchard et al. 2000), Genepop (Raymond & Rousset 1995), fstat (Goudet 1995, 2002) and simple spreadsheet tables to genind objects. Genind objects primarily contain data on the allele frequencies and population memberships of individuals, and allele and marker names. Additionally, genind objects are capable of storing user-defined variables such as an individual's sex, physical location (e.g. latitude and longitude) and other user-defined data. The genind data class has been adopted by several recently released packages including pegas (Paradis 2010), and mmod (Winter 2012).

Despite the many benefits of using R for carrying out genetic analyses on microsatellite data, many researchers continue to use an assortment of stand-alone programs for their basic population genetic analyses. This seems to be due, in part, to the familiarity of researchers with many existing programs and the steep learning curve of R. To increase the accessibility of population genetic analyses in R, we have developed a new package called PopGenReport.

PopGenReport is a freely available, open-source package that simplifies the analysis of microsatellite genetic data in R by automating several population genetic analyses in R, providing a user experience that is more analogous to using a text-based program such as Genepop (Raymond & Rousset 1995), relaxing the requirement to be able to program a script in R. The package combines the functionality of several new and existing R functions into a single function to produce an annotated pdf report (a complete example is provided in the Supporting information). The function exports all of the figures generated as a part of the report in multiple file formats (png, svg and pdf) and the contents of each table as individual csv files, allowing for their direct use in presentations or manuscripts. In addition, an R object and the code used to generate the report are returned, allowing users to customize their analysis and record the analysis for future reproducibility.

Overview of the package and the functions

The PopGenReport package currently consists of eight newly developed R functions that are described below.

The package's main function popgenreport integrates an assortment of new and existing R functions into a single new function which performs several basic population genetic analyses (e.g. summary counts and frequencies of alleles and individuals, maps sample locations, multiple measures of genetic differentiation within and between populations, tests for null alleles, observed and expected heterozygosity, tests for departures from Hardy–Weinberg Equilibrium, calculate individual pairwise genetic distances, test for spatial autocorrelation and perform a principal coordinate analysis) for microsatellite data and then generates a report. A complete list of the analyses modules is provided in the online supplementary material. Several of the routines (mk.counts, mk.locihz, mk.hwe, mk.fst, mk.pcoa) included in the popgenreport function are derived from, or build upon, analyses that were initially developed by Jombart (2008). The component routines of the popgenreport function are designed to run independently of one another, allowing the user to run only a subset of the analysis routines. We also include a dummy routine (mk.custom) that allows a user to develop their own analysis for inclusion in the report, by simply editing a file. The procedure on how to do this is detailed in the tutorial for the package (>browseVignettes(“PopGenReport”)).

The function read.genetable simplifies the process of creating a genind object from a tabular csv file that includes individual identifiers, co-dominant marker data and optional user-defined data including population, coordinates, gender and size. It improves on the existing df2genind function in adegenet (Jombart 2008 and Jombart & Ahmed 2011) by automating the process of importing latitudes and longitudes and other ‘extra’ data in genind objects.

The function null.all is used to detect the presence of null alleles and their frequency for each locus. The function has two main parts. The first determines the probability of observing the observed number of homozygotes for each allele at each locus by using the observed allele frequencies to generate bootstrap estimates of the expected number of homozygotes. The second part determines the frequency of null alleles at each locus using the methods of Chakraborty et al. (1992) and Brookfield (1996). The function allel.rich calculates the allelic richness for each combination of population and locus using the methods of El Mousadik and Petit (1996). As sample sizes are often uneven across combinations of population and locus, the number of alleles sampled to determine allelic richness is standardized to the smallest number of individuals sampled across all combinations of locus and population multiplied by the ploidy level of the species. This function behaves similarly to the allelic.richness function in the hierfstat package (Goudet 2005) but is capable of working with genind objects and can handle polyploid organisms.

We developed two R functions (gd.smouse and gd.kosman) to calculate pairwise genetic distances between all individuals within a data set using the methods of Smouse and Peakall (1999) and Kosman and Leonard (2005), respectively. We note that our function (gd.kosman) differs from the dist.codom function in package mmod (Winter 2012) in the handling of missing data. If an individual is missing data at any locus, dist.codom excludes that individual from the analysis of genetic distance. The gd.kosman function includes individuals with missing data, by ignoring any loci for which one or both individuals have missing data when calculating the genetic distance between that pair of individuals. Our approach has the advantage of determining pairwise individual genetic distances for all possible pairs of individuals, but comes at the cost of having variable levels of precision in the estimates of individual pairwise genetic distances which could be important if a data set has a large proportion of missing data. To help evaluate the precision of individual pairwise genetic distances, we provide a table with the number of loci used in the calculation of each distance.

The function allele.dist summarizes the distribution of alleles across populations and identifies any private alleles in the data set. For each locus, the function produces a pair of matrices that contain the absolute count and relative frequency of alleles for each combination of population and allele. Private alleles are identified by locus, and the name of the allele and the population in which it occurs are provided.

The function spautocor is newly available to R and carries out a spatial autocorrelation analysis using the methods of Smouse and Peakall (1999) and Peakall, Ruibal and Lindenmayer (2003) on a matrix of individual pairwise genetic distances (that can be generated using the gd.smouse or gd.kosman functions) and a Euclidean distance matrix based upon the coordinates of individuals (that can be generated using the dist function in R).

Obtaining PopGenReport

PopGenReport requires a current R installation (freely available under a GPL license for all major operating systems from http://cran.r-project.org/) and a LaTeX installation (freely available for all major operating systems from http://www.latex-project.org) as the package uses knitr (Xie 2012) to generate the annotated pdf reports. PopGenReport can be installed in R using the standard package installation process for the R language. [>install.packages(“PopGenReport”)].

Alternatively, a ready-to-run software suite (R, MiKTeX and the PopGenReport package) is available for download from http://www.popgenreport.org/. A tutorial on how to install PopGenReport and carry out analyses is also available from the website.

PopGenReport is under active development, and we encourage users to report any bugs that they may encounter or to suggest additional routines for future versions of the package. We also invite other package developers to contribute routines for use in PopGenReport. The popgenreport function has a modular structure that makes the process of adding new functionality straightforward, particularly if the functions are built around genind objects.

Acknowledgements

We thank S. Sarre, A. MacDonald, C. Holleley and the editors and reviewers for their critical remarks on this manuscript. We also thank T. Jombart and D. Winter for their review of an earlier version of this manuscript and help with software development. A.T.A's work on this project was funded by the University of Canberra Postdoctoral Fellowship Scheme.

Ancillary