SEARCH

SEARCH BY CITATION

Keywords:

  • heterozygote-excess;
  • linkage disequilibrium;
  • molecular coancestry;
  • Plan I and II temporal sampling

Abstract

  1. Top of page
  2. Abstract
  3. Introduction
  4. Data input
  5. Missing data
  6. Data output files
  7. Confidence intervals
  8. Negative or infinite estimates of Ne
  9. Rare alleles
  10. Examples
  11. Download and usage
  12. Acknowledgements
  13. References
  14. Data Accessibility

NeEstimator v2 is a completely revised and updated implementation of software that produces estimates of contemporary effective population size, using several different methods and a single input file. NeEstimator v2 includes three single-sample estimators (updated versions of the linkage disequilibrium and heterozygote-excess methods, and a new method based on molecular coancestry), as well as the two-sample (moment-based temporal) method. New features include the following: (i) an improved method for accounting for missing data; (ii) options for screening out rare alleles; (iii) confidence intervals for all methods; (iv) the ability to analyse data sets with large numbers of genetic markers (10 000 or more); (v) options for batch processing large numbers of different data sets, which will facilitate cross-method comparisons using simulated data; and (vi) correction for temporal estimates when individuals sampled are not removed from the population (Plan I sampling). The user is given considerable control over input data and composition, and format of output files. The freely available software has a new JAVA interface and runs under MacOS, Linux and Windows.


Introduction

  1. Top of page
  2. Abstract
  3. Introduction
  4. Data input
  5. Missing data
  6. Data output files
  7. Confidence intervals
  8. Negative or infinite estimates of Ne
  9. Rare alleles
  10. Examples
  11. Download and usage
  12. Acknowledgements
  13. References
  14. Data Accessibility

Spurred by recent advances in development of molecular markers and nonlethal methods for extracting DNA from natural populations, interest in using genetic methods to estimate contemporary effective population size (Ne) has grown exponentially over the past decade (Palstra & Fraser 2012). Until recently, most such estimates have used the temporal method, which requires at least two samples from the same population spaced in time. However, several new single-sample estimators have been developed recently (Nomura 2008; Waples & Do 2008; Zhdanova & Pudovkin 2008) and since 2009, published estimates using single-sample methods have eclipsed those from the temporal method by a wide margin (Palstra & Fraser 2012).

Given the variety of available methods, an attractive option is to develop software that can apply multiple methods to the same data set. Since 2004, this role has been filled by NeEstimator v1.4 (Ovenden et al. 2007). However, that original implementation predated most of the recent developments in single-sample methods, and this has increasingly limited its usefulness. Here, we describe a completely updated and revamped version of the NeEstimator software (version 2.0) that includes (i) three single-sample estimators [a bias-corrected version of the linkage disequilibrium method (Waples & Do 2008); an updated version of the heterozygote-excess method (Zhdanova & Pudovkin 2008); and a new implementation of the molecular coancestry method (Nomura 2008)]; and (ii) the standard temporal method (Waples 1989), with three different options for computing the standardized variance in allele frequency, F [Fe (Nei & Tajima 1981); Fk (Pollak 1983); and Fs (Jorde & Ryman 2007)].

The new version has a flexible and friendly graphical user interface (Fig. 1) and is suitable for empirical and simulated data sets containing varying numbers of nuclear genotypes consisting of two or more loci and having two or more alleles per locus. The genotypes can represent one to many populations that can be sampled once or at two or more times. NeEstimator (v2) also has versions for Windows, MacOS, and Linux operating systems. This programme note summarizes key features of the new software and should be read in conjunction with the primary literature describing the concept of genetic effective population size (e.g. Schwartz et al. 2007; Charlesworth 2009; Luikart et al. 2010; Palstra & Fraser 2012) and the published estimation methods (Waples 1989; Nomura 2008; Waples & Do 2008; Zhdanova & Pudovkin 2008).

image

Figure 1. Key features of the user interface of NeEstimator v2.

Download figure to PowerPoint

Data input

  1. Top of page
  2. Abstract
  3. Introduction
  4. Data input
  5. Missing data
  6. Data output files
  7. Confidence intervals
  8. Negative or infinite estimates of Ne
  9. Rare alleles
  10. Examples
  11. Download and usage
  12. Acknowledgements
  13. References
  14. Data Accessibility

NeEstimator (v2) accepts common input formats (genepop or fstat). The user browses directories to select the appropriate input file. The user can choose to only show files that are in acceptable formats (.TXT, .GEN, .DAT). One or multiple methods for calculating Ne can be performed simultaneously, generally performed on a single data input file. There is a batch option for processing many separate files. Input files can include data for a large number of samples. For the single-sample methods (linkage disequilibrium, heterozygote-excess and molecular coancestry), each sample is treated as a separate ‘population’. For the temporal method, each population is represented by two or more samples taken at different times, separated by a known number of generations. Generations for each sample can be defined as whole or fractional numbers. In the simplest circumstance, an input file for the temporal method would contain two samples, separated by a number of generations defined by the user. The temporal method would produce a single estimate of Ne applicable to the number of generations between samples, while the single-sample methods would produce separate estimates applicable to each sampled generation. More advanced sampling strategies can be implemented. For example, the user can specify that the first population was sampled at generations zero and two, the second population at generations three and five, and the remaining three populations were all sampled at generations zero, four and five. In this circumstance, the input data file would contain 13 total samples, analysed as five separate populations. The user can also specify whether samples were taken after reproduction or nonlethally before reproduction, so individuals can contribute to future generations (Plan I), or whether individuals (typically juveniles) are sampled without replacement before reproduction (Plan II) (Waples 2005). An estimate of census size is needed for temporal estimation under plan I.

The software also allows the user flexibility in defining parameters for the analyses. For all methods, the user can choose to screen out rare alleles with frequencies below a user-specified criterion, commonly referred to as Pcrit. Previously, this option has only been available for LDNe (Waples & Do 2008). Furthermore, subsets of the input data can be selected for analysis in the graphical user interface. For example, if the data file includes ten populations, the software can be directed to analyse the first two only. The user can also restrict the number of individuals analysed (e.g. to the first 10 or 20) in each sample. Additionally, loci can be selectively excluded, either by specifying a range (e.g. loci 1–5 and 10–15) or by listing individual loci (e.g. loci 2,4,6). For the linkage disequilibrium method, the user can toggle between the assumptions of random or monogamous mating. When an input file contains thousands of loci, or large number of individuals per population, the LD and coancestry methods can take hours or days to run. The interface will approximately evaluate this possibility and put out a warning dialogue box if necessary; the user then can decide whether to continue or use some options available on the interface to limit the data. If the user chooses to run, the terminal screen will print out the progress at certain goals so that the user can estimate when the run will be finished.

Missing data

  1. Top of page
  2. Abstract
  3. Introduction
  4. Data input
  5. Missing data
  6. Data output files
  7. Confidence intervals
  8. Negative or infinite estimates of Ne
  9. Rare alleles
  10. Examples
  11. Download and usage
  12. Acknowledgements
  13. References
  14. Data Accessibility

The software describes the extent of missing data in output files. NeEstimator software (v2) implements an improved method to account for missing data calculating a unique fixed-inverse variance-weighted harmonic mean (Peel et al. 2013). Here, the sample size is taken as the weighted mean sample size across loci whose weights are based on the number of alleles. If no data are missing, the formulas using weighted harmonic means will reduce to formulas given by the Jorde & Ryman (2007) method. In evaluations using simulated data generated under a wide range of scenarios (see Fig. 1, Peel et al. 2013), this new method outperformed the simple weighted mean that was implemented to correct for presence of missing data in version 1.4 of NeEstimator. It also is preferred over the method used to jointly weight by sample size and number of independent comparisons, which was implemented in LDNe (Waples & Do 2008).

Data output files

  1. Top of page
  2. Abstract
  3. Introduction
  4. Data input
  5. Missing data
  6. Data output files
  7. Confidence intervals
  8. Negative or infinite estimates of Ne
  9. Rare alleles
  10. Examples
  11. Download and usage
  12. Acknowledgements
  13. References
  14. Data Accessibility

One potential downside of software offering multiple analysis methods is large, hard-to-navigate output data files. NeEstimator (v2) overcomes this by generating a simple default output file which describes estimated population parameters for each selected analysis method and allows the user the option to select additional, more detailed output files. For example, the user can choose to have results for each method printed in a separate file that is organized in a streamlined, tabular format that is easy to analyse and import into other software. Other options include reporting frequency data at each locus and results for each pair of loci in the linkage disequilibrium method; the latter can be particularly useful for evaluating evidence for physical linkage in genomics studies.

Confidence intervals

  1. Top of page
  2. Abstract
  3. Introduction
  4. Data input
  5. Missing data
  6. Data output files
  7. Confidence intervals
  8. Negative or infinite estimates of Ne
  9. Rare alleles
  10. Examples
  11. Download and usage
  12. Acknowledgements
  13. References
  14. Data Accessibility

NeEstimator (v2) provides confidence intervals for all methods and in several cases implements new and improved routines. Potential bias associated with standard parametric chi-squared confidence intervals for the LD method are reduced by implementing the jackknife method of Waples & Do (2008) as an alternative and allowing the user to determine if one or both intervals are relevant for their analyses. For the heterozygote-excess method, our implementation corrects an error in the confidence interval method proposed by Zhdanova & Pudovkin (2008). Nomura (2008) did not propose a method for constructing confidence intervals for his molecular coancestry method; NeEstimator (v2) implements a new jackknife method developed specifically for this purpose. An important caveat is that the performance of new methods for confidence intervals implemented in NeEstimator (v2) has not been evaluated. In particular, use of large numbers (100s or 1000s) of SNP loci, many of which inevitably will be linked, introduces important issues related to pseudo-replication that need quantitative evaluation. Precision of estimates based on large numbers of loci might be substantially lower than suggested by traditional methods for computing CIs.

Negative or infinite estimates of Ne

  1. Top of page
  2. Abstract
  3. Introduction
  4. Data input
  5. Missing data
  6. Data output files
  7. Confidence intervals
  8. Negative or infinite estimates of Ne
  9. Rare alleles
  10. Examples
  11. Download and usage
  12. Acknowledgements
  13. References
  14. Data Accessibility

All the methods considered here are based on a genetic index that has two components: one due to genetic drift (the signal) and one due to sampling a finite number of individuals. Unbiased estimators depend on knowing the sample size (so that the expected magnitude of sampling error can be calculated) and subtracting that from the index. By chance, however, the actual amount of sampling error can be larger than expected, in which case it is possible for the correction to result in a negative estimate of Ne. The usual interpretation in this case is that the estimate of Ne is infinity, that is, there is no evidence for variation in the genetic characteristic caused by a finite number of parents – it can all be explained by sampling error (see discussion in Waples & Do 2010). An equivalent phenomenon also can occur with unbiased estimators of genetic distance or FST (e.g. Nei 1978; Weir & Cockerham 1984).

In NeEstimator (v2), negative point estimates, and confidence intervals are reported as ‘infinity’ in the main output file. In accessory output files, however, the actual negative values are reported, as negative estimates of Ne contain valuable information when included in harmonic mean calculations to provide an overall estimate of Ne, for example, when there are several replicate samples from the same population.

Rare alleles

  1. Top of page
  2. Abstract
  3. Introduction
  4. Data input
  5. Missing data
  6. Data output files
  7. Confidence intervals
  8. Negative or infinite estimates of Ne
  9. Rare alleles
  10. Examples
  11. Download and usage
  12. Acknowledgements
  13. References
  14. Data Accessibility

NeEstimator software (v2) provides options for screening out rare alleles for all methods except molecular coancestry (for which allele frequency is not an issue), using the same protocols as LDNe (Waples & Do 2008). By default, the software conducts and reports results for separate analyses that use all alleles, or which screen out alleles with frequencies below PCrit values of 0.01, 0.02, and 0.05. The user can change these default settings to implement any desired PCrit value(s). The user also has an option to select an additional output file that contains the allele frequencies for each locus for each population and reports the number of alleles per locus that were removed because they were below the user-specified PCrit value(s).

Examples

  1. Top of page
  2. Abstract
  3. Introduction
  4. Data input
  5. Missing data
  6. Data output files
  7. Confidence intervals
  8. Negative or infinite estimates of Ne
  9. Rare alleles
  10. Examples
  11. Download and usage
  12. Acknowledgements
  13. References
  14. Data Accessibility

We used genetic data simulated using Easypop (Balloux 2001) to illustrate some of the novel features of NeEstimator (v2). In the first example, we simulated two groups of 100 isolated populations with true Ne = 100, and for each group, we estimated effective size using the three single-sample estimators. In the first group of populations, we tracked 20 loci similar to microsatellites (μ = 0.0005, maximum of 10 alleles per locus); in the second group, we tracked 200 loci similar to SNPs (μ = 10−7, a maximum of two alleles per locus). We initialized using the Maximum Diversity option and used a burn-in period (25 generations) that produced an average heterozygosity for the ‘microsat’ loci of approximately 0.8 and ensured that most or all of the ‘SNP’ loci were still segregating for both alleles in most or all populations. We used PCrit = 0.02 for the microsat loci and used all alleles for the SNP loci. All 100 individuals were sampled for the genetic analyses. We recorded the minimum and maximum estimates for each method, as well as the harmonic mean inline image and the coefficient of variation in 1/inline image, which is the drift signal these methods respond to (Wang 2001, 2009). For the latter two metrics, infinite estimates were converted to 99 999.

Results (Fig. 2; Table 1) show that about 80–90% of the estimates for LDNe fell within the range 80–120, while the remaining values were between 120 and 160. For the other two methods, in contrast, most estimates were substantially too low or too high, with 10% or less of the inline image values falling between 80 and 120. About 30% of the estimates for the heterozygosity excess method were infinite, as were 10–20% of those for the molecular coancestry method. An interesting result was that patterns of bias and precision for all three methods were similar for the ‘microsat’ and ‘SNP’ analyses. As expected, based on previous results (Waples & Do 2010), use of PCrit = 0.02 with the microsat loci led to a slight (6%) upward bias for the LD method, while the upward bias was slightly lower (<3%) for the SNP analyses, which had few rare alleles because of the relatively short burn-in period. Precision of the LD method was slightly better with 200 SNPs compared with 20 microsats (slightly tighter range and slightly smaller CV; Table 1); again, this agreed with a previous theoretical prediction (Waples & Do 2010), which suggested that, for the LD method, precision comparable to that of 20 ‘microsat’ loci could be achieved with about 180 diallelic ‘SNP’ loci. Results for the heterozygote-excess method were nearly identical for the two marker types: ~20% downward bias in inline image, high CV, and a large fraction of infinite estimates. Harmonic mean inline image for the coancestry method was substantially lower than true Ne (60% lower for microsats and >70% lower for SNPs), while precision was slightly improved with SNPs (fewer infinite estimates, lower CV).

Table 1. Summary of results comparing performance of effective size estimators on simulated data with true Ne = 100
 Microsats; PCrit = 0.02SNPs; PCrit = 0
LDNeHet ExcessCoancestryLDNeHet ExcessCoancestry
Single sample
Hmean (inline image)106.077.539.6102.778.428.2
Min82.218.214.684.119.511.3
Max139.6InfiniteInfinite131.7InfiniteInfinite
% Infinite0.030.017.00.029.011.0
CV (1/inline image)0.1140.9740.7770.0950.9920.648
 PCrit = 0 Fk
Fs Fc Fk PCrit = 0.05PCrit = 0.02PCrit = 0
  1. In each case, estimates of Ne reflect data for 100 replicates. Figure 2 shows distribution of single-sample estimates summarized here.

Temporal
Hmean (inline image)98.1113.6111.796.9100.6111.7
Min59.583.281.057.265.181.0
Max167.3157.0156.1179.1177.3156.1
CV (1/inline image)0.2040.1440.1440.2270.1820.144
image

Figure 2. Distribution of estimates of effective population size (inline image) from three single-sample estimators, based on 100 replicate, simulated data sets using 20 ‘microsatellite’ loci (top panel, with up to 10 alleles each) or 200 ‘SNP’ loci (bottom panel, with up to two alleles each). True Ne was 100. The numbers above the vertical bars for inline image > 200 indicate the number of those estimates that were infinitely large. The microsatellite analyses used PCrit = 0.02; the SNP analyses used PCrit = 0.

Download figure to PowerPoint

The second example compared the three methods for computing the temporal F. We simulated a metapopulation of 50 populations, each with effective sizes of 100. After a burn-in period of complete panmixia (island model with migration rate = 0.98 per generation), we imposed a single generation of isolation before collecting data. This produced 50 populations of Ne = 100 that, on average, were as divergent from each other as would be samples from a single population taken two generations apart. We made 25 independent pairwise comparisons of these 50 populations and treated them as temporal samples taken two generations apart. We repeated the process four times to produce 100 replicate temporal comparisons. Sample size again was 100 individuals. For the temporal comparisons, we tracked 20 ‘microsat’ loci with up to 20 alleles each (hence large numbers of rare alleles). In the first analysis, we compared performance of Fs, Fc and Fk using all alleles (PCrit = 0). Results (Table 1) agreed with Jorde & Ryman's (2007) conclusion that Fs is both less biased and less precise than Fc and Fk. Harmonic mean inline image for Fs showed a slight (<2%) downward bias, while the other two estimators were both biased upwards by >10%. On the other hand, CV (1/inline image) for Fs was 50% higher than for the other two indices. Table 1 also shows in more detail how rare alleles affect the estimates from Fk: there is little or no bias for PCrit = 0.02 or 0.05; substantial upward bias only occurs when alleles at frequency <0.02 are allowed into the analysis. Information like this can be used to identify PCrit values for each method that strike an appropriate balance between increasing precision and minimizing bias. In this example, using PCrit = 0.02 rather than 0.05 is a win-win situation (more precision AND less bias), but going all the way to PCrit = 0 would appear to be a poor choice (bias increases from negligible to >10% while CV drops by only one-fifth) unless the user was much more concerned about precision than bias.

These examples should not be considered to represent definitive evaluations of performance of any of these methods, as only a few specific scenarios were considered. Nevertheless, they illustrate how easy it is, using routine features of the new NeEstimator (v2), to generate comparative information that previously would have required much more effort to compile.

Download and usage

  1. Top of page
  2. Abstract
  3. Introduction
  4. Data input
  5. Missing data
  6. Data output files
  7. Confidence intervals
  8. Negative or infinite estimates of Ne
  9. Rare alleles
  10. Examples
  11. Download and usage
  12. Acknowledgements
  13. References
  14. Data Accessibility

The software can be downloaded at no cost from http://molecularfisherieslaboratory.com.au/neestimator-software The user can select between MacOS, Linux or Windows versions. To run the NeEstimator (v2) software, start the graphical user interface file: Windows or Mac users can double click on the NeEstimator.jar files, whereas Linux users can start the program from the command line execute: ‘java –jar./NeGUI.jar’. An example of an input data file is provided as well as a help file in.pdf and.html formats.

Although NeEstimator (v2) can in theory handle arbitrarily large numbers of individuals, loci and populations, large combinations can slow the program considerably, and it is possible that the capabilities could be exceeded at some point. We have successfully run the 32-bit program using the LD method with a data set that included a single population with 27 individuals and >46 000 loci. This analysis involved calculation of r2 values for over one billion pairs of loci. The analysis, including calculation of jackknifed confidence intervals, took about 2 h for each PCrit value used on a Dell OptiPlex 390 running Windows 7 platform on a PC computer.

Acknowledgements

  1. Top of page
  2. Abstract
  3. Introduction
  4. Data input
  5. Missing data
  6. Data output files
  7. Confidence intervals
  8. Negative or infinite estimates of Ne
  9. Rare alleles
  10. Examples
  11. Download and usage
  12. Acknowledgements
  13. References
  14. Data Accessibility

We thank authors of the methods included here (Per Erik Jorde, Tetsuro Nomua, Alexander Pudovkin and Oxana Zhadanova) for reviewing and confirming the accuracy of implementations of their methods. We also are indebted to our cadre of BetaTesters, who diligently evaluated earlier versions of the software and provided valuable comments and feedback (Tiago Antão, Dean Blower, Mark Christie, Christine Dudgeon, Jon Hesse, Wes Larson, Friso Palstra, Ivan Phillipsen, Malin Pinsky and Ryan Waples), and to Dezhi Peng for sharing a large data set.

References

  1. Top of page
  2. Abstract
  3. Introduction
  4. Data input
  5. Missing data
  6. Data output files
  7. Confidence intervals
  8. Negative or infinite estimates of Ne
  9. Rare alleles
  10. Examples
  11. Download and usage
  12. Acknowledgements
  13. References
  14. Data Accessibility

C.D. wrote the software, with input from R.W., D.P. and G.M. J.O. and R.W. led the project. J.O. coordinated the project. J.O. and B.T. wrote the NeEstimator v2 help file and drafted the manuscript. All authors contributed to and approved the final manuscript.