SEARCH

SEARCH BY CITATION

Keywords:

  • heterozygosity-fitness correlation;
  • pipeline;
  • population genetics;
  • Python

Abstract

  1. Top of page
  2. Abstract
  3. Related manuscripts
  4. Acknowledgements
  5. References
  6. Data Accessibility

As the body of heterozygosity-fitness correlation (HFC) research grows, more and increasingly complicated tests have become an integral part of a typical HFC analysis (Chapman et al. 2009). Currently, no software is available to undertake conversion between the file formats required to conduct all of these tests and to conduct the main regression analyses at the core of all HFCs. Heterozygosity-Fitness Pipeline (HeFPipe) is a script written in Python that accomplishes both of these tasks for studies based on microsatellite data. HeFPipe is designed to be used from the command line terminal and will run on any Mac OSX computer. The script takes input in the form of allele reports from either the genotype-calling software, GeneMapper or GeneMarker, and reconfigures the data into GENEPOP (Raymond & Rousset 1995), Rhh (Alho et al. 2010), RMES (David et al. 2007) and GEPHAST (Amos & Acevedo-Whitehouse 2009) formats. The script is also equipped to reformat the output from GENEPOP on the Web (option 5) and Rhh into csv spreadsheets that can be incorporated into downstream analyses. HeFPipe accommodates user-provided lists of samples and markers to be included in or excluded from analyses. HeFPipe is equipped to create generalized linear models (GLMs) from both the main data set and subsets of the data. Finally, HeFPipe allows users to explore single-marker effects and conduct correlation analyses. The script, a comprehensive manual, a link to a series of video tutorials, and an example data set are available from GitHub (http://github.com/Atticus29/HeFPipe_rpos).

Identifying associations between heterozygosity and fitness is a lynchpin of studies intending to clarify the role that genetic diversity plays in the survival and reproductive success of individuals. HFCs have a long history in the population genetics literature (Mitton & Grant 1984; Pogson 1991; Britten 1996; Coltman & Slate 2003; Chapman et al. 2009), and they were originally conducted by genotyping samples at a modest panel of allozyme loci and regressing a trait(s) associated with fitness (e.g. survival, growth rate) on multilocus heterozygosity (MLH) as measured by the panel (Britten 1996). Modern HFC studies almost exclusively employ microsatellite markers rather than allozymes, and an emphasis has been placed on the use of large numbers of markers (Balloux et al. 2004), although few studies have yet to fulfil this recommendation (Chapman et al. 2009). The shift to larger marker panels containing potentially neutral loci has brought with it a wider availability of statistical tests that help researchers explore the nature of heterozygosity-fitness correlations in their study systems. These tests make it possible to determine (i) whether the MLH of the marker panel is reflective of genome-wide MLH (Balloux et al. 2004; Alho et al. 2010), (ii) whether there is identity disequilibrium (ID) among the markers and consequently inbreeding sensu lato in the study system (David et al. 2007; Szulkin et al. 2010) and (iii) whether there is evidence for single-marker effects on the trait(s) of interest (David 1997; Amos & Acevedo-Whitehouse 2009; Szulkin et al. 2010) (Fig. 1). The software now available to conduct these tests as well as run the regressions and correlations that are the core of HFC analyses require input files of different formats, and there is currently no software that provides ecumenicism across these formats.

image

Figure 1. Simplified flowchart depicting the Heterozygosity-Fitness Pipeline (HeFPipe). The input files listed in the top section are used at various relevant points throughout the pipeline. The chronological flow of the pipeline is depicted by the direction of the arrow in the middle of the figure, and the output of the pipeline is depicted on the right, while a brief description of the relevance of each output item to an heterozygosity-fitness correlation (HFC) analysis is described on the left. The items listed under ‘Output’ are files generated by HeFPipe that are either useable by external programs (GENEPOP, RMES, GEPHAST, Rhh) or are themselves core results of HFC analyses (correlations, regressions). Other, less-essential output files that are also products of HeFPipe are not described in this figure but are discussed in the HeFPipe manual and tutorial videos along with instructions for how to use the output from the external programs listed in this figure as input in subsequent steps of the HeFPipe pipeline.

Download figure to PowerPoint

Heterozygosity-Fitness Pipeline (HeFPipe) is a script written in Python that conducts analyses typically performed in HFC studies. It also tests for evidence of single-marker effects on a trait(s). More specifically, HeFPipe takes input in the form of allele reports in the ‘Marker Table’ style from the microsatellite genotype-calling software, GeneMarker, or from the microsatellite genotype-calling software, GeneMapper, and reconfigures the data into GENEPOP (Raymond & Rousset 1995), Rhh (Alho et al. 2010), RMES (David et al. 2007) and GEPHAST (Amos & Acevedo-Whitehouse 2009) formats (Fig. 1). The script is also equipped to reformat the output from GENEPOP on the Web (option 5) and Rhh into comma-separated values (csv) formatted spreadsheets and incorporate them into downstream analyses. The HeFPipe script accommodates user-provided lists of markers to be included in or excluded from analyses, a list of samples to exclude from analyses, and a spreadsheet containing trait values on which to perform the HFCs and search for single-marker effects. These input files allow the user to refine and repeat each analysis with ease. With regard to the analyses that require regression—HFCs and one of the tests for single-marker effects—HeFPipe is equipped to run generalized linear models (GLMs) using the Python package PypeR (Xia et al. 2010), a package that enables the statistics software R (R Development Core Team 2011) to be used in the context of a Python script. Using GLMs, the user is able to assign a link function and error distribution appropriate for the response variable in a particular model, which can be used to relax some of the assumptions of general linear regression. The script is also equipped to conduct the regression analyses on subsets of the data set, which might be desirable in various scenarios, such as where HFCs are predicted to appear in stressed individuals (e.g. Pujolar et al. 2006; Schmeller et al. 2007). Single-marker effects are explored using methodologies described in David (1997) and using GEPHAST (Amos & Acevedo-Whitehouse 2009). Correlations (both Pearson and Spearman) among the traits provided are reported in several different formats (as text, spreadsheets, and images); significance tests are conducted on these correlations, and the P-values (both adjusted and unadjusted for multiple comparisons) are also reported in the various formats.

Heterozygosity-Fitness Pipeline is designed to be used from the command line terminal and will run on any Mac OSX computer that has Python v 2.7.3 and R v 2.15.1 (or compatible versions) installed. Several features of the pipeline depend on properties of UNIX-based operating systems, and these properties are not native to Windows-based operating systems. The script, a users' manual, a link to a video tutorial and an example data set are available at GitHub (https://github.com/Atticus29/HeFPipe_rpos). Software dependencies, including packages in both R and Python required for the pipeline (e.g. PypeR), are listed in the manual, as are brief instructions for their installation.

Related manuscripts

  1. Top of page
  2. Abstract
  3. Related manuscripts
  4. Acknowledgements
  5. References
  6. Data Accessibility

An empirical manuscript for which the pipeline was developed is in preparation.

Acknowledgements

  1. Top of page
  2. Abstract
  3. Related manuscripts
  4. Acknowledgements
  5. References
  6. Data Accessibility

I thank Xiao-Qin Xia for help with PypeR and Emily Bewick, Kerin Bentley, and Brian Shamblin for providing additional data sets on which HeFPipe was tested. I also thank Kenneth Ross and DeWayne Shoemaker, who reviewed an early version of the manuscript, as well as three anonymous reviewers for constructive comments. This work was funded in part by the Georgia Agricultural Experiment Stations, University of Georgia.

References

  1. Top of page
  2. Abstract
  3. Related manuscripts
  4. Acknowledgements
  5. References
  6. Data Accessibility

Fisher wrote the software and manuscript, curated and tested the GeneMarker and GeneMapper example data sets and recorded the tutorial video series.

Data Accessibility

  1. Top of page
  2. Abstract
  3. Related manuscripts
  4. Acknowledgements
  5. References
  6. Data Accessibility

HeFPipe scripts, a users' manual and example data are deposited in GitHub: http://github.com/Atticus29/HeFPipe_repos.