Using heterozygosity to estimate a percentage DNA sequence similarity for environmental species’ delimitation across basidiomycete fungi


Until recently, characteristics used to circumscribe species have been morphological, including macro- and micromorphological characters; ecological, including host specialization; and/or reproductive, including mating studies. Operational criteria for the delimitation of species, including tree- and nontree-based approaches, were explored by Sites & Marshall (2004), who concluded that all methods will sometimes fail to delimit species’ boundaries or will provide conflicting results. Typically, species’ recognition is a slow, meticulous process, but the ongoing rapid loss of natural habitats, coupled with global climate perturbations, has led to concern that the rate of species’ discovery must be accelerated. For the septate filamentous fungi, there are additional problems in species’ discovery, including an ephemeral fruitbody, clades that have inconspicuous fruitbodies or never fruit and few diagnostic morphocharacters. To assay total fungal biodiversity, both above-ground and below-ground, attention has turned to the assay of fungal diversity by environmental DNA sequencing (see, for example, Anderson & Cairney, 2004; O’Brian et al., 2005; Smith et al., 2007). This interest in environmental sequencing is producing an exponential increase in unidentified DNA sequences, and has created a need for guidelines to estimate and identify the number of species and individuals represented in a given environmental sample based on DNA sequence homology.

Unfortunately, there are no established criteria for determining whether two sequences represent genetic variants of the same species. Smith et al. (2007) and Morris et al. (2008), in studies of ectomycorrhizal community structure in a dry oak forest, used an estimate of ≤ 3% internal transcribed spacer (ITS) sequence divergence to indicate conspecificity. This estimate considered a 0.2–1.2% error rate generated by PCR, cloning and unidirectional sequencing artefacts. Similarly, studies by Izzo et al. (2005) used ≤ 4% ITS sequence divergence as conspecific, with error rates estimated at 0.5–1%. O’Brian et al. (2005), Ryberg et al. (2008) and Walker et al. (2008) all assumed ≤ 3% ITS sequence divergence as indicating conspecificity. For fungi, the problem of whether a single general percentage sequence similarity can be used as an indicator of conspecificity was discussed at a recent Fungal Environmental Sampling and Informatics Network (FESIN) meeting. The consensus was that no arbitrary percentage sequence similarity could accurately indicate conspecific taxa across all fungi (Bruns et al., 2007). Using publicly available fungal ITS sequences for kingdom Fungi, Nilsson et al. (2008) estimated the divergence between sequences deposited under the same binomial. Infraspecific divergence ranged from a low of 0.2% (Aspergillus fumigatus) to a high of 24.2% (Xylaria hypoxylon). For kingdom Fungi overall, the weighted percentage within-species’ divergence was 2.51% and, for the Basidiomycotina, 3.33%. The authors demonstrated that the ITS region is not equally variable across all groups of fungi and argued that no single level of variability could be used for estimates of conspecificity. These and other studies based on GenBank accessions must, by their nature, include misidentifications and as yet unidentified cryptic or morphologically similar species in estimates of infraspecific variability (see Bridge et al., 2003; Nilsson et al., 2006, 2008). Ideally, estimates of infraspecific variability should include collections across the geographical range of the species, but there is the largely unresolved question of whether widely dispersed and/or intercontinental distributions under one morphological species’ epithet represent one or several biological or phylogenetic species (Taylor et al., 2006; Hughes et al., 2007; Taylor, 2008). This study focuses on putative biological species within a region of significant biogeographical diversity and avoids errors based on the inclusion of cryptic species in estimates of infraspecific genetic variability.

The Great Smoky Mountains National Park (GSMNP, south-eastern USA) is a unique venue to examine infraspecific genetic divergence. A United Nations Educational, Scientific and Cultural Organization (UNESCO) World Heritage site known for its high biodiversity and endemism, GSMNP is especially rich in salamanders, mosses and fungi. The exceptional biodiversity of GSMNP is a function of the varied habitat and post-glacial repopulation from disjunct refugia. During the last glacial period, the area around the current GSMNP was predominantly tundra, although small islands of deciduous trees may have existed (Delcourt, 1985; Pielou, 1991). With post-glacial warming, vegetation migrated from refugia along the Mississippi river banks (Delcourt & Delcourt, 1984), from Mexico and Central America (see Lickey et al., 2002), and potentially from other unknown refugia, into the Appalachians, whereas cold-tolerant vegetation migrated northwards or to higher altitudes, establishing the now threatened disjunct Spruce–Fir ecosystem islands of the southern Appalachian Mountains (Hughes & Petersen, 2004). Currently, GSMNP contains both subtropical and northern/boreal elements. Hybridization of genetically divergent lineages within species occurs within the Park (Lickey et al., 2002; Mata et al., 2007) and c. 50% of basidiomycete collections from GSMNP are heterozygous for one or more bases of the ribosomal ITS region (see later).

The All Taxa Biodiversity Inventory (ATBI) for agaric fungi at GSMNP was designed, in part, to map patterns of ectomycorrhizal fungal biodiversity to plant ecosystems in GSMNP and to examine changes in ectomycorrhizal fungal species and distributions over 50 yr of collecting in the Park. As part of this study, ITS sequences were generated for collections made over a 3-yr period (2004–07). As samples used for DNA analyses were derived from presumed dikaryotic fruitbody tissues (gills, cap, etc), we assume that the observed heterozygosity represents the product of mating between two different heterothallic parental genotypes (inbreeding fungi would tend to become homozygous and thus would not have been included in this study). The exceptions to this assumption may be basidiocarps that are a mosaic of two or more nonmating genotypes, interspecific hybridization and, for multicopy loci, such as rDNA, departures from strict concerted evolution. Apart from these exceptions, if the biological species’ concept is followed, haplotypes derived from a single fruitbody would represent genetic variation within the same biological species, i.e. individuals that are genetically similar enough to mate, form fruitbodies and produce spores (Mayr, 1942). The analysis of the sequence divergence of these haplotypes may therefore provide data contributing towards the determination of biological conspecificity from sequence data. We examined the percentage base heterozygosity as a proxy for percentage sequence divergence in a subset of 100 predominantly agaric fungal collections from GSMNP. Fifty-two per cent of these collections are presumed to be ectomycorrhizal (Table S1, see Supporting Information).

Collections of fruitbodies (mushrooms and allies) were made throughout the year, c. 500 yr−1. Fruitbodies were collected c. every 2 wk between mid-May and November, and occasionally over the winter season, for a 3-yr period by experts in various groups of fungi, identified morphologically, photographed, dried and accessioned into the University of Tennessee Herbarium (TENN: or into the collector's institutional herbarium. A small sample was reserved before drying and stored in silica gel beads in a 2-ml microfuge tube at −80°C until DNA extraction. DNA extraction from dried fungal tissue has been described previously (Mata et al., 2007). The ribosomal ITS1-5.8S-ITS2 region was amplified using primers ITS1F (Bruns & Gardes, 1993) and ITS4 (White et al., 1990). Primers for sequencing were ITS5 (White et al., 1990) and ITS4. Cloning was accomplished with the Promega pGEM-T cloning vector and JM109 competent cells using the manufacturer's directions. Cloning was required when a DNA sequence was heterozygous for more than one simple (1–2 bp) insertion or deletion event (indel), or if the fruitbody was contaminated with a second fungus (e.g. Candida fungicola). The proportion of ITS sequences that were cloned was 13% of total sequences. Sequencing for clean, uncomplicated sequences was unidirectional. Bidirectional sequencing was used in all other cases.

Sequence data for this study comprised two datasets: sequences that required cloning to obtain a clean ITS sequence, and sequences for which heterozygosity could be resolved without cloning. Collections were selected from each dataset using a random number generator, 13 from the cloned dataset and 87 from the uncloned dataset, for a total of 100 collections. Sequences from the uncloned dataset that were unclear or ‘dirty’, or did not show heterozygosity, were replaced by further random selection from the sequence dataset.

For uncloned sequences without indels, the percentage heterozygosity was determined by examining the electrophoretograms for overlapping peaks indicative of base pair heterozygosity (two bases read at the same position in the DNA sequence). Heterozygosity was calculated as a percentage of total bases starting with the first base of the ITS1 region, including the 5.8S region and ending with the last base of the ITS2 region, as determined by comparison with Saccharomyces cerevisiae ribosomal 18S and LSU sequences V01335 (Rubtsov et al., 1980) and J10355 (Georgiev et al., 1981). For sequences with a simple indel, where it was possible to determine the sequence by comparing both forward and reverse DNA sequences, heterozygosity was calculated by counting both the number of overlapping peaks and the number of bases in indels. All base pairs in an indel were included in estimates of heterozygosity.

For cloned ITS sequences, preliminary sequence alignment was performed in Wisconsin Package 10.3 (GCG, 2000) using the ‘pileup’ program (which generates an alignment based on sequence similarity), followed by manual adjustment to the alignment. Five clones were generated per fruitbody, occasionally 10 if the first five were all the same haplotype. All clones were compared using the Wisconsin Package ‘distances’ program using ‘uncorrected distance’. The two most divergent clones from this analysis were selected for the calculation of heterozygosity. The percentage difference between the paired sequences included point mutations and insertion/deletion events with each base of an indel counted as a difference. Where lengths of the paired sequences differed, the average length was used to calculate the percentage difference. An estimate of cloning error was calculated by examining the sequence of 20 clones of the ITS PCR product from a known homozygote, and this was subtracted from the calculated percentage heterozygosity.

ITS sequences starting with the first base of ITS1 and ending with the last base of ITS2 were individually submitted to a BLAST search in GenBank. Sequences were considered to ‘match’ a sequence in GenBank if more than 80% coverage gave a percentage homology of 98% or greater.

The distribution of the percentage individual sequence divergence (heterozygosity) for collections from GSMNP is given in Fig. 1 and Table S1 (see Supporting Information). The proportion of heterozygous collections for the uncloned dataset was c. 50% (87/172 randomly selected collections), and 100% for the 13 randomly selected cloned collections. Within-collection sequence divergence (heterozygosity) in the randomly selected uncloned sample varied from 0.12 to 0.78% and, in the cloned sample, from 0.17 to 3.27%. When all 100 sequences were grouped and an estimate of ≤ 2.00% percentage homology was used to indicate conspecificity, 97% of the heterozygous collections would be correctly assumed to be the same putative biological species from their individual ITS sequences. Three per cent of the collections, however, would be excluded and their individual haplotypes would be considered not to represent the same species. Using ≤ 3% to indicate conspecificity, we recovered 99% of heterozygous collections.

Figure 1.

Percentage heterozygosity for 100 randomly selected collections of predominantly agaric fungi from the Great Smoky Mountains National Park, USA.

No percentage DNA sequence divergence will accurately determine whether two sequences represent the same species for a number of reasons: morphological species’ concepts vary from group to group and from expert to expert (Nilsson et al., 2008); morphological, phylogenetic and biological species’ concepts may not be congruent (Taylor, 2008). Morphospecies’ delineations may occur without significant sequence divergence (see, for example, Gymnopus dryophilus complex: Mata et al., 2007), and significant sequence divergence may occur without morphological differentiation (see, for example, Hughes et al., 2007). Data from a region such as GSMNP, where species of fungi exhibit extensive intraspecific sequence divergence, may be used to suggest an upper limit for determining whether two sequences from the same geographical area represent the same putative biological species, and, on the basis of our data, we suggest a value of 2–3% divergence. A 2% sequence divergence value should be used with caution as it will certainly result in an underestimate of true biological species’ diversity for geographically dispersed species; however, for studies involving agaric fungi from a restricted geographical area (most environmental studies to date), this estimate recovered 97% of heterozygotes representing a single biological species. Using 3% divergence as indicative of conspecificity increased the recovery of heterozygotes to 99%, but selecting this level of divergence increases the possibility that recent or cryptic species would be considered as conspecific. As noted above, most environmental sampling studies use 3% or greater. Estimates of conspecificity of > 3% sequence divergence from a small geographical area may underestimate the total biodiversity in that area.

One collection of Amanita citrina var. ‘lavendula’ (FB 13441) produced sequences that diverged by 3.27% when all bases in multibase indels were counted as separate events, but 2.66% when indels were considered to be the result of a single event. The latter estimate may be more reasonable as indels may not occur as a progression of single base pair insertions and deletions; however, in all environmental studies to date, no consideration of indel size has been mentioned and the assumption has been that all base pairs are included in estimates of sequence divergence. As these divergent sequences were derived from a single fruitbody, they theoretically represent a single biological species containing significant genetic diversity. Alternatively, these may represent hybridization between ‘different’ species. The line between these two concepts is not always clear and depends on the individual species’ concept(s) of the investigator.

Using 2% or less sequence homology, 39% of sequences in this dataset ‘match’ sequences deposited in GenBank, but eight of the 39 matches were sequences previously deposited by the authors. An additional 9% of the ITS sequences matched environmental samples, but none of the environmental sequences was identical to sequences derived from above-ground fruitbodies, including sequences derived from soil and litter from the nearby North Carolina Piedmont region (O’Brian et al., 2005). Our data are similar to data generated by Brock et al. (2008). In their study of 279 ITS sequences from Kew Herbarium, 10% matched environmental sequences and only 30% were represented in GenBank. It is clear that significantly better coverage of fungal sequences is needed. Sampling herbaria and biotic surveys are valuable in filling such knowledge gaps.


We wish to thank the many specialists who have contributed to the fungal All Taxon Biodiversity Inventory in the Great Smoky Mountains National Park, including especially Drs Jean Lodge, Egon Horak, Andrew Methven, Bart Buyck, Brandon Matheny, Joseph Ammirati, Rod Tulloss, Clark Ovrebo, Juan Luis Mata, Joaquin Cifuentes Blanco, Roy Halling and participants in the 2004 mycoblitz. Many undergraduates have helped with DNA extractions and with specimen management, including David Mather, Anna Becker and Shawn Robertson. We thank two anonymous reviewers for their comments, and acknowledge support from the National Science Program, DEB 0338699.