Sequence analysis of a compound coding-region microsatellite in Candida albicans resolves homoplasies and provides a high-resolution tool for genotyping


*Corresponding author. Tel.: +1 (619) 534-5474; Fax: +1 (619) 534-7108; E-mail:


Sequence diversity at a coding-region microsatellite locus of two diploid Candida species was surveyed. Twenty-one alleles from fourteen strains of Candida albicans and three alleles from two strains of the closely related Candida dubliniensis were sequenced. Results show independent length variation in two contiguous hexanucleotide repeats, one non-contiguous hexanucleotide repeat, and two non-contiguous trinucleotide repeats within a 120 bp coding region. A neighboring, non-repetitive 120 bp region showed no variation. The information density of sequence polymorphisms in this region provides a powerful tool for genotyping microorganisms in epidemiological studies, yielding detailed resolution of closely related strains, and clearly distinguishing the two species studied here. The individual length-variable repeat regions are very short (2–8 repeats), demonstrating that even very short microsatellites can show high levels of length variability when surrounded by similarly repetitive DNA. Extensive homoplasy was discovered among the C. albicans alleles, with the majority of overall length categories consisting of alleles with more than one sequence. Our results show that microsatellite length alone should not be used to assume either sequence identity or identity by descent. Microsatellite length mutations appear to have generated the high degree of both inter- and intraspecific polymorphism seen at the ERK1 locus, and form an island of variability in an otherwise well-conserved gene.


Microsatellites are tandemly repetitive DNA elements composed of short (mono- to hexanucleotide) repeat units [1]. They are common, easy to identify, and extremely polymorphic in higher eukaryotes, and have been widely used as molecular markers for genetic mapping and population structure studies [2]. Length polymorphisms appear to stem primarily from slipped-strand mispairing errors during replication [3, 4].

Microsatellite variation is usually assayed by comparing the lengths of different alleles using PCR amplification of the repeat region, followed by acrylamide gel electrophoresis of the resulting product. The loci assayed usually contain a single run of identical repeat motifs, and length variation is assumed to result from differences in the number of repeats.

Compound microsatellites, defined as loci containing more than one run of different repeat motifs, exhibit length polymorphisms similar to those seen in simple repeats [5]. Sequence data from compound repetitive loci can provide useful information regarding strain identity and the possible history of alleles, since the length of each variable region within the locus can be measured separately [6]. If variation results from repeat-number mutations in different parts of a compound microsatellite, the overall length cannot be used to infer identity of sequence. This is because of possible convergence of ancestral alleles to the same length by different mutational events, an effect known as homoplasy. Similar homoplastic events are likely to occur within simple repeats as well, but these will remain undetectable at the sequence level. Variability and homoplasy have been documented and studied in compound non-coding microsatellites of insects and primates [5, 6]but not in microorganisms or in coding-region microsatellites.

In higher eukaryotes, microsatellite data can be very informative, since available microsatellites are often long (>20 repeats) and provide a large number of length variants (electromorphs) for use in genotyping [7]. In microorganisms, available microsatellites tend to be less common and shorter than those found in eukaryotes but have been shown to provide informative variation [8]. The lack of long simple sequences is probably related to the general scarcity of non-coding DNA in microorganisms. The long microsatellites used as markers in higher eukaryotes are usually found in non-coding DNA, where length mutations are less likely to effect gene function. The variability of repetitive sequences in microorganisms (compared to higher eukaryotes) has not been closely examined.

Candida albicans, a pathogenic diploid yeast, is primarily clonal [9–11]. This suggested to us that, given sufficiently long divergence times, even short microsatellites with relatively low mutation rates might be highly polymorphic, providing a useful tool with which to genotype individual strains. As part of an effort to develop informative genetic markers in this organism, we previously identified a length-variable compound microsatellite (ERK1) in the coding region of an ERK family protein kinase gene [8]. A 120 bp region of this gene contains several short (n=2 to 8 motifs) tri- and hexanucleotide repeats, as well as neighboring and intervening regions of related, but imperfect, repeats. This locus showed greater length diversity (seven electromorphs) than any of the simple microsatellites from this organism that were assayed in the same study. To determine the potential of compound microsatellites as informative microorganismal genetic markers and to resolve length homoplasies, we have sequenced 21 alleles of ERK1 from 14 independent standard and clinical samples of C. albicans and, for interspecies comparison, three alleles from two samples of the closely related Candida dubliniensis[12].

ERK1 is a coding microsatellite, and its length might therefore be under strong constraint. This could lead to correlated changes in different variable sites, since an insertion at one site might select for compensatory deletions, or vice versa. To test for this type of constraint, we quantified the real distribution of overall lengths and compared it with distributions obtained from shuffled data sets.

2Materials and methods

2.1Sample origin

The samples used in this work came primarily from a clinical study of HIV+ patients with opportunistic Candida infections, in which patients were sampled both before and after a 3 to 5 month anti-fungal treatment. All sequenced alleles shown in Section 3are from different patients. We did, however, also sequence alleles from both the first and second samples in two cases to determine if the ERK1 microsatellite sequence was stable over the treatment period. Based on an analysis of additional microsatellite loci and RAPD patterns (Metzgar et al., in preparation), both these patients had appeared to carry the same strain of C. albicans throughout treatment.

All clinical samples of C. albicans were obtained from Pfizer Study 066-174. These were isolated from randomly selected HIV+ patients undergoing anti-fungal prophylaxis at several clinics. None of the patients exhibited yeast-related symptoms, and the Candida isolates were obtained from buccal, palatal and lingual swabs cultured on Sabouraud agar. Alleles from three standard American Type Culture Collection (ATCC) strains (14053, 36232, and 60193) were also used in this study. The C. dubliniensis samples, used here for interspecies comparison, were discovered among the clinical samples that had been designated as C. albicans. Divergent ERK1 sequences obtained from them had originally suggested they might be representative of a distinct clade. Our identification of them as C. dubliniensis was based on conserved base substitutions at a ribosomal RNA locus, as was done in the original paper describing this species [12]. As well, the two species differ in their ability to grow on YPD agar at 42°C (C. albicans can, while C. dubliniensis cannot) [12]. We confirmed our species identifications using this test.

2.2Identification of the ERK1 locus and primer development

The ERK1 compound microsatellite was identified by a computer survey of C. albicans sequences archived in Genbank. This search was part of an effort to identify informative polymorphic microsatellites in this organism [8]. Although the longest perfect repeat in the Genbank ERK1 sequence was only made up of four repeat units, it was chosen as a candidate marker locus because few longer perfect repeats were found in the available sequence data. It was therefore a surprise that it turned out to be so polymorphic.

2.3DNA isolation, assays for length polymorphism, and cloning of alleles

All specific protocols used for these steps were previously described in the paper reporting the identification of the ERK1 compound microsatellite [8]. C. dubliniensis alleles were difficult to clone and amplify, due to considerable divergence of the primer sites between the two species. Amplification was accomplished by decreasing the annealing temperature of the cloning PCR reaction to 45°C and increasing the annealing and elongation times to 1.5 min per cycle. Efficient amplification of ERK1 from C. dubliniensis will require development of more precisely complimentary primers. The primers used are listed below. Ribosomal RNA primers are from Sullivan et al. [12].


2.4Sequencing of ERK1 and rRNA clones

Plasmid DNA was isolated from 5 μl overnight cultures of clones using a QIAGEN plasmid-prep mini-kit (QIAGEN), then resuspended in 30 μl sterile water. Two μl of the DNA solution was diluted to 9 μl in sterile water, then sequenced in both directions using T7 and T4 primers. Sequencing was accomplished using standard ABI sequencing methods. All sequences presented in this paper were obtained from at least two separate clones.

2.5Sequence alignment

Sequences were aligned by hand. Since trinucleotide (and hexanucleotide) indels were by far the most common polymorphisms, these were assumed to be the source of any polymorphisms which could have arisen by either base substitution or replication slippage. The interspecies comparative alignment (Fig. 3) was ambiguous in short segments of the variable region.

Figure 3.

Alignment of ERK1 sequences from C. albicans and C. dubliniensis, entire amplified region (alleles 246 A and 246 F, respectively).

2.6Genbank codes

All unique allele sequences presented in this paper were submitted to Genbank. The accession numbers are provided in Appendix A (allele codes refer to the designation used in Fig. 1).

Figure 1.

Alignment of 15 unique C. albicans ERK1 sequences (variable region only). Shown above is the Genbank sequence, identical to allele 237 A from the clinical isolates. All others were cloned from either clinical (C) or ATCC (A) strains. Allele designations refer to the length of the amplified product and sequence subtype among that size class. The five length-variable repeats are delineated by brackets: VH, variable hexanucleotide; VT, variable trinucleotide. Base substitutions are shown in bold and underlined.

2.7Analysis of overall length constraint

The following procedure was executed in True Basic® for the Macintosh. The program is available by contacting The length of each variable repeat within each of the 21 sequences was entered into an array. For 1000 replicates, the data set was then shuffled to randomize the associations between variable sites. This removes any between-site correlations from the data, while retaining the original distributions of repeat number for each variable site. The mean length remained constant, as all array elements were retained. The mean standard deviation for the shuffled replicates was calculated and compared to the standard deviation of the real data.


The variable regions of sequenced ERK1 alleles from C. albicans and C. dubliniensis are shown in Figs. 1 and 2, respectively. Fig. 3 provides an alignment of the entire amplified region of one allele from each species. For C. albicans, the sequence data shows 15 different alleles detectable at the sequence level among the six distinguishable electromorphs. The sequences are designated by a code made up of the number of nucleotides in the amplified region (electromorph group), accompanied by a letter representing each unique sequence within that size class. Among C. albicans alleles, five repetitive sites are found to be length variable in the 3′ half (120 bp) of the total amplified region (237 bp in the Genbank clone), while the rest of the region is completely length invariant. The five length-variable sites on the coding strand are, from 5′ to 3′, as follows: (CAGGCT)3–5(CAAGCT)0–2… (CAA)4–9… (GCCGCA)1–2… (CTT)2–4.

Figure 2.

Comparison of two ERK1 sequences from C. dubliniensis (variable region only).

For all five variable repeat regions, every possible intermediate length was found in at least one sequence. Three sequences also show single base substitutions in the length-variable region.

In the two cases in which alleles were sequenced from samples taken from the same patient both before and after a 3 to 5 month treatment regime, the ERK1 diplotype remained exactly the same (not shown). This suggests that the locus is at least stable enough to allow strains to be followed over short periods of time.

The distributions of overall lengths in real and shuffled data sets were not significantly different. The mean length in both the real and the shuffled data sets was 244.3. The standard deviation for the real data was 6.5, while the mean standard deviation for 1000 shuffled replicates was 6.7±0.89.

ERK1 sequences from C. dubliniensis have diverged significantly from C. albicans. Nonetheless, despite numerous indels and base substitutions, the sequences between the primer sites remain approximately the same overall length as those of C. albicans (see Fig. 3). Moreover, the region is also length variable in C. dubliniensis and shows repeat-number variation in two short repetitive sites. It is notable that, although the polymorphic repeats are located in the second half of this region in both species, the length-variable microsatellites of C. dubliniensis alleles do not correspond in position to those found in C. albicans. The two different C. dubliniensis alleles that were found also differ by 12 base substitutions, while only three variable bases were found in the 15 different C. albicans alleles.


The ERK (or MAP) family of kinase genes is highly conserved and has been identified in all major eukaryotic clades [13]. Their protein products function in a wide array of signal transduction pathways, generally linked to cell cycle control responses to extracellular signals. Among fungi, known functions involve pheromone-induced cell cycle arrest and starvation-induced invasive growth in Saccharomyces cerevisiae[14]and pathogenic growth and invasive differentiation in rice blast fungus [13]. However, the variable region of the ERK family gene sequenced in this paper represents a unique amino-terminal glutamine/alanine-rich domain not conserved in other known ERK kinases (such as FUS3 and KSS1 from S. cerevisiae). The function of this polymorphic domain is unknown [15]. The variable repeats in C. albicans code for poly (QA), poly (Q), poly (A), and poly (S). All but the terminal polyserine are part of a single domain consisting entirely of glutamine and alanine.

The observed level of variation at the ERK1 locus demonstrates that compound, coding-region microsatellites sites can provide extensive sequence polymorphisms useful for genotyping microorganisms. The combined length variability of all five repetitive sites allows for a large number of possible alleles (226 if each repeat region is assumed to be limited to, but free to vary within, the range of repeat numbers seen in the sequenced alleles). Potential allele diversity is increased further when base substitution polymorphisms are included.

The repeat regions within the ERK1 locus are extremely short; some having no more than two tandem repeats. If slipped-strand mispairing is responsible for microsatellite expansion and contraction, as is generally accepted [4], it is possible that this process may produce repeat-number variation even in extremely short microsatellites, such as those seen in ERK1, when flanking sequence patterns are closely related and similarly repetitive. This would seem reasonable because closely related sequences can provide a partially complimentary substrate for mispairing (Gostout, 1993). In the present data, many of the repeats share closely related motifs, and often similar imperfect repeats flank the length-variable perfect repeats. Further investigation is required to determine the stability of similar sequences.

Length homoplasy is the rule, not the exception, among ERK1 alleles. All size classes which include alleles from more than one strain of C. albicans are represented by multiple divergent sequences. This process is likely to occur in simple as well as compound microsatellites but cannot be resolved in the former case, indicating that neither length nor sequence identity should be used to assume identity by descent. A few alleles at any microsatellite locus should always be sequenced to determine the source of the variation if length is used to assume sequence identity.

Length homoplasy is seen between, as well as within, the two species at this locus. The two unique alleles sequenced from C. dubliniensis are the same lengths as the two most common C. albicans electromorphs. It would be impossible to distinguish alleles of the two species using length data alone, despite their considerable sequence divergence. This is not a purely academic problem, as many of the strains found in clinical collections that are classified as C. albicans are in fact C. dubliniensis, which has only recently been identified as a separate species. With sequence data, however, this distinction becomes obvious. The identification of genetically divergent groups among phenotypically similar pathogens is critical to the appropriate development and use of treatment strategies, since separate groups may exhibit different responses to treatment.

Our previous studies demonstrated that ERK1 and several other polymorphic microsatellite loci can act as species-specific markers. Of seven loci analyzed in C. albicans, only one (ERK1) can be amplified from C. dubliniensis, and none of them can be amplified from Candida kruseii (see [8]). Thus, the use of these markers in strain identification is limited to C. albicans (with the exception of ERK1 in C. dubliniensis), and can only be used in species identification insofar as they can indicate that a particular strain is either C. albicans or it is not; they cannot be used to determine which other species a strain belongs to if it is not C. albicans. For purposes of species identification, rRNA sequencing is far more efficient and informative. However, the identification of a polymorphic ERK1 homologue in C. dubliniensis suggests that specific design could provide microsatellite-based strain identification systems for species of yeast other than C. albicans.

The ERK1 locus is length variable in C. dubliniensis, though many of the variable regions seen in C. albicans do not have homologous repeats in C. dubliniensis. This raises the possibility that length variability itself might be selectively maintained at this locus across species. The overall length of ERK1 might also be conserved because of constraints on the encoded protein sequence. One piece of evidence for this is that although the ERK1 sequences of C. albicans and C. dubliniensis have undergone considerable divergence, including substitutions, insertions and deletions, they maintain very similar sizes. However, comparison of the real distribution of allele lengths in C. albicans with length distributions from data sets in which the individual variable site lengths had been shuffled showed that within C. albicans overall length is not strongly constrained at this locus.

The high level of divergence between C. albicans and C. dubliniensis sequences, as well as between the two C. dubliniensis sequences, shows that considerable genetic diversity exists within a group of pathogens which was, until recently, considered a single species. The value of high-resolution markers, such as ERK1, is two-fold: first, they allow identification of individual strains of microorganisms within a species on the basis of codominantly inherited characters. Second, at least among the Candida pathogens studied here, they can provide evidence of major genetic divisions (‘species’) within morphologically similar groups of microorganisms.


Genbank accession numbers of unique ERK1 alleles

AlleleAccession number
225 AU95785
237 AU95786
237 BU95784
237 CU95799
240 AU95800
243 AU95787
246 AU95789
246 BU95790
246 CU95791
246 DU95793
246 EU95794
249 AU95788
249 BU95792
249 CU95795
249 DU95796
249 EU95797
255 AU95798