Microsatellites, or tandem simple sequence repeats (SSR), are abundant across genomes and show high levels of polymorphism. SSR genetic and evolutionary mechanisms remain controversial. Here we attempt to summarize the available data related to SSR distribution in coding and noncoding regions of genomes and SSR functional importance. Numerous lines of evidence demonstrate that SSR genomic distribution is nonrandom. Random expansions or contractions appear to be selected against for at least part of SSR loci, presumably because of their effect on chromatin organization, regulation of gene activity, recombination, DNA replication, cell cycle, mismatch repair system, etc. This review also discusses the role of two putative mutational mechanisms, replication slippage and recombination, and their interaction in SSR variation.
Genomic microsatellites (simple sequence repeats; SSRs), iterations of 1–6 bp nucleotide motifs, have been detected in the genomes of every organism analysed so far, and are often found at frequencies much higher than would be predicted purely on the grounds of base composition (Tautz & Renz 1984; Epplen et al. 1993). Bell (1996) suggested that the abundance and length distribution of SSRs across the genome could result from unbiased single-step random-walk processes. Some authors considered SSRs to be selectively neutral sequences randomly or almost randomly distributed over the euchromatic genome (Schlötterer & Wiehe 1999; Schlötterer 2000). Bachtrog et al. (1999) detected a significant positive correlation between AT content and (AT/TA)n SSR density, suggesting that SSR genesis may be a random process. However, they also found that 39% of the contiguous sequences analysed deviate from random distribution of SSRs in Drosophila melanogaster.
The controversial interpretation of SSR evolution is reflected in recent literature. Numerous studies have documented SSR structural patterns with allele size constraints (Garza et al. 1995; Dermitzakis et al. 1998; Samadi et al. 1998; Li et al. 2000c; 2002a), and functional significance (reviewed in: Kashi et al. 1997; King et al. 1997; Kashi & Soller 1999; King & Soller 1999; Gur-Arie et al. 2000). Nevertheless, SSRs are usually just considered as evolutionarily neutral DNA markers (e.g. Tachida & Iizuka 1992; Awadalla & Ritland 1997; Schlötterer & Wiehe 1999). Such controversy calls for further evidence of SSR functional importance and justifies a comprehensive discussion concerning the evolutionary significance of genomic SSRs. An attempt to analyse the phenomenon of SSR variation from the viewpoint of the ‘qualitative’ alternative, i.e. functional vs. neutral, does not seem to suit the problem. In fact, there may be no fundamental contradiction in the opposite interpretations of SSR variation: functional vs. neutral, if the question is formulated in quantitative rather than qualitative terms. The rich evidence accumulated on SSRs and their manifold effects justifies this approach.
The present review focuses on the following aspects: (i) SSR distribution across coding and noncoding regions of the genome; (ii) evolutionary significance and dynamics of SSR genomic distribution; (iii) SSR effects/functions in gene expression and genetic disorder, chromatin organization, cell cycle, and DNA metabolic processes; and (iv) the relative contribution of replication slippage and associated DNA repair mechanisms and recombination to SSR mutation.
Nonrandom SSR distribution within and between genomes
SSRs constitute a rather large fraction of noncoding DNA and are relatively rare in protein-coding regions. For instance, all of the observed 101 mono-, di- and tetranucleotide SSRs were located in noncoding regions across 54 plant species (Wang et al. 1994). All types of SSRs (from mono- to hexanucleotide repeats) were found in excess (compared to random appearance) in noncoding genomic regions across seven eukaryotic clades: Saccharomyces cerevisiae, Caenorhabditis elegans, Schizosaccharomyces pombe, Mus, Drosophila, plants, and primates (Metzgar et al. 2000). Morgante et al. (2002) reported that all SSR types except triplets and hexanucleotides are significantly less frequent in the 25 762 predicted protein-coding sequences compared with the noncoding fraction in six plant species including Arabidopsis, rice, soybean, maize, and wheat (Triticum aestivum). In the genome of Japanese pufferfish, Fugu rubripes, only 11.6% of 6042 SSRs were detected in protein-coding regions (Edwards et al. 1998). This is attributable to negative selection against frameshift mutations in coding regions (Metzgar et al. 2000). Previously, a similar distribution pattern was found for triplet SSRs in coding and noncoding regions of genomes of fungi, protists, prokaryotes, viruses, organelles, plasmids and humans (Field & Wills 1996; Wren et al. 2000). However, the disease-associated triplet repeats are mostly found in coding regions of the human genome (Nadir et al. 1996). Likewise, Morgante et al. (2002) recently found that triplet SSRs doubled in frequency in the coding region of the above-mentioned six plant species, as a result of mutation pressure and possibly positive selection for specific single amino acid stretches. Some tri-arrays are not extensively conserved for long periods of time, even when they form parts of protein-coding sequences, since long tri-repeats (e.g. CAG arrays) can be destabilized during meiosis or gametogenesis (Jankowski et al. 2000).
The majority of SSRs (48–67%) found in many species are dinucleotides (Wang et al. 1994; Schug et al. 1998), but in primates mononucleotides [mainly, poly(A/T) tracts] are the most copious classes of SSRs (Tóth et al. 2000; Wren et al. 2000). In contrast to the triplet SSRs, di-and tetranucleotide SSRs are much less frequent in coding regions than in noncoding regions. For example, dinucleotide repeats are about 20 times less frequent in the expressed sequences than in random genomic clones of Norway spruce, Picea abies (Scotti et al. 2000). In eight prokaryotes and yeast, long mono- and di-tracts are almost exclusively distributed in nontranslated regions (Field & Wills 1998). For perfect dimeric SSRs, Bell & Jurka (1997) found that (i) shorter repeats (≤ 3 units) in coding sequences and other functionally important DNA regions can be predicted by a Bernoulli model; and (ii) the length distribution of longer (≥ 5) perfect dimeric SSR DNA in a noncoding region fits the unbiased single-step mutation model. In this model, repeats are permitted to change length by plus or minus one unit, with equal probabilities, and base substitutions are allowed to destroy long perfect repeats, producing two short perfect repeats. Analysis of available DNA sequences of human, mouse, worm (Caenorhabditis elegans), and yeast genomes shows that the distribution functions of all possible dimeric SSRs are exponential in coding DNA, whereas in noncoding DNA most of the dimeric SSRs have surprisingly long tails that better fit a power-law function (Dokholyan et al. 2000). These long, nonexponential tails are hypothesized to result from a higher tolerance of noncoding DNA to mutations (Dokholyan et al. 2000). A number of genes was found with dinucleotide SSRs in the untranslated 5′ and/or 3′ ends, such as in five genes of channel catfish Ictalurus punctatus (Liu et al. 1999) and in the mammalian heat shock protein 70 (hsp 70) genes [(GA)6CAG(TC)24 tract: Lisowska et al. (1997)]. Dinucleotide SSRs are also found in introns. For instance, in Mus musculus the intron A of Adh-1 gene contains (TA)14 (TG)8, and (TA)19, and the intron of IL-5 gene contains (AT)17; in the tree Betula pendu the intron of BVGC34 gene includes (CA)17 (TA)14, and (TGTA)3. The potential size expansion of di- or tetranucleotide SSRs at the 3′ and 5′ regions and introns could lead to disruption of the original protein and/or formation of new genes by frame shift (Bachtrog et al. 1999; Liu et al. 1999). These patterns suggest that random distributions of such di- and tetranucleotide SSRs were strongly selected against (Bachtrog et al. 1999; Liu et al. 1999). For a given number of repeats, the tetranucleotide locus is longer than a dinucleotide. This may affect the selective pressure, if the stability of meiotic processes depends on the absolute size (in base pairs) of the target region. Loci with longer repeat units seem to experience stronger selection against the difference in size, especially in genome regions with high recombination rates (Samadi et al. 1998).
These findings also suggest that the differences between coding and noncoding SSR frequencies arise from specific selection against frame-shift mutations in coding regions resulting from length changes in nontriplet repeats (Liu et al. 1999; Dokholyan et al. 2000). Nevertheless, 14% of all proteins contain repeated sequences, with a three times higher abundance of repeats in eukaryotes compared to prokaryotes (Marcotte et al. 1999). Prokaryotic and eukaryotic repeat families are clustered to nonhomologous proteins. This may indicate that repeated sequences emerged after these two kingdoms had split. The eukaryotes incorporating more repeats may have an evolutionary advantage of faster adaptation to new environments (Marcotte et al. 1999; see also below section 5 and Kashi et al. 1997; King & Soller 1999; Wren et al. 2000).
Tóth et al. (2000) conducted a detailed analysis of SSRs in several eukaryotic taxa, from fungi to humans, and revealed highly taxon-specific patterns in the distribution of different repeat types (from mono- up to hexanucleotides) for different motifs in coding and noncoding sequences, in introns and intergenic regions. This specificity can partly be explained by interaction of mutation mechanisms and differential selection. The accumulated empirical evidence seems to indicate that SSR sequences are more abundant and longer in vertebrates than in invertebrates, and among vertebrates longer SSR tracts are observed in cold-blooded species (see for review: Chambers & MacAvoy 2000). It is interesting that among the taxa compared by Tóth et al. (2000), maximum abundance of SSRs was displayed by rodents and the minimum by C. elegans.
Eyre-Walker (1999) found that compositional variation in noncoding DNA cannot be explained by mutation bias alone, and that selection may play an important role. Contrary to predictions of the neutral mutation theory, noncoding DNA is positionally constrained along the banding pattern with short interspersed repeats in R-bands (the primitive chromatin state) and long interspersed repeats in G-bands (Giemsa dark bands: Holmquist 1989). It was speculated that each sequence of tandem repeats is under the influence of local and general biological activities that determine its level of instability (Debrauwere et al. 1997). The dynamic organization of noncoding DNA suggests a feedback loop that could influence codon usage and stabilize the chromosome's chromatin pattern (Holmquist 1989). The preserved nonrandom codon usage, or overall amino acid usage, or both, play a significant role in determining the short-repeat excess and selection against long repeats (Field & Wills 1998). Theories of hierarchical selection show how selection can act on noncoding DNA at the genome level thereby creating positionally constrained DNA and contributing minimal genetic load at the individual level (Holmquist 1989; Trifonov 1989; Baldi & Basnee 2000). The overall level of repetition in genomes is related to genome size and to the degree of repetition, suggesting that the entire genome may react to an increase of simple repeated sequences in a concerted manner (Hancock 1996; Tóth et al. 2000). The presented examples demonstrate various nonrandom patterns of SSR variation both within the genome and between species and higher taxa that call for functional interpretation(s).
Chromosomal organization. Some aspects of SSR distribution point to their possible role in taxon-specific chromosome structure. For instance, SSR hybridization signals were found in related chromosome positions independently of the motif used, and showed remarkably similar distribution patterns in wheat and rye, suggesting a special role of SSRs in chromosome organization as a possible ancient genomic component of the tribe Triticeae (Cuadrado & Schwarzacher 1998). At locus GWM601 located in the short arm of chromosome 4A, the CT repeat was maintained at (CT)17 in wild emmer wheat (Li et al. 2000a, b, c, 2002b) and most remarkably also in its descendant cultivated wheat Chinese Spring (Triticum aestivum) (Röder et al. 1998), indicating that this locus may be involved in some aspects of organization of chromosome 4A. Furthermore, the implications of excess numbers of short iterated repeats (≤ 8 units) could be extremely important not only for genomic stability, but also with regard to the evolution of additional genomic features such as codon usage (Field & Wills 1998).
DNA structure. SSR DNA sequences are capable of forming a wide variety of unusual DNA structures with simple and complex loop-folding patterns. For example, the hairpin formed by the fragile X repeat (CCG)n, and the bipartite triplex formed by (GAA)n/(TTC)n, show simple loopfolding. Such triplex structures may have important regulatory effects on gene expression (Fabregat et al. 2001). The human centromeric repeat (AATGG)n can form a double-folded hairpin DNA structure (Catasti et al. 1999). Similarly, short triplet repeats have been shown to form extensive secondary structures when single-stranded (Zheng et al. 1996). Longer (CAG) and (CTG) repeats produce unusual secondary structures upon denaturation and subsequent renaturation (Pearson & Sinden 1996). The formation of such stable structures offers a mechanism of unwinding which is advantageous during transcription, and provides unique protein recognition motifs (Catasti et al. 1999). In many species, dimeric SSR relative abundance, which represents most of the departure from randomness in genome sequence, may also reflect duplex curvature, supercoiling, and other higher-order DNA structural features (Karlin et al. 1998; Baldi & Basnee 2000). The repeat number seems to be a critical parameter that determines the balance between the advantage gained from an unusual structure during gene expression and the disadvantage posed by the same structure during replication (Catasti et al. 1999).
Centromere and telomere. In many species, the centromeric region of chromosomes is composed of numerous tandem repeats, which affect the centromere organization. Long SSRs with mono-, di-, tri- and tetranucleotide motifs are highly clustered in the centromeric regions of tomato (Areshchenkova & Ganal 1999), Arabidopsis (Brandes et al. 1997), and sugar beet Beta vulgaris (Schmidt & Heslop-Harrison 1996). In Neurospora crassa, genomic Southern blots and sequence analysis of centromeric repetitive DNA revealed a unique centromere structure containing a divergent family of centromere-specific repeats (Centola & Carbon 1994). The characteristics and arrangements of simple repeats in the N. crassa centromere are similar to those seen in Drosophila centromeres, but relative abundance of each class of repeats is unique to Neurospora (Cambareri et al. 1998). In Drosophila minichromosomes, the centromere-flanking DNA predominantly contains highly repeated sequences, and the number of repeats required for normal transmission differs among cell division types and between the sexes (Murphy & Karpen 1995).
The assembly of divergent tandem sequences into chromosome-specific higher-order repeats appears to be a common organizational feature of many organismal centromeres, and also suggests that the evolutionary mechanism(s) that creates and maintains high-order repeats is conserved among their genomes (Janzen et al. 1999). The ubiquity of repetitive sequences among diverse species at sites of primary constriction also argues for a strong evolutionary link between centromere structure and function (Eichler 1999). The centromere-flanking repeated DNA may perform two functions: sister chromatid cohesion and indirect assistance in kinetochore formation or function (Murphy & Karpen 1995).
The repeat number also influences recombination. For instance, the effect of GT/GC SSR tracts on RecA-dependent homologous recombination was examined in vitro, and it was found that the number of molecules which underwent a complete strand exchange decreased from 100% to 80% and 30% for DNA containing 7, 16, and 37 repeats of (GT), respectively (Dutreix 1997). Majewski & Ott (2000) analysed the distribution of different SSRs and recombination intensity along human chromosome 22. Significant association between increased recombination and the presence of SSR tracts was found only for the GT repeat. The effect was especially pronounced for male recombination. In yeast (GT)39 tract at the ARG4 locus was shown to increase the frequency of gene conversions; the repetitive sequence strongly stimulated the formation of multiple crossovers, but had no effect on single crossovers (Gendrel et al. 2000). The evidence listed above demonstrates that SSRs could affect recombination not only through repeat motif, but also through repeat number.
DNA replication and cell cycle. SSRs may affect DNA replication (Field & Wills 1996). In rat cells, DNA amplification is arrested within a specific fragment which consists of a d(GA)27·d(TC)27 tract. It is found at the end of an amplicon, and in conjunction with the inverted repeat, may serve as an arrest site for DNA replication in vivo. In a mammalian mutator phenotype CSA7 clone, the unstable (CA)n SSRs are co-selectable with other gene amplification events (Caligo et al. 1999).
SSR can affect enzymes controlling cell cycles. For instance, the human CHK1 gene has a role in controlling cell cycle progression, and its coding region contains an (A)9 tract (Codegoni et al. 1999) that is a potential site of mutations in tumours with SSR instability (Bertoni et al. 1999). Alterations in the CHK1 gene in human colon and endometrial cancers were associated with the presence of a high degree of poly(A) tract instability. The insertion or deletion of one A in the (A)n tract resulted in a truncated protein. Alterations of the CHK1 gene could represent an alternative way for cancer cells to escape from cell cycle control (Bertoni et al. 1999). Some genes controlling the cell cycle, such as hMSH3, hMSH6, BAX, IGFIIR, TGFbetaIIR, E2F4 and BRCA2, carry short repeated sequences, which are important in cell fidelity and growth control. SSR instability affects these genes by both insertions and deletions of repeat units. Most SSR-instability tumours had acquired mutations in more than one of these genes, and longer repeated sequences were more frequently targets for mutations (Johannsdottir et al. 2000). There is also evidence for connection between DNA repair and cell cycle checkpoint: the mismatch repair (MMR) system interacts with the G2 cell cycle checkpoint in response to (TG)6 or N-methyl-N′-nitro-N-nitrosoguanidine-induced DNA lesions (Hawn et al. 1995). Very large (CAG)n repeat expansions were found in sperm cells of two spinocerebellar ataxia type 7 males; a significant proportion of such alleles might be associated with embryonic lethality or dysfunctional sperm (Monckton et al. 1999; see also Parniewski et al. 2000 for the role of MMR system in deletions of large CAG tracts in Escherichia coli).
SSRs in the eukaryotic DNA MMR gene as modulators of evolutionary mutation rate. DNA MMR proteins correct replication errors and actively inhibit recombination between diverged sequences (Chen & Jinks-Robertson 1998; Kolodner & Marsischky 1999), thus controlling mutation rates and evolutionary adaptation. It is found that the constellation of (A)n SSRs in the coding regions of the minor MMR genes (MSH3, MSH6, PMS2 and MLH3) is a general feature among eukaryotes including Homo sapiens, Mus musculus, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Drosophila melanogaster, Arabidopsis thaliana, and the prokaryote E. coli. Although in some species, 7-bp mononucleotide runs are sporadically found in the major MMR genes (MSH2 or MLH1), the longer runs, which are exponentially more mutable, are exclusively present in the minor MMR genes (see review Chang et al. 2001). SSR sequences are exceptionally vulnerable to spontaneous insertion or deletion mutations, and nontriplet SSRs, when located in coding sequences, are expected to introduce frameshift loss of function mutations at high frequency (Moxon et al. 1994). A recent experiment has proved that mutation rates are significantly higher in longer SSR tracts in both SSR-proficient mouse cells and SSR-deficient human cells (Yamada et al. 2002). The loss of activity in any one of these minor MMR proteins generates weaker mutator phenotypes than occurs with the loss of major MMR proteins (MSH2 or MLH1). A high rate of frameshift mutations that inactivate the minor MMR genes would provide a eukaryotic lineage with a subpopulation of individuals that exhibit mildly increased mutation rates. Chang et al. (2001) hypothesized that the exceptional density of SSRs in the minor MMR genes represents a genetic switch that allows the adaptive mutation rate to be modulated over evolutionary time.
Regulation of gene activity
SSRs and transcription. Numerous lines of evidence show that SSRs located in promoter regions may affect gene activity. The (TC)n tract in promoter regions was found to serve as a transcriptional element for heat-shock protein gene hsp26 in Drosophila (Sandaltzopoulos et al. 1995), Aspergillus (Punt et al. 1990), and Phytophthora (Chen & Roxby 1997). Deletion of various di-, tri- and tetra-SSR tracts markedly changed transcriptional activity. For instance, promoters’ transcriptional activity was dramatically decreased by deletion of a (TCCC)n tract from the promoter regions of c-KI-ras (Hoffman et al. 1990) and TGF-β3 gene in the CAT expression system (Lafyatis et al. 1991). Moreover, a (GT)n repeat could enhance gene activity from a distance independent of its orientation, but more effective transcriptional enhancement resulted from the GT repeat being closer to the promoter sequences (Stallings et al. 1991).
SSRs in intronic regions can also affect gene transcription. For example, a tetra-SSR HUMTH01 in the first intron of the tyrosine hydroxylase gene acts as a transcription regulatory element (Meloni et al. 1998). Gebhardt et al. (1999, 2000) found that transcription activity was affected by a (CA)n tract located in the first intron of the epidermal growth factor receptor (EGFR) gene. They also showed that RNA elongation terminates at a site closely downstream of the SSR and that there are two separate major transcription start sites. Model calculations for the helical DNA conformation revealed a high bendability in the EGFR polymorphic region, especially if the CA stretch is extended. These data suggest that the (CA)n SSR can act like a joint, bringing the promoter into proximity with a putative repressor protein bound downstream of the (CA)n SSR. It is noteworthy that triplet SSRs may be preferentially located in regulatory genes related to transcription and signal transduction and under-represented in genes for structural proteins (Young et al. 2000), also suggesting an SSR effect on gene transcription.
Effect of repeat number on gene expression. In many cases, SSR repeat number appears to be a key factor for gene expression and expression level. Some genes can only be expressed at a specific repeat number of SSRs. For example (GAA)12 in the promoter of Escherichia coli lacZ gene allows lacZ gene expression, whereas neither (GAA)14−16 nor (GAA)5−11 allows this gene to be expressed (Liu et al. 2000). Some genes can be expressed within a narrow range of SSR repeat numbers, and out of this range gene activity would be turned off. In yeast, a promoter containing (CTG/CAG)n with n = 25 allows expression of the URA3 reporter gene and yields sensitivity to the drug 5-fluoroorotic acid. However, the tract with n ≥ 30 (CTG/CAG) repeats turns off the UR3 gene and provides drug resistance (Miret et al. 1998).
Another group of genes show adjusted expression levels by changing their regulatory SSRs’ repeat numbers in a relatively wide range. In a direct experiment aimed to test the effect of (TG) length on enhancer activities for pSV2-CAT (simian virus 40 enhancer plus) or pA10-CAT (enhancer minus) expression vector plasmid, maximum enhancement was obtained with 30–40 bp of (TG). As the length of (TG) tract increased from 40 to 130 bp, enhancer activity fell, and 130 bp of (TG) tract was fivefold less active than 50 bp (Hamada et al. 1984b). Interestingly, the size range of the bulk of poly(TG) elements found in the human genome is between 20 and 60 bp, which is the size that seems to have maximum activity in this system (Hamada et al. 1984a). Transcription activity of the epidermal growth factor receptor gene declines with the increasing numbers of (CA) repeats (Gebhardt et al. 1999, 2000). In a CAT reporter system carrying an androgen response element with human CAG repeats in the presence of dihydrotestosterone, expansion mutations ranging from 25 to 77 repeats showed a progressive decrease in transcriptional transactivation with increasing CAG repeat length (Chamberlain et al. 1994). Similar results were obtained with the use of a somewhat different reporter system and androgen receptor poly Gln [encoded by poly(CAG)] tract length ranging from zero to 50 repeats (Lanz et al. 1995). In contrast, some genes’ transcriptional levels increase with the SSR repeat numbers. For instance, in a human brain the PAX-6 gene, promoter activity of variants with ≥ 29 repeats of (AC)m(AG)n was four- to ninefold higher than that of the 26-repeat allele (Okladnova et al. 1998). In chicken, constructs of 10, 15, or 22 repeats of (CT) in the promoter malic enzyme gene were shown to increase the expression relative to the wild-type with (CT)7 (Xu & Goodridge 1998).
All this evidence, representing both natural variation and the results of direct controlled experiments with various organisms, indicates the importance of the SSR repeat number for the SSR-related regulation of gene expression.
Protein binding. Some SSRs, found in upstream activation sequences, serve as binding sites for a variety of regulatory proteins (reviewed in Lue et al. 1989; Csink & Henikoff 1998). For example, single-stranded poly(GA)- and poly(GT)-binding proteins have been identified in human fibroblasts (Aharoni et al. 1993). (GT)n or mixed (GT)n(GA)m stretches of intronic repeats are preserved in immunologically relevant genes for at least 70 × 106 years and bind nuclear protein molecules with high affinities (Epplen et al. 1993). SSR repeat number may also affect protein binding. For example, datin binds to poly(T) tracts 19, 15 and 11 bp long with comparable affinities, but does not bind to poly(T) tracts 3, 5, 7, or 8 bp long (Winter & Varshavsky 1989). Similar evidence appears in the African green monkey α-SSR-binding protein for (A)n tract (Solomon et al. 1986).
Translation. Many studies show that SSRs can affect gene translation. For instance, a downstream (CA)n increases translation from leadered and unleadered mRNA in Escherichia coli (Martin-Farmer & Janssen 1999). A moderate size expansion of the CGG tract can markedly reduce translation of CAT gene (pSVsCAT) with insertion of human CGG repeats (Sandberg & Schalling 1997). The distribution of AGCT tetranucleotides in the E. coli and Bacillus subtilis genomes predicts translational frameshift and ribosomal hopping in several genes (infB, aceF/pdhC, eno, rplI, OMPa, OMF and tolA: Henaut et al. 1998). CUG repeat binding (CUGBP1) interacts with the 5′ region of C/EBPbeta mRNA and regulates translation of C/EBPbeta isoforms (Timchenko et al. 1999). AGG triplet repeats have a strong inhibitory effect on translation of mRNA in E. coli, which has been proven in the CAT reporter gene (Ivanov et al. 1992).
Human cancers and genetic disorders. About 15% of sporadic colorectal cancers, as well as cancers at several other sites, show SSR instability (see review, Atkin 2001). The progressive accumulation of SSR instability may also contribute to gastric cancer development (Leung et al. 2000). Patterns of SSR changes observed in various human cancers can be classified into two subtypes: type A, showing relatively small changes within six base pairs; and type B, exhibiting drastic changes over eight base pairs. Although type A SSR instability has been connected to defective MMR phenotype, the relationship between type B SSR instability and defective MMR phenotype remains unclear. Nevertheless, as symbolized in cases of hereditary nonpolyposis colorectal cancer, connections between type B SSR instability and familial predisposition have been suggested in some cancers (see review: Oda et al. 2002).
It has been shown that 14 neurological disorders result from the expansion of unstable trinucleotide repeats. Such triplet SSR diseases include cases with changes in either noncoding (untranslated) or coding (exonic) sequences (recent reviews: Rubinsztein 1999; Cummings & Zoghbi 2000; Masino & Pastore 2001). Expanded repeats may form altered DNA secondary structures (for detail, see review of Kovtun et al. 2001) that confer genetic instability, and most likely contribute to transcriptional silencing. The human FMR1 (CGG)n array can exhibit genetic instability, characterized by progressive expansion over generations leading to gene silencing and development of the fragile X syndrome (White et al. 1999). At the level of RNA, the expanded repeat may either interfere with processing of the primary transcript, resulting in a deficit of the corresponding protein, or interact with RNA-binding proteins altering their normal activity (Galvao et al. 2001). Recent evidence suggests that expanded RNAs and associated RNA-binding proteins are potential contributors to the pathogenesis of several triplet-repeat diseases (review: Galvao et al. 2001).
Myotonic dystrophy, an autosomal dominant neurological disorder, is caused by CTG-repeat expansions at the DMPK locus, with affected individuals having n ≥ 50 repeats of (CTG). Abnormal expansion (≥ 39) of (CAG) repeats, which translates into polyglutamine stretches, causes the Machado Joseph Disease (Rubinsztein et al. 1995). It has been proven that a dynamic balance of the joint effects of segregation distortion and selection maintains a normal range of CTG-repeat sizes at the myotonic dystrophy locus (Polanski et al. 1998). Negative selection seems to act against ATG triplets near start codons in eukaryotic and prokaryotic genomes (Saito & Tomita 1999). Nucleic acid sequences around start codons contain fewer AUG triplets due to negative selection against disruptive triplets, which could disturb the accurate detection of proper start codons. Negative selection in the upstream regions is especially strong in eukaryotes, an observation consistent with the fact that eukaryotic ribosomes scan mRNA from left to right (5′−3′) to find start codons (Saito & Tomita 1999). The average distance between a start codon and its nearest upstream-located ATG is generally longer in higher organisms (Saito & Tomita 1999). Such negative selection may act with differential strength on SSRs located in genomic regions with different functions (Hancock 1996).
Except for triplet expansion, other classes of SSRs were also found to cause human diseases. For instance, length rather than a specific allele of (CA)n repeat in the 5′ upstream region of the aldose reductase gene is associated with diabetic retinopathy (Fujisawa et al. 1999), and (CA)n allele polymorphism in the first intron of the human interferon-γ gene is associated with lung allograft fibrosis (Awad et al. 1999). De Fonzo et al. (1998) found that tetra-, penta- to 82 bp-repeats could also be associated with the onset of many human diseases. For example, allele (CCTTT)14 in the promoter of NOS2A gene significantly limited diabetic retinopathy, but other alleles (repeat number) could cause diabetic retinopathy (Warpeha et al. 1999).
The foregoing evidence indicates that SSR variation can produce either drastic or quantitative variations in gene expression. Because of genomic overabundance and high mutability of SSRs, this implies that changes in SSR array size may serve a rich source of variation in fitness-related traits in natural populations (Kashi et al. 1997; King et al. 1997; Kashi & Soller 1999; King & Soller 1999; Trifonov 2002). Its role may be especially important for population survival and adaptation to spatially and temporally varying environmental conditions (Li et al. 2000a,b,c, 2002b).
Mutational mechanisms of SSR variation
The genomic SSR abundance and various functions and effects (either putative or reliably established) are associated with its mutation rate, since the SSR mutation rates (10−2–10−6 events per locus per generation) are very high, as compared with the rates of point mutation at coding gene loci. Although the mutation process seems to display distinct differences among species, repeat types, loci and alleles, age and sex (Brock et al. 1999; Hancock 1999; Ellegren 2000; Schlötterer 2000), the instability is predominantly manifested as changes in the number of SSR repeats. Two mutational mechanisms can be invoked to explain such high rates of mutation. The first involves DNA slippage during DNA replication (Tachida & Iizuka 1992). The second involves recombination between DNA strands (Harding et al. 1992). The efficiencies of the two mechanisms may putatively depend on the environmental conditions. Various factors were found to affect the rate of mutations at SSR loci including repeated motif, allele size, chromosome position, GC content in flanking DNA, cell division (mitotic vs. meiotic), sex, and genotype (e.g. mutations at MMR genes) (see below, and section above on ‘SSRs in the eukaryotic DNA MMR gene as modulators of evolutionary mutation rate’).
Many changes of repeat numbers at SSR loci are caused by slip-strand mispairing errors during DNA replication (see Eisen 1999). Some of these errors are corrected by exonucleolytic proofreading and mismatch DNA repair, but many escape repair and become mutations. For instance, CTG/CAG or CGG/CCG repeats can form secondary structures (hairpin-like) that escape DNA repair in yeast (Moore et al. 1999). Thus, SSR instability can represent a balance between the generation of replication errors by slip-strand mispairing and the correction of some of these errors by exonucleolytic proofreading and mismatch repair. Although mononucleotide SSR tracts may modulate the efficiency of MMR system (Chang et al. 2001), MMR was found to have a much more significant impact on SSR instability than proofreading (Eisen 1999). It was shown that SSR instability in humans is associated with such DNA MMR genes as hMLH1 (Boyer et al. 1995), hMSH2, hMSH3, and hMSH6 (Boyer et al. 1995; Clark et al. 1999). When these genes mutate or become defective, SSR instability consequently increases. If the MMR system is defective, coding sequences containing repetitive DNA tracts will be preferred target sites for mutations in human tumours (Sia et al. 1997). Several recent reviews comprehensively discussed the dangerous relationship of SSR instability by defective MMR and human cancers (e.g. Aquilina & Bignami 2001; Atkin 2001; Hussein & Wood 2002). In Drosophila melanogaster, flies lacking a spell gene (a subfamily of mutS gene whose MutS proteins promote correction of DNA mismatch), displayed a highly increased instability in long runs of dinucleotide repeats, when analysed after 10–12 generations (Flores & Engels 1999). The MMR system not only enhanced CTG/GAC triplet repeat stability in Escherichia coli (Jaworski et al. 1995), but also stabilized long 64 (CTG/GAC) repeats (Schumacher et al. 1998). However, in (CTG/GAC) tracts larger than 100 repeats, an active MMR system promoted large deletions; interruption of the tract purity enhanced the frequency of deletions (Parniewski et al. 2000).
The effectiveness of the mismatch DNA repair system can be influenced by the genomic location of mismatch, by the DNA surrounding the mismatch, the presence of strand-recognition signals, methylation state, etc. (reviewed in Eisen 1999). Moreover, it is a paradox that the MMR system, which limits mutation in SSR sequences, would be particularly vulnerable to mutation by virtue of having SSRs in its own coding regions (Chang et al. 2001). Differences in the MMR among individuals of a particular species have been well documented. For example, many strains of E. coli in the ‘wild’ are defective in the MMR system (Matic et al. 1997). Since mismatch recognition is involved in other cellular processes such as the regulation of interspecies recombination, there may be other selective pressures affecting cellular processes that lead to variation in MMR capabilities within a species (Eisen 1999).
Based on the assumption that replication slippage is the main mechanism, and on empirical data that mutations in SSR repeats are biased with a tendency to make the array grow larger (Amos & Rubinstein 1996), Dermitzakis et al. (1998) proposed a hypothetical model where such clustering was generated in microsatellites. In this model, at two juxtaposed loci, allele size-constraints (natural selection and/or mutation bias) eliminate repeats randomly from short or long arrays with equal likelihood. In such a case, long arrays lose repeats at the same rate as short ones, but gain repeats faster than the short arrays. If this mechanism is allowed to act for too long, it may generate two clusters of about the same size, but at the two extremes show a bimodal distribution of allele-size frequency. Ellegren (2000) suggested that within genomes at equilibrium, the SSR length distribution is at a delicate balance between biased mutation processes and point mutations acting towards the decay of repetitive DNA.
We believe that natural selection is a very important constraint acting on SSR allele sizes. In theory, natural selection could act against long alleles, introducing a form of length ceiling (Garza et al. 1995). Our microgeographical study in wild wheat implied that natural selection might act as high and low limits (Li et al. 2000c). Morgante et al. (2002) provided evidence of positive selection for specific repeats in the transcribed regions by examining whether complementary motifs were equally represented in the transcribed strand in the Arabidopsis thaliana genome, and suggested that very different selective pressures seem to act on the 5′UTR, open-reading and 3′UTR regions of genes.
Recombination could potentially change the SSR length by unequal crossing over or by gene conversion (Brohele & Ellegren 1999; Hancock 1999; Jakupciak & Wells 2000; Richard & Pâques 2000). Structural analysis of mutant sequences derived from six alleles in human sperm revealed that it is the rate of intra-allelic rearrangement that increases with array size, and that intra-allelic duplication events tend to cluster within homogeneous segments of alleles. Both phenomena resemble features of trinucleotide repeat instability. Unequal exchanges in combination with random genetic drift and selection can have a strong effect on the accumulation of tandem-repetitive sequences in the genome (Charlesworth et al. 1994). Recent studies point to the important role of nonreciprocal recombination (gene conversion) in destabilization of tandem repeats (for both micro- and minisatellites) (Jakupciak & Wells 2000; Richard & Pâques 2000). Depending on the motif, nonreciprocal exchange may yield unidirectional (e.g. only contraction) or bidirectional changes. The effect may be associated with either meiosis or mitosis, albeit with different rates.
Interaction of replication slippage and recombination
In our study of microsatellite diversity in wild emmer wheat, a very strong effect of interaction between mean repeat length and SSR locus distance from the centromere on the number of alleles and variance in repeat size at SSR loci was found (Li et al. 2000c, 2002a). We suggested that this effect might reflect the possible influence of replication slippage during recombination-dependent DNA repair. Indeed, strand exchange between two homologous chromosomes should create a four-stranded configuration, called a Holliday structure, associated with mismatched (heteroduplex) DNA regions. These regions undergo replication-dependent correction. Hence, a slippage mechanism may also work in recombination tracts involving SSR arrays (Brohele & Ellegren 1999; Gendrel et al. 2000; Li et al. 2000c, 2002a). Similarly, in yeast, such ‘repair-slippage’ events associated with gene conversion and leading to SSR expansion or contraction may occur 800-fold, compared to ‘replication-slippage’ (Richard & Pâques 2000). Recombinational repair during gene conversion was hypothesized to play a major role in trinucleotide expansion for a series of human neurological disorders (Jakupciak & Wells 2000; Richard & Pâques 2000). The interaction of slippage and recombination, which may happen in heteroduplex DNA regions, could also affect SSR stability. Some SSRs promote recombination events, including multiple exchanges (e.g. Gendrel et al. 2000).
This review showed numerous lines of evidence available, which suggest that SSR genomic distribution is nonrandom across coding and noncoding regions. These findings are not always consistent with theoretical hypotheses, such as the classical stepwise mutation (Ohta & Kimura 1973) based on neutral or nearly neutral theory, and the random single-step walk model (Bell & Jurka 1997). That is because the evolutionary process leading to length variability at the SSR loci does not follow a simple mutation model, nor does it follow a strict single-stepwise model. Since a significant part of SSR structures are functionally important for gene transcription, translation, chromatin organization, recombination, DNA replication, DNA MMR system, cell cycle, etc. (see Fig. 1), selection seems to act against random SSR size expansions or compressions in corresponding genomic regions. Consequently, at least these functional groups of SSRs may not be neutral, in certain ranges of SSR variation, and under some ecological–genomic conditions. Therefore, molecular ecologists should be cautious in explaining empirical material on SSR diversity in certain environments in terms of neutral evolution without critically testing this hypothesis against non-neutral alternatives. Despite the numerous examples where the functional role of repeated sequences is known, in the vast majority of concrete cases the origin and biological function of SSRs are poorly understood. Nonetheless, insufficient knowledge or absence of function-relevant information cannot justify almost automatic attempts at explaining SSR variation as a neutral one. Still, these comments do not mean that the authors’ message is to warn others from using SSR as molecular markers in molecular-ecology, population-genetic or genetic-mapping studies. The converse is true, as illustrated by our own numerous studies in this field (e.g. Li et al. 2000a,b,c). Nevertheless, we believe that the demonstrated widespread nonrandomness, selectivity, and other various patterns displayed by repeated elements, call for special attention at all stages of using this class of markers in the aforementioned fields, from designing of the experiments, to data analysis and interpretation.
Replication slippage and recombination may function as two major mutational mechanisms. The effectiveness of the mismatch repair system is critical to SSR stability. Recombination could change SSR repeat number by unequal crossing over or gene conversion. The interaction of slippage and recombination, which may happen in heteroduplex DNA tracts, could also affect SSR stability. In contrast, many of the repeated elements are known to function as very efficient promoters of recombination events (Gendrel et al. 2000; Korol 2001). It was speculated that the dependence of these mechanisms on stress might be of high importance for fast genetic adaptation and speciation in both eukaryotic and prokaryotic organisms (Parsons 1992; Korol et al. 1994; Korol 1999, 2001; Young et al. 2000; Rocha et al. 2002). Although many clear examples of SSR function, genomic distribution, and mechanisms have been accumulated across taxonomy, the current knowledge about this class of sequences is still formulated as ‘an enigmatic consortium of consensus and controversy’ (Chambers & MacAvoy 2000). This is especially true concerning mechanisms of co-evolution of SSR and the rest of the genome. The observed patterns of SSR genomic distribution and polymorphism in extant species should have resulted from complex interactions between extreme structural features of repeated DNA, peculiarities of DNA metabolism, and selectable regulatory functions of SSR elements. Clarification of the evolutionary significance of these fast-evolving genomic elements calls for new critical evidence from the field and laboratory, especially by testing natural populations inhabiting ecologically heterogeneous and stressful environments (see also Rocha et al. 2002). We believe that this may be an important contribution of molecular ecology to ecological genomics in particular and to genomics in general.
Yochun Li completed her Ph.D in the Institute of Evolution, University of Haifa and now is a Postdoc at the Department of Plant Sciences, University of Arizona, Tucson, Arizona, USA. Abraham Korol is a full Professor of Genetics and Head of the Laboratory of Mathematical Population Genetics at the Institute of Evolution, University of Haifa, Israel. Tzion Fahima is a Senior Researcher and Head of the Laboratory of Plant Disease Resistance at the Institute of Evolution, University of Haifa, Israel. Avigdor Beiles is an Emeritus Senior Researcher of Population Genetics and Evolutionary Biology at the Institute of Evolution, University of Haifa, Israel. Eviatar Nevo is a Full Professor of Evolutionary Biology and Director of the Institute of Evolution, University of Haifa, Israel.