Spinocerebellar ataxias: An example of the challenges associated with genetic databases for dynamic mutations † ‡
For the Databases in Neurogenetics Special Issue
This work was funded in part by FEDER through the Operational Competitiveness Programme – COMPETE and by national funds through FCT – Funda¸ão para a Ciência e a Tecnologia under the project FCOMP-01-0124-“FEDER-022718 (PEst-C/SAU/LA0002/2011)”
Locus-specific databases are an important source of information for diagnostic laboratories and a valued means of improving quality of genetic testing. Although increasingly frequent, databases for oligonucleotide repeat expansions are still scarce, due to factors that make them different and the building of databases much more difficult. Definition of what constitutes “the repeat” to measure is not a simple matter and correct sizing is not always straightforward. Reference ranges and penetrance classes are not easy to establish. Acceptable margins of error depend on the disease and allele-size distribution, and vary according to size range and pathogenic significance. Inter- and intralaboratorial variance is well documented and allele distribution may vary among populations. The spinocerebellar ataxias, used only as an example of those difficulties, are also a highly heterogeneous group, which includes loci with both pathogenic repeat expansions and point mutations or insertions/deletions. They display a variable, but often overlapping phenotype, where genotype–phenotype correlation is difficult or nonexistent. Standard (Human Genome Variation Society) nomenclature is not appropriate for oligonucleotide repeats, as established at harmonization among all EMQN (European Molecular Genetics Network) external quality assessment (EQA) schemes for “repeat disorders.” Curation of such databases is a difficult task, but one that needs to be addressed adequately and without much delay. Hum Mutat 33:1359–1365, 2012. © 2012 Wiley Periodicals, Inc.
Databases of information about genes and mutations causing genetic diseases are an invaluable resource for any center offering genetic testing as they provide important technical details, a means to update scientific knowledge and thus constitute a straightforward way to ascertain whether a variant has been identified previously in association with a particular disorder. They may also indicate if there is any experimental evidence supporting a pathogenic role and give useful information about specific features of a phenotype, genotype–phenotype correlations (if existing), and whether there is evidence for founder effects in certain populations. A number of such databases are widely available and may be locus specific, such as the X-linked Adrenoleukodystrophy Database (www.x-ald.nl/) and the Cystic Fibrosis Mutation Database (www.genet.sickkids.on.ca/cftr/app).
Other databases contain data pertaining to a wide variety of genes, examples being the Leiden Open Variation Database (LOVD; www.lovd.nl/2.0), which provides access to data from over 4,000 individual genes, and the Diagnostic Mutation Database (DMuDB) in the United Kingdom, which was established in 2005 by the National Genetics Reference Laboratory in Manchester, as “a repository of diagnostic variant data, to support the diagnostic process in UK genetic testing laboratories” (www.ngrl.org.uk/Manchester/projects/informatics/dmudb).
Not many databases exist for diseases caused by dynamic mutations, despite the increasing number of neurodegenerative disorders (NDDs) identified as being caused by expansion of oligonucleotide repeat sequences. For example, trinucleotide repeat expansions, such as those associated with Huntington's disease (HD) and a number of the autosomal dominant spinocerebellar ataxias (SCAs), are individually rare, but collectively they contribute significantly to this class of genetic disorders.
Although for HD there is a general consensus as to what constitutes normal, large normal unstable, reduced penetrance, and disease alleles, this is not so for the SCAs, which present more of a challenge to laboratories offering testing. These disorders are also clinically and genetically heterogeneous, exhibiting significant differences in prevalence and in the type of mutations causing ataxia among different populations. For some genes, disease may result from point mutations, as well as repeat expansions, although the phenotype may be very different; for example, episodic ataxia type 2 and familial hemiplegic migraine type 1 are due to point mutations in the CACNA1A gene, expansions in which cause SCA6 [Barros et al., 2012]. We thus believe that, in addition to some challenges and problems specific to the SCAs, these are also a good paradigm to show the great difficulties associated with building of mutation databases in the case of “repeat disorders.”
Currently, several pathogenic disease mechanisms for oligonucleotide repeat disorders have been identified and include a loss of function of the gene, a gain of function of a protein containing an expanded polyaminoacid (GCG/adenine or CAG/glutamine) tract, or RNA toxicity by an untranslated transcript. Further details regarding the repeat disease pathology, as well as the where and when of the repeat instability, is beyond the scope of this report [for reviews see, e.g., Gatchel and Zoghbi, 2005; López Castel et al., 2010].
SCAbase (www.scabase.eu) is “an evidence-based online resource in the field of the spinocerebellar ataxias,” to help laboratories keep up to date and improve the quality of genetic testing for the SCAs, while a structured, dynamic, interactive database for the SCAs is not available. It resulted from a specific recommendation of the EMQN (European Molecular Genetics Quality Network) Best Practice (BP) Guidelines for molecular genetic testing of SCAs [Sequeiros et al., 2010a] and the discussions at the BP Meeting and during the drafting of those guidelines. It incorporates a number of tables with information about the genes and mutations known by then, relative frequencies and founder effects, repeat reference ranges, a list of primers and the definition of the repeat for each of the main SCAs, as well as a comprehensive list of relevant bibliographic references.
SCAbase is thus a purely static repository of information and not truly the dynamic disease mutation database still to be built, with phenotypic information and interaction with the diagnostic laboratories and the scientific community, to further and better update its information and make it more solid and more useful. The fact is that the complexity of these diseases is such that building and curating of a database of this nature is no simple task for a number of reasons, as discussed below.
Definition of the Repeat
Some of the repeat motifs associated with SCAs are not “pure” stretches of the relevant repeat sequence, but include interruptions by other sequences or even other polymorphic repeats. This has led to debate concerning whether interruptions should be included in what is counted, or whether the counting should be restricted to pure stretches of nucleotide or amino acid sequences.
It is also debatable as to the significance of whatever is encoded by the interruption. An example is SCA2, where the repeat motif, as originally defined [Pulst et al., 1996; Sobczak et al., 2004], is commonly interrupted by a CAA, thus: [(CAG)n CAA (CAG)n]n. However, because CAA also codes for glutamine, this interruption has no effect on the sequence of the protein; it may, however, confer stability on the repeat array and thereby make it less likely to expand on transmission.
Other interruptions have greater significance, such as in SCA1, where there is overlap between the allele sizes in the normal and pathogenic ranges and it is the absence of any CAT interruption (coding for histidine) that determines whether an allele in this overlapping range is disease causing [Chung et al., 1993; Orr et al., 1993; Sobczak et al., 2004].
Consensus as to the definition of the repeat structures for DRPLA, SCA1, SCA2, MJD/SCA3, SCA6, SCA7, SCA8, SCA10, SCA12, and SCA17 was achieved at the 2007 EMQN Best Practice Meeting for molecular genetic analysis of SCAs and discussion thereafter [Sequeiros et al., 2010b]. It is important to consider that further variation may be identified and should be included in any database providing information about these disorders because they may compromise routine genetic testing. Indeed, this has been shown to be the case in SCA10, where complex interruptions of the (ATTCT)n repeat in the ATXN10 gene have been shown to cause unusual patterns in repeat-primed polymerase chain reaction (PCR) analysis of expanded alleles [Matsuura et al., 2006]. Similar issues have been reported in the analysis of other repeat expansion disorders, for example, myotonic dystrophy types 1 and 2 [Radvansky et al., 2011a, b].
In some cases, the sequence composition of the expanded unstable repeat differs considerably from the standard seen in the majority of the patients. This has been documented for ATXN8 (SCA8), where it was suggested that differences could contribute to reduced penetrance [Moseley et al., 2000] and, more recently, also for the DMPK gene [Braida et al., 2010]. This diversification has important consequences for molecular genetic testing, as it might hamper the detection of repeat expansions. Indeed, routine triplet repeat-primed PCR (TP-PCR) assays [Warner et al., 1996], developed for typical triplets, will not amplify these variant expanded patterns, which can lead to false negative results. The phenotype of these patients might also divert from the classical form of the disease. A potential for pathogenic variant expansions might also be present among alleles of a subset of patients in other repeat expansion disorders.
Detection and Sizing of Repeats
The general methodology for detection of these expansion mutations involves PCR across the repeat array and determination of the number of repeats from the size of the PCR products obtained. In practice, this will exclude a clinical diagnosis in any samples unequivocally demonstrating two normal alleles of different sizes, but it may be necessary to go on to a confirmatory test, such as Southern blotting or TP-PCR, to completely exclude the presence of a very large expansion if only a single normal allele (presumable homoallelism) is detected by routine PCR.
In accordance with the current Best Practice Guidelines for molecular genetic testing of SCAs [Sequeiros et al., 2010a], methods used for determining the sizes of alleles should be capable of differentiating between alleles one repeat apart in size; and should be accurate enough to size to within ±1 repeat in the normal range and ±3 repeats in the pathogenic range (with the exception of SCA6, where the expanded alleles are also relatively small). Obviously, any errors in sizing will contribute inaccuracies to a database; hence the use of sequenced controls is recommended to ensure that laboratories are able to size the alleles as accurately as possible and assign them to their exact penetrance class.
Despite these recommendations, significant variability is still seen in sizing trinucleotide repeat alleles in the SCAs. Evidence for this is obtained from the annual EMQN external quality assessment (EQA) scheme for SCAs [Seneca et al., 2008], where three mock clinical cases are provided to all participant laboratories, which are required to genotype the samples, interpret the results, and report on the findings. The samples provided have allele sizes validated by two independent laboratories (by two different methodologies each), before and after aliquoting. The participating laboratories are requested to provide the allele sizes determined, whether or not they include (normal and/or expanded) allele sizes in their reports. Experience of this EQA scheme has shown that there is not only interlaboratory variation in the sizes reported, but there are also differences between samples tested in the same laboratory, that is, errors are not necessarily consistent. Although there has been an improvement in the allele sizing reported by participant laboratories over the eight years this EQA scheme has been in existence, considerable variation is still demonstrated. In some circumstances, this may make it difficult to assign an allele confidently to a normal, intermediate or expanded range, which could potentially result in a diagnostic error and incorrect recommendations.
For HD, the European Huntington Disease Network felt the need to duplicate in a central accredited laboratory the results obtained in 121 laboratories from 15 countries, and found a discrepancy in the repeat number reported in 51% of the upper alleles (n = 1326) and 40% of the lower alleles (n = 1250) [Quarrell et al., 2012]. All these issues related to the size of a repeat need to be adequately addressed and imply important differences regarding other locus-specific disease databases (LSDD).
The distribution of specific genetic diseases, such as the SCAs [Sequeiros et al., 2012], differs markedly among populations. For example, although the SCAs are collectively rare in the UK, the most commonly identified forms are SCA2 and SCA6. In Italy, SCA1 is relatively frequent in the north, whereas SCA2 is the most frequent form in the south. However, these are relatively rare in Portugal, where MJD/SCA3 predominates and DRPLA is more frequent than anywhere else in Europe. In the northwest of Spain, the most frequent ataxia is SCA36 [García-Murias et al., 2012]; whereas in Germany, MJD/SCA3 and SCA6 are the most frequent. In addition to this, the sizes of alleles that are common in any given population may differ. An example of this is in SCA7, where a (CAG)10 allele is by far the most frequently seen in the normal range in most populations, but may not be so in others. This can cause difficulties in defining normal, intermediate, and expanded allele size ranges in any database, as there may be differences in where the boundaries between categories lie in different ethnic groups and, in some populations, specific allele sizes may not be seen at all.
Furthermore, there is no evidence available in the literature about the prevalence or the relative frequency of the various forms of SCA for most populations [for a review of existing data, see Sequeiros et al., 2012].
Identification of Normal and Pathogenic Allele Ranges
Owing to the differences seen in allele sizes in different populations, in addition to the sizing problems discussed, it can prove problematic to define the normal ranges and those associated with disease. This is particularly relevant for expansion disorders due to the inverse correlation between the size of a pathogenic allele and the severity of the disease, as this may have a bearing on the decisions of family members with regard to presymptomatic testing and reproductive decisions, including prenatal diagnosis.
A further problem is that it may not be clear whether there is an “intermediate” range (alleles which in principle will be observed less frequently) and whether alleles within such a range may be pathogenic, perhaps with later onset and reduced penetrance. In some cases, the disease may not develop until very late in life and, therefore, the association of the phenotype with a smaller allele may go unrecognized for a long time. The effect of modifier genes, although suspected in some populations, is also unclear, and recent evidence indicates that some SCA alleles may be modifiers for other diseases. This has been demonstrated by reports of ATXN2 intermediate alleles (27–33 repeats) contributing to susceptibility to amyotrophic lateral sclerosis [Elden et al., 2010] and the detection of pathogenic ATXN2 alleles in 2% of a cohort of ALS patients [Daoud et al., 2011].
Human Genome Variation Society Nomenclature
The recommended nomenclature for reporting sequence variants is defined by the Human Genome Variation Society (HGVS; www.hgvs.org). In brief, nucleotide numbering starts from the first nucleotide of the ATG initiation codon and intronic nucleotides are numbered relative to the nearest exonic base, for example, +1, +2, and so on. Variable short sequence repeats are described such that the first nucleotide of the repeat counts as the start and the number of repeats is also indicated [den Dunnen and Antonarakis, 2000].
The alleles seen in trinucleotide repeat disorders are very complex to describe using HGVS nomenclature. This is made even more difficult by the fact that there are often many transcripts resulting from the relevant SCA genes, some of which do not contain the trinucleotide repeat sequence. A good example of this is in the CACNA1A gene associated with SCA6, where the transcript containing the (CAG)n (isoform 2, NM_023035.2) is distinct from the transcript generally used for the description of sequence variants associated with episodic ataxia type 2 or familial hemiplegic migraine type 1 (isoform 3, NM_001127221.1). This transcript does not include the polyglutamine tract.
Utilizing HGVS nomenclature to describe genotypes in individuals tested for the SCAs is, therefore, not straightforward. Taking SCA6 as an example, the genotype c.6955CAG; pertains to reference sequence NM_023035.2 and would denote an affected individual carrying a normal (CAG)11 repeat on one allele and a pathogenic (CAG)23 repeat on the other. The approved EMQN Best Practice Guidelines for molecular genetic testing of SCAs [Sequeiros et al., 2010a] state that the use of HGVS-approved nomenclature is potentially confusing for laboratories and clinicians and, thus, not appropriate for reporting repeat expansions. Harmonization among all EMQN EQA schemes for repeat disorders (HD, DM, FRDA, FRAXA) have also reached the same conclusion.
Primer Sequences and the Problems of Single Nucleotide Polymorphisms
Many of the original PCR primer sequences for the detection of expansion mutations associated with SCAs were published many years ago. In the intervening period of time, with the increased quantity of data being generated from the Human Genome project, there have been reports of a number of single nucleotide polymorphism (SNPs) located within the annealing sites of some of these primers. This raises the possibility of alleles being missed if a primer fails to anneal for this reason. Any database providing information concerning sequences of primers that can be used for the detection of expansion mutations in the SCAs also needs the capacity to provide access to information regarding any SNPs associated with the primers.
The latest build of dbSNP (build 137) was released in June 2012 (http://www.ncbi.nlm.nih.gov/projects/SNP/). It is based on Genome Reference Consortium (GRC) assembly GRCH37.p5 and includes 1000 Genomes project phase 1 release data; hence, it is becoming increasingly apparent that some of these SNPs occur at a significant frequency.
SNPs identified within published primer sequences for PCR detection of expansion mutations associated with SCAs are provided in Table 1.
Table 1. Details of SNPs Identified within Published Primer Sequences for PCR Detection of Expansion Mutations Associated with SCAs
|SCA1||Orr et al (1993)||Rep-1||rs61747470||0.001|
| || ||Rep-2||rs112175378||0.003|
| || || ||(rs35216501)||Not known|
| || ||CAG-a||(rs66949327)||Not known|
| || ||CAG-b||rs140370240||0.001|
|DRPLA||Majounie et al (2007)||DRPLAF||(rs144390061)||Not known|
| || ||DRPLAR||None reported|| |
| ||Li et al (1993)||CTGB37.5F||(rs144390061)||Not known|
| || ||CTGB37.5R||None reported|| |
|SCA2||Pulst et al (1996)||SCA2-A||None reported|| |
| || ||SCA2-B||None reported|| |
| ||Imbert et al (1996)||UH13||None reported|| |
| || ||UH10||None reported|| |
|SCA3/MJD||Kawaguchi et al (1994)||MJD52||None reported|| |
| || ||MJD25||None reported|| |
| ||Juvonen et al (2005)||SCA3F||None reported|| |
| || ||SCA3R||None reported|| |
|SCA6||Zhuchenko et al (1997)||S-5-F1||None reported|| |
| || ||S-5-R1||None reported|| |
|SCA7||David et al (1997)||4U1024||None on dbSNP but see note belowb||Not known|
| || ||4U716||rs201334618||0.004|
| || || ||(rs201152899)||0.001|
| ||Del-Favero et al (1998)||H2||None reported|| |
| || ||H1||None reported|| |
| ||Juvonen et al (2005)||SCA7F||None reported|| |
| || ||SCA7R||None reported|| |
|SCA8||Koob et al (1999)||SCA8-F3||None reported|| |
| || ||SCA8-R4||rs190934396||Not known|
| ||Majounie et al (2007)||SCA8 F||rs302018||0.446|
| || ||SCA8 R||rs190934396||Not known|
|SCA10||Matsuura et al (2000)||attct-L||None reported|| |
| || ||attct-R||None reported|| |
| ||Majounie et al (2007)||SCA10F||None reported|| |
| || ||SCA10R||None reported|| |
|SCA12||Holmes et al (1999)||A||None reported|| |
| || ||B||rs61326177||0.119|
| ||Majounie et al (2007)||SCA12F||None reported|| |
| || ||SCA12R||rs61326177||0.119|
|SCA17||Koide et al (1999)||TBP-F||(rs138026963)||Not known|
| || || ||rs191076110||Not known|
| || ||TBP-R||(rs141845648)||Not known|
| || || ||rs201331220||Not known|
| ||Juvonen et al (2005)||SCA17F||None reported|| |
| || ||SCA17R||(rs141845648)||Not known|
| || || ||rs148074761||<0.001|
| ||Majounie et al (2007)||SCA17 F||rs191076110||Not known|
| || ||SCA17 R||rs148074761||<0.001|
An open-access database is the most efficient way of getting important information into the public domain, where it can be used most effectively. However, it is clear that the factors that make oligonucleotide repeat expansion disorders (Table 2) different from other genetic disorders also make the task of curating any data to be included in a database for these diseases a very difficult one.
Table 2. Diseases Currently Known to Be Caused by Oligonucleotide Repeat Expansions
|FTD/ALS||105550||C9orf72||GGGGCC||Intronic||AD/sporadic|| ||None reported||Yes|
|DM2||602668||CNBP||CCTG||Intronic||AD|| ||None reported||Yes|
|DRPLA||125370||ATN1||CAG||Exonic||AD|| ||None reported||No|
|EPM1A||254800||CSTB||CCCCGCCCCGCG||5′UTR||AR|| ||Point mutations||Yes|
|FRAXA||300624||FMR1||CGG||5′UTR||XL|| ||Deletions and point mutations||Yes|
|HD||143100||HTT||CAG||Exonic||AD|| ||None reported||No|
|HDL1||603218||PRNP||8-octapeptide||Exonic||AD|| ||None reported||Yes|
|HDL2||606438||JPH3||CAG/CTG||Alternatively spliced exon (2A)||AD|| ||None reported||Yes|
|SBMA||313200||AR||CAG||Exonic||XL|| ||Point mutations (AIS)||No|
|SCA2||183090||ATXN2||CAG||Exonic||AD|| ||None reported||Yes|
|MJD (SCA3)||109150||ATXN3||CAG||Exonic||AD|| ||None reported||No|
|SCA6||183086||CACNA1A||CAG||Exonic||AD|| ||Deletions and point mutations (EA2/FHM1)||No|
|SCA7||164500||ATXN7||CAG||Exonic||AD|| ||None reported||Yes|
|SCA10||603516||ATXN10||ATTCT||Intronic||AD|| ||None reported||Yes|
|SCA12||604326||PPP2R2B||CAG||5′||AD|| ||None reported||No|
|SCA17||607136||TBP||CAG||Exon 3||AD|| ||None reported||No|
|SCA31||117210||BEAN1||TGGAAa||Intronic||AD|| ||None reported||Yes|
|SCA36||614153||NOP56||GGCCTG||Intronic||AD|| ||None reported||Yes|
|OPMD||164300||PABPN1||GCG||Exonic||AD/AR||No||Point mutation leading to polyalanine stretchb||No|
|EDM1/PSACH||132400 177170||COMP||GAC||Exonic||AD|| ||Deletion of GAC also seen in PSACH; point mutations||No|
It is essential to keep up to date with current literature and to ensure that any relevant changes are incorporated into the database, as soon as feasible. The scale of this task is beyond the capacity of a single individual, or even a small group, and is best addressed by being a community venture. Although the simplest way of keeping the database up to date with biological data would be to allow access to professionals working in the field, there needs to be regulation to ensure that a consistent approach is adopted. There would also need to be some form of validation process to ensure that updated information is accurate.
Although access to viewing the content of the database could be universal, it would be desirable to restrict the rights to update the database to specific interested parties specializing in the relevant diseases. It would be very important to protect the database with a comprehensive audit trail, as any changes might have serious consequences if they were incorrect and unchecked.
Such databases should include mutation and phenotypic data specific to individual patients, which would potentially be useful. However, collecting data that may be relevant, such as ethnic origin, age of onset and disease course, is not always easy. In addition, there may be local information governance issues that prevent the submission of such data to an open access database. Obtaining consent from patients to allow their appropriately anonymized data to be included in the database could circumvent this problem, although it would be an additional burden on laboratories and clinicians.
LSDBs are a major challenge for most of the oligonucleotide repeat expansion disorders, including the SCAs. There are intrinsic problems, such as definition of what constitutes the repeat and difficulties in measuring them, which bring significant variation in its measurement. In addition, it is not always easy to determine acceptable margins of error, reference ranges for different penetrance classes, and pathogenic significance for intermediate or borderline size alleles. Thus, the establishment of phenotype–genotype relationships constitutes a major additional difficulty when building databases for dynamic mutations. Nomenclature difficulties, intra- and interloci heterogeneity, population diversity, and phenotypic overlap and complex relations among several of these diseases all make the construction and maintenance of databases for repeat disorders a difficult trial, but also an essential requisite.
We thank the Human Genome Variation Society that organized a meeting on neurogenetics databases [Montréal, 2011], and Suzy Sobrido (the organizer) for this challenge. We thank also Maria García-Murias for helpful suggestions and John Martindale for help with the manuscript. We are also thankful to EMQN and all the participants in its SCAs EQA scheme along these years, some of whom, such as Carmen Ayuso, have provided us important feedback.
Conflicts of Interests: We declare no conflicts of interests for any of the authors.