Friedreich ataxia (FRDA) is a progressive neurodegenerative disorder resulting from mutations within the first intron of the FRDA gene on chromosome 9 leading to impaired frataxin levels and impaired mitochondrial function. To date, there is no cure for FRDA. The study of the natural history of the disease and the execution of future therapeutic trials call for the development of appropriate outcome markers. Therapies may lead to a symptomatic effect resulting in improvement of symptoms and/or may interfere with the disease pathogenesis leading to a slowing-down of disease progression. Potential outcome measures must be sensitive instruments carefully analysed for their significance with respect to either effect. Otherwise, the lack of appropriate assessment tools may undermine any scientific evidence for therapeutic effectiveness. Clinical scales may represent an appropriate and cheap tool since they do not require elaborate infrastructure.
Friedreich ataxia (FRDA) is a progressive neurodegenerative disorder associated with ataxia, dysarthria, pyramidal tract signs, sensory loss, cardiomyopathy and diabetes. There is no cure for FRDA so far. Studies of the natural history of the disease and future therapeutic trials require development of appropriate outcome markers. Since any therapeutic benefit is expected to modulate deterioration over time rather than to reverse disability, potential outcome measures must be sensitive instruments carefully analysed for their significance. Clinical scales may represent an appropriate measuring tool. Over the last few years the construction, evaluation and validation of sensitive clinical scales for the assessment of disease severity and progression in ataxia have had considerable impact on our understanding of the disease. Currently, there are three different scales that are most frequently applied: The International Cooperative Ataxia Rating Scale (ICARS), the Friedreich Ataxia Rating Scale (FARS) and the Scale for the Assessment and Rating of Ataxia (SARA). All scales have been validated and compared with regard to their testing properties.
Activities of Daily Living
Friedreich Ataxia Rating Scale
Functional Independence Measure
International Cooperative Ataxia Rating Scale
Intraclass Correlation Coefficients
Modified Barthel Index
Scale for the Assessment and Rating of Ataxia
Quality characteristics of clinical scales
A scale should be easy to learn and perform to facilitate integration into clinical routine. In addition, it should be rather compact to decrease fatigue in both the proband and the examiner. Furthermore, it should be robust and not influenced by external factors or fluctuations over the course of a day or from day to day. Basically, scales must correspond to certain quality aspects: A measuring tool must fulfil standards of acceptability, reliability and validity. A scale is acceptable if scores span the entire range without major floor (more than 20% of values drop under a certain threshold) or ceiling effects (more than 20% of scores correspond to the high-end range). To correctly display a wide range of clinical severity, a scale must also display adequate linearity. This is analysed by regression analysis applying a linear model. Reliability refers to the influence of random errors on a score. Variations between different raters applying the same scale are described as interrater reliability. Test-retest reliability [expressed as intraclass correlation coefficients (ICC)] characterizes the variation in repeated measurements under identical conditions. When comparing various scales, the quality of such reliability features is expressed as ICC. Reliability is good when ICC values exceed 0.8. Internal consistency describes whether performance measures that are thought to test the same theoretical construct lead to comparable testing scores; Cronbach's α (range: 0 to 1) expresses the quality of internal consistency. It refers to the correlation of various items across the same (sub) scale. Good internal consistency emanates from values higher than 0.8. It is also of importance whether a scale really tests what it pretends to test, a feature designated as validity. Internal construct validity is assessed by a principal component analysis. The latter assesses whether the number of factors and the loadings of the variables measured on them correspond to theoretical expectations. The factor analysis tests the variability among observed variables with respect to unobserved variables called factors to detect unobserved variables causing joint variations in a series of observed variables. The observed variables are modelled as linear combinations of the potential factors, plus error terms.
Ataxia rating scales
There are several clinical ataxia rating scales including the International Cooperative Ataxia Rating Scale (ICARS), the Friedreich Ataxia Rating Scale (FARS) and the Scale for the Assessment and Rating of Ataxia (SARA) (Trouillas et al. 1997; Subramony et al. 2005; Schmitz-Hübsch et al. 2006). All three scales tend to measure motor aspects of cerebellar dysfunction including ataxia of stance, gait and limbs. SARA being the shortest scale, FARS and ICARS also cover further aspects of the neurological examination (e.g. dysarthria or oculomotor symptoms). Only FARS considers features not directly related to the physical examination, such as activities of daily living.
In degenerative ataxia, the clinical presentation may be highly variable among patients suffering from the same disease. In FRDA, for example, the age of onset and the clinical severity depend largely on the number of trinucleotide repeats resulting in a highly polymorphic clinical syndrome. However, a scale should not only capture the phenotypical variety of a disease, but also reflect its progression over time.
In 1997, ICARS had originally been developed and assessed as a clinical rating scale for cerebellar disorders in general (Trouillas et al. 1997). ICARS completion requires at least 20 minutes. Elaborate testing instructions intended to reduce interrater variability (Trouillas et al. 1997), but they require proper training prior to administration. The ICARS structure aims to reflect the functional organization of the cerebellum (vermis and anterior lobe: posture and gait, hemispheres: limb ataxia, vermis and flocculus: oculomotor function). The maximum ICARS score of 100 is composed of four subscores. Highest scores reflect most severe disability. The four subscores refer to posture and gait (items 1–7, maximum score 34), limb ataxia (‘kinetic functions’, items 8–14, maximum score 52; each side of the body is assessed separately; the total of the score on each side is added to the total score), dysarthria (‘speech disturbances’, items 15 and 16, maximum score 8) and oculomotor function (items 17–19, maximum score 6). ICARS is based on the hypothesis that each of its 19 parameters is grouped into one of the four multi-item subscales. The summation of all subscores to a total score without further weighting or standardization is referred to as Likert's method of summated ratings (Likert 1932). The ICARS model therefore assumes that single raw scores may be summed to subscale scores and that all scores may be summed to a sum score.
In a study of 77 FRDA patients, Cano tested the model of summing single item scores to subscores and these four subscores to a sum score without further processing (Cano et al. 2005). To answer this question, three postulates were taken into account: (i) all items of each ICARS subscale should share an underlying construct as a prerequisite for summation of subscores and sum scores, (ii) all items of each ICARS subscale should contain a similar proportion of information concerning this construct to avoid the necessity of weighting and (iii) the items in each ICARS subscale should be correctly grouped together.
Analysis revealed that items of each subscale indeed shared a common theoretical construct. When comparing the correlations between single item scores and sum ICARS scores, it emerged that neither the ‘finger-finger test’ nor the ‘fluency of speech’ item exceeded the recommended correlation of 0.3. In contrast, all items in each subscale contained a comparable amount of information with the construct being assessed. Sixteen of 19 single items correlated better with the respective subscale than with other subscales, but correlations were rather weak and inconsistent (Cano et al. 2005).
In a second step, ICARS was tested for acceptability, reliability and validity in FRDA (Cano et al. 2005). ICARS sum and subscale scores yielded good acceptability for the application in FRDA. There was also evidence for reliability of sum ICARS, ‘posture and gait’ as well as ‘kinetic function’ subscores while ‘speech’ and ‘oculomotor disorders’ subscales failed standard criteria for reliability. With intercorrelations ranging from 0.30 to 0.75, subscales were shown to successfully capture related but distinct underlying constructs. There was a good correlation between ‘age’ and ‘disease duration’ with all subscale scores with the exception of the subscale ‘speech disorders’. Taken together, these results challenge the structural concept of four ICARS subscales. To test for interrater reliability (IRR), independent rating of patients’ videotapes by three assessors yielded only suboptimal ICC below 0.8 for ICARS. Nevertheless, these data had to be interpreted with caution because of the very small sample size. Indeed, Storey and co-workers established good interrater and intrarater reliability for ICARS (Storey et al. 2004). The speech disorder subscale revealed the weakest IRR with an ICC 0.791 while all other subscales as well as the sum ICARS score had ICC of 0.981 or better. In this study, ICARS rating was video-taped. These video-tapes were rated independently by two additional assessors and reviewed by the original live examiner 2–10 months after the initial rating. The comparison between the first live rating and the video-tape rating was used for intrarater and live versus videotaped examination reliability. However, rating videotapes entails the risk of overestimating the IRR because of the lack of variation in instructing and performing the examination between assessors. Interestingly, intrarater ICCs did not differ from ICCs of video-taped ratings.
In a large cross-sectional study on 603 FRDA patients, ICARS scores were analysed as a function of disease duration, age of onset and GAA repeat lengths (Metz et al. 2013). Age of onset, time between diagnosis and examination as well as repeat length were found to influence progression. Interestingly, individuals with an onset at 14 years or younger were characterized by an average increase of 2.5 on ICARS sum scores per year, while individuals with an older age of onset increased their ICARS sum score by 1.8. In contrast to the study of Cano, Metz found evidence for major ceiling effects in ‘posture/gait’ and ‘lower limb’ subscores leading to impaired ICARS sensitivity in more advanced disability stages with an ICARS sum score above 60. Subscale structure emerged as the major weakness of ICARS while sum scores were considered robust measuring tools in FRDA. These data are further corroborated by a comparative analysis of ICARS, FARS and SARA: principal component analysis identified several factors contributing to the variance of FARS and ICARS sum scores: For ICARS, four factors with an eigenvalue greater than 1 were determined. Only factor 3 loaded for a single subscale (oculomotor function) with no other factor correlating to specific subscales. Similarly, five different factors were determined for FARS, of which only factor 4 loaded for a specific subscale (bulbar function). Hence, these findings again question the division of both ICARS and FARS into subscales (Bürk et al. 2009). Nevertheless, ICARS has been successfully applied to clinical and therapeutic trials in various ataxia disorders (Hassin-Baer et al. 2000; Gabsi et al. 2001; Artuch et al. 2002; Mori et al. 2002; Bier et al. 2003; Di Prospero et al. 2007; Cooper et al. 2008; Heo et al. 2008; Pineda et al. 2008, 2010; Kimura et al. 2009, 2011; Nakamura et al. 2009; Lynch et al. 2010; Ristori et al. 2010; Meier et al. 2012). Because several clinical trials using ICARS as the primary outcome failed to show efficacy, its use as the primary clinical scale in future clinical trials should be questioned.
The neurodegenerative profile differs between various ataxia disorders: FRDA is characterised by prominent sensory dysfunction because of primary degeneration of dorsal root ganglia. It may also involve the skeletal system and the inner organs resulting in scoliosis, foot deformity, diabetes and cardiac hypertrophy. To address these phenotypical specificities, another scale entitled FARS had been developed, evaluated and validated specifically for FRDA (Subramony et al. 2005). The sum score (maximum sum score 159) is composed of an examination score derived from ordinal grading of a directed neurological assessment (maximum score 117), a functional ataxia staging score of overall mobility (score 0 to 6) and an assessment of the activities of daily living (ADL) (score 0 to 36). The physical examination score is composed of bulbar (maximum score 11), upper limb (maximum score 36) and lower limb (maximum score 16), peripheral nerve (maximum score 26) and upright stability/gait (maximum score 28) subscores.
In addition, three timed activities including the ‘PATA’ rate (number of repetitions of the bisyllabic phrase ‘PATA’ within 10 s), the 9-hole pegboard test (time taken to place and retrieve pegs on a 9 hole pegboard, assessed for each side) and the timed-walk of 50 feet (25 feet or 8 m one way, turn and walk back with or without device) were applied as performance measures in the original publication (Subramony et al. 2005). FARS assessment together with administration of all three performance measures takes about 30 minutes (Subramony et al. 2005). FARS had been primarily evaluated in 14 FRDA patients (Subramony et al. 2005). In this study, seven independent raters assessed all probands separately. There was no evidence for any influence of learning or fatigue. For disease stage, ADL, upper limb and lower limb coordination, stability/gait, sum neurological examination, interrater variability were excellent with ICC > 0.75, but bulbar and peripheral scores were found to be less consistent between examiners. A rater bias was found for disease stage, ADL, upper limb coordination, peripheral nerve scores and sum neurological examination. Timed performance tests were characterized by less rater bias (as demonstrated by non-significant p values) and good interrater reliability.
Factor loadings were high for most items tested. However, they indicated more variation for bulbar, peripheral nerve and lower limb scores. Promax rotation showed that almost 90% of variance depended on two factors. This was explained by the structure of FARS with its focus on lower-body and upper-body dysfunction. Correspondingly, ‘disease stage’, ‘lower limb’, ‘peripheral nerve’ and ‘upright stability’ scores as well as ‘PATA rate’, ‘bulbar’, ‘upper limb’ and ‘pegboard scores’ were highly correlated, thereby suggesting two common underlying constructs for the correlated items.
Validity of FARS as a measure for FRDA was evident through a good correlation of ‘ADL’ and ‘mobility measures’ with ‘neurological examination sum’ and most of its five subscores. Correlations between ADL and mobility scores with timed testing measures were weaker than most examination measures. The authors suggested composite functional and examination scores for monitoring global dysfunction in FA.
FARS sensitivity and responsiveness was investigated by Fahey in 76 FRDA patients (Fahey et al. 2000). The clinical severity covered a broad range among the participating FRDA patients. In this study, the testing protocol was restricted to FARS examination, ADL and functional ataxia scores and did not cover any performance measures. FARS scores were found to correlate with parameters such as age, age of onset, disease duration. Concurrent criterion validity was then analysed by correlating FARS and its subscores to ICARS as well as to a modified version of the Barthel index (Modified Barthel Index, MBI) and the Functional Independence Measure (FIM) (Keith et al. 1987; Shah et al. 1989). Both MBI (scores 0–100, decrease with increasing disability) and FIM (score 18–126, scores decrease with increasing disability) tend to estimate the amount of support required for the activities of daily living; they represent validated and established measures of disability (Keith et al. 1987; Shah et al. 1989).
FARS sum scores (examination subscores + ADL score + functional score), FARS examination scores and ADL scores were found to significantly correlate with MBI, FIM and ICARS scores (Fahey et al. 2000). The FARS functional subscore showed the highest correlation with FIM, while FARS examination scores correlated best with ICARS scores. The latter finding is not surprising given the examination-based character of both testing tools. The strong correlation between these four measuring tools (FARS-MBI-FIM-ICARS) suggests a common underlying theoretical construct.
As a measure for progression in a cross-sectional approach, higher FARS scores would be expected in individuals with longer disease duration. Indeed, disease duration was significantly correlated with FARS, ICARS, MBI and FIM (Fahey et al. 2000). The strongest correlation, however, was evident for FARS sum scores. Considering progression, face validity can therefore be assumed for all four measuring tools. To analyse FARS properties in longitudinal studies, all parameters were re-assessed after 12 months in a subgroup of 43 FRDA individuals (Fahey et al. 2000). The mean change was 9.5 points on FARS, 5 points on ICARS, 3.1 points on FIM and 1.9 points on MBI. Progression proved significant on FARS, ICARS and FIM, but not on the MBI.
Even a small benefit resulting from therapeutic intervention is likely to be clinically relevant. A clinical scale should therefore be sensitive enough to monitor smallest effects. Estimated numbers to treat refer to a trial powered to detect a 50% reduction in disease progression; for a 25% reduction, four times as many individuals per group would be needed. In the study of Fahey, FARS showed the largest effect size and was found to require the smallest number of individuals (60 individuals versus 118 for ICARS, 253 for FIM and 605 for MBI) for an equivalently powered clinical trial (Fahey et al. 2000). Indeed, FARS has meanwhile been successfully applied in therapeutic trials (Boesch et al. 2007; Di Prospero et al. 2007; Meier et al. 2012).
In the original FARS evaluation study, 12 of 14 participating patients had still been ambulatory with or without walking aid (Subramony et al. 2005). So, it was not clear whether FARS would also be applicable in more advanced stages of FRDA. In a multi-institutional study on 155 FRDA patients by Lynch and colleagues, less than 50% of the patients were ambulatory (Lynch et al. 2006). The testing protocol consisted of FARS examination scale, ADL and functional ataxia scale, several performance measures (PATA rate, TT25FW, 9-Hole Pegboard Test (9HPT) and a low-contrast letter acuity test based on Sloan Charts) as well as the SF-36, a generic health-related quality of life scale. In addition to single performance measures, several composite performance scores were analysed for their measurement properties in FRDA (Lynch et al. 2006).
Because ADL and functional disability scores do not only correlate with each other but also with disease duration, these parameters were used as reference measures of progression for further statistical analysis. A cross-sectional approach yielded a good correlation of FARS sum examination and FARS subscores with disease duration, functional disability and ADL rating. Correlations were higher for FARS sum score and upright stability subscores than for all other FARS subscores (Lynch et al. 2006). In contrast to the originally published validation study on FARS, factor analysis yielded only a single factor accounting for 99% of the variation among subscale scores. In a multivariate linear regression model accounting for sex and participating institution, FARS scores were reliably predicted by ‘age’ and ‘repeat length’ (Lynch et al. 2006). T25FW, 9HPT and PATA rate showed reproducible results between trials. All single performance scores were correlated with measures of disease progression (disease duration ADL, functional staging and FARS examination scores). Among these, correlations were considerably lower for PATA rate and slightly lower for Sloan Charts. So, performance measures are likely to sufficiently capture progression in FA. The variables ‘age’ and ‘repeat length’ predicted T25FW, 9HPT and Sloan Charts scores in a linear regression model. On the other hand, correlation among performance measures was highly variable pointing to specific sensitivity of each performance measures to distinct functional aspects of the clinical syndrome of FA. Performance measures, therefore, may not develop congruently over time. To overcome these shortcomings, single performance measures were combined to create composite scores analogous to the Multiple Sclerosis Functional Composite (MSFC) (Cutter et al. 1999). Regarding low correlation with disease progression parameters, PATA rate scores were not taken into consideration for the generation of composite performance scores. A combined performance score of TT25FW, 9HPT and Sloan Charts scores did not show any floor effect even in late disease stages in contrast to the two-component composite of T25FW and 9HPT. Both composite performance scores correlated with a higher extent with all measures of disease progression (disease duration ADL, functional disability score, FARS examination score) than single performance scores. They were also reliably predicted by ‘age’ and ‘repeat length’. Correlations of composite performance scores with measures of progression were considered equivalent or superior to those of FARS examination scores, suggesting that composite performance scores probably reflect progression to a higher extent than the clinically based FARS examination rating.
On the SF-36, a patient questionnaire referring to the quality of life, FRDA individuals achieved significantly lower scores than healthy controls on the Physical Component Summary, but not on the Mental Component Summary (MCS) (Ritvo et al. 1997). FARS and the three component composite performance score showed moderate but significant correlations with the Physical Component Summary (Cutter et al. 1999).
Stratification by ‘age’, ‘age of onset’ and ‘repeat length’ did not significantly alter the correlation with markers of severity. However, disease severity was best reflected by FARS and composite functional scores in individuals younger than 30 years with a faster progression (Cutter et al. 1999).
Despite these favourable results, we should be aware that performance measures may not represent appropriate measuring tools for the entire course of FRDA: It is obvious that the T25FW is not applicable in wheelchair-bound patients. Despite their excellent testing properties in moderate disability stages, performance measures targeted to restricted functional skills should only be applied in combination with broader assessment tools that do not ‘max out’ over the course of the disease. Their great advantage, however, is their continuous nature in contrast to the categorical or ordinal properties of ADL and functional disability rating. In the study of Lynch, another problem arose from limited test-retest properties: The test-retest method error of the two-component composite performance score was shown to roughly correspond to the predicted change per year (Cutter et al. 1999). This feature would have a direct impact on modelling ‘numbers to treat’ for therapeutic trials.
SARA is based on a semiquantitative assessment of cerebellar ataxia on an impairment level (Schmitz-Hübsch et al. 2006). SARA is restricted to cerebellar clinical symptoms and does not take into account any extracerebellar features. Depending on the disease stage, its administration takes 5–40 min (mean 14.2 min) and does not require special training or technical equipment (Schmitz-Hübsch et al. 2006). The eight measuring items were selected from a standard neurological examination for their specificity for ataxia and their qualities of standardizing testing and rating procedures. A maximum score of 40 reflects most severe ataxia. The items are the following: gait (score 0 to 8), stance (score 0 to 6), sitting (score 0 to 4), speech disturbance (score 0 to 6), finger chase (score 0 to 4), nose-finger test (score 0 to 4), fast alternating hand movements (score 0 to 4), and heel-shin slide (score 0 to 4). Testing of limb function is rated independently for both sides. The arithmetic mean of both sides is considered for sum scores.
SARA had initially been evaluated in large validation trials in SCA (Schmitz-Hübsch et al. 2006, 2008; Weyer et al. 2007). Its longitudinal metric properties were also established in a large series of SCA patients (Schmitz-Hübsch et al. 2010).
As already mentioned above, the FRDA phenotype is characterised by proprioceptive deficits leading to a rather sensory type of ataxia. Therefore, it was questioned whether SARA would be appropriate to monitor progression in FRDA. In a study of 96 FRDA patients, reliability and validity of SARA were assessed in comparison to ICARS and FARS (Bürk et al. 2009). SARA did not show major floor or ceiling effects and was not prone to interrater variability. Correlation was good across SARA, ICARS and FARS sum scores and between all three scales and both progression parameters: ‘disease duration’ and ‘ADL ratings’ (Bürk et al. 2009). Internal consistency was conclusive for single items as well as for subscale scores. All SARA items showed high construct validity. Principal component analysis showed that all SARA item scores loaded on a single factor accounted for 66% of the variance. Single item-sum score correlations and item homogeneity were comparable for SARA, FARS and ICARS. When correlated with ICARS and FARS items including FARS functional scores, all SARA items demonstrated good convergent and divergent validity, that is, they correlated highly with items similar in content and less with items dissimilar in content (Bürk et al. 2009).
Longitudinal analysis based on repeated ICARS, FARS and SARA ratings in FRDA patients gave evidence that SARA was the most sensitive and ICARS the least sensitive scale for detecting longitudinal changes over time (Bürk et al. 2011): SARA required the smallest number of individuals for an equivalently powered clinical trial. Changes of ICARS, SARA and FARS scores correlated significantly with one another, but also with the changes on ADL. So, despite its compact nature, SARA was found to sufficiently capture progression in FRDA. FARS, which had been established specifically for FRDA, was not found to be superior to SARA in monitoring progression in FRDA. These cross-sectional and longitudinal studies, therefore, question the necessity of complex, time-consuming and thus cost-intensive scales (Bürk et al. 2009, 2011), implying that robust scales are equally or better suited to detect change in symptoms.
Indeed, SARA has meanwhile been applied in therapeutic trials not only in FRDA, but also in other types of ataxia, including cortical cerebellar atrophy, episodic ataxia and Ataxia Telangiectasia (AT) (Boesch et al. 2007; Gazulla and Benavente 2007; Lohle et al. 2008; Broccoletti et al. 2011; Zesiewicz et al. 2012).
A still unsolved problem though is the rating in FRDA children younger than 10 years: in the first decennium, motor and coordination skills are physiologically still improving and thereby interacting with all measures of increasing disability. Future studies should investigate the longitudinal properties of various measurement tools in larger patient samples covering the whole age and disability range of FRDA.
The National Institute of Neurological Disorders and Stroke (NINDS) initiated a program to create a toolbox for common data elements (CDE) for neuroscience clinical research (www.commondataelements.ninds.nih.gov). A team consisting of international FRDA experts and members of the NINDS-CDE team developed FRDA common data elements for ataxia and performance measures, biomarkers, cardiac and other clinical outcomes, and demographics, laboratory tests and medical history (Lynch et al. 2013). The subgroups’ recommendations are classified and published as core elements, supplemental elements or exploratory elements. Template case report forms were created for the core tests. It is considered mandatory to use core elements in future clinical research studies and interventional trials. This will allow a faster initiation of clinical trials and a standardization leading to greater ability to compare and analyse data across studies.
Conflict of interest
The authors have no conflicts of interest to declare.