Measuring Parkinson's disease over time: The real‐world within‐subject reliability of the MDS‐UPDRS

Abstract Background An important challenge in Parkinson's disease research is how to measure disease progression, ideally at the individual patient level. The MDS‐UPDRS, a clinical assessment of motor and nonmotor impairments, is widely used in longitudinal studies. However, its ability to assess within‐subject changes is not well known. The objective of this study was to estimate the reliability of the MDS‐UPDRS when used to measure within‐subject changes in disease progression under real‐world conditions. Methods Data were obtained from the Parkinson's Progression Markers Initiative cohort and included repeated MDS‐UPDRS measurements from 423 de novo Parkinson's disease patients (median follow‐up: 54 months). Subtotals were calculated for parts I, II, and III (in on and off states). In addition, factor scores were extracted from each part. A linear Gaussian state space model was used to differentiate variance introduced by long‐lasting changes from variance introduced by measurement error and short‐term fluctuations. Based on this, we determined the within‐subject reliability of 1‐year change scores. Results Overall, the within‐subject reliability ranged from 0.13 to 0.62. Of the subscales, parts II and III (OFF) demonstrated the highest within‐subject reliability (both 0.50). Of the factor scores, the scores related to gait/posture (0.62), mobility (0.45), and rest tremor (0.43) showed the most consistent behavior. Conclusions Our results highlight that MDS‐UPDRS change scores contain a substantial amount of error variance, underscoring the need for more reliable instruments to forward our understanding of the heterogeneity in PD progression. Focusing on gait and rest tremor may be a promising approach for an early Parkinson's disease population. © 2019 The Authors. Movement Disorders published by Wiley Periodicals, Inc. on behalf of International Parkinson and Movement Disorder Society.

predictions remains difficult to date, both because of the heterogenous symptomatology of PD and because we lack objective biomarkers.
Recent longitudinal studies, both observational and experimental, have primarily used the Movement Disorder Society -Unified Parkinson's Disease Rating Scale (MDS-UPDRS) to quantify disease progression. 3,4 Introduced by the Movement Disorder Society in 2008 as a revision of the original UPDRS, it was designed as a comprehensive instrument for evaluating both motor and nonmotor impairments and disability in PD. 5 The extent to which the MDS-UPDRS, or in fact any instrument, is suitable to quantify disease progression, strongly depends on its reliability, that is, an instrument should show reasonably low measurement error in comparison with expected changes, so that the instrument can give a precise estimate of a patient's true progression rate. Using instruments with a high reliability allows for smaller sample sizes and shorter follow-up in trials assessing new disease-modifying therapies. 6 It is also a key prerequisite to build fine-grained predictive models.
A few aspects might affect the reliability of the MDS-UPDRS with respect to monitoring changes over time. First, the observer-rated items are subject to inter-and intrarater variability. Second, an individual assessment can only provide a snapshot of the patient's condition and is therefore prone to reflect short-term effects that are irrelevant to the overall progression of the disease. This is particularly relevant to the motor function assessment (part III) as a substantial proportion of patients experience motor fluctuations as a consequence of dopaminergic therapy (DT). To control for this, longitudinal studies often apply a washout period prior to the motor assessment. However, this can be burdensome for participants, and the length of the washout period (commonly >6 or >12 hours) is not sufficient to cancel out the long duration response of levodopa and dopamine agonists. 7 In real-world applications, shortterm effects may also be introduced by factors such as mood, stress, climate, time of the day, and, specifically for parts I and II, whether the patient or caregiver answered the questions.
Despite these sources of variation, the MDS-UPDRS is often referred to as a highly valid and reliable instrument, which is largely based on two studies examining its clinimetric properties, one conducted by Goetz et al in 2008 (English version) and the other by Martinez-Martin et al in 2013 (Spanish version). 5,8 Only the latter examined test-retest reliability, showing intraclass coefficients (ICCs) of greater than 0.90 for all subscales. However, in the context of monitoring changes over time, the high ICC values should be interpreted with caution because these values only reflect how well the scale can differentiate between patients at a given point, that is, it is a ratio between the true variance of absolute scores and the total variance of absolute scores (consisting of the true variance and variance produced by measurement error). 9 To quantify the scale's ability to assess changes over time, we need a measure of reliability that incorporates the variance of the withinsubject changes instead.
The problem becomes apparent when looking at two subsequent 1-year change scores of the MDS-UPDRS in the Parkinson's Progression Markers Initiative (PPMI) cohort (Fig. 1). A clear negative correlation can be seen, which means that if a MDS-UPDRS score increases during 1 period, it is likely to decrease in the subsequent period, and vice versa. This behavior is in essence a reflection of the well-known "regression toward the mean" phenomenon, indicating that the observed changes include a substantial amount of measurement error and are only partly related to changes in the true disease state. These effects can be quantified using models that assume there is an underlying "true" latent phenotype that evolves over time, of which the instrument provides noisy estimates at different times.

Study Objective
The objective of this study was to use data from a large cohort study to provide a realistic estimate of the reliability of the MDS-UPDRS when used to measure individual changes in disease progression. By modeling time series of MDS-UPDRS scores using linear state space models, we aimed to discriminate between FIG. 1. Correlation between two subsequent 1-year change scores of the MDS-UPDRS part III (OFF) on the PPMI data set. Line is fitted by linear regression on the response variable y3-y2. A similar negative correlation can be seen in parts I, II, and III (ON). variance introduced by actual disease progression and variance introduced by "noise," consisting of measurement error and short-term effects irrelevant to the overall disease progression.

MDS-UPDRS Progression Model
Observational data sets containing time series can be used to model the measurement error of an instrument using linear state space models. 11 In this study, a linear Gaussian state space model was used to describe the within-subject changes in MDS-UPDRS scores over time, defined by the following 2 equations: Time is indicated by the discrete index t which refers to the number of months after screening. Months without measurements in between study visits were treated as missing values. The index i refers to an individual subject in the study sample. In equation (1), the observation equation, analogous to classical test theory, we assume that there are hidden true progression states θ t,i , which can only be measured indirectly through our observations y t,i , the MDS-UPDRS scores. These observations are the result of the true progression states θ t,i plus Gaussian noise v t,i (independent and identically distributed with mean 0 and variance σ 2 E ). Thus, in our model σ 2 E is closely related to measurement error, reflecting both inter-and intrarater variability and short-term effects that are irrelevant to disease progression. These effects may be introduced by a wide range of factors such as mood, stress, and climate at the time of the assessment. In equation (2), the state equation, we assume that the true progression states θ t,i from adjacent study visits are linked through Gaussian true progression w t,i (independent and identically distributed with mean trend and variance σ 2 ΔT ). Therefore, σ 2 ΔT corresponds to the true variance of change scores. Although factors such as age influence the rate of progression, these were not included in our model because we aimed to estimate the magnitude of the variance in true progression σ 2 ΔT , not which factors explain this variance. Several Markov assumptions are implied by the model, for instance, given the previous state θ t − 1,i the present state θ t,i is independent of all other past states, and the observed score only depends on the current hidden state. We estimate one trend parameter for the study population, independent of the time t or individual i. This implies that we assume that the average progression in the study population is linear. Given the observed progression of the subscales of the MDS-UPDRS as displayed in Figure 2 and in the Supplementary Materials, this appears to be a reasonable assumption. Individual progression is not necessarily linear, because w t, i allows for individual variation to the population average trend. The specification of the model is completed by defining the initial state θ 0,i , which is assumed to be normally distributed with mean m 0 and variance C 0 . These parameters can be estimated without observations at this point using the distribution of the scores at the first point and the assumption of a constant average progression of the study population. See Figure 3 for a graphical representation of the model used in this study.

Estimation and Validation of Model Parameters
The model parameters were estimated using maximum likelihood estimation, implemented in R using the dlm package dedicated to linear state space models. 12 For replication purposes, we should note that there is no direct option in the dlm package to add a trend. To achieve this, we introduced a second variable to the hidden states that is set to a constant. So, in our implementation, θ t,i is a vector of length 2, with the first element representing the disease progression state and the second element a constant 1. By properly setting the transition matrix G and the variance of w t,i , we can keep this variable constant while having the new hidden state depend on the trend (which is a value in G): θ t,i = Gθ t − 1,i + w t,i . Confidence intervals were calculated using the simple percentile bootstrap method with 1000 repeats: we randomly sampled patients with replacement to construct bootstrap samples and used the 2.5% and 97.5% percentiles of the estimates to obtain 95% confidence intervals. The appropriateness of the model to describe the data was assessed by evaluating the distribution of residuals, that is, the difference between the  estimated true progression states and the observed values and the difference between the predicted next state and its measurement.
Based on the estimated parameters, the within-subject reliabilityr ΔΔ of 1-year change scores was calculated as follows:r Here,σ 2 ΔT denotes the estimate of the true variance of change scores, which is divided by the total variance of 1-year change scoresσ 2 ΔX , consisting of the true variance plus 2 times the variance produced by measurement errorσ 2 E . In addition, the effect of the length of the levodopa washout period on part III OFF measurements was assessed by comparing the estimates for two different thresholds. Because the median time after DT (for participants receiving DT) was approximately 14 hours, we compared the original threshold (>6 hours postdose) with the threshold >14 hours postdose. To maximize the power of the comparison, we first generated the bootstrap sample before applying the two different thresholds.

Results
A total of 423 PD subjects were included (277 men and 146 women). At baseline, the average age was 61.7 years, and the average time since diagnosis was 7 months (see Table 1 for all baseline characteristics). The maximum follow-up period was 60 months. In Figure 2 (left), the number of included assessments can be seen for each visit and for each part of the MDS-UPDRS. The attrition in the number of included parts I and II assessments was almost completely because of loss to follow-up. The main reason why fewer part III OFF than parts I and II assessments were included for the annual visits is that not all part III OFF assessments fulfilled the >6 hours postdose criterion. For each part of the MDS-UPDRS, the progression at the group level and an illustration of individual progression patterns is displayed in Figure 2 (middle and right).

MDS-UPDRS Progression Model
The estimations of the parameters σ 2 E , σ 2 ΔT , trend, and within-subject reliability, including their 95% confidence interval, are presented in Table 2. Within-subject reliability varied from 0.13 to 0.62 for the different subscales and factors. Of all subscales, parts II and III (OFF) demonstrated the highest within-subject reliability. Factor 3.3 (postural instability and gait difficulty), 2.1 (mobility), and 3.4 (rest tremor) were the most reliable factor scores.
Regarding the part III subscale (OFF),σ 2 E was significantly lower (difference of -1.75; 95% CI: -2.81 to -0.84), and the within-person reliability was significantly higher (difference of 0.04; 95% CI: 0.01-0.07) when applying the >14-hour threshold in comparison with the >6-hour threshold (P < 0.001 based on bootstrap procedure). We should note that because the time since last medication intake was not randomized in this data set, the possibility of confounding should be considered (see Supplementary Materials).

Model Performance
The QQ plots of parts I and II displaying the model residuals showed heavier tails than the normal distribution, which could be a consequence of both v t,i and/or w t,i having a distribution with heavier tails than the normal distribution (see Supplementary Materials). There were no indications that the data set contained any outliers that might have been caused by erroneous data entry (eg, no values were out of range of feasible MDS-UPDRS scores), so all data were retained. The distribution of residuals from part III more closely resembled the normal distribution. The inverse correlation between 2 subsequent 1-year change scores that was presented earlier was similar in data generated based on the model and estimated parameters (see Supplementary Materials).

Discussion
Our primary aim was to assess the reliability of the MDS-UPDRS as a tool to measure individual disease progression over time in a population of early-stage PD patients. A linear Gaussian state space model was applied to a large observational data set to capture the longitudinal behavior of the different subscales and factors. The selected model is closely related to classical test theory, which was the theoretical foundation for previous estimates of the (test-retest) reliability of the MDS-UPRS. 9 The novelty of applying this model to time series of MDS-UPDRS measurements lies in its ability to provide estimates for both the variance introduced by "noise" (ie, measurement error and short-term effects) and the variance introduced by long-lasting differences in individual progression rates. Expanding on previous validation studies, which presented a high between-subject reliability for all parts of the MDS-UPDRS, 8 we now demonstrate that the within-subject reliability of all parts is noticeably lower. Parts II and III (OFF) demonstrate a favorable within-subject reliability compared with parts I and III (ON). The factors related to mobility and tremor demonstrated a relatively consistent behavior and may be, in terms of their reliability, most suitable to monitor individual disease progression in early PD. Last, our findings underscore the importance of considering the symptomatic effects of levodopa when using part III to monitor changes over time, as OFF assessments >6 hours postdose were more consistent than ON assessments, and a longer washout (>14 hours) may further increase the assessment's reliability.
Like any model, ours is a simplification of reality and does not aim to explain the complete behavior of the MDS-UPDRS, but rather attempts to capture aspects relevant to the question at hand. Still, it is important to explore the assumptions underlying the model and their effect on the relevance of the results. First, some deviations from normality were observed, mostly visible in parts I and II. Although using a distribution with heavier tails might have produced a more accurate fit to the data, it also would have increased the model's complexity, affecting both the complexity of the parameter estimation algorithm and the interpretation of its results. 13 Given that data sampled from our model and real data display a similar negative correlation between two subsequent change scores, we believe the model is an appropriate choice to capture this remarkable behavior of the MDS-UPDRS. Second, it was assumed that both σ 2 E and σ 2 ΔT would remain constant during the course of the disease, which is a simplification; it is reasonable that, for example, the error variance (σ 2 E ) of part III is larger in populations with a longer disease duration with more severe and unpredictable motor fluctuations. Also, some nonmotor symptoms start to   develop later in the disease, which would result in a higher true progression variance (σ 2 ΔT ) of part I in this population. Although subjects with de novo PD are often the population of interest in cohort studies on disease progression, the generalizability to other disease stages remains to be evaluated. Given the homogeneity of the PPMI cohort in terms of disease duration at study start and the maximum follow-up of only 5 years, we believe it was reasonable to estimate one set of variance components that describes the behavior of the MDS-UPDRS in early PD. This is supported by the observation that the mean and variance of yearly changes did not show any obvious changes over time during the follow-up period (see Supplementary Materials). Third, the interpretation of the model parameter σ 2 ΔT deserves some nuance. Although σ 2 ΔT was referred to as variance in true progression, a more accurate description would be variance in long-term changes in PD symptomatology. Both the underlying disease progression and longterm effects of symptomatic treatment (for example, the effect of treatment with antidepressants on part I scores or the gradually increasing dose of dopaminergic medication on parts II and III scores) may contribute to this parameter. Future work may aim to disentangle the contributions from both factors. Last, results presented here are based on one cohort. Although statistical procedures were used to estimate the confidence intervals of the estimations, the results should be validated on independent cohorts.
An important advantage of our approach is that the estimates are based on the actual results from a large multicenter cohort study with all its logistical challenges, in contrast with the highly standardized conditions in which most clinimetric validation studies take place (eg, a small number of participating study centers, a small number of assessors and a short time interval between test and retest, so short-term effects are more likely to be similar during both). Therefore, our results are more likely to reflect the real-world behavior of the MDS-UPDRS when used in this type of study. The presented withinsubject reliabilities can be interpreted as the proportion of variance in MDS-UPDRS change scores that originate from long-lasting changes in PD symptomatology and are therefore directly relevant to any cohort study aiming to build predictive models for disease progression. Indeed, the identified behavior of the MDS-UPDRS may well be an explanation for the results of Latourelle et al, who achieved a higher explained variance (Pearson R 2 ) when modeling part III changes in untreated subjects compared with part III (ON) changes in subject receiving DT. 14 Because error variance is a substantial proportion of variance in all MDS-UPDRS change scores, the explained variance that can be maximally achieved is limited, and researchers should be aware of the risk of overfitting complex predictive models. It should be noted that it is possible, for example in the case of performing regression on the change scores, to reach a higher explained variance than the presented within-subject reliabilities by including the baseline score or the previous change score in the model (such as shown in Fig. 1). However, what happens here is that a part of the error variance (σ 2 E ) is explained in the model, which does not provide any knowledge about the actual disease progression (σ 2 ΔT ). The results should also be taken into account when applying (parts of) the MDS-UPDRS for individual follow-up in clinical practice, because the literature suggests that changes smaller than the measurement error (σ E ), of which we provide estimates, are unlikely to be clinically meaningful. 15 Although the developers of the MDS-UPDRS already recommended analyzing the subscales separately instead of analyzing one composite MDS-UPDRS score, this study allows for a more informed selection of factors and subscales based on their reliability. It deserves some attention that the factor scores related to gait/mobility and rest tremor outperformed the factor scores related to bradykinesia, rigidity, kinetic/postural tremor, and nonmotor symptoms. Initial studies suggested that rest tremor severity measured by the UPDRS did not correlate with disease duration and was therefore not a good marker for progression. 16,17 However, these studies were all performed in populations with a baseline disease duration of 4-9 years, and later evidence suggested that resting tremor severity does worsen in the early stages of the disease. 18 Our findings support this and show that, compared with other items (eg, bradykinesia and rigidity), the items related to resting tremor are a relatively reliable way to measure the variation in within-subject changes of symptom severity in early PD. Although detailed studies on the progression of gait and balance impairments in early PD are rare, Galna et al showed that in this population a deterioration in gait impairment can be observed within 18 months of followup. 19 In addition, there are indications that early changes in gait/mobility are measurable using the Timed Up & Go test. 20 We also observed significant progression in gait/posture-related items of the MDS-UPDRS in early PD and demonstrated a relatively consistent behavior of these items over time. The latter may be partially explained by the contribution from nondopaminergic pathology, which renders these items less sensitive to the short-term effects of DT. 18 In conclusion, our results support the search for more reliable instruments to monitor individual changes in PD symptomatology. In this light, wearable sensors have the potential to overcome some limitations of the MDS-UPDRS by collecting rater-independent and continuous data in the patient's own natural environment. 21 The finding that gait/mobility and tremor-related items demonstrate the highest within-subject reliability, highlights the potential of sensor-based outcome measures in these domains. [22][23][24][25] Hopefully, the combination of optimally using current clinical rating scales and the development of new reliable instruments will lead to a better understanding of the large heterogeneity in PD progression and pave the way for reliable measures of new diseasemodifying treatments.