Predicting motor, cognitive & functional impairment in Parkinson's

Abstract Objective We recently demonstrated that 998 features derived from a simple 7‐minute smartphone test could distinguish between controls, people with Parkinson's and people with idiopathic Rapid Eye Movement sleep behavior disorder, with mean sensitivity/specificity values of 84.6‐91.9%. Here, we investigate whether the same smartphone features can be used to predict future clinically relevant outcomes in early Parkinson's. Methods A total of 237 participants with Parkinson's (mean (SD) disease duration 3.5 (2.2) years) in the Oxford Discovery cohort performed smartphone tests in clinic and at home. Each test assessed voice, balance, gait, reaction time, dexterity, rest, and postural tremor. In addition, standard motor, cognitive and functional assessments and questionnaires were administered in clinic. Machine learning algorithms were trained to predict the onset of clinical outcomes provided at the next 18‐month follow‐up visit using baseline smartphone recordings alone. The accuracy of model predictions was assessed using 10‐fold and subject‐wise cross validation schemes. Results Baseline smartphone tests predicted the new onset of falls, freezing, postural instability, cognitive impairment, and functional impairment at 18 months. For all outcome predictions AUC values were greater than 0.90 for 10‐fold cross validation using all smartphone features. Using only the 30 most salient features, AUC values greater than 0.75 were obtained. Interpretation We demonstrate the ability to predict key future clinical outcomes using a simple smartphone test. This work has the potential to introduce individualized predictions to routine care, helping to target interventions to those most likely to benefit, with the aim of improving their outcome.


Introduction
Significant heterogeneity in Parkinson's influences clinical presentation, progression, medication response, and disease complication risk. The Oxford Discovery and Tracking Parkinson's cohorts provide around 2500 communityascertained patients, prospectively followed from early diagnosis, in whom these phenotypic variations can be studied. 1 We used data-driven approaches to identify fast and slow motor progressor subtypes, with differences akin to the minimally clinically important 3-point difference on the Movement Disorders Society Unified Parkinson's Disease Rating Scale (MDS-UPDRS) that neuroprotective treatment trials are often powered to detect. [1][2][3] While such differences are observed in cohort studies, individualized predictions remain challenging.
Disability in Parkinson's is mainly determined by the onset of postural instability, falls, freezing of gait, and dementia. 4,5 The time to reach these disease milestones varies considerably, leading to increased outcome variation and requiring larger sample sizes to demonstrate potential treatment effects. 6 A number of models have aimed to predict these clinically relevant outcomes. A 3step falls prediction tool by Paul et al. attached the greatest weighting to whether individuals reported falling at baseline, yet limited numbers prevented the prediction of the onset of falls in those without falls at baseline, an outcome of greater interest to the treating clinician. 7,8 Ehgoetz et al. recently reported a logistic regression model utilizing the Hospital Anxiety and Depression Scale (HADS) and the Freezing of Gait (FOG) questionnaire total score to predict the onset of future freezing, which requires external validation. 5 Velseboer et al. described a similar model utilizing age, the numbers of animals named in a verbal fluency task and the UPDRS axial score to predict a composite adverse outcome but it was not possible to distinguish between death, dementia or postural instability. 9 Models to date have relied on combinations of different clinical questionnaires and assessments, requiring time and skill to administer, in order to make specific predictions; to the best of our knowledge no single test has been able to predict multiple future clinical outcomes.
A multi-device study is being planned, that uses smartphones to capture questionnaire data and to store tremor data recorded by smartwatches, alongside tablet-based assessments, with the aim of differentiating Parkinson's from Essential Tremor. 10 However so far, studies using smartphones alone have focused on equipping the clinician; in distinguishing individuals with and without Parkinson's 11,12 and working to derive smartphone scales with which to measure disease severity. 13,14 Our aim was to use smartphone data to predict outcomes of direct clinical relevance to people with Parkinson's and clinicians, alike.

Methods
The Oxford Parkinson's Disease (PD) Centre Discovery study 15 is a longitudinal cohort study that recruits participants with early Parkinson's who fulfil the United Kingdom PD Brain Bank criteria for probable PD. 16 Continued participation depends upon participants being ascribed a probability of Parkinson's of at least 90% by trained researchers at their latest clinic assessment. 17 Antecedent approval was granted by the local research ethics committee, adhering to national legislation and the Declaration of Helsinki. Participants provide written informed consent to participate.
Since 2014, participants have been invited to perform smartphone assessments. We analysed smartphone tests performed at 18-monthly clinic visits and up to four times a day for a maximum of seven days at home, within a 3-month period of their clinic visit (see Fig. 1 for device details). 18 Smartphone tests assess: (1) Voice (participants hold the phone to their ear, take a deep breath, and say "aaah" at a comfortable and steady tone and level, for as long as possible); (2) Balance and (3) Gait (with the phone in a trouser pocket or arm band, participants stand still and then walk a distance of 20 yards before turning and walking back); (4) Dexterity (participants tap alternately between two buttons on the screen at a comfortable rate); (5) Non-cued reaction time (participants press on a button as it appears on the screen, keeping their finger down whilst it is there and lifting their finger off as it disappears); (6) Rest and (7) Postural tremor (participants hold the smartphone in the hand most affected by tremor if they have tremor, or their dominant hand if they do not, while their hand is at rest or held outstretched in front of them).
The seven smartphone tasks take 6-7 min in total to perform. All seven tasks constitute one smartphone recording. Incomplete recordings, where all seven tasks were not performed within a 15-min time period, were excluded from analysis.
Clinical data collected at in-person 18-monthly longitudinal clinic visits were matched to smartphone recordings performed in clinic and at home within 3 months of the clinic visit; henceforth referred to as a time window. Smartphone recordings contributed at different time windows, related to different longitudinal clinic visits, were treated independently for the purpose of analysis. Smartphone recordings analyzed were collected between 8 August 2014 and 7 November 2017.
Future outcomes were defined according to the results of clinical assessments and questionnaires performed at the next 18-month clinic visit as detailed in (Table 1) and included the new onset of (1) falls (>1 self-reported fall in the preceding 6 months), (2) freezing (a freezing frequency of at least "about once a month" on the FOG questionnaire 19 23 part II item 2.8 score ≥ 3 indicating major difficulty or an inability to do activities of enjoyment) and (6) the self-reported need for future help at home. Individuals who had already reached a given milestone at the time of their smartphone recording were excluded from the respective analysis. Similarly, smartphone recordings missing the clinical data necessary for the assessment of the outcome definition or meeting exclusion criteria were excluded. All assessments were done on existing medication.
Altogether 998 statistical features were extracted from each smartphone recording as previously described. 24 These such features include variation in speech frequency and amplitude, degree of turbulent noise in speech due to incomplete vocal fold closure, tapping speed and rhythm, mean reaction time, changes in body motion (such as detrended fluctuation analysis) and association between the tri-axial accelerometer sensor data (such as mutual information). Statistical analyses were performed using Matlabâ software (R2018a; Mathworksâ, USA).
Smartphone features were used to train machine learning algorithms (random forests) to predict each binary clinical outcome of interest. Random forests is an ensemble learning technique that has commonly been used for a range of classification and regression applications. 25 We chose random forests as they are relatively robust to outliers and noisy features. Moreover, in previous smartphone-based studies, random forests have been shown to be useful in distinguishing participants with RBD and Parkinson's from controls, 18 and in predicting MDS-UPDRS part III motor examination scores. 12 To try and avoid model overfitting, we used cross validation (CV) which involves splitting the dataset into non-overlapping training and test sets. The training data was used for learning the underlying patterns of interest; the test dataset for validating the model prediction accuracy. CV helps assess the generalizability of the model to previously unseen datasets. To test the accuracy of predictions, we used two common methods of CV: (1) 10-fold CV, and (2) leave one subject out (LOSO) CV.
Machine learning is also sensitive to the proportion of data with and without the clinical outcome of interest that is used to train the model. If there is an imbalance in outcome between groups, for example, where significantly more recordings from people without falls are used to train a falls prediction model, high prediction Figure 1. Smartphone models. In the search for a scalable solution to the quantification of motor symptoms in Parkinson's, an Android based smartphone app was installed on a range of consumer grade smartphones that were used in clinic and provided to participants to take home. Participants also had the option of being sent a link to download the app onto their own Android smartphone. A specialized smartphone app was used to collect the raw accelerometer, microphone and screen data and was run alongside KitKat, Lollipop, Marshmallow, Nougat, and Oreo Android operating systems. The raw data from the app was encrypted, time-stamped, and uploaded to a secure online server. The processing and analysis of the data was performed separately using computer-based Matlabâ software (R2018a; Mathworksâ, USA). "Others" include: Samsung Galaxy Ace 4 SM-G357FZ, Samsung Galaxy Ace 2 GT-I8160, Samsung Galaxy S3 Mini GT-I8200N, LG Optimus 3G CX670, Samsung Galaxy S III mini I8190, Samsung Galaxy J5 J500FN, Sony Xperia L C2105, Moto G LTE XT1039, Huawei Ascend G510, Samsung Galaxy S4 I9505. accuracies may simply reflect the prevalence of the outcome of interest in the dataset. Data were therefore balanced in a 1:1 ratio prior to the training of algorithms for both CV schemes by randomly under-sampling the majority class.
For 10-fold CV, to balance the dataset, an equal randomly selected sample of recordings from participants without the outcome of interest was merged with recordings with the outcome of interest, assuming independence between recordings and ignoring within person correlations. The recordings were randomized and partitioned into 10 separate folds. In turn, each fold was set aside to form the test set, on which the accuracy of the model, trained using the remaining mutually exclusive 9 folds, was tested. Ten repetitions of 10-fold CV, each with randomly selected balanced recordings, were performed. Similarly, datasets were randomly balanced by outcome for LOSO CV, where only the single clinic recording performed by participants at their clinic visit was analysed; all home recordings were excluded. Home recordings were excluded from LOSO CV in order to allow for a direct comparison to be made with existing logistic regression models (detailed below); a single set of clinical data was matched to each smartphone recording performed in clinic for LOSO CV, with the clinical data also being fed into existing logistic regression models. 5,9 LOSO CV involves each recording in turn forming the test set, with the rest being used to train the model. Ten repetitions of LOSO CV, each with randomly selected balanced datasets, were performed.
Following both CV schemes (see Table S1 for description of modelling algorithm), the predicted probability of each recording belonging to either group was compared to the known future outcome. To summarize the accuracy of predictions we calculated area under the curve (AUC) values.
A level of redundancy is likely among the 998 smartphone features, with some being more important for the prediction of certain outcomes than others. Feature ranking was obtained using random forests (employing only the training data that was balanced at each LOSO CV iteration). The assigned predictor importance was averaged across all trained models for a given outcome prediction, to allow the ranking of features in order of importance. We then trained models using the top 30 smartphone features.
In order to gauge the validity of our machine learning approach and the strength of our smartphone features,  we applied two existing logistic regression prediction models to our clinical data and assessed the accuracy of their predictions of (1) future freezing 5 and (2) a composite future adverse outcome of dementia, postural instability, or death 9 against those made using smartphone features, random forests and LOSO CV. Following personal communication with an author of the paper, direct MDS-UPDRS values for items 3.9, 3.13, 3.10, and 3.12 were substituted for UPDRS items 27-30 to form the axial score. Future freezing and postural instability were defined as before (Table 1)

Results
A total of 1842 recordings performed by 237 participants at 268 time windows were included in the analysis of one or more outcomes (Fig. 2).

Clinical characteristics
Baseline clinical characteristics of participants at the time of their smartphone recordings are presented in (Table 2). Participants who developed falls, postural instability, and cognitive impairment were significantly older than those who did not. Participants who developed falls and freezing had a longer disease duration from diagnosis and higher motor examination scores on the MDS-UPDRS part III. Those who developed postural instability and difficulty doing hobbies had lower MoCA scores and higher MDS-UPDRS part III scores compared to those who did not develop such future difficulty.

Model validation
It is likely that of the 998 smartphone derived features, some may be collinear (redundant) and associated with noise. Different sets of feature rankings were thus obtained separately for each of the outcomes of interest. Although multicollinearity can influence model inference, it does not reduce the overall predictive accuracy of the model. The results are displayed in (Fig. 3) and (Table 3). Using all 998 features, AUC values of greater than 0.90 were achieved for all six outcomes. Using only the top 30 smartphone features, with the highest predictive power, a reduction in the AUC values was seen though they remain greater than 0.75 for both 10-fold CV and LOSO CV. The effect of increment in feature number on prediction accuracy (for 10-fold CV and LOSO CV) is demonstrated in Figure S1.

Comparison with other models
Logistic regression models described by Ehgoetz et al. and Velseboer et al. were applied to the clinical data from participants who had contributed smartphone recordings and the accuracy of their predictions assessed. 5,9 As a comparator, smartphone recordings from the same participants were used to train machine learning algorithms (random forests) which were validated using LOSO CV (Table 4). For the prediction of future freezing, the LOSO CV model using 30 smartphone features outperforms the logistic regression model. Although to a lesser degree, the same is true when using two smartphone features, equivalent to the number of clinical variables in the logistic regression model. Seven participants who contributed smartphone recordings went on to develop dementia (n = 2) and/or postural instability (n = 6) while at the same time having the baseline clinical variables necessary for the Velseboer et al. model. None of the participants who contributed smartphone recordings had died by their next clinic visit. For the prediction of the composite adverse outcome of dementia, postural instability, or death 9 the AUC for the logistic regression model (0.81) was slightly higher than that garnered through a machine learning approach using the top 30 smartphone features (0.76). When creating models, the lack of any formal CV can result in over-fitting. Indeed, when a similar approach is adopted to Ehgoetz et al., that is, when models are trained and tested on the same data set, without any splitting into independent train and test sets, inflated AUC values of 1 for both the prediction of freezing and the composite adverse outcome are obtained using a machine learning approach.

Discussion
Prediction of disease progression and their milestones is a challenge for clinicians and patients alike. We describe for the first time the innovative use of a 7-min smartphone test in isolation to predict future clinical and functional change in early Parkinson's. Using a simple smartphone test, we can predict the onset of disease milestones including falls, freezing, postural instability, and cognitive impairment 18 months before occurrence with high levels of accuracy. Predictions are made on an individual basis using data from smartphone tests alone, without the need for clinical examination scores or patient questionnaires. A thorough evaluation of our machine learning approach is performed using two CV schemes; (1) 10-fold CV and (2) LOSO CV as well as (3)      1 P-value determined using a two-sample t-test or chi squared test to compare those with and without the future outcome of interest. 2 Data were balanced as described in the methods section, prior to training of machine learning algorithms, treating time windows independently. 3 A MoCA score was not available for two participants within the group that did not develop falling, 4 Two participants who did not develop freezing, 5 Two participants who did not develop postural instability and 6 Two participants who did not develop difficulty with hobbies in the future. The means/SDs for the aforementioned subgroups were calculated across participants for whom data was available. (see Table 1). Performance is comparable with/ significantly better than that achieved with existing prediction models which use combinations of different clinical questionnaires and assessments, requiring time and skill to administer (see Table 2). 5,9,19,27,28 These promising early results have the potential to provide people with Parkinson's and those partnering in their clinical care with greater information about their disease course, identifying those at risk of developing a future adverse outcome. In the age of limited clinical resources and an aging population, it is conceivable that such risk stratifying tools may assist in remote healthcare delivery, allowing people with well controlled Parkinson's to be seen in person less frequently while those flagged as being at risk of an adverse outcome could be offered closer clinical surveillance. The cost effectiveness of such an approach remains to be elucidated, but those at risk of falls or postural instability could be referred for a falls risk assessment to identify proactively falls risk factors and modify these where appropriate. 29 The earlier detection of those with freezing may also allow medication adjustments to be made or cognitive training to be considered, with potential improvements in quality of life. 30,31 Limitations and strengths In predicting future falls, freezing, difficulty doing hobbies, and the need for future help, we are limited by our reliance on the accuracy of self-reported answers to subjective questionnaires, with potential recall bias. In recruiting participants diagnosed with Parkinson's in life, we are limited by the diagnostic accuracy of clinicians, itself subject to significant variability, with pooled rates of 73.8% in non-experts rising to 83.9% in experts having refined their diagnoses over time. 32 In screening for cognitive impairment, we chose a definition based on the transition from a total MoCA score ≥ 26, to a MoCA score < 26, 21 in the absence of formal caregiver interviews. A MoCA score < 26 has previously been found to be associated with a sensitivity of 86% and specificity of 72% in screening for individuals with deficits on neuropsychological testing in at least two domains. 33 Additionally, it allows classification of PD Mild Cognitive Impairment according to MDS Task Force Level I criteria, 34 which, although subject to some temporal fluctuation, has been shown to have prognostic value in identifying patients at risk of developing future PD dementia. 35 Although the intention was to detect individuals who may benefit from further cognitive evaluation, neuropsychological assessments were not performed in this study due to time constraints, but would be an  Confidence intervals for the logistic regression model are calculated across single predictions for each set of clinical data whereas confidence intervals for LOSO CV are calculated using a bootstrapping approach; the two confidence intervals are therefore not directly comparable. interesting avenue of future investigation, potentially alongside the prospective evaluation of predictions in independent cohorts. Although the numbers we report are comparable with similar studies seeking to predict future clinical outcomes, our study includes relatively small numbers of participants who develop the outcomes of interest at their next clinic visit. 5,9 Although our results are encouraging, the LOSO CV results in particular are likely to improve as we continue to accrue data. The prediction accuracies reported in this study were obtained using only a single machine learning algorithm (random forests). As we accrue more data, future studies could explore the efficacy of using a range of different state-of-the-art classifiers that may require larger training samples.
Comparing our smartphone approach with existing clinical prediction models, it was necessary to make some assumptions and extrapolations. The Velseboer et al. model developed by the CARPA study and validated in CamPaIGN was trained to predict a composite adverse outcome of death, dementia and postural instability after a 5-year period using baseline data. 9 As participants in the Discovery study do not perform Mini-Mental State Examination (MMSE) assessments after their third visit, we used a MoCA ≤ 20 with an IQCODE> 52 to define dementia, as opposed to the original cut off of an MMSE < 26. We excluded participants with a baseline MoCA < 26 from analysis. As none of the participants who performed smartphone assessments had died by their next clinic visit, we are unable to assess the accuracy of their model in predicting death. Given the adaptations made it is encouraging that the AUC value associated with the application of the logistic regression model to our data as well as that achieved using trained machine learning algorithms, are comparable with those published by the authors of the original study.
In contrast, we were not able to achieve an AUC comparable to the values reported by Ehgoetz et al. when we applied their logistic regression model to our clinical data. Although the individuals who went on to develop freezing at their follow up visit were of a similar number, age and sex ratio, in the Ehgoetz et al. study they were on much larger baseline levodopa equivalent doses and had significantly higher levels of baseline anxiety than those who did not develop freezing, whereas in our study such baseline differences were not observed.
Our aim has been to use smartphone devices to measure symptom severity in a way that is potentially scalable to large healthcare systems where resources are limited.
To that end, we elected to collect data in both clinical and home environments without relying on cross-device reproducibility, using a wide range of off-the-shelf consumer grade smartphones (manufactured by major international brands) that were equipped with a tri-axial accelerometer (see Figure 1). Prioritising precision, the vast majority of wearable device studies have recorded data under highly controlled laboratory settings, using exactly the same (often costly) hardware and software across participants. This approach is likely to result in higher quality data, as confounding effects due to variations in the in-built accelerometer are minimized. We would therefore expect our use of different device settings to be associated with slightly worse classification accuracies, than if the exact same smartphone model were used across the cohort. However crucially, our study is scalable as it does not depend on cross-device reproducibility, leading to greater confidence in our interpretation of the overall classification results.
To the best of our knowledge, this is the largest study of its kind to use consumer-grade smartphones to capture real-world data recordings from 237 people with Parkinson's, studied longitudinally over 18 months. The Oxford Discovery cohort comprises community-ascertained patients and as such should more faithfully recapitulate disease evolution encountered in clinical practice. This study demonstrates the tractability of simple smartphone recordings across large populations, to accurately predict future clinically relevant outcomes. With time, our goal would be to seek the registrations and approvals necessary to translate our findings into routine clinical practice, with the ultimate aim of improving the care of people with Parkinson's.

Supporting Information
Additional supporting information may be found online in the Supporting Information section at the end of the article. Figure S1. Change in prediction accuracy with increase in feature number for 10-fold and leave one subject out (LOSO) cross validation (CV). Table S1. Description of the modelling algorithm.