Estimating longitudinal depressive symptoms from smartphone data in a transdiagnostic cohort

Abstract Background Passive measures collected using smartphones have been suggested to represent efficient proxies for depression severity, but the performance of such measures across diagnoses has not been studied. Methods We enrolled a cohort of 45 individuals (11 with major depressive disorder, 11 with bipolar disorder, 11 with schizophrenia or schizoaffective disorder, and 12 individuals with no axis I psychiatric disorder). During the 8‐week study period, participants were evaluated with a rater‐administered Montgomery–Åsberg Depression Rating Scale (MADRS) biweekly, completed self‐report PHQ‐8 measures weekly on their smartphone, and consented to collection of smartphone‐based GPS and accelerometer data in order to learn about their behaviors. We utilized linear mixed models to predict depression severity on the basis of phone‐based PHQ‐8 and passive measures. Results Among the 45 individuals, 38 (84%) completed the 8‐week study. The average root‐mean‐squared error (RMSE) in predicting the MADRS score (scale 0–60) was 4.72 using passive data alone, 4.27 using self‐report measures alone, and 4.30 using both. Conclusions While passive measures did not improve MADRS score prediction in our cross‐disorder study, they may capture behavioral phenotypes that cannot be measured objectively, granularly, or over long‐term via self‐report.

of quantifying treatment outcomes); the time required for clinical evaluation has precluded widespread use in practice; and clinician ratings contain sources of variance that are unrelated to underlying clinical affect. These scales themselves have been criticized for measuring a narrow range of symptoms that may be overly weighted toward specific illness features, neglecting the multidimensional nature of psychopathology (Insel et al., 2010).
The ubiquity of smartphones presents an opportunity to measure different social, cognitive, and behavioral markers in naturalistic settings. As of February 2018, 95% of Americans own a cellphone of some kind, with 77% owning a smartphone, up from just 35% in 2011 (Mobile Fact Sheet, 2018). With 6.3 billion smartphone subscriptions expected globally by 2021 (Cerwall, 2016), this technology offers an unprecedented opportunity to objectively measure human behavior in naturalistic settings outside of research laboratories and clinics.
Smartphone-based digital phenotyping encompasses the collection of a range of different social and behavioral data, including but not limited to spatial trajectories (via GPS), physical mobility patterns (via accelerometer), social networks and communication dynamics (via call and text logs), and voice samples (via microphone) (Onnela & Rauch, 2016;Torous et al., 2015).
The performance of digital phenotyping has rarely been directly compared to clinical rating scales in trial-like settings, nor has it been examined in a transdiagnostic cohort. To address these gaps, we conducted an 8-week study among psychiatric outpatients with mood and psychotic disorders, as well as healthy controls. Our aim was to assess whether digital phenotyping may be used as a complement for in-person psychiatric assessments of depressive symptoms, using the MADRS, in a clinical population. We sought to assess to what extent it might be possible to predict a future clinician-rated score on the Montgomery-Åsberg Depression Rating Scale (MADRS) from baseline assessments of MADRS, surveys administered on the phone (here, Patient Health Questionnaire), passively collected smartphone data (here, GPS and accelerometer), or a combination of these measures. More generally, we sought to quantify data completeness, a critical but commonly overlooked question in digital phenotyping, examining GPS, accelerometer, and phone survey data over the course of the 8-week study.

| Study design and cohort description
This study used a prospective cohort design and aimed to recruit equal-sized groups of outpatients with major depressive disorder (n = 11), bipolar I or II disorder (n = 11), schizophrenia or schizoaffective disorder (n = 11), and screened healthy controls with no axis I psychiatric disorder (n = 12). Each participant's primary diagnosis was confirmed by the Structured Clinical Interview for DSM-IV (SCID) Modules A-D (Diagnostic & statistical manual of mental disorders: DSM-IV, 2000). Demographic features of the study cohort are shown in Table 1.
Participants were recruited from outpatient clinics of the Massachusetts General Hospital (Boston, MA) and via advertisements seeking healthy control participants between 2015 and 2018.
All participants signed written informed consent prior to participation. The study protocol was reviewed and approved by the Partners HealthCare Institutional Review Board (protocol #: 2015P000666).
Participants were compensated $50 after the initial baseline visit and an additional $100 upon completion of the study. If a participant withdrew from the study before completing the full 8 weeks, they were compensated $25 in addition to the initial $50. Participants received reimbursement for reasonable parking and travel expenses for each in-person study visit.
All participants were 18 years or older and owned a smartphone running an iOS or Android operating system and were judged likely able to comply with study procedures by the site investigator's estimation. Participants installed the Beiwe application at the baseline visit and provided demographic information. Participants then returned for four follow-up visits over the course of the 8 weeks (for a total of five in-person visits, scheduled approximately every two weeks).

| Longitudinal assessments
At baseline and each follow-up visit, trained raters (AMP and KLH) certified and supervised by psychiatric clinical trialists (HEB and RHP) administered the MADRS. The overall MADRS score ranged from 0 to 60, and the following cutoff points were usually applied: 0 to 6 (not depressed), 7 to 19 (mild depression), 20 to 34 (moderate depression), and above 34 (severe depression). Participants also responded daily to a 4-question in-app Likert scale survey on overall mood, social interest, sleep quality, and activity level (Table S1).
They were also prompted once a week on Saturdays to take an in-app Patient Health Questionnaire (PHQ-8) survey (Kroenke et al., 2009). The question assessing suicidality included in PHQ-9 was omitted because the Partners HealthCare IRB has previously determined that its inclusion would require real-time evaluation of patient data, deemed by the investigators to be infeasible in the present design. To remain enrolled in the study, participants were required to respond to the surveys at least five times a week.

| Beiwe research platform
In this study, we used the Beiwe application for data collection, which is the front-end component of the Beiwe research platform.
We have previously described an earlier version of the Beiwe research platform for high-throughput smartphone-based digital phenotyping in biomedical research use (Torous et al., 2016).
The front end of Beiwe consists of smartphone apps for iOS (by Apple) and Android (by Google) devices. The back-end system, which enables data collection and data analysis and supports study management, makes use of Amazon Web Services (AWS)based cloud computing infrastructure. While data collection is arguably becoming easier with developing technology, analysis of the collected data is increasingly identified as the main bottleneck in research settings (Iniesta et al., 2016;Kubota et al., 2016;Kuehn, 2016). For this reason, Beiwe consists of a growing suite of data analysis and modeling tools triggered by the Beiwe data analysis pipeline.
Reproducibility remains a challenge in the biomedical sciences, as fewer than 10% of studies have been found fully reproducible (Prinz et al., 2011). To enhance reproducibility, all Beiwe data collection settings for both active (smartphone surveys and audio samples) and passive (smartphone sensors and logs) data are captured in a single JSON-formatted configuration file, which can be imported to future studies to enable them to use identical data collection. The configuration for this present study is also available.

| Data collection, storage, and security
Each study participant was assigned a randomly generated 8-character Beiwe User ID and a temporary password, and study staff assisted participants with app installation and activation at the time of enrollment. Data collected by the Beiwe application were immediately encrypted and stored on the smartphone until the phone was connected to Wi-Fi, at which point the data were uploaded to the study server and expunged from the phone. The reason for configuring Beiwe to use Wi-Fi rather than cellular data in this study was to avoid charges associated with uploading large volumes of data, roughly 1GB per subject-month, to the cloud. Any potentially identifying data were hashed on the mobile device, and all data were encrypted while stored on the phone awaiting upload, while in transit, and while on the server.

| Processing of passive data: Phone gps and accelerometer data
During the time period between the baseline visit and the last followup visit, accelerometer and GPS data from participants' smartphones were collected using Beiwe. The GPS measured the phone's latitude/ longitude coordinates, while the accelerometer measured its acceleration along three orthogonal axes. To preserve the battery life of the phone (mainly due to GPS) and to reduce data volume (mainly due to accelerometer), each sensor alternated between an on-cycle and off-cycle according to a predefined schedule (10 s on, 10 s off for the accelerometer; 2 min on, 10 min off for GPS). We selected a longer on-period for the GPS as it required time to locate the satellites required for positioning its location, and we correspondingly selected a longer off-period to reduce battery drain. In Supplemental Material, we describe our procedure for generating covariates for MADRS prediction from raw GPS and accelerometer data. Roughly speaking, the data were first summarized at a daily level and then the daily summaries were aggregated by type of day (weekend versus weekday).
For Android users, in addition to accelerometer and GPS data, we

| Statistical analysis
We used linear mixed models for MADRS prediction. Linear mixed models are an extension of standard linear regression to clustered data, where the clusters here are multiple MADRS assessments over time for each subject. Importantly, linear mixed models can handle clusters of varying size due to missing data. We considered four main model specifications. Each of them included the baseline MADRS score and the demographic variables as predictors (Table 2). We included the baseline MADRS score as a predictor based on the following rationale. One would ideally like to predict MADRS scores from passive data only, but this would require a large sample size and may not even be possible. The next best approach is to predict future MADRS scores from passive data and some baseline MADRS data. We assumed this latter approach because this approach, if successful, could reduce the number of times the MADRS score needs to be evaluated, which would help economize healthcare resources. The models differed by which smartphone-based covariates were included as additional predictors: Model A used phone-based PHQ-8 surveys, Model B used weekly summaries of passive smartphone data, Model C used both PHQ-8 surveys and weekly summaries of passive smartphone data, and Model D used neither. In Models A and C, when including the phone-based PHQ-8 survey score as a predictor, we used the survey that was closest in time preceding the MADRS assessment in question. We chose to include the PHQ-8 survey score as a predictor because of the ease of completion on a mobile phone by patients, and because of its widespread use as a screen in primary care settings. For Models B and C, we sought to predict MADRS score based on passive smartphone data collected in the seven days preceding the MADRS assessment. We computed summary statistics using raw GPS and accelerometer data. Our previous work has shown that one needs to impute missing GPS data when constructing summary statistics from GPS data. To generate summary statistics from GPS data, we first imputed missing GPS trajectories using a resampling method that has previously been demonstrated to result in a 10-fold reduction in the error averaged across all mobility features compared to simple linear interpolation of data by Barnett and Onnela (2018).
After imputing missing data, we then computed several GPS summaries proposed by Canzian and Musolesi (2015), Saeb et al. (2015), and Barnett and Onnela (2020). There were 32 candidate summary statistics computed from smartphone passive data (GPS and accelerometer) (see Table 2 and Table S2). As many of these statistics were correlated, rather than including all 32 statistics as predictors in the models that

| Participant baseline covariates and MADRS scores
Of the 45 consented participants, we excluded four participants who elected to cease study participation at or before the first follow-up visit (Figure 1). All other participants (n = 41) were included in the analysis, of whom three participants dropped out after the first follow-up visit and 38 fully completed the 8-week study. Table 1 shows the baseline features of these 41 study participants, including age, sex, diagnostic category, race, and baseline MADRS score. There were no missing data for these features. For the participants who completed the study, MADRS scores were available at baseline and at each of the four follow-up visits. Among the three participants who dropped out after the first follow-up visit, MADRS scores were assessed for two participants at the first follow-up visit. For descriptive purposes, Figure S1a

| Assessing completeness of phone data
The completeness of the accelerometer and GPS data was assessed at the participant level. For the accelerometer, we divided the number of minutes of data actually collected by the number of minutes of data expected to be collected. We examined the time period ranging from the day after the baseline visit to the day before the last followup visit, i.e., the time period including all full days in the study. Since accelerometer data were scheduled to be collected every minute, the expected number of minutes with data was the number of minutes in the time period. The completeness of GPS data was assessed analogously, except the expected number of minutes of data was 1/6 of the number of minutes in the time period (the 2-min on-cycle is 1/6 of the total cycle). The proportions for accelerometer and GPS are shown in Figure S2a. The proportions are variable, ranging from 0 to 0.99 for accelerometer and from 0 to 0.87 for GPS. For the accelerometer, 23 out of 41 (56%) participants had proportions of 0.5 or higher. For GPS, 16 out of 41 (39%) participants had proportions of 0.5 or higher. The proportions tended to be greater for accelerometer data than for GPS data. Despite the missingness, a large amount of data was captured over the course of the study, including 674,969,086 accelerometer measurements and 14,733,731 GPS measurements. The quantity of collected data for iOS phones tended to be greater on average than for Android phones. Figure S2b shows the completion rate for each PHQ-8 survey, indicated by the solid black line. Given a specific survey, its completion rate was defined as the proportion of participants who completed the survey. If a participant completed Survey t after Survey t + 1 had been sent, they were counted as not having completed Survey t but were counted as having completed Survey t + 1. The completion rate was 95% for the first survey and 80% for the last survey, which took place approximately two months after the baseline visit. Figure S3 shows a histogram of the number of weeks that the participant completed one or more PHQ-8 surveys. If a participant completed more than one survey during some week (i.e., the participant was late on the previous week's survey), the multiple surveys only contributed 1 to the participant's tally. Overall, 78% of participants completed PHQ-8 surveys on 8 or more weeks.
As an example of passive data, Figure 2a-d plots the average activity level hour-by-hour (from 12:00 a.m. to 11:59 p.m.) for four randomly chosen participants in the schizophrenia/schizoaffective group on weekdays and weekends. For each participant, the curves were computed using accelerometer data collected throughout their follow-up as described in detail in Supplemental Material. For any given 1-hr window, the average activity level estimates the proportion of time that the participant was active (e.g., walking, using stairs) compared to stationary (e.g., sitting, standing, lying down) during this hour of the day. On weekdays, the participant in Panel A had low activity levels overnight, which began rising around 7 a.m., and hit their highest levels between 9 a.m. and 1 p.m., followed by a decline over the course of the evening. On weekends, their activity level was lower in the morning than on weekdays and was highest at 1 p.m. In interpreting these plots, a caveat is that the participant's activity was missed if the phone was not carried (e.g., it was left on a table). Thus, differences between the participants could be due to differences in their activity patterns, as well as differences in their phone use habits (e.g., how often each participant carried their phone).
Data completeness for each passive modality, and for self-report, is summarized in Supplemental Results. In addition, we collected smartphone communication logs from Android devices (no iOS devices were included in this part of the analysis). Figure 3  The median number of unique phone numbers was also lower for the healthy controls at 18 (IQR: 12-24) versus 28 (IQR: 21-41) for those with a psychiatric diagnosis.  Table 3.

| MADRS prediction
As an exploratory analysis, we evaluated the effect of including the second principal component (PC) as a predictor, which we call Model E. The results are shown in Table 3 in the row entitled Model E.
Comparing the average RMSE's after adding the second PC (Model E) relative to having the first PC only (Model B), the average RMSE slightly improves when there are no other variables in the model or when the other variables are demographics, but slightly worsens when baseline MADRS and demographics are included. We conducted a separate exploratory analysis in which we identified the variables that had the highest loadings in the first PC: distance traveled maximum diameter, maximum home distance, and radius of gyration on the weekend. Since it is the most interpretable of the four, we used distance traveled on the weekend as the single passive predictor in a new model called Model F, F I G U R E 2 (a-d). Average activity level from 12:00 a.m. to 11:59 p.m. on weekdays and weekends for four randomly selected participants in the schizophrenia/schizoaffective diagnostic group. The solid line corresponds to the weekday, and the dotted line to the weekend. The x-axis origin of hour = 0 corresponds to 12:00 a.m. See Methods S1 for details on how these curves were computed which had no PC's included. This led to similar average RMSE's as using the first PC (Model B) and using the first and second PC (Model E).

| D ISCUSS I ON
In this cross-disorder investigation, we found that including passive data as a predictor did not improve the prediction of clinician-rated MADRS scores. While the participant payment employed in this study precludes strong conclusions about acceptability, the high retention rate suggests that, with compensation, participants are willing to adopt this technology as part of a standard clinical assessment model. A similar approach has successfully been used in other settings, such as to study patients with schizophrenia, where the subjects were not paid for app use, not given additional support for app use, and not provided with check-in calls or study staff reminders to use the app .

Both academic researchers and pharmaceutical leaders have
suggested that passive measures may replace clinical evaluation in clinical trials as a means of improving signal detection (Harvey et al., 2018). Setting aside the need for clinician involvement to ensure participant safety, our results suggest that more work will be required to replace clinical raters for assessment of

MADRS.
Although passive data did not perform as well as phone-based PHQ-8 in terms of average RMSE, it is important to stress that the passive approach requires only a one-time installation of the application which, even if less precise, may be valuable in settings where individuals are unlikely to adhere to a survey protocol, especially for extended time periods and in the absence of financial or other incentives.
One possible explanation for why incorporating passive variables in Model B did not improve the average RMSE compared to using only the baseline MADRS score and demographics in Model D is the varying data quality among participants. For example, Figure S4 shows the availability of accelerometer data for three participants. quality periods interspersed with periods with no data. Using incomplete passive data to predict the MADRS score can be challenging since the timing of the missing gaps may not be random ( Figure S4).
When deriving our predictors from passive data, we avoided the naïve approach of taking averages across the available data, which would overweight time intervals during which data tended to be collected. Instead, we utilized a more robust method for handling missingness, which is described in Methods S1. However, the predictors may be inaccurate when the proportions of data collected are low ( Figure S2a).
In a meta-analysis of seven smartphone-based digital phenotyping studies, there was no significant difference found in levels of missing data by sex, age, educational background, and phone operating system for either accelerometer or GPS data (Kiang et al., 2019).
Another study found that levels of missing GPS and accelerometer data were predictive of future clinical survey scores in a cohort of patients with schizophrenia , which presents a potential future extension of the analyses presented here.
We note multiple important limitations in considering our results.  (Insel et al., 2010). That is, it may be useful to capture negative valence symptoms such as depression across a range of disorders, not just in major depressive disorder. While such symptoms may be attributed to different underlying processes (e.g., negative symptoms in schizophrenia), our results suggest the ability of a single platform to measure across disorders.

| CON CLUS ION
While passively collected smartphone data did not improve the prediction of MADRS scores in our cross-disorder study, we demonstrate its application to capture features of patients' daily functioning-such as physical activity, social isolation, and spatial isolation-that are otherwise difficult to capture with surveys. These various behavioral phenotypes, which are listed in Table 2 and defined in the Supplement, can describe participants' physical activity (e.g., from the accelerometer data), spatial isolation (e.g., time spent at home, computed from GPS data), and social isolation (e.g., number of outgoing calls from Android call log data).

H U M A N S U BJ EC T S E TH I C S S TATE M E NT
All participants signed written informed consent prior to their inclusion in the study. The study protocol was reviewed and approved by the Partners HealthCare Institutional Review Board (protocol #: 2015P000666). has also received an unrestricted gift from Mindstrong Health, Inc.

AUTH O R CO NTR I B UTI O N S
All authors have contributed meaningfully to this work and gave final approval to submit for publication.

PE E R R E V I E W
The peer review history for this article is available at https://publo ns.com/publo n/10.1002/brb3.2077.