EczemaPred: A computational framework for personalised prediction of eczema severity dynamics

Abstract Background Atopic dermatitis (AD) is a chronic inflammatory skin disease leading to substantial quality of life impairment with heterogeneous treatment responses. People with AD would benefit from personalised treatment strategies, whose design requires predicting how AD severity evolves for each individual. Objective This study aims to develop a computational framework for personalised prediction of AD severity dynamics. Methods We introduced EczemaPred, a computational framework to predict patient‐dependent dynamic evolution of AD severity using Bayesian state‐space models that describe latent dynamics of AD severity items and how they are measured. We used EczemaPred to predict the dynamic evolution of validated patient‐oriented scoring atopic dermatitis (PO‐SCORAD) by combining predictions from the models for the nine severity items of PO‐SCORAD (six intensity signs, extent of eczema, and two subjective symptoms). We validated this approach using longitudinal data from two independent studies: a published clinical study in which PO‐SCORAD was measured twice weekly for 347 AD patients over 17 weeks, and another one in which PO‐SCORAD was recorded daily by 16 AD patients for 12 weeks. Results EczemaPred achieved good performance for personalised predictions of PO‐SCORAD and its severity items daily to weekly. EczemaPred outperformed standard time‐series forecasting models such as a mixed effect autoregressive model. The uncertainty in predicting PO‐SCORAD was mainly attributed to that in predicting intensity signs (75% of the overall uncertainty). Conclusions EczemaPred serves as a computational framework to make a personalised prediction of AD severity dynamics relevant to clinical practice. EczemaPred is available as an R package.


| INTRODUCTION
Atopic dermatitis (AD or eczema) is a common chronic inflammatory skin disease characterized by dry and itchy skin. 1 Disease symptoms manifest as relapses and remissions that are often unpredictable, making treatment difficult, and increasing patients' burden. Current treatments for mild and moderate AD are an application of emollients on dry skin and anti-inflammatory creams or ointments (topical corticosteroids and calcineurin inhibitors) on inflammatory skin.
Tailoring treatment strategies to each patient's conditions is essential to achieve maximum effectiveness because treatment responses often differ between and within patients. 2,3 Designing personalised treatment strategies requires predicting future disease states for individual patients as AD symptoms fluctuate dynamically in a highly heterogeneous manner. We have recently demonstrated that it is possible to predict the patientspecific daily evolution of AD severity by developing a mechanistic Bayesian machine learning model. 4 The model captured the patientspecific heterogeneity in dynamic trajectories of AD severity and responsiveness to treatment. However, its predictive performance and clinical applicability were limited because the model was developed using a daily bother score, which is a subjective global measure of distress caused by AD and is not suitable to capture different aspects of AD symptoms reliably. Using a validated objective severity score that combines multiple severity items could improve the predictive performance and make predictive models more relevant for clinical practice.
The Harmonising Outcome Measures for Eczema (HOME) initiative recommended the Eczema Area and Severity Index 5 (EASI) as the core outcome instrument for clinical signs of eczema to be measured in clinical trials. 6 SCORing AD 7 (SCORAD) and its objective component (oSCORAD) have also been validated as outcome instruments, 8 and other scores such as Six Area Six Signs AD 9 (SAS-SAD) are still routinely used in clinical practice. All these instruments report AD severity as a single score obtained by aggregating the severity scores for multiple severity items, including intensity signs, subjective symptoms and extent (the area affected by eczema). Each severity item captures a different aspect of AD severity and may follow its own dynamics.
In this paper, we introduce EczemaPred, a computational framework to predict patient-specific dynamic evolution of AD severity. It is based on the idea that modelling the evolution of each relevant severity item and aggregating the predictions could improve the performance and the clinical relevance of the prediction of AD severity dynamics. EczemaPred consists of a collection of Bayesian state-space models that describe the item-dependent dynamics of each severity item. The predictions for any AD severity score can be obtained by aggregating the predictions for the relevant severity items made by their Bayesian state-space models.
We use EczemaPred to predict patient-dependent dynamic evolution of the Patient-Oriented SCORAD (PO-SCORAD), 10 a validated self-assessment of SCORAD 11 that can be recorded on a smartphone app. Self-assessments of AD severity are more suitable for tracking the short-term (daily to weekly) evolution of severity dynamics than clinical assessments that can be performed only during clinical consultations of a limited frequency. PO-SCORAD is one of the core instruments recommended by the HOME initiative to measure patient-reported symptoms in clinical practice. 12 We validate the EczemaPred approach using longitudinal datasets from two clinical studies, a dataset from a published study 13 and another dataset we collected in an observational study.

| Observational study for data collection
An observational study (ClinicalTrials.gov, NCT04553224) was conducted from November 2019 to February 2020 in Toulouse (France) following the approval by IEC (CPP Ile de France V, Saint Antoine Hospital, n°582,211). We recruited 16 adult AD patients (mean age 25 years old, SD = 5) whose SCORAD were between 20 and 40 (mean SCORAD 34.6, SD = 4.4 at inclusion). Patients recorded PO-SCORAD using an app (https://www.poscorad.com) for up to 12 weeks every day while continuing their usual treatment. In the case of AD flare (n = 8 patients), medication was changed by the investigators. Informed consent was obtained from all study participants.

| Datasets used for predictive modelling
We used two datasets with daily to weekly measurements of PO-SCORAD and its severity items over a moderately long period. The first dataset, referred to as dataset 1, is from a published study investigating the role of an emollient in children (mean age 3.6 y.o., SD = 1.3) with mild to moderate AD. 13  We did not use demographic or treatment information in our models because our previous study 4 suggested that their inclusion does not show a noticeable improvement in the predictive performance for patient-specific daily evolution of AD severity. In this study, we aimed to develop simple models with a good predictive performance that could be extended later to investigate the effects of demographics or treatment.

| Model overview
We introduce EczemaPred, a collection of machine learning models (Bayesian state-space models) that can be used to describe the data-generating mechanisms of each severity item. Each model  assumes the existence of a true latent (unobserved) severity that follows its own latent dynamics and that the recorded severity was obtained as a result of imperfect measurement of the latent severity ( Figure 2A).
EczemaPred to predict SCORAD (and PO-SCORAD as its selfassessment version) consists of nine models, each corresponding to one of the nine severity items for SCORAD ( Figure 2B). Predictions from the nine models are aggregated to produce predictions for SCORAD by assuming independence of the severity items. The latent dynamics and measurement distributions of the state-space models were tailored to each severity item ( Figure 2C).
We modelled the latent dynamics for the extent as a Markov chain that describes how a small "patch" of the skin transitions from non-lesional to lesional and vice versa. A binomial distribution was used to count the numbers of lesional patches, that is the extent. We

| Model validation
We evaluated the predictive performance of our models in a forward-chaining setting ( Figure S3), where we trained the models every 4 days. That is, we first trained the models on the first day's data of a patient and tested them using their data over the next 4 days, then trained the models on the first 5 days' data and tested them on the next 4 days' data, etc.
The probabilistic predictions of PO-(o)SCORAD and its nine severity items were evaluated using a logarithmic scoring rule, the log predictive density (lpd). 15 We also computed an accuracy metric

| RESULTS
All EczemaPred models and reference models were fitted successfully for all severity items on the two datasets. We found no evidence of an absence of convergence by monitoring trace plots and by checking the potential scale reduction factor ( b RÞ. 16 We conducted posterior predictive checks and found no clear discrepancies between the data and the models' simulations.

| Predictions of severity items
EczemaPred models learned the dynamics of severity items as more data came in (Figures S4-S12, top). A similar predictive performance was confirmed for the models trained with dataset 1 and those with dataset 2, supporting the generalisability of the models. However, predictions of extent and itching appeared to be more difficult with dataset 2 than with dataset 1 (Figure 3, S4 and S11). For example, the lpd for predicting extent is much higher for the EczemaPred model trained with dataset 1 than with dataset 2 (−1:53 � 0:07 vs. The predictive performance decreased as the prediction horizon increased for all the models investigated. The decrease in lpd when the prediction horizon is increased by a day was similar or smaller for EczemaPred models compared to the reference models with a nonconstant forecast ( Figures S4-S12, bottom).

F I G U R E 2 Model overview. (A)
Bayesian state-space models in EczemaPred. Each model describes the dynamics of a latent severity (white ovals) and the measurement of the latent severity to obtain the recorded severity (grey ovals). (B) Use of EczemaPred for SCORAD prediction. Predictions from nine models (coloured rectangles), each of which corresponds to one of the nine severity items for SCORAD, are aggregated to provide predictions for SCORAD. (C) Latent dynamics and measurement distributions for the three severity components of SCORAD

| Predictions of PO-(o)SCORAD
Predictions for PO-SCORAD were derived by aggregating the predictions of the severity items by their respective models (example predictive trajectories in Figure 4).
We confirmed that the performance of PO-SCORAD prediction by EczemaPred improved as more data came in but did not plateau ( Figure 5). It suggests a possibility of further improvement of the performance if more training data was available and a need for more accurate estimation of some model parameters. In contrast, the performance of the reference models stopped improving, suggesting that the improvement observed for EczemaPred was not due to a change in the data distribution (e.g. due to patients dropping the study early).
EczemaPred outperformed the reference models that predict PO-SCORAD directly (rather than aggregating the prediction of severity items as in EczemaPred), supporting our approach. The difference in lpd between EczemaPred and the reference models is less evident in dataset 2 than in dataset 1 for PO-SCORAD prediction. The difference is more evident for PO-oSCORAD F I G U R E 3 Predictive performance for 4-day-ahead forecasts by EczemaPred models (empty circles) and reference models (filled circles) measured by lpd (the higher, the better). EczemaPred models are a binomial Markov chain model (BinMC) for extent, an ordered logistic random walk model (OrderedRW) for intensity signs, and a binomial random walk model (BinRW) for subjective symptoms. Reference models include a uniform forecast (uniform), a historical forecast (historical), a random walk model (RW), and a Markov chain model (MC). The performance was calculated after training with approximately 80% of the data (77 days' data for dataset 1 and 65 days' data for dataset 2) prediction ( Figure S13 The predictive performance of EczemaPred with dataset 1 appeared to be better than that with dataset 2 ( Figure 5), although the actual predicted dynamics do not always appear qualitatively different between the two datasets ( Figure 4). Several data characteristics (e.g. dataset size, frequency of measurements and demographics) may explain the difference, but it is difficult to pinpoint the main factors without a meta-analysis. It is also possible that the performance with dataset 2 becomes comparable or superior to that with dataset 1 if we allow for a more prolonged training phase, given that the performance did not plateau.
The predictive performance of EczemaPred and the reference models decreased as the prediction horizon increased ( Figure S14

| Decomposition of prediction uncertainty in EczemaPred
We investigated which of the three components of PO-SCORAD

| DISCUSSION
This paper introduced EczemaPred, a computational framework to predict the patient-dependent dynamic evolution of AD severity using machine learning ( Figure 2). We validated EczemaPred in predicting PO-SCORAD using two independent datasets with different characteristics.
EczemaPred for predicting PO-SCORAD consists of nine Bayesian state-space models, one for each severity item of PO-SCORAD (extent, six intensity signs, and two subjective symptoms).
EczemaPred models outperformed the reference models we considered for all the severity items ( Figure 3). Predictions of PO-SCORAD were produced by aggregating the predictions by the severity item models and outperformed those obtained by standard time-series forecasting models ( Figure 5). The prediction accuracy was approximately 72% and 60% for 4-day-ahead forecasts for datasets 1 and 2, respectively. Most of the prediction uncertainty in PO-SCORAD (79% and 72% for datasets 1 and 2, respectively) could be attributed to the intensity signs component, suggesting that improving predictions of intensity signs is the most promising approach to improve PO-SCORAD predictions.
Modelling the dynamics of each severity item has several advantages when the breakdown of the aggregate severity score is available. It enables us to extract more signals from the data, as the F I G U R E 5 Learning curves for 4-day-ahead forecasts of PO-SCORAD evaluated by lpd (top) and accuracy (bottom) as a function of the number of training observations (training days), for datasets 1 (left) and 2 (right). EczemaPred models perform better than reference models, including an exponential smoothing model (Smoothing), a mixed effect autoregressive model (MixedAR), an autoregressive model (AR), a random walk model (RW), a historical forecast (historical), and a uniform forecast (uniform) AD severity dynamics for each patient are described by multiple time-series, one for each severity item (nine using PO-SCORAD) instead of one for the aggregate score. This approach also reduces the uncertainty in the aggregate score prediction when some severity items are easier to predict than others (e.g. when they are not very prevalent or do not vary much over time). The models can be tailored to each severity item to reflect the item-dependent data-generating mechanisms with relevant measurement processes and latent dynamics. The models are thus more interpretable and transparent, as predictions of aggregate severity scores can be decomposed into predictions of their components. 18 The models could be used to predict any combination of the severity items (e.g. PO-oSCORAD) without potential inconsistencies in predictions that could arise if each severity score of interest (e.g. oSCORAD and EASI with overlapping severity items) is modelled separately.
EczemaPred can thus be applied to develop predictive models for other AD severity scores, such as EASI and POEM, 19 a selfassessment tool recommended by HOME to evaluate subjective symptoms. 20 EczemaPred has some further advantages, especially for clinical use. The Bayesian framework enables us to make probabilistic predictions by explicitly quantifying uncertainties in parameters and predictions. The state-space models explicitly describe potential and often inevitable errors in the measurement of the severity items.
For example, estimation of the body area affected by eczema is subject to a high inter-rater variability, 21 potentially even more so when it is self-assessed as in PO-SCORAD. 10 The choice of representative sites may also introduce variability in the measurement of intensity signs. Modelling the measurement processes separately from the latent dynamics of the disease severity items also allows us to deal with missing values efficiently as an absence of measurement, while still inferring the latent dynamics. In a practical application of the model, the posterior distributions obtained in this study could be used as a prior for new patients to 'pre-train' the model, shortening the training phase to only a few measurements.
More generally, the number of data points needed to accurately train the model depends on several factors, including the severity score to predict (e.g. SCORAD or oSCORAD), the performance metric to be optimised (e.g. lpd or accuracy), the target performance (e.g. 60% or 90% accuracy), the prediction horizon (e.g. one day or 1 week), the frequency of measurements (e.g. daily or twiceweekly) and potentially other characteristics of the datasets (e.g. demographics).
Limitations of this study include the subjective assessment of PO-SCORAD by patients. The reliability of PO-SCORAD assessment was shown to improve with experience, as patients may need time to learn how to use the PO-SCORAD instrument properly. 11 The severity item models may therefore benefit from specifying a timevarying measurement error. EczemaPred could also be improved by modelling the correlations between the six intensity signs, even though the components of SCORAD (extent, intensity signs and subjective symptoms) are thought to be uncorrelated by design. 7 For instance, dryness, thickening and scratching may covary as they mainly characterise the chronicity of the disease; and redness, swelling and oozing may covary as they represent acute flares. 7 Validation of EczemaPred in a real-world evidence study is also required, as the data used in this article were taken from patients involved in a clinical study in which they may have had a better follow-up than usual. The data also lacks severity scores from severe AD patients, who may exhibit different severity trajectory patterns.
In summary, this study introduced EczemaPred as a computational framework to predict the patient-dependent dynamic evolution of AD severity. Patients could benefit from EczemaPred in managing their disease and anticipating their symptoms' change. Notably, EczemaPred could be used to investigate the effects of treatment and environmental factors on the dynamic evolution of AD severity and eczema persistence. For example, the models could be extended to quantify patients' responsiveness to treatment and suggest personalised treatment strategies using Bayesian decision theory. In conjunction with EczemaNet, 22 a computer-vision pipeline to detect and assess eczema severity from camera images, an extended version of EczemaPred could serve as a treatment adjustment tool by providing direct feedback to patients on the likely evolution of their severity and suggesting the most appropriate treatments to manage their condition proactively.