Developing an ML pipeline for asthma and COPD: The case of a Dutch primary care service

A complex combination of clinical, demographic and lifestyle parameters determines the correct diagnosis and the most effective treatment for asthma and Chronic Obstructive Pulmonary Disease patients. Artificial Intelligence techniques help clinicians in devising the correct diagnosis and designing the most suitable clinical pathway accordingly, tailored to the specific patient conditions. In the case of machine learning (ML) approaches, availability of real‐world patient clinical data to train and evaluate the ML pipeline deputed to assist clinicians in their daily practice is crucial. However, it is common practice to exploit either synthetic data sets or heavily preprocessed collections cleaning and merging different data sources. In this paper, we describe an automated ML pipeline designed for a real‐world data set including patients from a Dutch primary care service, and provide a performance comparison of different prediction models for (i) assessing various clinical parameters, (ii) designing interventions, and (iii) defining the diagnosis.

• First, we describe the ML pipeline set-up for the exploitation of data of asthma and COPD patients coming from a Dutch primary care service, with particular attention to its development aimed at obtaining an automated pipeline easy to deploy and integrate in existing software stacks. • Then, we compare the performance of different prediction models for various clinical parameters, automated diagnosis and suggestion of medical interventions. • Finally, we summarise the "lessons learnt" during our research for the benefit of researchers and clinicians willing to perform similar tasks.
To the best of our knowledge, as summarised in Section 2.3, this is the first research work addressing both asthma and COPD for the comparison of multiple prediction models across different prediction targets, while relying on primary care data.
In the remainder of the paper, Section 2 gives context knowledge to health-risk prediction for asthma and COPD, then Section 3 presents the proposed ML pipeline detailing the techniques exploited for data preprocessing, and model training and scoring, whereas evaluation of the resulting prediction models is in Section 4. Section 5 summarises the limitations of our study and the "lessons learnt" during our research, and finally Section 6 provides for final remarks.

| BACKGROUND AND RELATED WORKS
In this section we provide the reader with the clinical and technological background knowledge necessary to fully understand the context, motivations and goals of our research. Accordingly, we provide a brief account on the current state-of-the-art regarding usage of digital tools for clinical pathways definition, there including literature showing evidence of the effectiveness of ML-based decision support systems (DSS) for predicting various asthma and COPD outcomes (in Section 2.2).
Finally we overview what are the recent efforts in providing support to clinicians specifically through ML and in the case of asthma and COPD patients, and what are the main differences with our work.

| Digital tools for clinical pathways
Research performed on electronic (or, computerised) clinical pathways can be roughly divided into three macroareas.
Clinical pathways analysis: Digital tools are created either to build an in silico representation of clinical pathways, enabling simulations and "what-if" analysis, or to define machine-readable specifications of constraints on pathways to be enforced in clinical practice. Both efforts share the goal of monitoring compliance or identifying bottlenecks and devise out reasons for underperformance of current care plans. In Reference [12], for instance, process mining is used to mine electronic health records and build clinical pathways post hoc. Others 13 have instead suggested an ontology-based approach for the definition of pathways in a workflow style. Clinical pathways synthesis: The focus is on an automated definition of clinical pathways, or enforcement and execution thereof, with the goal of practically assisting clinicians in the definition and execution of care plans. Both References [14,15], for instance, propose a semantic rule-engine configured with domain expert knowledge and a suitable rule-set to suggest adaptations to care plans depending on the specific patient conditions or unexpected events.
DSS within clinical pathways: Here, emphasis is given to DSS that support both the pathways definition and their execution, but in specific tasks-as in our case. The work in Reference [16], for instance, proposes a recommender system to suggest to clinicians the next steps of disease management depending on the evolving patients conditions. In the following sections, we overview the issue of predicting clinical variables for asthma/COPD patients and the AI techniques that are increasingly used to support such a task, with a focus on ML approaches.

| Health-risk prediction for asthma and COPD
Due to the large variation of characteristics in asthma and COPD patients, it can be extremely difficult even for experienced physicians to determine the most effective treatment for each patient.
In case of asthma, besides spirometry, the change in Forced Expiratory Volume in 1 second (FEV 1 ) before and after administration of a large inhaled dosage of bronchodilator medications can confirm the diagnosis. However, the absence of such a change does not necessarily mean the absence of asthma, as many patients have normal lung function during diagnostic assessment. Hence, physicians often determine their diagnosis based on the clinical evaluation of symptoms, or via histamine provocation tests. 17 In case of COPD, the diagnosis is performed based on patient history, including packyears and the presence of a fixed obstruction. An obstruction is determined using the FEV 1 and Forced Vital Capacity (FVC). The severity of COPD is based on lung capacity measured with the proportion of predicted FEV 1 . However, the burden that patients suffer in daily life activities is only partly determined by the level of obstruction. In addition, symptom questionnaires, such as the Clinical COPD Questionnaire (CCQ) or the COPD Assessment Test (CAT) provides a more comprehensive assessment of the level of disease severity, which is divided into GOLD stages A (least severe), B, C and D (most severe).
The optimal treatment is different for each patient and is based on disease severity, symptoms, patient characteristics and individual risk factors. Personalised treatment has proven to be effective in improving patient outcomes, however, it is unlikely that physicians during their limited consultation times can accurately and comprehensively evaluate all of the many aspects that might affect worsening of the disease or triggering of an exacerbation. The human brain is simply not capable of recognising large amounts of longitudinal patterns and interactions between different predictors, as computers can. 18 For these reasons, AI can be promising especially when based on real-life large databases and when using ML techniques, for instance, by supporting healthcare providers in predicting the effect of specific treatment approaches. 19 In this way AI can be a valuable aid for healthcare professionals and reduce the risk of treatment side effects of patients. 20 The use of automatic clinical DSSs may improve the diagnosis and ongoing management of chronic diseases, which currently requires periodic visits to multiple health professionals, disease and medication monitoring, and modification of patient behaviour. The systematic literature review by Fathima et al. 21 brings evidence of effectiveness of clinical DSSs in the care of people with asthma. However, they did not found a clear evidence on using them for COPD.
Another systematic review of clinical DSSs was performed by Roshanov et al., 22 with the objective to determine if clinical DSSs improve the process of chronic care (in diagnosis, treatment and monitoring) and associated patient outcomes. The authors identified 55 trials that measured and reported the impact of the clinical DSS on the process of care, and/or patient outcome. Out of the clinical DSSs that measured the impact on the process of care, 52% demonstrated a statistically significant improvement, and out of the trials that measured patient outcome 31% demonstrated benefits. Along the same line, Velickovski et al. 23 propose a clinical DSS in charge of delivering recommendations. Results show a high degree of accuracy to support COPD case-finding. Moreover, they demonstrate the integration into healthcare providers' workflow through the use of a modular design and service-oriented architecture that connects to existing health information systems already in use.
However, most of the DSSs considered in the aforementioned reviews do not focus on the prediction of risk and suggestion of interventions after baseline assessment, as we do, and do not focus on interoperability and portability of the models embedded in the DSS, as we do. The ML pipeline we describe in Section 3 and evaluate in Section 4 is not a DSS per se but as a fully automated ML pipeline implemented in Python can be easily embedded into one. Indeed, such a pipeline could be easily served via a DSS, such as the one described in Reference [24].

| ML for asthma/COPD
The goal of best predicting clinical variables related to asthma/COPD with data-driven approaches is shared by many research works in state-of-the-art literature, and rightfully so, as the benefits that AI techniques may bring to the whole asthma/COPD management are widely recognised. 25 However, most of the existing literature differ from our contribution in several aspects, such as data provenance and preprocessing, the approach adopted and outcome of the research. For instance, many works we mention in the following perform heavy preprocessing of data to select the most likely predictors, or get data from specific clinical trials. Also, whereas they want to identify statistically meaningful predictors based on extensive statistical analysis (e.g., uni/multivariate regression methods), we focus on the performance of a fully automated ML pipeline autonomously building a slew of predictive models (also targeting suggestion of interventions) from unfiltered primary care data-that is, data fetched "as is" from a Dutch primary care centre, with no a priori filtering or processing. Nevertheless, we here position our work with respect to with the most similar works to highlight similarities and differences.
Most of the related research works perform some form of statistical analysis on data collected at various scales (e.g., primary care vs. controlled trials) to evaluate the predictive value of specific attributes. For instance, in Reference [26] the authors perform univariate analysis to identify likely predictors of COPD exacerbation events, then feed such predictors into a stepwise multivariable logistic regression model, and finally carry out a sensitivity analysis with respect to asthma and smoking pack-years. Hence the first difference with our work is that they preselect a prediction model, whereas we compare a slew within an automated pipeline. They get data from a curated database and further select patients based on eligibility criteria meant to further remove confounding variables (e.g., of patients with other respiratory diseases, such as bronchiectasis). Also, they preselect relevant candidate predictors based on literature and expert knowledge. In contrast, our work relies on primary care data "as is," and carries out minimal preprocessing automatically-with the purpose of being easily applicable without experts assistance. Finally, their work is aimed at finding the best predictors for COPD exacerbations, whereas our work has a twofold goal: find good prediction models for a few different asthma and COPD-related variables, while also striving to develop a reusable and easy to deploy ML pipeline. For the work in Reference [27] similar considerations can be done, as authors select candidate predictors from literature and through statistical analysis and then build an a priori model to be evaluated using receiving operating characteristic (ROC) area under the curve (AUC), the same scoring metric that we use. They rely on data from selected clinical trials and focus on asthma only. In Reference [28] another statistical retrospective study deals with the issue of determining risk factors associated with asthma and COPD, applying univariate and multivariate logistic regression to determine predictors of future exacerbations. However, they only consider less than 400 patients specifically enrolled in the study. Although our work is considerably different in most aspects, such as sample size, data provenance, models comparison and automation of our ML pipeline, it shares with Reference [28] the inclusion of both asthma and COPD, which complicate the classification task as noted by authors themselves-even if they divided the population into subgroups. Nevertheless, our work provides for a much broader overview about predictive models for asthma and COPD, and it is immediately applicable to primary care data sets since our automated ML can deal with the data preprocessing steps reported in Figure 1 autonomously, as well as training and scoring models automatically as described in Section 3.
More focussed on automated ML is the recent work started in Reference [29], that is similar to ours in both the goals (providing to clinicians a decision support tool) and the general approach (adopting ML techniques to learn prediction models from data), as authors aim to build a learning healthcare system exploiting an ML pipeline fed with primary care data to learn different asthma attack prediction models. However, they intend to consider asthma only and focus on a single task, that is, prediction of a single clinical variable (asthma attack). This notably simplifies the goal to be achieved with respect to our case, where we want to predict different outcomes, also give advises about treatment plans, and finally consider COPD toowhich complicates our goal as it is partially overlapped with asthma symptoms. Obviously, as the work is yet to be concluded and fully reported, we cannot compare results yet.
Another recent paper more focussed on ML, and with very similar objectives and methods to ours is, Reference [30] where authors compare a slew of different ML models to predict the risk of readmission for COPD patients. Authors compare logistic regression and its variants, random forests, linear support vector machine (SVM), gradient boosting, multilayer perceptron and also deep learning models, including temporal features, such as convolutional neural networks, recurrent neural networks, long-short term memory and gated recurrent unit. The work is similar to ours in that authors build a mostly automated ML pipeline, with the objective F I G U R E 1 The machine learning pipeline. In light grey are the data set dimensions (rows × columns). ACQ, Asthma Control Questionnaire; AUC, Area Under the Curve; CCQ, Clinical COPD Questionnaire; COPD, Chronic Obstructive Pulmonary Disease; kNN, k-Nearest Neighbour; LABA, Long-Acting β-Agonists; LAMA, Long-Acting Muscarinic Antagonists; ROC, Receiving Operating Characteristic; SVC, Support Vector Classification of comparing which ML models could perform best for the target prediction task, which has strong reusability and modularity, hence appear to be easily adapted to slightly different tasks and easily integrated with existing DSSs for usage by clinicians in day-to-day practice-a goal we also have. Furthermore, they have a discussion section which is quite similar to our Section 5, although ours is a bit broader in scope as it covers both technical and nontechnical aspects. However, Reference [30] also differs under several aspects: first of all we do consider asthma too, and multiple prediction targets for the same primary care data set, which notably complicates the learning task, whereas they focus on COPD and a single prediction target; second, they rely on data beyond primary care, and require 12 months of clinical data as inclusion criteria. Nevertheless, even given their more focussed scope, their results are mostly worse than ours, as their best-reported performance has ROC AUC of 0.65-they do not report on precision-recall curves, that may report even worse results as described in Section 3.4.
Finally, two more related works are worth mentioning, despite their notable differences with the present paper. Reference [31] is one of the few works we found that considers asthma and COPD altogether. However, the learning task and methods are considerably different from ours: they want to discriminate asthma patients from COPD patients by analysing saliva samples. Furthermore, they rely exclusively on black-box ML models, that may be difficult to introduce in clinical practice due to their low transparency and understandability. However, one interesting aspect of Reference [31] lies in usage of "few shots" (e.g., 0 and 1) learning models, which are able to learn from very few data samples. Reference [32], instead, is interesting for two reasons, even if the scope is quite different from ours, as they deal with the prediction of individual asthma persistence upon clinical input for children under the age of 5 years with an incident asthma diagnosis. First, authors deal with highly imbalanced prediction classes, as for many of our prediction targets, by comparing different undersampling techniques, such as the edited nearest neighbors (ENN) method, that removes instances (of the majority class) whose class label differs from a majority of its k-nearest neighbours (kNN), repeated edited nearest neighbors (R-ENN) that repeats the ENN procedure until the majority of the kNNs for every data point (of the majority class) have the same class label as the data point, and by removing majority class instances of Tomek links, that are pair of instances which are each other's nearest neighbour but are in different classes. Although these techniques may generally improve the prediction outcome, they do so by altering the data set while artificially removing (or, at least, simplifying handling of) the most difficult instances to classify. For these reasons we chose not to exploit such techniques in this paper (as described in Section 3.1), leaving them as future work for comparison with our current results. Second, they use negative predictive value (NPV)-Specificity curves, which are similar to the precision-recall curves we use, as they acknowledge ROC curves are poor for imbalanced classes, too.
Given the above analysis, our work is, to the best of our knowledge, the first one to consider both asthma and COPD patients for building an automated prediction pipeline based on primary care data.

| MATERIAL AND METHODS
To both develop the ML pipeline and validate prediction models, we exploited data coming from a Dutch primary care laboratory in the city of Groningen that receives approximately 2000 patients yearly with suspicion of asthma or COPD, that are referred for assessments and treatment advice. Patients receive assessment by a trained laboratory technician according to the American Thoracic Society and European Respiratory Society guidelines, there including respiratory testing with reversibility, medical history, smoking behaviour, Body Mass Index (BMI), medication and inhaler technique evaluation. The primary care physician receives the advice from the pulmonologist directly in his/her electronic patient record. If the pulmonologist advises patients to change the medication regime, then patients are advised to have a follow-up assessment after 3 months to evaluate the effect of the new medication. Instead, if the pulmonologist advises to continue the current treatment policy, the patient is rescheduled for a yearly follow-up.
The primary care service shared a data set storing the real-life observational data derived between 2007 and 2017. The data set contains baseline assessment of the clinical conditions of 19, 077 patients. Attributes describe data, such as age, gender, BMI, family history of the disease, lifestyle habits associated with the disease, such as smoking, spirometry including FEV 1 , FVC, and reversibility measured by a trained laboratory technician, common symptoms, such as cough, wheeze, and dyspnoea, information about medications including inhalation technique, and symptom questionnaires, such as Asthma Control Questionnaire (ACQ) and CCQ. All mentioned data are assessed by a local pulmonologist through the internet. Diagnosis and treatment advice is sent to the general practitioner (GP) of the patient. Besides baseline, follow-ups are also available at different time points and only for some patients (more on this below). This results in 2454 attributes per patient, made of a set of ≈160 attributes repeated over time, for each potential follow-up. Amongst these, the primary care service focusses on: • for patient-based health-risk assessment, prediction of clinical variables relevant for diagnosis and prognosis of asthma, COPD and asthma-COPD overlap syndrome (ACOS) patients. In particular, the aim is to build a predictive model for: the amount of exacerbations at 1 year (0, 1, 2+), predicted after ACQ and CCQ assessment (usually done at baseline assessment), the ACQ category at 3 and 12 months follow-ups after baseline assessment (controlled, partially controlled and uncontrolled), the CCQ category at 3 and 12 months follow-ups after baseline assessment (stable, not entirely stable, unstable and very unstable), • for clinical pathways definition in the form of suggestions for personalised intervention, the aim is to build a predictive model for: advising usage of Long-Acting Muscarinic Antagonists (LAMA) after baseline assessment, advising usage of (low/high dosage of) Inhaled Corticosteroids (ICS), or Long-Acting β-Agonists (LABA), or both after baseline assessment, advising usage of β 2 bronchodilation after baseline assessment. Also automated diagnosis has been explored, by trying to predict whether the patient has ACOS, COPD or asthma after baseline assessment.
Since the data set is extremely sparse and data are not "clean" (as real-life data), many preprocessing steps were necessary before starting with model training. Section 3.1 describes such steps, while Figure 1 depicts the whole ML pipeline described in this section.

| From format conversion to preprocessing
The data set has been exported from IBM SPSS proprietary software, which is not natively compatible with Python, our language of choice for the ML pipeline; then it has been suitably converted as CSV preserving SPSS metadata.
Inspection of the resulting data set has been required to confirm correctness, for instance, with respect to data types congruence (e.g., date-time format and categoricals) and missing values preservation (e.g., custom missing values).
Then, sparsity has been addressed by looking at which follow-ups are the most common amongst the 19, 077 patients: Figure 2A shows the results of such analysis, where time points between 2 and 4 months have been grouped in the "3 months" category, those between 10 and 14 in the "12 months" one, and all the rest in the "Other." As confirmed by Figure 2B, which shows the number of months between baseline and follow-up for the three groups above, 3 and 12 months follow-ups are the most common-yaxis indicates the number of months elapsed between follow-up visits.
On the basis of these groupings, patients with missing values for 3rd-and 12th-months follow-ups have been removed from the data set, as well as attributes with missing values for the corresponding measurements. This resulted in a restricted data set of 3659 patients for 164 attributes, with only ≈5% of missing values. On this data set we performed univariate analysis to identify imbalanced attributes, that is, categorical attributes whose classes are not equally represented hence can skew prediction performance negatively, and multivariate correlation analysis to detect proxy predictors, that is, independent variables having high correlation with dependent ones hence could skew prediction performance positively.
For instance, one of the predicted variables for patient-based health-risk assessment is the total number of exacerbations at 1 year, whose class distribution is shown in Figure 3: it is extremely imbalanced, hence techniques, such as down/oversampling are necessary to improve prediction performance.
However, application of such techniques is challenging, as downsampling further reduces the size of the data set, hindering the learning task, and oversampling reduces the accuracy of the data set in representing the real-world situation. For these reasons, we decided not to perform any of the two, but instead to group together categories in three bins: no exacerbations, 1, two or more.
For multivariate analysis, instead, we looked at various forms of correlation amongst dependent, independent, and both dependent and independent variables. For instance, Figure 4 examines the co-occurrences of exacerbations and ACQ/CCQ categories, by looking at the percentage of patients with at least one exacerbation per ACQ/CCQ category.
The few peaks at ≈20% are deemed not solid enough to justify the elimination of one of the variables. A similar analysis has been conducted systematically on the entire set of independent variables, both between each other and against dependent ones, so as to identify proxy predictors and opportunities for dimensionality reduction (when independent variables have high correlations, it may be sufficient to keep only one). For instance, Figure 5 shows the heatmap built out of the correlations between a handful of independent variables chosen at random just to exemplify the heterogeneous situations which can be found in the data set: whereas FEV and FVC related measurements have a high correlation, the others are mostly unrelated. In summary, given the analysis described above, all the 164 attributes have been kept for further preprocessing stages required by the predictive models described in Section 3.2. Such stages include: • imputation of missing values: for numerical variables the median value has been propagated, whereas for categorical variables a random one has been sampled according to class distribution frequency, that is, more represented classes have higher chances of being drawn.
Although we are aware that this choice does nothing for mitigating the class imbalance problem, it preserves the distribution of the original data set; • one-hot encoding for categorical variables, which is a required step for all the learning algorithms exploited; • scaling of numerical variables to normal distribution (mean 1 and standard deviation 0), which is a requirement for the learning algorithms exploited.
Numerical and categorical variables were already defined as such in the source data set. It is worth emphasising that all the preprocessing steps described above have been automated through Python programming, so as to (i) provide a reusable and configurable pipeline to data scientists, and (ii) be ready for deployment on any target platform (e.g., as a web service).

| Predictive models
On the basis of the above-described reduced data set, we set up a model training pipeline to compare the performance of the following predictive models, all implemented by the well-known scikitlearn Python module (to which we refer the reader for further technical information not reported here)-we refer the reader to the referenced literature for a thorough description of the models: Linear support vector classification (SVC): SVC, that is, classification based on SVMs, 33 with a linear kernel. SVMs are a set of supervised learning methods particularly versatile and powerful for high-dimensional spaces, especially thanks to the notion of kernel enabling to plug into the SVM different decision functions depending on the problem at hand (e.g., linear for Linear SVC). Radial basis function (RBF) SVC: SVC with RBF kernel, the default kernel in scikit learn, notably performing well on average. 34 kNNs: Neighbours-based classifiers do not attempt to build a model of the data, rather, store instances of the training data and compute predictions on unknown data based on majority voting amongst known data points. kNNs 35 take into account the k data points nearest to the one to be assigned a class (according to a configurable metric), where k is an integer value that can be set as a learning parameter. Random Forest: Ensemble methods combine predictions from several classifiers so as to enhance robustness over a single model. Averaging methods build several independent models over subsets of data and average out their predictions in a single one, whereas boosting methods build models sequentially trying to reduce the bias of the whole sequence. The Random Forest 36 exploits an averaging method with two sources of nondeterminism, meant to reduce overfitting: the subset of data for training is chosen at random, and a random subset of features is chosen when splitting each node of the decision tree.
The set of chosen models represents well the most commonly used models for prediction of categorical variables: the Linear SVC model for its simplicity and ease of interpretation, the RBF kernel has been reported to perform well on average, independently of the specific characteristics of the data set, the kNN serves as a comparison with purely data-driven methods not forcing observations into a predefined model, and the Random Forest is the bestperforming ensemble method on average. It is worth noting that we explicitly avoided opaque models with limited explainability, such as neural networks, as the clinicians of the primary care service expressed interest in working with easily interpretable models, for which they can precisely track due to what a given prediction or suggestion has been delivered by the software.

| Training and testing models
Each of the aforementioned models exposes several parameters influencing the underlying learning algorithm.
All models have been trained and tested using a train/test split ratio of 0.33, hence 33% of the data set has been left out of training to be used for evaluating the produced prediction models. Also, k-fold cross-validation has been used to assess the performance of the models, with k ranging from 10 to 30 depending on the specific model (for some, more than 10 validation rounds were impractical either for overflowing memory or for taking too long to complete).
Grid search has been exploited for automatically tuning hyper-parameters of the learning algorithms, in particular: the C regularisation factor for SVC models (both linear and RBF kernels); the k number of neighbours parameter of the kNN model, and the "leaf size" property (regulating the trade-off between efficiency and greed while building the model); the number of estimators (models), maximum tree depth and minimum samples split at nodes, for the Random Forest.
For further information regarding each parameter meaning, we refer the interested reader to the technical documentation available starting from scikitlearn interactive map: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html.

| Scoring of models
Finally, the following scoring techniques for assessment of the best models have been used: F-measure and its variations (e.g., weighted and F-beta), 37 confusion matrix and ROC, also with AUC. The literature claims that the ROC should be the standard tool for assessing the performance of clinical risk prediction models. 38 However, it also warns about the misleading results it can provide for imbalanced data sets. 39 Hence, in the following we first report both the ROC and precision-recall curves, which are known to overcome some ROC-related issues, then we stick to the latter.

| RESULTS
The whole preprocessing pipeline described in Section 3.1 as well as the predictive models described in Section 3.2 have been applied to the data set subject to our investigation, for both patient-based health-risk assessment and clinical pathways definition. The following sections report on the best-performing models for each.

| Patient-based health-risk assessment
With the aim of predicting health-risk, three variables have been focussed on by the primary care service: exacerbation, ACQ category and CCQ category.

| Exacerbations
We have to deal with the extremely imbalanced multiclass problem of predicting the amount of exacerbations at 1 year amongst 0, 1 and two or more. Our experiments show that in this case the best-performing models are the Linear SVC and the Random Forest. Figure 6A depicts the confusion matrix comparing true classes (y-axis) against predicted ones (x-axis) for the Linear SVC classifier.
As shown in the graph, the Linear SVC model behaves well in predicting cases with one exacerbation (the centre square). On the other hand, for cases with no exacerbation it works only slightly better than a random predictor, while for cases with two or more exacerbations, it mostly fails by predicting only 1. Figure 6B depicts the confusion matrix of the Random Forest classifier. The best-performing model has the maximum depth of each tree set to 40, minimum samples for splitting to 30% of the population, and the number of trees to generate to 40 (hyperparameters automatically set through Grid search). The model behaves extremely well for both cases with no exacerbation and with 1, but not for two or more exacerbations, incorrectly predicted as 1.
As the Random Forest model natively provides probability estimates in scikitlearn, we also report on ROC curves, depicted in Figure 7A. Such curves indicate the true positives rate in the y-axis, and the false positives rate in the x-axis, hence their ideal shape is a curve with a steep elbow on the top-left corner. Those curves represent an excellent model, as they follow the mentioned elbow, and the AUC is always far superior to 0.9. Nevertheless, ROC curves disregard information about baseline probabilities, that is, the relative proportion of the different classes in the data set. In other words, if the model misclassifies a low represented class, it still scores high according to ROC.
For this reason we also report the precision-recall curves, depicted in Figure 7B. They indicate the precision along the y-axis and the recall along the x-axis, hence, the ideal shape is a curve with a steep elbow on the top-right corner this time. Here, it is more evident how the model sometimes misclassifies low represented classes, such as two or more exacerbations. Indeed, the precision-recall curve of the two or more class in particular is far from the ideal shape just described.
However, AUC is above 0.7 for two out of three classes despite class imbalance, and the class with the most errors is also the least represented amongst samples; hence, results are still arguably good.

| ACQ category
We have a multiclass classification problem. In fact, we are interested in predicting the ACQ category of patients at 3 and 12 months amongst controlled, partially controlled and uncontrolled. Our results show that the best-performing model is the Random Forest, trained with a maximum depth of each tree set to 30, minimum samples for splitting to 10% of the population, and a number of trees to generate to 40 (obtained through Grid search). Figure 8A shows the confusion matrix, while Figure 8B depicts the precision-recall curves. They confirm that the model behaves very well, as both the controlled and uncontrolled categories have over 0.9 correct predictions and over 0.95 AUC. The only category with results comparable to a random classifier is partially controlled.
It is worth noting, however, that such a prediction is complicated by the nature itself of the class: it represents "edge" cases with no clinical variable clearly hinting at either one of the other two categories, and corresponds to situations difficult to assess even for experienced clinicians. All the other models mentioned in Section 3.2 have similar performances but score lower in every class. Also, predictions at 12 months follow similar patterns but with slightly degraded performance, hence are not reported.

| CCQ category
We have to address the highly imbalanced classification problem of predicting the CCQ category at 3 and 12 months amongst stable, not entirely stable, unstable and very unstable. In fact, categories not entirely stable and stable are represented 3-7 times more than others.
Also in this case the best-performing model is the Random Forest, trained with fully developed trees, minimum samples for splitting set to 10% of the population and a number of trees to generate to 90 (again, obtained through Grid search). Figure 9A,B shows, respectively, the confusion matrix and the precision-recall curves with AUC. Predicting stable cases gives the best results, followed by very unstable ones, whereas cases unstable and not entirely stable are only slightly better than a random classifier. However, it is worth noting that most misclassifications happen by incorrectly attributing cases to the not entirely stable class; this should not surprise: similarly to the case of ACQ, this class represents the most uncertain cases, difficult to assess also for experienced clinicians.
Predictions at 12 months show similar results, again with slightly degraded performance, hence are not reported.

| Discussion
For each of the three prediction targets Random Forest is the approach that gives the best performance. To better understand our results from a clinical perspective, we have to consider the underlying operational context. The training data set after preprocessing has a limited size (3659 samples for 164 attributes, 5% missing values) because most GPs referred patients for one single assessment. Moreover, most tasks involved highly imbalanced class distributions, which complicates the learning task.
Finally, as the whole data processing pipeline, from preprocessing to application of models, is conceived to be as automated as possible so as to be easily reusable and deployable on different software platforms with minimal effort, undertaking specific operations for specific tasks and on specific portions of the training data set is not always possible. For instance, inspection of data for exploratory analysis and manual preprocessing, or fine-tuning of the learning algorithms under specific conditions, "by hand," is not supported by our Python pipeline at the moment. This makes easier to adapt the pipeline to slightly different data (e.g., coming from similar primary care services), at the cost of possibly sacrificing a bit of accuracy.
Our results show that the proposed models could be introduced in clinical practice as for each problem at least one model produces excellent results in at least one target category. A boosting approach based on Linear SVC and Random Forest can be used to predict cases with no exacerbation or 1. A Random Forest approach can be adopted to predict controlled and uncontrolled ACQ categories. Similarly, a Random Forest model can be used to identify stable cases in CCQ category prediction.
Our Python pipeline can be integrated with Electronic Health Records by delivering its predictions along with the degree of confidence (or even the precision-recall curves as a whole): this is crucial to ensure that clinicians are informed about the confidence of the prediction, and we argue that is also fundamental to boost adoption of this form of AI-driven support. For some models, also the relative importance of the different independent variables in contributing to the prediction may be available, and will be surely precious in further informing the clinician exploiting the predictors. As such, including this aspect in our analysis is already intended as future work.

| Clinical pathways definition
As regards clinical pathways definition, and in particular suggestion of interventions, the primary care service wanted to tackle four classification problems: whether to advise the usage of LAMA, or not (binary); whether to advise the usage of both ICS and LABA, or not (binary); whether to advise the usage of β 2 bronchodilation, or not (binary); suggesting the diagnosis of a patient amongst asthma, COPD, asthma/COPD overlap and diagnosis unclear (multiclass). It is worth mentioning that we decided to adopt a supervised approach because our data set is actually labelled, as it contains information about outcomes, hence, could base predictions on whether the suggestion led to a positive outcome or not. By doing so, a more accurate solution is likely to be obtained with respect to adopting an unsupervised approach as, for instance, association rules mining.

| Advise LAMA
The kNN classifier is the best-performing model. Figure 10A shows its confusion matrix on test data. Performance is very good for both the positive and negative classes, even if some samples (cases) are incorrectly not provided with suggestion to use LAMA when instead due. The model uses weighted distance amongst samples to label classes, and leaf size and the number of neighbours hyperparameters set to 10 and 4, respectively, autotuned through Grid search. Weighted F1 score 37 is used as a scoring metric. Both the excellent performance in one case and misclassifications in the other are confirmed by the precision-recall curves depicted in Figure 10B.
It is worth mentioning that the Random Forest slightly improves performance when correctly not giving the advise, but at the cost of sensibly degrading performance when giving it, hence the kNN is chosen as the best model.

| Advise ICS + LABA
The best model is the Random Forest. Figure 11A shows the confusion matrix obtained by applying the Random Forest model on test data, configured with balanced class weights, fully developed trees, and an autotuned (through Grid search) number of estimators (80) and minimum samples split (10% of population). The model does a good job in both delivering the advice when due and refraining to do so when unnecessary. The precision-recall curves reported in Figure 11B reveal that most errors are made while predicting the advise use ICS + LABA class.
Achieving better results across classes is complicated by the fact that class representation is highly imbalanced: for ICS and LABA usage we have a ratio of 1:5. Undersampling and oversampling are not deemed to be viable solutions for seeking improvement: the former would leave too few samples to train the model, whereas the latter would overfit the model to artificial data. As such, we prefer to stick with good results actually reflecting real-world data sets as truthfully as possible (e.g., in reference to the amount and quality of data that clinicians have at their disposal).

| Advise β 2
The two best-performing models are a Random Forest and a Linear SVC, whose confusion matrices are shown, respectively, in Figure 12A,B.
The two mentioned models are complementary in what they are good at suggesting: the Random Forest is almost excellent when deciding not to advise the usage of β 2 bronchodilation, whereas fails half the times when deciding to deliver the advice. The Linear SVC, instead, is very good when suggesting to advise usage, but fails half the time when deciding not to do so.
For this complementarity, a further attempt at improving classification results has been done with a Voting classifier, that is a kind of ensemble method (like, the Random Forest) with a very simple idea at its core: combine different ML classifiers and use a majority vote (or the average predicted probabilities) to predict the class labels. Unfortunately, no sensible improvement has been achieved.

| Diagnose
Besides the preprocessing presented in Section 3.1, additional filtering of features is necessary for automated diagnosis. In particular, as we are interested in diagnosing COPD, asthma and ACOS, we remove from the training data the features obtained as a consequence of the diagnosis itself (such as repetition of questionnaires to track the evolution of the condition), as they would inflate results-being essentially a "proxy" for the predicted variable.
The two best-performing models are a Linear SVC and a Random Forest, shown, respectively, in Figure 13A,B. The Linear SVC is excellent in diagnosing COPD. Extensive search through the hyper-parameters plane to find the best regularisation factor (which helps especially in the case of unbalanced problems, as this one) does not help in improving classification recall. The Random Forest improves where the Linear SVC is already good (asthma and COPD) but does not help where the Linear SVC fails. This is particularly unlucky, because otherwise a Voting classifier may have helped, by combining the complementary classifications of different models.

| Discussion
Given that the same considerations discussed in Section 4.1 still hold here, regarding both the limits of the used data set and our focus on pipeline automation, we have to take into account an additional complication: in the case of clinical pathways, establishing when to give the above advices and which diagnosis to make may quickly become very difficult even for experienced clinicians, depending on the specific patient's conditions. This means not only that the task is intrinsically difficult as no combination of independent variables clearly determines the outcome, but also and most importantly that the input data set used for training has some "intrinsic error" that inevitably biases the generated prediction models, as the labelled outcomes assigned by clinicians cannot be assumed to be always correct.
That being said, we have good results all across the different tasks. Advising usage of LAMA is accomplished with very good results by a kNN classifier. Advising usage of ICS and LABA is accomplished by a Random Forest classifier, exhibiting good results despite having to deal with a highly imbalanced problem. Advising usage of β 2 is jointly accomplished by a Random Forest and a Linear SVC: the former is very good in advising usage, whereas the latter is good in the opposite case (advice not due). Finally, a Linear SVC is F I G U R E 13 Confusion matrices of best Linear SVC versus best Random Forest. (A) Diagnosis based on Linear SVC is excellent at spotting COPD cases, and good at spotting asthma cases and (B) the Random Forest improves the Linear SVC. COPD, Chronic Obstructive Pulmonary Disease; SVC, Support Vector Classification [Color figure can be viewed at wileyonlinelibrary.com] excellent in diagnosing COPD, and good in diagnosing asthma. Once again then, we have at least one model able to provide useful support to clinicians in their daily practice when dealing with asthma and COPD patients. Table 1 summarises the key results we presented and discussed. For each ML task, we report: whether the classification problem is imbalanced ("Imbal.?"), whether it is multiclass ("Multi?"), the best model and its hyper-parameters, the ROC AUC, and the precision-recall curves AUC.

| Summary of key results
The Random Forest performs well across tasks, as it is the most represented model. However, its best parameters vary greatly from task to task, hence care and time must be dedicated to explore the hyper-parameters space (e.g., through automated Grid search, as we have done). It is interesting to note that for binary classification problems where class representation is balanced (LAMA and β 2 tasks), in our case, an alternative exists: kNN for the former, which outperforms other classifiers, Linear SVC for the latter, that is more or less on par with the Random Forest. Finally, it is worth emphasising that ROC is very good across tasks, as most of the prediction errors happen in the least represented classes.
In particular, our results show that: (i) for predicting the number of exacerbations at 1 year, the ACQ category at 3 and 12 months, and the CCQ category at 3 and 12 months, the bestperforming model is a Random Forest; (ii) for delivering suggestions about usage of LAMA a kNN model performs best, while for ICS + LABA and β 2 a Random Forest; (iii) for automated diagnosis a Linear SVC performs best. For each task, at least one predictive model exhibits actionable results, that is, good performance according to the clinicians collaborating to the research, and feasibility of deployment according to the data actually available in the primary care centres taken as reference.
Min et al. 30 compare the following ML models to predict the risk of readmission for COPD patients: logistic regression and its variants, random forests, linear SVM, gradient boosting, multilayer perceptron and also deep learning models including temporal features, such as convolutional neural networks, recurrent neural networks, long-short term memory and gated recurrent unit. Although they do consider COPD solely, and focus on readmission risk, as noted in Section 2.3 it is the most similar work to ours, hence the only one amenable of a detailed performance comparison. Their best-performing white-box model is a gradient boosting decision tree achieving mean of 0.643 ROC AUC, whereas the best deep learning model they obtained is a gated recurrent unit, that achieves mean 0.65 ROC AUC. In our case, as reported in Table 1, all of our models have mean ROC AUC above 0.73, peaking at mean 0.98 for the number of exacerbations, 0.937 for ACQ category, 0.898 for CCQ category, 0.87 for advising LAMA, 0.75 for advising ICS + LABA, 0.73 for advising β 2 , and 0.88 for diagnosis. Looking at the precision-recall curves, the performance of our models degrades as the class imbalance problem makes the prediction of some classes extremely complicated. In fact, the best models whose ROC AUC has been just described now achieve mean 0.73 for the number of exacerbations, 0.843 for ACQ category, 0.698 for CCQ category, 0.75 for advising LAMA, 0.675 for advising ICS + LABA, 0.67 for advising β 2 and 0.605 for diagnosis. It is worth noting then that even in the case of precision-recall AUC, performance is always better than a random choice, and in all cases but for diagnosis, outperforms the state-of-the-art.

| LIMITATIONS AND LESSONS LEARNT
Although our study has shown good generalisation capabilities during validation, as demonstrated by the results described in Section 4, there are limitations. Training data are limited in size (66% of 3659), and only represent patients from a specific Dutch region, hence our results may not hold still for a different population. Also, validation data comes from the same primary care centre, hence data distribution is the same as training data, which may hinder generalisation capabilities of models. Another limitation of our study stems from our attention to automatisation of the ML pipeline (depicted in Figure 1): since one of our aims is to provide a software package easy to integrate into DSSs or legacy systems used by clinicians, hence fully autonomous in its functioning, we do not currently support some fine-tuning operations that require manual intervention, such as feature engineering. However, we are aware that such interventions may improve predictive performance, and plan to further investigate automatisation of more preprocessing steps in our future works. Finally, a limitation of our study concerns the input data set, that lacks imaging features that are notably useful in predicting various aspects of asthma and COPD conditions. 8 Nevertheless, our study is the first addressing both asthma and COPD predictions exclusively from primary care data, and the results achieved encourage further research along this line. Before concluding the paper with some final remarks and an outlook to our planned future works, we take the chance to share with the reader some lessons we learnt from our experience in building and evaluating the ML pipelines described, so as to possibly deliver recommendations to those walking along the same path, or willing to.
Getting data takes time: Even if the data are technically readily available in a database, actually getting the hands on it may take a long time depending on the organisational setting of the provider: do not underestimate this. Is there a need for approval by an ethical committee? Add months to the expected delivery time. Is there administrative paperwork to carry out, such as for signing nondisclosure agreements? Add weeks. Will the data be handed over from a separate database/server/file than the one where it is used by the provider for daily practice? Again, add weeks. In our case, for example, we needed all of this, hence from the day we agreed to get the data to the day we actually got the data almost 6 months had passed. Takeaway: plan ahead. Real-world data are a mess: This may be obvious to state, but a notable amount of research on ML happens through either synthetic data sets or carefully curated data sets storing only relevant data in a neat and clean format. Real-world data are rarely the same: redundant fields storing the same information in different ways, missing or inconsistent information, wrong data formats, high-class imbalance and other technical issues are omnipresent. Furthermore, also nontechnical issues complicate the picture, as they need a domain expert to be resolved: relevant data mixed in with useless data, wrong data, correlations to be confirmed and so on. In our case, for instance, without the collaboration of the clinical co-authors, making sense of some of the data would have been almost impossible. Takeaway: making data neat requires much more effort than training an ML model. Exploratory analysis is crucial: Domain experts may know the meaning of data, for example, what a certain feature means from a clinical perspective, but may ignore the hidden relationships between data, for example, correlation of clinical variables, predictive power, and so forth. Also, there are technical issues with data that only an exploration stage may bring to light, such as proxy predictors, inconsistencies between related features, imbalanced representation of classes and so forth. In our case, the different classes to predict are not uniformly represented. Moreover, domain experts' knowledge is valuable to get insights about data, but may also (unintentionally) bias the exploratory analysis towards confirmation of already known facts. Takeaway: explore data with and without domain experts. Involve domain experts: Also this point may appear obvious, but involving domain experts in the process of designing the ML pipeline adds value. Way too much research efforts include domain experts only in the validation stage, to assess the performance of the models. However, domain experts bring added value at all stages of an ML pipeline, from initial conception to deployment in a production environment, passing through design and evaluation of the pipeline. In our case, for instance, clinicians participated from the very first stage of requirement elicitation, to define what goal to pursue with the ML pipeline, to the later stage of defining which prediction models to compare (mostly chosen for explainability), up to the latest stage of performance assessment. Takeaway: domain experts add value at each stage of the pipeline. Overfitting is tempting: While in the iterative process of training and validating models, hyperspecialising the model to reach the best performance possible is alluring, but may be misleading. We are not only talking about the well-known problem of model overfitting, but also of the subtle habit of manually manipulating both the data set and the model parameters to squeeze out a negligible performance increase, at the cost of sacrificing generalisation power, portability across populations, opportunities for automation of the ML pipeline. In our case, for example, while designing the ML pipeline we deliberately wanted to stick with preprocessing tasks and model building tasks easy to automate, as our concern was not only to find the best models, but also to produce an automated ML pipeline easy to adapt to different populations. Takeaway: privilege automation over specialisation. Automation is good: Although increasing attention is recently being devoted to the deployment stage of an ML pipeline 40 (rather than to the training and evaluation stage only) and to automatisation of various stages of the pipeline (e.g., the AutoML movement 41 ) way too many research works still present ML pipelines in which each step is performed manually or with little automation, and on an ad hoc basis requiring constant human intervention. Although this may improve the performance of the pipeline in the specific domain and use case it is being built, such practice limits its reusability across populations and deployability in legacy environments. That is why we tried to keep the ML pipeline as automated as possible, at the cost of (perhaps) sacrificing performance. Takeaway: automation adds value. Handle scoring metrics with care: It is already known that some of the most common scoring metrics used in ML are often abused, 42 either by applying them disregarding the underlying assumptions supporting their validity, or simply blindly using them "following the masses" without questioning their appropriateness (e.g., F-measure 43 and ROC 39 ). Even more so in the case of imbalanced data sets. 44 In our case, for instance, we commented how ROC AUC may be misleading in the case of imbalanced classes representation, and how precision-recall curves AUC may complement it to give a more comprehensive picture of models performance (see Section 3.2). Researchers must keep in mind at all times that each performance metric (i) is usually meant to assess one facet of model performance, and (ii) comes with its own applicability requirements (or assumptions) dictating when the metric has meaning. Takeaway: scoring has goals and assumptions, do not ignore them.

| CONCLUSIONS
In this paper, we tackled the problem of building an ML pipeline for clinicians treating asthma and COPD patients. On the clinical side, we developed an automated ML pipeline and compared the performance of a few prediction models to predict the number of exacerbations, the CCQ category and the ACQ category of patients, to deliver advices about the usage of LAMA, ICS and LABA, and β 2 medications, and to automatically diagnose asthma, COPD or ACOS. On the technical side, we implemented in Python the automated pipeline behind each compared model, from preprocessing to models scoring. We found at least one good prediction model for each task. We emphasise that all the prediction models evaluated in our study have been trained and tested on real data coming from a Dutch primary care service between 2007 and 2017, that has been preprocessed by carefully avoiding to introduce distortion, such as due to over/undersampling techniques. As the results achieved are satisfactory, we plan to advance the current work in two directions. First, further improve performance of the predictions by trying to cluster patients based on similarities amongst clinical variables first, and then apply classification separately within clusters. Also, we could consider using neural networks in combination with techniques for explainable AI, as interpretability of the models is of primary importance in the healthcare domain. Then, embed our automated and configurable Python pipeline into a web service able to serve models as web resources across different platforms.
Finally, we hope that our "lessons learnt" may serve well researchers and clinicians willing to start similar investigations as reference guidelines to avoid common pitfalls in ML pipelines design and development.