Developing a prediction model of children asthma risk using population‐based family history health records

Identifying children at high risk of developing asthma can facilitate prevention and early management strategies. We developed a prediction model of children's asthma risk using objectively collected population‐based children and parental histories of comorbidities.

Asthma is a chronic inflammatory lung disease resulting from inflammation and airways narrowing.In 2019, over 260 million people had asthma worldwide and about 460,000 asthma-related deaths. 1 Asthma is the most common chronic disease in children. 2,35][6][7] Therefore, early identification of children at high risk of developing asthma is essential to inform prevention and early management that will reduce its burden on the children, families and healthcare systems.
][10] Other factors have also been reported to increase children's asthma risk.2][13][14][15] Family history of diseases represents genetics and shared environments, such as socioeconomic status, region of residence, and environmental exposures.Children and family history of comorbidities could be important contributors to the risk of asthma in children, either directly or by serving as proxies of other factors.Therefore, including children and family health histories in prediction models can potentially improve their performance.
Disease prediction is a valuable approach for identifying individuals or groups at high risk of developing a disease.Existing models to predict the risk of childhood asthma have had, at most, moderate predictive performance and modest generalizability. 16,17Several asthma prediction models have been externally validated and translated into risk scores for risk stratification in children.However, available prediction models have a high risk of bias due to their reliance on self-reported measures and having a small sample size, which may contribute to model overfitting.Moreover, existing asthma prediction models did not consider children or parental comorbidities beyond allergic diseases as potential predictors to improve the performance of asthma prediction.Therefore, we aimed to utilize population level health histories and family linkages in electronic healthcare records to test the improvements in children's asthma risk prediction using objectively measured maternal, paternal, and childhood histories of comorbid conditions.Identifying important variables in asthma risk prediction in children can facilitate risk stratification work to inform prevention and early management strategies.Hospital discharge abstracts and medical claims were used to ascertain asthma diagnoses, diagnoses for children's comorbidities, and diagnoses for family health histories.The Statistics Canada Census is a Canada-wide survey conducted every 5 years, which contains neighborhood-level information on sociodemographic characteristics such as employment, education, and income; it was used to create a neighborhood-level measure of socioeconomic status (i.e., income quintile).

| Data source and population
The cohort included all children born in Manitoba from April 1st, 1974 to March 31st, 2000 who had linkage to at least one parent; the index date was April 1st of the year following their first birthday.
Children were followed until an asthma diagnosis, age of 18 years, migration out of the province, death or end of study period (March 31st, 2019), whichever occurred earlier.Children without at least 1 year of continuous health insurance coverage before and after the

Key Message
This work demonstrates the value of health histories for children and their parents in improving asthma risk prediction models and identifies important variables that can be included in future risk prediction models to facilitate disease prevention and treatment strategies.
index date, which could be due to death or moving out of the province, were excluded.Additionally, children without a parent who has at least 1 year of continuous coverage before the index date were excluded.Only the oldest child was included in the cohort in families with multiple eligible children to avoid clustering within families.

| Outcome measure
The outcome was a diagnosis of asthma on or after the index date using a case definition adopted from the Canadian Chronic Disease Surveillance System.The definition required one hospitalization or at least two physician visit records with an asthma diagnosis within 2 years. 19Table S1 details the ICD codes used in the case definitions of the outcome and predictors.

| Model predictors
We developed five asthma prediction models: base, children, maternal, paternal and either parent models.First, the base model predictors included the demographic characteristics at index date of sex (male, female), region (urban [residents of cities with large population density], rural [other residents]) and income quintile (Q1 -lowest to Q5 -highest).The predictors also included children diagnoses of allergic diseases (e.g., allergic rhinitis and dermatitis) or respiratory infections and parental asthma diagnosis recorded in hospital or physician visit records prior to the outcome or end of follow-up, whichever was earlier (Table S1).Income quintile was the only predictor with missing values.Children with missing values on income quintile were assigned a separate sixth category "missing" and modeled accordingly.
Second, children comorbid conditions were additionally included and comprised 130 categories of diseases based on an adaptation of the Clinical Classification Software (CCS).The CCS is an open-source classification system that classifies all ICD codes into a smaller number of mutually exclusive and clinically meaningful categories. 20The CCS was adapted for use within the Manitoba Repository to accommodate three versions and a three-digit truncated form of the ICD system resulting in 130 mutually exclusive CCS categories. 21The comorbid conditions were considered "present" if their ICD code appeared at least once in hospitalization or physician visit records prior to the outcome or end of follow-up, whichever was earlier.Finally, comorbid conditions of 130 CCS categories in mothers, fathers or either parent were included as predictors in three models: maternal, paternal, and either parent, respectively.

| Statistical analysis
Cohort characteristics and comorbidities were described using frequencies and percentages or median and interquartile ranges in the overall cohort and stratified by outcome asthma status.We reported the top 20 comorbidities that had the greatest difference in prevalence between children with and without asthma.Tetrachoric correlations were estimated for all pairs of predictors; in predictor pairs with a correlation estimate higher than 0.8, the predictor with the lower prevalence was excluded. 22e base model fitted to the data was a logistic regression model.
Subsequent models included children's comorbid conditions, followed by maternal and/or paternal comorbid conditions (Table S2).Two machine-learning approaches were used to fit the models with children and parental comorbid conditions, least absolute shrinkage and selection operator (LASSO) logistic regression (LR) and the random forest (RF) models, with 10-fold cross-validation.4][25] LASSO LR is a penalized logistic regression model for high-dimensional data (i.e., when the number of predictors is more or equal to the number of observations) by shrinking the coefficients of the less important predictors toward zero. 26The random forest is a robust prediction tool that fits multiple decision trees and splits each subtree using the best predictor among randomly selected set of predictors at each split.It then selects a class that best aggregates the results of these trees. 27r the RF models, hyperparameters that were varied included the number of trees (100, 200, and 500) and the number of variables for each tree node (5, 10, and 20).Other model parameters, such as tree depth and node size, were set to default options.Figure S1 details the model building process using LASSO LR and RF.
For each model, predicted probability thresholds were chosen based on the distribution of predicted probability.The predicted probability threshold that resulted in the best performance measures was selected to report the final performance for all models.
Model performance was evaluated using sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV).
Predictor importance was reported using the odd ratios (OR) for the LASSO LR models and the amount of the Gini index mean decrease for the RF models.Average model performance and variables importance over the 10 folds were reported.The Gini index is used as the splitting criterion in RF models, which measures the impurity for the classification based on a variable. 28The mean decrease in the Gini index measures how each variable decreases the classification impurity and contributes to the homogeneity in the RF model; the greater the mean decrease in the Gini index, the more important the variable is in the model.Three sensitivity analyses were conducted using LASSO LR.

| Cohort description
A total of 195,666 children were included in the analysis (Figure 1).   2 shows the top 20 comorbidities in the study cohort that were selected based on having the highest prevalence imbalance between children with and without asthma.Children's joint disorders, mood and anxiety disorders, and menstrual disorders were the comorbidities with the greatest standardized differences and were more common in children without asthma diagnosis.F I G U R E 1 Flowchart for the development of the study cohort.
Figure S2 shows the receiver operating characteristic (ROC) curve for the base model.were no differences in the performance of asthma risk prediction models when we included maternal compared to paternal comorbidities (Table 3).Figure S2 shows the ROC curves for offspring and parental models based on LASSO LR.Table S4 summarizes the results of the three sensitivity analyses.
Stratifying the models based on mother's decade of birth showed that model performance significantly varied over study period.Specifically, sensitivity increased, and specificity decreased over time.
Similarly, model performance differed between children with and without paternal linkage, with a lower sensitivity and higher specificity in those with paternal linkage, but similar estimates of PPV.Finally, splitting allergic diseases and respiratory infections into more specific categories did not significantly affect model performance.

| DISCUSS ION
We used population-based family health histories from administrative healthcare databases to test their contribution to asthma risk prediction in children.Children and parental health histories were found to improve the performance of asthma risk prediction models.This improvement was observed with both LASSO LR and RF models.Mood and anxiety disorders and menstrual disorders were among the most important children comorbidities retained in the prediction models.Lipid metabolism disorders, asthma, and menopausal disorders were the most important predictors among parental comorbidities.
The best-performing predictive model included children and parental comorbidities and resulted in performance metrics aligned with previously published prediction models.For example, the Pediatric Asthma Risk Score (PARS) resulted in a sensitivity of 0.68 and specificity of 0.77; the Asthma Predictive Index (API) resulted in a sensitivity of 0.57 and specificity of 0.81. 29,30Although existing prediction models included clinical data such as wheezing, eosinophilia, and skin prick test confirmed sensitization, our models without clinical data produced comparable metrics.This suggests that the included children and parental comorbidities collectively added value to predicting asthma risk, either directly or as proxies for clinically important variables.The prediction models in this work achieved a high NPV but a low PPV, indicating the models are better at predicting those who will not have asthma than those who will have asthma.
This is common in existing asthma risk prediction models as well;

Note:
The performances are based on predicted probability threshold of 0.2.
TA B L E 3 Summary of asthma risk prediction models performance.
PARS models achieved a PPV of 0.37 and an NPV of 0.93.Similarly, API achieved a PPV of 0.26 and an NPV of 0.94. 29,30th LASSO LR and RF models produced comparable performance.In addition, the most important predictors of asthma risk were consistent in the two models.Among the comorbidities that contributed to improving asthma risk prediction, children's menstrual disorders and mood and anxiety disorders, parental lipid metabolism disorders and menopausal disorders were the variables with the highest contribution in both LASSO LR and RF models.2][33] This is largely attributed to the fluctuations of sex hormones and their effect on the inflammatory and immune processes involved in asthma pathophysiology.Menopause has been linked to improvement in asthma symptoms; menopausal disorders are a proxy measure for menopause. 31 The mean decrease in the Gini index measures how each variable decreases the classification impurity and contributes to the homogeneity in the random forest model; the higher the mean decrease in the Gini index, the more important the variable is in the model.
environmental factors. 34,35Hyperlipidemia and obesity were also linked to asthma; those conditions have a genetic and environmental influence shared with parents. 36,37Except for menopause, which was previously reported to decrease asthma risk, the other comorbidities, such as mental and menstrual disorders, were reported to increase asthma risk.In our models, these comorbidities were inversely associated with asthma risk.Therefore, it is likely that these comorbidities are proxies for other social, environmental or genetic factors that inversely influence asthma risk and access to health care.The most important predictors identified in this study are comorbidities that develop later in childhood, which suggests that the developed models are more useful for predicting asthma in older children. in other populations.Finally, there are other machine-learning models that could be explored in future research, such as an artificial neural network.[25] In summary, this work highlights the value of including health histories for children and their parents in asthma risk prediction models.Health histories can potentially improve predictive models at the population level, where individual clinical data are lacking.
Future work is recommended to identify the biological mechanisms and characterize the association between children and parental comorbidities with asthma risk.
This cohort study used population-based administrative healthcare databases in Manitoba, Canada, contained in the Manitoba Population Research Data Repository (Repository) from 1974 to 2019. 18The universal and publicly funded healthcare system allows capturing almost the entire Manitoban population of over 1.3 million individuals and most of their healthcare encounters (e.g., hospital and outpatient physician visit records) in the Repository.This study used four databases from the Repository: Manitoba Health Insurance Registry (Registry), hospital discharge abstracts, medical claims, and Statistics Canada Census.The Registry includes all individuals receiving health insurance in Manitoba since 1970.It contains health registration information; demographic characteristics, such as sex, age, and residence; and family composition information.Families are identified in the Registry using family registration numbers assigned to children and adults living in the same household.These numbers change over time, allowing for tracking changes in family composition.The Registry was used to create the study cohort, ascertain their demographic characteristics and establish parental linkages.Hospital discharge abstracts capture hospitalization records for the Manitoba population since 1970 and include up to 25 diagnoses recorded using the International Classification of Diseases (ICD) in the eighth revision (ICD-8) from 1970 to 1979, ninth revision with clinical modifications (ICD-9-CM) from 1980 to 2004, and tenth revision with Canadian adaptations (ICD-10-CA) since 2005.Medical claims contain information about outpatient physician visits, including diagnoses and performed procedures, since 1970; a single ICD code is captured in each claim.ICD-8 (in three-digit truncated form) was used from 1970 to 1979; ICD-9-CM has been used since 1980 (in three-digit truncated form until 2015 and full length thereafter).

First
, to address potential misclassification due to diagnosis and coding changes over the study period, we stratified the prediction models based on decade of maternal birth year.Second, to assess the influence of selection bias in those without paternal linkages, we compared the prediction models in those with and without linkage to fathers.Finally, we used more granular base model variables by splitting allergic diseases into allergic rhinitis, contact dermatitis and atopic dermatitis, and respiratory infections into acute respiratory infections, pneumonia, and influenza.All analyses were performed using SAS ® version 9.4 (SAS Institute, Cary, North Carolina, United States).The SAS procedures HPGENSELECT and HPFOREST were used for the LASSO LR and RF models, respectively.This study was approved by the Health Research Ethics Board of the University of Manitoba and the Manitoba Health Information Privacy Committee approved data access.
quartile range 3.0 to 11.0 years.The median duration of follow-up was 16.3 years; interquartile range 16.1 to 16.7 years.Children with asthma were more likely to be males and urban residents, have an allergic disease or respiratory infection, and have parents with asthma diagnoses.Table the base model improved sensitivity, PPV and NPV (LASSO LR: sensitivity = 0.71 [0.69-0.72],specificity = 0.63 [0.63-0.64],PPV = 0.29 [0.28-0.30],NPV = 0.91 [0.90-0.91];RF: sensitivity = 0.69 [0.68-0.71],specificity = 0.65 [0.64-0.65],PPV = 0.29 [0.28-0.30],NPV = 0.91 [0.90-0.91]).The performance further improved by additionally incorporating parents' comorbidities (LASSO LR: sensitivity = 0.72 [0.70-0.73],PPV = 0.33 [0.32-0.34],NPV = 0.92 [0.91-0.92];RF: sensitivity = 0.70 [0.68-0.71],PPV = 0.33 [0.32-0.34],NPV = 0.91 [0.91-0.92]).Specificity also improved (LASSO LR: specificity = 0.69 [0.69-0.70];RF: specificity = 0.70 [0.69-0.70]).There The study's strengths are the large and diverse population, the wide range of objectively collected health conditions (i.e., physician-diagnosed and recorded in administrative healthcare data) captured over a long period of time, and the comparison of two machine-learning models, each of which can efficiently select important predictors.The limitations include the lack of clinical information in the study data source, such as wheezing, cough, and eosinophilia.Future studies that combine children and parental comorbidities with such clinical predictors can significantly improve the prediction of childhood asthma.Second, the severity of asthma and other comorbidities is not captured in the study data; the performance of the prediction models may vary by asthma severity.Third, model performance may have been influenced by misclassification of asthma and health histories, which could have resulted from coding errors, changes in diagnostic criteria over time, healthcare access issues, or transitions between ICD versions over the study period.In one of the sensitivity analyses, stratifying the models into four time periods showed variations in model performance over time, with highest sensitivity and lowest specificity in the most recent study years.The models became less likely to miss true asthma cases, with the caveat of including more false positives (i.e., children without asthma are more likely to be identified as asthma cases in the latest study years).Fourth, including the oldest child in families with multiple children, while it improves the homogeneity of the cohort in terms of asthma risk factors related to birth order, it may affect the generalizability of the prediction models to children who are not firstborn.Fifth, the lack of complete family linkages for all the children could result in selection bias.In a sensitivity analysis, we compared maternal and offspring models in those with and without paternal linkages and observed some difference in model performance, although PPV estimates were consistent in both groups.Additionally, the predictive models were not validated in a different population; this validation can contribute to the assessment of generalizability of the study findings.Future research is recommended to evaluate the performance of predictive models

Table 1
summarizes the baseline characteristics of the study cohort: 51.3% were males and 56.4% were urban residents.Overall, 13.6% had a parental asthma diagnosis, and 17.7% had received an asthma diagnosis.The median age at diagnosis was 6.0 years; inter-

Table 5
Frequencies (%) of the top 20 comorbidities in the study cohort (N = 195,666), stratified by asthma diagnosis.
joint disorders, mood and anxiety disorders, and menstrual disorders were the top three predictors.Among parental comorbidities, lipid metabolism disorders, asthma, and menopausal disorders were the strongest predictors of asthma risk in both the RF and LASSO LR models.Unlike parental asthma, parental lipid metabolism and menopausal disorders had an inverse relationship with asthma risk.TA B L E 2Note: Top 20 comorbidities selected based on having the greatest standardized difference between children with/without asthma.Data are presented as number (percentage).
It is unknown how parental menopause could affect asthma risk in children.Still, it might be related to pathophysiolog-Odd ratios and 95% confidence intervals for the asthma base model predictors.Top 20 predictors among child and parental comorbidities (ordered from the most important based on odd ratios or Gini mean decrease).
a Odd ratios <1 indicate decreased asthma risk (protective effect); odd ratios above 1 indicate increased asthma risk.b