Prediction of spontaneous preterm birth using supervised machine learning on metabolomic data: A case–cohort study

To identify and internally validate metabolites predictive of spontaneous preterm birth (sPTB) using multiple machine learning methods and sequential maternal serum samples, and to predict spontaneous early term birth (sETB) using these metabolites.


| I N TRODUC TION
Preterm birth (PTB) is a leading cause of mortality in neonates and children under 5 years worldwide. 1,2PTB can be indicated for medical reasons or it can occur spontaneously.Spontaneous preterm birth (sPTB) accounts for approximately 70% of all PTB and involves multiple pathological processes. 3,4Despite sPTB being a major contributor to mortality, long-term disability, and healthcare costs, 2,5,6 its pathophysiology is not completely understood.Identifying pregnancies at risk remains an important challenge.
Pregnancy initiates substantial and widespread metabolic changes. 73][14][15][16][17][18][19][20][21][22][23][24] However, metabolomics analyses involve complex data sets, with a vast number of predictors and relatively small sample sizes, known as the 'large p, small n problem'. 25The sparsity of large-scale metabolomic data and the nonlinear interactions have made traditional statistical methods less suitable. 26achine learning methods have become preferred for the statistical analysis of metabolomics data sets because of their inherently nonlinear data representation and their ability to process large and heterogeneous data sets rapidly. 27,28owever, according to a recent systematic review, the methodological quality (e.g.small sample size and selection of predictors based on univariable analysis) and reporting of machine learning-based prediction models for PTB have generally been poor. 29The aim of the present study was to identify and validate metabolites predictive of sPTB using multiple machine learning methods and sequential maternal serum samples from a large, prospective pregnancy cohort.This study addressed the limitations of previously published PTB models by using rigorous methodology (feature selection based on multiple models, use of multiple temporally separated sampling points, internal validation).

| Pregnancy Outcome Prediction study
The Pregnancy Outcome Prediction (POP) study was a prospective cohort study of nulliparous women with a viable singleton pregnancy attending the Rosie Hospital, Cambridge, UK, between January 2008 and July 2012.Information on maternal characteristics was collected through questionnaires, clinical records, or linkage to the hospital's electronic databases.The participants underwent phlebotomy and fetal biometry at around 12, 20, 28 and 36 weeks of gestation.Outcome data were retrieved from case records and electronic databases.The study protocol and the cohort have been described in detail elsewhere. 30,313][34] A written informed consent was obtained from all study participants.There was no patient or public involvement, nor was there a core outcome set for the present study.
A case-cohort design within the POP study was used in the metabolomics analysis, described in detail elsewhere. 35pontaneous PTB was defined as delivery in the absence of induction of labour or elective caesarean section at or after 28 and before 37 completed weeks of gestation.Among the participants who had the 28-week measurement available, there were 98 sPTB.Term births from the random sub-cohort (n = 297) were used as controls.Additionally, there were four medically indicated preterm births (iPTB) in the sub-cohort.
For the purpose of internal validation using term births from the sub-cohort, spontaneous early term birth (sETB) was defined as delivery in the absence of induction of labour or elective caesarean section at or after 37 but before 39 weeks of gestation.Indicated early term birth (iETB) in the same weeks of gestation window was defined as a competing event.

| Metabolomics analyses
Sequential metabolite measurements from maternal serum samples were made at around 12, 20, 28, and 36 weeks of gestation.An untargeted metabolomic analysis was performed by Metabolon Inc., blinded to the patients' clinical information and pregnancy outcome.Altogether, 837 metabolites of known structural identity were measured. 35They were expressed as scaled imputed values, i.e. multiples of the median. 36

| Classification, prognosis and feature selection
Details about classification and prognosis prediction methods and software are given in the Supplementary Methods in Appendix S1.For each classification and prognosis method, the most important metabolites and pathways in relation to sPTB were reported.The intersection between the discriminative features identified by different methods was visualised using UpSet plots. 37The classification models were retrained with the features that were selected as important by at least two methods.

| Internal validation
We further investigated the predictive ability of a selection of metabolites by (1) building a prediction model for sPTB that included a smaller sub-selection of metabolites at 28 weeks of gestation, (2) applying the model coefficients to 12-week and 20-week measurements to predict sPTB, and (3) applying the model coefficients to 28-week and 36-week measurements to predict a different outcome, sETB.Additionally, 36-week measurements were analysed in relation to spontaneous birth at any term gestational age.Details of these steps are described in the Supplementary Methods in Appendix S1.
Briefly, penalised logistic regression and best subsets selection were applied on the 28-week measurements.Model performance was assessed using the Akaike Information Criterion and Bayesian Information Criterion. 38ptimism-corrected area under the receiver operating characteristics curve (ocAUC) was reported. 39Predicted probability of sPTB was calculated applying coefficients from the best models to the 12-week and 20-week measurements (n = 402 and n = 405, respectively).The same coefficients were then applied to predict the probability of sETB, using 28-week and 36-week measurements (n = 266 and n = 264, respectively).AUC was reported with 95% confidence intervals at each week of gestation.Competing risks regression was fitted for predicted probabilities and sETB. 40Subdistribution hazard ratios (sHR) with 95% confidence intervals and cumulative incidence curves were reported.Finally, the 36-week measurements were analysed in relation to spontaneous birth at any term gestation using Cox regression (Supplementary Methods in Appendix S1).

| Characteristics of participants
The case-cohort study population included 399 participants: 98 (24.6%) sPTB cases, 297 (74.4%) term controls from the sub-cohort and 4 (1.0%) iPTB from the subcohort.Selected maternal and pregnancy characteristics of the sPTB cases and the term controls are reported in Table S1.There were no substantial differences in the maternal characteristics (age, body mass index, or self-reported smoking) between the cases and controls, although maternal height was on average about 1 cm lower in the cases (p = 0.02).

| Classification models
The performance of the six machine learning algorithms was described using the AUC (Table 1).When trained on all features, the random forest model performed best (AUC = 0.61), followed by the generalised boosted model (GBM), linear discriminant analysis (LDA) and support vector machine (SVM) models (AUC = 0.59).The top 30 most discriminative metabolites from each method, based on variable importance score, were included in the intersection.Forty-seven metabolites were identified by at least two methods (see Feature selection below).The models were retrained on the top 47 metabolites identified by multiple methods.This significantly improved the performance metric for all models.The random forest remained the best model (AUC = 0.73) followed by the GBM model (AUC = 0.71).

| Prognosis prediction model
No metabolite violated the proportional hazards assumption.The Cox proportional hazards model yielded a C-index of 0.61 (Kaplan-Meier curves in Figure S1).Participants in the lowest quartile of risk had significantly higher probability of remaining pregnant, i.e. avoiding sPTB, than the rest (p = 0.00012).In a dichotomised analysis (Figure S2), the probability of sPTB for participants in the high-risk group was greater (p = 0.01).

| Feature selection
Forty-seven metabolites were identified as important (based on variable importance score) by at least two methods, including penalised logistic regression, Rpart, SVM/LDA, GBM, random forest and Cox proportional hazards model.SVM and LDA were presented together because they led to the same selection of metabolites, and Cox proportional hazards model was added, giving a total number of six methods.The Rpart and penalised logistic regression models had the highest number of important metabolites in common (n = 8).Seven metabolites were identified by at least four methods.No metabolite was identified by all six methods, probably because of collinearity.The 47 metabolites common to multiple methods are presented in Table S2, and all intersections are visualised in the UpSet plot (Figure 1).

| Internal validation
Of the 47 metabolites, 22 were identified by at least three methods and were included in internal validation.The analysis of 28-week measurements included 98 sPTB cases and 297 term controls.Penalised logistic regression retained 15 out of the 22 metabolites and best subset selection was performed on these 15 metabolites.The best 7-predictor model had the lowest Akaike Information Criterion and the 4-predictor model had the lowest Bayesian Information Criterion.The results for the best 1-to 10-predictor models are listed in Table S3.The ocAUC for the 7-predictor model was 0.707 and the ocAUC for the 4-predictor model was 0.703.
A 4-predictor model was considered preferable for its simplicity and for its performance being similar to that of the 7-predictor model.
The model coefficients were applied to the 12-week and 20-week measurements to predict sPTB (measurements were available for n = 110 cases and n = 292 controls at 12 weeks of gestation and for n = 108 cases and n = 297 controls at 20 weeks of gestation).The estimated predicted probability from the 4-predictor model discriminated sPTB at 20 weeks with an AUC of 0.657 (95% CI 0.597-0.717),and the discrimination was weaker at 12 weeks with an AUC of 0.606 (95% CI 0.544-0.667),as expected (Figure 2A).The single most predictive metabolite at 28 weeks of gestation, 1-palmitoleoyl-GPE (16:1)*, had an AUC of 0.630 (95% CI 0.569-0.690)at 20 weeks of gestation and a slightly weaker discrimination at 12 weeks of gestation (AUC 0.609, 95% CI 0.548-0.670)(Figure 2B).This demonstrates that the single metabolite discriminates sPTB cases and controls in a similar way to the 4-predictor model at the earlier gestational ages.The means of the z scores (95% CI) of all four metabolites at each gestation by sPTB status are presented in Figure S4.
The model coefficients were applied to the 28-week and 36-week measurements to predict sETB (measurements were available for n = 13 sETB cases and n = 253 controls at 28 weeks of gestation and for n = 14 cases and n = 250 controls at weeks of gestation).The 4-predictor model actually performed slightly better at 36 weeks of gestation in predicting sETB than the 28-week measurements in predicting sPTB, with an AUC of 0.727 (95% CI 0.606-0.849)(Figure 2A).The single metabolite 1-palmitoleoyl-GPE (16:1)* had an AUC of 0.739 (95% CI 0.618-0.860)at 36 weeks of gestation for sETB (Figure 2B).The association with the single predictive metabolite was statistically significant at 28 weeks of gestation but the association with the 4-predictor model was weaker.The confidence intervals were wide in the analyses of sETB.However, the discrimination of sETB using 36-week measurements was similar to or better than when compared with the discrimination of sPTB using 28-week measurements, although the sETB cases were part of the control group in the sPTB analyses.The means of the z scores (95% CI) of all four metabolites each gestational age by sETB status are presented in Figure S5.

F I G U R E 3
Cumulative incidence of spontaneous early term birth between 37 +0 and 38 +6 weeks of gestation, using the 4-predictor model (A) and the single most predictive metabolite 1-palmitoleoyl-GPE (16:1)* (B).The group in the highest quintile of predicted risk (solid line) from the 4-predictor model (A) or 1-palmitoleoyl-GPE (16:1)* (B) was compared with all others (dashed line) using the competing risks method.

| Main findings
We applied multiple machine learning methods to identify and internally validate metabolites predictive of sPTB using sequential maternal serum samples from a large, prospective cohort study of nulliparous women.We reported the most important metabolites as those identified by multiple methods (classification and prognosis prediction).The majority of the 22 metabolites identified at 28 weeks of gestation by at least three methods were lipids.The most common sub-pathways were lysolipid and phospholipid metabolism.We reduced the number of predictors further and identified a 4-predictor model that discriminated sPTB cases and controls also when applied to measurements from 12 and 20 weeks of gestation.Furthermore, the same model discriminated sETB cases and controls when applied to measurements from 28 and 36 weeks of gestation.At 36 weeks of gestation, the most predictive metabolite (a lysolipid) discriminated sETB cases similarly to the 4-predictor model.

| Interpretation
The observation that the same predictors were associated with both sPTB and sETB is interesting, although not surprising.Preterm is defined on the basis of a somewhat arbitrary gestational age threshold of <37 weeks of gestation.Many of the sPTB events occurred at around 35-36 weeks of gestation.2][43] Hence, early term delivery is of clinical interest, and sPTB and sETB are related events on a continuum.Our analysis supports the concept that sPTB and sETB may share final common pathways and the predictive metabolites may be part of one or more of those pathways.The fact that the associations derived from the analysis of sPTB were actually more strongly predictive of sETB is also not unexpected.Generally, biochemical predictors of pregnancy complications in late pregnancy are more strongly associated when measured closer to the clinical manifestation of the complication. 44In the sPTB analysis, measurements were made at around 28 weeks of gestation but the events occurred up to 9 weeks later.In contrast, in the sETB analysis, measurements were made at around 36 weeks of gestation and the events occurred about 1-3 weeks later.We believe the more proximal sampling time-frame in the latter analysis explains the stronger prediction, even though the model coefficients were actually derived from analysis of sPTB.Developing this argument further, we speculate that 1-palmitoleoyl-GPE (16:1)* might be an even better discriminator when women present with symptoms of preterm labour, and this would be an appropriate area for further study.If this proved to be the case, measurement of this lipid in maternal serum might be useful clinically in triaging women who present with symptoms of preterm labour but where the diagnosis remains uncertain.Accurate triage is a critical element of management of preterm labour as it determines the use of a number of interventions which can be potentially harmful and/or expensive, such as admission to hospital, use of tocolytic therapy and administration of glucocorticoids. 45In a general pregnant population, our findings may not have a useful clinical application because the predictive performance of the metabolites was moderate.However, the findings can be helpful in terms of understanding the pathophysiology of sPTB.
Physiologically, maternal lipid profiles change throughout pregnancy.In early pregnancy, there is an anabolic phase in the adipose tissue with an increase in lipid synthesis and fat storage, whereas in late pregnancy, lipid metabolism transitions to a catabolic phase characterised by a net breakdown of maternal fat deposits. 46We could not identify published studies on the most strongly associated metabolite, the lysolipid named 1-palmitoleoyl-GPE (16:1)*.However, through its Human Metabolome Database ID (which is HMDB0011474; https:// hmdb.ca/ metab olites/ HMDB0 011474, accessed 11 September 2023) we found that it has a link to a protein (enzyme) Ectonucleotide pyrophosphatase/phosphodiesterase family member 2 (gene ENPP2) which may have a role in induction of parturition. 47The lipid metabolites that we have identified could be further investigated to identify the specific processes and pathways involved in sPTB.The two metabolites strongly associated with a higher risk of sPTB, 1-palmitoleoyl-GPE (16:1)* and 1-stearoyl-2-docosahexaenoyl-GPE (18:0/22:6)*, were also elevated in the second and third trimesters in women who developed term pre-eclampsia in the POP study. 35Further studies could investigate the role of these metabolites in the pathophysiology of pregnancy complications.
Published studies on the metabolomics of sPTB have typically relied on a small number of cases.The comparison of the present study with previous studies is difficult because of variation in the design and conduct of the published studies, the sample types, gestational ages when samples were drawn, metabolomics platforms used for analysis, and the statistical or machine learning approaches used.Some published studies have combined sPTB and iPTB as a single outcome, and the reporting of these studies has typically been inadequate, 29 making quantitative comparison difficult.Previous studies have treated sPTB as a binary outcome rather than using a time-to-delivery approach, mainly due to restrictions in study design.

and limitations
The identification of common metabolites between the classification models (binary outcome) and the prognosis prediction model (time-to-delivery) is a methodological strength of this study.Additionally, we were able to validate our model internally by applying it to metabolite measurements from different gestational ages in the analysis of (1) the same outcome (sPTB) and (2) a different but biologically related outcome (sETB).These analyses provide strong evidence for the robustness of our findings.Had the associations between the metabolites and sPTB been false discoveries, we would not have expected to see associations with sETB.Moreover, in line with our findings, some of the published studies have suggested that lipid metabolism could be involved in the development of sPTB. 13,19he present study also has some limitations.First, the POP study was a single-centre study in one city in the UK.Second, 93% of participants were of white European ethnicity, all were nulliparous, and the vast majority lived outside areas of socio-economic deprivation. 35,48This limitation in diversity may limit the external validity of the findings.However, previous POP study findings have been successfully validated in the Born in Bradford (BiB) study, whose population is significantly more diverse. 11,35,36For the present study, external validation using the BiB study samples was not performed because of the unavailability of some of the metabolites included in the models developed in the POP study (https:// osf.io/ axqsu , accessed 11 September 2023).Third, some of the metabolites we reported were not officially confirmed based on a standard.However, Metabolon Inc. is highly confident in their identity.Fourth, we did not consider additional predictors of sPTB reported in the literature, such as maternal smoking.Self-reported smoking obtained from the 20-week questionnaire was not associated with sPTB in the POP study.An objective measurement of tobacco exposure would be preferable.Cotinine (a tobacco metabolite) was among the 812 metabolites included in the machine learning analyses of 28-week measurements.It was not sufficiently predictive of sPTB in the POP study and therefore it was not included among the 47 metabolites identified as important by at least two models.A recent, separate analysis of the cotinine metabolite in the POP study suggested an increased risk of sPTB if the mother smoked consistently throughout pregnancy but there was no evidence for an elevated risk when the smoking pattern was inconsistent. 49Fifth, our main analysis required the availability of a 28-week maternal serum sample and hence we did not include early sPTB before 28 weeks of gestation in the analysis.We did not perform a separate analysis of earlier samples for sPTB before 28 weeks of gestation because of the inadequate number of such cases in the POP study (n = 5).This outcome would be better studied in a much larger pregnancy cohort or in a meta-analysis of multiple smaller cohorts.
To conclude, we have identified and internally validated metabolites from maternal serum that are predictive of the risk of sPTB.These were largely related to lipid metabolism and require further validation in external populations.

AU T HOR C ON T R I BU T ION S
GCSS, DSC-J and US contributed to study concept and design.Analysis and interpretation of data were by YAG, US, LXG, YD and GCSS.The manuscript was drafted by YAG and US, and critically revised for important intellectual content by all authors.Final approval of the version to be published was given by all authors (YAG, US, GCSS, DSC-J, LG, YD).

AC K NO W L E D GE M E N T S
We are grateful to the POP study participants, staff who recruited and assessed them, and the laboratory technicians who managed the biological samples.

C ON F L IC T OF I N T E R E S T S TAT E M E N T
We have the following disclosures outside the area of the submitted work.GCSS reports research support (financial and in kind) from Roche Diagnostics, and financial support of research from GSK and Sera Prognostics.GCSS has been paid to attend advisory boards by GSK and Roche Diagnostics, has acted as a paid consultant to GSK and is a member of a Data Safety and Monitoring Committee for a GSK vaccine trial.DSC-J reports research support (financial and in kind) from Roche Diagnostics.Remaining authors: none declared.

DATA AVA I L A BI L I T Y S TAT E M E N T
Since the individual patient data contain confidential information, it can be supplied only in an anonymised format to suitably qualified researchers who can make appropriate institutional commitments relating to data security and confidentiality.The corresponding author will on detail the restrictions and any conditions under which access to some data may be provided.Data requests can be made to the corresponding author.

F I G U R E 1
UpSet plot showing the intersection of metabolites identified by models trained on metabolite data.

F
U N DI NG I N FOR M AT ION The work was supported by the National Institute for Health and Care Research (NIHR) Cambridge Biomedical Research Centre (Women's Health theme; BRC-1215-20014), the Medical Research Council (United Kingdom; G1100221), and the NIHR Cambridge Clinical Research Facility.Ulla Sovio received additional support from the Cambridge Reproduction Strategic Research Initiative Development Fund. Lana Garmire received funding from the US National Library of Medicine (R01 LM012373, R01 LM12907) and the National Institute of Child Health and Human Development (R01 HD084633).The funders of the study had no role in study design, in the collection, analysis, or interpretation of data, in the writing of the report or in the decision to submit the paper for publication.The NIHR Cambridge Biomedical Research Centre is a partnership between Cambridge University Hospitals NHS Foundation Trust and the University of Cambridge, funded by the NIHR.The views expressed are those of the authors and not necessarily those of the NIHR or the Department of Health and Social Care.For the purpose of open access, the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising.

Model on top 47 metabolites
The mean AUC value (95% CI) is given.The data set was randomly split into training data (70%) and testing data (30%) 100 times.The mean value and standard deviation of the 100 repeats are shown for the AUC. Note: