Methods to generate and validate a Pregnancy Register in the UK Clinical Practice Research Datalink primary care database

Abstract Purpose Primary care databases are increasingly used for researching pregnancy, eg, the effects of maternal drug exposures. However, ascertaining pregnancies, their timing, and outcomes in these data is challenging. While individual studies have adopted different methods, no systematic approach to characterise all pregnancies in a primary care database has yet been published. Therefore, we developed a new algorithm to establish a Pregnancy Register in the UK Clinical Practice Research Datalink (CPRD) GOLD primary care database. Methods We compiled over 4000 read and entity codes to identify pregnancy‐related records among women aged 11 to 49 years in CPRD GOLD. Codes were categorised by the stage or outcome of pregnancy to facilitate delineation of pregnancy episodes. We constructed hierarchical rule systems to handle information from multiple sources. We assessed the validity of the Register to identify pregnancy outcomes by comparing our results to linked hospitalisation records and Office for National Statistics population rates. Results Our algorithm identified 5.8 million pregnancies among 2.4 million women (January 1987‐February 2018). We observed close agreement with hospitalisation data regarding completeness of pregnancy outcomes (91% sensitivity for deliveries and 77% for pregnancy losses) and their timing (median 0 days difference, interquartile range 0‐2 days). Miscarriage and prematurity rates were consistent with population figures, although termination and, to a lesser extent, live birth rates were underestimated in the Register. Conclusions The Pregnancy Register offers huge research potential because of its large size, high completeness, and availability. Further validation work is underway to enhance this data resource and identify optimal approaches for its use.


| INTRODUCTION
Pregnant women are a key study population for many important health questions, including understanding the safety and effectiveness of drugs and vaccines given during pregnancy, effects of other in utero exposures on foetal outcomes, and long-term sequelae of pregnancy complications. Electronic health primary care record (EHR) datasets contain a wealth of maternal and infant data, and their large size enables them to be used to assess rare exposures and outcomes in real-world settings. 1 However, ascertaining the timing and outcomes of pregnancies in such data presents challenges, since the start, end, and trimester dates of pregnancies are not systematically recorded. 2 For studies investigating potential teratogenic risk factors, it is essential to estimate pregnancy start dates accurately, as this enables exposures during the first trimester, the critical period of organogenesis, to be identified.
Primary care datasets are increasingly used for pregnancy research, with researchers developing a multitude of methods to characterise pregnancies therein. [3][4][5][6][7][8][9][10][11] These methods typically involve some of the following components: simple imputation 3,5 (subtracting a fixed duration from the pregnancy outcome date to derive the start date); mapping markers of pregnancy (diagnoses, appointments, or procedures indicative of a pregnancy) to pregnancy outcomes 7,8,12 ; and utilising additional information in patient records to infer the start of pregnancy 3,5,6,9,13 (eg, last menstrual period (LMP) dates, antenatal dating scans, or gestational age at birth).
While some researchers attempt to characterise a variety of pregnancy outcomes, 7,8,13 others restrict to live births. 3,12 Hence, the methods vary in their complexity, accuracy, and the situations in which they can be useful. Validation studies show that utilising multiple sources of information improves date estimation. 2,14 However, even the more complex approaches are limited by the exclusion of pregnancies with no recorded outcome, 7,9 outcomes with no earlier pregnancy marker, 9,13 or conflicting pregnancy records within the same woman. 7,9,13 To date, there has been no published, systematic approach to characterise each documented pregnancy, including those with no recorded outcome, and to use all available pregnancy data in a primary care database.
The UK Clinical Practice Research Datalink GOLD primary care database (henceforth referred to as CPRD) is one of the largest, best-established primary care databases for research. Our study aimed to develop, apply, and validate a new algorithm to identify pregnancies in CPRD, to facilitate and improve the quality of pregnancy research using CPRD. Building on our initial work to identify deliveries in CPRD, 15 the CPRD Mother-Baby link (restricted to live births) and other EHR-based pregnancy algorithms, 2,16,17 we sought to identify all documented pregnancies regardless of the completeness of recording or the type of outcome, to establish a Pregnancy Register in CPRD. Here, we describe our algorithm and the Pregnancy Register it generates, present our validation findings, and highlight key strengths and limitations of this data resource to enable researchers to understand its scope and optimise its use for pregnancy research.

| Data sources
CPRD is a database of routinely collected, anonymised primary care health records for over 15 million patients, representing the UK population in age, sex, and ethnicity. 18 CPRD comprises records of consultations, diagnoses and symptoms, prescriptions, tests, referrals to and feedback from secondary care, health-related behaviours, and all additional care administered as part of routine general practice. In the United Kingdom, general practitioners (GPs) are the main point of contact for nonemergency health issues, including pregnancy. Thus, CPRD is a rich source of pregnancy data relating to antenatal and postnatal care and pregnancy outcomes. A practice-specific family number enables mother-infant pairs to be algorithmically linked (the CPRD

| Generating pregnancy code lists
CPRD GOLD codes clinical events using the hierarchical read classification system. GPs may also record additional, structured data using entity codes. To maximise ascertainment of pregnancy data, we generated lists of read and entity codes relating to pregnancy. We identified an extensive set of pregnancy-related terms from relevant chapters of the read hierarchy, used in combination with wildcards, to identify potentially relevant read codes. We then compared these with existing code lists 7 to identify additional codes. We identified relevant entity

KEY POINTS
• Large primary care databases are valuable sources of pregnancy data for studies of pregnancy.
• Identifying pregnancies, their timing, and outcomes in these databases presents challenges for researchers.
• We developed an algorithm to determine pregnancy episodes in the UK Clinical Practice Research Datalink (CPRD) GOLD primary care database.
• Our algorithm generated a Pregnancy Register comprising 5.8 million pregnancies among 2.4 million women from CPRD GOLD general practices spanning three decades.
• This data resource provides a useful tool to enhance CPRD-based pregnancy research. codes from the "child health surveillance" and "maternity" chapters.

| Study population
We identified all female patients aged 11 to 49 years from CPRD GOLD practices during the period between January 1, 1987, and February 28, 2018, with individual-level research quality data and with a pregnancy code in their primary care records. We extracted all their pregnancy records and additional data on timing and gestational age at birth from live-born infants identified in the Mother-Baby link. We applied no further restrictions. Hence, records relating to time periods before patients joined a practice or before practices' data were deemed to be of a research quality standard (indicated by the practice up-to-standard date) were included. This enabled us to generate a complete pregnancy profile for each patient.

| Summary of the pregnancy algorithm
The pregnancy algorithm used all available pregnancy data (from read and entity codes) to determine the timing of pregnancy (start, end, and trimester dates), the outcome (live birth, stillbirth, or early pregnancy loss), and additional details including whether a pregnancy was preterm, postterm, or multiple. The algorithm began by classifying each patient's pregnancy outcome records into distinct pregnancy episodes (combining multiple records relating to the same outcome) and estimating pregnancy end dates. Delivery records were considered separately from early pregnancy loss records. In keeping with UK clinical practice, 19 we considered the onset of the LMP to be the pregnancy start.
We derived pregnancy start dates from multiple read and entity codes in the following order of priority: (a) estimated date of delivery (EDD), (b) estimated date of conception (EDC), (c) LMP, (d) antenatal records indicating gestational age, and (e) gestational age at birth (from maternal and infant records). EDD was preferred over EDC and LMP as these codes were considered more likely to derive from an antenatal ultrasound scan, and hence to be more reliable, than a record of LMP. Indeed, codes relating to a scan were used preferentially when available. Codes indicating gestational age during pregnancy or at delivery often specified a range rather than a precise number of weeks, hence these were positioned further down the hierarchy. Additionally, because codes indicating gestational age at delivery were considered more prone to delayed recording, which could result in a delayed estimated start date, these were used only in the absence of codes in other categories. In the absence of such records, we applied a fixed duration, consistent with the type of pregnancy, to impute the start date. We estimated the timing of trimesters to be LMP onset to 13 completed weeks for the first trimester, weeks 14 to 26 for the second, and week 27 to delivery for the third. The entity codes and associated data fields used to derive pregnancy start and end dates are shown in Table 2. Characteristics of each pregnancy episode, including the type of delivery or pregnancy loss (when recorded), were determined from the pregnancy codes, which the algorithm assigned to the episode. Figure 1 illustrates the eight stages of our algorithm.
Full details are provided in the Supporting information.
The Pregnancy Register lists and describes all pregnancies identified in CPRD GOLD by our algorithm. Each record represents a unique

| Internal validation
We assessed the validity of our algorithm to identify pregnancy outcomes occurring in hospital by comparison with linked HES Admitted Patient Care data (HES APC, henceforth referred to as HES).

Data sources and study population
We included women aged 11 to 49 years, who were registered with Deliveries were determined from the HES maternity file, and additional data on pregnancy outcomes were extracted using OPCS codes for end-of-pregnancy or postnatal procedures and ICD codes for early pregnancy loss and combined into pregnancy episodes. Full details of the approach and codes used to determine pregnancy episodes in HES are provided in the Supporting information.

Analysis
We compared the occurrence of deliveries and early pregnancy losses in HES to those captured in the Register and assessed agreement.  and HES were less than 12 weeks apart. We chose 12 weeks to allow for potential errors in date estimation in the Register, assuming that pregnancy outcomes recorded in the two data sources within 12 weeks apart represent the same event. We calculated the potential positive predictive value (PPV) (recognising that not all women deliver in hospital), completeness of recording, and accuracy of timing of pregnancy outcomes in the Register, using HES as the reference standard. Additionally, for matched deliveries (those in both data sources and less than 12 weeks apart), we assessed concordance on gestational age (completed weeks).
We explored reasons for incomplete matching in sensitivity analyses. First, to allow for possible delays in GPs recording pregnancy outcomes occurring in hospital, we excluded Register-recorded preg-

| External validation
We assessed the validity of the Pregnancy Register estimates of live birth, miscarriage, termination, and prematurity rates by comparing with national vital statistics and published estimates.

Data sources and study population
To ensure comparability with external estimates regarding age, geographic region, and time, we restricted our study population to women aged 15 to 44 years, registered with CPRD GOLD practices in England

Analysis
We estimated live birth and termination rates in 2015, defined as the number of live birth deliveries, and the number of terminations, each per 1000 women-years. We chose person-time as a denominator (rather than the mid-year number of women) to allow for the dynamic nature of our cohort. In secondary analyses, we expanded our definition of termination to include "probable termination" and "unspecified loss" (see example codes in Table 1). We conducted sensitivity analyses, extending follow-up to include patients' first year of registration, to increase ascertainment of live births and terminations among women who joined a practice while pregnant. We In a secondary analysis, we expanded our preterm definition to  1-2 pregnancies) and less than 1% had more than seven pregnancies.
The median gestational age at delivery was 280 days (IQR 273-280), and median 84 days (IQR 84-84) for early pregnancy losses. For pregnancies with known outcomes, the median gestation at the first antenatal record was 53 days (IQR 40-72). Pregnancy start dates were imputed for 42% of pregnancies with known outcomes, though for relatively fewer deliveries (30%) than for early pregnancy losses (76%).
Pregnancies whose start dates were not imputed had shorter gesta-

| External validation
Our external validation findings comparing pregnancy outcome rates in the Pregnancy Register with external estimates from Office for National Statistics (ONS) and other published evidence are shown in Register, comprising more than 5.8 million pregnancies (more than 1.5 million meeting patient and practice-level data quality standards) among 2.4 million women, spanning three decades, is the first of its kind in a UK primary care database. Our assessment of the internal and external validity of our algorithm to determine pregnancy episodes demonstrates high validity in identifying and dating hospital deliveries (91% sensitivity, 95% with date agreement within 2 days), and 77% sensitivity for hospital-based early pregnancy losses (85% with date agreement within 2 days). Miscarriage rates in the Pregnancy Register of 12% to 13% compared favourably with estimates from external sources, whereas lower rates were observed for terminations and live births. Prematurity rates were lower when based solely on preterm evidence in pregnancy codes but improved markedly when gestational age was taken into account (8%). Overall, the scale and scope of pregnancies captured in the Register and our validation findings demonstrate the potential of this data resource to enhance future CPRD-based pregnancy research.
Our algorithm has several advantages over previous pregnancyidentification approaches in CPRD. A key strength is our use of all available pregnancy data across the entire patient record, including additional clinical details (entity codes) recorded in structured data have not addressed. 7,13 While such pregnancy episodes can be challenging to interpret, ignoring them is potentially more problematic and could lead to bias, particularly for studies requiring a denominator of pregnant women such as vaccine uptake studies. When outcomes are recorded, we use all available records to classify the episode, including differentiating between induced and spontaneous abortions whenever possible, rather than combining these episodes in a single "abortion" category as a recent approach has done. 13 A key feature of our algorithm is it avoids preferentially selecting one type of pregnancy outcome over another. Because delivery episodes are generated separately from early pregnancy loss episodes, we are able to distinguish distinct pregnancy episodes for each type of outcome from multiple records corresponding to the same pregnancy, without choosing between outcomes when a patient has successive records of both (eg, a delivery code followed by a miscarriage code). By contrast, previous approaches discard pregnancy outcome records occurring within a pre-specified time period after a patient's previous outcome, disregarding the type of outcome specified in the records, which could potentially result in incomplete ascertainment of distinct pregnancy episodes of different types. 13 For studies of live birth pregnancies, a clear advantage of our is not registered at the mother's practice during the data collection period, or due to imprecision in delivery or birth date estimates resulting in more than 60 days difference. The inclusion of some historical deliveries or possible misclassification of some stillbirths as live births in the Register may also partly explain the incomplete linkage.
A further strength of our approach is its transparency. We provide full details of our algorithm stages in the Supporting information, including our complete categorised pregnancy code lists (read and entity codes). This enables researchers planning to use the Pregnancy Register to understand its scope and assess its applicability for their particular study questions. The provision of a data field "start source" in the Register enables researchers to determine how pregnancy start dates were derived, eg, through imputation or from the available data.
When assessing capture of HES deliveries in the Register, high con-  rate could be due to these practices missing some live births. Any such difference in completeness of recording of live births in HES-linked practices versus practices not linked to HES may limit the generalisability of our internal validation findings to the whole Pregnancy Register.
Other potential reasons for incomplete matching of pregnancy outcomes in the Register and HES include possible reporting delays, or retrospective recording of past pregnancies in primary care. Our sensitivity analysis findings provide some support for this: excluding Register and HES pregnancies in the first and last 6 months of follow-up and Register pregnancies recorded in the first year of registration with a practice, marginally improved the algorithm performance, both in terms of PPV and completeness for deliveries and early pregnancy losses. Incomplete capture or misclassification of pregnancy outcomes in either data source could also partly explain the lack of concordance. The median gestation at the first antenatal record among Register pregnancies was 7.6 weeks, which suggests that a proportion of pregnancies resulting in early miscarriage would not be identified in the Register.
The higher agreement we observed with HES for deliveries compared with early pregnancy losses could partly be due to women who give birth having increased opportunity for their pregnancy outcomes to be recorded through GP consultations with their babies, than women whose pregnancies do not yield an infant. Furthermore, because of difficulties distinguishing between types of early pregnancy loss in HES, our analysis includes a heterogeneous group of outcomes (miscarriages, terminations, ectopic pregnancies etc.) However, we would expect miscarriages to be better recorded than terminations due to a substantial proportion of terminations being carried out in specialist clinics outside of NHS hospitals.
Our validation findings of similar miscarriage rates yet lower termination rates in the Register compared with external sources reflect this.
There are limitations to our algorithm that are important to consider when using the Register. While our algorithm maximises all available pregnancy data, the reverse side of this approach is that some identified pregnancies included in the Register may represent historical events discussed during a consultation and recorded with the current date.
However, researchers can apply restrictions on data occurring within patient registration and practice-level up-to-standard follow-up if required for their particular study question. Our validation analyses restricted to pregnancies occurring during up-to-standard follow-up, hence the findings are not necessarily generalisable to pregnancies occurring before the practices' data were deemed up-to-standard.
Uncovering pregnancy episodes with no discernible outcome is also a consequence of our approach to maximise completeness of pregnancy ascertainment. Overall, these "outcome unknown" pregnancy episodes comprise 25% of all research-quality pregnancies in the Register (among women registered for at least 1 year, at an upto-standard practice). This is consistent with more than 20% of these types of pregnancies identified in an earlier pregnancy record mapping algorithm using CPRD. 8 Our findings suggest that one-third of these pregnancy episodes with unknown outcome are potentially ongoing pregnancies; however, the remainder are more difficult to interpret.
Such episodes may occur for a number of reasons, for example, some may represent undocumented miscarriages requiring no medical intervention or early pregnancy losses with feedback from secondary care that were not captured in the coded data. Variability in the PPV of certain codes used to identify pregnancies, for example, codes which relate to pregnancy planning rather than a current pregnancy, might also explain some of these outcome unknown episodes. Comparing the estimated pregnancy dates in the Register with the date followup is censored can help researchers decide whether a pregnancy is potentially or unlikely to be ongoing.
A further caveat of our approach is that it yields some patients' pregnancy episodes that appear to overlap. While some overlapping episodes may be an artefact of GP recording practices (for example, an apparent pregnancy loss nested within a delivery episode may represent a threatened miscarriage recorded as a "miscarriage," culminating in a later delivery), others may arise from errors in date estimation. While these episodes also present interpretational challenges, we do not attempt to resolve them. Instead, such episodes are flagged as "conflicts" in the Register, allowing researchers to judge how best to handle them in the context of their own study questions. Characterising these "outcome unknown" and overlapping pregnancy episodes and exploring potential reasons for their occurrence are key areas of ongoing validation work to improve the Pregnancy Register.

| CONCLUSIONS
We have described our approach to identifying and characterising pregnancies in the CPRD GOLD database and establishing a new data resource for pregnancy studies using CPRD data. The Pregnancy Register is available to researchers alongside existing CPRD GOLD datasets, upon receipt of ISAC approval of a study protocol. Further work to refine the Register and to extend it to data contributed by practices using EMIS software (CPRD Aurum) is ongoing.

ETHICS STATEMENT
Ethics approval for this study was obtained from the Independent