Validation of cardiovascular outcomes and risk factors in the Clinical Practice Research Datalink in the United Kingdom

Abstract Purpose Strategies to identify and validate acute myocardial infarction (AMI) and stroke in primary‐care electronic records may impact effect measures, but to an unknown extent. Additionally, the validity of cardiovascular risk factors that could act as confounders in studies on those endpoints has not been thoroughly assessed in the United Kingdom Clinical Practice Research Datalink's (CPRD's) GOLD database. We explored the validity of algorithms to identify cardiovascular outcomes and risk factors and evaluated different outcome‐identification strategies using these algorithms for estimation of adjusted incidence rate ratios (IRRs). Methods First, we identified AMI, stroke, smoking, obesity, and menopausal status in a cohort treated for overactive bladder by applying computerized algorithms to primary care medical records (2004–2012). We validated these cardiovascular outcomes and risk factors with physician questionnaires (gold standard for this analysis). Second, we estimated IRRs for AMI and stroke using algorithm–identified and questionnaire–confirmed cases, comparing these with IRRs from cases identified through linkage with hospitalization/mortality data (best estimate). Results For AMI, the algorithm's positive predictive value (PPV) was >90%. Initial algorithms for stroke performed less well because of inclusion of codes for prevalent stroke; algorithm refinement increased PPV to 80% but decreased sensitivity by 20%. Algorithms for smoking and obesity were considered valid. IRRs based on questionnaire‐confirmed cases only were closer to IRRs estimated from hospitalization/mortality data than IRRs from algorithm‐identified cases. Conclusions AMI, stroke, smoking, obesity, and postmenopausal status can be accurately identified in CPRD. Physician questionnaire–validated AMI and stroke cases yield IRRs closest to the best estimate.


| INTRODUCTION
The usefulness of electronic medical records for epidemiologic research relies on data accuracy. In the United Kingdom, the Clinical Practice Research Datalink (CPRD) contains demographic, clinical, and drug prescription data, constituting a rich source for epidemiological research. [1][2][3][4] Linkage to hospital and death records, generally considered reliable sources for health events, is possible for a proportion of UK primary care practices, enhancing the validity and utility of the CPRD. 1 Formerly, confirmation of events in primary care data often relied on physicians' unstructured notes, known as free-text comments 1,5 ; however, free-text comments from CPRD are no longer available for research. Currently, confirmation of events, treatments, dates, and patient characteristics is possible through questionnaires sent to general practitioners (GPs), a common strategy for researchers to confirm events in the subset of the population for whom linkage to hospital and death records is not possible.
Little research has been conducted to validate patient characteristics that are considered cardiovascular (CV) risk factors and potential confounders for CV outcomes. 6,7 Moreover, the impact of misclassification of myocardial infarction and stroke on effect measures when using primary care data that cannot be linked to hospital or mortality data has not yet been evaluated. The validation component of a larger drug safety study 8 provided an opportunity to evaluate the validity of algorithms for identifying CV risk factors from primary care data and to examine the effect of various outcome-identification strategies reliant on those algorithms when estimating the incidence of CV outcomes. This article has two parts. In the first part, we explored the validity of algorithms to identify acute myocardial infarction (AMI) and stroke that use only structured data from primary care and explored the accuracy of algorithms for recorded smoking, obesity, and menopause. In the second part, we examined the effect of four outcomeidentification strategies, applying our validated algorithms, on the estimation of adjusted incidence rate ratios (IRRs) for AMI and stroke.

| Setting
This study was the validation component of a drug safety study 8 ("parent study") that included patients aged ≥18 years who were continuously enrolled in the UK CPRD primary care database (GOLD) for ≥12 months and were newly exposed to antimuscarinic medications to treat overactive bladder (darifenacin, fesoterodine, oxybutynin, solifenacin, tolterodine, or trospium) in 2004 to 2012. Patients with a diagnosis of cancer other than nonmelanoma skin cancer were excluded, as were HIV+ patients because their health service utilization might not be fully captured. Patients were followed until death, disenrollment, cancer diagnosis, HIV+ status, AMI, stroke, or study end, whichever occurred first.

| Data source
This study used information collected in the CPRD. 1  • Algorithms for stroke performed less well (PPV = 56%) because they included codes for prevalent stroke. Excluding these codes increased the PPV to 80% but decreased sensitivity by 20%.
• Smoking, obesity, and postmenopausal status can be accurately identified using algorithms that rely on information in the electronic medical records in CPRD GOLD.
• For subjects whose information comes from general practitioners only, incidence rate ratios that used only cases confirmed by physicians via questionnaire were closer to incidence ratios from hospitalization and mortality data (best estimate) than estimates that used all algorithmidentified cases including cases that were not physician confirmed. This suggests that endpoint validation decreased misclassification bias. cause of death, and all other causes of death listed on the death certificate, also coded according to ICD-10.
In our cohort of patients treated for overactive bladder, CV outcomes and mortality were ascertained from general practice records in the CPRD GOLD and via linkage to HES APC data and ONS mortality data for the subset of general practices that permit such linkage. 1 For this study, we divided the study population into two groups: subjects with information from general practice records only ("CPRD unlinked") and subjects whose records from general practices could be linked to external data sources, including hospitalizations and mortality data ("CPRD linked"). Approximately 50% of the cohort were in the CPRD-unlinked group, and approximately 50% were in the CPRD-linked group.

| Study Participants
Validation was conducted within a subset of CPRD-unlinked data from the parent study. This subset comprised all patients with prescriptions for the least commonly prescribed drugs (darifenacin, fesoterodine, and trospium) and a randomly selected one-third of the patients with a prescription for each of the most commonly prescribed drugs (oxybutynin, solifenacin, and tolterodine). 10

| Algorithms to ascertain CV Events and risk factors
The CV events validated in this study were AMI and stroke (separately for ischemic, hemorrhagic, and unspecified stroke). The CV risk factors validated in this study were smoking (never/current/former smoker), obesity (yes/no, defined as body mass index≥30 kg/m 2 ), and menopause (yes/no). The computerized algorithms for AMI and stroke were based on clinical definitions for each type of event. 11,12 Out-of-hospital deaths due to these CV events can be difficult to identify in electronic health care data; our algorithms were designed to identify these events as completely as possible ( Table 1).
The AMI and stroke algorithms used combinations of coded entries in the appropriate time windows to identify AMI and stroke events and categorize them as definite, probable, or possible cases (details on the combinations of entries and time windows are provided in Table 1).
Patients lacking a specific AMI diagnosis but with the signs and symptoms of AMI listed in Table 1 were categorized as having a potential AMI. Other individuals were considered noncases, further subclassified as noncases alive (if they were alive at the end of the study period) or noncases dead (if their death occurred during the study period). Death and CV outcomes in these patients were ascertained from CPRD GOLD data for the CPRD-unlinked population.
The algorithms for smoking, obesity, and menopause were developed collaboratively by the three physician epidemiologists and other coauthors of the study ( Table 2). For smoking, a key consideration was that a record indicating that a patient was a smoker was given more weight than a record indicating that a patient had never smoked, under the assumption that the smoking habit (recorded in the electronic medical record) might have triggered a medical intervention. Patients with early codes for smoking and more recent codes for not smoking were considered former smokers. The algorithm for obesity was designed to capture the most recent information in the medical record; weight and height measurements were preferred over codes for obesity. Smoking, obesity, and menopause T A B L E 1 Definitions to identify and classify AMI and stroke AMI Definite AMI: A Read code for AMI with hospitalization AND two or more Read codes for the following events within 30 days before or after the AMI code: • Characteristic chest pain or symptoms of myocardial ischemia a • Abnormal results for cardiac enzymes a • Electrocardiogram with clinical signs of AMI a • Arteriogram with a recent coronary occlusion a • Administration of thrombolytic therapy a • Coronary revascularization procedure a following AMI diagnosis • Death Probable AMI: A Read code for AMI with hospitalization and/or one item above within 30 days before or after the AMI code were ascertained at three time points using all information available: the day before cohort entry, the last date with information before the CV outcome, and the date with information closest to the outcome (before or after). Validation was performed for ascertainment at each of the three dates, for comparison. To enhance the ascertainment of postmenopausal status, we conducted a post-hoc analysis of an algorithm to identify menopause that added two proxies for postmenopausal status: (patients aged >50 years who used HRT therapy and patients aged >55 years) to the original algorithm.

| Questionnaires
Validation of CV outcomes and risk factors was conducted by comparing the classification reached by using algorithms with that attained by using questionnaires (the gold standard for validation analyses) sent to GPs of the patients identified by the algorithms for the following groups of cases: • All definite and possible AMI and stroke cases

| Part 2. Understanding outcome misclassification
As a separate research question and to understand the impact of outcome misclassification on a safety question, we compared the IRRs of four outcome-identification strategies using primary care data in the CPRD-unlinked population with the IRRs estimated using primary care, HES APC, and ONS data in the CPRD-linked population.

| Study participants
The four outcome-identification strategies were applied to the study population that was included in Part 1 of this study (CPRD-unlinked population). In addition, we used the CPRD-linked population from the parent study for comparison.

| Outcome-identification strategies and estimation of IRRs
We estimated propensity score-adjusted IRRs of AMI and stroke for oxybutynin (commonly used to treat overactive bladder and has been on the market for many years) vs any other study medication using four outcome-identification strategies that made use of the algorithms and validation efforts described in Part 1 of this manuscript, separately for AMI and stroke: • The propensity score methods used for this analysis have been previously described. 13 The IRRs estimated using these strategies were compared with IRRs estimated from cases identified in the CPRD-linked population ("best estimate").
We plotted the P value functions 14 of the estimated IRR for these four strategies to summarize two key aspects of this analysis: the estimated effect size (represented by the horizontal location of the peak of each curve) and the degree to which the various strategies were

| Part 1. Validation of CV Outcomes and risk factors
The parent study cohort consisted of 119 912 patients, 70% of whom were women. Mean ages at cohort entry were 64.5 years (men) and 61.5 years (women). Among the 26 511 participants in Part 1 of this validation study, the algorithm identified a total of 2658 AMIs and 726 strokes in primary care data.
The initial algorithm for incident stroke resulted in low PPVs ( The algorithm identified smoking in primary care data at three time points: the day before cohort entry, the last date with information before the endpoint, and the date with information closest to the endpoint (before or after). General practitioners were asked to provide the same information via questionnaires on the day of the event (patients with AMI or stroke) or at study end (noncases alive or dead). Responses from questionnaires were the gold standard for this analysis. Percentages are row percentages.
care data. At the date closest to and before the endpoint, 97% of the patients identified as never smokers in primary care data were also never smokers according to questionnaires (Table 4). Likewise, 84% identified as current smokers were also current smokers according to questionnaires, and 77% of former smokers were also former smokers according to questionnaires. Concordance at the three evaluated time points is presented in Table 4.
Information on obesity was missing in primary care data for 26% of patients on the date closest to and before the endpoint. Of patients classified as obese in primary care data, 82% were confirmed as obese through questionnaires; 92% of patients classified as nonobese were confirmed as nonobese through questionnaires. Among women identified as postmenopausal at baseline in primary care data, 86% were confirmed as such through questionnaires. The algorithm that F I G U R E 1 Predictive values of obesity and postmenopausal status. BMI, body mass index; NPV, negative predictive value; PPV, positive predictive value. The algorithm identified patient characteristics in primary care data at three time points: the day before cohort entry, the last date with information before the endpoint, and the date with information closest to the endpoint (before or after). General practitioners were asked to provide the same information via questionnaires on the day of the event (patients with AMI or stroke) or at study end (noncases alive or dead  with the initial algorithm to 95% with the expanded algorithm.

| Part 2. Understanding outcome misclassification
The most stringent outcome-identification strategy (strategy A), treating as cases only those that were confirmed through questionnaires, yielded an IRR closer to the best estimates for AMI and stroke in the CPRD-linked population than the other three strategies (Table 5 and Figure 2). For AMI, the estimated effects for strategy A and the best estimate were very similar (ie, the peaks of the two curves are close to each other; Figure 2A); although less precise, results from strategy A were compatible with best estimate results (ie, the red curve is contained within the yellow curve). Strategies B to D were further from the best estimate in both aspects, with maximal peak separation, sharpest function, and least curve overlap for the less-stringent strategy D. For stroke ( Figure 2B), the IRR for strategy A also was closest to the best estimate and the curve overlap was maximal, but the peaks of the other strategies did not differ substantially from the peak of the best estimate, and the curves for the various strategies partially overlapped. This study also found support for the use of our algorithms to identify smoking and obesity in primary care data in CPRD GOLD.

| DISCUSSION
Menopause is poorly captured in CPRD. For postmenopausal status, the incorporation of additional information, namely age and treatments for postmenopausal symptoms, into their definition increased the validity of the definition. The addition of two proxies for postmenopausal status resulted in a relatively small increase in the PPV and a more substantial increase in the NPV. Most importantly, the estimated prevalence of postmenopausal status increased from 24% with the original algorithm to 95% with the expanded algorithm. The latter is much closer to the expected prevalence in this population, with a mean age of 61.5 years.
The AMI and stroke IRRs closest to the best estimate were those estimated using only confirmed cases, providing evidence on the value of outcome validation. The IRR 95% CIs of all proposed outcomeidentification strategies generally overlapped. A caveat of this comparison is that the subpopulation with linkage to hospitalization and mortality data used to calculate the best estimate might differ from the subpopulation without such linkage where the validation of outcomes was performed.
A study strength was the validation of key patient characteristics that can confound epidemiologic studies. Additionally, our outcomeidentification process relied on only structured (coded) data and questionnaires that continue to be available to researchers using CPRD GOLD.
This validation study was conducted in a population of patients treated for overactive bladder because this validation effort was part of a broader study on the safety of antimuscarinic drugs. 8,9,15 We do not expect that these CV outcomes and risk factors would be better or less well recorded among these patients than in the general population. Furthermore, CV risk factors investigated here might have been recorded in patients' electronic records before treatment started. For these reasons, we believe that the results presented in this paper apply to CPRD GOLD and possibly to other UK health databases with the same coding systems. Relative risks of AMI and stroke are intended to be scientifically generalizable to all patients using antimuscarinic medications to treat overactive bladder. 16

ETHICS STATEMENT
The study was judged to be exempt from review by the Research Trian-

REPRODUCIBILITY
The results of the study were generated by RTI Health Solutions (RTI-HS) using data obtained from CPRD. RTI-HS developed proprietary code to perform the analyses on the data. Researchers desiring access to the data would be required to obtain permission from the study sponsor, obtain data use agreement with CPRD, and develop their own code.