Impact of different assumptions on estimates of childhood diseases obtained from health care data: A retrospective cohort study

Abstract Purpose Accurate estimates of disease incidence in children are required to support pediatric drug development. Analysis of electronic health care records (EHR) may yield such estimates but pediatric‐specific methods are lacking. We aimed to understand the impact of assumptions regarding duration of disease episode and length of run‐in period on incidence estimates from EHRs. Methods Children aged 0 to 17 years (5–17 years for asthma) registered in the Integrated Primary Care Information database between 2002 and 2014 were studied. We tested the impact of the following: maximum duration of disease episode (0, 14, 30, 60, and 90 days) on recurrent diseases (acute otitis media [common] and acute pyelonephritis [rare]); and database run‐in period on chronic diseases—asthma (common) and type 1 diabetes (DM) (rare). We calculated incidence rate ratios with 95% confidence intervals and stratified using 1‐year age categories. Results Altogether, 503 495 children were registered. The incidence of acute otitis media was highest in <2‐year‐old children; using 30 days disease duration as reference, the rate increased with 8% if the duration was 14 days and decreased with 8% when extended to 60 days. Disease duration did not impact acute pyelonephritis (rare). No run‐in (to exclude prevalent cases) versus 24‐month run‐in period overestimated the incidence rate for asthma and DM by a factor of 2. Conclusions Analysis of EHR allows for estimation of disease incidence in children, but assumptions regarding episode length and run‐in period impact the incidence estimates. Such assumptions may be routinely explored.

Population-based electronic health care records (EHR) provide an excellent data source for estimation of disease occurrence; 5 however, there are specific methodological challenges that should be considered based on the fact that these data are not collected for research but for every day care.
First, since EHR was introduced only in the last few decades, software systems may change, and patients may move between physicians/health care plans. Therefore, the data often capture only a specific (limited) part of an individual's life-time. In order to distinguish incident from prevalent disease, researchers usually apply a look-back (run-in) period which is often arbitrarily chosen and the impact of the choice is not investigated or reported. 6 In addition, patients (or their parents) visit a physician usually at the start of a disease but not anymore once the disease has resolved, which hampers calculation of the duration of transient diseases.
Secondly, the characteristics of childhood diseases present additional challenges; children suffer from mostly common transient infection-related disease such as acute otitis media (AOM). On the other end of the spectrum of recurrent diseases, there may be rare diseases like acute pyelonephritis (APN). 7,8 Diseases that affect children may also be chronic and differ in frequency: asthma is common and chronic unlike type 1 diabetes (DM) which is also chronic but less common. 9,10 It would be important to understand the impact of different assumptions on estimates of disease occurrence.   Organization-Anatomic Therapeutic Chemical (WHO-ATC) classification system. The database has been proven to be valid for conducting pharmacoepidemiological studies. 13

| Study population
All children aged 0 to 17 years that were registered for at least 1 day between January 1, 2002 and December 31, 2014 could be included in the study. For the investigation of asthma, the minimum age for inclusion was 5 years because the diagnosis of asthma in children under 5 years old is prone to misclassification due to the high incidence of viral infections associated with wheezing. 14,15 Patients entered the study population at the latest of the following dates: start of the study period, date of birth or date of registration in IPCI, age of 5 years (asthma only).
For both asthma and type 1 diabetes, patients needed to have up to 24 months' run-in to be included in the study population.
Exit from the study population occurred at the earliest of the following events: leaving the GP practice, death, subject turned 18 years old, or end of the study period.

| Outcome definition and identification
Four outcomes of interest were studied based on different durations (transient or chronic) and frequencies (common or rare). The outcomes were identified based on diagnosis and prescription codes. See appendix 1.
Acute otitis media (AOM) is a transient disease, the systemic and local features of AOM usually resolve within 24 to 72 hours. 16,17 One patient can experience more than 1 episode of AOM. 18 Children with AOM were identified through a search on the ICPC AOM disease code H74.
Acute pyelonephritis (APN) is also a transient disease. In the Netherlands, ICPC disease code U70 implies that APN was diagnosed by urine testing. 19 Also, this code distinguishes APN from cystitis which is assigned the ICPC disease code U71 thereby preventing misclassification of both forms of urinary tract infection (UTI). APN may be recurrent. 20 Asthma is a chronic and rather common condition in children. 10 Cases were identified by combining the ICPC disease code (R96) with at least 2 prescriptions for asthma medication (ATC code R03) in the year following the initial diagnosis. 10,21 Type 1 diabetes (DM) is a chronic and rare disease in children. Cases were identified by combining ICPC disease code (T90) and at least 1 prescription for insulin (A10A) in the year following the initial diagnosis. 22

| Statistical analyses
Overall and age-specific incidence rates (IR) were calculated by dividing the events/outcomes by the total number of person-years

KEY POINTS
• Limitations arising from lack of standardized methodologies to calculate incidence of disease in children may lead to heterogeneity in reported incidence data that are required for the pediatric investigation plans.
• As part of the Global Research in Pediatrics, we demonstrate the impact of applying different assumptions regarding duration of disease episode and run-in period and provide recommendations for dealing with these. and 90 days. Person-time was not censored at diagnosis. For the chronic outcomes, person-time was censored at the date of first diagnosis. The run-in period was reduced from 24 to 12 and 6 months to assess the impact on the incidence rate. Further, the impact of not applying a run-in period was tested. The 95% confidence intervals (CI) around the incidence rates were estimated based on the negative binomial distribution. 23 For the transient outcomes (AOM and APN), age-specific incidence rate ratios (IRR) were calculated by comparing the IRs based on clinically plausible episode durations-14 days vs 30 days. For the chronic outcomes (asthma and DM), age-specific IRRs were calculated by dividing the IR resulting from not applying run-in vs 24 months' run-in.
As presented in Figure 1, point prevalence was calculated on July 1, 2010 by dividing the number of children with the outcome on that date by the total number of children in the study population on that date. 95% CIs were calculated based on the Wilson score interval. 24 To calculate the age-specific prevalence ratio (PR) for the transient outcomes (AOM and APN), we assumed an event to be new if it occurred ≥30 days after the preceding event, and we divided the resulting point prevalence by the estimate that was based on 14 days' episode duration. Regarding the chronic outcomes (asthma and DM), age-specific PRs were calculated by dividing the point prevalence resulting from 24 months' run-in vs no run-in. For the calculation of the point prevalence based on the 24 months' run-in, this meant that only those children with a database history of at least 24 months could be included in the denominator. Of these children, only those that met the case definition (as described earlier) prior to July 1, 2010 were included in the numerator. No run-in meant that no database history was required for inclusion. The 95% CIs around the IRRs and PRRs were calculated following the negative binomial distribution. 25 Analyses were conducted using a custom-built Java application

| RESULTS
To investigate AOM and APN, the study population comprised 503 495 children. For asthma (studied in children 5 years or older) and DM, we studied 304 856 and 405 600 children, respectively; the total PYs of follow-up were 710 980 PY and 1 042 067 PY. Figure 2 shows the distribution of age (at start of follow-up [no run in]) and duration of follow-up of the population base, without censoring.

Recurrent diseases and impact of episode duration
Based on the assumptions that a new event can only re-occur 0, ≥14, ≥30, ≥60, or ≥90 days after the preceding event, overall IRs for AOM decreased from 8.2, 7.1, 6.6, 6.2 to 5.9 per 100 PY, respectively ( Table 1).
The estimates resulting from the shortest assumed duration were the highest, decreasing with increasing length of an episode in all age categories ( Figure 3).  The results are summarized in Figure 5.

| DISCUSSION
This study showed that assumptions regarding duration of a disease episode (for transient and recurrent diseases) and run-in period (for chronic diseases) may impact incidence and prevalence estimates of childhood diseases obtained from population-based dynamic EHR databases. While this study was focused on children, some of the investigated issues may also be relevant for adults.
The lack of complete follow-up from birth till 17 years of age, and the fact that only the visit for a disease and not the ending of a disease is recorded in the electronic medical record has an impact on estimation of disease occurrence. Usually, epidemiologists apply assumptions to deal with these limitations, such as "assuming a standard disease episode duration" and use of a runin period prior to start of follow-up that can be used to exclude prevalent disease. In this paper, we wanted to investigate the impact of these assumptions on the different occurrence measures in children, and we witnessed relatively great impact. General rules can be obtained from this exercise: for common recurrent diseases, the impact of the choice of episode duration is relatively high, assuming longer disease episodes leads to lower incidence. The impact of a change in episode duration on the incidence is negligible in a rare recurrent disease. This is understandable because the probability of having another event of a rare disease is low, and this will not likely occur close together. The impact of an increasing episode duration on the prevalence of a common recurrent disease is opposite, with increasing duration the point prevalence increases. The impact is negligible on the point prevalence for a rare recurrent disease. We recommend that studies aiming to estimate incidence and prevalence of common recurrent diseases better explore the impact of the episode length because the true length cannot be observed in medical record databases. Patients do not return to tell the GP that disease is cured. For rare diseases, the impact of different episode durations may be ignored both for incidence and prevalence estimations. For chronic diseases, varying the run-in impacts the incidence with negligible impact on the prevalence. We recommend that studies investigating chronic diseases apply the longest possible run-in to avoid misclassifying prevalent cases as incident. We admit that this may lead to a reduced sample size and depending on the database can potentially limit the generalizability of the results.
Regarding AOM, there was no difference in incidence when we compared 2 clinically plausible episode durations: 14 versus 30 days; 30 days has been applied in a previous study. 7 Based on the natural history of AOM, the actual duration of an episode is not clear. 26 Therefore, we also compared the incidence rates (results not presented) we derived from the shortest (0 days) versus longest (90 days) assumptions; the estimates were significantly different. When we performed the same comparisons for APN, the result was not significant, further confirming that the episode duration is important for estimating the occurrence of only the common outcomes. We observed that assuming episode duration of 30 vs 14 days significantly impacted the prevalence of AOM. The peak in PR that was observed among children aged 9 to 10 years is probably due to the comparatively lower number of events in that age group. To derive the most clinically meaningful estimate of incidence and/or prevalence of AOM and to a less extent APN, perhaps the most plausible assumption for the duration of an episode remains 30 days. It is highly unlikely that an episode will last for as short as a few hours to 1 day or as long as 90 days.
We recommend further research to know the most appropriate assumptions to apply when estimating the occurrence of AOM and other common and recurrent childhood diseases. Prevalence might be a good measure. b For asthma and type 1 diabetes, subjects that had a minimum 24-month run-in were studied to know the impact of decreasing the run-in period on the incidence rate Still regarding AOM, the IRR seemed consistent across all ages, but the absolute difference in IR estimates between the assumed disease durations would be much smaller for older compared with younger children. In older children, there is minimal or no difference in incidence estimates regardless of the assumed duration of an episode.
Regarding both asthma and DM, increasing the run-in period considerably decreased the incidence. This finding is important given that people can be observed for only a part of their lifetime; despite conducting the current study over a 12-year calendar period, the median duration of follow-up for the studied population was 1500 days, showing the incomplete follow-up ( Figure 2). but the impact of the run-in period may be less pronounced in more stable regional or national databases where persons are registered from birth. Lastly, the IPCI database is a GP database and therefore may impact the identification of diseases or conditions that are usually diagnosed by the specialist. However, the database contains information on referrals and hospitalizations, and therefore it is expected that the aforementioned impact will be negligible.

ETHICS STATEMENT
The study and the access to the database were approved by the IPCI governing board (number 05/2015).