Fitness for purpose of routinely recorded health data to identify patients with complex diseases: The case of Sjögren's syndrome

Abstract Background This study is part of the EU‐funded project HarmonicSS, aimed at improving the treatment and diagnosis of primary Sjögren's syndrome (pSS). pSS is an underdiagnosed, long‐term autoimmune disease that affects particularly salivary and lachrymal glands. Objectives We assessed the usability of routinely recorded primary care and hospital claims data for the identification and validation of patients with complex diseases such as pSS. Methods pSS patients were identified in primary care by translating the formal inclusion and exclusion criteria for pSS into a patient selection algorithm using data from Nivel Primary Care Database (PCD), covering 10% of the Dutch population between 2006 and 2017. As part of a validation exercise, the pSS patients found by the algorithm were compared to Diagnosis Related Groups (DRG) recorded in the national hospital insurance claims database (DIS) between 2013 and 2017. Results International Classification of Primary Care (ICPC) coded general practitioner (GP) contacts combined with the mention of “Sjögren” in the disease episode titles, were found to best translate the formal classification criteria to a selection algorithm for pSS. A total of 1462 possible pSS patients were identified in primary care (mean prevalence 0.7‰, against 0.61‰ reported globally). The DIS contained 208 545 patients with a Sjögren related DRG or ICD10 code (prevalence 2017: 2.73‰). A total of 2 577 577 patients from Nivel PCD were linked to the DIS database. A total of 716 of the linked pSS patients (55.3%) were confirmed based on the DIS. Conclusion Our study finds that GP electronic health records (EHRs) lack the granular information needed to apply the formal diagnostic criteria for pSS. The developed algorithm resulted in a patient selection that approximates the expected prevalence and characteristics, although only slightly over half of the patients were confirmed using the DIS. Without more detailed diagnostic information, the fitness for purpose of routine EHR data for patient identification and validation could not be determined.

Conclusion: Our study finds that GP electronic health records (EHRs) lack the granular information needed to apply the formal diagnostic criteria for pSS. The developed algorithm resulted in a patient selection that approximates the expected prevalence and characteristics, although only slightly over half of the patients were confirmed using the DIS. Without more detailed diagnostic information, the fitness for purpose of routine EHR data for patient identification and validation could not be determined. Primary Sjögren's syndrome (pSS) is an underdiagnosed, long-term autoimmune disease that affects particularly salivary and lachrymal glands but that may involve any organ and system. 1 Despite generally benign, pSS may be characterized by severe rare complications including non-Hodgkin's lymphoma (NHL) with an unneglectable impact on patients' quality of life. 2,3 To date, health policy and management research for pSS are quite rare, especially on pSS diagnosis and management in primary health care. 4 A study on the epidemiology of Sjögren's syndrome by Patel and Shahane 5 concluded that: "there is no accepted universal classification criterion for the diagnosis of Sjögren's syndrome. There are a limited number of studies that have been published on the epidemiology of Sjögren's syndrome, and the incidence and prevalence of the disease varies according to the classification criteria used. The data is further confounded by selection bias and misclassification bias, making it difficult for interpretation." [p. 247]. In fact, international consensus on the classification criteria for pSS was only reached in 2016, resulting in the American College of Rheumatology/European League Against Rheumatism (ACR/EULAR) classification criteria for pSS, 6 making it difficult to estimate the exact prevalence of the disease. Consequently, estimates of the prevalence of pSS vary greatly across studies (ranging from 0.11‰ to 37.9‰), depending on the setting and the definition used and the population investigated. 7 Besides population, geographical, and diagnostic differences, diagnosis may be delayed or patients may be misclassified as another rheumatic disease due to the insidious onset and the broad spectrum of clinical manifestations of the disease. In addition, Sjögren's Syndrome (SS) can occur on its own (primary SS) or in association with other systemic autoimmune diseases (secondary SS). Given the vast availability of electronic health records (EHRs) for the general population, computational phenotyping may help to improve the diagnosis and timely referral of patients with complex diseases such as pSS to the medical specialist. Computational phenotyping algorithms are automated patient selection algorithms to identify a patient population of interest. 8 Such algorithms are increasingly used to identify and characterize patients with complex medical conditions from heterogeneous EHR data in order to improve efficiency of healthcare delivery and clinical outcomes. 9 1.2 | Primary care data Primary care EHRs are a rich source of information about people´s health and health service utilization. In countries with a gatekeeping system, general practitioners (GPs) have a fixed practice population and they are normally the first point of contact with the health care system. Routinely recorded electronic health care data in primary care may be used to develop early detection models or estimate population prevalences for diseases such as pSS defined as "complex with rare complications" 10 and in general, to study the disease in a "real life" situation, outside the setting of a specialized clinical center. 11 In the Netherlands, and in many other countries in Europe (eg, the United Kingdom, Italy, and Spain), primary care practices use an EHR system to record the care delivered to their patients and the health problems presented. 12 The diagnoses that are recorded can be assessed by the GP, but also in other sectors of the health care system, such as medical specialists. For many diseases, GPs are unable to diagnose the patients themselves so patients are referred to a medical care specialist for diagnosis and treatment. Diagnoses recorded in the GP EHR data are therefore not necessarily diagnoses made by the GPs but also include those of other healthcare specialists.
Two characteristics make it worthwhile to investigate primary care EHR data in relation to Sjögren's syndrome: 1. The GP is the first point of contact with the health care system. This allows us to identify the patient's first symptoms and to analyze the care trajectories that eventually lead to the diagnosis of Sjögren and its treatment in primary care and eventually in secondary care.
2. There is a fixed patient list. This means that the data recorded in primary care are population based and that there is an epidemiological denominator available.
One of the difficulties in identifying patients with pSS, or any other relatively rare disease, from EHRs is the coding system used in primary care. GPs in the Netherlands use the International Classification of Primary Care (ICPC) coding system to record diagnoses and symptoms. The ICPC coding system was especially devised for primary care settings. 13 In contrast with for example the International Classification of Diseases (ICD 14 ) coding system used in secondary care, ICPC has separate entries for symptoms (such as belly ache) and for diagnoses (such as urinary tract infection). However, as there are only about 700 separate entries, the level of granularity of ICPC coded primary care records is lower than that of the ICD coded records in secondary care. 15 Due to the low granularity of the ICPC coding system, there is no separate ICPC code for pSS. pSS is recorded under "Musculoskeletal disease other (L99)," as are for example Systemic Lupus Erythematosus and Systemic Sclerosis, which are autoimmune disorders that can occur in association with Sjögren's syndrome. 16 An important consequence for our purposes is the fact that there is no simple way to identify pSS patients from primary care EHRs and no gold standard available to validate any patient selection made based on alternative rules or criteria.
However, this information may be available from other sources, such as insurance claims data from secondary care.

| Secondary care data
As the GP is the first point of contact in the Netherlands, undiagnosed patients will first visit their GP with any complaints typical for Sjögren's syndrome. When the GP suspects Sjögren's syndrome, the GP will refer the patient to the Rheumatologist, Internist, or Ophthalmologist for specialized care and diagnosis. After formal diagnosis, general care for Sjögren's patients consists of follow-up appointments (medical checkups) with the medical specialist and symptomatic treatment (eg, artificial tears or artificial saliva to reduce the symptoms of drought). After first prescription of these drugs by the specialist, repeat prescriptions are generally prescribed by the GP. The medical specialist informs the GP of the diagnosis made, which is then included by the GP in the patient's primary care EHR. The fact that all suspected Sjögren's patients are eventually referred to secondary care for diagnosis and treatment means that all Sjögren's patients should ultimately show up in secondary care records.
Diagnostic information can be retrieved from hospital claims data using two classification systems; the diagnosis related groups (DRG) for hospital reimbursements and aforementioned ICD coding system for diseases.
Both systems contain explicit codes for Sjögren's disease.
This study investigates to what extent routinely recorded EHR data can be used to identify patients with complex diseases. To this aim we first examined how formal inclusion and exclusion criteria for pSS could be translated into a computational phenotyping algorithm to identify pSS patients in primary care. As the primary care data do not contain a gold standard to validate the algorithm, we secondly assessed whether secondary care data could be used as an alternative validation method, by comparing the resulting patient selection with DRG and ICD codes retrieved from hospital claims data. In order to assess the overall fitness for purpose of routinely recorded health care data for the identification of patients with complex diseases such as pSS, we finally compared prevalence rates and patients' demographic characteristics to those reported in literature. For this study, data were extracted for the years 2006 to 2017, containing consultations, diagnoses, prescriptions, referrals, and patient characteristics. 17 Diagnoses are recorded routinely in general practices and GPs use the ICPC classification system. Due to privacy regulations, the database contains no information stored in free text fields, apart from the titles of the disease episodes. This project has been approved by the governance bodies of Nivel PCD under No. NZR-00317.057.

| Hospital claims database
The national claims data set is provided by Diagnosis Related Groups Information System (DIS) and is accessible and linkable through Statistics Netherlands, a government institution that makes data available for policy development and scientific research. The data set includes claims data, using the DRG classification system for hospital reimbursements, 18 for all hospitals in the Netherlands.

| Population
In the Netherlands, all non-institutionalized inhabitants are compulso-

| Developing the algorithm
The first aim of this study was to assess whether primary care electronic health care data could be used to identify pSS patients from primary care electronic health care records. The formal ACR/EULAR classification criteria for pSS were used as a starting point to define inclusion and exclusion criteria for patient selection, but additional information available from the primary care database, such as drug prescriptions and disease episode titles, was also explored.

Inclusion criteria
The ACR/EULAR criteria 6 include patients who report at least one symptom of ocular or oral dryness and score above a certain threshold on certain weighted criteria items. Ocular or oral dryness is assessed by diagnostic questions regarding recent eye complaints, use of artificial tears, reporting of dry mouth, and difficulty swallowing food. The weighted criteria concern labial salivary gland histopathology, anti-SSA/Ro antibodies, ocular staining score, Schirmer's test, and unstimulated whole saliva flow rate.

Exclusion criteria
The ACR/EULAR criteria 6 exclude patients with a prior diagnosis of the conditions: history of head and neck radiation treatment, Active hepatitis C infection (with confirmation by polymerase chain reaction), AIDS, Sarcoidosis, Amyloidosis, Graft-vs-host disease, or IgG4-related disease.

Secondary Sjögren's syndrome
In order to distinguish specifically primary Sjögren's syndrome, Systemic Lupus Erythematosus, Systemic Sclerosis, and Rheumatoid Arthritis should additionally be excluded. 16

| Data recorded in primary care
To identify the possible pSS patients, the formal criteria were translated into a set of rules relating to coded diagnoses, comorbidities, and diagnostic test results. We additionally explored drug prescriptions and disease episode titles. These rules were applied in the form of automated queries on the database. Except for the disease episodes title, no free text fields could be used.

ICPC codes
In secondary care the patients with pre-specified diseases can be included and excluded using ICD-10 codes. To apply the ACR/EULAR criteria to primary care data, the ICD-10 codes were converted to the corresponding ICPC codes using the WHOFIC Thesaurus ICPC2-ICD10. 19 The resulting ICPC-codes were applied to ICPCcoded GP contacts (eg, consults, prescriptions) and disease episodes.

Diagnostic test results
The ACR/EULAR criteria 6

Prescriptions
The prescriptions in Dutch primary care are coded using the international Anatomical Therapeutic Chemical (ATC) Classification system for medicines. 21 In order to strengthen the patient selection, we examined the use of certain medication known to be much used by pSS patients. 22 These are: • Artificial tears (ATC S01XA20) • Hydroxyclorochine (ATC P01BA02) Especially the combined use of Artificial tears, Hydroxychlorochine, and Pilocarpine was expected to be a strong indicator of pSS.

Disease episode titles
Finally, a text query was applied to all disease episode titles recorded between 2006 and 2017. The text query was based on a number of variations in spelling of the word "Sjögren" (namely "sjogren," "sjorgen," "sjogern," "sjögren," "sjorgren," "sjoegren," "sogren") in the disease episode titles in Nivel PCD. The text strings of found cases were then manually checked and scored as to whether they described primary Sjögren syndrome by two of the authors as: (a) "primary Sjögren"; (b) "perhaps primary Sjögren"; or (c) "not Sjögren" or explicitly "secondary Sjögren." All cases in which the term Sjögren was followed by a question mark were assigned to category 2. Cases explicitly described as secondary were scored as category 3. This, however, does not necessarily mean that all cases with score 1 are indeed primary Sjögren cases.

| Validating the algorithm
As there is no formal diagnosis available to use as a gold standard to validate the developed algorithm, the second aim of this study was to assess to what extent hospital claims data, which contain more finegrained DRG treatment and ICD-10 diagnosis codes for Sjögren, might be suitable as an alternative validation method. We additionally compared prevalence rates and demographic characteristics of pSS patients identified in primary and secondary care with those reported in literature.

| Validation scores
Based on the linked data set, it is possible to compare the pSS patients found with the algorithm from Nivel PCD with formal diagnoses based on recorded DRGs and ICD-10 codes related to Sjögren within the health insurance claims data set. Based on the combined data sets each patient is flagged as a true positive, true negative, false positive, or false negative, as shown in Table 1: • True positives (Tp): labeled as pSS by the algorithm, confirmed based on DRG codes secondary care claims database.
• True negatives (Tn): not labeled as pSS by the algorithm (hence not included in our data set), confirmed based on absence of DRG code related to Sjögren recorded in the secondary care claims data.
• False positives (Fp): labeled as pSS by the algorithm but not confirmed based on DRG codes secondary claims database.
• False negatives (Fn): not labeled as pSS by the algorithm, but DRG codes related to Sjögren recorded in the secondary claims database.
The total number of Tps, Tns, Fps, and Fns can be used to calculate the accuracy and other performance scores of the algorithm:

| Prescriptions
In addition to the ICPC codes, we assessed the use of Artificial tears,  Table 2 remained. The (combined) prescription use thus does not seem to be a feasible selection criterion for pSS.
To gain insight in the prescriptions that were used a lot by possible pSS patients, Table 3  The distinction between primary and secondary Sjögren was not often explicitly made in the episode texts. For only 71 patients Sjögren was specifically defined as primary (indicated by "prim," "prim.," or "primary") and for 65 patients as secondary (indicated by "sec," "sec.", or "secondary"). When the GP was unsure of a patient having Sjögren this was often indicated by a question mark: for example, "Sjögren?" (N = 348). However, as the episode titles are free text fields, the variation in used text strings was high and each patient was assigned to one of the categories manually by taking into account the complete textual context.
Text strings interpreted as primary Sjögren were mainly clear and short statements such as "m. Sjögren," "morbus Sjögren," and "Sjögren's syndrome," without the mention of any secondary diseases (Table 2). Text strings interpreted as perhaps Sjögren contained words such as "suspicion of Sjögren" or "possibly Sjögren," or the use of a question mark. Text strings interpreted as not or secondary Sjögren

| Final algorithm
The selection criteria based on the formal ACR/EULAR classification criteria (listed in Table 2), combined with the mention of "Sjögren" The flowchart in Figure 1 shows the inclusion and exclusion rules

| Validating the algorithm
The claims data indicate that on average around 54 000 unique patients per year visit the hospital for a treatment recorded under one of the Sjögren related DRGs or the ICD-10 code for Sjögren. Based F I G U R E 1 Flowchart inclusion and exclusion criteria applied to Nivel PCD data (cumulative numbers) on the estimated global prevalence of 61 per 100 000 7 and a total Dutch population of 17 million, we would expect slightly over 10 000 patients. Table 4  For the patients for which no Sjögren ICD code was recorded, the ICD code was mainly missing, or referred to an "Unspecified Illness"

| Validation scores
The matrix in Table 5 Table 6 shows the mean age and gender of the total population included in Nivel PCD and for the selected pSS patients over the years. It also shows the number of new and known pSS patients for each year. The number of new patients in a given year is the number of patients for which "Sjögren" was mentioned for the first time in the ICPC episode title in that year. The total number of patients in a given

| Patient characteristics
year is the number of new patients in that year added to the number of patients known from previous years.
As the first year of diagnosis we used the first date in which a record was found in the journal in which "Sjögren" was mentioned in the ICPC text episode title. This date was unknown for 810 pSS patients, probably because the diagnosis was made before the patient had been listed as patient in the practice for which data are included in the database. For the patients for which this date could be retrieved, the majority was between 50 and 70 years of age at the first year of diagnosis (see Figure 3), with a mean of 65.8 years (SD = 15.1).     5. In spite of claims regulations, DRG groups in claims data may not represent true pSS patients.

| Prevalence rates
In future research we will first focus on exploring and confirming these possible explanations by comparing the primary and secondary care characteristics of the confirmed and unconfirmed pSS patients.
Second, we aim to fine-tune the patient selection algorithm for primary care and the resulting patient selection by studying the characteristics of the pSS patients that were included in the DIS database but that were not found in Nivel PCD based on the initial selection criteria. This may result in additional pSS identifiers in primary care, to be implemented in an improved, more precise algorithm for the selection of pSS patients in general practice. Third, we will develop a timeline displaying the average combined primary and secondary care trajectory of pSS patients in the Netherlands, using the linked Nivel PCD and DIS data of the confirmed pSS patients. This timeline will provide more insight into the used healthcare and the diagnostic process.
When looking at the prevalence rates based on the Dutch primary care database, we see the average prevalence based on our final algorithm (0.7‰) is comparable to the global population prevalence of 0.61‰ reported by Qin et al. 7 Our mean age at diagnosis (Figure 3) is comparable to the average age of 56.16 years reported by Qin et al. 7 The female:male ratio in our sample is 7:1, which is to be expected as pSS primarily affects peri-and postmenopausal women. Our female: male ratio is lower than the ratio in the prevalence data reported by of pSS patients in secondary care, however, highly exceeded the number expected based on the general population prevalence. Even when using only ICD-10 codes, which might be a more accurate source of diagnostic information, the prevalence found for the Netherlands still exceeds global estimates. There is not enough information to assess whether this discrepancy can be attributed to the sources and methods used to identify pSS patients in secondary care or the possibility that literature reported global prevalence rates might not be accurate for the Netherlands. This has a major impact on our study results in that it is unclear whether insurance claims records are a suitable source to compare and confirm the results obtained from primary care data with and, consequently, we cannot draw unambiguous conclusions regarding the quality of our patient selection and the developed phenotyping algorithm.
This study shows the possibilities of using EHR data for studying complex medical conditions. It is clear that population-based health records provide a lot of longitudinal medical information and insight in the use of care for a large range of diseases. However, the study of patients with low prevalence, uncoded diseases is more challenging, as those cannot be as easily identified from primary care data as patients with more general diseases. The lack of a granular coding system for symptoms and diseases also makes it difficult to apply diagnostic criteria used in secondary care to data recorded in primary care. The possibility to link primary to secondary care databases on patient level allows one to (iteratively) try different patient selection algorithms and compare those to patients referred to specialized care, and to study patient and care characteristics in primary care of patients thus far only known in secondary care. As such, these combined medical data should be considered a rich source of information for the epidemiological study of low prevalence, complex diseases, patients' early symptoms, diagnosis paths, and overall treatment trajectories in primary and eventually secondary care. However, without the formal diagnostic information required to validate the developed phenotyping algorithm and patient selection, we have insufficient information to affirm that routine EHR data is fit for the identification and study of patients with complex diseases such as pSS.