Consistency between the National Patient Register and the Swedish Cancer Register

The Swedish National Patient Register (NPR) is widely used as a data source in epidemiological studies, but the consistency of all cancer diagnoses compared to the Swedish Cancer Register (SCR) remains unclear. Using NPR supplementary for detecting safety signals is beneficial due to shorter data extraction delays compared to using SCR alone. This study aims to evaluate the consistency of NPR for cancer diagnoses compared to SCR and its potential use in pharmacoepidemiology.

• This study evaluated the consistency between cancer diagnoses from the National Patient Register as compared to the Swedish Cancer Register via accuracy measurements.
• The consistency depended on cancer sites.Higher consistency was observed for breast, male genital, and oral cancers.Lower consistency was observed for female genital, and ill-defined, secondary, and unspecified sites.
• The findings from this study may support the use of the National Patient Register as supplementary data source to the Swedish Cancer Register to identify cancer outcomes in a readily manner and provide preliminary results for drug safety studies.

Plain Language Summary
The consistency of cancer diagnoses between the Swedish National Patient Register (NPR) and the Swedish Cancer Register (SCR) remains unclear.This study aims to evaluate the consistency between the two registers for cancer diagnoses and its potential use in pharmacoepidemiology.
Patients with a cancer diagnosis recorded in SCR during 2018-2020 were included.Positive predictive value (PPV), and sensitivity were calculated using SCR as the gold standard.As an example of differences in identification of cancer diagnoses between the registers, two nested case-control studies for the association between antidiabetic medications and pancreatic cancer were repeated using the two registers.Conditional logistic regression was performed and the 95% confidence intervals (CI) for the odds ratios (OR) were checked for overlaps.For breast, male genital organs, and oral cancers consistency was high (PPV: 87.5%-97.4%,sensitivity: 82.2%-91.0%),while for female genital organs, thyroid, and ill-defined, secondary, and unspecified sites cancers it was low (PPV: 8.8%-90.0%,sensitivity: 19.9%-32.3%).All the CIs from the nested case-control studies overlapped when pancreatic cancer was identified in NPR or SCR.
Differences in diagnostic processes and coding of cancer in the two registers may explain part of the results.

| INTRODUCTION
Register data is widely used in pharmacoepidemiological research to identify safety outcomes 1 including the risk of cancer, but cancer outcomes may have a delayed onset after drug use. 2 In pharmacoepidemiological studies, data quality is important as it heavily impacts the validity of the estimated associations. 1 In Sweden, data on cancer diagnoses are recorded in the National Patient Register (NPR) and the Swedish Cancer Register (SCR), both population-based registers.NPR was launched in the 1960s 3 and it records one main diagnosis and up to 30 secondary diagnoses, where the same cancer diagnosis may be recorded repeatedly over time, that is, suspected cancers and cancers under treatment.Data on diagnoses registered in NPR for a specific month becomes available for data extraction in the three following months. 4Data availability to detect safety outcomes in a timely manner is a crucial aspect of observational studies.However, not many pharmacoepidemiological studies use data from NPR to detect safety outcomes in terms of cancer development due to the lack of evidence on data quality of cancer diagnoses in the register.Therefore, if NPR is validated in terms of cancer diagnoses, preliminary results on cancer risk can be provided in a timelier manner when conducting pharmacoepidemiological safety studies investigating cancer outcomes.SCR was established in 1958 to map the incidence and change of cancer over time. 5The quality of data in SCR is very high, with almost 99% of all cancer diagnoses being morphologically verified and with comprehensive quality assurance being conducted at the regional cancer centers before data is sent to the National Board of Health and Welfare.However, SCR contains detailed information about cancer, including TNM (tumor, node, metastasis) and pathology codes, which requires additional time before the data can be available for research.
Diagnoses registered in SCR for a specific year are not available for data extraction until earliest December of the following year. 4Therefore, the use of SCR to identify suspected cancer signals without delay is limited.A review article of studies assessing the validity of inpatient diagnoses included in NPR, showed that it had PPVs ranging between 85% and 90% for a variety of diseases, but limited evidence was showed with respect to cancer diagnoses. 6There are some studies which compared cancer diagnoses recorded in SCR and the Cause of Death Register or the Hospital Discharge Register, however, these studies used data from 1998 or focused only on death cases. 7,8To the best of our knowledge, no study has validated diagnoses of neither various cancer types nor specific cancer types registered in NPR when compared to SCR.Since many specific cancer sites have already been validated, we decided to investigate 12 groups of cancers based on the consistency at the 3-position code level and to present them at a higher level to provide an overview of the consistency between NPR and SCR.
One empirical example of a safety study in pharmacoepidemiology is related to the use of antidiabetic medications (ADMs) and risk of pancreatic cancer.0][11][12][13][14][15][16] Even though using SCR in safety studies is considered the "gold standard," the timeline for such studies may not align with the data availability of SCR due to longer data production times.Pancreatic cancer was chosen as an empirical example since it is a common endpoint in safety studies of ADM.
The aims of this study were primarily the evaluation of the consistency of cancer diagnoses in NPR as compared to SCR, and secondarily to provide an empirical example illustrating the impact of the choice of data source (NPR or SCR) on outcome identification when estimating associations between ADMs use and pancreatic cancer.

| Primary aim
A cross-sectional study design was used with SCR as gold standard for calculating accuracy measures due to its national coverage and high completeness. 17All patients with a cancer diagnosis, C00-C80 (excluding C43-C44) according to the third revision of the International Classification of Disease for Oncology (ICD-O/3) codes, recorded in SCR between 1 January 2018 and 30 September 2020 were included.

| Secondary aim
Two nested case-control studies were performed, one using SCR and one using NPR to estimate the association between ADM use and pancreatic cancer.The study population included patients with any cancer diagnosis registered in NPR or SCR between 1 January 2018 and 31 December 2020.The assessment period for all variables of interest was from 1 January 2011 to 31 December 2020 (Figure 1).The outcome was identified between 1 January 2018 and 31 December 2020, and since cancer outcomes may have a delayed onset the exposure status was assessed before a lag time period. 18Cases were patients with incident pancreatic cancer registered in NPR or SCR.The date of the diagnosis of pancreatic cancer was defined as the index date.Patients were excluded when they were not residing in Sweden during the whole assessment period, their cancer diagnoses were found from autopsy or on the day of death, and when there were no matched controls (cases only) due to lack of follow-up time (Figure 2).Controls were selected from patients with cancer other than pancreatic cancer registered in NPR or SCR, and the same index date and exclusion criteria as for cases were applied.Each patient in the case group was matched with 5 controls from the same register, NPR or SCR, by age and sex.

| Primary aim
The cancer outcome was defined using the 10th revision of the International Classification of Disease (ICD-10) codes in NPR and using the ICD-O/3 codes in SCR.Cancer diagnoses were categorized into 12 groups by the location of the tumor according to the blocks within the malignancy chapter in ICD-10 19 and were matched with corresponding ICD-O/3 codes using the ICD Coding in the Cancer Registry from The Swedish National Board of Health and Welfare 20 (Table S1).
The following cancer diagnoses recorded in ICD-10 were not included since there are no one-to-one corresponding ICD-O/3 codes: lymphoid, hematopoietic, and multiple sites (C81-C96) and skin (C43-C44).A cancer diagnosis recorded in SCR was considered consistent between the two registers if the same 3-position code (e.g., C34) was found in NPR within ±90 days from the date in SCR.F I G U R E 1 Assessment windows for the variables used in the secondary aim.SES, socioeconomic status.

| Secondary aim
The outcome was defined by the first diagnosis of pancreatic cancer Exposure was defined from filled prescriptions of ADMs registered in the Swedish Prescribed Drug Register (PDR).In total, 14 types of ADMs were identified according to the Anatomical Therapeutic Chemical (ATC) codes (Table S2).Patients were defined as exposed when they had at least one filled prescription of ADMs in the exposure assessment year.
Data on covariates was collected from Swedish national registers by linkage of unique personal identifiers (Table S3).The Longitudinal integrated database for health insurance and labor market studies was used to collect sociodemographic data, that is, income, education, employment, marital status, and residential county.Unavailable lifestyle factors including smoking and alcohol use were proxied by related diseases (chronic obstructive pulmonary disease and alcohol disorders) and medications used for alcohol dependency.Diagnoses for diabetes, comorbidities, and number of inpatient and outpatient care visits were identified in NPR.Comorbidities measured during the 5-year period before the exposure assessment were CVDs (heart failure, hypertensive diseases, ischemic heart diseases, and cerebrovascular diseases), cholelithiasis, and obesity.Diabetes was also measured in the 5-year-window.From PDR, data on other filled prescriptions coded by ATC codes was identified during the most recent year prior to the exposure (ADMs) assessment period.
Other medication use included were acetylsalicylic acid, nonsteroidal anti-inflammatory drugs, selective serotonin reuptake inhibitors and hydroxymethylglutaryl-CoA (HMG-CoA) reductase inhibitors (statins).Additionally, the total number of filled prescriptions during the year was identified.

| Primary aim
Frequencies of cancers registered in NPR and SCR were compared by group, using proportions of cancer diagnoses in each register.
To assess the consistency of cancer diagnoses in NPR compared to SCR, positive predictive value (PPV), sensitivity, false positive rate (FPR), and false negative rate (FNR) were calculated as measures of accuracy.The PPVs were calculated as the proportions of patients registered in NPR whose cancer was identified in SCR.
Sensitivity was calculated as the proportion of patients who had a cancer diagnosis in both NPR and SCR among all cancer cases in SCR.2][23][24][25][26] The FPR corresponded to the proportion of patients whose diagnosis was recorded in NPR but not in SCR, or vice versa for the FNR.
A sensitivity analysis was performed to assess whether different lengths of the assessment window for the cancer outcome in NPR affected the accuracy measures.The calculations of each accuracy measure were repeated with the assessment window reduced from within ±90 days to within ±30 days.F I G U R E 2 Flow-chart of the inclusion process for the secondary aim.The number of matched controls does not correspond to the number of matched cases due to the lack of follow-up time for four matched controls.

| Secondary aim
Multivariable conditional logistic regression models were used to estimate associations between pancreatic cancer and ADM use with crude and adjusted odds ratios (ORs).Estimation was repeated for outcomes identified from NPR or SCR.Consistency between results were determined by overlapping 95% CIs for the ORs.When the 95% CIs were overlapping, it was assumed that the choice of data source did not strongly impact the estimated association.The magnitude of the CI overlaps was determined by the length of the overlapping part relative to the total range defined as maximum upper confidence limit (from both confidence intervals) minus minimum lower confidence limit.Missing values that were identified in some confounding variables, especially related to socioeconomic status, were not imputed but categorized into a separate "missing" category and included in the analysis.All covariates were included in the adjusted models.
As a sensitivity analysis, data from 2020 was excluded to assess whether the observed fewer cancer records during the COVID-19 pandemic had an impact on the recording of pancreatic cancer in NPR and SCR and the consistency between estimates of associations.
All analyses were performed using STATA version 16 (StataCorp, College Station, TX, USA), and data management was done using SAS version 9.4 (SAS Institute Inc., Cary, NC, USA).

| Primary aim
In total 157 594 cancer diagnoses in SCR and 123 654 cancer diagnoses in NPR were identified (Table 1).The 3 most common cancer types were digestive organs, female genital organs, and male genital organs in SCR.In NPR, cancers of digestive organs, male genital organs, and breast were the most common cancer types.
The results from the sensitivity analysis with narrower assessment window for the cancer outcomes in NPR showed that, for all cancer types, the PPV was still stable but sensitivity and FPR decreased while FNR increased (Table S4).

| Secondary aim
In total, 4246 patients with pancreatic cancer from SCR and 5593 patients from NPR were identified between 2018 and 2020 and included in the analysis along with 21 230 and 27 961 matched controls, respectively (Table 2).Due to the matching by age and sex, these demographic characteristics were homogeneously distributed among cases and controls.However, cases as compared to controls, had a slightly higher proportion of being employed, having any type of diabetes, other comorbidities, and were more often prescribed more than 3 medications, regardless of the register used to identify the outcome.
When compared to SCR, NPR had a higher proportion of people aged T A B L E 1 Number of cancer diagnoses and proportions in the Swedish Cancer Register (SCR) and the National Patient Register (NPR), sensitivity, positive predictive value (PPV), false positive rate (FPR), and false negative rate (FNR).When comparing the estimates by overlapping CIs from the crude and adjusted models obtained using SCR and NPR to define pancreatic cancer (Figure 3), all models showed overlapping CIs.The magnitude of the overlaps was relatively larger (85.6%-87.1%)for long-acting insulins, TZD, and combinations of oral ADMs and relatively smaller (30.1%-36.1%)for alpha glucosidase inhibitors, fastacting insulins, and sulfonylureas (Table S5).
When investigating whether removing data from 2020 affected the consistency between associations estimated using NPR and SCR (Table S6) overlapping 95% CIs were still found for all associations.

| DISCUSSION
Relatively low PPV and sensitivity and high FPR and FNR were observed for female genital organ cancer, thyroid cancer, and cancer of ill-defined, secondary, and unspecified sites, showing inconsistency of the cancer diagnoses registered in NPR compared to SCR.Conversely, the PPV and the sensitivity were high and the FPR and the FNR were low for breast, male genital organs, and oral cancers, indicating consistency.In the nested case-control studies assessing the association between ADM use and pancreatic cancer as an empirical example, all models showed overlaps of the 95% CIs, implying that there is no major impact on the CIs when identifying outcomes from NPR as compared to outcomes identified from SCR.
The lower consistency for the ill-defined, secondary, and unspecified cancer sites was most likely due to differences in coding system between the two registers.Some secondary cancers are described by the behavior code in ICD-O/3 (SCR), thus C77 and C78 in the ICD-10 codes system are classified as local cancers C00-C80 in the ICD-O/3 codes system. 27This could have increased the number of diagnoses categorized as secondary cancers from NPR. Recurrent cancers are not recorded in SCR since it primarily focuses on reporting incident cancers.Furthermore, information regarding metastasis is only available in SCR when the primary tumor is unidentified, [28][29][30] which may also have led to differences in the recording of these cancers between the two registers.
The inconsistency of recorded cancers for female genital organs in NPR might be related to high grade squamous intraepithelial lesion, including cervical intraepithelial neoplasia lesions, registered in SCR since 2015. 31These lesions are not reported in NPR as malignant tumor in ICD-10, but are recorded under cervical cancer in SCR using morphology codes that characterize the neoplasm in ICD-O/3. 20,27The number of cases with high-grade lesions in cytology between 2013 and 2015 approximately match the difference between the numbers of diagnoses recorded in SCR and NPR identified in the current study. 32screpancies for thyroid and other endocrine cancers may be due to underreporting from primary care.The majority of thyroid cancers are diagnosed by clinical findings and palpation. 33Additionally, sensitive diagnostic procedures such as cervical ultrasound have been suggested to incidentally detect asymptomatic small thyroid cancers. 34Cancers that can be detected by less invasive diagnostic processes such as palpation, imaging, and ultrasound may be detected in primary care or private clinics.However, these cancers are more likely to be underreported in NPR because it does not cover primary care and there is significant non-response from private clinics. 35A B L E 2 (Continued) The lower incidence of bone and soft tissue cancers reported in SCR compared to NPR may be attributed to issues in the reporting and diagnostic processes within SCR.A previous study highlighted a misconception among some clinicians that only histopathologically confirmed cancers should be reported to SCR. 7 Bone cancers typically require tissue samples for confirmation, 36 and due to the above misconception, cases diagnosed based on clinical findings may be underreported in SCR.Furthermore, cancers not detected during early-stage may not undergo aggressive histopathological diagnosis, surgery, and prognosis due to the highly invasive nature of the diagnostic process.In fact, some cancers associated with poor survival are reported in NPR but not in SCR, and found from autopsy.37 This scenario may be particularly applicable to bone cancer.Regarding soft tissue cancers, underreporting in SCR has been suggested due to confusion about which clinic or pathology department is responsible for reporting cancer diagnoses when samples are referred to central specialty clinics or laboratories.7 Higher consistency for some cancer types could be explained by available screening program and easily implementable screening techniques.F I G U R E 3 Odds ratios (ORs) and 95% confidence intervals (CIs) for the associations between antidiabetic medications (ADMs) use and pancreatic cancer when outcomes were defined using the Swedish Cancer Register (SCR) and the National Patient Register (NPR).
screening program but are easily performed with relatively little invasiveness and might follow the same recording pattern as breast cancer.
The overlaps of all CIs for the analyses conducted for the secondary aim suggest consistency between NPR and SCR.However, this should be interpreted with caution.Previous studies found that about 30% of pancreatic cancers were underreported in SCR compared to NPR. 41,42 Cancers that require a highly invasive diagnostic process, such as pancreatic cancer, may not be histopathologically confirmed in elderly patients.Therefore, even when symptoms of pancreatic cancer are present, the cancer may be registered in NPR as a suspected cancer without pathologically confirming but not recorded in SCR.One study found that 12% of pancreatic cancers were detected incidentally at autopsy. 43Our study excluded cancers detected at autopsy, and consequently the number of cases with incidence of pancreatic cancer identified in SCR was smaller than the number identified from NPR.Despite differences in the number of cases, our results still indicate that the CIs of the association between pancreatic cancer and ADMs in this example are not largely influenced by the choice of data sources.
Although NPR may lack comprehensive details on cancer pathology, it delivers preliminary results for identification of emerging safety signals faster than SCR when focusing on the cancer incidence and not on the detailed information on TNM or pathology codes.Therefore, the supplementary use of NPR is anticipated in safety studies that require timely preliminary results, serving as an additional tool rather than a substitute.

| Strengths and limitations
Strengths of the present study include the use of data from SCR as the gold standard, which covers the entire population and has a high completeness of cancer diagnoses, contributing to the reliability of the results.Additionally, via the secondary aim this study provided a real-world pharmacoepidemiological example of the impact of using data from NPR for cancer identification on associations estimated in safety studies.
One limitation could be the potential underestimation of true positive cancers not identified within the defined assessment window for the primary aim.There might also be an underestimation of false positives due to the identification of diagnoses from NPR based on the date of diagnoses registered in SCR.Moreover, there might be misclassification of the pancreatic cancer outcome since the associations estimated using NPR diagnoses might have included not confirmed cancer cases.On the other hand, in the models using SCR there may be misclassified patients with pancreatic cancer in the control groups due to potential under-recording of pancreatic cancer in SCR, being approximately 30%. 41,42Finally, in NPR, specific coding may have been induced due to economic incentives, but it was impossible to detect this aspect in the present study. 44,45maining uncertainties about the mechanisms behind the in/consistency need to be addressed.To date, several factors are known to influence differences in reporting of incident cancer between NPR and SCR.Examples include unexpected cancer findings from autopsy and underreporting to NPR in some small, especially private clinics. 43,46Further research is needed to investigate the differences in reporting rates of each cancer at different clinics, the influence of patient sociodemographic factors on detecting and reporting various cancers, and the reporting process of suspected cancer and cancer under treatment.In addition, to ensure NPR's usefulness in pharmacoepidemiological drug safety studies, the consistency of estimates obtained using NPR and SCR should be evaluated with additional empirical examples of different drug exposures and safety outcomes since the therapeutic area may have an impact on the results.

| CONCLUSION
This study identified high consistency of registered diagnoses in NPR compared to SCR for some cancer types.Moreover, the associations between ADMs and pancreatic cancer when outcomes were identified from NPR or SCR showed fairly overlapping CIs.These findings suggest that NPR can be a valid supplementary data source to provide preliminary results and faster detection of cancer as a outcome.
Differences in diagnostic processes and coding differences between the registers may explain part of the inconsistent results and need to be further investigated.

AUTHOR CONTRIBUTIONS
Sakura Sakakibara, Laura Pazzagli, and Marie Linder contributed to this study, which originated as a master thesis project for Sakura Sakakibara suggested by Marie Linder.Sakura Sakakibara performed data analysis and drafted the manuscript.Laura Pazzagli played a crucial role in methodological contribution, conceptualization, results interpretation and manuscript review and editing as well as supervision.
Marie Linder made substantial contributions to conceptualization, data curation, project administration, results interpretation, manuscript review and editing as well as supervision of the project.The final approval was made by all authors.
Since a cancer diagnosis might have been recorded repeatedly in NPR when a person had symptoms of suspected cancer or was undergoing cancer treatment an incident cancer diagnosis from NPR was defined as the first diagnosis within the ±90-or ±30-day window from the same diagnosis code recorded in SCR.

(
ICD10: C25.0-C25.9) in NPR or (ICD-O/3: C25.0-C25.9) in SCR.The diagnoses from NPR were limited to the first cancer diagnosis per patient since the same cancer diagnosis might be recorded repeatedly in NPR.
Characteristics of the study population for the nested case-control studies by national register used to define the outcome pancreatic cancer, the Swedish Cancer Register (SCR) and the National Patient Register (NPR).
80+ years, with lower educational level, and with missing data for marital status and region.
[38][39][40]A positive screening result can easily lead to a subsequent prompt confirmative diagnosis, and both the NPR and SCR registers allow the diagnosis to be registered with little time lag.Breast cancer screening, where the diagnostic process involves mammography imaging followed by confirmative diagnosis by biopsy at healthcare facilities, is also likely to have been recorded in both NPR and SCR.Oral and prostate cancer screenings are not part of the national