Exploring the impact of design criteria for reference sets on performance evaluation of signal detection algorithms: The case of drug–drug interactions

Abstract Purpose To evaluate the impact of multiple design criteria for reference sets that are used to quantitatively assess the performance of pharmacovigilance signal detection algorithms (SDAs) for drug–drug interactions (DDIs). Methods Starting from a large and diversified reference set for two‐way DDIs, we generated custom‐made reference sets of various sizes considering multiple design criteria (e.g., adverse event background prevalence). We assessed differences observed in the performance metrics of three SDAs when applied to FDA Adverse Event Reporting System (FAERS) data. Results For some design criteria, the impact on the performance metrics was neglectable for the different SDAs (e.g., theoretical evidence associated with positive controls), while others (e.g., restriction to designated medical events, event background prevalence) seemed to have opposing and effects of different sizes on the Area Under the Curve (AUC) and positive predictive value (PPV) estimates. Conclusions The relative composition of reference sets can significantly impact the evaluation metrics, potentially altering the conclusions regarding which methodologies are perceived to perform best. We therefore need to carefully consider the selection of controls to avoid misinterpretation of signals triggered by confounding factors rather than true associations as well as adding biases to our evaluation by “favoring” some algorithms while penalizing others.

• We tested 14 design criteria for reference sets in the case of DDIs, showing that some of them considerably affected the performance and comparative evaluation of different SDAs for DDI surveillance while others did not have a significant effect.
• Overall, this analysis advocates the utilization of large, to the extent possible, reference sets that are less likely to suffer from overrepresentation of controls that make different SDAs behave in different ways due to confounding. Any decision to restrict the evaluation set using specific design criteria should be carefully justified.

Plain Language Summary
Reporting of suspected side effects experienced by patients following drug approval is a key component to identifying novel drug safety issues. Statistical methods are then used to analyze reports and reveal signals of novel associations between drugs and side effects. Performance evaluation of those methods traditionally relies on custom-made reference sets of limited size that consider ad-hoc exclusion or inclusion criteria to define eligible controls. However, each method can be impacted to a different extent by those criteria, as they can act as potential confounders. This study investigated the impact of 14 criteria on three methods that have been developed to detect signals of potential adverse drug-drug interactions, showing that some of them had opposing effects or effects of different levels of magnitude on the performance of the different methods. The relative composition of reference sets can therefore significantly affect the evaluation metrics, potentially altering the conclusions regarding which methodologies are perceived to perform best. The selection of controls should be carefully performed to avoid misinterpretation of signals triggered by confounding factors rather than true associations as well as adding biases to our evaluation by "favoring" some algorithms while penalizing others.

| INTRODUCTION
Monitoring drug safety issues during the post-approval phase requires reporting of suspected drug-related adverse reactions by healthcare professionals, patients, and pharmaceutical companies. The reports are collected in spontaneous reporting system (SRS) databases, such as the FDA Adverse Event Reporting System (FAERS) database in the US, the Eudravigilance database in the EU, and the Yellow card database in the UK. These databases form an important part of the pharmacovigilance strategy since they not only contain information on adverse events (AEs) and suspected drugs, but also details regarding concomitant medications, indications, and patient demographics.
By applying statistical methods known as signal detection algorithms (SDAs), novel associations between drugs and AEs (i.e., signals) that have not been identified in clinical trials can be identified in the SRS data. Given the absence of a control group, SDAs predominantly rely on disproportionality analysis, which calculates the degree of disproportional reporting of drug-AE combinations compared to what would be expected if there were no association between them. 1 However, the presence of synthetic associations (i.e., causative covariates that have not been taken into account or remain unobserved) can lead to confounding, either upward or downward, thus generating faulty associations between the drug and the AE and complicating the detection of safety signals. [2][3][4] For example, reporting quality issues arising from a poor distinction between symptoms of disease-related AEs and treatment effects of drugs (or drug combinations) is a result of a synthetic association called confounding by indication. 5,6 The practice of using larger clusters of medical terms to perform quantitative signal detection in pharmacovigilance has been widely discussed in the literature. 1,7 Many previous efforts investigated the impact of the Medical Dictionary for Regulatory Activities (MedDRA) granularity on signal detection tasks. 8,9 Also, many studies have considered the use of term grouping to identify relevant reports. 10,11 However, recommendations from the IMI-PROTECT project suggest that signal detection at the PT level should be considered the standard approach in real-life pharmacovigilance. 9,12 The development of novel SDAs in pharmacovigilance requires the existence of appropriate reference sets that can be utilized both for absolute performance evaluation as well as for comparison with existing methodologies. Given that each SDA, depending on the applied modeling, might be impacted to a different extent by a confounder, the performance evaluation might be biased based on the selected benchmarks.
The challenge of building appropriate reference sets in pharmacovigilance has been previously acknowledged in the literature. [13][14][15][16] Most studies have attempted to comparatively evaluate SDAs by testing their performance against custom-made reference sets, often limited in size [17][18][19] or not publicly available 20,21 which commonly consider ad-hoc inclusion or exclusion criteria to generate positive and negative controls.
Examples of such criteria include those related to AE background prevalence (given that, in disproportionality analysis, the denominator signifies the expected rate of occurrence), 22 disease-related AEs, 23 AE seriousness 23,24 or evidence associated with positive controls. [22][23][24][25][26] The criteria are typically used to attempt to address the limitations of disproportionality analysis and to tackle issues with potential confounders.
In the case of adverse drug-drug interactions (DDIs), signal detection is considered more complicated, with the existing methodology being less mature compared to the one in the case of signals for single drugs. A previous study has suggested that the detection of DDIrelated signals might suffer from multiple confounders. 27 For example, concomitant medications appear to be a significant source of confounding (i.e., the signal associated with a drug combination triggered by drugs that are usually given concomitantly but not signifying true adverse drug-drug-event associations). In addition, only limited efforts exist in the literature to generate reference sets related to two-way DDIs. 17,19,27,28 In this study, we aim to explore the relative impact of different factors that could be potential sources of confounding on the performance evaluation of existing methods for signal detection of DDIs. By utilizing a large and diversified reference set, we were able to create custom-made reference sets considering multiple design criteria to assess any differences observed in the quantitative evaluation of SDAs tailored for two-way DDIs.

| Data mining
We performed the case/non-case analysis at two different levels, based on the reference sets that we utilized. The first one was restricted to the reports that included the PT that was related to each control from the PT Reference Set. The second one considered as cases all the reports that contained any of the PTs that were part of the MC linked to the control in the MC Reference Set.
For example, the case/non-case analysis for a control related to torsade de pointes resulted in two contingency tables: the first one only considered the PT "Torsade de pointes" to retrieve case reports, while the second one included the following terms (as PTs): "Electrocardiogram QT interval abnormal", "Electrocardiogram QT prolonged", "Long QT syndrome", "Torsade de pointes", "Ventricular tachycardia".
Non-cases included the reports without the aforementioned PTs, while reports containing more than one of the relevant PTs linked to the MC were not double-counted. Table 2 shows the design criteria that were considered as potential confounding factors, which fall into the following categories:

| PT prevalence
The impact of reference set restriction by PT prevalence on the Area Under the Curve (AUC) estimates was also examined. The PT prevalence was calculated in the curated FAERS data set as the frequency of PTs from reports containing at least one drug. We grouped the 179 PTs from the PT Reference Set using quartile binning of their prevalence. The controls were then stratified into four groups (Groups Q1-Q4) based on their PTs by considering the respective PT prevalence quartile.

| SDAs
Three SDAs that have been previously described in the literature were considered: i. An observed-to-expected shrunk interaction measure (Omega) 32 ; ii. The "interaction coefficient" in a linear regression model with additive baseline (delta_add) 33 ; iii. A measure based on an adapted version of Multi-Gamma Poisson Shrinker (MGPS) model, called Interaction Signal Score (IntSS). 17 2.6 | Impact of MedDRA granularity on SDA performance evaluation To assess the impact of MedDRA granularity on the SDAs that were considered in this study, we performed a Receiver Operating Characteristic (ROC) analysis to examine the difference in AUC when considering matched controls from the two reference sets.

| Estimation of design criteria impact on SDA performance evaluation
For each reference set and design criterion, we simulated the generation of a constrained reference set by randomly drawing an equal T A B L E 2 Categories and descriptions of design criteria for reference sets that could affect performance evaluation of SDAs for DDI surveillance.

Evidence level BNF-Study
Interactions where the information is based on formal study including those for other drugs with the same mechanism, for example, known inducers, inhibitors, or substrates of cytochrome P450 isoenzymes or P-glycoprotein.

BNF-Theoretical
Interactions that are predicted based on sound theoretical considerations. The information may have been derived from in vitro studies or based on the way other members of the same class act.

BNF-Anecdotal
Interactions based on either a single case report or a limited number of case reports.

Micromedex-Established
Controlled studies have clearly established the existence of the interaction.

Micromedex-Theoretical
The available documentation is poor, but pharmacologic considerations lead clinicians to suspect the interaction exists; or documentation is good for a pharmacologically similar drug.
Micromedex-Probable Documentation strongly suggests that the interactions exist, but well-controlled studies are lacking.
Event seriousness* EMA Important Medical Event (IME) Terms Any untoward medical occurrence that at any dose: * results in death, * is lifethreatening, * requires inpatient hospitalization or prolongation of existing hospitalization, * results in persistent or significant disability/incapacity, or * is a congenital anomaly/birth defect.

EMA Designated Medical Event (DME) Terms
Medical conditions that are inherently serious and often medicine-related (e.g., Stevens-Johnson syndrome). This list does not address product-specific issues or medical conditions with high prevalence in the general population. The statistics of the samples were summarized by fitting a Normal distribution, for which we report the mean and variance. The difference of the means of AUC (AUC diff ), and PPV (PPV diff ) (with 95% confidence intervals) were the target measures. The probability of AUC diff being non-zero, P jAUC diff j > 0 ð Þ , was also estimated under the normality assumption: where μ is the mean, σ is the standard deviation, and F AUCdiff is the normal cumulative distribution function (CDF) of AUC diff . Figure 1 illustrates the simulation workflow for the calculation of differences in AUC scores and PPV when considering the various design criteria.

| RESULTS
Τhe total number of positive and negative controls when applying each of the design criteria to the PT Reference Set is presented in Hence, more than 250 positive and negative controls were considered for every design criterion. For the MC Reference Set, the restricted subsets were smaller in size (Table S1). Three design criteria (BNF-Anecdotal, BNF-Theoretical, and AE is an indication-True) were not tested with this reference set, as their N max was less than or equal to F I G U R E 1 (A) Initial positive and negative control sets (P and N) and their respective restricted subsets (DC-restricted, p and n) when applying a design criterion; (B) Simulation workflow for the differences in AUC (AUC diff ) and PPV (PPV diff ) when considering the specified design criterion.
100. Figure 3 provides the frequency distribution of PT prevalence in: F I G U R E 4 Number of positive and negative controls for groups Q1-Q4 that were formed using PT prevalence quartile binning, with Q1 containing the controls with the lowest prevalence and Q4 the highest one. In terms of PT prevalence (Figure 6), there was a similar trend for Groups Q1-Q3, with AUC diff metric increasing for all algorithms as we moved to more common PTs. However, this relationship appears to be reversed in Group Q4, which contains the most frequent PTs in For the highest sensitivity that was considered (0.90), the difference in PPV was in most cases neglectable.
Given the inability of SDAs to account for all potential confounding factors that are present in SRS data, each methodology might be impacted to a different extent by a confounder. At the same time, there might be cases where signals are triggered by those confounding factors. As an illustrative example, the majority of DDI signals identified using IntSS in the original research paper 27 were composed of drug pairs that are usually given concomitantly (e.g., antibiotics). 27 We therefore need to consider the selection of appropriate controls to avoid misinterpretation of signals triggered by confounding factors rather than true associations as well as adding biases to our evaluation by "favoring" some algorithms while penalizing others. On the other hand, by attempting to completely remove all potential sources of confounding in our evaluation sets, we are more likely to fail to demonstrate their utility in real-life application, which should be determined by its ability to perform at a commensurate level when it is applied prospectively to identify novel signals in SRS databases. 14,15 Overall, this analysis advocates the utilization of large, to the extent possible, reference sets when it comes to comparative performance assessment, that are less likely to suffer from the overrepresentation of controls that make different SDAs behave in different ways due to confounding. Also, regarding novel reference sets, the decision to restrict the evaluation set using specific design criteria should be adequately supported.
A major concern about reference sets used for prospective signal detection in pharmacovigilance revolves around the validity of established (i.e., well-known) positive controls to test the performance of algorithms. This aspect has been widely discussed in the literature. 14,15,34 It has been acknowledged that the combination of estab- In terms of event background prevalence, the simulation results suggest that, if we restricted the evaluation set to specific ranges of PT prevalence, the conclusions would change, that is, the sole choice of common PTs would have an inverse impact on the comparative evaluation as to rare AEs. We know that SRS data are predominantly used in the post-marketing setting to spot rare adverse reactions that have not been revealed during clinical trials. However, the use of SRS data for the detection of DDIs can be considered a different scenario, given that clinical trial data are not sufficient to detect adverse reactions of drug combinations due to inherent limitations (e.g., patient recruitment processes that exclude people taking multiple medications). Hence, the detection of novel DDI-related adverse reactions, even with a common background rate, in SRS data should be of special interest.
Disease-related AEs are a challenging issue in the effort to generate signals using SRS data, as confounding by indication can occur. A previous study reported that around 5% of the total reports for any drug in FAERS mention a drug's indication as an adverse event. 36 This might be related to poor reporting quality or intended to report a disease's exacerbations due to a drug. Our results support that the choice of excluding disease-related AEs (i.e., AE is an indication-False) did not have a significant effect on the AUC across the SDAs with the PT Reference Set, while it decreased the performance of all SDAs with the MC Reference Set. On the other hand, Omega demonstrated deteriorated performance in the scenario of detecting controls with AEs that were drugs' indications at the same time (i.e., AE is an indication-True), while the other two SDAs did not seem to be substantially affected by this design criterion. Event seriousness has been used to build reference sets and assess SDA performance, as it could be utilized to filter signals in reallife pharmacovigilance settings. 23,24 Our study suggests that, by only considering "significant" events, bias is introduced to evaluating SDAs that could be potentially used in routine pharmacovigilance to detect a broader set of events. Also, given that DMEs are rare events (i.e., have low prevalence) with a high drug-attributable risk, it is F I G U R E 6 AUC diff values for Groups Q1-Q4 relevant to PT prevalence. The dot size represents the probability of the estimated score, AUC diff , being non-zero.
important to note that this category might have been confounded to an extent by other design criteria categories that were considered in our study, such as the event frequency.
Quantitative signal detection is only one aspect of the more complex framework before a safety signal is validated. In the case of adverse DDI surveillance, previous studies have considered triage filters alongside disproportionality analysis to direct preliminary signal assessment. 37,38 These filters might be less suitable depending on the type of DDI. For example, there are more filters relevant to pharmacokinetic DDIs (e.g., cytochrome P450 activity) as opposed to pharmacodynamics interactions. Although the clinical significance of the differences between SDAs that are reported in this study might be questioned, it is important to note that quantitative methods for adverse DDI surveillance remain way less mature compared to those for single-drug safety surveillance, also considering the additional complexity that is inherent to DDIs. In this way, the potential impact on real-world pharmacovigilance could not be refuted, as even small changes in the performance of an SDA might have a considerable impact on the number of generated signals that are captured for further evaluation, leading to either missed signals or large amounts of potential signals that need to be evaluated, thus increasing the manual effort needed. It is also important to note that the three SDAs that were included in our study are not implemented to the same extent in the real world. Omega and IntSS are two of the major methods that we understand to be used for routine pharmacovigilance screening for DDIs. del-ta_add is a less mature method that is described in the literature, for which, as far as we are aware, is not as widely used in practice.
Although this study provides a novel framework for studying how SDA performance may change by considering different criteria for eligibility of controls, there are some limitations worth mentioning. First, only a single test data set (i.e., FAERS) was utilized for the purposes of this study. Also, CRESCENDDI was the only reference set utilized to generate estimates of the impact on AUC, in the absence of another comprehensive data set that could be used as a comparative source. We acknowledge that, by modifying the CRESCENDDI data set to consider adverse events at the MC level, we ended up with a smaller reference set that only included controls that could be represented by event groups (e.g., angioedema). This can have an impact on the extrapolation of the results and conclusions drawn from our analysis when considering single PTs as opposed to event groups. Additionally, for the determina-

DATA AVAILABILITY STATEMENT
The CRESCENDDI data set that supports the findings of this study is openly available in Figshare at https://doi.org/10.6084/m9.figshare.c. 5481408.v1.