Adverse drug reaction or innocent bystander? A systematic comparison of statistical discovery methods for spontaneous reporting systems

Spontaneous reporting systems (SRSs) are used to discover previously unknown relationships between drugs and adverse drug reactions (ADRs). A plethora of statistical methods have been proposed over the years to identify these drug‐ADR pairs. The objective of this study is to compare a wide variety of methods in their ability to detect these signals, especially when their detection is complicated by the presence of innocent bystanders (drugs that are mistaken to be associated with the ADR, since they are prescribed together with the drug that is the ADR's actual cause).

remain undetected until after the launch of the respective drugs. [1][2][3] Spontaneous reporting systems (SRSs) have been established over the years [4][5][6][7] in order to detect these unknown ADRs. Medical professionals and patients can send in a report when a drug is suspected to have triggered an adverse event (AE). These reports are collected, stored, and subsequently analyzed by medical experts.
A plethora of statistical methods have been proposed to aid in the detection of drug-ADR pairs. All of these methods involve two steps: 1. For each drug-AE pair, a score is computed that reflects their "association strength"; 2. These scores are used to compile a shortlist of drug-AE pairs; any pair with a score that exceeds a predefined threshold is forwarded to the experts for further investigation.
All methods except the LASSO 8,9 base their scores on the 2 × 2 contingency tables for each of the drug-AE pairs (see Table 1). The count a denotes the number of reports that contain both the drug and AE of interest, b is the number of reports that mention the drug but not the AE, and so on. Some measures that rely on these tables are, for example, the reporting odds ratio 10 (ROR = (ad)/(bc)) and the proportional relative risk 11

(PRR = [a/(a + b)]/[c/(c + d)]).
A potential downside of these disproportionality measures 7,12 is that they are vulnerable to the innocent bystander effect. 2,7,8 This effect refers to a form of confounding where a drug is thought to be associated with a certain ADR, because it is prescribed together with the drug that causes the reaction.
Several studies in the past compared the performance of some of these statistical methods. Van Puijenbroek et al 10  Other studies rely on simulated data, 18,19 which has the advantage that (a) the associations are truly known and (b) the methods' performance can be explored in a variety of parameter settings, for example, for varying odds ratios between drugs and ADRs.
A disadvantage of simulation studies is that it is unclear to what extent the simulated data reflects reality. The simulated data sets that were used, however, assume independence between drugs, 18 15 Using predefined thresholds might skew the results, since the choice of thresholds could be unfavorable for a particular method.
In this paper we perform a simulation-based study with 24 measures, ranging from simple disproportionality measures, hypothesis tests, Bayesian shrinkage estimates, to sparse regression (see Section 2). The simulations are unique in the sense that they contain innocent bystanders (for a description see Section 3). Instead of using thresholds, we use a threshold-free measure, the area under the precision-recall curve for assessing the performance of the presented measures. Our results are presented in Section 4 and critically appraised in the discussion. Table 2 lists the 24 measures considered in this article. The column "Category" contains a categorization of the various measures. The column "Measure" contains the notation for each measure, where we try to stick to the notation commonly used in the literature. The year and the publication in which the measure is mentioned and/or used for the first time can be found in the columns "Year" and "Reference(s)."

| STATISTICAL METHODS FOR POSTMARKETING SURVEILLANCE
For some measures, it is common to use the lower endpoint of the confidence/credible interval. This is denoted here by the subscripts "025" and "05." In case of "025," the 95% confidence/credible interval is used; in case of "05" the 90% interval. The "Number of Reports" measure, equal to a in Table 1, is the number of times the drug and AE are reported together. This measure was suggested by Norén et al 22 as a "placebo" measure; each measure should at least be able to outperform this basic count. There are two versions of the BCPNN, which we denote here with the superscripts "original" and "alternative." The former refers to the version as it was first proposed by Bate et al. 13 The latter uses a different prior. 1 Yule's Q is not

KEY POINTS
• A plethora of statistical methods have been proposed over the years to detect associations between drugs and adverse events in spontaneous reporting data.
• Earlier comparison studies were limited in the number of measures considered and ignored confounding by other drugs, that is, the innocent bystander effect.
• Twenty-four measures are compared in their ability to detect associations in a large number of simulated systems allowing for confounding between drugs.
• Hypothesis tests perform best when the associations are weak and there is little to no confounding. Bayesian methods should be preferred for larger effect sizes and/or when the level of confounding increases. considered here, since it is a rescaling of the ROR 10 and, therefore, performs equally well. All measures are implemented and publicly available as the R package pvm at www.github.com/bips-hb/pvm.

| METHODS
We first introduce the simulation setup, before we describe how the measures' performance will be assessed. An extensive description of the simulation can be found in Appendix A. Each report to an SRS contains two lists: 1. The AEs the patient experienced, and 2. The drugs suspected to have caused the AEs.
All the reported drugs in the SRS are represented by binary vari- A report can then be represented by the binary vector: where X i is 1 if the report contains drug i, and 0 otherwise. Similarly, Y j is 1 when the report lists AE j, and 0 otherwise. An SRS is a collection of reports: where N is the number of reports. In the simulations, there are 500 drugs, 500 AEs, and 50 000 reports.
Some drugs are prescribed more frequently when another drug is prescribed, for example, to suppress side effects. There are two types of drugs, that is, the probability of the drug being listed depends on (a) no other drugs or (b) one other drug. In the first case, the probability of a drug, X i , to be reported is P(X i = 1) = π i . In case a drug, X j , is influenced by drug X k , we specify two conditional probabilities: (a) when drug X k is not on the report and (b) when drug X k is on the report, that is, Note that γ is the same for all j. The probabilities, π i , are drawn from a beta distribution, which is a common choice in the field, 13,22 with rate and shape parameters 1 and 20, respectively, to ensure that drugs tend to be listed infrequently. In the simulations, γ is .5, .75, or .9.
There are 500 × 500 = 250 000 drug-AE pairs. The relationship between a drug X i and an AE Y j is expressed using a logistic model: where logit(x) = log[x / (1 -x)], β j is the intercept and OR ij is the odds ratio between the i-th drug and the j-th AE. In the case that the drug causes the event (ie, an ADR), the OR ij ∈ [1, ∞) is drawn from a truncated normal distribution with mean 1.5, 3, or 5. When the drug does not cause the AE, OR ij = 1. The intercept, β j , is chosen such that the probability of the AE appearing on the report is small (see Appen- Modeling an innocent bystander requires to specify the relationships between two drugs and one ADR. Let X i be the drug that causes the ADR Y j and X k be the innocent bystander, that is, OR ij > 1 and OR kj = 1.
The dependence between the innocent bystander, X k , and drug X i is defined according to Equation (3): P(X k = 1 | X i = 0) = π k and P(X k = 1 Only drug X i causes the ADR Y j . The innocent bystander, X k , is often reported alongside X i and can, therefore, be mistaken as the cause of Y j . T A B L E 2 An overview of all 24 measures considered. The column "Category" provides a rough categorization. The notation for each measure used in this paper can be found in "Measure." "Year" and "Reference(s)" refer to the first appearance of the respective measure in the field of pharmacovigilance The prescription of the first 250 drugs, X 1 to X 250 , is not influenced by the prescription of any other drug. The last 250 drugs, X 251 to X 500, can be innocent bystanders, for example, X 251 can be the innocent bystander for the drug-ADR pair (X 1 , Y 1 ). In that case, X 251 is prescribed more regularly when X 1 is prescribed than when not; the parameter γ in Equation (3) is set to .5, .75, or .9. The parameter γ is the same for all innocent bystanders. We consider three cases: 1. No innocent bystanders. All drugs X 1 , X 2 , …, X 500 are independent.
2. One hundred twenty-five innocent bystanders, that is, drug X 251 is the innocent bystander for drug-ADR pair (X 1 , Y 1 ), X 252 is the innocent bystander for (X 2 , Y 2 ), etc., up to X 375 and (X 125 , Y 125 ).
3. Two hundred fifty innocent bystanders, that is, drug X 251 is the innocent bystander for drug-ADR pair (X 1 , Y 1 ), X 252 is the innocent bystander for (X 2 , Y 2 ), etc., up to X 500 and (X 250 , Y 250 ). reflects the measures' overall capability to distinguish between associated and not-associated pairs. Although the receiver-operating curve (ROC) is more commonly used than the PRC, it is known to perform rather poorly when the data are imbalanced. 27 4 | RESULTS  Mean OR of drug-ADR pair (4) OR 1.5, 3, or 5 CI = ± 4.7 × 10 −2 ) when the innocent bystander effect is strongest.
The same holds for the other Bayesian measures, which explains the change in ranking in Figure 1.
Even though the LASSO does not excel in any setting, its performance is striking since it is not affected by the appearance of innocent bystanders. By employing the data on all drugs simultaneously, it  All Bayesian methods such as the GPS and the BCPNN use the relative report rate as their basis. 1,13,24 Figure 1 shows, however, that the RRR is consistently outperformed by the reporting odds ratio. It might be fruitful to explore the possibility of applying Bayesian shrinkage to the ROR rather than the RRR. By using a threshold-free performance measure, we avoided the problem of choosing an appropriate threshold for the various measures. 29 For some, it might be easier to choose an appropriate threshold than for others. It is, for example, unclear how to set the thresholds for ROR, PRR, and RRR. For the measures based on hypothesis tests, one could employ a multiple testing correction procedure. 25 In case of the Bayesian methods, there are similar procedures to control the false-discovery rate, for example, the study by Ahmed et al. 18 A point of caution is that the simulation setup is a simplification and might differ from the SRSs used in daily practice. Real data sets can contain more noise. The results shown here should, thus, be seen as a best-case scenario. In addition, one should not rely solely on statistical analysis. Clinical and pharmacological knowledge is essential when identifying drug-ADR pairs.
The implementation of the measures, the SR simulator, and the code for this comparison study are publicly available as R packages at www.github.com/bips-hb/pvm, www.github.com/bips-hb/srsim and www.github.com/bips-hb/pvmcomparison.

ETHICS STATEMENT
The authors state that no ethical approval was needed.