Large-scale regression-based pattern discovery: The example of screening the WHO global drug safety database



Most measures of interestingness for patterns of co-occurring events are based on data projections onto contingency tables for the events of primary interest. As an alternative, this article presents the first implementation of shrinkage logistic regression for large-scale pattern discovery, with an evaluation of its usefulness in real-world binary transaction data. Regression accounts for the impact of other covariates that may confound or otherwise distort associations. The application considered is international adverse drug reaction (ADR) surveillance, in which large collections of reports on suspected ADRs are screened for interesting reporting patterns worthy of clinical follow-up. Our results show that regression-based pattern discovery does offer practical advantages. Specifically it can eliminate false positives and false negatives due to other covariates. Furthermore, it identifies some established drug safety issues earlier than a measure based on contingency tables. While regression offers clear conceptual advantages, our results suggest that methods based on contingency tables will continue to play a key role in ADR surveillance, for two reasons: the failure of regression to identify some established drug safety concerns as early as the currently used measures, and the relative lack of transparency of the procedure to estimate the regression coefficients. This suggests shrinkage regression should be used in parallel to existing measures of interestingness in ADR surveillance and other large-scale pattern discovery applications. Copyright © 2010 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 3: 197-208, 2010