Estimating the Null Distribution to Adjust Observed Confidence Levels for Genome-Scale Screening

Authors

  • David R. Bickel

    1. Ottawa Institute of Systems Biology, Department of Biochemistry, Microbiology, and Immunology, Department of Mathematics and Statistics, University of Ottawa, 451 Smyth Road, Ottawa, Ontario K1H 8M5, Canada
    Search for more papers by this author

email: dbickel@uottawa.ca

Abstract

Summary In a novel approach to the multiple testing problem, Efron (2004, Journal of the American Statistical Association99, 96–104; 2007a Journal of the American Statistical Association102, 93-103; 2007b, Annals of Statistics35, 1351–1377) formulated estimators of the distribution of test statistics or nominal p-values under a null distribution suitable for modeling the data of thousands of unaffected genes, nonassociated single-nucleotide polymorphisms, or other biological features. Estimators of the null distribution can improve not only the empirical Bayes procedure for which it was originally intended, but also many other multiple-comparison procedures. Such estimators in some cases improve the proposed multiple-comparison procedure (MCP) based on a recent non-Bayesian framework of minimizing expected loss with respect to a confidence posterior, a probability distribution of confidence levels. The flexibility of that MCP is illustrated with a nonadditive loss function designed for genomic screening rather than for validation. The merit of estimating the null distribution is examined from the vantage point of the confidence-posterior MCP (CPMCP). In a generic simulation study of genome-scale multiple testing, conditioning the observed confidence level on the estimated null distribution as an approximate ancillary statistic markedly improved conditional inference. Specifically simulating gene expression data, however, indicates that estimation of the null distribution tends to exacerbate the conservative bias that results from modeling heavy-tailed data distributions with the normal family. To enable researchers to determine whether to rely on a particular estimated null distribution for inference or decision making, an information-theoretic score is provided. As the sum of the degree of ancillarity and the degree of inferential relevance, the score reflects the balance conditioning would strike between the two conflicting terms. The CPMCP and other methods introduced are applied to gene expression microarray data.

Ancillary