A Bayesian Hidden Markov Model for Motif Discovery Through Joint Modeling of Genomic Sequence and ChIP-Chip Data


  • Jonathan A. L. Gelfond,

    Corresponding author
    1. Department of Epidemiology and Biostatistics, Mail Code 7933, University of Texas Health Science Center at San Antonio, 7703 Floyd Curl Drive, San Antonio, Texas 78229-3900, U.S.A.
    Search for more papers by this author
  • Mayetri Gupta,

    1. Department of Biostatistics, Boston University, 801 Massachusetts Avenue, Boston, Massachusetts 02118, U.S.A.
    Search for more papers by this author
  • Joseph G. Ibrahim

    1. Department of Biostatistics, University of North Carolina at Chapel Hill, North Carolina 27599, U.S.A.
    Search for more papers by this author

email: gelfondjal@uthscsa.edu


Summary We propose a unified framework for the analysis of chromatin (Ch) immunoprecipitation (IP) microarray (ChIP-chip) data for detecting transcription factor binding sites (TFBSs) or motifs. ChIP-chip assays are used to focus the genome-wide search for TFBSs by isolating a sample of DNA fragments with TFBSs and applying this sample to a microarray with probes corresponding to tiled segments across the genome. Present analytical methods use a two-step approach: (i) analyze array data to estimate IP-enrichment peaks then (ii) analyze the corresponding sequences independently of intensity information. The proposed model integrates peak finding and motif discovery through a unified Bayesian hidden Markov model (HMM) framework that accommodates the inherent uncertainty in both measurements. A Markov chain Monte Carlo algorithm is formulated for parameter estimation, adapting recursive techniques used for HMMs. In simulations and applications to a yeast RAP1 dataset, the proposed method has favorable TFBS discovery performance compared to currently available two-stage procedures in terms of both sensitivity and specificity.