Abstract
- Top of page
- Abstract
- 1. Introduction
- 2. Data, Preprocessing, and Notation
- 3. Model and Priors
- 4. Estimation and Inference
- 5. Application to Experimental Data
- 6. Simulation Study
- 7. Discussion
- 8. Supplementary Materials
- Acknowledgements
- References
- Supporting Information
Summary ChIP-seq combines chromatin immunoprecipitation with massively parallel short-read sequencing. While it can profile genome-wide in vivo transcription factor-DNA association with higher sensitivity, specificity, and spatial resolution than ChIP-chip, it poses new challenges for statistical analysis that derive from the complexity of the biological systems characterized and from variability and biases in its sequence data. We propose a method called PICS (Probabilistic Inference for ChIP-seq) for identifying regions bound by transcription factors from aligned reads. PICS identifies binding event locations by modeling local concentrations of directional reads, and uses DNA fragment length prior information to discriminate closely adjacent binding events via a Bayesian hierarchical t-mixture model. It uses precalculated, whole-genome read mappability profiles and a truncated t-distribution to adjust binding event models for reads that are missing due to local genome repetitiveness. It estimates uncertainties in model parameters that can be used to define confidence regions on binding event locations and to filter estimates. Finally, PICS calculates a per-event enrichment score relative to a control sample, and can use a control sample to estimate a false discovery rate. Using published GABP and FOXA1 data from human cell lines, we show that PICS' predicted binding sites were more consistent with computationally predicted binding motifs than the alternative methods MACS, QuEST, CisGenome, and USeq. We then use a simulation study to confirm that PICS compares favorably to these methods and is robust to model misspecification.
1. Introduction
- Top of page
- Abstract
- 1. Introduction
- 2. Data, Preprocessing, and Notation
- 3. Model and Priors
- 4. Estimation and Inference
- 5. Application to Experimental Data
- 6. Simulation Study
- 7. Discussion
- 8. Supplementary Materials
- Acknowledgements
- References
- Supporting Information
ChIP-seq, which combines chromatin immunoprecipitation with massively parallel short-read sequencing, offers high specificity, sensitivity, and spatial resolution in profiling in vivo protein-DNA association; histones, histone variants, and modified histones; nucleosome positioning; polymerases and transcriptional machinery complexes; and DNA methylation (Holt and Jones, 2008; Park, 2009). While sequencing overcomes certain limitations of profiling with microarrays (ChIP-chip), it raises statistical and computational challenges, some of which are related to those for ChIP-chip, and others that are novel. A typical ChIP-seq data set consists of millions or tens of millions of sequence reads that are generated from ends of immunoprecipitated DNA fragments. The quality of called bases varies along and between reads, and, as the sequencing technology evolves, read lengths and quality are increasing, as are the number of reads generated in a machine run. Current read lengths are generally between 36 and 75 base pairs (bp). While pairs of end reads can be generated from each DNA fragment, current ChIP-seq data typically consist of single-end reads, in which each DNA fragment contributes a directional read from only one randomly selected fragment end ((7–8) in Figure 1).
After read sequences have been aligned to a reference genome (e.g., Li and Durbin, 2009), the aligned read data are transformed into a form that reflects local densities of immunoprecipitated DNA fragments, and, in the work described here, into estimates of locations where transcription factors were associated with DNA in the experimental cellular system. The analysis is complicated by biases and artifacts in local read densities that can be introduced in sequencing and aligning, and by chromatin structure and genome copy number variations (Johnson et al., 2008; Kharchenko, Tolstorukov, and Park, 2008; Rozowsky et al., 2009). As well, repetitive sequences can prevent aligning (mapping) reads to unique genomic locations (Robertson et al., 2008; Rozowsky et al., 2009), and reads that cannot be uniquely aligned are typically removed from the analysis. In typical mammalian experiments, 30–40% of reads may be discarded, but higher rates can be encountered in particular experiments. Because of ChIP-seq's cost-effectiveness, such global losses are usually not an important practical consideration; however, few analysis methods correct for the local biases in aligned read densities that are caused by repetitive regions.
Certain types of biases in read density profiles can be estimated by sequencing a control sample in addition to the immunoprecipitated “treatment” sample, and then using an analysis method that considers the treatment profile relative to the control profile (Kharchenko et al., 2008; Nix, Courdy, and Boucher, 2008; Rozowsky et al., 2009). Control data can be used to help identify false positives, assess numerical background models, and estimate a threshold for segmenting a read density or enrichment profile in order to identify a subset of significantly enriched regions. Analysis methods are described as “two-sample” when a control data set is available and as “single sample” when only treatment data are available. As with ChIP-chip, there are various ways to generate control samples, and we refer the reader to Buck and Lieb (2004) for an overview.
In summary, once reads have been aligned to a reference genome, there are at least four central analysis issues: interpreting the information in local densities of directional reads; identifying local read densities that represent false positives; addressing biases in read densities that arise from local variations in the efficiency with which reads can be aligned to unique genomic locations; and segmenting the enrichment profile in order to identify a statistically and biologically meaningful subset of enriched regions.
Short-read sequencing technology has evolved rapidly since its commercial introduction early in 2007, and, as was the case while ChIP-chip developed as an experimental approach (e.g., Johnson et al., 2006; Gottardo et al., 2008), statistical analysis methods for ChIP-seq are actively being developed. For example, Valouev et al. (2008) introduced QuEST, a method based on kernel density estimates of forward and reverse read counts, which allows estimating DNA fragment lengths. The forward and reverse profiles are then combined to estimate binding site locations and quantify local enrichment. Given control sample data, QuEST can estimate a false discovery rate (FDR). Like QuEST, MACS (Zhang et al., 2008) uses both forward and reverse read profiles to empirically model the “shift size” of ChIP-seq reads, and uses this to improve the spatial error of the predicted binding sites. Instead of using kernel density estimates, MACS uses a parametric model based on a dynamic Poisson distribution to identify and quantify locations where the protein of interest binds. Ji et al. (2008) introduced a “CisGenome” analysis pipeline for the analysis of ChIP-chip and ChIP-seq data. Their method is also based on a negative binomial background model, and includes functionality not offered by MACS and QuEST (e.g., filtering atypical regions, and different types of FDR estimates). Another method that has been shown to perform well is USeq, which uses window-level binomial p-values followed by an FDR correction to call individual binding events (Nix et al., 2008).
While these methods have established statistical approaches for ChIP-seq analysis, model-based and Bayesian approaches are in earlier stages of development. In the work described here, we introduce PICS, a method for probabilistic inference of ChIP-seq data that is based on a Bayesian hierarchical truncated t-mixture model. PICS integrates four important components. First, it jointly models local concentrations of directional reads. Second, it uses mixture models to distinguish closely adjacent binding events. Third, it incorporates prior information for the length distribution of immunoprecipitated DNA fragments to help resolve closely adjacent binding events and to identify events that have atypical fragment lengths. Fourth, it uses precalculated whole-genome read mappability profiles to adjust local read densities for reads that are missing due to genome repetitiveness. For each binding event, PICS returns an enrichment score that is relative to a control sample when such a sample is available, and it can use a control sample to estimate a false discovery rate. Finally, because it is based on a probabilistic model, PICS can compute measures of uncertainty for binding site estimates; these can be used to estimate binding site locations and to filter out low-confidence regions.
The article is organized as follows. In Section 2, we introduce the data structure and some notation. In Section 3, we present our Bayesian hierarchical truncated t-mixture model. Section 4 discusses parameter estimation, inference, and detection of binding sites. In Section 5, we apply PICS to two published human ChIP-seq data sets and compare its results to those from QuEST, MACS, CisGenome, and USeq. Section 6 presents the results of a simulation study that evaluates the robustness of our model to misspecification and compares the performance of PICS and the four other methods. In Section 7, we briefly discuss our results and possible extensions.
2. Data, Preprocessing, and Notation
- Top of page
- Abstract
- 1. Introduction
- 2. Data, Preprocessing, and Notation
- 3. Model and Priors
- 4. Estimation and Inference
- 5. Application to Experimental Data
- 6. Simulation Study
- 7. Discussion
- 8. Supplementary Materials
- Acknowledgements
- References
- Supporting Information
We used two ChIP-seq data sets that have been analyzed by other groups. Zhang et al. (2008), using MACS, identified binding sites of FOXA1 (hepatocyte nuclear factor 3α) in human MCF7 breast cancer cells. Valouev et al. (2008), using QuEST, identified binding sites of the growth-associated binding protein (GABP) in human Jurkat T cells. Each data set consists of single-end reads for a treatment (ChIP) and an input DNA control sample. The FOXA1 data consist of 3,909,507 treatment reads and 5,233,322 control reads, while the GABP data consist of 7,830,602 treatment reads and 17,028,066 control reads.
Because ChIP-seq aligned-read data are usually sparse, consisting largely of regions in which few or no reads are observed, we first preprocess the read data by segmenting the genome into regions, each of which has a minimum number of reads that aligned to forward and reverse strands. We detect such regions using a w-bp sliding window with an s-bp step size, counting the number of forward and reverse strand reads in the left and right half-windows, respectively, and we retain windows that contain at least one forward read and one reverse read. For each chromosome, after merging overlapping windows and removing merged regions that have fewer than two forward or reverse reads, we obtain a disjoint set of candidate regions, each of which we analyze separately. Because DNA fragments are often between 100 and 300 bp long after gel size selection, for the work described here we chose w = 100 bp, and set s = 10 bp for computational convenience. We tested other values for w and s and obtained essentially the same candidate regions.
From this point, we will use the following terminology. A candidate region is a region obtained by our sliding window approach, i.e., one with enough forward/reverse reads to be processed by PICS. A binding event refers to a location where the protein of interest is associated with DNA. At a binding event the {in vivo} protein was either interacting with the DNA directly at a binding site, i.e., a DNA motif that the protein recognizes and binds to (D'haeseleer, 2006), or indirectly via membership in a DNA-associated protein complex. We will define a binding site's midpoint as its position or location. Note that a candidate region can result from more than one binding event, and so can contain more than one binding site.
5. Application to Experimental Data
- Top of page
- Abstract
- 1. Introduction
- 2. Data, Preprocessing, and Notation
- 3. Model and Priors
- 4. Estimation and Inference
- 5. Application to Experimental Data
- 6. Simulation Study
- 7. Discussion
- 8. Supplementary Materials
- Acknowledgements
- References
- Supporting Information
We applied PICS to the two experimental data sets described in Section 2, obtaining 75,451 candidate regions and 86,262 binding events for the GABP data, and 51,843 candidate regions and 53,740 binding events for FOXA1 data. Web Figure 1 shows histograms of estimated average DNA fragment lengths for both data sets. For the FOXA1 data the estimated average fragment size was approximately 150 bp, consistent with Zhang et al. (2008); it was somewhat smaller for the GABP data. Web Figure 1 also shows that most of the regions had DNA fragment lengths between 50 and 200 bp, which supports our filtering atypical regions by this parameter.
Noting that some of the algorithms responded very differently in terms of estimated FDR (see Web Figure 2), we compared the methods by identifying conserved DNA sequence motifs in the 5,000 top-ranked predictions from each method, using 200-bp-wide regions that were centered on each method's binding site estimates. For motif analysis, we used GADEM (Li, 2009), which can process large sets of ChIP-seq regions on a single CPU, identifies multiple motifs and adjusts motif widths, and returns motifs similar to those from algorithms that are more computationally demanding. We assessed the de novo motifs using STAMP (Mahony, Auron, and Benos, 2007), and retained only expected and biologically relevant motifs. As expected, for all five methods, GADEM identified GABP and Forkhead (FKHR) motifs as dominant in GABP and FOXA1 data sets, respectively. For the FOXA1 data, for all methods, many regions also contained the binding motif for the FOS proto-oncogene protein. The FOS gene family encodes leucine zipper proteins that can dimerize with proteins of the JUN family to form the AP-1 complex (Milde-Langosch, 2008). The AP-1 complex is overexpressed in ER-positive cells (e.g., MCF7) and can interact directly with the ER transcription factor (Cicatiello et al., 2004; Milde-Langosch, 2008). Similarly, the FOXA1 protein is known to play an important role in ER regulation and to interact with ER (Eeckhoute et al., 2006; Lupien et al., 2008). The FOS motif that we identified was consistent with AP-1 enriched motifs reported for ChIP-chip FOXA1 regions (Lupien et al., 2008) and may reflect interactions, possibly indirect, between the FOS and FOXA1 proteins. For the work described here, we used GABP motif occurrences for evaluating GABP results, and both FKHR and FOS motif occurrences for evaluating FOXA1 results.
We evaluated the methods using two criteria: (1) the motif occurrence rate, i.e., the fraction of predicted binding events that contained a biologically expected motif, for which a larger value indicates better performance; and (2) the spatial error, i.e., the distance between a binding site point estimate and a motif location, for which a smaller value indicates better performance. Because a motif can occur more than once in a sequence, we used only the motif instance closest to the predicted binding event when computing spatial error.
Figure 4 shows the motif occurrence rate and spatial error as a function of the region rank for the 5,000 top-ranked predictions for the GABP and FOXA1 data. Overall, in terms of occurrence rate, PICS performed better than the other methods. MACS and USeq were relatively close to PICS, while QuEST and cisGenome performed comparably, but were less effective than PICS, MACS, and USeq. For spatial error, PICS and MACS were most effective, followed closely by USeq, and then QuEST and cisGenome.
As stated in Section 3, we use mixture models to address the possibility that the sets of forward and reverse reads within a candidate region were generated by multiple binding events. In order to assess how PICS' mixture model can detect multiple binding sites within a candidate region, we used our predicted transcription factor binding motifs for the top-ranked 5,000 PICS predictions for the GABP and FOXA1 data. In each case, we determined the percentage of binding events from single- and multiple-component candidate regions (i.e., regions with multiple predicted events) that could be associated with at least one motif site. Table 1 shows these results as a function of the number of PICS predictions (i.e., mixture components) in a candidate region. Four times more GABP regions than FOXA1 regions had two components (940 vs. 208), and seven times more had at least three components (236 vs. 33), even though candidate regions had comparable sizes. Because shorter DNA fragments should support detecting more adjacent binding sites, these differences may be related to the smaller average fragment size estimated for the GABP data. For both data sets, the percentage of binding events that was associated with a predicted binding motif was relatively insensitive to the number of predictions in a region. Because cells can use multiple closely spaced transcription factor binding sites to establish progressive expression responses to cellular signals, we also assessed how effectively PICS can detect closely adjacent binding sites. To evaluate PICS and all other methods compared here, we generated another table, but this time considered predicted binding events that had at least one other event within a fixed distance d. Table 2 summarizes the results for d = 250, 500, and 1000 bp. For these data, PICS, cisGenome, and USeq were the most effective at identifying proximal binding events, and large fractions of these events were associated with a predicted motif site. While cisGenome and USeq also predicted large numbers of proximal binding events, a larger fraction of those reported by PICS were associated with predicted binding motifs. In comparison, MACS returned the lowest number of proximal binding events, which suggests that MACS is less effective in discriminating proximal binding events, even though it performs relatively well in terms of the overall number of predictions that are associated with a motif (Figure 4). These results suggest that our mixture model was effective in distinguishing biologically meaningful proximal binding events. Results from the simulation study in Section 6 were consistent with this.
Table 1. Number of 5000 top-ranked PICS predictions from GABP and FOXA1 data in candidate regions that had 1, 2, or at least 3 predicted binding events (i.e., mixture components). The first row gives the number of binding events in each category, while the second row gives the percentage of predicted events that could be associated with an expected motif. For example, 80% of the 940 binding events in two-component GABP regions could be associated with a predicted GABP site.| Number of components in region | GABP | FOXA1 |
|---|
| 1 | 2 | 3+ | 1 | 2 | 3+ |
|---|
| Number of events | 3824 | 940 | 236 | 4759 | 208 | 33 |
| % of motifs | 81 | 80 | 77 | 86 | 88 | 94 |
Table 2. Number of proximal binding events found in the 5000 top-ranked regions identified by each method in GABP and FOXA1 data, as a function of the motif “proximity” distance d. The numbers in parentheses give the percentage of predicted binding events that could be associated with at least one predicted motif site. For example, in the GABP data, PICS identified 128 binding event locations that were within 250 bp of another location, and 86% of these predicted events could be associated with a motif.| d (bp) | GABP | FOXA1 |
|---|
| PICS | QuEST | MACS | cisGen | USeq | PICS | QuEST | MACS | cisGen | USeq |
|---|
| 250 | 128 (86) | 0 | 0 | 0 | 6 (90) | 2 (100) | 0 | 0 | 4 (75) | 4 (83) |
| 500 | 437 (84) | 2 (50) | 18 (83) | 16 (57) | 106 (77) | 62 (85) | 18 (72) | 18 (100) | 34 (82) | 70 (74) |
| 1000 | 517 (81) | 405 (63) | 54 (81) | 76 (63) | 269 (72) | 108 (87) | 74 (76) | 54 (88) | 105 (85) | 149 (77) |
As described in Section 4, PICS can compute approximate standard errors for its model parameter estimates. In particular, we can derive an approximate confidence interval for a given predicted binding event location as
, where c is a constant to be chosen as a function of the motif coverage desired. For example, assuming that
is approximately Normal, c = 1.96 should give us an approximate 95% confidence interval for our binding site position.
Using the set of motifs identified by GADEM, we evaluated the actual coverage of our confidence intervals for different values of c. Figure 5 shows the occurrence frequency of GABP motifs (left) and FOXA1 motifs (right) within
) of peaks' centers. Using three standard errors, the coverage was approximately 65% and 85% for the GABP and FOXA1 data. While these results indicate that the current version of PICS provides a capable modeling framework, they also suggest that certain biases remain unaddressed. For example, PICS models the binding site as the midpoint between the F/R peak summits, but fragmentation biases during sonication due to local chromatin structure could result in a binding site being asymmetrically positioned with respect to its associated DNA fragment densities. Locations where motifs fall outside event confidence intervals identify a subset of regions on which to focus ongoing work.
Finally, we evaluated the effect of the mappability profiles on the parameter estimates. We re-did the analysis while ignoring mappability, and compared the spatial error, i.e., the distance to the closest computationally verified binding site, with and without the mappability correction. A paired t-test showed that our mappability correction significantly (p < 0.05) improved spatial error (data not shown).
7. Discussion
- Top of page
- Abstract
- 1. Introduction
- 2. Data, Preprocessing, and Notation
- 3. Model and Priors
- 4. Estimation and Inference
- 5. Application to Experimental Data
- 6. Simulation Study
- 7. Discussion
- 8. Supplementary Materials
- Acknowledgements
- References
- Supporting Information
We have developed PICS, a probabilistic framework for detecting transcription factor binding events from ChIP-seq experiments. The approach integrates a number of important factors in interpreting aligned read data, including using prior information for input DNA fragment lengths and correcting for reads that are missing due to genome repetitiveness. Working with two published ChIP-seq data sets from human cell lines, we compared PICS to four alternative analysis methods. While additional methods are available (e.g., Fejes et al., 2008; Kharchenko et al., 2008; Rozowsky et al., 2009), the four methods that we used have been shown to perform well, and so offer reasonable performance baselines from a range of algorithms. For both experimental data sets, the binding events predicted by PICS were the most consistent with computationally identified motif sites. Our simulation study confirmed PICS' effectiveness and also showed that its model is robust to misspecifications.
To address read “outliers” due to biological and technical biases, we used a t-distribution as a robust alternative to the Gaussian distribution. We fixed the degrees of freedom parameter at four because this value has been shown to work well in many applications (Lange et al., 1989). To confirm that this was an appropriate choice, we estimated the degrees of freedom for each candidate region and found that it was less than 10 in about 20% of all regions. This suggests that many of the regions contain read outliers and would be poorly modeled by a Gaussian distribution. In addition, many of the candidate regions have very few reads, and it is possible that the number of reads was not large enough to properly estimate the degrees of freedom. We believe that with more sequence reads, the overall percentage of candidate regions with low degrees of freedom would have been even larger. For that reason, and because estimating the degrees of freedom is computationally intensive, we used a fixed four degrees of freedom.
We showed that PICS' mixture model addresses multiple adjacent enrichment events, and can fit a different DNA fragment length value for each binding event in a mixture. Our estimation of the number of binding events in a region is based on the Bayesian information criteria. Although regularity conditions required for consistency of the BIC estimate do not hold for mixture models, there is considerable theoretical and practical support for its use in this context (Roeder and Wasserman, 1997; Fraley and Raftery, 1998). Our simulation study confirmed this, and suggests that our estimation of the number of binding events performs relatively well. In our BIC estimation, we allowed the mixture model to detect up to 15 components per candidate region, but this limit can be readily adjusted.
Because it is based on mixture models and accounts for missing reads, PICS is computationally intensive. The results reported here were generated with an implementation of PICS that was written in the R programming language (Ihaka and Gentleman, 1996). Processing a ∼ 10M read data set required an average computing time of 30 minutes on a machine with dual quad-core 3 GHz CPUs. We believe this to be acceptable, but if speed were an issue, PICS could be rewritten in C. PICS will be made freely available via Bioconductor (Gentleman et al., 2004).
Paired-end (PE) data should resolve a subset of read alignments that would be nonunique in SE data, and offer direct information on DNA fragment lengths that is not available for single-end (SE) reads. However, PE experiments require more input DNA and more complex library preparation, and double the sequencing machine time; to our knowledge, all published short read ChIP-seq data are SE rather than PE. We anticipate that PICS' probabilistic approach will remain useful for PE data, where the defined fragment lengths should simplify the modeling framework.
As a first step in implementing a probabilistic approach for ChIP-seq data, we have shown how to incorporate prior information about the DNA fragment lengths using a Bayesian approach. We can extend the PICS framework to include more types of prior information. For example, we could place a prior distribution on μ, the binding site position, and could include in this information about nucleosome occupancy and computationally derived motifs. Such extensions should allow us to improve the detection of biologically relevant binding sites. More generally, we anticipate that extended probabilistic methods for ChIP-seq will contribute to mechanistic biological insights by offering principled ways for addressing backgrounds, noise and biases, and for integrating diverse types of biological information. For example, histone modifications are an important part of cellular biology, and their aligned reads present a wider range of nucleosome-based density distributions than the localized transcription factors that PICS currently addresses. As part of ongoing work in this area, we are extending PICS to infer nucleosome positioning from histone modification data.