- Top of page
- Materials and Methods
- Results and Discussion
- Conflict of Interest
- Supporting Information
The cell growth, development, and regeneration of tissue and organ are associated with a large number of gene regulation events, which are mediated in part by transcription factors (TFs) binding to cis-regulatory elements involved in the genome. Predicting the binding affinity and inferring the binding specificity of TF–DNA interactions at the genomic level would be fundamentally helpful for our understanding of the molecular mechanism and biological implication underlying sequence-specific TF–DNA recognition. In this study, we report the development of a combination method to characterize the interaction behavior of a 11-mer oligonucleotide segment and its mutations with the Gcn4p protein, a homodimeric, basic leucine zipper TF, and to predict the binding affinity and specificity of potential Gcn4p binders in the genome-wide scale. In this procedure, a position-mutated energy matrix is created based on molecular modeling analysis of native and mutated Gcn4p–DNA complex structures to describe the position-independent interaction energy profile of Gcn4p with different nucleotide types at each position of the oligonucleotide, and the energy terms extracted from the matrix and their interactives are then correlated with experimentally measured affinities of 19 268 distinct oligonucleotides using statistical modeling methodology. Subsequently, the best one of built regression models is successfully applied to screen those of potential high-affinity Gcn4p binders from the complete genome. The findings arising from this study are briefly listed below: (i) The 11 positions of oligonucleotides are highly interactive and non-additive in contribution to Gcn4p–DNA binding affinity; (ii) Indirect conformational effects upon nucleotide mutations as well as associated subtle changes in interfacial atomic contacts, but not the direct nonbonded interactions, are primarily responsible for the sequence-specific recognition; (iii) The intrinsic synergistic effects among the sequence positions of oligonucleotides determine Gcn4p–DNA binding affinity and specificity; (iv) Linear regression models in conjunction with variable selection seem to perform fairly well in capturing the internal dependences hidden in the Gcn4p–DNA system, albeit ignoring nonlinear factors may lead the models to systematically underestimate and overestimate high- and low-affinity samples, respectively.
Eukaryotic gene expression is fundamentally important for a series of cellular events, including growth, development, differentiation, proliferation, regeneration, and tissue repair (1). For example, although adult hepatocytes are long lived and normally do not undergo cell division, they maintain the ability to proliferate and regenerate in response to toxic injury and infection. The most tightly controlled of all of these, and hence the rate determining step for most genes, is that of initiation where the DNA elements around the start of the gene are recognized by a number of nuclear proteins termed transcription factors (TFs) (2). For example, it was found that TFs are the central regulators in liver regeneration after hepatectomy, where a number of cell signaling pathways are established from the activation of several key TFs such as STAT3, HNF, and C/EBP as well as the negative regulatory factors SOCS-3 and TMUB1 to start the liver regeneration event (3,4). To date, a high quality of TFs has been identified that function in a wide variety of biomolecular processes. Given the highly consistent characteristics over a chromosomal DNA sequence, it is puzzled how a TF is capable of specifically picking out its recognition element in a background of 1000 of other similar sequences and binding to it? Knowing the quantitative specificity of TFs, both the preferred binding sites and the relative binding affinity to different sites, would allow us to approach this open question and to facilitate the understanding of gene expression details (5).
Experimental techniques such as electrophoretic mobility shift assay (5), isothermal titration calorimetry (ITC) (6), and surface plasmon resonance (7) have been emerged to quantitatively measure TF–DNA interactions, but all have relatively low throughput. Recently, chromatin immunoprecipitation with sequencing (ChIP-Seq) (8) and microfluidic affinity analysis (9) were developed for genome-wide mapping of TF–DNA binding profile. However, it is too time-consuming and expensive to implement such assays, leading to significant gap between the experimental capability and practical requirement. Alternatively, bioinformatics provides an in silico idea to address this issue, and there were a number of computational methods to be proposed by different groups to fulfill this purpose. These methods could be roughly assorted into two categories: sequence-based and structured-based. The sequence-based methods utilize information deriving solely from the primary sequence patterns of a collection of TF-binding affinity-known oligonucleotides to train predictive models and, after the models constructed, perform generalization on the candidate DNA segments that generated from, for example, the entire genome of a studied species. Several representative sequence-based methods including multiple regression model (10), position weight matrix (11), and nonlinear probabilistic inference (12) have been successfully applied either to identify potential binding site for a given TF or to analyze the consensus sequence pattern of known binding sites for a series of homologous TFs. Although the sequence-based methods are widely used in the bioinformatics community and can be very efficient in performing high-throughput prediction of protein–DNA binding specificity, which completely ignore the presence of the DNA-binding partner TF as well as the information about the interactions between DNA and TF, thus probably impairing the predictive power and interpretability of established models.
With the rapidly increasing availability of solved protein–DNA complex crystal structures in recent years, it is now possible to exploit structure-based approaches to accurately characterize the binding behavior of TF to DNA at atomic level and to straightforwardly explain molecular mechanism of the binding event based upon the known protein–DNA complex structures. Baker and coworkers have made the first attempt to do so; they employed an integrated protocol of crystal structure analysis, side-chain rotamer search, and free energy calculation to dissect the interfacial feature and energetic landscape of several DNA-binding proteins interacting with their cognate DNA ligands and, on this basis, a simple physical model was developed to predict and design protein–DNA interactions with desired biological functions (13,14). Later, using a modified strategy, the same group successfully redesigned the interface property and cleavage specificity of endonuclease I-MsoI on its substrate DNA molecules (15). Other methods available for structure-based prediction of protein–DNA binding specificity include but not limited to a variety of knowledge-based potentials (16) as well as molecular mechanical (17) and more exhaustive quantum mechanical (18) calculations, which were used to describe the direct readout energy involved in protein–DNA binding (19), while the thermodynamic effects associated with indirect readout can be characterized by molecular dynamics simulation (20). Moreover, we have recently presented a systematic classification and analysis of themes in protein–DNA recognition using a protocol that incorporates automatic methods into manual inspection to plant a comprehensive classification tree for currently available high-quality structure data. The gained knowledge would be helpful for our understanding of molecular mechanism and biological implication underlying protein–DNA binding and for accurate modeling of affinity values at a mesoscopic level (21).
Despite considerable success of structure-based methods in all-atom modeling of protein–DNA binding energetic components and in rational design of proteins to recognize target DNA sites with prescriptive sequence specificity, it is a great challenge for use of the structure-based methods in the context of high-throughput prediction and interface of potential TF-binding sites at the genomic level. In this work, we propose a combination strategy that combines the high efficiency of sequence-based method and the ready interpretability of structure-based approach to investigate the complete DNA-binding specificity landscapes of Saccharomyces cerevisiae Gcn4p, a prototypical, homodimeric, and basic leucine zipper TF that is the master regulator of the amino acid starvation response (22). This procedure starts from the crystal structure of Gcn4p complexed with a 20-bp double-stranded DNA segment. Using virtual mutagenesis, structure optimization, and energy analysis, we quantitatively characterize the interaction energy profile of Gcn4p with all possible single-base mutations within the core region of the DNA segment, and then, this profile is engaged to define a position-mutated energy matrix (PMEM). On this basis, a position-independent/interactive linear equation (PILE) is derived by fitting energy terms extracted from the PMEM to experimental observations of more than 10 000 of 11-mer oligonucleotides with known Gcn4p-binding affinity. Subsequently, we employ the derived PILE to analyze the binding specificity between Gcn4p and distinct DNA sequence patterns and to infer the potential binding sites of Gcn4p within the S. cerevisiae genome. We also give a preliminary insight into the physicochemical nature and origin of Gcn4p–DNA binding specificity.