Genome-wide Inference of Transcription Factor–DNA Binding Specificity in Cell Regeneration Using a Combination Strategy


  • Xiaofeng Wang,

    1. Institute of Hepatobiliary Surgery, Southwest Hospital, Third Military Medical University, Chongqing 400010, China
    2. Department of General Surgery, CAPF General Hospital 100039, Beijing, China
    Search for more papers by this author
  • Aiqun Zhang,

    1. Department of Hepatobiliary Surgery, PLA General Hospital 100853, Beijing, China
    Search for more papers by this author
  • Weizheng Ren,

    1. Department of Hepatobiliary Surgery, PLA General Hospital 100853, Beijing, China
    Search for more papers by this author
  • Caiyu Chen,

    1. Institute of Hepatobiliary Surgery, Southwest Hospital, Third Military Medical University, Chongqing 400010, China
    Search for more papers by this author
  • Jiahong Dong

    Corresponding author
    1. Institute of Hepatobiliary Surgery, Southwest Hospital, Third Military Medical University, Chongqing 400010, China
    2. Department of Hepatobiliary Surgery, PLA General Hospital 100853, Beijing, China
    Search for more papers by this author

Corresponding author: Jiahong Dong,


The cell growth, development, and regeneration of tissue and organ are associated with a large number of gene regulation events, which are mediated in part by transcription factors (TFs) binding to cis-regulatory elements involved in the genome. Predicting the binding affinity and inferring the binding specificity of TF–DNA interactions at the genomic level would be fundamentally helpful for our understanding of the molecular mechanism and biological implication underlying sequence-specific TF–DNA recognition. In this study, we report the development of a combination method to characterize the interaction behavior of a 11-mer oligonucleotide segment and its mutations with the Gcn4p protein, a homodimeric, basic leucine zipper TF, and to predict the binding affinity and specificity of potential Gcn4p binders in the genome-wide scale. In this procedure, a position-mutated energy matrix is created based on molecular modeling analysis of native and mutated Gcn4p–DNA complex structures to describe the position-independent interaction energy profile of Gcn4p with different nucleotide types at each position of the oligonucleotide, and the energy terms extracted from the matrix and their interactives are then correlated with experimentally measured affinities of 19 268 distinct oligonucleotides using statistical modeling methodology. Subsequently, the best one of built regression models is successfully applied to screen those of potential high-affinity Gcn4p binders from the complete genome. The findings arising from this study are briefly listed below: (i) The 11 positions of oligonucleotides are highly interactive and non-additive in contribution to Gcn4p–DNA binding affinity; (ii) Indirect conformational effects upon nucleotide mutations as well as associated subtle changes in interfacial atomic contacts, but not the direct nonbonded interactions, are primarily responsible for the sequence-specific recognition; (iii) The intrinsic synergistic effects among the sequence positions of oligonucleotides determine Gcn4p–DNA binding affinity and specificity; (iv) Linear regression models in conjunction with variable selection seem to perform fairly well in capturing the internal dependences hidden in the Gcn4p–DNA system, albeit ignoring nonlinear factors may lead the models to systematically underestimate and overestimate high- and low-affinity samples, respectively.

Eukaryotic gene expression is fundamentally important for a series of cellular events, including growth, development, differentiation, proliferation, regeneration, and tissue repair (1). For example, although adult hepatocytes are long lived and normally do not undergo cell division, they maintain the ability to proliferate and regenerate in response to toxic injury and infection. The most tightly controlled of all of these, and hence the rate determining step for most genes, is that of initiation where the DNA elements around the start of the gene are recognized by a number of nuclear proteins termed transcription factors (TFs) (2). For example, it was found that TFs are the central regulators in liver regeneration after hepatectomy, where a number of cell signaling pathways are established from the activation of several key TFs such as STAT3, HNF, and C/EBP as well as the negative regulatory factors SOCS-3 and TMUB1 to start the liver regeneration event (3,4). To date, a high quality of TFs has been identified that function in a wide variety of biomolecular processes. Given the highly consistent characteristics over a chromosomal DNA sequence, it is puzzled how a TF is capable of specifically picking out its recognition element in a background of 1000 of other similar sequences and binding to it? Knowing the quantitative specificity of TFs, both the preferred binding sites and the relative binding affinity to different sites, would allow us to approach this open question and to facilitate the understanding of gene expression details (5).

Experimental techniques such as electrophoretic mobility shift assay (5), isothermal titration calorimetry (ITC) (6), and surface plasmon resonance (7) have been emerged to quantitatively measure TF–DNA interactions, but all have relatively low throughput. Recently, chromatin immunoprecipitation with sequencing (ChIP-Seq) (8) and microfluidic affinity analysis (9) were developed for genome-wide mapping of TF–DNA binding profile. However, it is too time-consuming and expensive to implement such assays, leading to significant gap between the experimental capability and practical requirement. Alternatively, bioinformatics provides an in silico idea to address this issue, and there were a number of computational methods to be proposed by different groups to fulfill this purpose. These methods could be roughly assorted into two categories: sequence-based and structured-based. The sequence-based methods utilize information deriving solely from the primary sequence patterns of a collection of TF-binding affinity-known oligonucleotides to train predictive models and, after the models constructed, perform generalization on the candidate DNA segments that generated from, for example, the entire genome of a studied species. Several representative sequence-based methods including multiple regression model (10), position weight matrix (11), and nonlinear probabilistic inference (12) have been successfully applied either to identify potential binding site for a given TF or to analyze the consensus sequence pattern of known binding sites for a series of homologous TFs. Although the sequence-based methods are widely used in the bioinformatics community and can be very efficient in performing high-throughput prediction of protein–DNA binding specificity, which completely ignore the presence of the DNA-binding partner TF as well as the information about the interactions between DNA and TF, thus probably impairing the predictive power and interpretability of established models.

With the rapidly increasing availability of solved protein–DNA complex crystal structures in recent years, it is now possible to exploit structure-based approaches to accurately characterize the binding behavior of TF to DNA at atomic level and to straightforwardly explain molecular mechanism of the binding event based upon the known protein–DNA complex structures. Baker and coworkers have made the first attempt to do so; they employed an integrated protocol of crystal structure analysis, side-chain rotamer search, and free energy calculation to dissect the interfacial feature and energetic landscape of several DNA-binding proteins interacting with their cognate DNA ligands and, on this basis, a simple physical model was developed to predict and design protein–DNA interactions with desired biological functions (13,14). Later, using a modified strategy, the same group successfully redesigned the interface property and cleavage specificity of endonuclease I-MsoI on its substrate DNA molecules (15). Other methods available for structure-based prediction of protein–DNA binding specificity include but not limited to a variety of knowledge-based potentials (16) as well as molecular mechanical (17) and more exhaustive quantum mechanical (18) calculations, which were used to describe the direct readout energy involved in protein–DNA binding (19), while the thermodynamic effects associated with indirect readout can be characterized by molecular dynamics simulation (20). Moreover, we have recently presented a systematic classification and analysis of themes in protein–DNA recognition using a protocol that incorporates automatic methods into manual inspection to plant a comprehensive classification tree for currently available high-quality structure data. The gained knowledge would be helpful for our understanding of molecular mechanism and biological implication underlying protein–DNA binding and for accurate modeling of affinity values at a mesoscopic level (21).

Despite considerable success of structure-based methods in all-atom modeling of protein–DNA binding energetic components and in rational design of proteins to recognize target DNA sites with prescriptive sequence specificity, it is a great challenge for use of the structure-based methods in the context of high-throughput prediction and interface of potential TF-binding sites at the genomic level. In this work, we propose a combination strategy that combines the high efficiency of sequence-based method and the ready interpretability of structure-based approach to investigate the complete DNA-binding specificity landscapes of Saccharomyces cerevisiae Gcn4p, a prototypical, homodimeric, and basic leucine zipper TF that is the master regulator of the amino acid starvation response (22). This procedure starts from the crystal structure of Gcn4p complexed with a 20-bp double-stranded DNA segment. Using virtual mutagenesis, structure optimization, and energy analysis, we quantitatively characterize the interaction energy profile of Gcn4p with all possible single-base mutations within the core region of the DNA segment, and then, this profile is engaged to define a position-mutated energy matrix (PMEM). On this basis, a position-independent/interactive linear equation (PILE) is derived by fitting energy terms extracted from the PMEM to experimental observations of more than 10 000 of 11-mer oligonucleotides with known Gcn4p-binding affinity. Subsequently, we employ the derived PILE to analyze the binding specificity between Gcn4p and distinct DNA sequence patterns and to infer the potential binding sites of Gcn4p within the S. cerevisiae genome. We also give a preliminary insight into the physicochemical nature and origin of Gcn4p–DNA binding specificity.

Materials and Methods

Gcn4p–DNA binding affinity data

Most recently, a new technique called high-throughput sequencing fluorescent ligand interaction profiling (HiTS-FLIP) was described by Nutiu et al. (23) to quantitatively measure protein–DNA binding affinity in both effective and efficient manners. This technique incorporated fluorophore assay into high-throughput sequencer to quantitatively measure DNA-binding affinity to target protein in vitro. By applying HiTS-FLIP to the S. cerevisiae Gcn4p yielded millions of binding measurements, enabling determination of dissociation constants for a massive number of oligonucleotides. According to an early report, the core region of wild-type DNA sequence that Gcn4p binds contains a consensus heptanucleotide motif TGACTCA (24). To include the marginal effects arising from those beyond 5′- and 3′-terminses of the core heptanucleotide motif, we herein considered 11-mer oligonucleotides, to which each end has two-nucleotide extension from the core heptanucleotide, to perform this study (Figure 1). In this respect, a large-scale panel consisting of 19 268 distinct 11-mer oligonucleotides – they possess both/either different extensions at two ends and/or different mutations on the core heptanucleotide motif – was compiled, and their binding affinities to Gcn4p were assayed by Nutiu et al. (23) using the HiTS-FLIP technique and quantified as the dissociation constants Kd of Gcn4p–oligonucleotide complexes (see Supporting Information, Table S1).

Figure 1.

 (A) Stereoview of Saccharomyces cerevisiae Gcn4p in complex with a 20-bp dsDNA segment (PDB: 1ysa). In the dsDNA, the core heptanucleotide motif and 5′-/3′-extensions from the core motif are colored in pink and purple, respectively. (B) Schematic representation of the considered 11-mer oligonucleotide, which consists of a core heptanucleotide motif and 2 two-nucleotide extensions separately at 5′- and 3′-terminuses.

Preparation of Gcn4p–DNA complex structure

The crystal structure of Gcn4p in complex with a 20-bp double-stranded DNA (dsDNA) segment (5′-TTCCTATGACTCATCCAGTT-3′) was determined by Ellenberger et al. (25) using the X-ray crystallography. As shown in Figure 1a, the dimerized basic leucine zipper puts its two helical arms into dsDNA major groove, forming extensive contacts with the core heptanucleotide motif of the dsDNA; the two extended regions separately from the 5′- and 3′-terminuses of the core motif appear to be also effective in conferring stability and specificity for the protein–DNA architecture. Considering that this complex crystal structure was solved only at a modest level of resolution (2.9Å) and hence may involve unreasonable distortions and interatomic collisions, before analysis, we employed AMBER99 potential (26) as implemented in the tinker package (27) to perform energy minimization on the complex structure to largely eliminate these structural errors. The minimization procedure was similar to a previous study of protein–DNA binding energetics (28). Briefly, explicit hydrogen atoms were first added to the structure and then optimized in gas phase using the Newton method (29). Subsequently, the crystallized water molecules were removed, and the complex structure was fully minimized with GB/SA solvation model (30), limited to 100 steps – this number of steps allows the structure to approach or reach a local minimum in a reasonable amount of time (32).

Position-mutated energy matrix

In the minimized Gcn4p–DNA complex, the 11 base pairs of core heptanucleotide and its extensions were one-by-one mutated manually to other three base-pair types, once on a pair and to a type, to create a complete position-mutated profile. As a result, 33 (3 × 11) Gcn4p–DNA mutants were obtained. For each mutant, energy minimization was carried out using the protocol described in Section preparation of Gcn4p–DNA complex structure, and the interaction energy Emutation between the mutated base pair and Gcn4p was calculated as following three steps: (i) the interaction energy E1 of the mutated DNA with Gcn4p was computed; (ii) the mutated base pair was artificially removed from the DNA, and then, the interaction energy E2 of the incomplete DNA with Gcn4p was recomputed; and (iii) the Emutation can thus be defined as E1E2. Here, the interaction energy E1 or 2 between Gcn4p and (complete or incomplete) DNA was obtained in terms of a simple thermodynamic cycle, that is, E1 or 2 = Ucomplex − UGcn4p − UDNA, where Ucomplex is the systematic energy of Gcn4p–DNA complex, and UGcn4p and UDNA are the energies of, respectively, Gcn4p and DNA in isolated state, as computed using AMBER99 force field (26) in conjunction with GB/SA solvation model (30). Similarly, interaction energies between the Gcn4p and each one of the 11 base pairs in native structure can also be readily calculated using aforementioned strategy. In this way, totally 44 energy terms, of which 33 associated with mutated base pairs and 11 with wide-type base pairs, were finally obtained, and they came together to define a PMEM, as tabulated in Table 1.

Table 1.   The position-mutated energy matrix (PMEM), in which an element represents the independent interaction energy between Gcn4p and the base pair whose type and position are denoted by matrix row and column, respectively. For convenience, all negative signs are not shown.
Base pairP1P2P3P4P5P6P7P8P9P10P11

Position-independent/interactive linear equation

In theory, the total binding energy Etotal between Gcn4p and a dsDNA with a specific 11-nucleotide motif can be figured out by summing up energetic contributions separately from the 11 base pairs, if assuming the contributions are additive and position independent:


where i denotes the position of a base pair in the 11-nucleotide motif, ti is the type of the base pair located at the position i, and inline image therefore represents the interaction energy of Gcn4p with the base pair at position i and of type ti, which can be looked up in Table 1.

In practice, however, the 11 base pairs, particularly those in neighbor, which exhibit noticeable π-π stacking character, are highly interactive, and their contributions to binding may be different significantly. In addition, the contribution from some other additional factors such as entropy loss and conformational effect that were not considered here could be regarded as a constant. In this respect, the binding affinity, which is conventionally represented by the negative logarithm of dissociation constant (−logKd or pKd) and approximately correlated linearly with the total binding energy (pKd ∝ Etotal) according to the Gibbs equilibrium equation (31), could be expressed as following formula:inline image


where a, b, c,…k and tx are positions and base-pair types in the positions same to those in eqn (1), and C is the coefficient, which characterizes the weight of each term. The eqn (2) is the final form of our defined PILE, which consists of three components: position-independent contribution Aindependent, position-interactive contribution Binteractive, and constant contribution Cconstant. (i) Aindependent is the sum of 11 weighted terms – each term represents an independent contribution from a single position, (ii) Binteractive is the sum of 2036 weighted terms – 55 (combination inline image) binary interactive terms, 165 (combination inline image) ternary interactive terms, 330 (combination inline image) ternary interactive terms, … 11 (combination inline image) 10-membered interactive terms, and 1 (combination inline image) eleven-membered interactive term, and (iii) Cconstant, the constant term, represents additional contribution from other factors. The 2047 (2036 + 11) coefficients C as well as the intercept Cconstant can be determined by linearly fitting the eqn (2) to experimentally measured affinity values pKd of the 19 268 11-mer oligonucleotides listed in Supporting Information, Table S1. This fitting procedure could be readily conducted using the sophisticated partial least squares (PLS) regression (32) – this method was demonstrated as a versatile bioinformatic tool for studying multivariable regression problems (33) and has already been successfully applied in describing TF–DNA binding behavior (34).

Results and Discussion

Sequence and energy analyses of the 11-mer oligonucleotides

The 11 base pairs are highly interactive in contribution to Gcn4p–DNA binding affinity

According to experimental measurement, among the 19 268 assayed samples, a 11-mer oligonucleotide TATGACTCATA exhibits the highest binding affinity (10.8 nm) to Gcn4p, indicating very strong interactions set between this sequence pattern and Gcn4p protein. Intuitively, one would expect that the base type of each position of this sequence could exert largest interaction potency as compared with other three alternatives in the same positions. However, our energy analysis told a different story; as can be seen in Table 1, the bases with largest interaction energy are usually not those present in the corresponding positions of the strongest binder TATGACTCATA, and even more, there are two positions (P2 and P9) of this strongest binder at which the occupied bases are the lowest energy contributors among all the four base types. Obviously, the total binding energy between the Gcn4p and a 11-mer oligonucleotide is not simply equal to the sum of independent interaction energies of the 11 positions with Gcn4p. In other words, the 11 base pairs are highly interactive and non-additive in contribution to Gcn4p–DNA binding affinity. This conclusion can be solidified by directly fitting the 11 independent interaction energies to observed binding affinity based on the 19 268 oligonucleotide samples, where only the position-independent contribution Aindependent and constant term Cconstant in eqn (2) were considered, but ignoring interactive effect Binteractive among the sequence positions of oligonucleotides. The fitting, as might be anticipated, was failed; the correlation r between the fitted and observed affinity values was only 0.284 – such low correlation clearly suggested the non-additive character of Gcn4p–DNA binding behavior. The principal origin of non-additivity involves the so-called indirect component of protein–DNA recognition, which is related to the sequence dependence of DNA deformation induced during complex formation (35).

The direct nonbonded interactions contribute limitedly to Gcn4p–DNA binding specificity

It is intuitively thought that direct nonbonded interactions such as hydrogen bonding and electrostatic contacts between protein and the edge of DNA base pairs provide a large proportion of specificity judgement for protein–DNA recognition. However, this notion seems not to be consistent with our analysis of the Gcn4p–DNA complex. As shown in Figure 2, most nonbonded interactions across the complex interface are formed with the phosphates of DNA backbone, which is invariable over the nucleotide mutation procedure, while only very few interactions, including four hydrogen bonds involving two Gcn4p residues Asn235 and Arg243, are established on the variable base pairs at positions 4, 6, and 9 of the 11-mer oligonucleotide segment. In fact, there are a dozen of considerable affinity changes associated with nucleotide mutations present not at these positions. For example, noticeable affinity reductions were observed upon single-point mutations of the native sequence TATGACTCATC (11.6 nm) to its mutated counterparts TAAGACTCATC (94.4 nm), TATGACACATC (46.3 nm), and TATGACTCAGC (104.8 nm); these mutations, according to virtual mutagenesis analysis, do not break down (or build up) any existing nonbonded interaction in Gcn4p–DNA complex and can only cast modest effects on the independent interaction energies of these positions with Gcn4p (ΔE = −0.48, 0.94, and −1.33 kcal/mol, respectively), but would affect the native conformations of complex architecture subtly, thus altering the geometry and strength of existing interactions and the exquisite atomic contacts at binding interface. In this respect, we herein consider that the indirect conformational effects upon nucleotide mutations as well as associated subtle changes in interfacial atomic contacts, but not the direct nonbonded interactions, are primarily responsible for the sequence-specific Gcn4p–DNA recognition. This finding could also be used to explain why the DNA base pairs are highly interactive in contribution to binding affinity, because single-point mutations would affect the conformations and interactions over even the entire interface of Gcn4p–DNA complex.

Figure 2.

 Schematic representation of the nonbonded interactions across Gcn4p–DNA binding interface, in which the region containing the studied 11-mer oligonucleotide is highlighted in a box. This plot was generated based on the crystal structure of Gcn4p–DNA complex (PDB: 1ysa) and using the nucplot program (37).

The intrinsic synergistic effect determines Gcn4p–DNA binding affinity and specificity

A typical feature of high-affinity oligonucleotides is the high conservation of their sequence patterns, in particular at the positions that are in close contact with Gcn4p. As seen in Figure 3A, the sequence logo for those of strongest binders (Kd < 100 nm) presents a consensus sequence pattern with bimodal profile; each peak represents a region where Gcn4p – the dimerized leucine zipper – puts its one helix arm in the major groove of dsDNA, forming intensive atomic contacts at the binding interface. In contrast, the sequence logo profile of weakest binders (Kd > 900 nm) shows a distinct pattern (Figure 3B); only two extended regions at 5′- and 3′-terminuses of the 11-mer oligonucleotides exhibit higher conservation, whereas the core heptanucleotide motif is occupied randomly by all the four possible nucleotides. By manually examining apparent relationship between the sequence and affinity of the 19 268 samples, we found that high-affinity oligonucleotides are quite limited in their quantity (∼10%) as compared with those of moderate- and low-affinity ones (∼60%) and, more importantly, former commonly share a consensus sequence pattern, that is, at most two core residues or three extended residues of the strong binders are different from this pattern. The high sequence conservation of those with strong binding capability reveals the decisive role of intrinsic synergistic effect among the sequence positions in conferring affinity and specificity for an oligonucleotide, that is, only quite limited mutations are acceptable and would not impair Gcn4p–DNA binding behavior substantially, but once the magnitude of mutations breaks even the lowest limitation of tolerance – that are commonly one to three nucleotides – the binding affinity and specificity of the oligonucleotide would be lost nearly completely. The synergistic feature at least partially arises from nearest neighbor effects on base-pair steps (36), which was found to have a significant influence on DNA conformational state and hence to indirectly affect Gcn4p–DNA binding behavior.

Figure 3.

 The sequence logos for those of strongest binders (Kd < 100 nm) (A) and those of weakest binders (Kd > 900 nm) (B). This figure was created with weblogo (38).

Development of PILE model

The fact that highly interactive effects exist among the 11 sequence positions of oligonucleotides explains why simply fitting position-independent contribution Aindependent and constant term Cconstant to observed affinity cannot work effectively. In fact, to date attempts to use linearly weighed approaches to assess the DNA-binding specificity of TFs from protein-binding microarray data have suggested that the energetics of TF–DNA recognition fail to follow a simple rule (39). Therefore, we herein further considered position-interactive contribution Binteractive, which consists of 2036 binary to eleven-membered interactive terms, in the linear regression equation to define the complete form of PILE model. Consequently, PLS regression extracted 18 significant latent components (LCs) from the 19 268 × 2047 independent variable matrix (2047 = 2036 interactive terms + 11 independent terms) and engaged them to linearly correlate with the 19 268 × 1 dependent matrix. As might be expected, the resultant correlation r between fitted and observed affinity values received a substantial improvement as compared with that based solely on position-independent terms (0.635 versus 0.284). Specifically, the improved model gave an acceptable standard deviation (SD) of fitting as 0.24, which is just about 5% fluctuation of model estimated against experimentally measured affinities over the 19 268 samples.

Furthermore, we employed an out-of-bag method to detect the importance of each variable in the PILE model; the measure of importance for the kth variable was simply defined as SDk − SD, where the SD is the standard deviation of the model built with all variables, and SDk is that with kth variable removed. The 2047 measures of variable importance are shown in Figure 4, in which the bars with numbers 1–11 represent the 11 independent terms and with numbers 12–2047 correspond to the 2036 interactive terms. It is suggested that (i) absence of anyone of the 2047 variables would not impose a substantial influence on the model’s performance; and (ii) the interactive terms, especially those of multiposition-interactive terms, seem to be more important than independent terms for the model.

Figure 4.

 The relative importance of variables in position-independent/interactive linear equation model. The bars with numbers 1–11 represent the 11 independent terms and with numbers 12–2047 correspond to the 2036 interactive terms. The measure of importance for the kth variable was simply defined as SDk–SD, where the SD is the standard deviation of the model built with all variables, and SDk is that with kth variable removed.

Although a seemingly satisfactory result was warranted from the correlation coefficient r and SD, these statistics can only be used as indicative of the internal fitting ability of a statistical model when which works on the samples that are ever used to develop this model, but not those out of the model development. It is well-known that overfitting is ubiquitous in the world of statistical modeling (40) and many regression methods such as multiple linear regression (41) and more significantly back-propagation neural network (42) are often associated with this problem. In bioinformatics area, the jackknife test is widely applied to verify the reliability and generalization ability of a built model (43), which involves using a single observation from the original sample set as the validation data and the remaining observations as the training data, and this is repeated such that each observation in the sample is used once as the validation data. Here, considering that the observations used here are more than 10 000 with which running jackknife test is too time-consuming, we adopted a modified method called cross-prediction (44) that randomly divided all these samples equally into three groups, and each group was then used as the test data once to validate the model built based on other two groups. In this way, the stability and predictive power of the PILE model could be measured by the correlation = 0.415 and SD = 0.38 between cross-predicted and observed affinities. As can be seen, the r and SD exhibit relatively large differences between fitting and cross-prediction, imparting relatively strong overfitting accompanied with the PILE model – this is not unexpected whether recalling that there were 1000 of variables engaged in the modeling, which may involve significant noise and strong collinearity. This problem would be solved in next section by means of the variable selection strategy.

Improvement of model predictability

As aforementioned, the created PILE model using all the 2047 variables appears to be at least partially overfitted. Thus, we further employed a simple strategy called variable selection to solve this problem. Up to date, various variable selection methods have been proposed by different groups aiming at extracting the best subset from a crude, large-scale variable pool (45). In this study, we adopted three sophisticated methods including stepwise regression (SR) (46), heuristic method (HM) (47), and genetic algorithm (GA) (48) to implement variable selection for the PILE model. Briefly, the SR method successively introduces significant variables into or deletes insignificant variables from the subset until no variable can be introduced or deleted anymore, the HM method adopts a series of empirical rules to eliminate those of insignificant, collinear, noisy, and/or redundant variables from the initial variable pool, and the GA method is a parallel, non-numerical algorithm that treats the variable selection issue by mimicking the basic mechanism of Darwin’s natural selection in population evolution of species.

Before performing modeling, a complicated procedure of choosing optimum parameters for these variable selection methods was carried out carefully, but limited to space, we do not intend to describe the details of this procedure at here. The resultant statistics of three variable-selected models built separately using SR, HM, and GA in conjunction with PLS are listed in Table 2, from which it is evident that both the stability and predictive power of models undergoing variable selection – no matter which selection method was used to fulfill this – improve considerably in comparison with that of full-variable model, albeit the fitting ability of former seems to be more or less degraded relative to latter. The increase in predictive power and the decrease in fitting ability are usually concomitant when performing variable selection for a regression model, because fitting ability only relates to the quantity of variables engaged in modeling, but predictive power is usually determined by the rate of significant to insignificant variables used in the modeling.

Table 2.   Statistics of three variable-selected models as well as full-variable model
Variable selectionNVaNSLCbFittingCross-prediction
r SD r SD
  1. GA, genetic algorithm; HM, heuristic method; SR, stepwise regression.

  2. aNV, number of variables engaged in the modeling.

  3. bNSLC, number of significant latent components extracted by partial least squares.

Full variable2047180.6350.240.4150.38

The optimal model was built using GA variable selection in conjunction with PLS regression, which extracted 15 significant LCs from the subset consisting of 1037 selected variables and exhibited both stronger fitting ability (= 0.591) and in particular higher predictive power (= 0.538). The predicted against observed affinities for the 19 268 oligonucleotides separately in the three rounds of cross-prediction are shown in Figure 5. It is clearly seen that the affinity profile presents a pronounced uneven distribution; most samples have low or moderate affinity to Gcn4p, of which the affinity values were generally overestimated by the model, whereas few ones that possess high binding capability were underestimated significantly. The phenomenon of prediction behavior of regression model differentiating systematically between high- and low-affinity samples is not uncommon, if reciting that there were several our previous works to which the similar problem was encountered (49–51). Undoubtedly, although complicated interactive effects involved in the system have been considered here, the nonlinear relationship between the internal interactive and external affinity of samples remained largely unconsidered. This point could be used to explain why the systematical errors exist in the predictive results of our model. In addition, ignoring context specificity of TF–DNA recognition may also lead to some bias in the resulting models (52).

Figure 5.

 The predicted against observed affinities for the 19 268 oligonucleotides separately in the three rounds of cross-prediction using the model of genetic algorithm variable selection in conjunction with partial least squares regression.

Inference of Gcn4p-binding specificity at the genomic level

Prediction of the affinity values for unmeasured DNA segments can give the estimation of the order of their binding capabilities to a target TF of interest. Even if the values are not accurately quantified, the predicted results can still be used to determine the priority for these segment candidates. Here, we employed the linear regression model built using GA variable selection in conjunction with PLS regression to infer the binding priority for all possible 11-mer oligonucleotide segments involved in the S. cerevisiae genome, which contains totally about 12-Mb DNA, organized in 16 chromosomes (53). The primary sequences of these DNA were retrieved from the GenBank database (54).

It is known that the complete genome of a species contains abundant redundant or ‘junk’ DNA regions, such as poly(nucleotide) tails, terminal repeats, and telomeres, which are distinct to the consensus sequence patterns of TF-binding partners. Therefore, we herein only considered those of effective 11-mer oligonucleotide segments generating from the S. cerevisiae genome; a 11-mer oligonucleotide was defined as effective segment if its core heptanucleotide motif matches four or more nucleotides in the corresponding positions of the standard motif TGACTCA. As a result, ∼780 000 effective segments were extracted from the S. cerevisiae genome, and then, we employed the regression model to score these segments. Consequently, more than 10 000 of segments were predicted as high-affinity binders with their dissociation constant <100 nm.

As an example, the density distributions of effective and high-affinity segments along the DNA sequence of S. cerevisiae chromosome I are shown in Figure 6, in which the ordinate represents the number of effective or high-affinity segments per Kb sequence, that is density, and the abscissa denotes sequence sites with unit in Kb. As can be seen, the potential binding partners of Gcn4p in chromosome I are distributed quite randomly and, more interestingly, the density curve profiles of effective and high-affinity segments seem not consistent at all, implying that just no more than three nucleotide mutations in the core heptanucleotide motif of a DNA segment would fundamentally change its binding capability to Gcn4p. This could be deepened by the fact that most of high-affinity samples are those having no – or at most one – mutation in their heptanucleotide motif. So, that the presence of strong synergistic effect is the prerequisite for a high-affinity Gcn4p binder is demonstrated fairly well by this observation. Furthermore, the top 10 of inferred high-affinity 11-mer oligonucleotide segments in the S. cerevisiae genome are tabulated in Table 3, from which it is seen that the 10 sequences share a strikingly similar pattern, in particular with respect to their core heptanucleotide motif: Eight of the 10 completely match the standard motif TGACTCA, and remaining two have only one position different to the standard motif. In fact, the predicted binding affinities of these samples are very close; their dissociation constant values vary within only 3 units of nanomolar, ranging from 49.4 to 52.0 nm. In addition, these highest affinity samples are not averagely distributed among the 16 chromosomes of S. cerevisiae genome, exhibiting a significant heterogeneity. For instance, some chromosomes such as XII and IV hold two potent binders, whereas some others such as III and VI have no any candidate for the powerful binding. In fact, the heterogeneity is not only present over the chromosomes, but also behaves within a chromosome. As shown at the top of Figure 6, the strongest five binders in chromosome I are located at the beginning and ending regions of chromosomal DNA sequence, leaving vast empty area between them. The heterogeneity and inhomogeneity are recognized as the basic peculiarity of diverse biological phenomena, which may originate from both the differentiation and reorganization of functional modules through genetic evolution (55).

Figure 6.

 The density distributions of effective and high-affinity11-mer oligonucleotide segments along the DNA sequence of Saccharomyces cerevisiae chromosome I (complete length = 230.208 Kb). The blue and red curves represent the density distributions of effective and high-affinity segments, which are those with predicted dissociation constant <1000 nm and <100 nm, respectively. The precise sites of top-five inferred segments in the DNA sequence are depicted at the top of this plot.

Table 3.   The top 10 of inferred high-affinity 11-mer oligonucleotide segments in the Saccharomyces cerevisiae genome
OligonucleotideChromosomeSite (Kb)Predicted Kd (nm)


In this work, we developed a combination approach for high-throughput inferring TF–DNA binding specificity. This method combines the advantages of both high efficiency and good interpretability from traditional sequence-based and modern structure-based strategies, respectively. We employed this method to analyze position-independent interaction energy profile of a 11-mer oligonucleotide segment with the dimerized leucine zipper Gcn4p based on a virtual site-directed mutagenesis/molecular mechanics dissection procedure and to characterize the DNA-binding specificity landscapes of Gcn4p using GA variable selection in conjunction with PLS regression modeling. The built statistical model was then applied to screen those of potential high-affinity Gcn4p binders from the complete S. cerevisiae genome. Here, some findings from this study are raised as follows: (i) Different positions of the 11-mer oligonucleotides are highly interactive and non-additive in contribution to Gcn4p–DNA binding affinity. This could be solidified by analyzing the PMEM of Gcn4p–single nucleotide interactions and by comparing statistical qualities between the regression models that separately considers and is free of nucleotide-interactive effects; (ii) The direct nonbonded interactions are not the dominant factor in determining Gcn4p–DNA binding specificity, whereas indirect conformational effects upon nucleotide mutations as well as associated subtle changes in interfacial atomic contacts are primarily responsible for the sequence-specific recognition; (iii) The intrinsic synergistic effects among the sequence positions of the 11-mer oligonucleotides play a decisive role in conferring affinity and specificity for Gcn4p–DNA binding; (iv) The linear PLS models are capable of capturing internal dependence involved in the Gcn4p–DNA system, albeit some nonlinear factors are ignored by the models that would systematically underestimate and overestimate high- and low-affinity samples, respectively; and (v) The stability and predictive power of PLS regression models would improve substantially if they undergo variable selection treatment; GA seems to be a good choice for implementing the variable selection.


This work was supported by the National High Technology Research and Development Program of China (863 Program) (No. 2011AA020103).

Conflict of Interest

The authors confirm that this article content has no conflicts of interest.