• archaea;
  • ChIP-chip;
  • non-coding RNA;
  • tiling array;
  • transcription

Despite the knowledge of complex prokaryotic-transcription mechanisms, generalized rules, such as the simplified organization of genes into operons with well-defined promoters and terminators, have had a significant role in systems analysis of regulatory logic in both bacteria and archaea. Here, we have investigated the prevalence of alternate regulatory mechanisms through genome-wide characterization of transcript structures of ∼64% of all genes, including putative non-coding RNAs in Halobacterium salinarum NRC-1. Our integrative analysis of transcriptome dynamics and protein–DNA interaction data sets showed widespread environment-dependent modulation of operon architectures, transcription initiation and termination inside coding sequences, and extensive overlap in 3′ ends of transcripts for many convergently transcribed genes. A significant fraction of these alternate transcriptional events correlate to binding locations of 11 transcription factors and regulators (TFs) inside operons and annotated genes—events usually considered spurious or non-functional. Using experimental validation, we illustrate the prevalence of overlapping genomic signals in archaeal transcription, casting doubt on the general perception of rigid boundaries between coding sequences and regulatory elements.


Evidence is mounting that the standard model of transcription factor (TF) binding to intergenic regions is not always the rule. Although there is isolated prior evidence for functional consequences of TF binding inside coding sequences, this issue had not been systematically evaluated genome wide. We have conducted a study to investigate the genome-wide consequence of internal TF binding for nearly 10% of all TFs in an archaeal extremophile, Halobacterium salinarum NRC-1. We show that a significant number of TF-binding sites (TFBS) inside the coding sequences are functional and have marked consequences, such as by conditionally modulating the architecture of at least 43% of all operons in this organism. We present the integrated analysis of complementary systems-wide data on TFBS locations and dynamic modulation of transcriptome structure that led to this striking discovery.

Using ChIP–chip and the MeDiChI algorithm (Reiss et al, 2008), we precisely located TFBSs and determined their corresponding local false discovery rates (LFDRs) from new and previously reported genome-wide ChIP–chip measurements for 11 TFs: all TFBs (TFBa, TFBb, TFBc, TFBd, TFBe, TFBf and TFBg), one TBP (TBPb) and three transcriptional regulators (TRs) (Trh3, Trh4, VNG1451C) in H. salinarum NRC-1. Our conclusion from this analysis was that as many as 10% of all multi-TFBS loci were within coding regions.

To show that these TFBS have significant functional consequences on transcriptional regulation and cellular physiology, we used high-density genome tiling arrays to analyze the transcriptome structure (TS) of H. salinarum NRC-1 at different phases of growth in a batch culture, which is associated with differential regulation of over 65% of all genes. Through this analysis we assigned transcription start sites (TSSs) to 64% of all annotated genes, termination sites (TTSs) to 46% of the genes, verified the expression of 203 operons and discovered 5′and 3′ UTRs for ∼65% of all genes and operons. Further, by correlating the transcribed units with chromosomal coordinates of predicted genes (Ng et al, 2000) and experimentally mapped peptides from large-scale proteomics studies (Van et al, 2008), we revised the translation start site for 61 genes, detected 10 new protein-coding genes, and discovered 61 new putative ncRNAs. Although the physiological roles and mechanisms of action of specific ncRNAs remain to be uncovered, the bimodal distribution of correlations between the expression of ncRNAs and that of their antisense strands are consistent with the characterized roles of ncRNAs in the regulation of their cognate antisense transcripts. Finally, this analysis also showed a large mRNA population that has variable 3′-end locations and transcripts with extensive overlaps in their 3′ termini.

By integrating TFBS locations with the TS, we identified internal binding sites that are functional in the conditional modulation of operon organization. We assessed the global prevalence of such operons by devising a quantitative measure for classifying operons as conditional. Specifically, we found that 43% of all operons are conditionally modulated by integrating probe intensities of transcripts hybridized to the genome tiling array with gene-expression correlations derived from expression analysis of H. salinarum NRC-1 in 719 microarray experiments. Remarkably, there was a strong functional link between transcription-factor binding inside operons and their classification as ‘conditional’ (P<10−9). We transcriptionally fused two of these conditionally activated promoters inside coding sequences to a reporter gene encoding a fast-degrading GFP variant optimized for the high-salt cytoplasm of halophilic archaea. FACS analysis of cells harboring these internal promoter–reporter transcriptional fusions provided in vivo validation of growth-phase regulated transcription initiation inside coding sequences.

Although earlier studies have discovered internal promoters within a single gene or operon (Tsui et al, 1994; Guillot and Moran, 2007), we have significantly extended these findings to a genome-wide scale to show that biologically meaningful promoters do exist inside coding sequences at a frequency that is much higher than was previously appreciated. Further, this discovery also shows how a simple prokaryote can use the same set of genes in different combinations to elicit complex responses according to an environmental challenge.

Irrespective of the specific underlying mechanisms, our observations of widespread modulation of operon architecture, as well as transcription initiation and termination inside genes, etc. all constitute evidence that archaea can intersperse regulatory logic within their coding sequence and thus blur the boundaries between coding and non-coding elements. We have shown that it is possible to use new high-throughput technologies to find these biologically important instances where transcriptional regulation does occur within coding sequences and, furthermore, that it is possible to globally characterize specific regulatory mechanisms responsible for these phenomena. Combined with new high-throughput sequencing technologies, our results will expand the view of genetic-information processing that can be investigated at high resolution (Nagalakshmi et al, 2008; Wilhelm et al, 2008). These data will enable construction of mechanistically accurate models for reliable systems re-engineering of biological circuits. Moreover, these findings suggest that the incorporation of mechanistic accuracy into GRN models would require operons, promoters, and terminators to be treated as dynamic entities.