Characterization of promoters in archaeal genomes based on DNA structural parameters

Abstract The transcription machinery of archaea can be roughly classified as a simplified version of eukaryotic organisms. The basal transcription factor machinery binds to the TATA box found around 28 nucleotides upstream of the transcription start site; however, some transcription units lack a clear TATA box and still have TBP/TFB binding over them. This apparent absence of conserved sequences could be a consequence of sequence divergence associated with the upstream region, operon, and gene organization. Furthermore, earlier studies have found that a structural analysis gains more information compared with a simple sequence inspection. In this work, we evaluated and coded 3630 archaeal promoter sequences of three organisms, Haloferax volcanii, Thermococcus kodakarensis, and Sulfolobus solfataricus into DNA duplex stability, enthalpy, curvature, and bendability parameters. We also split our dataset into conserved TATA and degenerated TATA promoters to identify differences among these two classes of promoters. The structural analysis reveals variations in archaeal promoter architecture, that is, a distinctive signal is observed in the TFB, TBP, and TFE binding sites independently of these being TATA‐conserved or TATA‐degenerated. In addition, the promoter encountering method was validated with upstream regions of 13 other archaea, suggesting that there might be promoter sequences among them. Therefore, we suggest a novel method for locating promoters within the genome of archaea based on DNA energetic/structural features.

diverse functional and evolutionary mechanisms, such as membrane origin, operon organization, and the proteins devoted to regulating gene expression. Nevertheless, there is a lack of well-annotated archaeal genomic data (Zuo et al., 2015), which enables a lush path toward genomic annotation such as regulatory sequences validation.
The transcription of DNA into RNA and its regulation are central processes in the genetic information flux. Research accumulated in the last few years has evidenced that transcription in archaeal organisms can be roughly described as a simplified version of its eukaryotic relatives (Gehring et al., 2016). The initiation process begins with the binding of a TATA-binding protein (TBP) and a transcription factor B (TFB) to a specific DNA segment, defined as a promoter, allowing the recruitment of the RNA polymerase (RNAP) enzyme.
Additionally, the initiation might be optimized with the presence of a transcription factor E (TFE) protein (Ao et al, 2013). Subsequently, an open complex is assembled, followed by the elongation process whereby the RNAP carries out the synthesis of a messenger RNA molecule (mRNA) (Smollet et al., 2017;Soppa, 1999). In general, three main conserved DNA elements devoted to the transcription process have been identified as common to all archaeal groups: (i) an initiator element (INR) around the transcription start site (TSS); (ii) the TATA box element, centered around −26/27 relative to the TSS; and (iii) an element upstream the TATA box comprising two adenines at −34 and −33, which is designated as "transcription factor B recognition element" (BRE). These elements, INR, TATA box, and BRE, are crucial to initiating transcription in archaeal genes. They also present a close homology to eukaryotic transcriptional machinery (Gehring et al., 2016;Soppa, 1999).
An in-depth analysis of archaeal promoter elements will provide comprehension of the gene functionality. As an example, there are advances in biotechnology that have employed promoter identification tools to enhance gene regulation and optimize biological processes. The broader comprehension of promoter activity could, in theory, enable full control over the start and halt of the expression of specific genes (Kernan et al., 2017). The production rise in biosynthetic processes is related to the control of regulatory pathways (Ren et al., 2020). For example, clinical biology has benefited from promoter identification due to the increased mutation rate found in regulatory regions that may lead to antibiotic resistance.
Evolutionary biology has also applied promoter identification as part of the process to understand better horizontal gene transfer between species of the three domains of life (Khademi et al., 2019).
Bioinformatics tools employ physical assets of the genetic material and relate these with gene expression variance, enabling the distinction of specific regions such as promoters. The study of DNA structural features may give rise to more information about promoter activity than a primary sequence analysis (Bansal et al., 2014; de Avila e Silva et al., 2011;Kanhere & Bansal, 2005;Yella & Bansal, 2017). Indeed, comparative analysis of bacterial and eukaryotic promoters has shown that Pribnow and TATA boxes, respectively, differ at structure and sequence level from other random locations within and around the promoter (de Avila e Silva et al., 2011;. When converted into numeric attributes, genetic information will promote enough sensibility for capturing even the smallest alterations among the nucleic acids (Benham, 1996). Hence, we consider the nucleotide conservation found in archaeal promoters (Gribaldo & Brochier-Armanet, 2006;Londei, 2005) will convey a sustained structural parametrization, enabling the characterization of archaeal promoters. In this work, we selected four structural parameters, namely, DNA duplex stability, enthalpy, curvature, and bendability, which are fundamental in understanding the molecular recognition that happens at a structural level (Ryasik et al., 2018).

| Archaea promoter sequences
To determine the nucleotide composition, a total of 3630 promoter sequences of three archaeal organisms were evaluated, which are divided into 1340 sequences of Haloferax volcanii (Babski et al., 2016), 1248 of Thermococcus kodakarensis (Jäger et al., 2014), and 1042 of Sulfolobus solfataricus (Wurtzel et al., 2009). These particular archaea were selected because they are model organisms and well-studied members of Halobacteriales, Thermococcales, and Sulfolobales, respectively. They also have available transcriptome data (RNAseq), enabling the possibility of retrieving promoter sequences from their published information. Internal and antisense promoters from the transcriptome dataset were not included due to the limitation of data.
The original data covers 1000 nucleotide length sequences, which contains experimentally identified promoters with their transcription start site (TSS), spanning from −500 to +500. Only primary TSS (pTSS) was considered, a category that accounts for abundant transcripts from this original dataset. A shorter sequence was selected, located at 80 nucleotides upstream and 20 nucleotides downstream of the TSS, that is, the core promoter. This briefer region was chosen because it contains the core promoter element (Aptekman & Nadra, 2018;Haberle & Stark, 2018;Kadonaga, 2012). Accordingly, the core promoter has been detailed as sufficient to convey archaeal and eukaryotic transcription (Bartlett et al., 2000;Haberle & Stark, 2018;Zuo et al., 2015). Indeed, promoters from halophilic archaea were reported to be located in the range proposed here; their TATA boxes were found in a median distance of 31 base pairs (bps) upstream the TSS (Babski et al., 2016). Additionally, 96% of the pTSS TATA boxes from T. kodakarensis are located in a median distance

| Conversion in structural parameters
To convert the DNA sequences into structural parameters, four DNA structural features were selected, namely DNA duplex stability (DDS), enthalpy contribution, bendability, and intrinsic curvature. These features are biologically relevant to characterize promoter regions since they convert DNA information into numeric attributes (Benham, 1996). These four parameters have previously been used and reflect in capturing specific signals that are not evident at the sequence level (Bansal et al., 2014;de Avila e Silva et al., 2011;Kanhere & Bansal, 2005;SantaLucia & Hicks, 2004;Yella & Bansal, 2017;. Moreover, the appointed features can be described as: • The DDS of double-stranded DNA is calculated as the sum of its base-pair free energy. It considers the free-energy values associated with the 16 possible combinations of dinucleotides (Kanhere & Bansal, 2005).
• Enthalpy parameters refer to thermodynamic processes that occur at a cellular level (e.g., chemical bonds, mass transport inside and outside the cell, and heat spawning) that affect the thermostability of the cell (Privalov & Crane-Robinson, 2018). These numeric parameters have been taken from DNA melting studies (SantaLucia & Hicks, 2004).
• DNA bendability is a sequence-dependent measurement, reflecting in the DNA bending itself because of the effect specific proteins have in the molecule's helical structure. By this means, DNA bending facilitates the assembly of transcription complexes (Leonard et al., 1997). TATA's bend angle is wider than GC-rich sequences; for instance, TA dinucleotides angle the DNA at 6.74°, the most impactful of the 16 dinucleotide combinations (Karas et al., 1996).
• Finally, intrinsic curvature reflects the capacity of DNA to form small circles around its helical axis (Bolshoy et al., 1991). To this end, we used a model based on DNA gel retardation values (BMHT) for its sensibility toward AT-rich sequences (Bolshoy et al., 1991;Kanhere & Bansal, 2003). BMHT calculation estimated 16 roll and tilt wedge angles based on independent gel mobility experiments performed on a training set of 54 different sequences (Bolshoy et al., 1991).
All the four features selected are sequence-dependent and their combination yields more information gathered on a sequence (Ryasik et al., 2018). The complete set of promoter sequences was converted into structural parameters through a self-developed Python script (Supplementary Script S1: https://doi.org/10.5281/zenodo.5137597) that adopts the numeric parameters available in Table 1 The structural properties were computed in a one-nucleotide sliding window. All promoters were aligned relative to their TSS, and numerical values were averaged to get information in each position.

| Classification of conserved TATA and degenerated TATA sequences
To classify the core promoters in conserved and degenerated TATA, the MEME Suite-a motif-based sequence analysis tool (Bailey et al., 2009) was employed. All the sequences were scanned with MEME, and the motifs identified by it were extracted. A key motif for this research would be located in −27/-28, so the search was directed to this specific region to capture the TATAs. The following parameters on MEME were used in the organisms H. volcanii and T. kodakarensis: i) 100 nucleotides sequence length, considering the −80 to +20 region, where the core promoter is located (Haberle & Stark, 2018;Kadonaga, 2012); ii) a 0-order background model generated from TA B L E 1 Enthalpy, stability, and bendability parameters for every possible dinucleotide combination the supplied sequences; iii) zero or one occurrence (of a contributing motif site) per sequence; iv) 8 motifs were located; v) the width of the motifs varied between six and eight nucleotides (Hausner et al., 1991). The motif discovery had to follow different parameters in S.
solfataricus, in which the width of the motifs was increased from six to fifteen nucleotides to capture the TATA boxes adequately. Hence, TATA boxes and BRE elements were considered. The combination of these two consensuses was described as a critical feature in Sulfolobaceae family transcription (Le et al., 2017).
Afterward, the dataset was classified through a self-developed Python script (Supplementary Script S3: https://doi.org/10.5281/ zenodo.5137597), dividing it into two groups: conserved TATA, those motifs identified by MEME, and degenerated TATA, containing sequences which the previously identified motif was not present.

| Statistical tests
Statistical tests were conducted to differentiate the two groups this study hinged on. First, the dataset was found not to be nor-  Nucleotide information (sequence logo profiles) is overlaid with signals that represent the core promoter content were performed due to the data not following a normal distribution.

| Structural profiles of archaeal promoter sequences vary when transcription factors binding sites
The entire promoter dataset was converted into enthalpy, DNA Duplex Stability (DDS), bendability, and intrinsic curvature to capture specific signals in wider genome analysis, ranging from −500 to +500. In addition, control sequences were added to elicit the strong signals promoter sequences have (Figure 2). A zoomed version, encompassing the promoter region only, was included in Figure 3, where there is a conserved region around the binding site of the transcription factor proteins: TBP (TATA box, around −28), TFB (BRE, around 2 nucleotides upstream TBP), TFE, whose binding site is located in position −10 (PPE -proximal promoter element) and +1, matching the INR (initiator element).

| Definition of a promoter-like profile
By following the profiles brought by Figure 3, a promoter-like profile was formed upon the average per position (100 nucleotides) of each feature in the validated promoter dataset. By combining nucleotide information (sequence logo profiles) with the structural parametrization brought by this work, Figure 4 was created. In this, the strong DDS, enthalpy, bendability, and BMHT curvature signals are overlaid with transcription factor binding sites.

| Validation of the results with 13 other archaea
Upstream regions of thirteen other archaea divided into four TACKs and nine Euryarchaea were included to test the validity of the findings. Figure 5 holds the genomic information of each archaeon plotted against DNA bendability, BMHT curvature, enthalpy, and DDS.
In all cases, a strong signal around the ending of the upstream regions was located.
F I G U R E 5 Structural/energetic upstream profiles in thirteen archaea. Thirteen other archaea were selected from 42 to validate the promoter-like behavior observed. These organisms have 400 nucleotide-long sequences corresponding to upstream sequences where no annotation toward promoter finding was done. The blue lines represent bendability profiles, the purple enthalpy, the green refers to DDS, and the red is BMHT curvature Moreover, we included a comparison of the upstream regions found in 13 archaea against the promoter-like profile established in 3.4. To perform a comparative analysis, the promoter-like profile was compared with upstream regions of 13 other archaea split into their phylogenetic family ( Figure 6). Since the profiles observed in Figure 6 are the same when another physical feature is tested, comparisons following DDS, enthalpy, and bendability are included in Figures A2, A3, and A4, respectively. Analysis of variance tests indicated each organism is significantly different than the other by presenting p < 2e-16 in TACK archaea and p < 2e-16 in Euryarchaea.
The statistical analysis of the two archaeal families is visualized in boxplots available in Figure 7.

| Conserved and degenerated TATA groups
The core promoters belonging to conserved and degenerated TATA groups were converted into energetic and structural properties to indicate RNAP action in both groups (Figure 8). The two groups presented overlapping lines with strong signals being located around −28.

| Nucleotide content
The results of this study suggest that TATA boxes slightly vary between organisms, supporting the archaeal diversification reported by (DeLong et al., 1994). Additionally, the AT content was found differently in each archaeon.
When the archaeal promoters were evaluated as owning either a conserved or a degenerated TATA consensus, the GC% of each organism has explained the conservation found upon TATA boxes, so the organism with higher genome GC% was the one that presented the least amount of TATAs, this is no news. However, the binding of TBP, TFB, and TFE to a TATA+BRE motif and TFE binding to PPE/INR were found through this in silico approach to be off from a primary sequence inspection, just as that conservation found around these motifs is not mandatory. Moreover, promoter activity is still observed when promoters lack a clear TATA motif (Aptekman & Nadra, 2018). Therefore, the uneven number of conserved TATA sequences sprung around archaea is explained by the dynamics of biology. The

| Energetic and structural parameters define promoter-like profiles
Promoter sequences might be defined by a set of strong signals around their transcription factor binding sites (TFBS), that is, TFB, TBP, and TFE. In this study, the conversion of genetic information into physical attributes has protruded distinctive signals around TFBS of the proteins, while shuffled sequences did not. These , which matches the TFE binding site, also contributes to promoter definition (Ao et al, 2013). This TF protein was reported to optimize the formation of PIC in TACK and other families as well (Hanzelka et al., 2001).
The signal located in the −10 region of three archaea is also an important factor in bacterial transcription (Lloréns-Rico et al., 2015).
Both bacteria and archaea share the same last unique common ancestor, and consequently, share similarities despite their evolution taking place in different branches of the tree of life (Gribaldo & Brochier-Armanet, 2006).

F I G U R E 7
Boxplots of promoters and upstream regions of thirteen other archaea converted to bendability. The boxplots represent statistical comparisons between the promoter-like profile, (red), formed upon experimental data of H. volcanii, S. solfataricus, and T. kodakarensis. The p < 2e-16 values obtained by the nonparametric Kruskal-Wallis test conveyed statistical significance in the averages of both groups. Additional analyses encompassing BMHT curvature, enthalpy, and DDS are found in Figures A5, A6, and A7, respectively The lack of annotation in the genome of many archaea creates the possibility for such methods. When the validation of the promoter identification method was tested in upstream regions of thirteen archaea, the same rationale was inferred. Mining published information upon transcripts has enabled the definition of a promoter-like profile through a combination of strong signals in the binding sites of TBP, TFB, and TFE (−27, −31, −10, and +1, respectively). When data that do not encompass experimentally validated promoter sequences only was assessed, strong signals were observed in the ending of the sequences, suggesting that there might be promoter elements found in these intergenic areas, as identified by .
The observation of Figure

| Promoter signal beyond TATA boxes
TATA boxes are likely the most conserved sites that distinguish both archaeal/eukaryotic promoters. The initiation of the transcription in archaea has been reported to start with TPB and TFB proteins attaching to the promoter (Gehring et al., 2016, Blombach & Grohmann, 2017 and enhanced by the presence of TFE (Hanzelka et al., 2001), this binding is assisted by the conservation found around the binding site of these proteins. Promoters have been grouped in terms of their TATA analysis in (Tirosh et al., 2007;Yella & Bansal, 2017), both authors performed structural conversions such as this study did. Divergent results could be observed in which TATA-conserved sequences did not show significant differences when compared to TATA-degenerated ones.

F I G U R E 8
Structural/energetic profiles of conserved and degenerated TATA promoters. The conserved and degenerated TATA core promoter profiles are plotted. The lines represent the average value each group and organism showed. The navy-blue lines represent sequences that had a MEME-identified TATA motif, the light blue depicts sequences in which the specific TATA motif was not found In this study, both TATA-conserved and TATA-degenerated groups have shown the same strong signals around the binding sites of TFB, TBP, and TFE. Some differences might protrude mathematical variance, for example, the TFB and TBP binding sites analyzed in the curvature profile of three archaea and H. volcanii bendability and DDS. This feature defines the promoter (either TATA-conserved or not) as a promoter-like sequence, which is a novel approach in identifying and finding new promoter sequences in archaea.

| CON CLUS IONS
The results we demonstrated in this study encourage the DNA codification into energetic/structural attributes that reveal transcription factor proteins binding sites where a primary sequence inspection failed. Hence, this study poses a novel method to be used in genome annotation regarding archaeal promoters.

ACK N OWLED G M ENTS
We are grateful to the support received from Universidade de Caxias  -209620).

CO N FLI C T O F I NTE R E S T S
None declared. Writing-review & editing (equal).

E TH I C S S TATEM ENT
None required.