Using a whole genome co- expression network to inform the functional characterisation of predicted genomic elements from Mycobacterium tuberculosis transcriptomic data

A whole genome co- expression network was created using Mycobacterium tuberculosis transcriptomic data from publicly available RNA- sequencing experiments covering a wide variety of experimental conditions. The network includes expressed regions with no formal annotation, including putative short RNAs and untranslated regions of expressed transcripts, along with the protein- coding genes. These unannotated expressed transcripts were among the best- connected members of the module sub- networks, making up more than half of the ‘hub’ elements in modules that include protein- coding genes known to be part of regulatory systems involved in stress response and host adaptation. This data set provides a valuable resource for investigating the role of non- coding RNA, and conserved hypothetical proteins, in transcriptomic remodelling. Based on their connections to genes with known functional groupings and correlations with replicated host conditions, predicted expressed transcripts can be screened as suitable candidates for further experimental validation.


| INTRODUC TI ON
Tuberculosis continues to be a leading cause of death worldwide, causing over 1.5 million deaths, and infecting over 10 million people in 2020 (World Health Organization, 2021). The human-adapted pathogen causing tuberculosis, Mycobacterium tuberculosis (Mtb), has a complex lifestyle that requires rapid adaptation to host defences and immune pressure, including nutritional immunity, hypoxia and lipid-rich environments. In order to eradicate the disease, it is crucial to understand how the pathogen survives attacks from host immune cells and persists in an extended latent state inside the host.
To adapt to these environmental challenges, bacterial cells must make complex transcriptomic adjustments, and these are thought to be complemented and fine-tuned by post-transcriptional regulation.
The mycobacterial genome produces a range of conditionally expressed transcripts, including non-coding RNA, short, unannotated ORFs and untranslated regions at the 5′ and 3′ end of protein-coding sequences, many of which are poorly annotated and understood. In this paper, we extend our focus to include 'non-coding' RNA (ncRNA), here referring to non-ribosomal RNA transcripts not known to be translated into peptides, such as short RNAs (sRNAs) acting on either distant or antisense mRNA targets and the expressed untranslated regions (UTRs) flanking coding regions (which may also contain short open reading frames (sORFs) upstream from coding regions). Noncoding RNA can alter the abundance of RNA and proteins by controlling mRNA stability, processing and access to ribosome binding sites. Discovering the contribution of the non-coding genome to specific adaptation-response pathways may improve our ability to design therapeutics and prevent the evolution of persistent phenotypes.

| Uncovering the role of non-coding RNA in adaptation and transcriptomic remodelling
The proportion of non-ribosomal, ncRNA in the Mtb transcriptome has been shown to increase in stationary and hypoxic conditions, indicating a potential role in adjusting to environmental cues (Aguilar-Ayala et al., 2017;Arnvig et al., 2011;Gerrick et al., 2018;Ignatov et al., 2015). Several mycobacterial ncRNA transcripts (particularly, sRNA) have been extensively studied and found to be associated with regulatory systems controlling adaptation to stress conditions or growth phase, linked to virulence pathways and access to lipid media (Arnvig et al., 2011;Gerrick et al., 2018;Girardin & McDonough, 2020;Mai et al., 2019;Moores et al., 2017;Solans et al., 2014). Non-coding regulation in Mtb appears to function quite differently compared to model organisms, eschewing the use of any known chaperone proteins for RNA-RNA interactions and with few sRNA homologues found outside the phyla (Gerrick et al., 2018;Mai et al., 2019;Schwenk & Arnvig, 2018). The discovery of ncRNA in Mtb has progressed using both molecular biology methods and high-throughput sequence-based approaches (reviewed in Schwenk & Arnvig, 2018) (Stiens et al., 2022).
Using RNA-sequencing (RNA-seq) data to predict ncRNA in the compact Mtb genome is challenging. Paradoxically, more sensitive, high-depth sequencing can make it more difficult to identify the small, low-abundance, functional transcripts above stochastic gene expression and technical noise. Parameters of detection must, therefore, be carefully considered for each data set to account for variation in expression levels. Though RNA-seq-based ncRNA prediction algorithms are often assumed to overpredict putative ncRNAs, especially at the 5′ and 3′ ends of coding genes, there are biological and technical reasons for detecting abundant signal in the unannotated regions of the genome.
Ribosome profiling (Ribo-seq) methods that sequence the ribosomeprotected fragments of mRNA have identified actively translated RNA in the 5′ UTRs of annotated protein-coding mRNA transcripts (Canestrari et al., 2020;D'Halluin et al., 2022;Sawyer et al., 2021;Shell et al., 2015;Smith et al., 2022). These unannotated sORFs may represent functional peptides or function to regulate the translation of the downstream transcript; however, it is impossible to tell the difference between a putative ncRNA and a sORF from RNA-seq signal alone. Additionally, the 3′ ends in mycobacterial RNA-seq data often lack clear signal termination (Dar et al., 2016;D'Halluin et al., 2022;Lejars et al., 2019) and processing of transcripts at the 3′ end may be the norm (Wang et al., 2019). Finally, polycistronic transcripts often include non-coding sequence between the genes of an operon, and this may contain functional elements and/ or processing sites (Martini et al., 2019).
The location of a transcription start site (TSS) in the 5′ end of a predicted transcript supports the biological relevance of a predicted ncRNA. However, the available lists of Mtb TSS sites (Cortes et al., 2013;Shell et al., 2015) have so far been mapped only in starvation and exponential growth and may not include TSSs that are expressed under different experimental conditions. New TSS maps, published subsequent to this analysis may increase the number of predicted transcripts with a TSS (D'Halluin et al., 2022). Furthermore, functional ncRNA elements generated from the 3′ UTRs of coding genes through RNase processing would presumably lack a TSS. 3′ UTRs that are functionally independent from their cognate coding sequence (CDS) have been identified in other bacteria (Desgranges et al., 2021;Menendez-Gil et al., 2020;Ponath et al., 2022). Therefore, it is important to consider predicted UTRs as separate annotated elements from protein-coding transcripts when quantifying differential expression.
To include a complete picture of the interaction of the noncoding genome with coding genes involved in adaptation pathways, we have generated a novel set of ncRNA sequence-based predictions (sRNAs and UTRs) from publicly available data sets using our in-house software package, baerhunter (Ozuna et al., 2019). Some of these predicted non-coding transcripts overlap with those of previous studies, but many represent novel predictions. The expression of these transcripts is quantified along with the protein-coding genes and used in network analysis to provide a more complete picture of the functional groupings involved in adaptation to environmental changes. Including a variety of culture conditions that replicate aspects of the host environment improves the chances that the expression of any ncRNA that is restricted to one or more conditions is included in the network (Ami et al., 2020).

| Using WGCNA to implicate functional associations of non-coding RNA
Weighted gene co-expression network analysis (WGCNA) (Zhang & Horvath, 2005) has been widely used to identify functional groups of genes, called 'modules', through the application of hierarchical clustering to differential expression levels of RNA transcripts in microarray or RNA-seq experiments. Recent studies have focussed entirely on the protein-coding portion of the transcriptome, using WGCNA with RNA-seq to cluster the differentially expressed genes of Mycobacterium marinum in response to resuscitation after hypoxia (Jiang et al., 2020) and Mycobacterium aurum infected macrophages (Lu et al., 2021). Mtb microarray data have been used to cluster protein-coding genes that show differential expression among clinical isolates (Puniya et al., 2013) and in response to two different hypoxic models to identify potential transcription factors (Jiang et al., 2016). Another recent network analysis, using a matrix deconvolution method followed by module clustering, uses a large number of RNA-seq samples including deletion mutants, infection models and antibiotic-treated samples as well as restricted media and culture conditions (Yoo, et al., 2022). Here the authors identify 80 modules of protein-coding genes that each approximate an isolated source of variance, together estimated to account for 61% of the total variance seen in in the data set. This proportion is reportedly lower than results from similar analyses in other organisms, potentially due to the bias in the types of conditions available in the database and/or the complex nature of regulation in Mtb (Yoo, et al., 2022). However, the contribution of regulatory ncRNA elements may be a considerable unexplored source of variance in this complex system. Here we use an alternative, complementary approach by including ncRNA, as well as annotated protein-coding genes, in the modules.
In this study, WGCNA was applied to multiple Mtb H37Rv data sets covering 15 different culture conditions replicating various growth conditions, nutrient sources and stressors encountered in the host environment. We present a global view of the noncoding genome across an extensive WGCNA network and interrogate selected modules to identify functional groupings between protein-coding and non-coding transcripts, as well as between wellcharacterised genes and those with little functional annotation. The correlation of the modules with the various conditions can identify participants in large-scale transcriptomic remodelling programs in response to changes in environmental conditions.

| MATERIAL S AND ME THODS
The overall workflow for this analysis is presented in Figure 1. All scripts for baerhunter, WGCNA and subsequent analysis are available at: https://doi.org/10.5281/zenodo.7709329.

| Data acquisition and mapping
Data sets were downloaded from SRA (https://www.ncbi.nlm.nih. gov/sra/docs/) or Array Express (https://www.ebi.ac.uk/array expre ss/) using the accession numbers listed in Table 1. To minimise batch effects and ensure compatibility with RNA prediction software, we limited analysis to data sets with similar library strategies. Samples were included based on inspection to confirm that (1) samples were from monocultures of wild-type Mtb H37Rv strain and (2) sequencing was using a paired end, stranded protocol. Reads from samples that passed quality control thresholds were trimmed using Trimmomatic (Bolger et al., 2014) to remove adapters and low-quality bases from the 5′ and 3′ ends of the sequences. Trimmed reads were mapped to the H37Rv reference genome (GenBank AL123456.3) using BWA-mem in paired end mode (Li, Heng, 2013 flagging' functions were used for transcript quantification and to identify low expression hits (less than or equal to 10 transcripts per million) in each data set, which were subsequently eliminated. When viewed on a genome browser, coverage at the 3′ ends of putative sRNA and UTRs often appears to decrease gradually, with the actual end of the transcript appearing indistinct, compared to the 5′ end. Prokaryotic ncRNA transcripts may not demonstrate a clear fall-off of expression signal in RNA-seq due to incomplete RNAP processivity and pervasive transcription regulated by the changing levels of Rho protein observed in different conditions (Bidnenko & Bidnenko, 2018;Wade & Grainger, 2014). These very long predictions can mask predicted transcripts in the same region from other samples, obscuring potentially interesting shorter transcripts expressed in different conditions. For this reason, transcripts longer than 1000 nucleotides were eliminated before combining the predictions between data sets. The predicted annotations for each data set were combined into a single annotation file, adding the union of the predicted boundaries to the reference genome for H37Rv (AL123456.3). Predictions that overlapped with annotated ncR-NAs and UTR predictions that overlapped sRNA predictions from a different data set were eliminated. Transcript quantification was repeated on each data set using the resulting combined annotation file and the count data from each data set was merged into a single counts matrix.  figure S3). Samples from experiment PRJEB65014 continue to group together, but as they represent single replicates in unique conditions, it is difficult to estimate the influence of confounding batch effects for these samples. The normalised, batch-corrected data are accessible as an R data object at https://github.com/ jenja ne118/ mtb_wgcna/ tree/maste r/R_data and in a spreadsheet (Supporting Information Table S3).

| Creation of the WGCNA network
The normalised and batch-corrected expression matrix was used to create a signed co-expression network using the R package, WGCNA v1.69 (Langfelder & Horvath, 2008), with the following parameters: corType = 'pearson', networkType = 'signed', power = 12, TOMType = 'signed', minModuleSize = 25, reassignThreshold = 0, mergeCutHeight = 0.15, deepSplit = 2. In this type of network, the 'nodes' are the genes, and the 'edges', or links, are created when gene expression patterns correlate. In contrast to unweighted binary networks where links are assigned 0 or 1 to indicate whether or not the genes are linked, in a weighted network the links are given a numeric weight based on how closely correlated the expression is.
WGCNA first calculates the signed co-expression similarity for each gene pair. The absolute value of this correlation is raised to a power (determined by the user, based on a scale-free topology model that mimics biological systems) (Supporting Information Figure S4) in order to weight the strong connections more highly than the weaker connections. The resulting similarity matrix is used to cluster groups of genes with strong connections to each other in a non-supervised manner (i.e. it does not use any previous information about gene groups or connected regulons). A cluster dendrogram is created (Supporting Information Figure S8) and closely connected branches of the dendrogram are merged into modules based on a cut-off value (also a parameter controlled by the user). Pairwise correlations were calculated between all of the genes in each module, and between module 'hubs' and all of the other genes in the module, using the TA B L E 1 Data sets used in analysis. Accession numbers from SRA and array express. (ME), which explains most of the variance in the expression values in the module. The connectivity of the MEs define the shape of the overall network (Supporting Information Figure S9). The modules can then be tested for potential correlations with experimental conditions while reducing the degree of penalties for multiple testing.
In signed networks, correlation of the module with a condition can be in either the positive or negative direction, as modules include transcripts that are similar in both the degree and direction of correlation, allowing for a more fine-grained analysis than with unsigned networks (Supporting Information Figure S10).
To test correlations of modules with experimental conditions, the individual RNA-seq samples were assigned to a condition based on the experimental description in the project metadata. Some of these conditions were shared among the different projects, so when appropriate, samples from different data sets were assigned the same condition, resulting in 15 tested conditions. For example, late-stage reaeration samples were tested along with exponential growth samples, and samples that tested hypoxia and cholesterol utilisation together were included in multiple conditions. Models of hypoxia differed between the RNA-seq projects, and these samples were assigned to different conditions: 'hypoxia' versus 'extended hypoxia' (Supporting Information Table S1, 'Condition summary' tab).
Network correlations were made using robust biweight midcorrelation tests and all p-values were corrected for multiple testing with the Benjamini-Hochberg (BH) method (Benjamini & Hochberg, 1995).
Significance was evaluated as an adjusted p-value (p adj ) of <0.05.

| Module enrichment
Modules were interrogated for enrichment for Gene Ontology (GO) terms (Ashburner et al., 2000; The Gene Ontology Consortium, 2021), Clusters of Orthologous Groups (COG) (Galperin et al., 2021), KEGG pathway genes (Kanehisa et al., 2022), functional categories and literature searches for known regulons. GO terms, COG term and KEGG pathway enrichment were accessed programmatically using the DAVID web service (Huang et al., 2009b(Huang et al., , 2009aJiao et al., 2012) to query the list of protein-coding genes from each module for enrichment. Enrichment was determined using a modified one-sided Fisher's Exact Test ('EASE' score) with BH correction for multiple testing, with p adj < 0.01 considered significantly enriched for a particular term, pathway or COG term. Enrichment for the 11 functional categories from Mycobrowser annotation (Kapopoulou et al., 2011) was determined using a one-sided Fisher's Exact Test with BH correction for multiple testing. Modules were enriched for a particular functional category if p adj < 0.01. Lists of genes associated with known regulons were mined from literature and enrichment was tested using the same one-sided Fisher's Exact Test as above with a p adj < 0.01 cut-off for enrichment.

F I G U R E 2
Hierarchical dendrogram of rlog transformed and limma batch-corrected expression data by sample. The sample labels are coloured by data set, demonstrating that they are clustering by condition, rather than experiment.

| Mtb expresses an extensive range of ncRNA transcripts over a wide variety of experimental conditions
Mycobacterium tuberculosis RNA-seq data sets were selected from publicly available data to find experiments using the wild-type H37Rv strain and representing a range of growth conditions the pathogen may encounter in a host environment. Four data sets passing our quality standards were subjected to our analysis pipeline (see Section 2) and included 52 samples under 15 different experimental conditions (Supporting Information Table S1, 'Samples' tab). The R package, baerhunter (Ozuna et al., 2019), was used to predict ncRNA in intergenic regions, antisense RNA (opposite a protein-coding gene) and UTRs at both the 5′ and 3′ ends of genes by searching the mapped RNA-seq data for expression peaks outside of the annotated regions in the reference sequence for H37Rv.
Non-coding RNA predictions from each data set were filtered for low expression and combined to create a single set of nonoverlapping annotations that encompassed all predictions made from any sample under any experimental condition. In total, 1283 putative sRNAs were predicted (including both truly intergenic transcripts as well as those antisense to a protein-coding gene, or annotated RNA) and 1715 UTRs, which includes all transcribed regions outside of annotated protein-coding sequences at both 5′ and 3′ ends, as well as the non-coding regions between adjacent genes in operons. All putative ncRNA transcripts (sRNAs and UTRs) were searched for a TSS near the start of the predicted 5′ boundary using previously published annotations (Cortes et al., 2013;Shell et al., 2015). Annotated TSSs were found within 20 nucleotides of the 5′ end in 43% of the predicted sRNA transcripts. Predicted 5′ UTRs had a TSS within 10 nucleotides of the start in 42% of cases, compared with 3% of the predicted 3′ UTRs. Where the UTR covered the entire sequence between two protein-coding regions (labelled as 'between' UTRs), 9% had a TSS in the first 10 nucleotides of the sequence (Table 2 and Supporting Information Table S2 'pu-tative_sRNAs', 'putative_UTRs' tabs).
The predicted sRNAs were further annotated using the ac-

TA B L E 2
Tally of predicted expressed elements in the baerhunter-generated combined annotation file. 4018 proteincoding genes were included in the annotation. 'Between' UTRs cover the entire sequence between two protein-coding regions.  (Cortes et al., 2013;Shell et al., 2015).
3.2 | Module networks represent groups of coexpressed genes and predicted non-coding RNA

| Creation of the WGCNA network
A weighted co-expression network was created from the normalised RNA-seq expression data using WGCNA (Langfelder & Horvath, 2008)  The signed co-expression network presented in this paper con- and growth phase and this can be informative when considering the association of the gene groups with biological processes (Figure 3).

| Well-established regulons cluster together in single modules
In many cases, the gene membership of the modules includes wellestablished regulons or groups of functionally related genes, establishing the biological relevance of the module sub-networks and proof of concept for the application of WGCNA on such a heterogenous data set. For example, the DosR regulon is a well-studied regulon associated with hypoxia and stress responses (Du et al., 2016;Rustad et al., 2008;Voskuil et al., 2004 (Ignatov et al., 2015;Moores et al., 2017).
Unsurprisingly, the 'cyan' module is enriched for the GO term, 'response to hypoxia'; however, a statistically significant correlation was not seen with the hypoxia condition (though it is negatively correlated with the exponential growth condition, bicor = −0.35, p adj = 0.05) (Figure 3). The KstR regulon includes 74 genes under control of the TetR-type transcriptional repressor, KstR, known to be involved in lipid catabolism and upregulated during infection (Kendall et al., 2007(Kendall et al., , 2010Nesbitt et al., 2010). The 'royalblue' module is significantly enriched for known KstR-regulated genes (one-sided

| Antisense RNAs are hubs in modules independent of cognate ORF
It has been observed that the overall abundance of antisense RNA and other non-ribosomal RNA increases upon exposure to stress such as hypoxia and nutrient restriction (Arnvig et al., 2011;Ignatov et al., 2015), and in our network, ncRNA are well-connected in vari-

| Focus on selected module networks
The large-scale transcription analysis presented here is useful for the more global analysis of the overall trends related to ncRNA and transcription, but there is a great deal of information to be gleaned by more fine-grained inspection of individual module groupings. To discover novel associations in such a large and complex data set, we have selected a few modules for closer examination, focussing on those that contain gene groups or regulons related to the tested conditions. Many of the modules that contain interesting correlations or gene regulon enrichments also include an abundance of putative sRNAs and UTRs. Using the 'guilt by association' principle, we can hypothesise that the well-connected ncRNAs found among the module hub elements have a role in transcriptional 'remodelling' in response to changes in environmental conditions such as growth on cholesterol-containing media, restricted iron or hypoxia.

| Detoxification-linked proteins cluster in the module best correlated with cholesterol media condition
The  (Ramage et al., 2009;Sala et al., 2014).
VapBC12, specifically, has been shown to inhibit translation and promote persister phenotypes in response to cholesterol (Talwar et al., 2020). Other detoxification-linked genes in the module, such as the ABC-family transporter efflux system, Rv1216c-1219c, have also been implicated in transcriptomic remodelling in response to cholesterol (Aguilar-Ayala et al., 2017;Pawełczyk et al., 2021).
Together, they extend to overlap the antisense strand of a large portion of Rv1773c, a probable transcriptional regulator in the IclR family, found in a different module ('turquoise'). The 3′ UTR for Rv1772 has been previously identified as an abundant antisense transcript during exponential growth (Arnvig et al., 2011). The start of the predicted sRNA transcript has no known TSS and could instead be an extension of the predicted 3′ UTR (Supporting Information Figure S11). (When combining predicted annotations from different data sets, long predicted UTRs that overlapped shorter sRNA predictions were discarded, see Methods). In E.coli, the IclR family transcriptional regulators demonstrate both activating and repressing activities on targets such as multidrug efflux pumps and the aceBAK operon, which regulates the glyoxylate shunt (Zhou et al., 2012). lipids has been observed in the 'lag phase' after reaeration for increased protein synthesis (Du et al., 2016). The hubs of the 'saddlebrown' module include several predicted sRNAs, and the annotated sRNA, F6. F6/ncRv10243/SfdS is a sigF-dependent ncRNA, which has been shown to be induced in nutrient starvation, oxidative stress, acid stress (Arnvig & Young, 2009;Houghton et al., 2021) and the fatty acid hypoxia model (Del Portillo et al., 2019). In addition to being expressed from its own promoter, F6/SfdS has been proposed to be co-transcribed with the upstream gene fadA2 (Rv0243), a probable acetyl-CoA acyltransferase; however, fadA2 is clustered in a different module from SfdS ('darkred').
One of the predicted sRNAs among the 'saddlebrown' module hubs is antisense transcript ncRv2489/putative_srna:p2801108_2801678 with a TSS at 2801108. This overlaps the 3′ end of PE-PGRS43 (Rv2490c) (Figure 6). There is a short reading frame (30 nucleotides, 10 amino acids) initiating from a Methionine at this TSS that suggests a possible dual-function sRNA or sORF with independent function. A shorter, possibly leadered, sORF was predicted by Shell et al. (2015) Figure S12). Another hub antisense sRNA, putative_sRNA:p3608313_3608866/ncRv3231c, overlaps the 3′ end of the upstream gene, Rv3231c, and has a predicted TSS at 3608313.

| Slow-growth-correlated module is associated with transcriptional remodelling and metal ion homeostasis and enriched for sRNAs
The 'green' module contains genes that are associated with transcriptional remodelling in response to hypoxic or stationary growth conditions. It is positively correlated with hypoxic The 'green' module is enriched for sRNAs (p adj = 0.011). Among the best-connected, are 27 predicted antisense RNAs. One of these hubs, putative_sRNA:p1404640_1404929/ ncRv1257 is antisense to the 3′ end of Rv1257c, a probably oxidoreductase, and another (putative_sRNA:p1771044_1771498/ncRv1546) is antisense to the 5′ end of a trehalose synthetase, treX. Both of these sRNAs have TSSs and are expressed differentially among the tested conditions.
Control of reactive oxygen species and synthesis of trehalose intermediates are important for cells in to survive in hypoxic conditions (Eoh et al., 2017;Harold et al., 2019) and antisense RNA may be involved in fine-tuning these responses.
Another antisense RNA, ncRv1358c (putative_sRNA:m1530046_ 1530745) has a TSS near its start and is found antisense to Rv1359.
Rv1359 and the upstream gene, Rv1358, on the opposite strand are very similar to each other (43.7% identity in 197 aa overlap) and to another gene elsewhere in the genome, Rv0891c (48.5% identity in 204 aa overlap) (Kapopoulou et al., 2011). All three genes are possible LuxR family transcriptional regulators, which are thought to be involved in quorum-sensing adaptations and contain a probable ATP/GTP binding site motif (Chen & Xie, 2011;Modlin et al., 2021) and are found in different modules. Expression of this antisense sRNA appears to suppress the expression of the transcript on the opposite strand to varying degrees in all conditions (Figure 7). In the cholesterol and fatty acid media samples, expression of a shorter transcript appears to begin inside the Rv1359 ORF, where the transcript is not overlapped by the antisense transcript, possibly utilising an internal TSS at 1530774.

| Metal ion homeostasis genes cluster
in module that is negatively correlated with the hypoxia condition The 'darkred' module is negatively correlated with the hypoxia condition (bicor = −0.46, p adj = 0.005, Figure 3). This module contains most of the ESX-3 genes (Rv0282-Rv0292) related to siderophoremediated iron (and zinc) uptake in Mtb (Serafini et al., 2013;Zhang et al., 2020), with nine of these representing hubs in the module.
The module is enriched for the PE/PPE functional category, and includes the two genes preceding the ESX-3 genes, Rv0280 (PPE3) and Rv0281  (fadA2) (Dutta, 2018), and a gene involved in the pentose phosphate pathway, zwf2 (Rv1447c).

F I G U R E 7
Expression of antisense transcript putative_sRNA:m1530046_1530745 (magenta bar) seems to suppress the expression of most of Rv1359 and Rv1358 in cholesterol and fatty acid media. An internal TSS exists inside the Rv1359 CDS at 1530774 near where expression begins. Note, prediction of an individual sRNA is an aggregate of predictions under different conditions, so will not always match the expression of the sRNA in any particular sample. Sample SRR5689230 from PRJNA27860. Strand coverage using the 'second' read of each pair mapping to the transcript strand, visualised using Artemis genome browser (Carver et al., 2012).
There are some well-connected ncRNAs in the 'darkred' module, including a predicted antisense RNA to Rv0281, 'ncRv0281c' (putative_sRNA:m341328_342075). This putative sRNA has a predicted TSS at the 5′ end and is transcribed divergently from Rv0282 (eccA3). This is one of the rarer cases where the antisense transcript and cognate protein-coding gene (Rv0281) are clustered in the same module. The prevailing direction of transcription at this locus may be a result of competition for RNAP binding at a bidirectional promoter in the predicted 5′ UTR of Rv0282, which also clusters in the module. There are several UTRs in the module hubs, including a 3′ UTR for the gene Rv1133c, metE (also found in the module). This UTR was previously identified as abundantly expressed in exponential culture (Arnvig et al., 2011). There is a 3′ UTR for Rv0292 (eccE3, also a hub in the 'darkred' module) that is antisense to a large part of the 3′ end of Rv0293c, which has a converging orientation to Rv0292 (Supporting Information Figure S13). Rv0293c is a hub in a different module ('lightgreen') along with its 3′ UTR. Overlapping 3′ ends of genes could func- genes (Mai et al., 2019). Mcr11 is also found in the module. This sRNA is known to respond to the second messenger 3′,5′-cyclic adenosine monophosphate and has been found to be expressed in hypoxic Mtb cultures and in a mouse infection model (Girardin & McDonough, 2020). Mcr11 regulates the expression of several genes that adapt central carbon metabolism during slow-growth conditions (Girardin & McDonough, 2020).
There are two well-connected intergenic sRNAs predicted in the 'darkturquoise' module. Putative_sRNA:p1164036_1164162/ ncRv11040 is located between PE8 and a possible transposase, Rv1041c, but on the antisense strand. There is a predicted TSS at 1163697, 39 nucleotides upstream of the predicted start sequence.
This transcript is in a converging orientation to the transposase and may be instrumental in regulating horizontal gene transfer (Ellis & Haniford, 2016;Lejars et al., 2019). The other intergenic hub is also upstream from possible transposase, Rv3114, but in diverging orientation on the opposite strand. The TSS is at 3481459, and the sRNA is within a predicted 'MT-complex-specific' genomic island associated with virulence genes (Becq et al., 2007). Rv3112-14 are clustered in the 'salmon' module.
There are several interesting 'independent' UTRs that are wellconnected in the module, but their assumed transcriptional partner clusters in another module. There are several predicted TSS's and transcriptional termination sites (TTS) (D'Halluin et al., 2022) within the predicted boundaries of a 3′ UTR for the gene Rv2081c (putative_UTR:m2337218_2338064) and a predicted sORF based on ribosome profiling (Smith et al., 2022) (Figure 8). The adjacent gene, Rv2081c, is in the 'cyan' module along with most of the DosRregulated genes. The 5′ UTR of Rv0281c is also a hub in the module and contains predicted TSSs, TTS and sORF. It would be interesting to discover whether these UTRs could have dual functions as regulatory RNA elements as well as being translated into short peptides.
Rv2081c is a conserved membrane protein containing a simple sequence repeat of 8 C's and has been identified as a source of sequence variation in Mtb sputum and culture (Shockey et al., 2019;Sreenu et al., 2007).
The best-connected elements in the module are antisense sRNAs, including putative_sRNA:p2081553_2082178/ncRv1835, with a predicted TSS at its start. It is antisense to Rv1835c, the gene for a putative serine esterase clustered in the 'mediumpurple3' module, in particular to the 3′ end of the peptidase domain (Xaa-Pro dipeptidyl-peptidase-like domain) (Blum et al., 2020). Putative_sR-NA:m2497549_2498369/ncRv2225c, with a TSS at 2498368, is antisense to Rv2225, coding for a 3-methyl-2-oxobutanoate hydroxymethyltransferase PanB. This gene clusters in the in 'turquoise' module.

| Comparison with other global Mtb networks
Other regulatory networks have been developed for Mtb that use transcriptomic data to cluster protein-coding genes according to their responses to environmental conditions (Peterson et al., 2014;Yoo et al., 2022). Peterson et al. (Peterson et al., 2014), utilises a 'biclustering' algorithm, cMonkey, that clusters genes and conditions based on co-expression in publicly available microarray data and the presence of common transcription factor binding motifs (Reiss et al., 2006). The network is pruned and shaped by adjusting the weights of particular lines of evidence a priori input such as binding motifs, protein homology, operon groupings and known proteinprotein interactions (PPIs) (Peterson et al., 2014;Reiss et al., 2006).
This network's ability to assimilate both a priori and transcriptomic expression data was tested by its ability to recapitulate known associations and groupings found by overexpression of transcription factors and identification of transcription factor binding motifs.
Thus, a 'parsimonious' network was created that uncovers novel transcriptomic responses to particular environmental conditions that are validated by several lines of evidence (Peterson et al., 2014;Reiss et al., 2006). This approach differs significantly from ours in several important ways. Firstly, the WGCNA network we present relies entirely on transcriptomic data alone-RNA-seq, in particular. RNA-seq is more sensitive than microarray data and is able to detect the expression of novel transcripts that may represent noncoding or unknown protein-coding RNA transcripts. Our network is more comprehensive in an attempt to include every detectable RNA transcript found in the included RNA-seq data sets. These novel transcripts naturally lack any a priori data to shape or reinforce associations, and we have not applied any filtering methods other than evaluating the strength of module membership.
A more recent approach uses a large number of RNA-seq data sets with deconvolution methods to reduce the noise in the network and find clusters of protein-coding genes ('iModulons') that together account for significant chunks of variation in expression levels in response to environmental conditions (Yoo et al., 2022). In both the As all of the RNA-seq data sets included in this WGCNA analysis are also included in the iModulon analysis, overlaps between these two studies are perhaps not surprising. An important distinction between our study and these other approaches is that the network presented here seeks to identify not just groupings of proteincoding genes linked by transcriptional regulation, but associations involving non-coding RNA, as well. For example, the protein-coding hub genes of the 'violet' module overlap with the 'VirS' iModulon, which was linked in Yoo et al. (2022) to response to acid environment and remodelling of cell membrane. In addition to the coding genes that overlap the 'VirS' iModulon, the hubs of the 'violet' module include the non-coding RNA, Mcr7. Mcr7 is a ncRNA known to be activated by the PhoPR regulon, which responds to acid pH (Solans et al., 2014). The hypothetical protein-coding transcript that overlaps this locus, Rv2395A, is found in the 'PhoP' iModulon. The 'violet' module also includes several UTRs among the hub members that may represent important players in this adaptation response. Thus, our approach adds value to these previous methods by including unannotated elements that may have roles in the regulation of gene expression.
One advantage of the deconvolution method over WGCNA is that by filtering for only the strongest associations and allowing genes to be members of more than one iModulon, the modules are less 'noisy'. However, deconvolution methods require extremely F I G U R E 8 The 5′ and 3′ UTRs for Rv2081c (green bars) are overlapped by predicted sORFs (yellow bars). (Cortes et al., 2013;Smith et al., 2022). Shown is sample SRR5689224, exponential growth, from PRJNA27860. Strand coverage using the 'second' read of each pair mapping to the transcript strand, visualised using Artemis genome browser (Carver et al., 2012).
large numbers of samples to perform well, may be subject to batch effect issues between experimental data sets and characterise a limited proportion of the protein-coding transcripts expressed by Mtb (Saelens et al., 2018;Yoo, et al., 2022). In order to include predicted ncRNA in the network, a significant degree of quality control, parameter adjustment and manual curation is required, limiting the number of data sets that could be included in our analysis. Including more data would most likely strengthen the correlations with certain conditions and improve the overall specificity of the WGCNA modules.  (Eoh et al., 2017;Gerrick et al., 2018). Confounders such as dual-function, 'moonlighting', proteins may weaken the correlation of a module with a specific condition and may create noise in otherwise well-connected modules. Rather than using an arbitrary cutoff to decide which module associations are relevant, we utilise a flexible measure of module membership that allows the user to filter the strength of associations. In our discussion, we used a relatively stringent threshold 'module membership' score of 0.8 to identify the transcripts in each module that have the tightest correlation to the module eigengene, but there has been no pruning or editing of the modules, in order to avoid any loss of information.
An important advantage of including ncRNA in a co-expression network is the chance to observe post-transcriptional groupings that result from adaptive responses, as well as the transcriptional responses. By focussing on the best-connected transcripts in various modules, unexpected connections between genes of diverse pathways can be discovered. The work presented here confirms that ncRNA are important players in adaptation responses, and the existence of informative protein-coding co-expression networks can help to implicate these transcripts in adaptive responses and provide context for their activity.

| CON CLUS ION
This paper presents a large-scale network analysis of over 7000 transcripts expressed by Mtb under a variety of conditions. The modules group together clusters of co-expressed protein-coding genes, as well as ncRNA transcripts predicted from RNA-Seq signals. Several modules are statistically enriched for sRNAs, especially those modules positively correlated with hypoxia. The abundance of antisense RNA in conditions of stress has been widely observed, and it is, therefore, not a surprise to find them in the hubs of these modules. However, it is noticeable that the complementary ORF is usually excluded, which leads us to seriously consider antisense transcription as part of strategic regulation of protein production in response to environmental cues through mechanisms of divergent transcription, translational control or by regulating mRNA stability (Vargas-Blanco & Shell, 2020;Warman et al., 2021). If these strategies actually differ among the members of the MTBC, it may have implications for host specificity and virulence (Dinan et al., 2014). By the same logic, 3′ UTR transcripts clustering in modules distinct from their upstream ORF implies independent function from the ORF. sRNAs generated from
The modules discussed in depth in this paper represent a limited snapshot of this extensive co-expression network. Modules of interest can be identified by correlations to experimental conditions, associated GO terms, functional categories, or gene group enrichment.
The supplementary tables (Supporting Information

FU N D I N G S TATEM ENT
This work was supported by a Bloomsbury Colleges PhD studentship to JS.

CO N FLI C T O F I NTE R E S T S TATE M E NT
The authors declare no competing interests.

DATA AVA I L A B I L I T Y S TAT E M E N T
The code and data to reproduce the analysis in this study are archived on Zenodo (https://www.zenodo.org), DOI: https://doi. org/10.5281/zenodo.7709329. The original code and R data objects are also available on GitHub (https://www.github.com/jenja ne118/ mtb_wgcna/).

E TH I C S S TATEM ENT
None.