Transcription factors: Bridge between cell signaling and gene regulation

Transcription factors (TFs) are key regulators of intrinsic cellular processes, such as differentiation and development, and of the cellular response to external perturbation through signaling pathways. In this review we focus on the role of TFs as a link between signaling pathways and gene regulation. Cell signaling tends to result in the modulation of a set of TFs that then lead to changes in the cell's transcriptional program. We highlight the molecular layers at which TF activity can be measured and the associated technical and conceptual challenges. These layers include post‐translational modifications (PTMs) of the TF, regulation of TF binding to DNA through chromatin accessibility and epigenetics, and expression of target genes. We highlight that a large number of TFs are understudied in both signaling and gene regulation studies, and that our knowledge about known TF targets has a strong literature bias. We argue that TFs serve as a perfect bridge between the fields of gene regulation and signaling, and that separating these fields hinders our understanding of cell functions. Multi‐omics approaches that measure multiple dimensions of TF activity are ideally suited to study the interplay of cell signaling and gene regulation using TFs as the anchor to link the two fields.

modulating the accessibility of its binding sites (including epigenetic processes and cell-type specific chromatin states). Once bound, TFs can open the chromatin for other factors to bind or prevent other factors from binding, and activate or repress the transcription of genes.
Consequently, TFs can be studied at any of these levels, each associated with its own challenges and limited in the specific insights that can be gained.
TFs tend to be lowly abundant at the protein level [6][7][8][9], which makes them challenging to detect in proteomic assays. Furthermore, the function of TFs depends on PTMs and binding to interaction partners. Thus, their expression level does not necessarily correlate with their functional activity. On the other hand, the binding of TFs to chromatin can be readily assayed, for example, using chromatin immunoprecipitation followed by sequencing (ChIP-seq) and similar assays, which results in genome-wide maps of TF binding. Yet, it remains a challenge to delineate functional binding sites from unspecific binding events [10]. Furthermore, these binding assays are typically limited to one TF at the time, and thus grossly underestimate the complexity of gene regulation. Finally, TF activity can be inferred from assessing the chromatin accessibility around their predicted binding sites [11] or the expression of the genes they regulate [12][13][14][15]. The challenge for the former is that TF binding predictions suffer from a lack of specificity, and for the latter that our knowledge of TF-to-gene mapping is very incomplete.
In this review we focus on how TFs respond to cellular signaling, by adopting a broad definition of TFs: We used a carefully curated list by [2] and, since one aim of this review is to compare different resources that collect TF, we extended this list with proteins that are judged as TFs by commonly used databases such as TRRUST [16] and Dorothea [13]. Both databases are more lenient in their definition of TFs: TFs bind DNA within a complex and/or through a DNA-binding domain to regulate gene expression. We review the regulatory and functional aspects of TFs within signaling pathways and gene regulation with a focus on technological and data analysis challenges, and highlight the existence of a strong literature bias in the TF literature. We argue that gene regulation is part of cell signaling and propose that the most comprehensive way of studying the functional role of TFs in signaling is by combining the power of multiple assays that detect signaling activity, TF activity, their genomic localization and potential interaction partners. Therefore, TFs, which can be assayed both in terms of signaling and in terms of their impact on transcriptional regulation, offer a perfect bridge between the signaling and transcriptional regulation fields ( Figure 1).

POST TRANSLATIONAL MODIFICATION-BASED REGULATION OF TF ACTIVITY IN CELL SIGNALING
Signaling pathways tend to result in the activation or inactivation of TFs, often through PTMs of the TF, typically without altering their DNA binding specificity. Examples of well-known signaling cascades that lead to the activation of TFs are TGFbeta signaling

Significance Statement
We review the role of transcription factors (TFs) in signaling and gene regulation, highlighting the importance of multiomics studies to obtain a full picture of TF function. Globally, while a subset of TFs is very well studied for their role in signaling and gene regulation, a significant fraction of TFs has been very much neglected by both fields. Future studies that focus on these "dark TFs" will have the biggest impact in understanding the role of TFs.
leading to activation the of SMAD family TFs [17][18][19], Jak-STAT signaling activating the STAT TFs [20][21][22][23], Erbb2 signaling typically activating Jun and Myc [24,25], Hippo signaling targeting the TEAdomain-containing (TEAD) family (TEAD1-TEAD4) of TFs [26,27] and Notch signaling that induces dissociation of DNA-bound RBPJ from a corepressor complex and recruitment of a coactivator complex instead [28,29] an intermediary signaling transduction (reviewed in [33][34][35] [36,37]. One class of nuclear receptors, such as estrogen receptor and glucocorticoid receptor, reside in the cytoplasm until binding to their ligand allows translocation to the nucleus and expression of target genes. Another class of receptors, like the thyroid hormone receptor, are already bound to their specific DNA-sequence as a heterodimer with retinoid X receptor and binding of their ligand replaces corepressors with coactivator complexes to initiate gene expression [33,35]. Coregulators of nuclear receptors serve as important targets, propagators and integrators of PTM's to drive specific gene expression programs [38,39].
The most studied PTM in signaling is phosphorylation, but other PTMs such as sumoylation, ubiquitination, acetylation, glycosylation, and methylation also play a role in signal transduction. PTMs can affect TF localization, stability, activity and interaction with other proteins [4,[40][41][42]. See [26,[43][44][45] for focused reviews on the effect of different PTMs on subsets of TFs. TFs can carry several modifications at once and the modifications might be dependent on each other as reviewed by [46,4,47].

An early attempt of curating the complexities of PTM regulation in
TFs is the PTM-switchboard database [48] which is unfortunately not accessible anymore. Not much progress in understanding TF PTMs has been made since. This is also illustrated by the recently published compendium of TFs, which explicitly refrains from characterizing the PTMs that regulate TFs because only few studies systematically annotated and disentangled the complex combinatorial effects of PTMs on the function of TFs [2].
On the other hand, the rapid advancement of mass spectrometry (MS) in recent years has enabled high-throughput measurement of PTMs on proteins. Specifically for phosphorylation, the number of datasets (phosphoproteomics) increased from 127 annual submissions to PRIDE in 2010 to 2344 in 2019 [49]. These data provide a great resource for investigating the effect of different phosphorylations on the TF function.
To exemplify the potential of this uncharted area, we summarize the phosphosite data available for TFs. We compiled a list of 1967 human TFs using the carefully curated list of 1639 TFs by Lambert et al. [2], supplemented by TFs curated by two commonly used databases that provide TF-gene interactions Dorothea [13] and TRRUST [16] (see Table S1). Most TFs in these two databases are also defined as TFs by Lambert. However, 328 proteins are not considered TFs by the strict definition of Lambert et al, likely because they don't bind to a specific DNA-sequence but are part of the more general gene regulation machinery. Figure S1 shows the intersection of TFs as defined by the three resources. Databases such as PhosphoSitePlus [50,51] and PTMcode2 [52,53] collect and annotate the presence and function of PTMs on proteins in several species. While PTMcode2 specifically curates and predicts functional associations of PTMs between proteins it covers only a few TFs. Hence, we queried the curated list of TFs in PhosphoSitePlus, specifically collecting information about phosphorylation sites (phosphosites) [51]. Of the 1967 TFs, 1857 (94%) have at least one measured phosphosite and 934 (47%) have more than ten.
However, only 393 TFs (20%) have a known functional phosphosite (i.e., annotated with a functional effect or known process). Among the functional phosphosites in TFs, the most common effects on protein function are related to regulation of molecular association, intracellular localization, protein degradation, protein stabilization and induced activity, while the most common effects on biological processes are altered transcription (both induced and inhibited), cell cycle regulation and altered cell growth (see Table S2).

REGULATION OF TF-BINDING TO CHROMATIN THROUGH DNA-SEQUENCE AND EPIGENETICS
Once a TF is activated by a signaling cascade, and before it can modulate the expression of its target gene, it has to bind to chromatin.
There are multiple ways of how TF binding to chromatin is regulated. We briefly review the major mechanisms below, focusing on InterPro. Families are sorted by the fraction of proteins with at least one phosphosite (total bar length). The fraction of proteins with at least one functional phosphosite (blue), at least one phosphosite with a high predicted functional score (>0.5) as defined by Ochoa et al [55]. (orange), their intersection (shaded area), and the fraction of proteins with no functional annotated phosphosite (grey) are shown (n = total number of proteins; n_TF = number of TFs in family). Our curated list of TFs is highlighted in bold. Some InterPro families are a subset of another family with a similar name, those were merged into one set and called according to the larger family (indicated by an asterisk).
sequence-specific TFs (as opposed to general TFs that are part of the basal transcription machinery and bind less sequence specific).
Sequence-specific TFs typically bind to specific DNA sequences that can be summarized as TF motifs as reviewed in Slattery et al [56].These motifs are inferred from TF binding assays in vitro (e.g., SELEX [57], protein-binding DNA-arrays [58]) or in vivo (e.g., ChIP-seq, or other, more recent chromatin profiling technologies such as ChIPexo [59], ChIP-nexus [60], CUT&Tag [61], or CUT&RUN [62]), by finding enriched sequences among the TF-bound DNA fragments. These motifs, which are collected in motif databases, such as JASPAR [63], HOCOMOCO [64], CIS-BP [65], and others [66], can then be used to predict putative TF binding sites in the genome, using tools like PWMscan [67], or MOODS [68]. However, one of the long-standing challenges in the field is that these predictions suffer from a high false positive rate. An early observation defined by Wasserman and Sandellin [69] as the futility theorem, describes that knowing only the TF binding motif will not lend any functional insight. This observation still holds true to date, and it is now evident that DNA sequence alone cannot predict TF binding very well in an in vivo setting as also outlined in a recent review [10]. On the one hand, this is partially due to the lack of accuracy of available data and partially a lack of conceptual understanding of TF biology. For example, recent advances in TFmapping technology combined with deep-learning algorithms to predict TF binding sites have been much better at predicting direct binding for some TFs [70]. Furthermore, a recent conceptual advance suggests that phase-separation of multimolecular assemblies can explain transcriptional regulation to some extent, thus suggesting TF activity can be independent of direct DNA binding [71]. On the other hand, the lack of prediction specificity may simply stem from cell-type specific regulation of TF binding, for example, through mechanisms involving chromatin compaction (see below).
Large-scale efforts, such as ENCODE [72,73], that profile TFs across thousands of cell types [74] and databases collecting experimentally measured TF binding sites (e.g., REMAP [75], ChIP-Atlas [76] or GTRD [77]) are useful to study TF binding in specific cell types. The caveat for these is that they remain blind to cell types that have not been experimentally profiled.
The cell-type specific action of TFs is partially driven by their expression pattern with a considerable number of TFs showing tissuespecific expression [2,9]. In addition, the same TF can bind different loci depending on the context [78,79], or even change its mode of action (i.e., acting as repressor or activator) in different cell types [11].
This context-specific behavior may be achieved by interactions with other TFs, cofactors and overall changes in DNA accessibility (recently reviewed in Zeitlinger [10]). In a landmark study Jolma et al measured in vitro binding affinity of hundreds of pairs of TFs and found that cobinding of two TFs is much more prevalent than previously appreciated [80]. Following up on this, Ibarra et al showed that genes bound by pairs of TFs (instead of just one) provide a remarkable specificity in terms of their biological function [81]. These and other works suggest co-binding of TFs as an important mechanism to regulate cell-type specific TF binding [82,81,80,83]. Given the large number of TFs that have phosphosites of unknown function (Figure 2A The epigenetic profile of a cell constitutes an additional layer that contributes to context-/cell-type specific TF binding [10]. This includes DNA methylation and chromatin modifications, which are PTM of histone tails that correlate with functional properties of chromatin [85]. Chromatin modifications are mostly known for their ability to recruit chromatin remodeling complexes, for example polycomb [86], and parts of the basal transcription machinery, such as TFIID [87,88]. Even though a few sequence-specific TFs have also been shown to directly interact with specific histone modifications [89], the main impact of chromatin modifications on TF binding is likely mediated through their effect on DNA accessibility. For example, lysine acetylation neutralizes the positive charge of histone residues and thus decreases nucleosome affinity to DNA [90,91]. This effect has theoretically been described by a nucleosome-mediated cooperativity model [92], which proposes competition for DNA binding between nucleosomes and a set of TFs as a dynamic equilibrium. A recent study has shown experimental evidence for a slightly updated model of TF-nucleosome cooperativity that includes active nucleosome remodeling [93]. This model also implies that TFs play an important role in modulating chromatin accessibility and thereby define the epigenetic landscape of a cell. This is most evident for the class of so-called pioneer TFs, which are defined based on their ability to bind to closed chromatin and make it accessible for other TFs to bind, for example during cell fate decisions (recently reviewed in Zaret [94]). There is also accumulating evidence that non-pioneer TFs can regulate chromatin. For example, in [95] the authors achieved a reasonably accurate prediction of histone modifications across cell lines based only on TF binding data. More recently a deep-learning framework was able to predict the chromatin accessibility profiles of immune cells based on sequence and thereby discovered the sequence-motifs of cell-type specific TFs ab initio [96]. Furthermore, observations that genetic variants that modulate histone modifications tend to disrupt TF binding sites [97,98] suggest a causal-direct or indirect -role of TF binding on regulating histone modifications. Thus, while chromatin modifications and accessibility may determine where TFs can bind, and integrating them is useful for inferring context-specific TF binding, they are also actively being modulated by TFs.
Certain modifications, specifically those related to accessible chromatin (e.g., Histone 3 lysine 27 acetylation (H3K27ac)), can therefore even serve as a direct readout of TF activity, which highlights the tight interconnection between signaling and gene regulation [99].
Recent studies have formalized this relationship to quantify differential TF activity by aggregating changes in histone modifications or chromatin accessibility across the predicted binding sites of a TF (diffTF [11], chromVar [100]). Together with the development of single cell chromatin accessibility profiling [101] and in particular its recent commercialization, this will dramatically increase our understanding of cell-type specific TF activity profiles in the future.

TFs AS PART OF GENE REGULATORY NETWORKS (GRN)
Conceptually, the final result of activating a TF is the modulation of expression in the set of its direct target genes, also referred to as regulon of a TF. The combined activity of a set of TFs connected to their targeted genes is referred to as gene regulatory network (GRN) [102].
These networks are responsible for maintaining cell-type specific transcriptional states and response to signaling. However, the exact nature The most comprehensive resource for experimentally validated TFgene interactions is the TRRUST (transcriptional regulatory relationships unravelled by sentence-based text-mining) database [16], which is based on manual curation and currently comprises over 8000 TFgene interactions. Typically, these links are derived from studies that focus on one TF in one specific context at the time. However, similar to its binding to DNA, the set of genes regulated by a given TF is likely highly context-specific. In fact, most TFs in TRRUST are classified as activator and as repressor almost equally often (Figure 3), suggesting that even the actual function of a TF is highly context-dependent.
An alternative explanation for this is that the data-curation underly-ing the TRRUST database is incomplete. Either way, while it is a great resource for testing individual TF-gene interaction in a given context (i.e., consulting the curated studies) it is not a reliable source for inferring genome-scale GRNs.
One strategy for inferring genome-scale GRNs is based on perturbation studies that alter the activity of a TF (through overexpression, knockdown, knockout or chemical inhibitors) and then measure the resulting changes in DNA binding or target gene expression [105,106]. A number of these studies have been curated within the KnockTF database covering 308 human TFs [107]. Another set of methods are based on coexpression of TFs and genes (e.g., WGCNA [108]), with some variations that use energy-based or information-based measures instead of correlation (e.g., DPM [109], sdcorGCN [110], PIDC [111,112]. These approaches (reviewed in [113][114][115]) are based on the assumption that a change in TF expression level will result in a transcriptional change of its regulon. Despite the significant progress and numerous practical applications of co-expression to GRN inference their direct interpretation in terms of gene regulation is limited due to missing directionality. More recently, the use of co-expression to infer modules of jointly regulated genes (regulons) has been combined with prior knowledge of TF binding sites and/or TF perturbation studies, to define TF-specific regulons [116,13,117], in some approaches even integrating TF-mediated enhancer activation [118,119], which limits the target genes to those co-expressed with and likely bound by a TF.

Publication bias
[#pubmed articles with TF in title or abstract] #Predicted TF binding sites [HOCOMOCO] #Measured binding sites

THE SAME TFs ARE UNDERSTUDIED AT ALL MOLECULAR LAYERS
For each layer of regulating TF activity there are literature curated and large-scale measured or inferred data. For example, the collection of phosphosites in PhosphoSitePlus incorporates high-throughput mass-spectrometry screens [51]. In contrast to functional studies that focus on a few proteins at a time, these screens are not biased a pri-ori towards specific sets of proteins. Similarly, TF binding to chromatin as measured by ChIP-seq data requires experiments in a specific cell type and context, whereas motif-based predictions of TF binding sites are data-independent. Finally, genes regulated by TFs can be curated in small, functional studies, or inferred based on high-throughput data.
To quantify a potential literature bias in functional annotation of these different measures of TF activity, we defined a measure of how well a TF is studied as the number of PubMed-indexed studies that mention its gene name in their titles or abstracts (query on 09.03.2021, see Table S3). This revealed between 0 and 1,120,174 studies per TF with 50% of TFs having less than 44. Hence, a few TFs are studied very intensively, while most TFs gather little attention. This bias towards a small set of well-studied TFs was already observed over ten years ago by Vaquerizas et al. [9]. Notably, most of the least-cited TFs belong to the Zinc finger C2H2 family. Hence the largest family of TFs (716, Figure 2A Figure 4E). Curated TF targets in TRRUST [16] seem mostly available for highly studied TFs, as illustrated by the strong relationship between the number of studies citing a TF and the number of its target genes reported in TRRUST ( Figure 4H). A similar relationship between literature bias and number of predicted targets is not observed for more data-driven approaches to link TFs to their targets, such as DoRothEA [13] ( Figure 4G), which, in addition to literature curation also includes ChIP-seq peaks, TF binding site motifs and gene co-expression.
Thus, many of the measured phosphosites in TFs, their predicted binding sites and inferred target genes await further functional studies ( Figure 4). To assess whether the same TFs are well-studied for their role in signaling (i.e., PTM regulation) and their role in gene regulation (i.e., effect on chromatin binding or gene regulation), we compared their literature-curated and predicted/inferred measures of TF activity. As expected we observe a strong relationship between the number of literature curated functional phosphosites in PhosphoSitePlus [51] and curated target genes of a TF from TRRUST [16] ( Figure 5A).
This relationship is less strong-yet still visible when comparing functional phosphosites with the number of measured TF binding sites by ChIP-seq data [75] ( Figure 5B). In contrast, comparing the unbiased measures of phosphosites versus inferred targets from DoRothEA [13] reveals an inverse relationship ( Figure 5C), and no relationship is observed with predicted binding sites from HOCOMOCO [64] (Figure 5D).
Overall, this indicates that a small subset of TFs has been studied extensively both in terms of their involvement in signaling pathways (approximated by functionally annotated phosphosites) and in gene regulation (approximated in the number of curated target genes  One group of approaches uses causal reasoning to integrate transcriptomics or on occasion phosphoproteomics data with prior knowledge-based protein interaction networks and pathways to infer upstream signaling regulation of the gene networks that are represented in the transcriptomics data [121][122][123][124][125][126][127]. Common to the cited studies is the dependence on a reliable annotation of TFs with their target genes, highlighting once more the pivotal role of TFs in understanding the connection of signaling and gene regulation. CARNIVAL [126], in particular, explicitly estimates TF-and pathway-activities [128,13] from the gene expression data using prior knowledge, which improves the performance of the method compared to other causal reasoning approaches. Another recent approach, named KPNN (knowledgeprimed neural network), uses prior knowledge about signaling and regulatory interactions as a constraint on the architecture of the hidden layers in a neural network, which marginally improved the predictions yet greatly increased stability and biological interpretability of the neural networks [125]. Extending beyond signaling, NicheNet, focuses on understanding cell-cell interactions by combining various data sources for creating a ligand-to-target gene network, and integrating prior knowledge on signaling and GRNs [122]. It calculates a ligand-target gene interaction score, and based on this, predicts cell-cell interactions.

MULTOMICS STUDIES CAN EXPLOIT TFs AS BRIDGE BETWEEN SIGNALING AND GENE REGULATION
A complementary approach relying on chromatin accessibilityinferred TF activity to approximate signaling has used a combination of RNA-seq and chromatin accessibility to generate a cell-type specific regulatory network [118] and projected TF activity onto the network.
This revealed an enhancer-driven remodeling, which primed patientderived cells to an aberrant response to TGFβ signaling in pulmonary arterial hypertension. A combination of both causal reasoning and direct use of accessibility data is used in methods like CellOracle [129] and Inferelator3.0 [130] which use chromatin accessibility data to define the cell-type specific prior regulatory network that is then refined by training a regularized linear regression model on gene expression data.
Other approaches explicitly integrate (phospho)proteomics and transcriptomics datasets and/or other omics data to identify multilayer components that can explain the observed differences across datasets or provide integrated signaling-gene regulation networks that represent the cell function in the studied condition or patient. For example [131], adapted the TieDIE signaling network diffusion algorithm [132] to phosphoproteomics and transcriptomics data to find druggable kinase pathways. Based on the two datasets they inferred differential TF activity and kinase activity in metastatic castrationresistant prostate cancer patients and proposed new therapeutic targets, which would have been missed in each dataset on its own.
Another study investigated the differentiation of mouse embryonic stem cells into neurons using RNA-seq, ATAC-seq, proteomics and protein-interaction experiments [78]. They established the relative importance of each regulatory layer along neuronal differentiation and could show that chromatin accessibility preceded changes in RNA and protein abundance. Furthermore, the authors uncovered a new role of the canonical pluripotency TF SOX2 as a regulator in differentiated neurons, which was only possible by integrating information from protein-interaction, chromatin accessibility and gene expression layer.
These studies together with several others highlight the additional insights that can be obtained by integrating data of the signaling, epigenetic and gene regulatory layer.  TFs upon signaling activation. For more detailed reviews we refer to [133][134][135] and [136].

OUTLOOK/DISCUSSION
TFs provide an excellent anchor for linking signaling studies with gene regulation. Given the importance of the currently well-studied TFs for cell function, the large number of understudied TFs and our poor understanding of the interplay of these regulatory layers, a focus on functionally annotating TFs in the context of cell signaling and gene regulation is likely to prove transformational for our understanding of cell functions.

ACKNOWLEDGMENTS
Open access funding enabled and organized by Projekt DEAL.

CONFLICT OF INTEREST
The authors declare no conflict of interest.

DATA AVAILABILITY STATEMENT
Data for generating the figures is given in supplementary tables.