Advances in computational and experimental approaches for deciphering transcriptional regulatory networks

Understanding the influence of cis‐regulatory elements on gene regulation poses numerous challenges given complexities stemming from variations in transcription factor (TF) binding, chromatin accessibility, structural constraints, and cell‐type differences. This review discusses the role of gene regulatory networks in enhancing understanding of transcriptional regulation and covers construction methods ranging from expression‐based approaches to supervised machine learning. Additionally, key experimental methods, including MPRAs and CRISPR‐Cas9‐based screening, which have significantly contributed to understanding TF binding preferences and cis‐regulatory element functions, are explored. Lastly, the potential of machine learning and artificial intelligence to unravel cis‐regulatory logic is analyzed. These computational advances have far‐reaching implications for precision medicine, therapeutic target discovery, and the study of genetic variations in health and disease.


INTRODUCTION
Gaining a comprehensive understanding of the gene regulatory code (Figure 1) has been a challenge.[3] Several factors contribute to this complexity, including binding site degeneracy, DNA structural constraints, complex F I G U R E 1 Gene regulatory elements.(A) Gene regulatory elements include enhancers and promoters.They play crucial roles in modulating the transcriptional activity of genes.(B) The structural components of the eukaryotic gene include the upstream regulatory sequence, which includes the promoter region responsible for initiating transcription, the open reading frame (ORF) where the protein-coding sequence resides, and the downstream regulatory sequence involved in post-transcriptional regulation.(Created with BioRender.com).regulatory networks (GRNs).Expression-based methods employ gene expression matrices derived from transcriptome sequencing and computational methods such as correlation metrics, probabilistic methods, and regression algorithms. [5]ternatively, machine learning approaches frame network construction as a binary classification problem and use methods such as support vector machines and decision-trees. [6]Although expressionbased approaches are more straightforward and intuitive, they cannot model TF binding specificity due to their omission of sequence information.To address this limitation, alternative methods scan genomic sequences near transcription start sites (TSSs) to identify TF binding sites (TFBSs).Motif analysis and chromatin data can help to model sequence specificity for transcriptional regulatory networks (TRNs). [7] systematically study genome-wide TF binding, researchers have employed various experimental methods, such as ChIP-seq (chromatin immunoprecipitation followed by sequencing), CUT&Tag (cleavage under targets & tagmentation), CUT&RUN (cleavage under targets & release using nuclease), [8] and DNA footprinting with DNase I. [9,10] In vitro techniques, like systematic evolution of ligands by exponential enrichment (SELEX) sequencing, allow researchers to derive binding motifs by examining TFBSs. [11,12][15][16] Large-scale, multicenter initiatives, such as the ENCODE consortium, have produced comprehensive genome-wide information maps for TFs across diverse cell types. [17]hese experimental approaches have led to the development of algorithms for inferring TF binding preferences at sequence motifs and genomic loci.[20][21] High-throughput assays, such as massively parallel reporter assays (MPRAs), have the ability to examine a multitude of cis-regulatory elements in parallel. [2,22,23]chine learning techniques and artificial intelligence (AI) have been instrumental in deciphering regulatory relationships.The use of hierarchical layers within convolutional neural network (CNN) architectures captures increasingly complex features from the input data and facilitates a more comprehensive understanding of cis-regulatory sequences.Subsequently, machine learning-based tools have been used to identify TFBSs, predict promoters, enhancers, and their interactions, and to infer the cis-regulatory grammar in DNA sequences.
Despite the potential of these techniques, their intricate structures and numerous parameters can impede interpretability.Ongoing efforts to build comprehensive catalogs of functional regulatory elements promise to facilitate the development of new machine learning-based tools, improve understanding of the gene regulatory code, and elucidate how specific variants can lead to unique phenotypes. [24]This review will delve into current experimental and algorithmic advancements, their challenges, and their potential to provide insights into cis-regulatory elements and their roles in gene regulation.

TRANSCRIPTION FACTOR NETWORKS
[27][28][29] These networks help identify downstream targets of TFs, support comparative analyses across developmental stages and disease conditions, and locate master regulator TFs, which drive cell differentiation by controlling numerous downstream genes.Despite the significance of transcription factor networks (TFNs), a standardized terminology is lacking in this field, with terms like gene co-expression networks (GCNs), GRNs, and TRNs used interchangeably. [30]In this review, GCNs represent gene expression correlations with undirected edges, while GRNs use directed edges for regulatory interactions (Figure 2).TRNs are specialized GRNs with edges originating exclusively from TFs, representing transcriptional regulation by TF geneencoded proteins. [30]This terminological distinction aims to reduce ambiguity when categorizing, comparing, and benchmarking network methodologies, although it is worth noting that most GRN construction methods can infer TRNs by restricting regulator genes to TF genes.
There are more than one thousand TF genes within the human genome, [31] and each of these genes can regulate the expression of hundreds to thousands of downstream genes through transcriptional control, resulting in an extensive network with millions of potential edges.Computational methods are essential for constructing and analyzing complex TRNs, and a multitude of approaches rely on supervised and unsupervised learning from next generation sequencing data using both expression and sequence data (Table 1).
It is worth noting that evolving single-cell technologies had a considerable impact on building TRNs.[54][55][56][57][58][59] Singlecell technologies present tremendous opportunities for constructing TRNs.As opposed to bulk data where there are usually tens of samples as data points, there are thousands of cells as individual data points in the single-cell data.This exponential increase in the number of observations provides an enormous advantage for training machine learning methods, opening the door for employing a vast number of supervised learning algorithms, including many different deep learning architectures.
Methodologies developed for bulk data cannot be directly used to infer TRNs for single-cell data as utilizing single-cell data proposes unique challenges, which do not exist in the bulk data. [52,53,59]Dropout events lead to data sparsity, which can hinder data processing.This is generally addressed by using data imputation, though this approach can sometimes introduce false interactions. [60]As an additional challenge, the stochastic nature of gene expression in the individual cells increases noise, which obscures the biological signal.This undesired effect gradually reduces in parallel to the growing size of the data. [57]wever, the immense increase in data size does propose challenges in terms of computational resources, as the bulk methods that are tested in tens of samples can fail to complete in the case of a single-cell dataset that includes thousands of cells.Typical single-cell methods tackle this obstacle by reducing the number of genes through filtration of gene expression values.However, this approach carries the potential risk of missing information in the output. [57]Hence, single-cell-based TRN methods need to be designed efficiently to handle these issues and effectively utilize this specific type of data.

Expression-based methods for constructing TRNs
][34] The networks that are based on transcriptional correlation in this context are classified as co-expression networks.One of the early developed and most widely used co-expression network methods is WGCNA, [61] which not only builds gene networks but also identifies modules, which are highly correlated clusters of genes.WGCNA's high dimensional counterpart hdWGCNA [62] derives gene networks and modules from single-cell genomics and spatial transcriptomics data, using metacells, which are defined as groups of cells that are transcriptionally similar to one another.
One drawback of co-expression networks is that they lack the ability to establish directional relationships, which is a necessary feature for identifying the downstream targets of TFs in TRNs.Additionally, they cannot distinguish between direct and indirect interactions, making them susceptible to false positives.To mitigate this, alternative approaches employ partial correlations to model gene expression
Another limitation with the correlation metrics is that they only capture linear dependencies in gene expression data, which often contains nonlinear relationships like feedback loops and sigmoidal responses.
In contrast, mutual information (MI) can capture both linear and nonlinear dependencies.][37][38][39] An additional advantage of MI over correlation is that it does not assume a specific distribution of data.However, as a symmetric measure, it still cannot infer the directionality of gene interactions.In addition, as the MI is designed to work with discrete data, continuous gene expression values cannot be used directly for calculating this information. [58]Hence, the continuous data must be discretized into bins, [68] which can be computationally demanding. [69]More importantly, the boundary of the MI values depends on the data distribution, and unlike correlation, there is no predefined boundary.The absence of a preset boundary can make interpretation of the results difficult.Moreover, the MI is incapable of determining indirect correlations, leading to increased levels of false positives. [70][45][46] These methods model gene expression using multiple regression models, treating the expression of each gene as the dependent variable, influenced by the expression of other genes as independent variables.To simplify the model, they apply regularization penalties to encourage coefficients to approach zero.An alternative approach involves constructing a regression tree to predict gene expression and determine the weight of regulator genes based on their significance within the tree. [71,72]However, this approach can be computationally expensive, leading to longer runtimes as data size increases.To address this challenge, gradient boosting machines have been employed to enhance performance. [51]Recently, a balanced approach for deriving TRNs was developed that combines random forest (RF), extra tree, and support vector regressors through an ensemble regression technique. [73]A B L E 2 Deep learning methods for TRN construction.

Graph attention neural networks Expression Supervised
Structural equation modeling [86][87][88] Expression Supervised Transformer learning model [89] Expression Supervised Denoising diffusion probabilistic models [90] Expression Supervised [42] These methods model the gene's expression as a conditional probability of its parent genes (regulators) and the expression of all genes as a joint probability distribution of individual distributions.Similar to regression-based methods, Bayesian networks inherently infer directionality.However, a significant limitation is their assumption of the network topology as a directed acyclic graph, which reduces their ability to model loops in the networks.This limitation is critical because TFs can regulate themselves and other TFs in various pairwise or higher-order structures. [74,75] an alternative to statistical techniques, various supervised machine learning methods have been applied to predict TRNs from transcriptome data. [47,48]These methods frame TRN construction as a binary classification task, aiming to identify TF-to-target interactions using the expression profiles of both genes.Support vector machine (SVM)-based methods classify these interactions directly based on the expression of both the TF and target gene [47] or by indirectly converting the data into graph distance profiles for input in the kernel function. [49]Transfer learning extends the utility of SVM-based methodologies by training models on one organism and transferring knowledge to another. [50]Another variation incorporates positive-only data for training SVM classifiers. [48]A recent study conducted a comprehensive evaluation of supervised learning methods for TRN inference using single-cell expression data. [6]The authors formulated network inference as a binary classification problem and trained SVM, RF, K-nearest neighbor (KNN), naive Bayesian, decision tree, and logistic regression algorithms to solve the problem.
Notably, these models outperformed unsupervised approaches, with SVM, RF, and KNN emerging as top performers among the supervised methods.

Deep learning methods for constructing TRNs
Different variations of deep learning architectures have also been employed to infer TRNs [76] and mostly use the BEELINE [77] evaluation framework for benchmarking (Table 2).CNNs, known for their unprecedented accuracy in image classification, have been modified for TRN construction. [78]This adaptation involves converting gene expression data into image representations based on normalized empirical probability function matrices and training the classifiers with these images.A hybrid method combining CNNs and recurrent neural networks has been introduced and takes into account both correlations and pseudotimes for deriving TRNs from single-cell RNA-Seq data. [79]other approach uses 3D CNNs to predict regulatory interactions by using expression of gene triplets to reduce the effect of noise and dropout. [80]ructural equation modeling (SEM) techniques, which are used to derive the relationships between observed and latent variables, are also employed in a deep learning context for TRN inference.DeepSEM adapts the SEM for inferring regulatory relationships using singlecell gene expression data. [88]DAZZLE follows a similar approach as DeepSEM but introduces the Dropout Augmentation step, which adds random zero values to the expression data during training to increase robusticity. [89]MetaSEM couples SEM with a meta-learning approach for learning high dimensional data features. [90]aph neural networks (GNNs) are also utilized for unsupervised network construction.GenKI builds network construction as an unsupervised learning problem by adapting a variational graph autoencoder. [83]GRGNN also follows an unsupervised approach using GNNs in which the network construction task is handled as a graph classification problem. [84]DeepRIG also utilizes GNNs for TRN construction, but it does so after completing a prior co-expression network building step. [85]Another GNN-based approach is GRINCD, which first generates a graph representation of each gene and applies the additive noise model to predict causal regulation. [86]GeneLINK uses a graph attention network (GAT) model, which is a specific type of GNN, to infer TRNs from incomplete prior networks for link prediction. [87]e TRN inference problem is addressed with deep learning models as well.One of them is STGRNS, which formulates the network inference problem as a binary classification task and employs the transformer architecture, which is a deep learning model widely used in language learning models. [91]RegDiffusion articulates GRN construction as a supervised regression task and uses Denoising Diffusion Probabilistic Models.In contrary to DAZZLE, RegDiffusion adds Gaussian noise to the data, and the machine learning model is trained to predict the added noise. [92]ing epigenomic data for constructing TRNs Expression-based methods for deriving TRNs are frequently validated using relatively simple organisms such as Escherichia coli or yeast, primarily because they are easy to genetically engineer and can facilitate the elegant construction of ground truth networks.However, when these methods are extended to complex organisms, independent benchmarking exposes a notable decrease in accuracy, which can be ascribed to the impact of epigenetic regulation via DNA and chromatin modifications in complex organisms (Figure 1A), a factor not accounted for in expression-based methods. [93,94]Therefore, the cell's epigenetic landscape must be integrated into network models.
Prior studies have used histone modification data [95] to achieve high accuracy at the bulk level, surpassing methods relying on gene expression.][98] To identify TFBSs in accessible chromosomal regions, epigenomic methods incorporate motif or footprinting analysis and link TFBSs to their target genes with proximity or correlation-based approaches.Despite the increased cost associated with using two data modalities, recent experimental approaches offer a potential solution by profiling both gene expression and chromatin accessibility together to generate multiomic data. [99,100]r example, the multiomic-based SCENIC+ employs a modified version of cisTopic, which uses latent Dirichlet allocation, to calculate motif enrichment scores along putative enhancer regions and identify cis-regulatory enhancers and TF-to-target relationships. [82]scREG also uses multiomic data but distinguishes itself in its utilization of non-negative matrix factorization (NMF) to derive a lower-dimensional representation for calculating cluster-specific regulatory scores. [97]eNA couples motif analysis with decision tree (DT) regressions to construct TRNs from single-cell multiomic data, [96] while DIRECT-NET uses motif analysis and gradient boosting machines (GBM) to identify cis-regulatory enhancers and TF-to-target interactions. [96,98]lf-organizing maps (SOMs) are also utilized to construct TRNs with multiomic data. [81]In this deep learning technique, separate SOMs are generated for single-cell RNA-Seq and ATAC-Seq data and then integrated via a linking function to build the regulatory network based on motif enrichment.Additional single-cell multiomics TRN construction methods include Dictys, [101] which performs footprint analysis with single-cell ATAC-Seq data to infer TF-to-target relationships, which are further refined by stochastic process modeling of regulatory relationships based on single-cell RNA-Seq data.Finally, scMEGA [102] integrates scRNA-Seq and scATAC-Seq data and associates chromatin accessibility peaks with the genes based on signal correlation; it uses TF binding information to link TFs to their target genes.

FUNCTIONAL EXAMINATION OF CIS-REGULATORY ELEMENTS
Due to the aforementioned ambiguity regarding gene regulatory elements, it remains challenging to link variants in regulatory sequences with functional outcomes. [103]Although assays like DNase-seq and ChIP-seq provide comprehensive genome-wide regulatory maps, they do not offer a functional readout for the identified sequences. [2]cent advancements have allowed for a more systematic and large-scale examination of the functions of cis-regulatory elements, and there are two main groups of methods: (1) massively parallel reporter assays (MPRAs) and self-transcribing active regulatory region sequencing (STARR-seq) and ( 2) CRISPR-Cas9 techniques (Figure 3).

MPRAs and STARR-seq in gene regulation research
[107] In MPRAs, candidate regulatory sequences are linked to unique barcodes and incorporated into classic promoter or enhancer reporter vectors, enabling the regulatory element to drive its own transcription and that of the associated barcode.Subsequently, next-generation sequencing measures the barcode's expression and normalizes it to the genomic element's DNA abundance to represent cis-regulatory activity. [108]wever, traditional MPRAs are conducted outside of the genome in an episomal manner and may not fully represent the in vivo functions of regulatory elements. [2]To address this limitation, novel approaches, such as the lentivirus-based MPRA (LentiMPRA), integrate regulatory elements into the chromosomal context of the genome. [109]In one study, LentiMPRA was employed to compare the functional activities of 2,236 candidate liver enhancers in episomal versus chromosomally integrated contexts, revealing significant differences.These findings have broad implications for the identification, prioritization, and functional validation of cis-regulatory elements.Additionally, MPRAs using adeno-associated virus (AAV) vectors offer solutions to mitigate this concern. [110,111]Klein et al. performed a systematic comparison of MPRA experimental designs and found that sequence length had the greatest effect on the results of MPRAs, followed by assay design and then orientation. [106]cent research with MPRAs has advanced understanding of disease mechanisms, regulatory elements, and the functional consequences of genetic variation in the noncoding regions of the human genome.Studies have spanned a wide range of health-related topics, including human traits, [112] vascular disease, [113] inherited retinal degeneration, [114] schizophrenia, [115] Alzheimer's disease, [115] osteoporosis, [116] and early human neurodevelopment. [117]To understand the functional consequences of more than 30,000 single mutagenesis using MPRAs. [118]The resulting dataset of functional measurements for potentially disease-causing regulatory mutations has emerged as a comprehensive resource for the development of predictive tools.The findings underscore the potential of MPRAs for identifying new biomarkers and promising therapeutic targets for diseases including cancer. [119]Recent studies have also highlighted the role of MPRAs in identifying cis-regulatory elements that are evolutionarily conserved [120] or essential for pluripotency. [121]though traditional MPRAs require serial assays across different cell types, single-cell MPRAs (scMPRAs) concurrently measure cis-regulatory sequences at the single-cell level, while also identifying cell identities through transcriptomes. [122]This is accomplished using a two-level barcoding scheme that measures reporter gene copy numbers in single cells based on mRNA.By employing complex random barcodes (rBC) and specific CRS barcodes (cBC), this method minimizes repetition of cBC-rBC pairs within the same cell, ensuring precise measurements of cis-regulatory sequence activity across different cell populations with varying input abundances.The potential of scMPRA's to assess subtle genetic variations in cis-regulatory sequences across various cell types was demonstrated in live mouse retinas. [122]ARR-seq is a massively parallel reporter assay experiment type that examines putative transcriptional enhancers based on their activity in fragments derived from across the entire genome. [123,124]DNA fragments are cloned downstream of a core promoter and into the 3′ untranslated region of a reporter gene.Active enhancers within these fragments drive transcription and become part of the result-ing reporter transcripts, enabling the simultaneous testing of millions of DNA sequences in a complex reporter library.Key features of STARR-seq include its independence from the location of candidate sequences within the genome and its avoidance of position effects that are typically associated with random genomic integration.It provides a quantitative measure of enhancer activity and can generate genome-wide cell type-specific maps of enhancer activity.Although STARR-seq does not directly measure enhancer activity within its endogenous chromatin environment, it identifies functional enhancers that overlap accessible chromatin and bear typical histone modifications, suggesting functionality within their endogenous context. [23]ditionally, it can detect enhancer activity within inaccessible chromatin regions marked by specific histone modifications, which can provide insights into chromatin-mediated silencing mechanisms in gene regulation.
STARR-seq has allowed for the identification and examination of novel enhancers in humans, animals, [124] and plant genomes [125] and for further characterization of genome-wide enhancer-promoter interactions. [126]Whole Human Genome STARR-seq (WHG-STARRseq) is a powerful method for assessing enhancer activity across the entire human genome and has been utilized to identify active enhancers in open chromatin regions and potentially functional enhancers in inaccessible chromatin regions. [127]Additionally, STARRseq facilitates high-resolution mapping of tissue-specific regulatory elements, including enhancers with highly biased activity towards the dorsal raphe nucleus in the brain. [128]STARR-seq has also helped researchers understand how alterations in regulatory elements can contribute to diseases like coronary artery disease [129] ; this has offered insights into potential drug targets within regulatory regions.

CRISPR-Cas9 in gene regulation research
5][16] The CRISPR-Cas9 gene-editing system selectively modifies or deletes specific cis-regulatory elements, and researchers can discern functional importance by observing the impact on gene expression.In gene regulation research, CRISPR-Cas9 has been utilized for screens, knockout/knock-in studies, CRISPR activation (CRISPRa), CRISPR interference (CRISPRi), and epigenetic studies.
Studies have highlighted the value of CRISPR screens in general in identifying cis-regulatory elements associated with disease states, [130][131][132][133] including neurodevelopmental disorders and neurodegenerative diseases [134,135] and host-pathogen interactions for viruses such as COVID-19. [136]These screens have also pinpointed specific TFs as potential immunotherapeutic targets in various cancer types [137] and been used to investigate the connection between regulatory elements and drug resistance or sensitivity in cancer cells. [138]oled CRISPR screens, like Perturb-seq, couple genetic perturbations with single-cell transcriptomics to study the effects of these perturbations at the single-cell level.However, their limitations include prohibitive costs and challenges in efficiently measuring lowly expressed genes and small effects. [139,140]To address these drawbacks, targeted Perturb-seq (TAP-seq) amplifies specific genes of interest, significantly increasing the scalability and cost-effectiveness of single-cell genetic and functional CRISPR screens, while also providing improved sensitivity for detecting gene expression changes. [140,141]ISPR interference (CRISPRi), which selectively inhibits target gene expression using catalytically inactive Cas9 protein, provides insights specifically into the effects of gene silencing on regulatory networks.To functionally validate distal cis-regulatory elements and link them to their target genes, researchers have combined CRISPRi and RNA-seq.[16,139,142,143] One study introduced CRISPRi-FlowFISH, which combines CRISPRi with RNA fluorescence in situ hybridization and flow cytometry.[144] By applying this technique to a large dataset of potential enhancer-gene connections, researchers developed an Activity-by-Contact (ABC) model; this model significantly improves predictions regarding enhancer-gene connections and offers a systematic way of mapping and predicting relationships based on chromatin state measurements.Over the past year, CRISPRi experiments have investigated regulatory element pathways important in various cancers, [145][146][147][148][149] neurodegenerative diseases, [150,151] systemic lupus erythematosus, [152] and asthma [153] ; these and similar studies highlight CRISPRi's potential for therapeutic purposes.
Although CRISPRi experiments have led to advancements in cisregulatory research, the vast majority of studies have worked on screening regulatory elements for necessity and not for sufficiency in the endogenous context. [154][161][162][163] However, one limitation of CRISPRa research thus far is that it has primarily been used in an ad hoc manner in workhouse cancer cell lines, not in therapeutically relevant in vitro models. [154]In response to this, Chardon et al. have introduced an experimental framework that combines multiple gRNA perturbations with sc-RNA-seq and allows for large-scale screening of gRNAs that activate therapeutically relevant genes in a cell type-dependent manner. [154]ditionally, CRISPR-Cas9-based epigenome editing, specifically with dCas9-KRAB, has shown potential to induce targeted epigenetic changes in regulatory elements in the native genomic context. [164]searchers successfully silenced multiple globin genes by directing H3K9 trimethylation to the HS2 enhancer, demonstrating precise disruption of enhancer activity and gene expression without significant off-target effects. [165]Klann et al. used dCas9-KRAB repressor, as well as dCas9p300 activator, with lentiviral single guide RNA libraries to perform loss-and gain-of-function screens targeting DNase I hypersensitive sites near specific genes, successfully identifying known and novel regulatory elements. [166]Kabadi et al. also used dCas9-KRAB and dCas9p300, but investigated Cystic Fibrosis Transmembrane Conductance Regulator (CFTR) gene regulation across 18 genomic regions, identifying enhancers and repressors and showing potential for enhancing CFTR expression as a therapeutic strategy for cystic fibrosis. [167]

APPLYING MACHINE LEARNING TECHNIQUES TO DECIPHER THE CIS-REGULATORY CODE
Architectures including CNNs, recurrent neural networks (RNNs), and transformers have emerged as powerful machine learning tools for research on gene regulation (Figure 4).Studies have focused on identifying TFBSs and regulatory elements, as well as deciphering the rules for cis-regulatory grammar.

Methods for transcription factor binding identification
Numerous machine learning tools have been developed for identifying TFBSs.[170][171][172] However, these methods have faced challenges related to model interpretability and performance in capturing complex regulatory relationships.In response to these limitations, one algorithm entitled TF-MoDISco [173] leverages explainability techniques, such as DeepLIFT, [174] and uses per-base importance scores to consolidate motifs learned by the network into a nonredundant set, subsequently improving the interpretability of the motifs.
Recognizing the need for an interpretable method that considers TF cooperativity and is based on genome-wide experimental data, BPNet, a CNN, was developed to model the relationship between cis-regulatory sequences and TFBSs at base resolution. [175]It successfully identified composite TFBSs, indirect binding footprints, and TFBS periodicity patterns, with CRISPR-validated motif syntax.Another machine-learning framework, AgentBind, utilizes CNNs to predict TF binding, but also assesses the importance of sequence context features. [176]Ultimately, this model highlights the importance of training data quality and features such as open chromatin on prediction accuracy.
In order to apply deep learning to massively parallel sequencing data, not just sequential data, DeepGRN was developed next.It combines single and pairwise attention modules and utilizes an attention mechanism to capture long-range dependencies from DNA sequences and associated data. [177]More recently, DeepSTF, a unique deeplearning architecture for predicting TFBSs that integrates shape and DNA sequence profiles, was designed; it utilizes stacked CNNs and a novel transformer encoder structure. [178]Lastly, to provide userfriendly TFBS prediction from ATAC-seq data, maxATAC, a large suite of deep neural network models, was established. [179]

Identifying promoters, enhancers, and their interactions
Recent advancements in machine learning have focused on computationally predicting promoters, [180,181] their strength, [182,183] and mRNA abundance. [184,185]For instance, iSEGnet, a deep CNN integrating epigenetic modifications and RNA-seq data, identified potential regulatory sites within promoters and transcription termination sites and provided insight into specific epigenetic modifications within regulatory regions. [186]92] To assess the functional effects of human-chimpanzee variants in human accelerated regions, a recent study utilized Sei, a deep CNN model, coupled with lentiMPRA and epigenetic experiments. [193]The findings highlighted nucleotide changes, predicted by TF footprints, as the primary source of differences in human-chimpanzee enhancer activity.Additionally, evidence of compensatory evolution to preserve ancestral enhancer activity was observed, which is significant given links between enhancer-active human accelerated regions and neurodevelopmental genes and neuropsychiatric diseases.
The Enformer model incorporates transformers in addition to CNNs to predict enhancers. [190]Transformers have shown promise in various other fields, such as natural language processing (NLP), which attempts to capture the most relevant information through an attention mechanism.Despite a recent push to favor large-scale attention transformer models in this field, some researchers have argued that despite excellent performance in protein structure prediction, text mining, and genomic data analysis, the quality of transformer models can be overestimated under certain test scenarios. [194,195]Concerns also persist regarding their ability to effectively capture long-range interactions. [194]relevant development has been LegNet, a CNN for modeling short gene regulatory regions that achieved first rank in predicting promoter expression from a gigantic parallel reporter assay at the DREAM 2022 challenge.[195] The authors highlight that fully convolutional networks should be recognized as a dependable method for computationally modeling short gene regulatory regions and predicting the consequences of regulatory sequence modifications.However, ultimately, it is critical to remember that the effectiveness of machine learning and AI models hinges on the quality of experimental data, with current limitations in wet lab techniques contributing to challenges in precisely defining enhancers across the genome and occasionally leading to poor reproducibility even in replicates of the same experiment.

Deciphering the rules governing cis-regulatory sequences
Researchers have been challenged to decipher cis-regulatory grammars, the binding combinations, and patterns that dictate regulatory activity in different cellular contexts.Models, ranging from the flexible TF billboard model [196] to the more stringent enhanceosome model, [197] have attempted to explain how regulatory grammar drives enhancer activity.To investigate the impact of TFBS orientation and order, Georgakopoulos-Soares et al. utilized an extensive lentiMPRA library of 209,440 sequences. [198]Their findings indicated that TFBS orientation significantly impacts gene regulatory activity, especially with multiple copies of the same TFBS.Some TFBSs showed increased expression levels with specific orientations, while others performed best with a balanced proportion of orientations.In addition, the study highlighted that the order in which heterotypic TFBSs are placed can significantly influence gene expression.Ultimately, the research concluded that incorporating TFBS orientation into predictive models enhances their performance and may improve understanding of disease-associated genetic variants.

Agarwal et al. also utilized lentivirus-based MPRAs to study
the sequence features controlling the activity and cell-type-specific attributes of cis-regulatory elements within the human genome. [24]ing lentiMPRAs, they tested over 680,000 sequences representing annotated CREs across three cell types and found that while Neural networks have, in addition to MPRAs, been employed to predict cis-regulatory grammar.DeepSTARR, a deep learning model for cis-regulatory grammar, was designed to predict the activities of developmental and housekeeping enhancers in Drosophila melanogaster S2 cells directly from the DNA sequence. [199]This model not only identifies relevant TF motifs, but also discerns the higher-order rules governing functional differences between instances of the same TF motif and facilitates the creation of custom synthetic enhancers.
One deep learning method based on generative adversarial networks, ExpressionGAN, utilizes genomic and transcriptomic data to generate de novo synthetic regulatory DNA. [200]To better understand how mutations in noncoding regulatory sequences impact cis-regulatory grammar, one study developed sequence-to-expression models using deep neural networks with convolutional layers. [201]These models provided a framework for addressing questions in regulatory evolution.The study ultimately concluded that regulatory evolution occurs rapidly and is subject to diminishing returns epistasis, which means that as genetic changes accumulate, their effects become less pronounced.
In addition, residual neural network (ResNet) algorithms, designed to address vanishing gradients in very deep networks, have been utilized to classify enhancer sequences by simulating the sequences with various regulatory architectures, including homotypic/heterotypic clusters and enhanceosomes. [202]Findings demonstrated that ResNets can effectively model regulatory grammars, even with heterogeneity in regulatory sequences and a significant proportion of TFBSs outside regulatory grammars.However, the network's ability to learn regulatory grammar does still depend on the nature of the prediction task.mechanisms. [203]Consequently, there is growing emphasis on incorporating more explainable AI techniques into biological data analysis.The ongoing evolution of AI is set to enhance predictive abilities, potentially unveiling novel regulatory motifs and strengthening our understanding of cis-regulatory grammar.

F I G U R E 2
Comparison of gene co-expression networks (GCNs), gene regulatory networks (GRNs), and transcriptional regulatory networks (TRNs).(A) GCNs represent gene expression correlations with undirected edges.(B) GRNs use directed edges to represent gene expression correlations.(C) In TRNs, the directed edges originate exclusively from transcription factors.(Created with BioRender.com).TA B L E 1 Statistical and machine learning methods for GRN construction.
nucleotide substitutions and deletions within twenty disease-related promoters and enhancers, Kircher et al. conducted saturation F I G U R E 3 Approaches to studying cis-regulatory elements.(A) Standard reporter assays assess the ability of candidate promoter sequences to drive expression of a reporter gene or the ability of a candidate enhancer and minimal promoter together to drive expression of a reporter gene.(B) Massively parallel reporter assays utilize transcribed barcodes following the reporter gene.(C) Self-transcribing active regulatory region sequencing (STARR-seq) uses the candidate enhancer, which is part of the downstream regulatory sequence, as the transcribed barcode.(D) CRISPR/Cas9 screens use sgRNAs to assess changes in function or expression of a target gene.(E) CRISPRi allows researchers to selectively inhibit the expression of target genes by using Cas9 to block transcription.(Created with BioRender.com).

E 4
Machine learning approaches to studying cis-regulatory elements.(A) Convolutional neural networks are deep learning models designed for processing structured grid-like data by using filters to detect patterns hierarchically.(B) Recurrent neural networks are specialized in handling sequential data, where information cycles through recurrent connections.(C) Transformers enable complex hierarchical relationships and dependencies within data to be captured.(Created with BioRender.com).
promoters exhibited significant strand orientation effects, enhancers displayed tissue-specific characteristics.Ultimately, the research generated accurate sequence-based models for predicting CRE function, identified factors influencing cell-type specificity, and provided a comprehensive catalog of functional CREs in commonly used cell lines.
In summary, understanding cis-regulatory grammar has emerged as a challenge in genomics, but recent research has provided valuable information.Advancements in deep learning, such as GNNs and CNNs, have shown promise in inferring TRNs with higher accuracy and the ability to capture nonlinear dependencies.Additional research has demonstrated that integrating both gene expression and epigenetic data enhances our understanding of transcriptional regulation.Functional examination of cis-regulatory elements through techniques like MPRA, STARR-seq, and CRISPR-Cas9 has furthered research on their role in disease and allowed for the identification of biomarkers and therapeutic targets for precision medicine.Future advancements in single-cell genomics will allow researchers to dissect regulatory networks within individual cells, and machine learning models will allow for a more holistic view of cis-regulatory logic by integrating genomics, epigenomics, transcriptomics, and proteomics.A multitude of machine learning techniques rely on black-box AI models, like CNNs and RNNs, which provide accurate predictions but often lack explanations for the underlying