Data‐ and theory‐driven approaches for understanding paths of epithelial–mesenchymal transition

Reversible transitions between epithelial and mesenchymal cell states are a crucial form of epithelial plasticity for development and disease progression. Recent experimental data and mechanistic models showed multiple intermediate epithelial–mesenchymal transition (EMT) states as well as trajectories of EMT underpinned by complex gene regulatory networks. In this review, we summarize recent progress in quantifying EMT and characterizing EMT paths with computational methods and quantitative experiments including omics‐level measurements. We provide perspectives on how these studies can help relating fundamental cell biology to physiological and pathological outcomes of EMT.

approaches to challenge the binary switch model and transforming the concept of EMT into a system with a wide range of cell phenotypes spanning the phenotypic space between E and M states.In particular, several groups used dynamical systems theory and models based on core regulatory network involving a few players (ZEB1,etc.)to show that intermediate, partial, or hybrid EMT states can be stable (Hong et al., 2015;Lu et al., 2013;Steinway et al., 2015;Tian et al., 2013).The notion of intermediate EMT states is supported by experimental data from both cell lines and in vivo samples (Grande et al., 2015;Hong et al., 2015;Pastushenko et al., 2018;Zhang et al., 2014).More recently, quantitative, and high-throughput data as well as theoretical development allowed the characterization of the trajectory of EMT involving complex state transitions.In this review, we will highlight recent innovations in data analytical approaches and theories for studying multistate EMT and provide perspectives for future developments in this field.

| SCORING METHODS FOR THE DEGREE OF EMT
Because EMT is now recognized as a continuous spectrum, there is a need for scoring individual samples or cells for assessing its extent of EMT.Expression levels of phenotypic proteins, such as E-cadherin and Vimentin, as well as key transcription factors, have been used to assess EMT.However, the scoring process is challenged by the diversity and complexity of EMT program.As a form of dramatic phenotypic change, the EMT program involves coordinated regulations of many genes, which may generate multiple stable or meta-stable cell states.Therefore, using individual "marker" expression may produce misleading results because the expression of a specific marker may not be significantly altered in a partial EMT process.Tan et al. proposed an EMT scoring approach based on a curated list of EMT signature genes and Kolmogorov-Smirnov (KS) tests (Tan et al., 2014) (Figure 1a).This approach and the associated gene lists have been widely used for assessing EMT progression.For example, Foroutan et al. found wide score distributions for multiple cancer cell lines with TGF-β induced gene signature (Foroutan et al., 2017).As expected, in most cases when epithelial and mesenchymal signatures were separately considered, cancer cells with varying degrees of EMT have inversely correlated E and M scores.The separation of E and M signatures also allows the identification of EMT trajectories in a twodimensional (2D), E and M space (EM space).For example, using timecourse bulk RNA-seq data, Panchy et al. found that TGF-β induced EMT involves transient states at the high-E-high-M region in the EM space (Panchy et al., 2020).The high-E-high-M states represent most, but not all, breast cancer tumors' expression profiles that show intermediate EMT phenotypes (i.e., not extreme E-or M-like).
While the KS test-based approaches are useful for assessing degrees and trajectories of EMT, there has been a concern on the suitability of these methods for analyzing single-cell transcriptomes due to the efficiency and the bias toward upregulated genes (Noureen et al., 2022).A method based on principal component analysis (PCA) (non-negative PCA) was proposed to rapidly score EMT progression (Panchy et al., 2021).While this method produced similar results in terms of the EMT progression in the EM 2D space when the leading principal component (PC) was used for gene set (E or M) scoring (Figure 1a), it allows the investigation of gene set scores from more than one PC ordered by the explained variance.Because it has been shown the transcriptional program of EMT is context-dependent and the activation of M genes can be diverse (Cook & Vanderhyden, 2020;Watanabe et al., 2019), the advantage of the multi-PC feature is that it can detect divergent EMT trajectories in a single dataset.For example, in the analysis of a time course, single-cell RNA-seq dataset for small cell lung cancer (SCLC), two PCs from a M gene set produced comparable variances explained, and they showed distinct sub-modules of mesenchymal-like transcriptional programs used by different SCLC subtypes that correspond to intermediate EMT states (Groves et al., 2022).
To simplify the assessment of partial or hybrid EMT phenotypes in datasets, a single score for partial EMT was constructed by taking the minimum of E and M scores (Sacchetti et al., 2021).This was F I G U R E 1 Illustration of EMT paths and dynamical systems underpinning EMT/MET.(a) An EMT scoring system allows the identification of intermediate EMT states and paths of EMT.(b) Two regulatory networks each capable of producing a tristable EMT system.(c-e) Three plausible modes of EMT/MET: signal-driven disappearance of stable steady state (c), noise-driven transitions among stable steady states (d), and limit-cycle-like orbits (e).
helpful for identifying partial EMT states during colon cancer progression (Sacchetti et al., 2021).A categorial scoring scheme based on E, M, and E/M phenotypes was also proposed, and it was shown to be useful for predicting survival outcomes of cancer patients (George et al., 2017).
In addition to the scoring methods based on transcriptomic profiles, morphological features can be important to determine cell phenotypes in the EMT spectrum.For example, Wang et al. used a scoring method based on live cell images to track trajectories of EMT, revealing crucial temporal dynamics not captured by snapshot data (Wang et al., 2020).In the future, connecting the morphological space to omics' layers will be helpful to gain insights into the characteristics of the intermediate EMT states.

| DYNAMICAL SYSTEMS THEORY-BASED MECHANISTIC MODEL STUDIES OF PARTIAL EMT
The early notion of binary switch between E and M states resulted in confusions and controversies regarding the role of EMT in metastasis and other pathological processes.This necessitates the rigorous description of EMT as a dynamical system that allows the assignment of cellular states to (stable) steady states representing the long-term phenotypes of cells.This application of theory to biology yielded successful insights that shed light on the complex process of EMT.
Two pioneering models provided the first theoretical frameworks for a stable, hybrid/partial EMT state (Lu et al., 2013;Tian et al., 2013).Both models considered networks containing transcriptional regulations by key EMT factors ZEB1 and SNAIL, as well as post-transcriptional regulations by miR-200 and miR-34.With ordinary differential equation (ODE) models and bifurcation analysis, the authors showed a gradual increase of EMT signals such as TGF-β can trigger a two-step, sequential transition from E to M states.The partial EMT state was experimentally observed by Zhang et al. (2014).
The regulatory network structures supporting the 3-state EMT systems are centered at the regulation of ZEB1, which is involved in several feedback loops with other factors via mutual-repression or mutual-activation circuits (Figure 1b) (Hong et al., 2015).Describing these regulatory units with nonlinear functions (e.g., Hill functions) in ODE models is sufficient to give rise to three stable states.For example, we consider the ODE system regulator, n is the Hill exponent, and K is the repression threshold.
The system describes the network shown in the yellow box of Figure 1b, that is, x, y, and z represent the concentrations of ZEB1, OVOL2, and miR-200, respectively.The model can produce a tristable system with a specific choice of parameter set (Figure 1c).
However, most EMT models used more detailed descriptions for microRNA-mediated regulations.This approach requires the separation of mRNA and protein species in the model, as well as mass-action kinetics as opposed to nonlinear functions with explicitly sigmoidal response curve.For example, the interactions between miR-200 and ZEB1 mRNA harboring four miR-200 binding sites (Figure 1b, pink box) can be described by where R is the concentration of free ZEB1 mRNA, r is the concentration of free microRNA, and C n is the total concentration of complexes with microRNA molecules bound to an mRNA molecule except C 0 ≔ R and C 5 ≔ 0, k R is the transcription rate constant of ZEB1, κ on and κ off are the binding and unbinding rate constants of mRNA and microRNA, γ is the degradation rate constant of free microRNA, and α n and β n are the degradation rate constants of mRNA and microRNA bound to a complex with respect to their free-form degradation rate constants.
The advantage of using this type of model is that one can relate the mechanisms at the level of molecular interactions to the EMT cellular phenotypes.Even more interestingly, this type of model for posttranscriptional regulations can enable 3-state EMT systems even without transcriptional regulations or explicit feedback loops (Nordick, Park, et al., 2022).Equations ( 1) and ( 2) are typically combined to describe both transcriptional and post-transcriptional regulations in a single model (Hong et al., 2015;Lu et al., 2013;Tian et al., 2013), and a recent study shows that this modeling approach can yield up to seven stable states in the EMT spectrum (Nordick, Park, et al., 2022), which supports the notion of an EMT continuum in the gene expression space.While the ODE-based models provide useful mechanistic insights into the basis of hybrid EMT states, it is challenging to parameterize and simulate such systems when the number of variables is large.To consider a more comprehensive EMT system, Steinway et al.
used Boolean models containing up to 69 nodes, including both transcription factors and signaling proteins, to predict multiple EMT states, including hybrid ones, in liver cancer cell lines (Steinway et al., 2015).The team validated their predictions regarding combinatorically activated nodes with experiments (Steinway et al., 2015).In addition to these modeling approaches for single-cell signaling networks, cellular level, and multiscale (gene and cell levels) models have also been used to characterize dynamics of EMT in terms of phenotypic compositions in populations containing proliferating cells at various EMT states (Sha et al., 2020;Ta et al., 2016).
The connection between hybrid cellular states and high-order multistability (more than two co-existing attractors) provides a theoretical basis for studying the complex EMT system.In this framework, transitions among the EMT states can be driven by signals that trigger the disappearance of certain attractors (Wang et al., 2022) (Figure 1c), or stochastic gene expression that permits "jumps" among co-existing attractors (Bhatia et al., 2023;Li et al., 2016) (Figure 1d).A recent work showed an additional type of transitions driven by limit cycle oscillations with diverging periods (Nordick, Yu, et al., 2022) (Figure 1e).Interestingly, this dynamical system can be produced by microRNA-mediated regulations (Equation ( 2)), and it has higher robustness of regenerating diverse gene expression patterns compared with the stochasticity-driven transitions (Nordick, Yu, et al., 2022).Future experiments will be needed to distinguish these different modes of EMT/MET transitions in various biological scenarios.

| UNRAVEL EMT TRANSITION PATHS THROUGH SNAPSHOT CAPTURE OF CELL STATES
To investigate how EMT proceeds, the first question is how to define and distinguish between an epithelial and a mesenchymal phenotype experimentally.Traditionally, one often monitors a small panel of EMT markers, for example, a decrease of E-cadherin and an increase of Vimentin as the epithelial and mesenchymal markers, respectively (Zhang et al., 2014).Accumulating studies reveal that the EMT program is context-dependent and it generally requires a large panel of markers for the state definition (Yang et al., 2020).For example, Taube et al. identified an EMT signature composed of downregulated genes and 87 upregulated genes from analyzing the expression profiles of a panel of clowdin-low and metaplastic breast cancer lines (Taube et al., 2010).Using a small number of EMT markers may mislead interpretation of the experimental results and erroneous conclusions, as exemplified in two studies (Fischer et al., 2015;Zheng et al., 2015).The necessity of using a large panel imposes a technical challenge of using fluorescence-labeling-based live-cell imaging approaches for tracking individual cells over time to monitor the sition process.Advances of fixed single-cell techniques, such as various omics approaches, provide transcriptomic, proteomic, and/or epigenomic profiling of cell states.For these methods cells need to be fixed thus the state of individual cells cannot be tracked over time.
Consequently, numerous computational approaches have been developed to infer the transition dynamics and transition paths from the snapshot data, and several studies on EMT have been reported.Below we will briefly discuss a few such studies.
Pseudotime analysis refers to a class of computational methods of trajectory inference.To perform such analyses, one first specifies a starting cell state and an ending cell state, then uses computational algorithms to assign a scalar value as a metric measuring the similarity between a given cell state and the initial state relative to that between it and the final state.Pseudotime is therefore an unfortunate misno- JAC method that approximates the gene regulation as being linear (Bocci et al., 2022).This approximation is valid near a fixed point such as a stable phenotype excluding cell cycle.With SpliceJAC the researchers demonstrated that they could recover gene regulation relations from single-cell data, and identified driver genes that regulate partial and full EMT.Wang et al. also developed a linearized version of the vector field that applies for cell states away from the fixed points (Wang et al., 2021).They first identified one-dimensional transition paths connecting the epithelial and mesenchymal regions in the cell state space, called reaction coordinates in the context of studying chemical reactions, using an algorithm originally developed for studying chemical reactions.They constructed an approximate linear model describing gene regulation relations and binarized the gene expressions into on and off states.The directed regulations are turned on or off by the switch of genes' on-and-off states.That is, the binarized gene states give rise to dynamical gene regulation networks along the transition paths.Additionally, they grouped the genes into different communities for EMT and several other developmental processes and found that at the stable cell type regions there are more regulations between genes within the same community that corresponds to the functionality of the cell type.This modular structure has been proposed to be a general feature of biological networks for minimizing interferences between different cellular processes (Norman et al., 2013;Ravasz et al., 2002).Interestingly, along a transition path the intercommunity regulations first increase upon moving away from the initial cell state region and then decrease upon approaching the final stable cell state region.This result suggests cell phenotypic transitions require transient cooperation between different cellular programs.
Another group of methods that go beyond the pseudotime analyses use additional information one can extract from the time course data of measured single-cell distributions in the cell state space (Cheng et al., 2023;Schiebinger et al., 2019;Sha et al., 2024;Yeo et al., 2021).Consider a cellular process starts with a swarm of cells Notice that cellular functions are executed through various cellular machineries, and one expects that a cell state change accompanies with noticeable changes of morphological change of cellular structures such as membranes and organelles.Indeed, morphological profiling with fluorescently labeled and label-free optical imaging has been used to characterize cell states complementing to expression profiling methods (Way et al., 2022).For EMT studies, Burute et al. showed that centrosome position is an indicator of EMT progression (Burute et al., 2017).Montell described a framework of using morphological state space to describe EMT (Montell, 2021).Amack reviewed efforts of in vivo live cell imaging on EMT, specifically cytoskeletal reorganization (Amack, 2021).Zhao et al. reviewed efforts of using morphological features as readouts for screening drugs that can reverse EMT (Zhao et al., 2021).Devaraj and Bose (2019) and Mandal et al. (2016) used morphological features to define different EMT states and used Markov transition models to infer transition dynamics between different states from measured cell state distributions.composite cell feature space (Wang et al., 2020;Wang et al., 2022).
The studies reveal two distinct classes of single-cell EMT trajectories, and further studies are needed to examine whether they correspond to the two EMT paths corresponding to G1/S and G2/M arrest from scRNA-seq data analysis (Hu et al., 2023).

| PERSPECTIVE
It is widely accepted that EMT is a stepwise process with a spectrum of intermediate phenotypes (Nieto et al., 2016;Yang et al., 2020;Zhang et al., 2014).That is, EMT transitions may originate from different epithelial states and end at different mesenchymal/partial mesenchymal states through multiple convergent and divergent paths.EMT also couples to various other cellular processes such as metabolism and stemness, further increasing the number of possible transition paths.Traditionally, one pre-select a panel of markers representing each process for subsequent analyses, and biases may be introduced with different choices of the markers.One example is the two controversial reports about the role of EMT on cancer metastasis (Fischer et al., 2015;Zheng et al., 2015).Later studies suggest that the conclusion was misled by using mesenchymal markers incapable of reflecting partial EMT states, which turn out to contribute to metastasis.With advances of single-cell techniques, now we are able to study how the EMT regulatory program couples to and coordinates with other processes systematically.
Consider the question on EMT and stemness that has attracted extensive attention.An influential study of Weinberg and coworkers (Mani et al., 2008) led to an explosive number of studies on the relationship between EMT and cancer stem cell generation (Scheel & Weinberg, 2012), which is hypothesized to contribute to the complexity of EMT landscape.However, mechanistic understanding on the coupling among the processes remains unclear.While it is well established in the EMT field that EMT proceeds through a continuum  et al., 2015;Kröger et al., 2019), the mesenchymal end (Mani et al., 2008), or both epithelial and mesenchymal states (Liu et al., 2014) (Figure 2a).One possible explanation is that coupling between EMT and other cellular programs including stemness results in multiple transition paths, and these paths and associated intermediate states may be populated differently at different cellular contexts (Figure 2b).That is, one contribution to the different observations may come from the oversimplification of using a one-dimensional EMT axis to describe the transition progression.Several modeling studies have analyzed small-scale EMT-stemness regulatory network models and concluded that the partial EMT states are the most stemlike (Jolly et al., 2014).This conclusion resonates with the observation of Wang et al. (2021) that the crosstalk between different cellular programs transiently increases during EMT.With single-cell omics data one can perform systematic studies on how various cellular programs work cooperatively to execute large-scale cell expressions and function reprogramming, such as the context-dependent coupling between EMT programs and stemness regulation programs, as exampled by a few existing such studies (Acosta et al., 2013;Bocci et al., 2021;Hari et al., 2022;Hari et al., 2023;Hu et al., 2023;Watanabe et al., 2019).
Therefore, it is a high priority for EMT studies to have a complete map of the EMT states and their transitions (Yang et al., 2020), or the EMT landscape as it is referred in some literature.Advances of singlecell techniques, including multiomics techniques, spatial genomics, and live cell imaging studies, will provide unprecedent cellular details at both in vivo and in vitro contexts.Analyzing the data within the framework of dynamical systems theories, that is, data-driven systems biology modeling (Xing, 2022), will reveal mechanisms on how the EMT transition is regulated in a context-and microenvironmentdependent manner, and guide effective strategies for modulating the transitions.
mer with no direct relation to time.It corresponds to the reaction coordinate in chemistry, which is a one-dimensional abstract coordinate quantifying the progression of a chemical reaction from a reactant to a product state along a specific reaction route.Another misnomer in the single-cell field is to term such an averaged transition path as a trajectory, while in physics and chemistry objects follow individual trajectories to complete a transition process.McFaline-Figueroa et al. performed pseudotime analysis on MCF10AscRNA-seq data and concluded that EMT proceeds through a onedimensional continuum(McFaline-Figueroa et al., 2019).While pseudotime trajectories are inferred single or multiple average transition paths, several other methods aim to infer individual single-cell transition trajectories to unravel the heterogeneity of transition processes.A Markov state transition model first clusters single cell data into groups, and assumes that unidirectional or bidirectional transitions can take place between neighboring groups in the cell state space.Based on single-cell genomics data, a general method MuTrans uses multiscale reduction of Markov model to identify attractors and reveal multiple transition paths of EMT, as well as transitional cells inbetween(Zhou et al., 2021).One uses computational approaches to infer the transition rates from the measured stationary or time series cluster populations.Karacosta et al.  analyzed time course mass cytometry data of lung cancer cells using a Markov transition model, and concluded similarly that both EMT and the reverse MET proceed through distinct one-dimensional paths(Karacosta et al., 2019).Optimal transport-based approaches are another class of computational approaches that uses the cell state distributions from time series single cell data to infer transition dynamics between different cell states.Using optimal transport analysis, Cheng et al. examined the TGF-β treated MCF10A cells and identified distinct paths leading to three final cell states with different mesenchymal features(Cheng et al., 2023).Using the method of matrix decomposition, Sha et al. studied transition cells along the EMT trajectories based on single-cell data as well as benchmarked the methods using agent-based EMT models (Sha et al., 2020).All the above approaches use the expression information only and unavoidably impose some approximations to infer the cellular dynamics.Different class of approaches uses additional information for the inference.Specifically, La Manno et al. showed that one can estimate the instant velocity of RNA copy number change termed RNA velocities from the snapshop single-cell sequence data (La Manno et al., 2018).The work inspired an explosive number of studies to improve the estimation accuracy.With all these efforts of improving the estimation of the instant RNA velocities, Qiu et al. developed a theoretical framework, Dynamo, which addresses a separate question of how to extract quantitative mechanistic information from the single cell expression states and instant velocities.The method takes (a) the discrete, sparse, and noisy single-cell gene expression state x measured in scRNA-seq and (b) the estimated genome-wide RNA turnover dynamics (i.e., RNA velocity dx/dt) as input and robustly reconstructs a set of dynamical equations dx/ dt = F(x) through machine learning(Qiu et al., 2022).Contrary to some widespread misconception, the input of Dynamo can be RNA velocities estimated from various approaches, for example, the original splicing-based RNA velocities or those from metabolic labeling data.Dynamo provides a continuous and analytical vector field function F, which quantitatively describes (generally nonlinear) gene regulation relations in the single-cell transcriptome.That is, it is a generative model to provide the values of F at cell states not included in the original data points.Therefore, Dynamo is a framework to systematically construct a data-driven mechanistic model such as the one described in Equation (1), but at a larger (genomewide) scale.With the vector field F, one can make in silico predictions on the effects of gene perturbations.For example, the Jacobian analysis can be viewed as an in silico experiment: for a gene i, the Jacobian reveals how changing the expression of gene j, while keeping other genes unperturbed, affects the transcription rate of gene i, with a positive value indicating activation and a negative value for repression.The vector field is analogous to the force field in molecular dynamics simulations, and one can simulate how individual cells evolve over time during processes such as differentiation and reprogramming.Using the approaches, Hu et al.analyzed MCF10A cells treated with low-dose of TGF-β, and revealed that EMT proceeds through either G1/S or G2/M cell cycle arrest(Hu et al., 2023).The work of Qiu et al. set a general framework of analyzing single-cell data under dynamical systems theories, and inspired further development of the detailed algorithms.Bocci et al. provide a Splice- forming a cloud in the cell state space.The cell density distribution evolves over time.Computational approaches have been developed to learn the cell-state transition dynamics of underlying cellular processes including gene-gene regulations and stochastic processes, as well as death and proliferation of cells at specific cell states, from the measured temporal evolution of cell density distributions.Specifically, Sha et al. developed a Trajectory Inference with Growth via Optimal transport and Neural network (TIGON) that infer cell transition dynamics, gene regulation relations, and population growth simultaneously(Sha et al., 2024).They applied to EMT datasets and demonstrated that the method recovers gene regulation relations such as activation of SNAIL1 on mesenchymal genes.Separately, Zhang et al. also developed a graph Dynamo formalism that describes the full dynamics including convection from the vector field, diffusion from stochastic processes, and cell birth-death in a multidimensional cell state space(Zhang et al., 2023).Graph Dynamo also uses both the temporal cell density distributions and RNA velocities to constrain model parameters.The framework also addresses an important technical problem on how to transform the velocity vectors and the equations between different representations, for example, between principal component space and UMAP.5 | UNRAVEL EMT TRANSITION PATHS THROUGH TRACING STATES OF INDIVIDUAL LIVE CELLS OVER TIMEA fundamental limitation of the above approaches is that the dynamical information is inferred from snapshot data under certain model assumptions.Direct experimental measurements of the dynamics of cellular state change require monitoring cells over time.Live-cell imaging with fluorescently labeled proteins or mRNAs allows tracking cell states with molecular specificity but with a limited number of molecular species to define a cell state.
morphological features can be used to specify cell states, particularly in EMT studies.Notice that morphological features can be monitored through transmission light imaging without the need of fluorescent labeling, Wang et al. used combined fluorescent-and label-free imaging to monitor individual single cells over time in a multidimensional spectrum of intermediate states, conflicting reports exist on where cell stemness couples to the EMT axis, from coupling to the epithelial end(Celia-Terrassa et al., 2012), the hybrid (Grosse-Wilde Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/dvg.23591,Wiley Online Library on [03/08/2024].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License