Stochastic Latent Transformer: Efficient Modelling of Stochastically Forced Zonal Jets

We present a novel probabilistic deep learning approach, the 'Stochastic Latent Transformer' (SLT), designed for the efficient reduced-order modelling of stochastic partial differential equations. Stochastically driven flow models are pertinent to a diverse range of natural phenomena, including jets on giant planets, ocean circulation, and the variability of midlatitude weather. However, much of the recent progress in deep learning has predominantly focused on deterministic systems. The SLT comprises a stochastically-forced transformer paired with a translation-equivariant autoencoder, trained towards the Continuous Ranked Probability Score. We showcase its effectiveness by applying it to a well-researched zonal jet system, where the interaction between stochastically forced eddies and the zonal mean flow results in a rich low-frequency variability. The SLT accurately reproduces system dynamics across various integration periods, validated through quantitative diagnostics that include spectral properties and the rate of transitions between distinct states. The SLT achieves a five-order-of-magnitude speedup in emulating the zonally-averaged flow compared to direct numerical simulations. This acceleration facilitates the cost-effective generation of large ensembles, enabling the exploration of statistical questions concerning the probabilities of spontaneous transition events.


Introduction
The computational expense of modelling turbulence has spurred research into more computationally cost-effective alternatives known as reduced-order models (ROMs) to advance simulations of fluids [1].In this context, deep learning methods, particularly autoencoders [2,3], have surpassed traditional techniques such as Proper Orthogonal Decomposition (POD) [4] in constructing spatial ROMs given to their capacity for nonlinear transformations [5,6].Deep learning has also been employed to model the temporal evolution of turbulent fluid flows, significantly reducing computational costs during inference [7,8,9,10,11,12] -with efforts combining these methods with autoencoders to evolve dynamics in latent space [5,13,14,15] outperforming linear methods such as dynamic-mode decomposition (DMD) [16].The aforementioned studies have primarily focused on deterministic nonlinear partial differential equations (PDEs), however in many physical applications, including fluid mechanics, state-dependent stochastic terms are often included to represent unresolved processes such as small-scale turbulent eddies.Figure 1: Latitude-time plots of U(y, t) displaying ensemble of numerical integrations and neural network emulations.a-h exhibit numerical integrations with identical initial conditions up to t = 10 (indicated by the dotted line) and distinct realisations of random noise, ξ, after t = 10, spanning a forecast period of 500 time units.i-p showcase neural network emulations with identical initial conditions as (a-h), but with different noise histories ϵ ∼ N (0, 1) after t = 10.Numerical integration employs a time step of δt = 4 × 10 −4 , with a-h displaying coarsened time intervals, with the neural network operating at unit time intervals ∆t = 1.Yellow indicates positive U (y, t), signifying eastward jets.Noteworthy events within these jets encompass nucleation, coalescence, and latitudinal translation.The neural network adeptly captures anticipated system features, offering plausible evolutions.
In this study, we develop a deep learning approach tailored to handle Stochastic Partial Differential Equations (SPDEs).We use a translation-equivariant autoencoder to find a reduced order representation and employ a stochastically forced transformer to model the latent dynamics in a probabilistic manner.While Operator Learning methods [17] have excelled in modelling PDEs [18,19,20,21], for tackling SPDEs we focus on transformer-based methods [22] due to their proficiency in discerning correlations within an input set using scaled dot-product attention.This allows the model to generate a context-aware weighted sum of its inputs when forecasting the next sequence member [22].In the context of modelling SPDEs this attends to previous time histories of the flow as well as forcing noise to produce a state-dependent weighting of the forcing contribution.
One domain in which there is demand for ROMs of stochastically-driven flows is weather and climate modelling, characterised by turbulent and complex multi-scale flows.Parameterisation schemes, often explicitly stochastic, are employed to approximate the effects of important smallscale processes not explicitly resolved by the spatial resolution of the model, such as cloud physics, radiative transfer, and turbulent eddies [23], with stochasticity here accounting for the chaotic behaviour of the unresolved dynamics.Recent studies have leveraged deep learning methods to create stochastic parameterisations for coarsened deterministic models [24,25,26].These introduce stochasticity on the basis that no deterministic function of resolved variables can fully account for the effects of unresolved variables.In contrast, our study focuses on employing deep learning to emulate systems that explicitly incorporate a stochastic component, that drives the system dynamics.
In this work, we investigate producing a deep learning ROM of a well-studied idealised Geophysical Fluid Dynamics (GFD) system -beta-plane turbulence [27,28].This system serves as a useful simplest model for many aspects of flows in atmospheres and oceans.This is a single-layer flow in which turbulence is generated via stochastic forcing, ξ(x, y, t), in the vorticity equation: with the velocity field defined as u(x, y, t) = (−∂ y ψ, ∂ x ψ) (note the sign convention used widely in GFD), where ψ is the stream function and the relative vorticity is ζ = ∂ x v − ∂ y u.The parameter β represents the variation of rotation with latitude, while energy dissipation occurs through linear damping at rate µ at large scales, while hyperviscosity, with coefficient ν and order n, is used to remove vorticity at small scales.
In turbulence, a Reynolds decomposition is typically employed as an averaging procedure to separate coherent structures from fluctuations.Here we apply an eddy-mean decomposition, u(x, y, t) = u(y, t) + u ′ (x, y, t), [29,30] in the zonal direction to separate the dynamics, giving equations for the zonally averaged zonal velocity field U (y, t) = u(y, t) = 1 Lx Lx 0 udx (where L x = 2π is the domain size in x), and the associated vorticity fluctuation fields, ζ ′ (x, y, t): where Here we see that the evolution depends on the nonlinear two-way interaction between the mean flow and the eddies and that the applied noise, ξ is applied only directly applied the eddy fields, and therefore is filtered in the mean flow.
This study focuses on a particular choice of parameters, β = 90, µ = 4 × 10 −2 , ν = 100, and an energy injection rate of ε = 10 −4 , with a principal forcing wavenumber of k f = 16 and a time step of δt = 4 × 10 −2 , chosen to be in the regime in which coherent zonal jets form.The parameter choice made here leads to the jets exhibiting a wide range of time variability behaviour [31].The latitude-time plots in Figure 1.a-h, generated via numerical integration, illustrate how the system self-organises to produce jets -the inclusion of β disrupts the turbulent energy cascade, leading to jet formation (defining a jet as a local maxima of U ).Each plot displays numerical integrations with identical histories up to the dotted line, with the subsequent dynamics driven by different realisations of ξ.These jets exhibit complex phenomena, including spontaneous transition events such as nucleation (the formation of a new jet), coalescence (the merging of two jets) and latitudinal drifting of persistent zonal jets.In this regime it may be argued that the model provides an analogue of week-to-week variations in the large-scale dynamics of the tropospheric jet, a key driver of midlatitude weather variations [32].Details of the particular numerical scheme used are given in the Appendix.
In looking to model the mean flow using deep learning without explicitly parameterising equation (3), we combine equations ( 2) and (3): where G is a non-linear operator that represents the two-way interaction between the eddies and the the large-scale flow to provide the evolution of U (y, t).
Existing research on deep learning applied to stochastic parameterisation focuses on acquiring a stochastic specification solely for the eddy effects at any instant in terms of the large-scale flow [24,25].For the system stated here this would equate to a learning a data-driven form of the second term on the right-hand side of equation ( 4) offline, before using this closure to solve equation ( 4) using a PDE solver.By contrast, we propose a novel deep learning approach designed to learn the equivalent to the entire right-hand side of equation ( 4) -a stochastic specification for how the SPDE governing U should be updated in time.We observe that the evolution of the system, ∂ t U , depends on the temporal histories of U , ζ ′ , and ξ.Consequently, our deep learning model is structured to take short-term histories of U and incorporates an attention mechanism that implicitly learns an effective form of the interaction between these temporal histories and the mean flow, here represented by G, which enables the prediction of possible realisations of ∂ t U .
We demonstrate that this deep learning approach is capable of reproducing both short-term and long-term statistics of the original system.Illustrating a faithful model over very-long time evolutions has not been previously demonstrated using transformers for PDEs or SPDEs, and this could not be replicated for this system by other state-of-the-art generative models, such as Variational Autoencoders (VAE) [3] and adversarial models such as those used in [26,25].The deep learning approach provides a cost-effective means to generate large ensembles, enabling us to study regimes close to spontaneous transition events, such as nucleation and coalescence events.This allows us to approach previously computationally prohibitive questions, such as quantifying to what extent different events are driven by deterministic or stochastic dynamics.

Methods
Recent advances in fluid modelling using transformers have been constrained by the large number of degrees of freedom of these spatiotemporal systems.In weather system nowcasting [21,33,34], researchers have omitted using attention to model temporal correlations and focus on spatial features by adapting the vision transformer [35].In our approach, we maintain the ability to address both spatial and temporal correlations, necessary for this system.We achieve this by first obtaining a reduced-dimensional representation of the original system [5,6,14,36] as described by equation ( 4), which is input to a transformer to evolve the stochastic latent dynamics.

Autoencoder -Translation Equivariant Pointwise Convolution
Given the multi-scale nature of fluid flows, Convolutional Neural Networks (CNNs) [37] are commonly used to capture spatial correlations.Each convolutional layer preserves translational equivariance through circular padding.However, it is important to note that while individual convolution operations exhibit translational equivariance, CNNs as an architecture lack global translational equivariance due to max pooling and striding operations used for dimensionality reduction [38].
Assuming spatial homogeneity for the forcing term ξ, the beta-plane system demonstrates latitudinal symmetry under the transformation (t, x, y, ψ, ζ) → (t, x, −y, −ψ, −ζ).Translation equivariance in the beta-plane system can be denoted by ϕU (y, t) = U (ϕy, t), where ϕ signifies a shift operator in the y dimension, a result of the system's periodic boundary conditions and inherent symmetries.
Given that this is an important property of the system, maintaining translational equivariance is advantageous (conversely, in cases with different boundary conditions, where this symmetry may not exist, the network should adapt accordingly and not exhibit such equivariance).
To address this, we introduce a phase-equivariant convolutional architecture for both the encoder and The weights are translation invariant, as with the TEPC, with the phase, ϕ, at time t removed.The green block represents the random noise vector, ϵ ∼ N (0, 1), and is concatenated with the latent space-time histories Zt−L:t.The architecture consists of N transformer blocks, where 'MHA' is multi-headed attention using scaled dot-product attention (see Figure 3 for details), 'Linear' is learned linear transformation and σ is the nonlinear activation function.Further details are outlined in the Methods section.
remains equivariant to any shift ∆y in the physical space.We draw inspiration from both the method of slices [39,40] by learning translation invariant weights, employed by [5], as well as the Fourier Neural Operator [17] by performing a convolution as a linear transformation in Fourier space.
The method of slices involves taking a discrete Fourier transform, F, to transform the input data, X, at each TEPC layer into the spectral domain, X = F(X) ∈ C. We obtain the phase information, ϕ, from the first Fourier mode using ϕ = arg( X(1) ).We store this phase and construct a phase-aligned solution, X′ = Xe −ϕ•Ky , such that the first Fourier mode is a pure cosine.This step ensures that the weights W are calculated to be invariant to this phase.Subsequently, we apply the convolution operation in the form of a linear transformation in Fourier space, Ĥ′ = W * X′ .We reintroduce the phase information via Ĥ = Ĥ′ e −ϕ•Kout before transforming the result back to the physical space, We denote the TEPC layer operation as H = Γ(X, W).The encoder and decoder each consist of a two-layer structure, with a nonlinear activation function, σ = GELU [41], following the first layer.
We can express the encoder as , where ϕ represents the weights W 1 andW 2 .The decoder, U = D θ (Z), with weights θ, takes the same form but to transform from latent space back to physical space, R D M → R 256 .The point-wise convolution allows for both local and global convolutions to take place in a single procedure, allowing the TEPC to reduce input dimensions to any desired size, in contrast to CNNs which rely on pooling operations to achieve a specific reduction factor.
Through experimentation, it was observed that the TEPC only required two layers with a single activation function to establish mappings between physical and latent spaces, as illustrated in Figure 2.b, resulting in fewer operations and faster inference compared to a CNN with the same latent dimension reduction.
The latent space, Z, maintains translational equivariance with respect to the input U .Alternatively, not reintroducing the phase ϕ in the final layer achieves a translation-invariant latent representation of the input field.This approach, when extended to handle 2D and 3D inputs, holds great promise for computer vision tasks where properties such as translation invariance and equivariance are desirable, an example being object recognition.Additionally, it eliminated the need for data augmentation during training, thereby reducing our overall training costs.

Stochastic Latent Transformer
The encoder provides a latent representation of U → Z.In order to evolve the system forward in time we look to model the stochastically forced dynamical system One motivation for using transformers is that they have been shown to outperform Recurrent Neural Networks (RNN) [42] by effectively handling temporal dependencies in sequences by utilising the scaled dot-product attention mechanism that considers the entire sequence and has achieved improved training speed over RNNs by parallelising backpropagation through multi-headed attention which evaluates multiple attention operations in parallel.
Figure 2.c illustrates the transformer architecture that comprises a multi-head attention block and a feed-forward neural network.The attention blocks allow the model to weight the importance of different parts of the input sequence when making predictions, while the feed-forward network (comprising two linear layers with a non-linear activation function, here σ = GELU, between the two) allows it to learn complex non-linear relationships.The transformer blocks and feed-forward network are each concatenated with skip connections around them to enhance gradient propagation towards the input during the backpropagation phase of training.
We employ a regression transformer, adapting the original transformer architecture proposed in [22], by exclusively utilising the transformer decoder, which throughout this article, we shall simply refer to as the transformer.This choice stems from the fact that our input data is in the form of a time series of continuous values, rather than a prompt that necessitates semantic comprehension, as is in natural language processing.The original transformer was used for the classification of discrete data in natural language tasks, and so we adapt the final layers by replacing the softmax function in the feed-forward network with two linear layers with a non-linear activation function, again σ = GELU, between the two to provide a regression transformer.
In the natural language setting, the softmax layer gains probabilities for the next possible token, sampled in inference to make a prediction such that the most probable word is not always selected to produce diverse outputs.By removing the softmax and sampling operations, we must now induce stochasticity in a different manner.Current efforts in stochastic transformers for regression tasks explore different approaches: using stochastic activation functions [43], replacing attention with stochastic alternatives for computational efficiency [44], or introducing noise to all inputs [45].In contrast, our approach incorporates forcing noise as an additional sequence element, {Z t:t−L , ϵ}, such that we only impose a single forcing when making a forecast of Z t+1 .The attention mechanism in the transformer identifies correlations among the input elements, using this to calculate the statedependent weighting of the forcing noise ϵ ∼ N (0, 1) when predicting Z t+1 based on the latent flow history Z t:t−L .This facilitates ensemble generation with different noise realisations, akin to sampling ξ in equation ( 4).
We employ two different attention mechanisms: self-attention and cross-attention.While the computation of the attention layer remains consistent across both cases, the inputs to the layer differ -see Figure 3 for schematic of the cross-attention implemented.The attention operation can be defined as: Weights, W, linearly transform inputs into Q (Query), K (Key), and V (Value) vectors.The ϵ input is identical for both the K and V vectors.Dotted lines in the A (Attention) matrices separate the additional row introduced from cross-attention, given forcing ϵ, and the self-attention of the latent time histories.This illustrates the case where the batch size = 1 and number of heads = 1, with reshaping implemented to extend this to multiple heads.
The non-stochastic self-attention in the following transformer blocks operates in the exact same manner, simply without the concatenation of ϵ.
Here, Q (query vector), K (key vector), and V (value vector) are obtained through linear transformations of their respective inputs: Q = M W Q , K = N W K , and V = P W V .d k represents the dimension of the key vector, K, and softmax denotes the row-wise softmax function.The application of softmax introduces sparsity to the attention matrix, A, regulating the emphasis placed on each input in the sequence of length L. A notable characteristic of the attention mechanism is that the vectors are calculated dynamically based on the correlating patterns within the key (K) and query (Q) vectors.
In self-attention, the inputs M , N , and P share the same values, which in this context M = N = P = Z t:t−L .In cross-attention, N and P take the same values, while M differs.Our aim for the attention mechanism to appropriately weight the significance of input features, which here take the form of the latent mean flow Z t:t−L and the forcing noise ϵ ∼ N (0, 1).As such, we choose that M = Z t:t−L , while N = P = {Z t:t−L , ϵ}.Within cross-attention, we effectively conduct selfattention on Z t:t−L , capturing temporal correlations, while providing an adaptive weighting of the stochastic forcing, ϵ, contingent on the time history Z t:t−L .This can be conceptualised as determining the degree of emphasis to allocate on ϵ in relation to the current system state.Analogously, this mirrors phase space regimes in the beta-plane system, with transitions between statically stable fixed points induced by fluctuations, relative to the current state in phase space [46].
It's worth noting that only the initial multi-headed attention layer in the transformer incorporates cross-attention with the forcing term ϵ.Subsequent layers utilise self-attention, to again ensure that we only impose a single forcing when making a forecast of Z t+1 , to match how the beta-plane system is forced, as described in equation (1).
We add a time-embedding vector [47] to each input latent-space vector to incorporate temporal order information into the model, with the exception of the ϵ to indicate that ϵ is not part of the temporal sequence.During inference, the transformer evolves autoregressively in the latent space to produce Z (i) t+1:tmax for each ensemble member i, before using the Decoder to produce the evolution U (i) t+1:tmax , see Figure 2.b for illustration.We denote the stochastic transformer operation Z t+1 = T φ [Z t:t−L , ϵ], parameterised by weights, φ.
As with the autoencoder, we wish for the transformer to also act in a translation-invariant manner.The transformer architecture itself is not translation invariant due to fully connected layers within the architecture.To overcome this we only input to the transformer phase aligned solutions, using the method of slices and then add back the phase to each output.This way the transformer handles inputs that have been translated in an identical manner (given identical samples of noise ϵ).Here we separate each value of Z into a phase-aligned solution, Z ′ and phase, ϕ.The phase is extracted from the latent time history Z t:t−L from the sequence member at the time t only, ϕ(t).This is because removing the phase from the previous time steps in the time history would prevent the transformer from also learning the latitudinal drifting dynamics that we observe in the system.Each sequence member is then phase-aligned with respect to ϕ(t), Z ′ t:t−L = Z t:t−L e −ϕ(t)•Kz .We reintroduce the phase information to the output from the transformer Z ′ t+1 via ϕ(t), Z t+1 = Z ′ t+1 e ϕ(t)•Kz .With the transformer evolving the system forward in an autoregressive fashion during inference, we must apply this each time step forward, as the transformer expects to see inputs where Z t has zerophase in the first Fourier mode at every time step.We must phase align solutions with forecasting each autoregressive time step Z t + n before input to the transformer, to ensure that data given to the transformer during inference is from the distribution of the training data.

Training
In most research on data-driven modelling of fluid flows, the Mean Squared Error (MSE) has been the predominant choice as the loss function.This is because it aligns with the goal of minimising the discrepancy between predicted outputs and the reference data.However, MSE proves inadequate for handling probabilistic forecasts, which are essential due to the unpredictable and random nature of turbulence, especially in the context of the beta-plane turbulence model.
As a training objective, we employ the Continuous Ranked Probability Score (CRPS) [48], a proper scoring rule and a commonly used metric for probabilistic forecasts in meteorology.[49] demonstrated the use of the CRPS as an alternative to adversarial training methods -these employ competitive learning but can suffer from issues such as non-convergence and mode collapse [50].The CRPS is a generalisation of the Mean Absolute Error (MAE) for distributional forecasts, and can be expressed as: We use the non-parametric estimate to the CRPS [51], estimating F with the empirical cumulative distribution function (CDF) of m independent and identically distributed (iid) samples Ũ ∼ F : Ensemble Variation (8) where U is the truth trajectory from the training dataset, Ũ (i) is the i th member of the prediction ensemble, where the model produces an ensemble of size m for time step t + 1.The network is trained to minimise the difference between each predicted ensemble member and the truth trajectory U t+1 , while simultaneously maximising the dissimilarity between each individual ensemble member Ũ (i) t+1 due to the inclusion of the Ensemble Variation term in equation ( 8).This promotes output diversity, overcoming mode collapse issues seen in adversarial training.
To ensure that the Autoencoder approximates the identity function, that is map from physical space to the latent space and back, without evolving the system forward in time, we introduce an additional term MAE(U t , Ût ) = U t − Ût where Ût = D θ [E ϕ (U t )] is the output of the Autoencoder without being passed through the transformer.Importantly, the encoder, decoder and transformer are trained simultaneously, to ensure that the encoder and decoder learn temporal features when finding the latent representation, however, this additional loss constraint ensures that the encoder and decoder are only performing a spatial transformation.
While research conducted by [17] found that nonlinear activation functions could recover energy at higher wavenumbers, following a filtering operation, our attempts to replicate these findings were unsuccessful.Consequently, the TEPC layers in our model do not function as filters for higher modes; instead, they serve to reduce the number of modes to a latent dimension.Generative deep learning models have been shown to emphasise larger-scale features due to the diminished contribution of smaller-scale features, inadequately capturing higher frequency modes [52].In the context of modelling physical data, the preservation of not only large-scale properties but also energy across all scales is essential.To address the recovery of higher wavenumbers, we introduce a term into the loss function, serving as a soft constraint, ensuring the conservation of the absolute value of each Fourier mode.This involves employing the discrete Fourier transform, denoted as F, to transform U t and Ût and subsequently taking the modulus, resulting in | derives a spectral loss term, ensuring that the autoencoder preserves energy at all necessary scales.Non-linear transformations are not distance-preserving, however, this is an advantageous property to have, both for interpretability of the latent space, but we also observed that it aids in numerical stability during training.We assert that two states in physical space, if close together, should also be close together in our latent space.This is not hard constrained in the model, instead, we add an additional term to the loss function that intends to minimise the distance in latent space between our prediction, in latent space, Zt+1 , and the encoded representation of Z t+1 := E ϕ (U t+1 ) from our 'truth' training set, while retaining output diversity -it is, therefore, appropriate to use again use the CRPS in latent space.This leads to the following objective function: This loss function is applied over each batch, and the gradient with respect to model parameters is calculated to perform backpropagation updates.Through experimentation, we found there was no requirement to weight these terms to achieve convergence during training.
We use 2 × 10 6 unit time intervals for training and 2,000 unit time intervals for validation to monitor overfitting.The network receives randomly sampled mini-batches as inputs, which are short-time evolution histories of length L of U (y, t : t − L) separated by unit time intervals of the zonally averaged zonal velocity from numerical integration, as defined in equation ( 4).While a time step of δt = 4 × 10 −2 is required for stability of the numerical integration, sampling at such small intervals is unnecessary for the neural network to capture the dynamics.Therefore data input to the network is sampled at unit time intervals (2500 times larger than the numerical integration time step).Each time step of the system is represented as a one-dimensional vector U (y, t) ∈ R 256 , with the target output being the subsequent unit time step U (y, t + 1), also obtained from the numerical integration.
The numerical integrations were conducted using the GeophysicalFlows [53] package in Julia [54].Machine Learning models were built using the PyTorch [55] framework and all analyses were conducted in Python [56].

Model Hyperparameters
We ran a hyperparameter sweep We ran a Bayesian hyperparameter sweep [57] to determine the best-performing model.The metric used to minimise was the Helliger Distance, defined in section 3.2, between PDFs, produced via long-time evolutions of 20, 000 from a numerical integration and the output from the deep learning model.Through this sweep we found that the following hyperparameters were optimal: batch size: 128; optimiser: Adam; initial transformer learning rate: 5e-4; initial autoencoder learning rate: 2.5e-3, learning rate scheduler: exponential decay with decay rate: γ=0.9825; training epochs: 400, convolution filters in each TEPC layer: 4, CRPS ensemble size: 4; L (length of time history used to make next forecast): 10, number of transformer blocks: 3.

Short Term Evaluation
Given the system's stochastic nature, we compare an ensemble of model forecasts to an ensemble of numerical integrations, each with a different realisation of forced noise, ξ, (shown in Figure 1.a-h) to evaluate the model's ability to learn the evolution of U (y, t), as opposed to a direct comparison between individual trajectories.The bottom 8 plots (i-p) in Figure 1 depict 8 emulations from the SLT, with the same time histories as (a-h) input to the model and subsequent dynamics driven by different realisations of ϵ for 500 time units.The SLT yields plausible evolutions when compared to the numerical integrations, effectively predicting the imminent coalescence event and capturing subsequent nucleating events, additional coalescence events, and latitudinal translation in a realistic manner.
To make a more quantitative assessment of the deep learning model's performance, we employ metrics for short and long forecast periods.For short forecast periods, the trajectory distribution is narrow, rendering the CRPS as a suitable probabilistic measure.We compute the CRPS between the emulated ensemble (Figure 1.i-p) and a single truth trajectory, specifically the numerical integration in Figure 1.a.While a lower CRPS indicates better performance, it is most useful to compare the CRPS of the ensemble forecast to that of ensemble numerical integrations, both with respect to the same reference trajectory.We separate the CRPS, equation ( 8), into its MAE and ensemble variation components for visualisation in Figure 6.b-h).The SLT exhibits strong agreement with the numerical integration throughout the period.Both reach saturation at ∼ t=250, indicating that trajectories have sufficiently diverged and MAE is no longer a meaningful metric.We also compare the outputs of a temporal Variational Autoencoder (VAE) [3], a popular generative model architecture, shown in blue.The VAE demonstrates notably poorer performance, as evidenced by the larger MAE values.Performing the same analysis with the ensemble variation term, displayed in Figure 4.b.The SLT again shows good agreement with the numerical integration, while the VAE shows a much greater forecast spread.In the Supplementary Information, we demonstrate the SLT's similar effectiveness for trajectories with different initial conditions.
One goal of this research was to create a computationally more efficient model than numerical integration.Each emulation in Figure 1.i-p took 8.35 milliseconds to produce 500 future time units using the SLT, while each numerical integration in Figure 1.a-g took 29.8 minutes to produce a time series over the same period: the SLT being ∼ 2.13x10 5 times faster on identical hardware (NVIDIA P100) running CUDA 10.2.

Long Time Evolutions
To assess the SLT's performance over extended evolutions, we conduct an emulation for 20,000 time units, shown in Figure 5.b, and compare with a numerical integration spanning the same duration, shown in Figure 5.a.Given the rapid divergence of trajectories under different noise realisations, initial conditions become less significant and an ensemble is not required, with the long evolutions capturing the broad spectrum of system dynamics.The results from the SLT are show good agreement with the numerical integration with Figure 5.b, demonstrating consistent and expected dynamics throughout the integration period.
This performance, however, could not be replicated with alternative generative models, such as the temporal Variational Autoencoder (VAE) in Figure 5.c.The VAE exhibits nonphysical behaviour at ∼ t=5000 while displaying a lower frequency of nucleation, coalescence events, and latitudinal translation compared to numerical integration.These discrepancies stem from significant deviations between VAE outputs and the training data distribution, as evident from the short-term errors in Figures 4.a-b.Although an adversarial VAE in Figure 5.d produces qualitatively plausible results, it experiences mode collapse, resulting in cyclically repeating features over a shorter evolution period.These limitations are not observed with the SLT architecture, trained using the CRPS.Further comparisons between the SLT architecture and these alternative models are discussed in Appendix B.
For a quantitative assessment of longer-term evolutions, we examine statistical properties of the flow.We generate Probability Density Functions (PDFs) for individual U values and their derivatives, ∂ y U and ∂ t U .These PDFs serve to measure the learned spatial and temporal correlation.We conduct this analysis over a time integration period of 200, 000 time units, producing two PDFs: p(U ) representing the PDF from numerical integration, U , and q( Ũ ) representing the PDF of forecasted values derived from the SLT, Ũ .
In Figures 6.a-c, we present the distributions of individual constituents of p(U ) and q( Ũ ): U , ∂ y U , and ∂ t U , along with their corresponding Ũ equivalents.When assessing the disparity between distributions, the comparison of distribution tails serves as a valuable metric to determine the emulator's efficacy in representing the entire distribution and accurately capturing rare event statistics.Figures 6.a-c show good agreement between numerical integration (in green) and SLT (in red), showcasing an appropriate representation of distribution tails.This suggests that the SLT successfully captures events with limited representation in the training dataset, demonstrating its capacity to faithfully approximate the entire distribution.In particular, our observations indicate that the transformer has learned the exact form of the temporal correlations exact form, showcasing minimal discrepancy between the PDFs in Figure 6.c. Figure 9, in the Supplementary Figures, displays the joint PDFs between each of these components.
We quantify the distance between these distributions, following [26], using the Helliger Distance, H, as our measure: where textbf U = U, ∂ y U, ∂ t U .In Figure 6.d, we observe how H changes with different sizes of the latent dimension D M , with a distinct change in the gradient (dH/dD M ) when D M = 64.For all evaluation results presented in this paper, we set D M = 64, as only marginal improvements in H are observed when increasing the degrees of freedom.This corresponds to a compression factor of 1024, when compared to the numerical integration, as the numerical integration is required to integrate over a 2D domain, R 256×256 → R D M =64 .Figure 6.c also depicts the H value between two trajectories, generated through numerical integration (indicated by the grey dashed line), giving context to the values of H obtained.Using H as a metric, we can directly compare outputs across various models and hyperparameter selections, as detailed in the section 2.4.
We can measure the duration spent in each jet configuration and the occurrence frequency of state transitions (coalescence or nucleation events).PDFs of the number of jets at time t and changes in jet count at time t + 1 are generated from longer evolutions spanning 10 6 time units, both from It also shows H between two numerical integrations with different noise realisations for comparison (grey).e depicts the PDF for the likelihood of the system transitioning in the number of jets, based on 10 6 time units observable time from both numerical integration (solid lines) and SLT (dashed lines).The PDFs are derived from jet count frequencies at observable time t, showing whether the count increases (red), decreases (blue), or remains constant (green) at observable time t + 1. f time averaged power spectral density displayed for both the numerical integration and the SLT, showing the energy content of each wavenumber in the zonally-averaged zonal direction, U .Using these evaluation metrics, we observe that the SLT again exhibits strong agreement with the numerical integration over long time evolutions.
numerical integration and the SLT.In Figure 6.e the frequencies of these PDFs closely align between the numerical integration (solid lines) and the SLT (dashed lines), indicating that the SLT adequately reproduces the underlying system.
Neural networks have been shown to exhibit a spectral bias [52], wherein they tend to excel in capturing low-frequency modes but struggle to represent higher frequencies that contribute less to the overall energy.The spectral loss term, outlined in section 2.3, is employed to combat this by penalising outputs from the decoder that exhibit spectral deviation from the training data.In Figure 6.f, the time-averaged power spectral density is plotted for both the numerical integrations (in green) and the autoregressive outputs of the SLT (in red).We observe a high level of agreement across all scales, paying particular attention on the higher wavenumbers where data-driven approaches often fail.These findings, in conjunction with the aforementioned evaluation metrics, allow one to place confidence in the SLT's capacity to faithfully emulate the dynamics of the system under consideration.This positions the SLT as a valuable tool for investigating dynamical phenomena observed in the beta-plane system, facilitated by its computational speed-up.

Characterising Transition Events
In chaotic or stochastically driven systems, predicting transition events, such as nucleation and coalescence in the beta-plane system, is of interest.To accurately determine lead times from initial conditions, ensembles are required to account for the system variability, however, if the numerical solvers are too computationally expensive, generating large ensembles for such studies might not be feasible.Previous work has been conducted to artificially upsample the frequency of these spontaneous transition events using a rare-event algorithm [58] or augmenting the training data for deep Figure 7: PDFs of time to coalescence events.a displays a reference coalescence event generated through numerical integration, with dashed lines representing initial conditions located 250, 150, 50, and 10 observable time units away from the event.Note that the x-axis is shifted with 0 at the occurrence of the coalescence event.b-e depict PDFs of the time taken between each initial condition and the coalescence event.The ensemble consists of 10 6 instances generated using the SLT (red) and 10 3 instances obtained through numerical integration (green).The orange dashed lines indicate the 5 th and 95 th quartiles of the probability density, and we display the corresponding time within this bound.The blue lines represent the 25 th and 75 th quartiles for the red curve.As the initial conditions approach the reference event, the PDFs become sharper, and the dynamics become increasingly deterministic.The green and red curves show agreement, indicating that the SLT successfully captures the expected distributions.
learning of extremes [59].The introduction of the SLT now allows us to efficiently generate large ensembles of trajectories without the prohibitive costs associated with numerical solvers.
In this experiment, we use two reference events generated by numerical integration: a coalescence event shown in Figure 7.a and a nucleation event shown in Figure 8.a.For the coalescence event, we analyse various initial conditions set at different lead times prior to this event, at 250, 150, 50, and 10 time units before the reference event, indicated by the dashed lines in Figure 7.a.To estimate the probability density function (PDF) representing the time interval between each initial condition and a coalescence event for each trajectory, we perform forward integrations, each using an ensemble of 10 6 trajectories with the SLT.The resulting PDFs are displayed in Figures 7.b-e.
As the initial conditions approach the reference coalescence event, the Probability Density Function (PDF) progressively narrows, signalling a shift towards more deterministic behaviour in the system.This transition can be quantified by examining the variance of each PDF distribution.For instance, with initial conditions at t = 150 units from the reference event as shown in Figure 7.c, the SLT provides a 90% probability of a coalescence occurring within the time range of 110-273 units, as Here we show the same information as in Figure 7, but for a reference nucleation event, shown in a, with the dashed lines referencing a number of initial conditions that are 250, 150, 50 and 10 observable time units from the coalescence event.While we can see that as the initial conditions are moved closer to the reference event the PDFs in b-e become much sharper, the time scales over which this takes place indicate that the nucleation events are driven much more by stochastic dynamics than the coalescence events.Again, we also see agreement between the green and red curves indicating that the SLT is successfully capturing the expected distributions.
denoted by the span between the orange dashed lines in Figures 7.b-e.As the initial conditions move closer to the event, this time window tightens to 21-53 units when t = 50 units away, and further to 9-13 units when t = 10 units from the event.In Figures 7.b-e, the blue lines signify a 50% probability of occurrence, with the width of this interval also contracting progressively as the initial conditions approach the reference event.
To validate the PDFs produced by the SLT, we generate ensembles of size 10 3 through numerical integration, represented by the green plots in Figure 7.The close agreement observed between the PDFs affirms the reliability of the results and underscores the SLT's effectiveness as an emulator for predicting transition events.Generating each ensemble of size 10 6 for 500 time units in Figure 7.b using the SLT only required 16 minutes on a two GPUs (NVIDIA P100).In contrast, producing the 10 3 ensemble through numerical integration took around 10 days on identical hardware, making this analysis impossible for very large ensemble sizes.
Figure 8.a displays the reference nucleation event and by analysing the PDFs in Figures 8.b-e, we observe a considerably broader dispersion and a distinct skew when the reference event is positioned 200 and 50 time units away, in contrast to the coalescence events in Figure 7. Irrespective of the lead time to the reference event, there is a notable probability of a nucleation event occurring shortly after the initial conditions.This implies that initial conditions with a 2-jet configuration exist in a phase space region more sensitive to stochastic forcing and thus more susceptible to undergo a transition than a 3-jet state.
Comparing the breadth of each PDF for nucleation and coalescence events reveals notable distinctions.When considering initial conditions at t = 50 units from the reference nucleation event (see Figure 8.c), the time span corresponding to a 90% probability of occurrence extends to 254 forecasted time units, contrasting with a span of only 33 time units for the coalescence event (see Figure 7.d) When only t = 10 away from their respective reference events, it becomes evident that coalescence (Figure 7.e) can be predicted within a considerably narrower time window compared to nucleation (Figure 8.e).From this, we can conclude that the system exhibits higher variability and reduced predictability preceding nucleation events, than when predicting coalescence events.

Conclusion and Future Work
In this study, we demonstrate the ability to capture the dynamics of a stochastically driven system, namely beta-plane turbulence, using a probabilistic neural network.We demonstrate that the deep learning model remains a faithful representation of the system, even over very long time evolutions and achieves a speedup of over five-orders-of-magnitude over traditional numerical schemes.Furthermore, our work showcases the advantages of the Stochastic Latent Transformer, trained using the Continuous Ranked Probability Score and a spectral loss, when compared to existing state-of-the-art methods for probabilistic modelling of fluid flows, namely VAE and adversarial-based training.
If the beta-plane model is accepted as a very simple analogue of mid-latitude weather variation, then the deep learning approach presented here demonstrates its capability in leaning a combined specification of large-scale dynamical evolution which includes the effects of the parameterised eddies, in contrast to work that learns just a specification of eddy forcing.This appears to provide a successful emulation of weather variability, that importantly remains faithful during long timescales.
The speedup achieved by the neural network allows for the forecasting and characterisation of spontaneous transition events, providing probability estimates for specific lead times.This provides a tool to address questions previously unattainable with direct numerical integration methods.Our investigation highlights an asymmetry in system predictability; with the system selecting a default configuration (under the parameters used this is a 2-jet state), with nucleation events away from this state highly unpredictable, predominantly driven by stochastic forcing, while coalescence events to return to this default state act in a much more deterministic manner.
To investigate this asymmetry further, the study could be expanded across a range of parameter regimes of β and µ to determine if the observed results persist under varying preferred jet configurations.Since the neural network is trained in a specific parameter regime, this could implement transfer learning [19] to enable the model's adaptation to alternative regimes with reduced additional training requirements.
Extending this work to model beta-plane turbulence with multiple layers, to capture a system closer to observations, could be done in one of two approaches.One method involves explicit stochastic forcing to parameterise fluctuation fields, which the Stochastic Latent Transformer models, as here.
In another, a system capable of generating turbulence without stochastic forcing could be employed, with the Stochastic Latent Transformer implicitly modelling the unresolved degrees of freedom to capture large-scale spatiotemporal features.The architecture is flexible to both approaches, with modifications from this work only required in the encoder and decoder to accommodate 2D and 3D inputs while retaining the same stochastic transformer in the latent space.This extension of the probabilistic deep learning model, adaptable to any SPDE, would allow for the exploration of various turbulent flows, including more complex geophysical fluid dynamics phenomena.

Open Research
All of the code used in this paper is available under an Apache license at: https://github.com/Ira-Shokar/Stochastic_Latent_Transformer.
All of the data used in this paper is under an Apache license at: https://zenodo.org/records/10034268.

A Numerical simulations of geostrophic turbulence
Direct numerical simulation of vorticity in equation ( 1) is performed on a 2D doubly-periodic square domain (x, y) ∈ [0, 2π] 2 , using a pseudo-spectral method with N = 256 grid points -the full numerical scheme is outlined in [60].Each of the variables can be written in terms of its discretised Fourier transform relating spatial coordinates x = (x, y) to wavevectors k = (k x , k y ), together with a standard 2/3 dealiasing rule.For the case of vorticity, this is given by: with the allowed wavenumbers in each direction: , where it is assumed that N is even.Each vorticity equation is a stochastic partial differential equation that can be written in terms of linear and nonlinear operators: Note that the Fourier transform commutes with the linear operator but not with the nonlinear operator.Consequently, the linear and stochastic terms are time-stepped entirely in Fourier space using a fourth-order Runge-Kutta algorithm.However, the nonlinear terms must be computed in physical space before being transformed back to Fourier space: To promote the spontaneous emergence of zonal flows, rather than being directly forced, the fluid is stirred using a stochastic vorticity force ξ(x, t) with zero mean (⟨ξ(x, t)⟩ = 0) injected onto an annulus of wavevectors in Fourier space centred around a mean radial wavenumber k f with thickness δk = 1.To prevent the direct forcing of purely zonal or purely meridional flows, we exclude wavevectors of the form k = (0, k) and k = (k, 0).The stochastic forcing is expressed as: where ε is the energy injection rate, N is the number of forced wavevectors that in Fourier space, with Markovian coefficients η satisfying the Hermitian property (η(-k, t) = η * (k, t)) to ensure that the forcing function ξ(x, t) is real.Using a decorrelation time → 0, the forcing takes the form of white noise with η(k, t) = X n , where the X n are independent, identically distributed random variables given by X n = e iθn with uniformly distributed random phase θ n ∈ [0, 2π), and have mean ⟨X n ⟩ = 0 and covariance ⟨X m X * n ⟩ = δ mn .Hyperviscosity, ν n ∇ 2n ζ, is introduced to mitigate energy accumulation in small-scale fluctuations near the largest retained wavenumber, k max , implemented as: where ν is the hyperviscosity coefficient and n is the hyperviscosity index.Simulations start from rest with u = ζ = 0, and parameters remain constant throughout the integration.
The time for total kinetic energy to reach a statistically steady state during spin-up depends on the linear damping rate µ, typically achieved by µt = 2 − 3 (here t = 50 − 75).Only data points beyond this point are used for training and testing the deep learning model.

B Comparisons With Other Architectures
In Figure 5 we can see that the SLT produces qualitatively compelling results.However, when using other neural network architectures we would not reproduce this stability over long-time evolutions.An established method for producing probabilistic generative models is the Variational Autoencoder (VAE).Recent research [13], used a combination of a VAE with a Transformer evolving the latent variables.Within this framework, the VAE maps each latent variable to a distribution parameterised by its mean, µ, and standard deviation, σ, as contrasted to a single vector.Samples are subsequently extracted from this distribution, Z = µ + σ • ϵ with ϵ ∼ N (0, 1).
For comparison, we replace the Stochastic Transformer in the SLT architecture with a multi-layer LSTM that takes in Z ′ t:t−L , after being phase-aligned.The LSTM outputs µ and σ, which are then sampled µ t+1 + σ t+1 • ϵ, and passed to a nonlinear layer, followed by a linear layer, to output Z ′ t+1 .The phase ϕ(t) is reintroduced as with the SLT, before being passed to the decoder, as previously.
We found that using a VAE led to instability in autoregressive outputs as shown in Figure 5.c.Upon investigation, the root cause was determined to be the prevalent blurring effect, a challenge frequently encountered with VAEs.This occurs as the VAE is trained to minimise the Evidence Lower Bound, ELBO = MSE(U, Ũ ) + D KL [N (µ, σ)|N (0, 1)], which is a lower bound on the intractable likelihood.The MSE reconstruction loss in physical space assumes a Gaussian distribution for U and prioritises values near the mean with higher probability rather than those with greater variance, leading to a conservative model.Here, at later time steps, the model is asked to make predictions using the output from a previous time step that is outside the distribution of the training dataset due to growing errors, leading to nonphysical outputs.
To overcome this, the addition of an adversarial loss term was considered.Adversarial training methods have become very popular in producing probabilistic generative models [61], as they add an additional classifier network that must also converge during training, ensuring that the distribution of the generative model outputs matches that of the dataset [62].This then informs the generating network (in our case the VAE) whether its samples are similar to that of the training dataset.As this secondary network is also differentiable, through backpropagation the weights of the generator network are updated to produce results closer to that of the targets from the training dataset, while the classifier improves in its ability to distinguish between the two classes, leading to both networks improving until an equilibrium is reached where the generator is able to produce samples that the classifier cannot distinguish correctly -with these samples being drawn from the same distribution as the training dataset.
The classifier network, referred to as the Discriminator with weights χ, is trained using the following objective function: where Ũ is the output from the VAE, outlined above.The goal of the discriminator is for the first term in equation (B1), which is the average prediction of the discriminator on real sampled data from the training dataset, to be as large as possible, corresponding to correctly identifying real data with a high probability.The second goal of the discriminator is to correctly identify generated data from the VAE as fake with high confidence, corresponding to logD χ ( Ũ ) being as small as possible.
To turn the expression into a maximisation problem, the term (log(1 − D χ ( Ũ )) is used in equation (B1).The real data, U from the dataset is assigned a label 1, while the fake, generated data Ũ is assigned a label 0. The Discriminator outputs a value that lies between [0, 1], attempting to classify the data correctly.
The goal of the generator (here the VAE) is for log(1 − D χ ( Ũ )) to be as small as possible-corresponding to the Discriminator thinking that the generated data is real with high confidence.The following loss function is evaluated, appending an additional loss term to the ELBO loss, weighted by a parameter, γ: However, this approach is not without its challenges, predominantly, issues of non-convergence and mode collapse may arise.The non-convergence issues arise, partly due to the fact that learning the discriminator from data with few steps of an optimisation procedure and subsequently using that to obtain the gradient of the objective with respect to the generator leads to biased gradient estimates.Mode collapse is a consequence of unstable adversarial training, in which the generative distribution collapses to a single point.In Figure 5.d we observe mode collapse in the form of cyclical repeating features with a period of 1500 observable timesteps, where the generative model has learned dynamics that will fool the discriminator network but are physically unrealistic.

C Supplementary Figures
Figure 9: PDFs displaying joint probabilities of U , ∂yU and ∂tU .a, for visualisation of the 3D PDF used to evaluate the Hellinger distances in Figure 6, we average over ∂tU to obtain the density of values in the u-∂yU space computed from the 10 6 time steps via numerical integration.b shows the same u-∂yU space from an evolution generated using the SLT with latent dimension size 64.c shows the u-∂tU space by averaging over ∂yU for the numerical integration values and d shows the same as c for an evolution generated using the SLT.
Here we can see that again the SLT shows good agreement with the numerical integration, covering the full space of dynamics.

Figure 3 :
Figure3: Schematic of stochastic multi-headed attention.The initial transformer block among the N blocks incorporates a stochastic variant of multi-headed attention.Here ϵ ∼ N (0, 1) is introduced as an additional sequence member, L represents the length of the latent time history, and DM denotes the latent dimension.Weights, W, linearly transform inputs into Q (Query), K (Key), and V (Value) vectors.The ϵ input is identical for both the K and V vectors.Dotted lines in the A (Attention) matrices separate the additional row introduced from cross-attention, given forcing ϵ, and the self-attention of the latent time histories.This illustrates the case where the batch size = 1 and number of heads = 1, with reshaping implemented to extend this to multiple heads.The non-stochastic self-attention in the following transformer blocks operates in the exact same manner, simply without the concatenation of ϵ.

Figure 4 .
Figure 4.a displays the MAE over time, in red for the SLT ensemble and in green for an ensemble of the 7 other numerical integrations (Figures 1.b-h).The SLT exhibits strong agreement with the numerical integration throughout the period.Both reach saturation at ∼ t=250, indicating that trajectories have sufficiently diverged and MAE is no longer a meaningful metric.We also compare the outputs of a temporal Variational Autoencoder (VAE)[3], a popular generative model architecture, shown in blue.The VAE demonstrates notably poorer performance, as evidenced by the larger MAE values.Performing the same analysis with the ensemble variation term, displayed in Figure4.b.The SLT again shows good agreement with the numerical integration, while the VAE shows a much greater forecast spread.In the Supplementary Information, we demonstrate the SLT's similar effectiveness for trajectories with different initial conditions.

Figure 4 :
Figure 4: Performance evaluation over short time scales.a presents the Mean Absolute Error (MAE) for short-term tracking ability.Green represents an ensemble of 7 numerical integrations, while red shows 7 emulations from the SLT with respect to the reference truth trajectory (Figure 1.a).Blue indicates an ensemble produced by a Variational Autoencoder (VAE).b showcases ensemble variation.Using these evaluation metrics, we observe that the SLT exhibits strong agreement with the numerical integration.

Figure 5 :
Figure 5: Assessing model faithfulness over long-term evolutions.a presents a long-term evolution obtainedthrough numerical integration, spanning 20,000 observable time units (from a longer evolution of tmax = 10 6 ).b displays a long-term evolution generated by the SLT, showing qualitative similarity to a with the occurrence of nucleation and coalescence events, as well as latitudinal translation.c illustrates a long-term evolution produced by a Temporal Variational Autoencoder (VAE) that becomes unstable.d depicts a long-term evolution spanning 5,000 time increments generated by a VAE with adversarial training, which suffers from mode collapse, resulting in cyclically repeating features.These emulations demonstrate the robustness of the autoregressive SLT in capturing and maintaining realistic long-term dynamics, which could not be replicated by existing architectures.

Figure 6 :
Figure 6: Performance evaluation of long time evolutions.a PDF of values of U , from the numerical integration in green and the SLT in red, obtained from long evolutions of 10 6 time units.b same as a for values of ∂yU , to demonstrate learned spatial correlations.c same as a and b, for values of ∂tU , to demonstrate learned temporal correlations.d illustrates the Hellinger distance H(p, q) between PDFs of U , ∂yU , and ∂tU obtained from long evolutions of 10 6 time units from the SLT, as the latent dimension increases (blue).It also shows H between two numerical integrations with different noise realisations for comparison (grey).e depicts the PDF for the likelihood of the system transitioning in the number of jets, based on 10 6 time units observable time from both numerical integration (solid lines) and SLT (dashed lines).The PDFs are derived from jet count frequencies at observable time t, showing whether the count increases (red), decreases (blue), or remains constant (green) at observable time t + 1. f time averaged power spectral density displayed for both the numerical integration and the SLT, showing the energy content of each wavenumber in the zonally-averaged zonal direction, U .Using these evaluation metrics, we observe that the SLT again exhibits strong agreement with the numerical integration over long time evolutions.

Figure 8 :
Figure 8: PDFs of time to a nucleation event.Here we show the same information as in Figure7, but for

Figure 10 :
Figure 10: Latitude-time plots of U(y, t) showing a different set of initial conditions to those in Figure 1.a-h exhibit numerical integrations with identical initial conditions up to t = 10 (indicated by the dotted line) and distinct realisations of random noise, ξ, after t = 10, spanning a forecast period of 500 time units.i-p showcase neural network emulations with identical initial conditions as (a-h), but with different noise histories ϵ ∼ N (0, 1) after t = 10.Here the initial conditions shown to the model (left of the dashed line) are different to those Figure 1 displaying model performance in an initial two-jet state, subject to the more stochastically driven dynamics before a nucleation event.

Figure 11 :
Figure 11: Evaluative Metrics of short-term evolutions, showing the same information as in Figure 4ab, but for the ensembles shown in Figure 10. a presents the Mean Absolute Error (MAE) for short-term tracking ability.Green represents an ensemble of 7 numerical integrations (Figure 10.b-p), while red shows 7 emulations from the SLT (Figure 10.i-o) with respect to the reference truth trajectory (Figure 10.a).Blue indicates an ensemble produced by a Variational Autoencoder (VAE).b showcases ensemble variation.We again show that the SLT shows very good agreement with the numerical integration.
decoder, consisting of what we shall refer to as Translation Equivariant Pointwise 1D Convolution Layers (TEPC), see Figure 2.a for illustration.Consequently, the resulting latent space, Z ∈ R D M , Schematic of the Stochastic Latent Transformer (SLT) architecture and its components.a Translation Equivariant Pointwise 1D Convolution (TEPC) Layer.The layer convolves inputs X with learned weights W in the frequency domain, with weights W learned irrespective of the phase, ϕ (ϕ is the argument of the first mode of X). b Stochastic Latent Transformer (SLT) architecture.Solid arrows indicate the forward pass.The encoder employs two TEPC layers with a nonlinear activation function, σ, to encode the short time history of Ut−L:t.The resulting Zt−L:t is fed into the Stochastic Transformer (ST) for forecasting Zt+1.The dotted line indicates the autoregressive flow ( Zt−L+1:t+1) for forecasting subsequent steps up to Ztmax during inference.The decoder then transforms Zt+1 (or Zt+1:tmax ) back to the physical space Ũt+1 (or Ũt+1:tmax ), mirroring the encoder's architecture.c Stochastic Transformer (ST) architecture.