A comparison of data-driven approaches to build low-dimensional ocean models

We present a comprehensive inter-comparison of linear regression (LR), stochastic, and deep-learning approaches for reduced-order statistical emulation of ocean circulation. The reference dataset is provided by an idealized, eddy-resolving, double-gyre ocean circulation model. Our goal is to conduct a systematic and comprehensive assessment and comparison of skill, cost, and complexity of statistical models from the three methodological classes. The model based on LR is considered as a baseline. Additionally, we investigate its additive white noise augmentation and a multi-level stochastic approach, deep-learning methods, hybrid frameworks (LR plus deep-learning), and simple stochastic extensions of deep-learning and hybrid methods. The assessment metrics considered are: root mean squared error, anomaly cross-correlation, climatology, variance, frequency map, forecast horizon, and computational cost. We found that the multi-level linear stochastic approach performs the best for both short- and long-timescale forecasts. The deep-learning hybrid models augmented by additive state-dependent white noise came second, while their deterministic counterparts failed to reproduce the characteristic frequencies in climate-range forecasts. Pure deep learning implementations performed worse than LR and its noise augmentations. Skills of LR and its white noise extension were similar on short timescales, but the latter performed better on long timescales, while LR-only outputs decay to zero for long simulations. Overall, our analysis promotes multi-level LR stochastic models with memory effects, and hybrid models with linear dynamical core augmented by additive stochastic terms learned via deep learning, as a more practical, accurate, and cost-effective option for ocean emulation than pure deep-learning solutions.


Introduction
Medium-range weather forecast models routinely use computationally expensive Ocean General Circulation Models (OGCMs) that are coupled to the atmosphere model. However, the long timescales of ocean dynamics and the weak influence from the deeper layers of the ocean on the atmosphere for a weather forecast of, say, a couple of days justify the investigation of replacements of expensive OGCMs with low-dimensional datadriven models that can run at negligible cost and emulate the upper ocean. Here, these models are referred to as ocean "emulators", because they emulate statistical properties of the flow rather than simulating the dynamics derived from physical principles. Our definition of emulators is slightly different when compared to a number of studies that try to replace components of existing models to reduce computational cost. In our case, we do not aim to emulate a model component but rather the physical system that has generated the data -a dynamical ocean model. Ocean emulators are also useful (i) in long-time climate-type model simulations for process sensitivity studies, (ii) in climate prediction, and (iii) for improving ensemble forecast statistics. Furthermore, ocean emulators can potentially be down-scaled and used for data-driven parameterizations of mesoscale (and even sub-mesoscale) eddies for non-eddy-resolving and eddy-permitting comprehensive OGCMs. Finally, they can also be used as conceptual toy models for process-related studies (e.g., as kinematic flow emulator of material transport).
The physical ocean models have the fundamental advantage that they can operate even without training from data, which simply may not exist. The main problems with the physical models are that for some practical applications they can be prohibitively expensive or may not allow to resolve all important features, some of the involved physics can be inaccurately accounted for, and numerical and discretization errors can be unacceptable. On the other hand, data-driven emulators, which are the focus of this study, can be much cheaper, accurate (for the data-trained regimes), and simple to deal with, which gives them a crucial advantage for many practical problems. However, they can be hard to interpret physically as many of them are ultimately used as a black box, e.g., machine-learning-based methods. Low-cost emulators can be constructed in terms of the Empirical Orthogonal Functions (EOFs) and their Principal Components (PCs; Lorenz (1956)), but the true governing equations in the EOF space are always unknown. Our approach to ocean emulation is supported by the existing and rapidly developing methodologies for statistical data-driven modeling (reviewed in Rowley and Dawson (2017); Bren--4-manuscript submitted to JAMES ner et al. (2019)). We consider three major statistical model types: linear regression (LR), stochastic, and deep learning.
LR belongs to the broader family of regression-only statistical models, where some polynomial surfaces are fit to the data, while forbidding correlations with the residuals.
The fitted polynomials are believed to capture "enough" of the dynamics, so that the residuals can be attributed to the uncertainty in the initial conditions and internal variability. The regression-only models benefit from significant speed-up as there is no need to identify the covariance matrix and the associated calculations, such as covariance inversion, which can be complex and computationally expensive for high-dimensional systems. The fast, simple, and easy-to-implement characteristics of regression-only models found them numerous applications in climate and environmental sciences (Sexton et al., 2012;P. Holden et al., 2013;Williamson, 2015). Here, we use them in the linear form, due to incomplete knowledge about the nonlinear basis functions of the ocean circulation tendencies, and develop the simplest ocean emulator that provides the baseline.
Next, stochastic models use parameterized noise signals to deal with missing/unknown physics, parameter uncertainties, inaccurate initial conditions, and noise-induced regime transitions (Sardeshmukh et al., 2001;Sura et al., 2005;Berner et al., 2017) in chaotic dynamical systems. They produce random outputs, thus, allowing for ensemble statistics for uncertainty estimates. Stochastic terms can be added to any deterministic framework in two common ways, as either additive or multiplicative noises. The additive noise is directly added to the basic equations, whereas the multiplicative noise is added after multiplying it with the amplitude function depending on the predicted model variables.
The latter models are significantly more complicated, and fitting their parameters is more difficult, yet both have distinct advantages and are applied in ocean and climate modelling for numerous purposes. The additive noise has been used to force linear dynamical models (Farrell & Ioannou, 1993;DelSole & Hou, 1999;Y. Zhang & Held, 1999), to model effects of subgrid-scale turbulence (Farrell & Ioannou, 1995;D'Andrea & Vautard, 2001; P. S. Berloff & McWilliams, 2003;DelSole, 2004;Williams et al., 2016), to provide stochastic climate predictions (Majda et al., 1999;Seiffert & Von Storch, 2008;Seiffert & von Storch, 2010), and to derive stochastic primitive equations for the oceans and atmosphere (Ewald et al., 2007). On the other hand, the multiplicative noise strategy is considered most relevant for modelling non-Gaussian statistics, such as extreme events, tipping points in the dynamical systems (Sura, 2011;C. Franzke, 2012; C. L. Franzke, -5-manuscript submitted to JAMES 2013; Sura, 2013), uncertainty estimates in parameterization schemes (Buizza et al., 1999;Juricke et al., 2013Juricke et al., , 2017Ollinaho et al., 2017), stochastic primitive equations (Glatt-Holtz & Ziane, 2008;Debussche et al., 2012), and nonlinear coupling between noise and model variables (Sardeshmukh et al., 2003;C. Franzke et al., 2005). Recently, a multi-level framework of the additive nonlinear stochastic models -called Empirical Model Reduction Kravtsov et al., 2005Kravtsov et al., , 2009Kondrashov et al., 2013, EMR) -has also been developed. It allows to include inherently important time-delayed (memory) effects both in the additive state-dependent stochastic forcing and dynamical operator , and allows for a more complicated temporal structure of the noise. Data-driven climate models based on EMR formulation have proven to be highly competitive in prediction and process studies (Strounine et al., 2010;Chen et al., 2016;Ghil et al., 2018). The Linear inverse modeling (LIM) approach (Penland, 1989;Penland & Sardeshmukh, 1995;Penland, 1996) is a specific case of the EMR with a linear propagator and additive white noise. Kondrashov and Berloff (2015) have shown that decadal oceanic variability can be successfully simulated by a linear EMR formulation and a change of basis, namely instead of Principal Component Analysis (PCA) modes, using modes identified by Multichannel Singular Spectrum Analysis (M-SSA) which incorporates time-delayed embedding (Ghil et al., 2002). Similar to M-SSA, data-adaptive harmonic decomposition (DAHD) Ryzhov et al., 2020) relies on the eigendecomposition of the lag-covariance matrix. However, unlike M-SSA, DAHD modes form an orthonormal set of spatial patterns oscillating harmonically within the time-embedding window, and thus can be modeled by a system of coupled frequency-ranked nonlinear stochastic oscillators.
For this study in the class of stochastic methods, in order to make a fair comparison, we have made a deliberate choice to focus on the methods that are commonly used in PCA basis, and implemented linear stochastic models for white noise, as well as the linear formulation of the multi-level EMR framework. The third statistical model type investigated in this study are deep-learning models. These models approximate the intricate nonlinear functional relationships between the model inputs and outputs by training an extensive parametric network of interconnected nodes, using neither physical knowledge about the system nor the governing differential equations. With the recent advancements in computing power, simple feed-forward -6-manuscript submitted to JAMES Artificial Neural Networks (ANNs), Convolutional Neural Networks (CNNs), and Long Short Term Memory (LSTM) models rose to prominence in many science disciplines and helped to find hidden patterns in multi-dimensional data sets. Feed-forward ANN is the most commonly used deep-learning model, and its design includes multiple layers of small blocks of equations communicating in a nonlinear fashion. These ANNs have been used broadly in oceanic and atmospheric studies, ranging from the idealized Lorenz63 (Lorenz, 1963) and Lorenz96 (Lorenz, 1996) models (Dueben & Bauer, 2018;Scher & Messori, 2019) to more realistic situations, such as developing subgrid-scale models (Karunasinghe & Liong, 2006;Rasp et al., 2018;Maulik et al., 2019), learning the inter-dependency between global climate and vegetation fields (P. B. Holden et al., 2015), super-parameterizations (Chattopadhyay, Subel, & Hassanzadeh, 2020), and spotting extreme events in complex weather data sets (Y. Liu et al., 2016). CNNs (Krizhevsky et al., 2012) form another class of deep-learning methods that has been extensively used in geophysical fluid dynamics to identify (and regress) patterns in turbulent flow regimes by repeatedly convoluting the inputs with appropriate kernels (Bolton & Zanna, 2019;B. Liu et al., 2020;Weyn et al., 2020;Chattopadhyay, Mustafa, et al., 2020). LSTMs (Hochreiter & Schmidhuber, 1997) and Reservoir Computing (Jaeger & Haas, 2004;Pathak et al., 2018;Nadiga, 2021) are specific forms of Recurrent NNs that can progress learned information from one timestep to the next when applied in an iterative way. This improves the time evolution of the model which makes the LSTMs attractive for applications in oceanic and atmosphere modelling that show multi-scale and lagged interactions (Q. Zhang et al., 2017;Vlachas et al., 2018;Salman et al., 2018).
Despite first attempts to interpret deep-learning models physically (McGovern et al., 2019;Portwood et al., 2021), they remain mostly black boxes. Therefore, the modern trend is to use them in combination with physical models (Karpatne, Atluri, et al., 2017;Reichstein et al., 2019) -referred to as hybrid methods -to potentially increase the forecast skills of an imperfect physical model. Also, the involved neural network may require less training and complexity (Jia et al., 2019). A few applications of hybrid methods using ANN and LSTMs are in Krasnopolsky and Fox-Rabinovitz (2006);Rahman et al. (2018); Watson (2019); Pawar et al. (2020). Similarly, hybrid deep-learning stochastic approaches are also being developed (Mukhin et al., 2015;Seleznev et al., 2019), but this area remains understudied, and its full potential has not yet been explored. In this paper, we test all of the above deep-learning models, except for the CNN, which is best -7-manuscript submitted to JAMES suited to image-based datasets and structured grids, whereas, here, we work in the EOF/PC space to achieve significant model order reduction. We also propose novel hybrid stochastic formulations by utilizing residuals from the deep-learning procedure, thus effectively providing nonlinear state-dependent noise.
How do different emulators from the selected 3 classes compare against each other in terms of their skills? This is a difficult and significantly understudied question, which is central in our study. In particular, as most papers will only evaluate a single method for a specific dataset, quantifiable intercomparison of different methods are often impossible across papers. In the context of geophysical applications, we can speculate the LR to be the least successful due to its purely linear form and deterministic nature, making it ineffective in accounting for the inherent uncertainties due to, for e.g., insufficient resolution, unresolved processes, parameterization errors, etc., and imprecise/incomplete knowledge of many geophysical processes (especially on the reduced-dimension space) and scale interactions caused by non-linear terms of the differential equations. The stochastic models can deal with model and forecast uncertainties and, therefore, are expected to perform better than LR. However, it is hard to predict their performance relative to the deep-learning methods, which are generally deterministic (note that they can also be Bayesian) but capable of producing accurate and generalizable models. In this paper, we aim to compare a number of ocean emulators for relatively simple ocean circulation data obtained from a long-term model simulation of an idealized ocean model. We do not use ocean observations as model data to avoid problems due to measurement errors, complex coast-lines, gaps in observations, biases between different observational products, etc. which would make a fair comparison between the different emulators more difficult. However, the study could easily be repeated with observational datasets. We aim to develop emulators, which are computationally cheap, able to reproduce the statistical characteristics of the reference flow, and are capable of providing simulations on climaterange time scales.
Section 2 provides details about the datasets; Section 3 discusses all the modelling frameworks; Section 4 presents the model assessment metrics and their outcomes, and Section 5 discusses the results and concludes.

Dataset
The reference data set used in this study was generated by a three-layer doublegyre quasigeostrophic ocean model (P. Berloff, 2015;Ryzhov et al., 2019), which provides an idealized representation of the North Atlantic wind-driven circulation dominated by the subtropical and subpolar gyres, and by the Gulf Stream current. A square ocean basin with side 3840 km is considered; a steady asymmetric wind forcing at the top layer is imposed; the partial-slip boundary condition is used on the lateral boundaries; stratification is imposed with the first and second Rossby deformation radii equal to 40 km and 20.6 km, respectively; and the grid resolution is 7.5 km. Since the model is dynamically eddy-resolving, it qualitatively correctly reproduces the eastward jet extension of the western boundary currents and its adjacent recirculation zones, as well as the mesoscale eddy variability and interdecadal oscillations (of period 17 years). The reference solution of the statistically stationary flow regime is obtained in terms of the evolving potential vorticity and velocity streamfunction that are related to each other via elliptical inversion. The solution snapshots are saved after every 10 days, for a total of about 1400 years (5×10 5 days). Both potential vorticity anomaly and streamfunction snapshots ( Fig. 1) show two asymmetric gyres of opposite circulations separated by the eastward jet region, which is characterized by the most vigorous flow variability, and, therefore, is in the focus of our study.
For the reference data set, we chose velocity streamfunction, because it is smoother than potential vorticity anomaly (both of them represent the same dynamical regime).
We deal with the upper isopycnal layer, as it contains the most intensive flow variability and is the most relevant for numerical weather predictions. Next, we re-organize the solution description and reduce the dimensionality of the raw data by using singular value decomposition (SVD), which allows us to find dominant spatial patterns, called Empirical Orthogonal Functions (EOFs), and the corresponding temporal coefficients, called Principal Components (PCs). Before SVD, the reference 3D spatio-temporal dataset (∈ R m×m×n ; m = 513, n = 5 × 10 4 with the grid-dimension in each direction m and the number of records n) is reshaped into a matrix (∈ R n×m 2 ), where each row is a snapshot of the flow. It is then smoothed along the temporal dimension with a 100-days (i.e., 10 records) running-average window, in order to focus on the long-timescale tendencies (the window size is truncated at the endpoints), followed by an SVD decomposition to obtain n EOFs/PCs. These EOFs/PCs are used for the emulation and analyses in this Positive (red) and negative (blue) values of potential vorticity anomaly correspond to cyclonic and anticyclonic motions, respectively, and oppositely for the streamfunction. The color scales are in dimensionless units. Non-dimensionalization is done using the length and velocity scales equal to 7.5km (grid interval) and 0.01 m/s, respectively, and CGS units.
work. Each EOF explains a fraction of the total temporal variance, and all EOFs are ranked so that this fraction decreases with k; for complex geophysical data, this decreases exponentially (Ghil et al., 2002). For the purposes of our study, we considered the leading 150 EOFs and their PCs, that all together represent about 90% of the total variance.
Our choice of relatively small (0.3% of the total number of EOFs), yet dynamically significant, number of EOFs is justified by our goal to gain a foothold in developing and applying a systematic methodology for comprehensive assessment of the model skills. However, we admit that we are trying to build a simplified statistical model of the QG dynamics, which itself is a simplification of the comprehensive general circulation model dynamics, therefore, our study does not present the real ocean situation, and the generalization of the results presented here are only possible up to a certain extent. Fig. 2 shows the top-five and the bottom-most EOFs among the 150 EOFs/PCs considered here.
The top-ranked EOFs represent the most dominant patterns along the eastward jet region, which is in the focus of this study, as it contains most of the variability, while the last EOF represents high-frequency variabilities across the basin.  years. Also, note that the low-ranked PC is much noisier with a smaller variance. The y-axis values are dimensionless.
Our aim is to emulate the leading 150 PCs of the upper-ocean streamfunction (Fig. 3) efficiently and to construct the corresponding reduced-order models.

Modelling Frameworks
In this section, we expand on each adopted modelling methodology one-by-one, by providing the mathematical formulations and the relevant parameters and hyper-parameters, and by explaining the model training processes. Before modeling, we normalized the PCs by the respective standard deviations such that each of them follows zero mean and unit standard deviation. However, the results were found to be robust to other normalizations too, e.g., division by the standard deviation of the top-most PC (similarly for auxiliary variables, if any). Within each method, we model either tendencies or state of the PCs, denoted by y i ; i = 1, 2, 3, . . . , N , where N = 150 is the total number of PCs considered. The tendencies are computed numerically using the finite-difference method with ∆t = 10 days as the time difference between any two successive records. Out of the total 50K records (which covers a time interval of 500K days, as we have recorded the snap--12-manuscript submitted to JAMES shots after every 10 days) of the PCs, we set the initial 40K records as the training dataset, and the last 10K records as the test dataset. This training length is roughly 400 times the decorrelation time scale of the topmost PC (≈ 2.7 years) and includes nearly 65 periods of the intrinsic low-frequency variability of period 17 years. Note that such a partitioning of training and test datasets may leave the end of the training dataset and the beginning of the test dataset correlated. But this is inconsequential because the test dataset still possesses roughly 16 cycles of the intrinsic low-frequency variability, and we have considered multiple initial conditions for forecasts and reported their average.
We considered the full length of the training dataset for all models but also experimented with different lengths ranging from 1K to 40K records as in real-world applications, such a large training dataset (over 1000 years) may not be available. Our analysis showed that at least 10% of the training data (corresponding to ≈ 7 cycles of lowfrequency variability) is needed for our models to have similar performance on the test dataset as reported for the full length of training dataset. Transfer learning may help when training data-driven models for the real ocean with limited observations. In this approach, the models are trained first using the historical climate data, such as CMIP model outputs, before fine tuning using observations and reanalysis data (Ham et al., 2019;Rasp & Thuerey, 2021).

Linear Regression (LR)
This model is expressed as follows: where y = [y 1 y 2 . . . y N ] is the multivariate PC, and L is an N × N matrix of regression coefficients, referred to as the system matrix. LR is the simplest model considered in this work and is used as a baseline for assessing the performance of the others.
Linear dynamics is pertinent to the gyres and explains a significant proportion of the lowfrequency variabilities in the top PCs. Therefore, we also term LR as the 'core dynamics'.
-13-manuscript submitted to JAMES

LR with Additive White Noise (LR-AWN)
This is a stochastic model based on the LR model (1) but additionally equipped with the additive white noise. The model is given as: where σ is the standard deviations of the LR residuals r = dy/dt − Ly; Q is a lower triangular matrix -obtained by Cholesky decomposition of the correlation between the residuals r at zero lag -multiplied to dW in order to induce the required correlation between the noise components. Mathematically, and Q, C ∈ R N ×N . We validated the output of this model against a Linear Inverse Model, which is more widespread, but found almost no difference between the two model outputs.

Multi-level Linear Regression (ML-LR)
We adopt a linear formulation of multi-level stochastic Empirical Model Reduction approach (Kravtsov et al., 2009;Kondrashov & Berloff, 2015;, where the regression residuals are not immediately replaced by some noise but instead are modeled by using a stack of levels. The top-most level is similar to (1), and the lower levels regress the higher-level residuals as the additional (hidden) state variables, until the lowest-level residual degenerates into spatially uncorrelated white noise, i.e., their autocorrelation approaches zero (technical implementation of the stopping criterion is based on fraction of the explained variance by regression, see Appendix A in Kondrashov et al. (2015)). The complete model can be expressed as: Level 2: Level 3: . . . . . .
where, as before, dW is an independent Gaussian white noise process; and Q is the lower triangular matrix obtained by the Cholesky decomposition of the zero-lag correlation between the last-level residuals (same as in Sec. 3.2). In our results, the residuals at the -14-manuscript submitted to JAMES second level r (1) (t) become sufficiently decorrelated in time (according to the stopping criterion) and are well approximated by the spatially correlated white noise, so we used Note that for the prediction experiments in Sec. 4, the initial conditions for the residuals at various levels need to be determined in strictly "no-look-ahead" procedure, i.e., only using the model coefficients and the time history of y prior to the forecast start time.
E.g., we would need to begin from time instant (t−1) to make forecasts from time instant t, when using a model with one extra level (see Appendix B in Kondrashov et al. (2015)) and initializing r(t−1) = (y(t)−y(t−1))/dt−Ly(t). To obtain numerical results we have used the publicly available Stochastic Modeling and Prediction Toolbox (see Acknowledgments).

Artificial neural network (ANN)
A neural network is a computational architecture loosely based on the biological networks of neurons in the human brain. Each neuron is an instance of an activation function (e.g., linear, binary, hyperbolic, sigmoid) that operates on the weighted sum of its inputs with some bias added into it. Multiple neurons can be combined into distinct neural network architectures. A feed-forward Artificial Neural Network (ANN) is among the most common (Nielsen, 2015). It is composed of multiple layers of neurons so that outputs from one layer are the inputs to the next layer, with the ultimate goal to approximate the right functional relationship between the inputs and outputs. In a dense network, each neuron in a layer receives inputs from all the neurons in the previous layer and, thus, exhibits a compact set of connections between the available neurons, inputs, and outputs. Each connection in the network is characterized by its weight, and the goal of an ANN is to optimize them using a suitable loss function and optimization algorithm.
An ANN is defined by a few hyper-parameters controlling its performance (e.g., number of hidden layers, number of neurons per layer, activation function, optimizing function, optimizer). To adjust the hyper-parameters to the optimal values is a non-trivial task (due to the size of hyper-parameter space), and the adjustment is mostly based on trial-and-error testing that adjusts the model complexity to the amount of data available and the complexity of the problem. To improve the ANN forecasts further, we added a spatially correlated white noise to the ANN forecasts to account for the residuals, similar to LR in (2). This model is abbreviated as ANN-AWN and is given as: where ζ ∼ N (0, Q T Q), and Q is the lower triangular matrix obtained by Cholesky decomposition of the covariance of ANN residuals r(t) = y(t + 1) − AN N (y(t)). Note that it is also possible to train ANN to predict the perturbation dy for a given state y(t) -as done in Chattopadhyay, Hassanzadeh, and Subramanian (2020); Dueben and Bauer (2018) -but this approach resulted in unstable solutions for long integrations.

Long Short Term Memory (LSTM) Model
LSTM models belong to the class of recursive NNs and function by passing information from previous timesteps to calculate the next timestep when used iteratively. Because these models hold essential dynamical information between the successive time steps, they account for long-time correlations between the model states. This is a significant advantage over ANNs, for an application with significantly autocorrelated time series.
-16-manuscript submitted to JAMES Like ANNs, LSTMs can also be upgraded using spatially-correlated white noise (abbreviated as LSTM-AWN) with the noise parameters inferred using the LSTM residuals.
We used Keras Google API to implement a two-hidden-layered densely-connected LSTM configuration with 100 neurons in each layer. The model was trained using mean absolute error as the loss function, "Adam" as the optimizer, hyperbolic tangent as the activation function, and the whole 400K days as the training length. Unlike ANNs, the LSTM model takes the state of the PCs at five previous time steps as the input and produces the next state as the output -the so-called 'look back' hyperparameter is 5. Higher values of look back did not improve model performance significantly, but the overall computational cost of training/prediction increases many folds (note that LSTM is optimized for taking into account long-time correlation effects by construction). For mini-batches, we used 32, 644, and 128 as its potential values and found 32 to be optimal -amounting to nearly one year of observations. For all other hyperparameters, we used the same hyperparameter search space as described in ANN, and the final values were chosen after testing their different combinations and analyzing the resulting model performance on the training data.

Hybrid Modeling
The hybrid model that combines LR, which conveys the linear dynamics, with the deep-learning models -used as a non-linear correction, state dependence, and memory term -may be more skilled than the standalone implementation of these methods. Such a hybrid model can potentially also preserve some core dynamics of the system and may also benefit from simpler algorithms and architectures in the spirit of theory-guided data science (Karpatne, Watkins, et al., 2017).
We trained the deep-learning models (say, f ) from the previous sections (ANN and LSTM) to emulate the LR residuals r(t) = dy(t)/dt − Ly(t) and, thus, augment the LR output as: dy(t) = Ly(t)dt + r(t)dt, r(t) = f (r(t − 1), y(t − 1); r(t − 2), y(t − 2); . . . ; r(t − l), y(t − l)) + ξ, where l is the 'look back' hyperparameter for LSTM; for ANN, it is equal to 1 by construction. For LSTM, we set l = 3 after checking the LSTM-hybrid model performance on the training data for l = 1, 2, . . . , 5 and finding that the model performance does not -17-manuscript submitted to JAMES improve beyond l = 3. The model learning proceeds in three successive steps as follows: (1) LR is used to estimate L, (2) the resulting residual r(t) = dy/dt − Ly(t) is modeled by f (r(t−1), y(t−1); . . . ; r(t−l), y(t−l)) that accounts for non-linear correction, state dependence and possibly memory effects, (3) the final residual from deep-learning minimization is approximated by a spatially correlated white noise process ξ ∼ N (0, Q T Q).
This procedure can be interpreted as incorporating state-dependent noise r(t), and, as said previously, it is implemented for the two deep-learning methods (described in the previous sections), referred to as LR-ANN-AWN and LR-LSTM-AWN. We have also evaluated versions of Eq.(5) with no stochastic forcing, i.e., without the white noise term, that are abbreviated as LR-ANN and LR-LSTM.
Note that unlike their standalone implementations, the deep-learning models here take both the state y(t) and residual r(t) as inputs, and return the residual r(t+1) as the output. We tested this configuration against several others and found that the current setup performs better than the others on both training and test datasets.

Results
In this section, we consider each model assessment metric separately, summarize them, and report the outcome for all models. For the majority of the assessment metrics, we use the reconstructed spatio-temporal streamfunction field (ψ f cast 1 ), obtained by multiplying the forecasted PCs (y i 's; i = 1,2,3,. . . ,150) with the respective EOFs (φ i ).
We obtain PC forecasts corresponding to a set of initial conditions. The exact number of initial conditions and the lengths of the forecasts differ for the assessment metrics and are provided in the detailed descriptions. Additionally, for stochastic methods, we obtain an ensemble of 100 realizations for each initial condition and calculate the ensemble mean. The reference truth (ψ ref 1 ) used for assessing the model outputs belongs to the reduced space spanned by the 150 EOFs/PCs. Below, we present the assessment metrics and detailed analyses of the models.

Root Mean Square Error (RMSE)
The RMSE is given as -18-manuscript submitted to JAMES and its time series describes the basin-averaged mismatch between the reference and emulated streamfunction snapshots. For perfect forecasts, the RMSE should be zero. However, in practice, RMSE is small for short forecast lead times and grows until it becomes saturated, as the forecast and reference truth become uncorrelated at long forecast lead times. We considered the state of the PCs at each of the 10K records of the test dataset (10K records correspond to 100K days) as an initial condition and obtained 10 records long (i.e., 100 days) forecasts for all of them; therefore, we used a total of 9990 initial conditions (as we would not have the reference data for the last 10 initial conditions).
Next, we computed RMSEs for each of these forecasts followed by the mean RMSE over all initial conditions; this provides a 10 records long time series for each model (Fig. 4a).
The motivation for using this forecast length is to study the error growth for short-term forecasts (e.g., on seasonal time scales) as opposed to a change of the mean fields using long-term simulations (e.g., decadal-to-centennial). The short-term predictability for a single initial condition can depend strongly on the underlying flow regime, such as defined by the western boundary current position. It is therefore important to take the average RMSE over a number of initial conditions.
For comparison purposes, we also considered the "persistence" model, where the memory effect is the strongest, and the model state remains constant -equal to the initial condition. The persistence RMSE time series, therefore, characterizes a "constant state" with the absence of a dynamic model.
For 100 days, ML-LR exhibits the best performance with the lowest mean RMSE, followed by the stochastic augmentations of the hybrid models, i.e., LR-ANN-AWN and LR-LSTM-AWN (Fig. 4a). The deterministic hybrid model RMSEs are similar or worse than LR and its white noise extension; LR-ANN is worse compared to LR-LSTM as the RMSE for the former grows more steeply with the lead time and gets even higher than the Persistence for a lead time beyond 70 days. This suggests that the additive noise is improving the deterministic hybrid methods (clearer from Fig. 4b), as the residuals in such models are less correlated and closer to white noise. We also observe that the standalone ANN, both in its deterministic and stochastic version, performs the worst, with its RMSEs being higher than the Persistence at all lead times. The standalone LSTM also performs similarly poorly (yet better than its ANN counterpart), but its stochastic version produces RMSEs lower than the Persistence at high lead times. Nonetheless, a comparison of standalone and hybrid implementations of deep learning methods en--19-manuscript submitted to JAMES courages us to use ANN/LSTM as a nonlinear corrector term rather than using them to represent the complete dynamics.
On the other hand, LR and LR-AWN belong to the middle of the RMSE spectrum and show similar performances (see the box plot, Fig. 4b). The similar performances of these models is due to the inability of the noise component to account for the coupled -20-manuscript submitted to JAMES dynamics contained in the LR residuals, which are modeled more efficiently using an extra regression level (as in ML-LR) or deep-learning methods (as in the hybrid methods).
Overall, we conclude that (i) ML-LR and the stochastically augmented hybrid models show better performance than the standalone implementation of LR and deep-learning models, probably, because the former types include all three major components of a reliable model: core dynamics, memory effects, and model errors accounted by stochastic noise; (ii) adding simple additive noise to the hybrid models significantly improves their performance.

Anomaly Correlation Coefficient (ACC)
Next, we diagnose the correlation between the forecasted (100-days-long forecasts) and the reference truth spatio-temporal streamfunction datasets on spatial and temporal domains, referred to as ASCC and ATCC, respectively. We call this anomaly correlation because we deal with mean-subtracted PCs, and, therefore, the resulting physical fields possess zero mean.
ATCC is a grid-point-wise zero-lag cross-correlation between the forecast and the reference truth over all lead times. It, therefore, gives us a gridded map with the zerolag cross-correlation value between the forecast and the reference truth for each grid location. Like RMSE, the ATCC map is computed for each of the 9990 initial conditions followed by their average (Fig. 5).
ASCC is the cross-correlation between the spatial snapshots of reference truth and forecast at each lead time, say, t. The snapshots of forecast and the reference truth are reshaped to a vector before cross-correlation. ASCC therefore returns a time series of length 100 days (the maximum lead time) for each initial condition, and we report their average (Fig. 6).
where t is the forecast lead time, (.) and . indicate the spatial and temporal averages, respectively, and σ refers to the standard deviation. Essentially, ATCC conveys tempo--21-manuscript submitted to JAMES ral similarity between the forecasted and reference truth over all lead times, as it computes the grid-point-wise temporal correlation between the time series of the two datasets averaged for all initial conditions. In contrast, ASCC exhibits the spatial structure similarity between the two datasets, as it computes the snapshot-wise correlation between the two fields at a given lead time, normalized and then averaged for all initial conditions.
The ATCC maps show that the ML-LR forecasts are the best, followed by the stochastic hybrid models (Fig. 5c,i,k). For most of the models, the correlations are generally higher in the gyre regions than in the eastward jet region, which is justified since the latter area is more turbulent. However, the stochastically-improved hybrid methods and ML-LR provide higher correlation in the jet region than those of the non-stochastic hybrid models (Fig. 5h,j). The pure deep-learning methods and their stochastic extensions ( Fig. 5d-g) fail to reproduce the variabilities entirely, thus, resulting in significant basinwide dissimilarity with the reference dataset. The ATCC maps of LR and its white noise extension ( Fig. 5a-b) show similar basin-wide correlations, and these are similar to the deterministic hybrid models (Fig. 5h,j). However, small correlations along the jet region are more pronounced in LR-AWN.
A comparison of the ASCC time series (Fig. 6) is telling a similar story. The pure deep-learning methods and their stochastic augmentations show the worst structural similarity with the reference truth (lower than the Persistence at all lead times), whereas ML-LR provides the highest correlation, followed by the noise-augmented hybrid models. An inspection of the correlation decay rates of the models involving LR suggests that, on average, all of them possess a similar decorrelation rate, but the deterministic hybrid models decay faster than the others on longer lead times with LR-ANN decaying similar to the Persistence baseline. This suggests that introducing noise in the hybrid models improves both temporal and spatial characteristics.
Overall, we conclude that, on short forecast time scales, both ML-LR and noiseaugmented hybrid methods are most realistic regarding the spatial and temporal evolution. The pure deep-learning models and their stochastic extensions perform poorly.

Climatology and Variance
Here, we diagnose the mean and variance of the forecasted streamfunction field along the temporal domain: The mean field ψ f cast 1 is also referred to as the "climatology". However, unlike the operational forecasts, the reference climatology is zero as we have performed an SVD of the mean-subtracted streamfunction field. Therefore, the reconstructed streamfunction ψ f cast 1 should show a climatology of zero. We used these diagnostic metrics to characterize longtimescale forecasts, over 20K records, i.e., 200K days or nearly 547 years. As results are independent from the initial conditions, we perform simulations from a single initial condition and use only one stochastic realization, wherever applicable. Due to a much longer record of the reference streamfunction compared to the forecast length, we expect a small nonzero value of the predicted time-mean streamfunction field (see Fig. 7a for reference -24-manuscript submitted to JAMES dataset of the same length), and its value can serve as a reference for the temporal bias introduced by different models. On the other hand, the variance map would characterize the extent of jet reproduction by different methods, as it is the most turbulent and possesses maximum fluctuations in the entire domain (Fig. 8a).
The climatology maps suggest that the LR shows the least temporal bias among all methods (Fig. 7b). However, this is because the LR output decays to zero after a short lead time, and the PCs exhibit near-zero mean despite the poor forecasts. The multilevel formalism (Fig. 7d) shows the second smallest bias, with stable and non-zero forecasts at all lead times. The AWN extension of LR produces the next overall small bias among the LR and its stochastic extensions (Fig. 7c).
All standalone deep-learning methods ( Fig. 7e-h) produce a relatively large bias in the forecasted streamfunction field. For ANN, the deterministic version produces a higher bias than the stochastic one, whereas LSTM produces a high bias irrespective of the noise. The large bias is primarily because these methods produce a significant drift in the modeled PCs, and this results in a large non-zero temporal mean which also reflects in the reconstructed streamfunction field after multiplying with the EOFs. Among the hybrid models ( Fig. 7i-l), the ANN hybrid models generate a smaller basin-wide bias than the LSTM hybrids, and the stochastically improved hybrid models produce a smaller bias than their deterministic counterparts. The latter suggests that the induced drift in the modeled PCs can be contained to some extent by adding stochasticity, which nudges the PCs back towards the reference truth trajectory.
Analyzing the spatial map of grid-point-wise temporal variance of the model outputs (Fig. 8), we found that LR-AWN, ML-LR, and the stochastically augmented hybrid models (Fig. 8c,d,j,l) best reproduce the reference jet variability (Fig. 8a). The nonstochastic ANN-hybrid model (Fig. 8i) produces correct but extra stretched jet variability in the north-south direction and around the eastward extension, whereas the deterministic LSTM-hybrid model (Fig. 8k) produces irregular and overly narrow region of jet variability. Among the standalone deep-learning models (Fig. 8e-h), the deterministic versions of both ANN and LSTM produce insufficient jet variabilities, with ANN being worse than LSTM. But, when augmented with white noise, both produce more variabilities along the jet region and are nearer to the reference truth, especially ANN-AWN.
-25-manuscript submitted to JAMES The LR fails to produce any variability in the basin (Fig. 8b) as its outputs decay to zero after a short forecast lead time.
Combining the climatology and variance results, we conclude that (i) on long time scales, LR-AWN, ML-LR, and the stochastically augmented hybrid methods best reproduce the jet variabilities with a small drift in the resulting flow field (approx. 2 − 3% of the original mean field); (ii) the standalone deep-learning implementations infuse a relatively higher bias in the climatology and are inefficient at reproducing the turbulent jet characteristics on long time scales; (iii) the LR model cannot produce climate-like forecasts due to its dissipative nature, and therefore all other models fare better than it on long timescales.

Frequency map
Here, we consider frequency maps of the emulated long-timescale solutions (same as the one used in the previous section) to quantify their spectral frequency characteristics. For each model output, the frequency map is obtained by diagnosing the frequency value locally (i.e., for each grid cell), as given by the inverse of the decorrelation time scale of the forecasted streamfunction field. The decorrelation time scale is determined as the lag at which the autocorrelation drops by a factor of e from the zero-lag value.
Because we repeat this calculation for each spatial location, we get a gridded frequency map of the size m×m. Overall, we expect higher-frequency variability along the eastward jet and in boundary regions, due to the vigorous eddy activities, and low-frequency variabilities elsewhere. This is demonstrated in the reference frequency map (Fig. 9a) obtained for a randomly chosen 200K days long data sample from the reference dataset.
The results suggest that ML-LR most closely resembles the reference frequency map followed by LR-AWN and the two stochastic hybrid models (Fig. 9d,c,j,l) . While ML-LR produces the correct frequency patterns in the gyre regions, the magnitude is lower in the jet region when compared to the reference truth. LR-AWN and the stochastic hybrid models also reproduce the frequencies in the gyre regions but are less accurate in the jet region, with the frequency magnitude even lower than ML-LR in this region.
On the other hand, the deterministic hybrid models (Fig. 9i,k) fail to describe the characteristic frequencies throughout the domain and produce frequency maps mostly dominated by low frequencies -more so for the deterministic ANN hybrid model. The  The units of frequency are in cycles/year.
-29-manuscript submitted to JAMES same is also true for the standalone ANN implementations (Fig. 9e,f) with the stochastic one being better as they show higher frequencies in the domain but not following the correct pattern. Such low-frequency dominated maps suggest that the individual PC outputs lack correct high-frequency contributions. The LSTM-only models also produce incorrect frequency maps irrespective of the noise (Fig. 9g,h). The deterministic LSTM variant exhibits patterns of low and intermediate frequencies in the entire basin, whereas the stochastic variant produces patches of high frequencies both in the jet and gyre regions and do not resemble their reference truth. This is because the deterministic LSTM outputs (for all 150 PCs) are dominated by low frequencies, and adding noise to them produces somewhat large-amplitude (therefore, higher variance) but much high-frequencydominated outputs for most of the PCs, which ultimately leads to high frequencies in both the jet and the gyres regions when multiplied by the EOFs.
LR completely misses the high-frequency variability around the jet (Fig. 9b), as the solution decays to zero, although it still manages to reproduce the low-frequency variations in the two gyres to a certain extent.
We conclude that (i) as seen for the climatology and variance, LR-AWN, ML-LR, and the stochastic hybrid models perform best regarding frequency characteristics of longtimescale solutions; (ii) the deterministic hybrid models fail to correctly reproduce the frequency content despite their low climatological bias and nearly correct variance pattern; (iii) standalone deep-learning methods produce the most inaccurate and physically unjustified frequency maps, especially ANN.

Forecast Horizon
Here, for each model, we estimate the forecast horizon which is the time scale for which the model produces stable and non-zero forecasts. This information is vital for deciding on the applicability of a method in short-/long-term forecasts. When the system matrix is available, as in the LR, the forecast horizon is given by the inverse of its maximum eigenvalue. For the other methods, it is computed using the model outputs, which, for long-timescale forecasts, either saturate to a steady-state value or provide nonzero and stable solutions up to the maximum lead time (equal to 20K records, i.e., 200K days or 547 years). In the first case, the forecast horizon for each PC is given by the time beyond which the solution is trapped in a small-amplitude range, i.e., a small threshold -30-manuscript submitted to JAMES value. The overall forecast horizon is, then, set equal to the minimum of the individual PC horizons. In the second case, the model forecast horizon is set to ∞ because such models can produce stable non-zero forecasts on any finite lead time.
Only LR belongs to the first category, whereas all the other methods belong to the second (

Training and prediction time complexity
Here, we discuss computational costs and scalability as the number of degrees-offreedom is increased. For each model, we have diagnosed the run times for training and inference and refer to them as the "training time complexity" and "prediction time complexity", respectively. In practice, the training is done only once, whereas the predictions are made many times, therefore, it makes sense to look at their time complexities separately. Note that the complexity estimates should take into account different levels of possible optimization of the models. Therefore, we wrote all model codes in Python (except ML-LR that is in a publicly available Matlab Toolbox), implemented them on the same hardware, and optimized them to reasonably high levels, including vectorization, function calls, and efficient data structures. The training complexity estimates do not include data processing and variable declaration/initialization, and only correspond to the time taken for training the models. The prediction time complexity corresponds to the time taken for producing one realization of a 1000 records long forecast. Since we can only provide estimates for the optimal performance of different methods, we only report the orders of magnitude of the elapsed time (Table 1, column 3 and 4).
Among the stochastic augmentations of LR, ML-LR is one order of magnitude more expensive to train and forecast compared to LR and its additive white noise extension.
The higher complexity of ML-LR is simply due to the extra regression layer. Similar training and prediction times of LR and LR-AWN are due to the same trained core, i.e., LR coefficients; the white noise parameters in LR-AWN are inexpensive to train.
Mathematically, for LR and its white noise stochastic extension, as the number of PCs (say, n) increases, we expect the training time complexity to increase as O(n 2 ), which is the size of the trained regression matrix, and the prediction time complexity to increase as O(n). Due to the added levels (say, l) in the multi-level formalism, the corresponding training and prediction times are expected to increase additionally by O(n 2 l+nl 2 + nl), assuming n >> l, and O(l), respectively.
In the standalone deep-learning class, LSTM is an order of magnitude more expensive to train than ANN, but two orders of magnitude costlier to produce forecasts. The higher computational cost of LSTM is due to its more complex architecture and a large number of past states passed as input -set by the 'look back' hyperparameter. However, in the hybrid category, both ANN-and LSTM-based methods follow the same or--32- Overall, we conclude that (i) adding simple stochasticity bears a negligible computational cost, but a more complicated red-type noise addition can increase the training and prediction cost by order of magnitude; (ii) Among the deep learning methods, LSTM is equal or more costly to train and forecast than ANN; (iii) Both LSTM and ANN can benefit from reduced training time in the hybrid framework than in its standalone implementation, as a less complex network design is required for optimal performance.

Discussion and Conclusion
We have presented a comprehensive inter-comparison of Linear Regression (LR), its various stochastic extensions, deep-learning models (ANN and LSTM) and their hybrid formulations with additive-noise (see Table 2 for an overview), to obtain a low-cost, reduced-order model for complex multi-scale spatio-temporal flow of the upper ocean.
LR has the simplest form and, thus, provides a baseline for assessing the performance of the other models. and noisy patterns, referring to a mixture of low-and high-frequency variability. We have modeled the PCs' dynamics using different methodologies, obtained forecasts, reconstructed the physical spatio-temporal fields using the EOFs, assessed the results using several assessment metrics, and inter-compared model skills. Training of the models is done using a ≈ 1100 years long dataset with 10 days as the sampling period. We have studied forecasts with both short-and long-timescales, where the lead times are on seasonal and centennial time scales, respectively. To assess the accuracy of short-time forecasts, we have used RMSE and Anomaly Cross Correlations (ACC), and, for long-timescale forecasts, we used climatology, variance, and frequency maps. Additionally, we have diagnosed the stability and computational costs of the methods using forecast horizon and training/prediction time complexities, respectively. It is also possible to define a few dynamically inspired performance metrics, such as 'product integral' discussed in Agarwal et al. (2020), for future eddy emulators, but developing and applying such metrics is beyond the scope of this paper.
We have made the following key observations during the assessment: • On short forecast lead times (e.g., several months), Multi-Level LR (ML-LR) de-  and poor ACC's both in space and time. The hybrid methods with additive noise are the next best after ML-LR; the ANN-hybrid model performs marginally better than the LSTM-hybrid. The success of ML-LR and noise-augmented hybrid models reflects the importance of memory effects and stochastic noise when accounting for dynamical interactions with the removed EOFs/PCs (i.e., beyond the rank 150). Including these effects, accounting for the truncated dynamics, is rigorously justified in Mori-Zwanzig theory of statistical mechanics , which started also to attract attention in deep learning (Wang et al., 2020). The ML-LR conveys memory effects using an additional LR level, leading to a red-type noise, whereas LSTM achieves this by definition. The state dependency of noise is achieved using additional regressions in ML-LR, whereas in hybrid models this is accomplished using state variables as inputs alongside the LR residuals to forecast the residuals for the next time instant.
Additionally, both models explicitly resolve the linear dependency, which is vital because pure ANN/LSTM (with or without noise) display a performance even worse than persistence for short lead times. The linear dynamics is pertinent to the gyres and must explain a significant proportion of the variance in the top PCs. Representing it using linear regression (i) determines the coefficients accurate up to the machine precision, (ii) leaves a lesser number of terms to be learned more, and (iii) improves the signal-to-noise ratio in the residual statistics on which ANN/LSTM is later trained to a higher effect. Therefore, we argue that the use of bare ANN/LSTM is not useful in situations where linear terms dominate that can be learned via regression (similar to Pyle et al. (2021)). In such cases, it is better to use the ANN/LSTM as a correction (potentially stochastic) term in combination with LR to get an optimal closure model for the dynamics of the retained EOFs.
For long-timescale forecasts (e.g., several centuries), the white-noise extension of LR, ML-LR, and stochastic hybrid models perform the best as they correctly resolve both low-and high-frequency variabilities across the domain (with the right frequency magnitude) and as they produce low climatological bias. Stochastic ANNhybrid produces a lower climatological bias than its LSTM counterpart, but the latter produces relatively better frequency map; the variance estimates are almost the same for both. LR outputs decay to zero, whereas all standalone deep-learning -36-manuscript submitted to JAMES methods (with or without noise) generate strong climatological drifts and fail to represent the flow variability both in the jet and the gyre regions.
• All models show a better forecast horizon than the LR. In particular, all the noiseaugmented models produce stable, non-zero, and finite forecasts on climate-like lead times, e.g., centuries and millennia, meaning that adding noise to deterministic models alongside the system variables boosts its stability while keeping the forecasts bounded and physically relevant.
• Simple white noise extension of LR shows a similar training and prediction time complexity as the LR model, but they become more expensive for more complex architecture, such as the ML-LR. Deep-learning-based methods are the most expensive to train but relatively cheaper during forecasts -a potentially useful property from a climate forecasting viewpoint. Note that LSTM models are more expensive to train and forecast than the others, so it is better to use them in the hybrid configuration, as they benefit from a simpler model configuration, and thus fewer trainable parameters and lesser training (and forecast) time.
Overall, our results prove the superiority of ML-LR and stochastically augmented hybrid models for building simple, stable, and low-cost reduced-order ocean emulators (within the EOFs/PCs framework) for producing short-/long-timescale forecasts. The success of these methods highlights the importance of including core dynamics, memory effects, and model errors for building reliable emulators. In this application, we have considered the core dynamics as linear and concentrated on the latter two components to prove that state-dependent additive noise is an excellent way to account for memory and unknown dynamical model errors in emulators. ML-LR allows for only linear additive state-dependent noise (and memory), but hybrid deep-learning models can potentially learn very general forms of non-linear and multiplicative state-dependent noise. Similar outcomes of ML-LR and hybrid deep-learning models prove that ML-LR produces the most optimal noise configuration which deep learning learned successfully. Another evidence of the importance of state-dependent noise is the poor performance of a purely red noise augmentation of LR (not shown), which has memory but no state-dependency for the noise.
However, as for any data-driven method, the results presented here are valid for the current training length, and we admit that a more prolonged training may improve -37-manuscript submitted to JAMES the models' performance. However, using a longer training length can be computationally prohibitive, and too-long ocean observations may not be available in real life. Therefore the emphasis here is also on identifying the models that perform relatively well even on shorter training data. The use of orthogonal bases, i.e., EOFs/PCs, puts another constraint on the current study, as many fluid dynamical systems may not follow this assumption, but using and comparing different bases transformations, e.g., Dynamical Mode Decomposition (DMD) modes, is beyond the scope of the paper and is left as a future exercise.
The current research can be extended along the following lines. The first and also straightforward direction is to test the performance of the best-performing stochastic models on a more complicated testbed, e.g., on the output of a comprehensive ocean general circulation model or coupled ocean-atmosphere models for emulating, say, ENSO or Madden-Julien Oscillation, which incorporate significant delay time response . Such an implementation would further ascertain our findings for building reliable emulators for complicated ocean/atmospheric processes. It is worth looking for the ways of imposing physical constraints, such as energy/mass conservation, into the emulators, e.g., using an appropriate penalizing term in the loss function. Secondly, the results obtained here are directly relevant for emulation of various complex and multi-scale fields in the context of eddy parameterizations in low-resolution ocean models. Finally, a possible sequel to this work is including more stochastic and deep-learning methods, or a mixture of both, e.g., the Stochastic Neural Networks. We started to develop a rigorous testbed for datadriven models and used this for several model setups, but we will expand this to more complex testbeds, models and datasets in the future and check if the conclusions still hold.