Interpretable structural model error discovery from sparse assimilation increments using spectral bias-reduced neural networks: A quasi-geostrophic turbulence test case

Earth system models suffer from various structural and parametric errors in their representation of nonlinear, multi-scale processes, leading to uncertainties in their long-term projections. The effects of many of these errors (particularly those due to fast physics) can be quantified in short-term simulations, e.g., as differences between the predicted and observed states (analysis increments). With the increase in the availability of high-quality observations and simulations, learning nudging from these increments to correct model errors has become an active research area. However, most studies focus on using neural networks, which while powerful, are hard to interpret, are data-hungry, and poorly generalize out-of-distribution. Here, we show the capabilities of Model Error Discovery with Interpretability and Data Assimilation (MEDIDA), a general, data-efficient framework that uses sparsity-promoting equation-discovery techniques to learn model errors from analysis increments. Using two-layer quasi-geostrophic turbulence as the test case, MEDIDA is shown to successfully discover various linear and nonlinear structural/parametric errors when full observations are available. Discovery from spatially sparse observations is found to require highly accurate interpolation schemes. While NNs have shown success as interpolators in recent studies, here, they are found inadequate due to their inability to accurately represent small scales, a phenomenon known as spectral bias. We show that a general remedy, adding a random Fourier feature layer to the NN, resolves this issue enabling MEDIDA to successfully discover model errors from sparse observations. These promising results suggest that with further development, MEDIDA could be scaled up to models of the Earth system and real observations.


Introduction
Numerical solutions of physics-based models are the core of modern weather and climate predictions.However, climate and numerical weather prediction (NWP) models suffer from a variety of parametric and structural errors (model errors, hereafter).These model errors often arise from approximations to forcings and boundary conditions, and more importantly, poor representations of complex, usually small-scale, processes, due to the lack of scientific understandings and/or limitations of computational resources (Stevens & Bony, 2013;Masson-Delmotte et al., 2021;Moon et al., 2018;Woldemeskel et al., 2012;Zadra et al., 2018).Reducing these model errors through improving the fundamental understanding of these processes and increasing the numerical resolution has been the subject of extensive past and ongoing efforts (Danforth & Kalnay, 2008;Bonavita & Laloyaux, 2020;Dunbar et al., 2021;Regayre et al., 2023).More recently, the rapid rise in the availability of high-quality, frequent observations and algorithmic advances, particularly in data assimilation (DA) and machine learning (ML) (Cheng et al., 2023), can provide another promising direction for reducing such model errors and the resulting biases and uncertainties.
Many model errors for the Earth system are associated with the "fast physics" (Rodwell & Palmer, 2007), and therefore, they can be observed and quantified in short-time prediction horizons.Such model errors can lead to deviations of the predicted states from the analysis states; such deviations are sometimes referred to as DA analysis increment (Leith, 1978).Assuming the analysis states to be our best estimates of the true states of the underlying system, hereafter we treat them as truth.In short-time forecasts with NWPs, these deviations are corrected using observations and classical DA, e.g., the predicted states are nudged towards the observations.The analysis increments can be also examined to diagnose the roots of these model errors, potentially, helping with improving the long-term weather forecasts and climate projections as well (Rodwell & Jung, 2008;Palmer & Weisheimer, 2011).
The latter perspective and the recent advances in ML have motivated many recent studies to leverage deviations of short-time predictions from observations and/or high-fidelity simulations to correct biases of NWP and climate models to improve long-term predictions and projections.The key underlying assumption of this approach is that correcting the errors arising from the fast physics can correct the model's predictions over long time scales.In this approach, the imperfect (numerical) models are initialized with the true state or its close estimate (e.g., analysis state, from observations, or high-fidelity simulations), evolved forward in time for a short period, and the difference between the predicted and the true states, the "analysis increment", is quantified/parameterized as a representative of the model error.
A number of recent studies have followed this approach and used ML to learn the model errors, showing promising results in various systems, from simple toy models to NWP models (Carrassi & Vannitsem, 2011;Mitchell & Carrassi, 2015;Lang et al., 2016;Watson, 2019;Pawar et al., 2020;Pathak et al., 2020;Farchi et al., 2021;Watt-Meyer et al., 2021;Yuval et al., 2021;T.-C. Chen et al., 2022;Bretherton et al., 2022;N. Chen & Zhang, 2023;Mojgani et al., 2022;Gregory et al., 2023;Clark et al., 2022;Arcomano et al., 2023;Bora et al., 2023).However, most of these studies have directly learned the model error using deep artificial neural networks (ANNs): ANNs are trained using many pairs of state and analysis increments from a training set.Then, for out-of-sample states (that are from the same distribution as those of the training), the ANN predicts the systematic model tendency correction needed to nudge the state of NWP or climate model, thus improving the trajectory and potentially the simulated statistics.While ANNs are powerfully expressive, they have a number of major shortcomings: 1) they are difficult to interpret, 2) they do not generalize to out-of-distribution, 3) they are data-hungry, and 4) their predictions fed into numerical models can cause instabilities and unphysical drifts (Guan et al., 2022;Subel et al., 2023;Bretherton et al., 2022;Clark et al., 2022;Slater et al., 2023;Farchi et al., 2023;Pahlavan et al., 2023).Challenges with interpretability hinder understanding the root cause(s) of the model errors and fixing them.This and (2) together significantly limit the applicability of such ANN-based approaches to improve climate change projections.
(3) and (4) also pose major practical and operational limitations.The latter is particularly challenging as rigorous frameworks for quantifying the interactions between different sources of error in hybrid numerical-ANN solvers are lacking.
As an alternative to using ANNs to learn model errors, a smaller number of studies have pursued equation-discovery techniques (Lang et al., 2016;Mojgani et al., 2022;Pachev et al., 2022).In these approaches, one learns a closed-form, parsimonious representation of the analysis increment (an estimate of the model error) in terms of the state vector.For example in Mojgani et al. (2022), we introduced Model Error Discovery with Interpretability and Data Assimilation (MEDIDA).Such closed-form, parsimonious representations can be highly interpretable and point to the root cause of the model errors.They are also often generalizable, as they explain the underlying physical mechanisms and, in principle, can be connected to the physics of the changing system.This approach is also data efficient, usually requiring a small training set (Lang et al., 2016;Zanna & Bolton, 2020;Mojgani et al., 2022;Ross et al., 2023;Jakhar et al., 2023).While such model error representations might also lead to instabilities when used to correct the imperfect models, in principle, the existing rigorous tools and concepts for analyzing the stability of numerical methods can be applied to them.Finally, there are potential advantages of such closed-form equations over ANNs in terms of implementation and coupling to traditional numerical solvers.
In Mojgani et al. (2022), the capabilities of MEDIDA for discovering nonlinear structural errors were showcased on a highly chaotic, multi-scale canonical test case, i.e., the Kuramoto-Sivashinsky (KS) equation.In that work, equation-discovery was performed using relevance vector machine (RVM) (Tipping, 2001), which is a Bayesian linear regression with an interpretable library of linear and nonlinear bases built on the knowledge of domain experts and common physical terms (note that any equation-discovery method can be used in MEDIDA).Furthermore, it was shown that even if observations of the true system are noisy, DA methods such as ensemble Kalman filter (EnKF) can be used as smoothers with MEDIDA to successfully discover model errors.
In this paper, we first show the capabilities of MEDIDA using a much more complex and climate-relevant test case, the two-layer quasi-geostrophic (QG) turbulence.Second, we scale MEDIDA to work with spatially sparse observations, a challenging task that is essential for using model error discovery methods with imprecise observations such as those from in-situ and remote-sensing measurements.A critical component needed for dealing with sparse observation is an "interpolator".The quality of the interpolation scheme in equation discovery plays a significant role, as inaccuracies can degrade the library of bases that have high-order derivatives and/or strong nonlinearities.
In the context of equation discovery, algebraic interpolators, such as polynomials (Rudy et al., 2017) and splines (L.Sun et al., 2022), have shown success for canonical ordinary differential equation (ODE) or non-chaotic partial differential equation (PDE) test cases.More recently, ANNs have emerged as popular and powerful interpolators to reconstruct the state and discover the governing equations from spatial sparsity (Z.Chen et al., 2021) or partially observed states (Lu et al., 2022).Interpolation using ANNs provides a differentiable learning process that provides a flexible framework to formulate different optimization paradigms (Z.Chen et al., 2021), an encoder to construct partial observations (Lu et al., 2022), or to simultaneously estimate noise and the governing equation (Kaheman et al., 2022).However, the success of both algebraic and ANN-based interpolation schemes for highly chaotic and multi-scale test cases remains underexplored.In the case of ANNs, an implicit assumption is that the state and its higher order derivatives, e.g., dispersion or dissipation terms in PDEs, can be accurately represented by ANNs.This assumption is often justified by simply referring to ANNs as universal approximators (Hornik et al., 1989).
In this paper, we show that for highly multi-scale systems such as turbulent flows, ANNs do not learn the small-scale features (high-wavenumber features), a well-known phenomenon in the ML-literature called "spectral-bias" (Rahaman et al., 2019;Xu et al., 2022).This bias, which does not show up in simple metrics such as pattern correlation and rootmean-square error (RMSE), can cause significant errors in the calculations of high-order derivatives, hindering the successful equation discovery (of model error, and more broadly, other functions).We further show that this spectral bias can be mitigated by employing "random Fourier features (RFFs)" (Tancik et al., 2020), enabling the successful discovery of model errors from sparse observations.Our contributions in this paper are 1.We scale MEDIDA up to a much more complex and climate-relevant test case, the QG turbulence, 2. We extend the use of MEDIDA to spatially sparse observations, 3. We identify spectral bias as a major shortcoming of using common ANNs as interpolator for equation discovery in widely multi-scale systems such as QG turbulence, 4. We show how this spectral bias can be alleviated using RFFs, enabling successful model error discovery from sparse observation where both algebraic interpolation and the naive use of ANNs fail, and 5.We present an example, based on missing hyperdiffusion, showing that short-horizon DA increments may not be able to capture some structural errors, although such errors may have significant long-term effects.
Note that (3)-( 4) have implications well beyond the topic of this study (model error, equation discovery) as spectral bias can significantly degrade the performance of ANNs when applied to multi-scale processes in the Earth system, leading to major challenges (e.g., long term stability of the learned model (Chattopadhyay & Hassanzadeh, 2023)).
This paper is organized as follows.Section 2 summarizes the problem setup, our model error discovery framework, MEDIDA, and the new extensions to address spatial sparsity of observations.The details of our numerical experiments, i.e., the QG flow (as the perfect system), imperfect test cases, and the library of bases are described in Section 3. The results on model error discovery from both fully observed and sparse observations are presented in Section 4. Section 5 provides a summary and discussion.

Model Error Discovery and Extensions to Sparse Observations
In this section, the elements of MEDIDA are briefly described.Additionally, Section 2.3 presents our proposed approach to estimate the full state from spatially sparse observations.

Model error and state increment
Here, the notion of system (the truth), model (the imperfect representation of the system), and model errors (differences between the system and model) are formalized.The goal is to discover model errors from sporadic and possibly sparse/noisy observations of the true system.Note that hereafter, "system" and "observation" will be used to refer to either the natural system (e.g., the real atmosphere) and the associated observations, or a high-fidelity simulation of this system (e.g., global 1-km runs) and the associated numerical results.
Suppose that the spatio-temporal nonlinear evolution of the system is governed by a PDE, where w(x, t) is the state vector in space and time.We assume that the state can be observed in short temporal intervals, while the governing PDE operator, g (w(x, t)), is unknown to us.Moreover, we assume to have a model of the system described as We further assume to have a working numerical solver for the model, such that given an initial state, w m (x, t i − ∆t i ), the state at the next time step, w m (x, t i ), is predicted.
( The states, ψ 1 and ψ 2 , are observed at (potentially non-uniform) consecutive time steps in each later and the model error (in potential vorticity, q) in each layer is approximated from the analysis increment (Step 1).For clarity, only layer 1 is shown.A library of b basis functions is constructed using the observations (Step 2) and the Bayesian regression problem is formulated and solved to discover the closed-form model error (Step 3).For better illustration, the schematic shows using only two pairs of snapshots; h i from several snapshots are stacked to form h. Calculations of h and steps 2-3 are repeated for the discovery of the model error in layer 2.
∆t i ) and w o (x, t i ).As described in Mojgani et al. (2022) and shown here in Fig. 1 in the context of the QG test case, in MEDIDA, the numerical solver of the model is initialized from the observed state, w o (x, t i − ∆t i ), to predict the state at w m (x, t i ).The model error can then be estimated via the analysis increment as With precise observations, h i includes structural/parametric model errors as well as errors from the numerical solver.Here, the structural/parametric model errors originate from the poor understanding of the true physical system (epistemic uncertainty), and numerical errors are results of discretization in time and/or space.As our goal is to isolate and quantify the former, the latter has to be minimized, e.g., by choosing small ∆t i and high spatial numerical resolutions.Treating observations as "precise" is reasonable if they are from high-fidelity simulations, but not if they are from measurements (in situ, remote sensing, etc.).In this case, the observations can be imprecise due to noise and/or spatial sparsity, or unavailability of some of the state variables, the so-called hidden or latent variables (i.e., partial state vector).In Mojgani et al. (2022), we showed that noisy observations can be used in MEDIDA by employing DA techniques such as EnKF.In the current paper, we aim to tackle the challenge of spatial sparsity, as discussed in Section 2.3.Note that in the current paper, we will not address the challenge of partial state vectors; see Majda and Chen (2018) and Mou et al. (2022) for recent advances in that area.

Interpretable model error discovery
The procedure in Section 2.1 can be repeated given the availability of observation pairs to collect n samples of h i .Note that the time intervals, ∆t i , between the observation pairs do not have to be the same (but need to be short enough to reduce the numerical discretization errors).We assume that the model error h i (x) can effectively be represented by a pre-defined library of basis functions ϕ j : where c j is the coefficient corresponding to the j th term.The bases, ϕ j , can be independent of the state, e.g., to represent external forcings.However, they are often functions of the state variables, e.g., the spatial derivatives, polynomials, and their linear or nonlinear combinations (Rudy et al., 2017;Zanna & Bolton, 2020;Mojgani et al., 2022;Jakhar et al., 2023).Such terms are usually chosen based on the understanding of the underlying physics (Lang et al., 2016;Zanna & Bolton, 2020;Jakhar et al., 2023).That said, there are algorithms that provide more expressive and evolving libraries, e.g., through radial bases in a Gaussian process or symbolic regression, though the full physical interpretation of the discovered models via such algorithms can be complicated (Luo, 2019;Levine & Stuart, 2021;N. Chen & Zhang, 2023;Ross et al., 2023).
Note that in this paper, all calculations, including the derivatives for the library, are performed using the same Fourier spectral method used to solve the governing equations of the system and the model (see Section 4.2).
By assuming the candidate structural forms of the model error as in ( 5), we have reduced the problem of discovery of structural error into a parameter estimation problem, where the coefficients have to be estimated such that the difference between the prediction and observed state is minimized, i.e., a regression problem as in where is the minimizer vector.Here, the n samples of vector h i (x) are stacked to form the vector h.
Substituting the discovered model error in (3) leads to the corrected model, i.e., To enhance interpretability, it is often favorable to limit the discovered model to its simplest form that can explain the highest variability, i.e., a parsimonious or sparse model (Rudy et al., 2017).Enforcing sparsity is the cornerstone of equation discovery and many interpretable modeling efforts in statistical learning (Hastie et al., 2015).In this paper, parsimony is achieved using RVM, a Bayesian regression algorithm that has been successfully applied to various ODE/PDE discovery problems (Zhang & Lin, 2018;Zanna & Bolton, 2020;Mojgani et al., 2022;Jakhar et al., 2023).A threshold on the variance of fitted coefficients is systematically chosen to prune the less relevant terms (with the highest uncertainty in the coefficient).In this procedure, a range for the model hyper-parameter is swept leading to a Pareto front, and the elbow of the L-curve is chosen as a compromise between accuracy and sparsity of the model.See Zhang and Lin (2018) and Jakhar et al. (2023) for more details and discussions of PDE discovery using RVM.
The accuracy of corrected models is quantified using the normalized distance between the coefficients of the model and the true system (Rudy et al., 2017;Reinbold et al., 2020;Mojgani et al., 2022): where c s is the vector of the coefficients in the true system, and c m and c * are the vectors of coefficients in the model and corrected model, respectively.We acknowledge that this measure is not practical when the true system is unknown, however, this is commonly used measure to validate a methodology, see., e.g., Rudy et al. (2017); Reinbold et al. (2020).
In practice, the predictive capability of the discovered model can be evaluated after the discovery step, an a posteriori test.The vector of coefficients span over the library of bases defined in Section 2.2, i.e., they are of size b.A successful discovery leads to ε * ≪ ε m .Note that although this measure is often used in the equation-discovery literature, it can also be misleading when a parameter of a dynamically important term is incorrect but is much smaller than some of the other coefficients, or when such a term is entirely missing (structural error).In such cases, long-term simulations of the model can be statistically very different from those of the system even when ε m is small, thus ε * ≪ ε m might fail as a proper critetion.Therefore, examining the statistics of the corrected model would be needed in such cases to provide a better metric.In our study, given that we precisely know the governing equations of the system, it suffices to examine the individual coefficients, as presented in the results section.

Interpolation and discovery from spatially sparse observations
The previous steps assume the availability of observations of the full state.However, observations, e.g., in-situ or from remote sensing, are often sparse in the spatial domain.Therefore, to compute h i and ϕ j in MEDIDA, we need an additional "interpolation" step to estimate the full observed state.Then, the estimated state as the output of this step will readily, without requiring any further modifications to the framework, replace the fully observed states in Eqs. ( 5)-( 7).
As shown later, the quality of this interpolation plays a major role in the success of equation discovery, in particular, because of the presence of high-order spatial derivatives in the library of ϕ j .Combining prior information with sparse observations to estimate/reconstruct the full state is known as the state estimation step and is a key component of DA algorithms such as Kalman filters/smoothers (Evensen et al., 2022).However, interpolation is known to be particularly challenging for multi-scale systems such as turbulent flows (Willcox, 2006;S. Chen et al., 2021;Beauchamp et al., 2022;Di Leoni et al., 2022;Kelshaw et al., 2022;Karnakov et al., 2023).In fact, our attempts in using EnKF using algebraic interpolations showed that an impractically large ensemble size might be required to achieve the estimation accuracy needed for a successful equation discovery.We emphasize again that the requirement for high accuracy in the interpolation task comes from the need for constructing an accurate library of high-order derivatives and strong nonlinearities.Another approach to state estimation/reconstruction that has received substantial attention in recent years involves the use of ANNs as interpolators.This approach has shown success in a number of recent studies involving multi-scale toy models, including those motivated by Earth system-related applications (Z.Chen et al., 2021;Beauchamp et al., 2022).
In this paper, we leverage these advances and use ANN-based interpolators.Details of these ANN-intepolators are given in Appendix A. Briefly, for a given snapshot at time t i on a grid of size L, suppose ℓ = 1, 2, . . ., L − S is the index of grid points at which the value of a function A(x ℓ , y ℓ ) is known (i.e., observation datapoints), and η = 1, 2, . . ., S is the index of grid points at which the value of this functions is unknown (the level of sparsity is S/L).We train an ANN (a 5-layer perceptron) with inputs of (x ℓ , y ℓ ) and outputs of A(x ℓ , y ℓ ) and a mean-square-error loss function.Once trained, the ANN estimates the missing values A(x η , y η ) for inputs of (x η , y η ).This procedure is illustrated in Fig. 2. Note that unlike many other data-driven interpolators (Willcox, 2006;Fukami et al., 2021;Beauchamp et al., 2022), this approach does not need a history of past sparse observations, although using past data may become beneficial when the level of sparsity increases.As a baseline for our comparisons, we use an algebraic (cubic) interpolator trained on the same observations.To summarize, the trained interpolators (either algebraic or ANN-based) for each of the state parameters are queried at the location of missing observations to estimate a full state at t i − ∆t i .This estimation is then evolved using the model (2) to predict w m (t i ).
Figure 2: Schematic of the training process of the ANN-based interpolator on the observations, followed by the inference step at the location of the missing observations.An ANN is trained at the location of the known observation datapoints.After the training phase is complete, the trained model is queried at the location of the missing observations.The union of the known datapoints and the interpolated states provides us with an estimation of the state necessary for the other steps of MEDIDA.
Moreover, the full observed state at the following time step, w o (t i ), is similarly estimated from sparse observations.The approximated states are used to construct the library of bases in (5), and finally the sparse regression problem of model error ( 6) is solved.
Note that our use of ANNs is restricted to interpolation, and unlike many studies mentioned in Section 1, the ANNs are not used to learn the model error.
3 The Quasi-Geostrophic System 3.1 The true system As the true system, consider the non-dimensionalized two-layer QG model (Phillips, 1954), a simple baroclinic climate model: where subscripts 1 and 2 represent the upper and lower layers, respectively.Here, q is the potential vorticity, and ψ is the stream function.Equilibrium profile, ψ R , is defined to enforce a baroclinically unstable jet, and is set to ∂ψ R /∂y = sech 2 (y/σ), where σ is the width of the jet.τ f is the Rayleigh friction time scale in the term that represents surface drag.τ d1 and τ d2 are the Newtonian relaxation time scales in the term presenting radiative cooling.ν is the hyperdiffusion coefficient in the term that represents unresolved physics.The Jacobian term is for both layers, k ∈ {1, 2}, and α is a constant we introduce to impose a model error in the nonlinear terms (for the true system α = 1).Potential vorticity in each layer is where β is the meridional gradient (in y-direction) of the Coriolis parameter, and R (x, y) is the orographic forcing only acting on the lower layer.The orography is represented by the summation of Gaussian features, each centered at [x i , y i ] with the height of r i , and variance of Equations ( 9) are integrated using the pseudo-spectral solver of Lutsko et al. (2015).All parameters are the same as those in Lutsko et al. (2015), and are summarized in Table 1;  12) where (x i , y i ) ∈ {(0, 0) , (−10, −5) , (0, 20) , (0, −20)}, r i = 20, and σ i = 5 for i ∈ {1, 2, 3, 4} (depicted in Fig. 4a).

The imperfect models
For our numerical experiments, we have defined 12 (imperfect) models introduced in Table 1.These cases consist of a combination of parametric and structural uncertainties due to different terms, representing different physical phenomena.Cases 1 and 2 include parametric uncertainty on the radiative cooling, and both drag and radiative cooling terms.In Case 3, the radiative cooling term is completely missing, i.e., a structural uncertainty.Case 4 consists of a parametric uncertainty modeling the nonlinear Jacobian term.Case 5 includes a combination of parametric and structural uncertainties in all the aforementioned terms.Case 6 has only a large parametric error in the drag term.Cases 7 to 10 include structural uncertainties in different terms.Cases 1-3 and 6-8 represent model errors in linear terms while Cases 4 and 5 have errors in the nonlinear terms.
In Case 9, β is missing, representing a major structural model error in the dynamics of the flow.In Case 10, the hyperdiffusion term is missing, another major structural error that as discussed later, does not dominate the analysis increment, as its effects are not significant in short-term evolution.
In Case 11, the drag term is modeled incorrectly.Inspired by Gallet and Ferrari (2020), the drag term is assumed to have a quadratic form, i.e., and replaces the linear drag in layer 2. In this case, a successful discovery has to remove the extra terms and replace them with correct terms from the library of bases.
The twelve cases described above provide a variety of benchmarks where parametric and structural uncertainty and misrepresentation of the system are incorporated into the model.

The library of bases
The library, ϕ i as defined in Section 2.2, consists of the candidate terms of the state variables and their first and second derivatives in both zonal and meridional directions, i.e., Any discovered orographic terms, r i (x i , y i ), are expected to appear in the nonlinear form of where the candidate bases, , are uniformly distributed on a 5 × 5 formation in the domain with variance of σ i = 5.Moreover, the following nonlinear terms are included to represent the quadratic drag terms, and to enrich the space of exploration, we also include This choice of the library results in 85 candidate terms, consisting of linear and nonlinear terms.

Numerical Experiments
In the following experiments, the first 500 Earth days of the simulation of the true system are dismissed as the spin-up period.Then, n = 100 pairs of observations (ψ o k (x, t i − ∆t), ψ o k (x, t i )) for k = {1, 2} are collected daily from a 100-day training period.The interpolation step (described in Section 2.3) is applied if the observations are sparse.Next, each imperfect model (described in Section 3.2) is evolved for one ∆t from each ψ o k (x, t i −∆t) and the analysis increment is calculated.The library of bases is constructed as in Section 3.3, and the minimization problem for discovery is solved as described in Section 2.2.
In Section 4.1, MEDIDA is evaluated for when full observations of the state are available (referred to as direct).In Section 4.2, the same full observations are used to train ANNs to evaluate the capabilities of ANNs as interpolation schemes.As shown soon, we find spectral bias as a major shortcoming of such interpolators for equation discovery, which we will then address using RFFs.Then, in Section 4.3, MEDIDA with ANN+RFF interpolators (and baseline algebraic interpolators) is evaluated for spatially sparse observations.

Discovery from full observations (Direct)
First, we evaluate MEDIDA given the full observation of the state variables, ψ 1 and ψ 2 .Here, the analysis increments h i and library terms are constructed directly using measurements of the state variables.The performance of MEDIDA for the first 10 test cases of parametric and structural errors (Table 1) is quantified and summarized in Table 2.In all the first 9 cases, which involve linear and nonlinear structural errors, the proposed approach has reduced the model error significantly: ε m = O (10) − O (100) % has been reduced to ε * = O (0.1) − O (0.01) % (the one exception is Case 5, layer 2, where ε m = 103.6% is reduced to ε * = 1.3%, which is still a substantial reduction).For selected representative cases, Fig. 3 compares the true, model's, and corrected model's coefficients for some of the library terms (note the logarithmic scale).We see that in all these cases, MEDIDA successfully corrects the model (compare the black, red, and blue bars); this includes a case where the entire Jacobian in layer 2 is missing (Fig. 3-(b)).
Case 10 is the only unsuccessful discovery: In this case, the model error is missing scaleselective hyperdiffusion, which is needed to remove enstrophy from small scales and ensure the long-term stability of the solver.However, the effects of missing hyperdiffusion are small in short-term evolutions and therefore, do not dominate the analysis increment (numerical errors, for example, could be comparable or even larger in magnitudes).As a result, ME-DIDA or any method that relies on model error discovery from analysis increment, will fail.This case serves as an important reminder of this shortcoming of such approaches, as there are important processes that have small effects on short-term evolution but significant effect on the long-term evolution of the climate system (one example is cloud microphysics (Atlas et al., 2023)).
For Cases 11-12, the model error discovery is successful.For Case 11, where the structural error is quadratic surface drag (Eq.( 13)) instead of a linear drag (−0.07∇ 2 ψ 2 ), ME-DIDA successfully finds the following correction which removes the incorrect quadratic term and adds the linear drag with only 0.6% error in τ f (ε * < 0.01%).
Finally, in Case 12, where the orography's profile R(x, y) is incorrect (compare Figure 4b with Figure 4a), MEDIDA successfully corrects the profile (Figure 4c), reducing the model error to ε * < 0.01%.
As reported in Mojgani et al. (2022) for a much simpler chaotic test case, here, we again find that MEDIDA can effectively and robustly discover complex and nonlinear structural Table 1: Twelve cases of imperfect models, which suffer from a range of parametric and/or structural errors due to different mis-represented physics.The coefficients of the system are in the top row (ε m = 0).Boxes are around values that represent parametric model errors (P).The ✗ represents structural errors (S), i.e., missing terms.The error of the incorrect models, ε m (%), is calculated separately for each layer.The next question is the performance of MEDIDA with sparse observations.As mentioned before, simply using a linear or cubic interpolation scheme did not lead to satisfactory discoveries in most cases (2).The use of ANN-based interpolators resulted in even worse failures (not shown).To understand the source of these failures, next, we examine the performance of ANNs when simply used to "represent" the fully observed states.

Spectral bias in ANN-based representation of broadly multi-scale states
The speed and low cost of inference, and the possibility of automatic differentiation (AD) and differential modeling have made ANNs increasingly attractive for applications involving PDEs.In climate/weather modeling, such applications range from DA (Farchi et al., 2023) to online-learning of parameterizations (Frezat et al., 2022), and to ambitious attempts such as discovery of full governing equations (Z.Chen et al., 2021;Shankar et al., 2023) and more (Shen et al., 2023).Here, we are particularly interested in the interpolation capabilities of ANNs, as described in Section 2.3.In this section, we evaluate the performance of ANNs in representing broadly multi-scale states, as a prerequisite for their use as interpolators for applications involving turbulent flows (Section 4.3).In these experiments, similar to Section 4.1, the state variables are fully observed (without sparsity); however, the model error and the library terms are computed using the output of the trained ANNs, as further explained below.
For each observed sample at time t i , we train an ANN for each state variable to fit (i.e., represent) the fully observed states (without sparsity), a map between each grid point in the domain and the state variables, ψ o k (k = 1 or 2).In other words, using the notation introduced in Section 2.3, the input of the ANN is the coordinate (location) of observations, (x ℓ , y ℓ ), and the output of the network is the corresponding state variable at that location, ψ o k (x ℓ , y ℓ ), where ℓ = 1, 2, . . ., L (no sparsity, S = 0).Thus, the network is trained to fit/represent the full state.Next, the output of the trained ANN is used as the observation in MEDIDA.One can still use conventional numerical schemes such as spectral or finite difference (FD) on the outputs of the ANN to compute the derivatives and build the library.Alternatively, because this ANN-based representation of the state is "differentiable", one < 0.01 † : In cases 6 and 7, the error only appears in layer 2. ‡ : In Case 10, the effect of the structural error, i.e., missing hyperdiffusion, is too small in short-term and does not show up in analysis increment.o : The failure of ANNs (as interpolator) is due to spectral bias discussed in Section 4.2.This can be addressed by adding an RFF layer (ANN+RFF), extending MEDIDA to sparse observations.can use AD to compute derivatives/library.Note that the same steps are followed later when interpolation is needed to deal with sparsity (where S > 0).Using this approach for fully observed states, we have trained ANNs, which based on common metrics such as pattern correlation R 2 and relative RMSE, seem to provide very accurate representations of the state variables (R 2 around 0.99, relative RMSE around 0.25%).Next, using these trained ANNs, we have repeated the experiments of Section 4.1, but found the discovery with MEDIDA to fail in all cases, i.e., ε * > 100% (Table 2).This is regardless of which method (AD, spectral, FD) is used to compute the derivatives.As a reminder, these observations are not sparse and the direct method of just applying MEDIDA works very well, as already discussed in Section 4.1.Note that in these experiments, any failure in model error discovery should be due to these seemingly accurate ANN-based representations of the full states.
A detailed analysis shows that the ANN-based representation is only accurate for the large scales and poorly reproduces the small scales (i.e., the high wavenumbers).To demonstrate this, examples of power spectra of the state and its first-and second-order spatial derivatives with respect to the zonal direction are shown in Fig. 5.It is clear that the ANN has captured the large-scale features of the state accurately (here, for κ x < 15), which leads to high pattern correlation and small RMSE (κ x is the wavenumber in zonal direction).However, the spectra clearly deviate from that of the truth at small scales (here, for κ x > 25).
(a) Case 3, Layer 1 Figure 3: The true, model's, and corrected model's coefficients a few selected library terms and three representative cases: a) Case 3 (layer 1), b) Case 5 (layer 2), and c) Case 8 (layer 1).The cases are defined in Table 1 and discovered correspond to the reported results in Table 2.The discovery is performed on full observations (results are shown direct and ANN+RFF), and also 5% sparse observations (using cubic and ANN+RFF interpolation schemes).
The legend is: Truth Model Direct ANN+RFF Cubic (5%) ANN+RFF (5%) .The vertical axis is in logarithmic scale showing the model's and corrected model's coefficients for each of the selected library terms, i.e., the prominent terms in Eqs. ( 9) to (11).While the discovery using direct observations is consistently successful for full observations, the ANN fit to the same observations consistently fails (results not shown here, discussion in Section 4.2).The proposed use of ANN+RFF significantly alleviates this issue by reducing the spectral bias.See the discussion in Section 4.2 on the few incorrectly discovered additional terms with ANN+RFF (e.g., 0.01 × ∇ 2 ψ 1 and 0.01 × J 2 in (a) and (c)).The poor representation of small-scales in the state variables significantly degrades the accuracy of derivatives (and thus, the bases in the library).Figure 5 also compares the accuracy of AD, spectral, and second-order central FD schemes in computing the first and second derivatives of the output of the trained ANNs.The spectral (Fourier) derivative of the full state (from the numerical solver) is considered as the truth.All derivatives are highly inaccurate.Not surprising, the FD scheme is least affected by the inaccuracy of high wavenumbers, though it is less accurate at lower wavenumber compared to the spectral method.Surprisingly, AD is the least accurate method, implying its high sensitivity to the quality of the ANN-based interpolators.
The loss in accuracy at high wavenumbers is similar to using a coarser effective resolution, here, almost half of the actual resolution.In other words, the output of a naively trained ANNs is only as accurate as that of the state represented on a coarse grid with half of the resolution of the observation.However, there is a subtle but consequential distinction: while representing the state on a coarser grid smoothens the representation, ANNs inject relatively large errors to high wavenumbers.Since this error is limited to the small scales, it does not show up in common metrics of accuracy such as pattern correlation and RMSE.
The origin of this error (the poor representation of small scales) is "spectral bias", a well-known phenomenon in the ML literature (Rahaman et al., 2019;Xu et al., 2022).Although ANNs, as universal approximators (Hornik et al., 1989), can fit any observations (given a large-enough network size and training set), it has been shown that the small-scale features are slower to be learned, to the extent that ANNs trained with traditional loss and activation functions, regardless of the architecture, are practically incapable of representing the full range of scales.While this inductive bias can be actually useful in many common ML applications (in which the signal is only in the large scales), it can cause major issues for applications involving physical systems, especially those with broadly multi-scale features and no scale separation.One prominent example of such systems is turbulence.For instance, see the paper by Chattopadhyay and Hassanzadeh (2023), which identified spectral bias as the root cause of the long-term instability of data-driven weather models and offered a solution based on a regularized loss function.
There are various ways to reduce spectral bias (Xu et al., 2022;Chattopadhyay & Hassanzadeh, 2023).In the current application, i.e., using ANNs as interpolators, an easy and powerful solution is to equip the ANN with a RFF layer (Tancik et al., 2020), which has been shown as a promising approach for canonical multi-scale PDEs (Wang et al., 2021).Details of the ANN+RFF approach are presented in Appendix A. Briefly, RFF is a differentiable but non-trainable layer that slightly and randomly perturbs the ANN' input, which in our application, is the grid location (x ℓ , y ℓ ).The perturbation is done via a nonlinear mapping using sine and cosine functions with random wavenumbers (Eq.( A1)).Adding the RFF layer helps with learning the small-scale features before the network is over-fitted on (biased towards) the large-scale features.
Figure 5 (d) shows that adding the RFF layer significantly reduces the spectral bias in the interpolator, which now can almost perfectly represent the state across the scales, even at the highest wavenumbers.Consequently, the representations of the derivatives (and this the library) also substantially improve (panels (e)-(f)).Not surprisingly, there are still some differences in the performance of different differentiation schemes, especially for higher-order derivatives.Most notably for AD and to some degree FD, the noise from the RFF induces noise at the high wavenumbers, which results in relatively noisy derivatives at high wavenumbers.Nevertheless, the ANN+RFF with any differentiation scheme is much more accurate compared to the classic ANNs.Overall, the Fourier spectral method shows the best performance, and we use this method hereafter for building the libraries.
Next, we repeat the experiments of Section 4.1 but now using the ANN+RFF to represent the full observed states.As shown in Table 2, all model errors are successfully discovered, with ε * < 2%, which is a significant improvement over when ANN alone was used (ε * > 100%), and comparable to the performance of the Direct approach (ε * < 1.3%).
In Fig. 3c, we further examine the performance of ANN+RFF for a few cases where ε * is O (1) %.Moreover, for layer 1 in Cases 3 and 8, ε * is larger than ε m .Studying layer 2 in Case 5 shows that while ε * = 1.45%, it is significantly decreased from the original model error of ε m = 103%.No additional term has been incorrectly identified but a few of the coefficients, e.g. for ψ 1 and ψ R are slightly larger than what they should be.Note that in this example, the ε * is much smaller than ε m , since the largest contribution to this metric is from the Jacobian term, overshadowing the success of the method in also discovering the relatively smaller but important terms, i.e., coefficients of ψ 1 , ψ 2 , and ψ R .
Cases 3 and 8 are more interesting, as in these cases, while the original model errors have been corrected, a few additional terms have been mistakenly added to the corrected model.In Case 3, the source of the structural model error, the missing radiative cooling, has been correctly discovered with the right τ d1 and τ d2 .However, additional errors, involving ∇ 2 ψ 1 and J 2 , have been also added, leading to an increase in ε * .That said, the incorrectly added terms have small coefficients of O (0.01).As a result, 0.01J 2 is not expected to substantially influence the performance of the corrected model given that a 1J 1 term already exists in the equations of layer 1.Similarly, in layer 1 of Case 8, MEDIDA has corrected the most important terms, i.e., the coefficients of ψ 1 , ψ 2 , and ψ R ; however, again, additional terms involving ∇ 2 ψ 1 and J 2 have been identified too, though their coefficients are small (O (0.01)).Overall, this example shows the importance of interpretable model error discovery using ML, as in the end, the users' domain knowledge can be leveraged to understand if a correction is physical or spurious.
To summarize this section, we have shown that spectral bias can significantly degrade the performance of ANNs when used as interpolators for equation-discovery techniques.Furthermore, we have shown that adding an RFF layer to such ANNs substantially improve their performance and lead to successful model error discovery, at least when tested on fully observed states.Next, in Section 4.3, we will test this approach on sparsely observed states.But before that, it is worthwhile to further discuss spectral bias and its implications for climate science applications.
Spectral bias is abundant in ML applications to images and patterns, and is a wellknown problem in computer vision, though it often does not lead to major loss of accuracy.However, spectral bias can be severe for many climate processes, such as geophysical turbulence, which involve a very broad range of spatial scales.To show this, we compare the Fourier spectrum of a typical natural image (e.g., a fox) to that of snapshots of the solution

Discovery from spatially sparse observations
Leveraging the success of ANN+RFF as an interpolator, now we repeat the experiments of Table 2 but with observations that have 5% random sparsity (S/L = 0.05).ANN+RFF and a baseline bi-cubic interpolation scheme are used following the procedure described in Section 2.3.The ANN-based interpolators are used only to estimate the missing values.The same spectral method is then used on the estimated full state to compute the derivatives and build the library.
Table 2 presents the discovery using MEDIDA with the two interpolators.In all cases, ANN+RFF outperforms the baseline, often by a huge margin.One exception is Case 9, where both interpolators yield comparable, excellent results.MEDIDA with ANN+RFF successfully discovers the model error in all cases but two and often reduces ε by a factor of 10 to 100.The two exceptions are Case 10, in which as discussed before, the missing hyperdiffusion does not affect the analysis increments, failing the discovery.The other one is Case 3, in which ε * remains close to ε m .Although the structural error (missing radiative cooling) has been discovered and corrected to some degree (by adjusting the coefficients of ψ 2 , and ψ R ), the coefficients are still not that accurate, and an extra term (involving ∇ 2 ψ 1 ) has been mistakenly identified (Fig. 3a).Still, the performance of the ANN+RFF interpolator is significantly better than the baseline, which cannot identify the structural error at all.Note that other complex cases, such as 5 and 8 show how well MEDIDA with ANN+RFF works in correcting the model error without adding any additional term (see Figs. 3b and 3c).
Overall, MEDIDA with ANN+RFF successfully discovers model errors from sparse analysis increments, although cases like 3 show the challenging nature of sparse observations.In fact, while 5% sparsity might seem low, state re-construction (super-resolution) from sparse observations of turbulent flows is known to be a challenging task in various areas of science and engineering (Willcox, 2006;Agarwal et al., 2021;Kelshaw et al., 2022;Fukami et al., 2021;Karnakov et al., 2023).Equation discovery from such sparse observations is even more challenging.Here, unlike other studies, we use interpolators trained on only one sample of sparse observations and show promising results once spectral bias is addressed.This progress will enable future studies to focus on developing better interpolators (using several observations and deeper/better architectures) that could handle higher sparsity.For example, Fukami et al. (2021) used deep convolutional neural nets and a sequence of observations to successfully perform super-resolution of turbulent flows with 20% sparsity, although whether the accuracy of the interpolation is enough for higher-derivatives and equation discovery remains to be investigated.

Discussion and Future Work
In this paper, we scaled our recently proposed framework, MEDIDA (Mojgani et al., 2022), to a much more complex and climate-relevant test case and to spatially sparse observations.MEDIDA discovers, from analysis increments, close-form equations that represent structural/parametric model errors.As a result, unlike approaches to learning model errors using neural networks, MEDIDA is data efficient and yields physically interpretable corrections (nudging terms).The interpretability of discovered model errors also increases the likelihood that the corrected models can generalize beyond the training climate, which is a major limitation of deep learning-based model error discovery approaches.Thus, these models corrected using short-term evaluations (at weather timescales) might also lead to more accurate long-term climate projections.
The numerical experiments explored in this paper show the possibility of discovering different forms of linear and nonlinear structural (and parametric) model errors that are due to various physical processes from analysis increments in a fully turbulent flow.However, we also find an exception, which is while not surprising, has been ignored in most past studies: There are structural errors whose effects do not dominate short-term evaluations (thus the analysis increments), but they have major effects on the long-term evolution of the system.Here, we show such an example based on missing hyperdiffusion, but there are other physical processes, such as cloud microphysics, that can behave in this way as well.While a deeper examination of the analysis increments might provide some information about the effects of such processes, it will be hard to separate their effects from those of numerical errors, at least for the purpose of interpretable model error discovery.
Overall, we find that if the state of the truth is available without any noise or sparsity, and if the model error dominates the analysis increment, then MEDIDA can robustly discover complex model errors.These promising results already suggest that MEDIDA can be a potentially useful method for various model-error-learning applications, because in many of such applications, as explored in recent studies but using deep neural networks (Watt-Meyer et al., 2021;Bretherton et al., 2022;Clark et al., 2022;Gregory et al., 2023;Bora et al., 2023) noise-free, full observations of the "truth" are available from reanalysis or high-resolution simulations.
Note that while in this paper we focus on model errors that are missing or incorrect terms in the governing equations, a dominant source of model error in climate science is the lack of numerical resolution, requiring the use of subgrid-scale parameterizations in low-resolution models.Such parameterizations can be learned from data, e.g., from filtered and coarse-grained high-resolution simulations (Zanna & Bolton, 2020;Yuval & O'Gorman, 2020;Y. Q. Sun et al., 2023).However, applying equation-discovery techniques to such filtered, coarse-grained data has shown, robustly, the emergence of the nonlinear gradient closure model, which is known to lead to instabilities when coupled to the low-resolution models (Zanna & Bolton, 2020;Jakhar et al., 2023).One interesting future direction is to apply MEDIDA between the low-resolution simulations (model) and high-resolution simulations (truth, system) and discover closed-form parameterizations from the analysis increment.Note that a few studies that use differentiable climate models (Frezat et al., 2022;Kochkov et al., 2023) have recently trained neural network-based parameterizations in a similar "online" fashion: to correct the discrepancy between the trajectories of low-resolution and high-resolution simulations.However, such neural network-based parameterizations, while powerful and already scaled to global circulation models (GCMs), remain uninterpretable and will likely not generalize out-of-distribution.These shortcomings can be addressed by approaches such as MEDIDA, though scalability to GCMs needs further exploration.
In this paper, successful model error discovery from sparse observations is made possible through the use of ANN-based interpolators, but only after a major shortcoming of such interpolators is addressed.In general, the use of ANNs for PDEs is implicitly based on the assumption that well-trained ANNs can accurately represent the states as well as their relevant high-order derivatives.Here, we demonstrate that spectral bias in ANNs is a fundamental hindrance to accurate representation of states and more importantly, higherorder derivatives, when applied to turbulent flows.While ANNs are highly expressive and can fit to any arbitrary input-output map (Hornik et al., 1989), they also tend to be biased towards learning large-scales (low-wavenumber) features of the data, a phenomenon known as spectral bias (Rahaman et al., 2019;Xu et al., 2022).Given that turbulent flows (and some other processes in the climate system) are inherently broadly multi-scale and span a wide range of wavenumbers, a naive application of ANNs to such systems will lead to poor representations of the high wavenumbers.Such small-scale features do not measurably contribute to the most commonly used loss functions, e.g., L 2 -norms, and therefore are often neglected in most of the commonly used metrics when the output of a network is compared with the truth.As discussed in the paper, spectral bias may not cause major issues in many applications, but here, we show that it is a major shortcoming of ANN-based interpolators for equation discovery (see Chattopadhyay and Hassanzadeh (2023) for an examples of the implications of spectral bias on long-term instability of data-driven weather models).
To reduce the spectral bias and develop accurate ANN-based interpolators, we have demonstrated the effectiveness of adding an RFF layer to the class ANN architecture.We show that ANN+RFF captures the entire range of scales in turbulent flows, enabling ME-DIDA to successfully discover model errors from sparse observations.In dealing with real observations (from remote sensing, in-situ), aside from spatial sparsity, another major challenge is noise (measurement errors).Previously, in a simple chaotic test case, we have demonstrated that noise can be adequately reduced using DA techniques such as ensemble Kalman filters with a large-enough ensemble size for the successful discovery of model errors with MEDIDA (Mojgani et al., 2022).However, the rate of reduction of noise (as a function of ensemble size) in higher-order derivatives of state in large-scale problems can become a practical challenge.One solution to this challenge is using efficient surrogates, such as reduced order models (ROMs) or ANNs with accurate short-term prediction horizons (Chattopadhyay et al., 2023), to increase the number of ensemble members at a low computational cost.
Interpretable model error discovery from noisy, sparse observations is challenging.While the results presented here are promising in this QG test case and for low-level random sparsity, more extensive explorations using more challenging test cases, e.g., an intermediatecomplexity GCMs and more realistic synthetic observations, are needed.The spectral biasreduced ANN+RFF interpolators offer a step toward success in this direction.

Open Research Section
All data are generated running the solver that is made available at www.github.com/rmojgani/MEDIDA QG.The codes to evolve the model, generate the library of bases, and perform the discovery, as well as to train the neural networks and interpolation schemes are made available at www.github.com/rmojgani/MEDIDAQG. wavenumber by perturbing the eigenvalues of the neural tangent kernel (NTK) towards a well-behaved eigen-decomposition (Tancik et al., 2020).This addition to the network leads to learning small-scale features before the network is over-fitted (biased) towards the large scale features.
In this paper, the ANNs are 5-layer deep, with 1000 nodes per layer.Adam optimizer (Kingma & Ba, 2014) with learning rates of 10 −3 and 5 × 10 −4 are used in all the experiments.We have found σ = 1 to provide the lowest training error in terms of how closely the spectra of the ANN output and truth match.
200∆t is 1 Earth day.SeeZurita-Gotor et al. (2014),Lutsko et al. (2015) andNabizadeh et al. (2019) for detailed discussions and analyses of the system.The non-dimensionalized zonal and meridional domain widths are L x = 46 and L y = 68.96 × 192 Fourier modes are used in the x and y directions, respectively.The width of the jet is set to σ = 3.5.The boundary effects on the northern and southern ends are damped by a sponge layer.The orographic forcing, when present, is described by (

Figure 4 :
Figure 4: The orography's profile R(x, y) of the true system (a), and the imperfect model (b).MEDIDA successfully corrects the profile as shown in (c).

γ
Figure A1: An ANN architecture with a non-trainable RFF layer to reduce the spectral bias in the interpolator.In this paper, separate ANNs are trained for each layer of the QG system, i.e., k = {1, 2}.

Step 1: Observe and evolve the model Step 2: Build library of bases Step 3: Bayesian regression 𝑐
3)With these definitions in mind, the goal of MEDIDA is to approximate structural and parametric model error at time t i , h i , by leveraging one or more observation pairs, w o (x, t i − ℎ = Σ       (,  1 ,  2 ) = min ||ℎ − Σ       (,  1 ,  2 ) || Schematic of the main steps of MEDIDA applied to the two-layer QG test case. *

Table 2 :
For the first 10 cases, the errors of the corrected models ε * (%) versus the errors of the original imperfect model ε m (%).The discovery is performed separately for each layer, and the errors are reported separately.