Multiplicative Non‐Gaussian Model Error Estimation in Data Assimilation

Model uncertainty quantification is an essential component of effective data assimilation. Model errors associated with sub‐grid scale processes are often represented through stochastic parameterizations of the unresolved process. Many existing Stochastic Parameterization schemes are only applicable when knowledge of the true sub‐grid scale process or full observations of the coarse scale process are available, which is typically not the case in real applications. We present a methodology for estimating the statistics of sub‐grid scale processes for the more realistic case that only partial observations of the coarse scale process are available. Model error realizations are estimated over a training period by minimizing their conditional sum of squared deviations given some informative covariates (e.g., state of the system), constrained by available observations and assuming that the observation errors are smaller than the model errors. From these realizations a conditional probability distribution of additive model errors given these covariates is obtained, allowing for complex non‐Gaussian error structures. Random draws from this density are then used in actual ensemble data assimilation experiments. We demonstrate the efficacy of the approach through numerical experiments with the multi‐scale Lorenz 96 system using both small and large time scale separations between slow (coarse scale) and fast (fine scale) variables. The resulting error estimates and forecasts obtained with this new method are superior to those from two existing methods.

In the variational Data Assimilation literature, model error is often considered by formulating the forecast model as a weak constraint in the optimization problem (often referred to as Weak constraint 4D-Var; Tremolet, 2006;Zupanski, 1997). One approach to achieve this is through Long-window 4D-Var, where an additive model error term is incorporated as a control variable in the 4D-Var formulation, with initial conditions for each window held fixed (Fisher et al., 2005(Fisher et al., , 2011Tremolet, 2006). These approaches require apriori specification of the model error covariance matrix, while our task is to estimate the characteristics of the model error (Zhu et al., 2017). Estimated the model error covariance online in a particle filter in a 1000-dimensional Lorenz 1996 model. The advantage of a particle filter is, similar to long-window 4D-Var, that the error covariance of the state plays no role in the estimation. The method needs a first guess of the model error covariance matrix, which is then updated over time, with the restriction that only an additive second order moment is estimated.
More complex methods involve estimating a parameterization on-line using linear regression on a pre-defined large set of potential functional forms (e.g., Lang et al., 2016). Furthermore, off-line methods from machine learning, such as Relevant Vector Machines (Bishop, 2006) and Bayesian Symbolic Regression (Jin et al., 2020) have been explored to find structural model errors, and hence missing physics. These methods define a set of basic functions and build model equations from these that fit the data best. All these methods have in common that they do need to define a set of functional relations, after which the fitting is performed, which limits the freedom of structure of model errors. Bonavita and Laloyaux (2020) provide an in-depth investigation of how best to integrate machine learning for model error estimation into existing data assimilation methods. See also the recent work of (Brajard et al., 2021) where neural networks are used to emulate what is viewed as model error from a data assimilation run.
A related line of research involves stochastic parameterization and model reduction methods to account for model errors associated with unresolved sub-grid scale processes in data assimilation (e.g., Berry & Harlim, 2014;Lu et al., 2017;Mitchell & Carrassi, 2015;Mitchell & Gottwald, 2012). This is particularly relevant in weather and climate modeling where the system dynamics evolve on a wide range of spatial and temporal scales. Such model reduction methods involve modeling or parameterizing sub-grid scale processes in a more computationally tractable fashion than solving the true sub-grid differential equations. Often this is achieved through either deterministic or stochastic parameterizations which aim to capture the mean effects of small scale processes on the resolved variables. Several studies have demonstrated the superiority of stochastic over purely deterministic parameterizations in this regard (Buizza et al., 1999;Palmer, 2001).
Methods for stochastic parameterizations of multi-scale systems vary widely; from homogenization methods that are suited to systems with large time scale separations (Pavliotis & Stuart, 2008;Wouters et al., 2016) to fitting stochastic models to sub-grid tendencies (e.g., Arnold et al., 2013;Wilks, 2005). Methods of the form of the latter include that of Crommelin and Vanden-Eijnden (2008), who proposed utilizing a Conditional Markov Chain to represent the evolution of sub-grid tendencies given the state of the resolved variable. The transition matrices are estimated using realizations of the true sub-grid tendencies. Kwasniok (2012) explored a similar approach whereby a clustering algorithm was used to develop a cluster-weighted Markov chain to represent the sub-grid tendencies. Arnold et al. (2013) extended the work of Wilks (2005) by examining the potential of autoregressive error models to effectively parameterize sub-grid tendencies in the multi-scale Lorenz 96 system. This was further extended in Gagne et al. (2020) where a Generative Adversarial Network was trained on data of sub-grid tendencies and coarse scale variables from the 2 scale Lorenz 96 model. Lu et al. (2017) have proposed using a non-Markovian non-linear autoregressive moving average model to characterize model error. All of the aforementioned approaches require knowledge of the sub-grid scale equations, representative data of the sub-grid tendencies and/or full observations of the resolved variables. This reduces their applicability for more realistic data assimilation applications where knowledge of the sub-grid scale processes is unavailable and the resolved variables are only partially observed.
We propose a methodology for model uncertainty estimation that is specifically designed for partially observed systems and does not require knowledge of the sub-grid scale processes. The method is suited to systems where a locality and homogeneity assumption can be invoked, as this is used to regularize the ill-posed problem of estimating model errors from partial observations. In such systems, errors due to sub-grid scale processes are dependent only on neighboring states instead of the full resolved state vector, and the error statistics are the same at each location in space, or over larger parts of state space with similar physics.
The approach first approximates the conditional probability distribution of additive model errors given some informative covariates (e.g., the state of the system). This density is calculated from estimated model error realizations. These are obtained during a training phase by minimizing their variance conditioned on the informative covariates, constrained by available observations. Samples from the estimated distribution can then be combined with forward model simulations to generate a forecast distribution in any ensemble data assimilation framework. The distribution estimate is nonparametric, allowing for the characterization of highly non-Gaussian errors. We demonstrate its efficacy through numerical experiments with the multi-scale Lorenz 96 system. The forecast model in the assimilation experiment is the single layer Lorenz 96, so that model errors arise from the unresolved high frequency fast variables. The proposed approach is compared to two benchmark methods in terms of the ability to recover the true model error structure and the impact on assimilation and forecast quality.
The remainder of this paper is structured as follows. In Section 2, we discuss data assimilation methods and the Ensemble Transform Kalman Filter (ETKF), which is adopted as the assimilation algorithm in this study. Methods for estimating model uncertainty in partially observed systems are discussed in Section 3. The details of the proposed method are provided, along with a long window 4D-Var formulation and an ensemble analysis increment based method, both of which are adopted as benchmarks. In Section 4 we describe the numerical experiments with the multi-scale Lorenz 96' system. We conclude with a summary of the main outcomes and possibilities for future work in Section 5.

Data Assimilation Methods
The general problem setting considered in this study is described as follows. Suppose the system of interest can be represented by the following discrete time continuous state space equation: where −1 ∈ ℝ is the true state vector at time j − 1; ∶ ℝ → ℝ is a Markov Order 1 forecast model; and ∈ ℝ is an additive model error at time j capturing deficiencies in the forecast model M.
Noisy partial observations of the state x j are available, given by the following: where ∶ ℝ → ℝ is a N y × N x matrix consisting of 1's and 0's only (i.e., state components are either directly observed or not at all), ∈ ℝ is the vector of observations at time j, and ∈ ℝ is the observation noise at time j, assumed to be temporally uncorrelated Gaussian with zero mean and known covariance matrix ∈ ℝ × . In this study, we focus on the case N y < N x , that is, the state vector is partially observed. These observations are available at a coarser temporal resolution than the model forecast time step. Throughout the manuscript, the notation v [k] is used to refer to the kth element of some vector v; A [k, l] refers to the element at the kth row and lth column of some matrix A.

of 23
The aim of data assimilation is to optimally combine observations and prior information (usually from a numerical model, e.g., Equation 1) based on their respective uncertainties. We focus on ensemble based data assimilation methods due to their suitability for dealing with non-Gaussian model errors, which is a particular focus of the proposed method. The standard discrete time Kalman filter provides the optimal posterior (in the minimum variance sense) for the special case of linear forecast model and observation operator, and for zero mean temporally uncorrelated Gaussian process and observation noise. Ensemble Kalman Filter (EnKF) methods (amongst others) have been developed for high-dimensional systems where the full covariance matrix is too large to store in a computer, with an additional benefit for the more general case of non-linear and non-Gaussian problems encountered in many applications. The (ETKF; Bishop et al., 2001;Wang et al., 2004) has been widely adopted particularly in meteorological data assimilation due to its computational efficiency and accuracy in high dimensional systems with small ensemble sizes when localization is applied. We will therefore use this as the data assimilation method in our numerical experiments in Section 4.
The ETKF is an extension of the original EnKF proposed by Evensen (1994). It belongs to the class of ensemble square root filters which operate on the square root of the forecast and analysis error covariance rather than the full covariance matrices (Tippett et al., 2003;Vetra-Carvalho et al., 2018). Such methods use a deterministic transformation to map the forecast ensemble to the analysis ensemble, whose statistics are consistent with the Kalman filter update. As noted by Tippett et al. (2003), the linear transformation is not uniquely defined, and is the main distinguishing factor between different ensemble square root methods. Here we present the method of Wang et al. (2004), which is an updated version of the original ETKF proposed by Bishop et al. (2001) that ensures the filter is unbiased. A single cycle of the ETKF is summarized below.
A forecast ensemble at time j (denoted ) is generated by propagating the analysis ensemble from the previous time through Equation 3: where the superscripts f and a denote the forecast and analysis, respectively. The crucial strength of EnKFs is that one can avoid the explicit calculation of the state covariance matrices. In the ETKF, the update is achieved by writing the analysis ensemble deviation matrix ′ in terms of the forecast ensemble deviation matrix where ′ ∶= − ∈ ℝ × , is the ensemble mean, k is a vector of ones and ∈ ℝ × is a transformation matrix given by where U and Σ arise from the SVD of the scaled forecast ensemble observation deviation matrix W, that is, This approach transforms the computations to ensemble space which significantly reduces the required number of operations whenever n ≪ N y (as is typically the case in real world geophysical applications). Finally, the SVD of W is utilized to efficiently calculate the analysis ensemble mean : We will use this data assimilation method in our data assimilation experiments described in Section 4, after we have derived an expression for the model error distribution, as detailed in the next section.

Estimating Model Uncertainty
In the following, we propose a method for estimating model uncertainty in partially observed systems where knowledge of the unresolved processes is unavailable. The approach is specifically for use with Monte-Carlo based sequential filtering techniques such as Ensemble Kalman methods and Particle methods. Existing methods for accounting for model errors that are amenable to the partially observed setting are also discussed in Section 3.3. We first clarify some important notation that will be used from here onwards.

Notation
The notation 1 ∶ 2 is used to indicate the sequence of vectors { } = 1 , 1 +1,…, 2 . Unless otherwise stated, the subscript notation is reserved for the time index and the superscript bracket notation x (i) is used to indicate the ith iteration in an iterative optimization method. The bracket notation [i] is used to indicate the ith element of a vector or set. Similarly [i, j] is used to indicate the element in the ith row and jth column of a matrix and [i, ⋅] is used to indicate the ith row. The hat notation ^ is used to indicate an estimate of a variable. We also let S u and S o denote the set of indices of the unobserved and observed grid points, respectively. The shorthand notation v [S o ] is used to denote a vector whose ith entry is given by

Proposed Method
The proposed method utilizes a training period to obtain estimates of model errors using an optimization procedure and knowledge of some informative covariates (e.g., the state of the system). Estimates are generated by minimizing the conditional sum of squared deviations of the model errors given the covariates, constrained by available observations. These estimates are then used to build a conditional model error probability density using kernel density estimation, which allows for the characterization of potentially non-Gaussian features. In actual data assimilation experiments such as in Section 4 below, model errors are drawn from this conditional distribution. Since the density estimation is computed off-line the cost of incorporating uncertainty in this fashion is kept to a minimum.
The following assumptions are required for the method: 1. The system states are directly but partially observed, that is, H takes the form as described in Section 2 2. The additive error at time j and grid point k, η j [k], is dependent on some informative covariates captured in the matrix Z j of size N x × N c where N c is the number of covariates. For instance, if it is assumed that the error depends only on the state at the previous time and same location, then Z j = x j−1 . Other possibilities include if the error is expected to depend on the states in a neighborhood of the grid point, and if a longer temporal dependence on the states is expected 3. The magnitude of the measurement errors is small in comparison to the magnitude of the model errors, that is, ‖ɛ j ‖ ≪ ‖η j ‖ 4. Additive error statistics are the same in time and space, that is, The key advantages of the proposed approach are it (a) allows for the estimation of complex error structures with minimal apriori knowledge and partial observations; (b) requires no assumptions or specification of a parametric error distribution (e.g., Gaussian errors) and considers the full range of moments (not just bias and covariance); (c) computes all error statistics from data, without the need for numerical tuning; and (d) has sufficient flexibility to incorporate a range of covariates that influence error processes, which will generally be problem dependent.
The aforementioned assumptions could be relaxed for larger scale realistic applications. For instance, in many applications the model grid and observation locations do not align perfectly. Standard interpolation procedures are not likely to be problematic for Assumption (a), so long as the quantities are directly observed. The flexibility of Assumption (b) is a strength of the method. A brief exploration of the impact of Assumption (c) is given in Section 4.3.2, showing that this assumption can be relaxed to a large extent. As mentioned in the Introduction, Assumption (d) could be relaxed by dividing the study area into smaller groups based on certain physical characteristics so that assumption four is valid within each group. These group sizes has to be chosen such that the sample size is sufficiently large.

of 23
The methodology consists of two main steps and is discussed in detail for the remainder of this section.

Step 1. Offline Additive Error Estimation
Given a training period of length T time steps, the aim is to estimate the sequence of errors η 1:T under the assumptions stated above. To this end we solve a constrained optimization problem where the objective function is of conditional sum of squares type:̂1 subject to the constraints , given bŷ where K b is a kernel function with bandwidth b, both of which must be selected. Common choices for the kernel function could be a Gaussian, Uniform or Epanechnikov kernel. Such regularizers are also used in semi-supervised learning where they guide the learning method to find models that respect some underlying structure of the samples. The Levenberg-Marquardt (Levenberg, 1944;Marquardt, 1963) algorithm is used as the minimizer.
Optimizing η 1:T can be prohibitively expensive especially for large T and N x . We therefore use a sequential optimization technique over a sliding time window of length τ, as is also employed in Long window weak-constraint 4D-Var (Tremolet, 2006) and particle smoothing methods (Sarkka, 2013). For a given time t, initial condition estimate ̂− 1 and time window length τ, the optimization problem Equations 9-11 is restricted to j ∈ {t, t + 1, …, t + τ} instead of j ∈ {1, 2, …, T}. This process is then repeated by sliding the window of length τ forward one time step, so that η t+1:t+τ+1 is optimized, where the existing estimates from the previous optimization step are used as an initial guess. The sliding window procedure allows one to avoid specifying the background error covariance matrix or including the initial condition ̂− 1 in the optimization (Tremolet, 2006) as is needed in standard 4D-Var.
The optimization window τ must not be so large that the optimization procedure is computationally infeasible, but large enough to ensure enough points to approximate the conditional variance. Furthermore, it should be large enough so that that inclusion of a new observation at the end of the time window does not influence the initial condition.
This step of estimating model errors given the observations in the training period is summarized in Algorithm 1 and 2. A pictorial representation of the optimization over a single time window is provided in Figure 1. It shows the estimated errors at various stages of the iterative minimization process for the numerical experiment considered in Section 4.

Step 2. Conditional PDF Estimation and Sample Generation
The resulting sample of additive error and states estimates ̂1 ∶ and ̂0 ∶ −1 from Algorithm 1 is now used to derive the conditional probability density for example, p (η j [k]|x j−1 [k]) for a given grid point k. Kernel conditional density estimation methods (Hall et al., 2004;Hyndman et al., 1996) are well suited to such a task, although they are generally data-intensive and suffer from the curse of dimensionality. However, they are sufficient for the class of problems considered herein where the locality assumption greatly reduces the dimension of the response variable and covariates. We adopt the method of Hayfield and Racine (2008) as implemented in the np package in R. For a set of N data points { , } =1∶ for covariate ∈ ℝ and response variable ∈ ℝ , a Kernel estimate of the conditional density is constructed as where K b is a user specified Kernel function with bandwidth b and b x and b y refer to the bandwidths selected for the covariates and response variable, respectively.
As mentioned earlier, it is also possible to include additional covariates that strongly influence the errors at the current time (for example, η j−1 [k] to capture serial dependence, as demonstrated in the numerical experiments in Section 4. The set of covariates is likely to be problem dependent; prior knowledge of the system is required to select them appropriately.

Benchmark Methods
The stochastic parameterization methods for multi-scale systems discussed in Section 1 (e.g., Arnold et al., 2013;Crommelin & Vanden-Eijnden, 2008;Kwasniok, 2012;Lu et al., 2017;Wilks, 2005) require knowledge of the sub-grid scale processes and/or fully observed resolved variables. These approaches are therefore inapplicable for the problem setting considered here. In the remainder of this section, two existing data assimilation based methods that are amenable to our problem setting are discussed. They are also adopted as benchmarks for comparison with the proposed approach.

B1 -Analysis Increment Based Method
Several researchers have investigated the potential of using analysis increments from a data assimilation run to characterize model errors, see for example, (Leith, 1978;Li et al., 2009;Mitchell & Carrassi, 2015). We adopt the recently proposed ETKF-TV of Mitchell and Carrassi (2015) as a representative method of such approaches (hereafter referred to as Method B1). Their method consists of estimating a Gaussian model error distribution by calculating the mean and covariance of the analysis increments over a so-called reanalysis period, via: where and refer to the ith ensemble member at time j obtained from the reanalysis assimilation run for the analysis and forecast respectively. Note Equations 14-16 are derived assuming the analysis interval length is the same in the reanalysis and experimental run (as is the case in this study). This model error distribution is then used to draw model error samples for the the actual data assimilation experiments. Since this estimate of the model error is corrupted by various sources of error (including from the data assimilation algorithm used to generate the analysis increments), the authors include a tuning parameter α leading to a model forecast of the form:

B2 -Error Estimation Using Long Window Weak Constraint 4D-Var
As discussed in Section 3.2, the proposed method for estimation of additive errors relies on ideas from Long window weak-constraint 4D-Var (Tremolet, 2006) to avoid specification of the background covariance matrix. However, it differs in the specification of the cost function, as the 4D-Var method provides the least squares solution for the model error control variable. The second benchmark (Method B2) is taken to be the same as the proposed approach, but with Step 1 (see Section 3.2.1) replaced by Long window weak constraint 4D-Var estimates for the model error. The probability density estimation ( Step 2) remains unchanged. It is worth noting that this is not exactly a "standard" method in its entirety, but is investigated to examine the benefit of the conditional sum of squared deviation minimization aspect of the proposed method. The long window weak constraint 4D-Var method is discussed below.
In variational data assimilation model errors are accounted for using weak constraint 4D-Var. In the formulation where the initial state and model errors are considered as control variables, this amounts to minimizing the following cost function over a time window of length τ (Tremolet, 2006): • window size, τ • initial state, ̂0 • total time series length, T • initial guess for errors on unobserved variables, where x b is the background estimate of the initial state x 0 ; B is the background error covariance matrix associated with x b ; Q is the model error covariance matrix; and x j = M(x j−1 ) + η j for j ∈ {1, …τ}. The assimilation cycle is repeated by then considering the next assimilation window {τ + 1, 2τ}. However, in the long window approach, minimization is performed by shifting the interval by one observation interval rather than the full assimilation window of length τ. This allows one to neglect the background term J B from the cost function after a suitable 1 1: warm up period. The estimate x b would have already converged due to the many iterations of the minimization algorithm from the overlapping windows, meaning its uncertainty is negligible in comparison to the other terms.
The window length should also be chosen to be sufficiently long, such that the inclusion of a new observation at the end of the time window does not affect the initial state (this is relevant for the proposed approach also).
In summary, the B2 method is defined as being the same as the proposed approach, but with the cost function in Equation 9 replaced by minimization of the following cost function for any given time t: where x j = M(x j−1 ) + η j for j ∈ {t, …t + τ} and the initial condition is given by ̂− 1 . The estimated model error distribution is derived in the same way as for the proposed method, as detailed below.
The Levenberg-Marquardt algorithm is again used as the minimizer, for the sake of comparison with the proposed approach. Notice that unlike the proposed approach, the errors in the entire state vector (not just unobserved states) must be optimized.

Multi-Scale Lorenz 96
Here we investigate the efficacy of the proposed method and benchmarks discussed in Section 3 through synthetic experiments using the multi-scale Lorenz 96 model. This system has been used extensively as a toy model of the atmosphere to test new algorithms and to study model errors due to unresolved sub-grid processes. It consists of a coupled system of equations representing the evolution of an atmospheric quantity discretized over a latitude circle at different scales: The { [ ]} =1 variables represent quantities evolving continuously in time on a coarse spatial scale with low-frequency large amplitude fluctuations, where the subscript k refers to the kth grid point on the latitude circle. Each X [k] variable is coupled to N z small-scale variables V [l, k] that are characterized by a high frequency and relatively small amplitude evolution. The variables are driven by a quadratic term that models advection, a linear damping, constant forcing (F) and coupling terms that link the two scales. The system is subject to periodic boundary conditions, so that X The effect of the unresolved fast variables on the slow variables is denoted by the so-called sub-grid tendency U [k]: we use the formulation of the Lorenz 96 Equations 21-22 as provided in (Fatkullin & Vanden-Eijnden, 2004) which makes the time-scale separation between the slow and fast variables (measured by ξ) explicit. Note that this formulation is equivalent to the system originally proposed by Lorenz with the following parameter conversions: = 1 where c = time scale ratio; = − 2 where b = spatial scale ratio and h is the coupling constant; and h z = h.
The behavior of the system can vary considerably depending on the values assigned to the parameters in Equations 21-22. We consider two dynamical regimes to study the robustness of the proposed approach to different model error structures, summarized in Table 1. We first consider a case with large time scale separation (ξ ≈ 0.008) studied by Fatkullin and Vanden-Eijnden (2004). The sub-grid tendency has a complex non-linear dependence on the resolved variable, making it of interest to this study. However, such large time scale separations are not representative of the real atmosphere. We therefore consider a second case that has a much smaller time scale separation (ξ ≈ 0.7) and stronger coupling (h x = −2) than cases considered in previous studies (where ξ is typically 0.4 or 0.5 and h x = 1; e.g., Arnold et al., 2013;Crommelin & Vanden-Eijnden, 2008;Lu et al., 2017). Smaller values of ξ (i.e., larger time scale separation) are generally considered more difficult to parameterize, and the larger magnitude of the coupling term amplifies the effect of model errors. In both case studies, the dynamics are chaotic and give rise to complex non-Gaussian conditional error densities, as shown by the variation of U [k] with X [k] in Figure 2.

Experimental Setup
The available forecast model is the single scale Lorenz 96 model Equation 24, where the forcing term is known perfectly but knowledge of the sub-grid processes V [l, k] is unavailable: our aim is to first characterize the uncertainty in model simulations due to missing physics that is, the subgrid term in Equation 23, where the resolved variables are partially observed. Then we study the effects of uncertainty characterization on forecasts and assimilation.

Training Period
A truth run for the training period was first generated by numerically integrating the full multi-scale system Equations 21-22 using a fourth-order Runge-Kutta scheme with time step Δt = 0.0008. Similar to (Arnold et al., 2013), we use MTU to denote model time units, where 1MTU = 1 Δ . Partial observations of the resolved slow variables were then developed by perturbing the true values with zero mean, temporally and spatially uncorrelated Gaussian noise:  Table 1), x j is the true state at time j where x j [k] is equivalent to the time discretized value of X k in Equation 21, and = 10 −7 where is the identity matrix of size N y . R was chosen such that measurement errors are negligible in comparison to model errors.
The resolved variables are partially observed in space in all experiments (approx. 50% observed, see Table 1), and observations are available at 0.02 and 0.04 MTU for Case Studies 1 and 2 respectively. Based on the work of Lorenz (2006), this corresponds to an observation interval of 2.5 and 5 hr respectively (1 MTU is approximately equivalent to 5 days). These interval lengths where chosen to reflect a realistic observation network whilst also maintaining complex non-Gaussian error structures.
The proposed approach was then applied to the training data. The forecast model Equation 24 was integrated using a time step of Δt = 8 × 10 −4 . We estimate the following probability densities: Note the inclusion of the past value of the model error in Case Study 2, which is related to the presence of time correlations in errors in the Lorenz 96 system (as identified by e.g., Arnold et al. (2013)). In a real system, such choices would be informed by expert knowledge of the error processes.
Window lengths equal to τ = 25 and 50 observation intervals were selected for Case Study 1 and 2, respectively. This was sufficiently long to capture a range of dynamical states and also longer than the system decorrelation time, so that the sliding window approach can be utilized to ignore the background term in the cost function (as discussed in Section 3.3.2). To ensure temporal independence the data for the nonparametric conditional density estimation was generated by sampling the estimated error and states at an interval of 0.3 MTU, where autocorrelation is approximately zero. A Gaussian Kernel function was adopted throughout using the data-driven bandwidth estimation procedure as detailed in (Hayfield & Racine, 2008). The np package in R was used for the bandwidth estimation and the in-built Levenberg-Marquardt algorithm in Matlab was used for optimization. To avoid issues related to bandwidth specification and data sparsity in high dimensions, outlier points in the covariate space were removed from the data used for density estimation in Case Study 2.
This training data was also used in the benchmark methods. For method B2 we used the same window length and density estimation algorithm as for the proposed approach. The process error covariance matrix Q was estimated by calculating the sample covariance of the true errors over the training period. For the B1 method, the inflation parameter used in the ETKF which provides the analysis increments was tuned based on the analysis Root Mean Squared Error (RMSE), whilst the correction factor α (see Equation 17) was selected by evaluating the spread versus RMSE relationships, as it has a greater impact on ensemble spread than accuracy. The optimal value was found to be 0.8; the fact that it is less than one is because the model error spread is overestimated due to the inability of the method to resolve the complicated non-Gaussian error structure (see Figure 3). Localization was not required due to the large ensemble size (n = 1000) relative to the state dimension.

Assimilation Period
Model errors generated using the proposed method and two benchmarks were then assessed in assimilation experiments using the ETKF. The forecast model in the assimilation experiment was also the single scale Lorenz 13 of 23 96 Equation 24; spatio-temporal observation frequency was the same and observations were generated also using Equation 25. Assimilation was undertaken for 30 independent runs of length 100 observation intervals with independent initial conditions. Truth runs were first generated using the same approach as for the training period. The initial conditions were generated by selecting 30 values on the attractor at intervals of 12,500 time steps, which is sufficient to ensure that the autocorrelation in the resolved variables is close to zero (also adopted by Arnold et al. (2013)). Perfect initial conditions were adopted in all experiments as the focus is on the effects of model error. Similarly, a large ensemble size (n = 1000) was adopted to minimize the effects of sampling error and to avoid the use of localization methods. Furthermore, we can avoid the use of inflation for the assimilation experiments with the model errors generated from the proposed method.

Model Error Estimation
In both case studies, the proposed approach recovers the true error estimates from partial observations more accurately than the benchmark methods. This is demonstrated qualitatively in Figure 3, which shows the sample set of additive errors η j [k] against the spatial covariates (resolved variable at the previous observation time, x j−1 [k] for Case Study one; and x j−1 [k − 1] and x j−1 [k] for Case Study 2). In Case Study 1, the B2 method manages to at least partially recover the non-linear relationship between η j [k] and x j−1 [k], but is less precise than estimates from the proposed method (compare Figures 3b and 3c). In Case Study 2, it more closely reflects the true error structure, although an overestimation and underestimation of error values is apparent in key regions of the covariate space (compare Figures 3e-3g). Method B1 produces poor quality error estimates in both case studies; errors are grossly overestimated and the dependence structure between the errors and covariates is poorly represented.
The model error estimation techniques considered here can also be considered as stochastic parameterizations of the sub-grid dynamics. The ability of the methods to replicate key characteristics of the full 2-scale Lorenz 96 model when used in this manner is also assessed. For each case the single-layer Lorenz 96 system is run for 10 5 time steps with a Δt = 8 × 10 −4 , adding draws from the model error pdfs at the observation intervals used to construct these pdfs, that is, 0.02 and 0.04 MTU for Case Study 1 and 2, respectively. We calculate the autocorrelation function of X k , the cross-correlation function between X [k] and X [k + 1], and the marginal probability density of X [k] (see Figure 4). The correlation functions approximate the dynamical transitions of the slow variables whilst the marginal probability density approximates the invariant measure. Again, Method B1 performs poorly in all aspects, particularly in Case Study 2 where temporal correlations are not reproduced, meaning that the dynamical transitions are poorly represented. The results are similar to those from using inflation and localization only, in case studies with a similar time scale separation (e.g., Lu et al., 2017). Improvements of the proposed method over Method B2 are more distinct in Case Study 1 than in Case Study 2, consistent with the greater similarity in error estimates in this case study (see Figure 3). The Proposed Method reproduces all three features relatively accurately in both case studies, and even compares favorably with other methods that rely on data of the sub-grid processes (cf. Figures 5-7 in Crommelin and Vanden-Eijnden (2008) and Figure 1 in Lu et al. (2017)).
The superior performance of the proposed method is attributed to two aspects (a) the formulation of the cost function which aims to minimize the conditional sum of squared deviations of the estimated errors; and (b) optimization of errors over a time window (as is performed in traditional 4D-Var and smoothing methods). First, minimizing the conditional sum of squared deviations of the errors allows one to estimate more complex state dependent error structures, as opposed to the 4D-Var type approach in Method B2 where dependence information is not taken into account. The error estimates from Method B2 give J Q terms (see Equation 20) that are most often lower than J Q of the true data (see Figure 5), meaning that estimates are obtained by minimizing an inappropriate cost function for this setting. Furthermore, the proposed approach has the added benefit of avoiding the specification of a model noise covariance matrix, which is needed in Method B2.
Second, optimization over a time window allows one to more effectively constrain the range of possible errors in the partially observed setting, particularly when errors are time correlated. This partly explains the poor performance of Method B1, which is based on increments from a filter). Furthermore, the error estimates from Method B1 are heavily influenced by the quality of the assimilation algorithm (ETKF with inflation). Poorly specified prior uncertainty in the unobserved variables from the inflation procedure can lead to large incremental updates in observed variables at future times. This ultimately corrupts the error estimates, as the increments are now dominated by initial condition errors in the unobserved variables.

Model Error Estimation With Non-Negligble Observation Error
In the aforementioned experiment, negligible observation errors ( = 10 −7 ) were considered, consistent with assumption 3. This assumption is clearly a limitation for real world applications, and future work will examine how this assumption can be relaxed. As a first step in this direction, we examined the robustness of the procedure by repeating the error estimation procedure described in Section 3.2 for Case Study 1, but with larger observation error = 2 × 10 −5 . With this choice, the model error standard deviation is approximately 3 times the observation error standard deviation. Figure 6 shows that although error estimates from the Proposed method are not as precise as in the previous experiment, they still capture the underlying error structure more effectively than method B2. Both are superior to method B1 even in the presence of negligible observation error (cf. Figure 3d). In cases where the observation error variance is of similar or greater magnitude than the model error variance, the performance of the Proposed method will degrade because of the hard constraint in Equation 10. Future research will involve developing methods to deal with this scenario whilst maintaining the flexibility of being able to detect non-Gaussian error structures.

Forecast Skill
The superior error estimates from the proposed approach leads to improved forecasts compared to the benchmark methods. Representative results of one-step-ahead forecasts for both case studies are provided in Figure 7 and Figure 8. They show the relative histograms of the ensemble anomalies (forecast -truth) for both an observed (left  column) and unobserved (right column) variable for a single assimilation run of 100 cycles. In both case studies, relatively large systematic errors can be seen when using Method B1 compared to the other approaches, which is unsurprising given the results in Figure 3. One step ahead forecasts of the observed variables are relatively similar between the proposed method and the B2 method, although the forecast variance is considerably lower, particularly in Case Study 1. This is a direct consequence of the more precise additive model error estimates obtained from the proposed approach.
Resolving bimodality of the transition errors allows one to generate more accurate analyses (hence initial conditions for subsequent time steps) and forecasts, even in an EnKF setting. This is demonstrated in Figure 9 where initial conditions and forecasts in Method B2 have greater variance than in the proposed method, as it is unable to precisely resolve the two modes of the error density. Differences between forecasts from the proposed and B2 method are much more pronounced for the hidden variables, where both bias and variance are much lower when using the proposed method in both case studies. The conditional sum of squared deviations minimization procedure allows for a more accurate representation of the spatial dependence structure. This means that information from observed variables is more effectively transferred to unobserved variables during the assimilation, thereby contributing to the improved forecasts seen for the Proposed Method compared to Method B2, through better initial conditions. Forecast skill is assessed quantitatively in the remainder of this section. for an observed variable (left column) and unobserved variable (right column) for the different methods for Case Study 1. Forecasts are one-step-ahead (in this case, 0.02 MTU). for an observed variable (left column) and unobserved variable (right column) for the different methods for Case Study 2. Forecasts are one-step-ahead (in this case, 0.04 MTU). A range of forecast metrics were considered to quantify forecast properties including reliability, resolution, accuracy and consistency. Reliability and resolution were quantified using the Continuous Ranked Probability Score (CRPS) and the (negative) Logarithmic Score (LS) given in Equations 26-28: where ( ) is the empirical cumulative distribution function of the forecast of variable y at time j; ( ) is the cumulative distribution function of the observations of y at time j; and ( = ) indicates the value of the forecast probability density function, evaluated at the observation value. For cases where only a single observation of y is available at each time, the Heaviside step function is used to characterize the cumulative distribution function of the observation (see Equation 27).
The CRPS is routinely adopted in forecasting studies, although it can be a poor statistic for complex forecast probability densities (see for example Smith et al. (2015) who showed that the CRPS can give misleadingly good scores to outcomes that fall in between two modes of a bimodal forecast density). Hence, the LS is also calculated, although it has the drawback of heavily penalizing forecasts in which the outcome falls outside the forecast range. Accuracy is measured by the RMSE, which is evaluated on the ensemble mean.
Statistical consistency is characterized using RMS Error versus RMS Spread diagnostic plots, which has been adopted in similar studies (see e.g., Arnold et al., 2013). Ensemble forecasts are considered statistically consistent if the expected ensemble variance equals the expected squared ensemble mean error (assuming unbiasedness and a large enough ensemble size). We separate forecasts into 10 equally populated bins according to their forecast variance, and the mean square spread and mean square error are calculated for each bin prior to taking the square root.
We used the forecast skill score (FSS) to quantify the relative improvement of the proposed approach over the benchmark methods, defined as: where Score Pr indicates the forecast score of the proposed method; Score Be indicates the forecast score of the reference method (i.e., Method B1 or B2); and Score Pe indicates the score associated to a perfect forecast (e.g., a perfect forecast has FSS = 1). A skill score of 0.5 means that the proposed approach provides a 50% improvement over the benchmark, whilst a negative score indicates a degradation in performance.
Overall, the proposed method was found to outperform the benchmark methods in all forecast metrics considered across a range of lead times, in both case studies. This is demonstrated in Figure 10 which shows the space and time averaged forecast score against lead time. Forecasts from the proposed approach have better reliability, resolution and accuracy scores than the benchmarks, and are significantly more skilful at longer lead times (e.g., 0.6 MTU, or approximately 3 days). The observed improvements are robust to different dynamical regimes, as indicated in Figure 11 which shows the mean and standard deviation of skill scores computed over the 30 independent simulations. Relative improvements are greatest when comparing to Method B1, where the proposed approach offers a 70% improvement on average based on the RMSE and CRPS, although a sizable improvement of 30% is still apparent when comparing to Method B2 in Case Study 2 (see Figure 11). Forecast ensembles from the proposed approach also have better consistency properties, as shown by the RMS Error versus RMS Spread diagnostic plots (Figure 12) where the points lie closer to the diagonal in the proposed approach.

Conclusions
Characterizing model error is critical to ensure ensemble Data Assimilation methods produce high quality forecasts and analyses. Accounting for model errors due to unresolved scales is particularly of interest in weather and climate modeling. Numerous stochastic parameterization methods have been proposed for this purpose, although such methods generally rely on data or knowledge of the sub-grid scale processes and/or require observations of all resolved variables. We develop a method that is suited to the more realistic condition where the resolved variables are only partially observed and knowledge of the sub-grid processes is unavailable. It allows for the estimation of complex error structures which depend on known covariates (e.g., state); requires no assumptions or specification or a parametric error distribution (e.g., Gaussian errors); considers the full range of statistical moments (not just bias and covariance); and avoids the need for numerical tuning typical of inflation and localization methods.
The efficacy of the method is demonstrated through numerical experiments on the multi-scale Lorenz 96 model. Comparisons are made to two existing methods that use data assimilation to estimate model errors offline, as these are amenable to the partially observed setting: (a) where the errors are assumed to be Gaussian with mean and covariance estimated from a sample of analysis increments; and (b) where model errors are estimated using long window weak constraint 4D-Var. The proposed approach is shown to recover model errors more precisely than the benchmark methods, thereby making it a more effective parameterization of the sub-grid processes. It is also particularly useful for cases with highly non-Gaussian errors, as considered in this study. Assimilation experiments with the ETKF show that the proposed approach leads to improved forecasts in terms of accuracy, reliability, resolution and consistency. The conditional sum of squares minimization procedure in the proposed method also allows complex error structures to be estimated more precisely than with the least squares type 4D-Var approach. The advantages of accounting for complex state dependent error relationships are also clearly demonstrated by the considerably poorer performance of the constant mean and covariance Gaussian error method.
The proposed method is suited to multi-scale systems where a locality and homogeneity assumption can be made, that is, where errors are influenced by neighboring states instead of the full state vector and the error statistics are the same at each location in space or in parts of the state space with similar dynamics. These assumptions help regularize the ill-posed problem of estimating model errors from partial observations. Future work will investigate systems where such assumptions are inapplicable, although it is expected that other simplifying assumptions would be needed. Finally, the method was applied to a case with negligible observation error, with some Figure 11. Forecast skill scores against lead time for both case studies. Skill scores are first averaged across space (i.e., over all k variables) and time within each independent simulation. The average of all such values over the 30 independent simulations is shown in the plot (square and triangle markers), as well as the standard deviation. More positive skill scores indicate greater relative improvement of the Proposed method compared to the benchmark method.
preliminary work including more prominent observation error. Subsequent work will consider the more complex case of estimating model errors from noisy observations.

Data Availability Statement
The implementation of the proposed method and benchmarks on the Lorenz 96 application detailed in Section 4 can be accessed at https://doi.org/10.5281/zenodo.5820227.