Foundations for Universal Non‐Gaussian Data Assimilation

In many applications of data assimilation, especially when the size of the problem is large, a substantial assumption is made: all variables are well‐described by Gaussian error statistics. This assumption has the advantage of making calculations considerably simpler, but it is often not valid, leading to biases in forecasts or, even worse, unphysical predictions. We propose a simple, but effective, way of replacing this assumption, by making use of transforming functions, while remaining consistent with Bayes' theorem. This method allows the errors to have any value of the skewness and kurtosis, and permits physical bounds for the variables. As such, the error distribution can conform better to the underlying statistics, reducing biases introduced by the Gaussian assumption. We apply this framework to a 3D variational data assimilation method, and find improved performance in a simple atmospheric toy model (Lorenz‐63), compared to an all‐Gaussian technique.


10.1029/2023GL105148
2 of 10 Here, we provide a general framework for non-Gaussian data assimilation, in which the error statistics can be chosen freely.In essence, the proposed technique can match the first four moments of any probability density function.It is based on the so-called method of translation (Edgeworth, 1898), generalized by Johnson (1949), which states that one can find a specific function that transforms a variable to an approximately Gaussian variable.This has an obvious advantage: one can retain much of the Gaussian data assimilation foundations, while allowing for skewness and kurtosis of the original variables.
The proposed method has similarities with Gaussian anamorphosis (e.g., Amezcua & van Leeuwen, 2014;Bertino et al., 2003;Bocquet et al., 2010;Brankart et al., 2012), a pre-processing technique that transforms variables to an approximately Gaussian space before performing the analysis step.By doing this, one can avoid making unphysical predictions, but it can introduce biases in the analysis.Crucially, approaches using Gaussian anamorphosis evaluate the median of the posterior distribution, not the maximum a posteriori estimate (Fletcher & Zupanski, 2007).This is because the transformation only preserves the quantiles of the probability density function, not its moments.As such, the median of the Gaussian distribution is mapped to the median of the original distribution.Because the mode, median, and mean coincide for a Gaussian distribution, they all get mapped to the median of the original one, which is not necessarily a good estimate for the mode of the posterior, especially when the distribution is highly skewed.Instead, our method directly evaluates the most likely state of the underlying distribution.This reduces biases in the analysis, leading to improvement in the performance of data assimilation systems with various non-Gaussian variables.

Johnson Family of Distributions
In developing new probability density functions that are able to describe the statistics of observed quantities, Johnson (1949) proposed a family of distributions that is able to conform to any value of the mean, variance, skewness, and kurtosis of the sample.It is based on the idea that, if x is a random Johnson variable, then   [( − )∕] is a random Gaussian variable.This defines the (univariate) probability density function . (1) The transform   should obey a few simple rules: it is (a) continuous, (b) continuously differentiable, (c) strictly monotonic, (d) invertible, and (e) it ranges from −∞ to +∞.According to Johnson (1949), the last rule is not strictly necessary, but highly desirable.It might be possible to relax this rule to obtain finite probabilities at the boundaries of the domain, but we do not consider that situation in the current work.In general, this distribution is controlled by four parameters: ξ, λ, μ, and σ.
Although the definition of Equation 1 is very broadly applicable, Johnson explored four explicit candidates: and  ∶ ↦ (Gaussian).The probability density functions corresponding to these transforms are shown in Figures 1a-1d.These four transforms have a broad applicability range; in particular, together they span the whole skewness-kurtosis domain, as shown in Figure 1e, and can thus adjust to the first four moments of any distribution.However, these are not the only possibilities, and do not necessarily describe well any atmospheric variable.For example, functions exist that are known to transform humidity to a nearly Gaussian variable (Hólm et al., 2002;Ingleby et al., 2013), and might be good candidates.Moreover, other well-known probability distributions, such as the Gamma distribution, can take the form of Equation 1, by equating the cumulative distribution functions and solving the resulting equation.
Note that the semi-bounded Johnson distribution is equivalent to the lognormal distribution when λ > 0 and to the reverse lognormal distribution when λ < 0. These distributions have already been successfully used in different data assimilation settings, such as variational techniques (Fletcher & Zupanski, 2006, 2007;Kliewer et al., 2016), Kalman filters (Fletcher et al., 2023;Van Loon & Fletcher, 2023), and ensemble methods.A major advantage of the bounded and unbounded distributions is that their skewness can change from positive to negative by changing μ, see Figures 1a and 1c.This is a particularly crucial property for variables that have two physical bounds; for example, when relative humidity is low, its distribution tends to be positively skewed, while the opposite is true for high relative humidity (Ingleby et al., 2013).By interpreting relative humidity as a bounded Johnson variable, this change in skewness is inherently described. 10.1029/2023GL105148 3 of 10 A natural question that arises is how to identify the best transform for a specific atmospheric variable, and how to estimate the four parameters that describe its distribution.A 3DVAR data assimilation system, such as described in the next section, finds the optimal solution for one of those parameters (μ) by approximating the mode of the posterior distribution, while keeping the other parameters fixed.Methods based on the Kalman filter produce an estimate of the (co)variance as well (tuned by σ), so only two additional parameters need to be set (ξ and λ).
In some cases, ξ and λ can be chosen based on the physical properties of the variable.For example, if relative humidity is well-described by a bounded Johnson distribution, then ξ and λ can be tuned to fix the lower bound to zero and the upper bound to supersaturation.In other cases, the choice is less obvious; variables described by an unbounded Johnson distribution do not necessarily have physical properties that set ξ and λ, but they can be based on a climatology (e.g., from a sample mean and standard deviation).
A climatology could also be used to estimate the best transforming function for a certain variable, by investigating their distribution from past data or free model runs.Alternatively, an ensemble method could be used to estimate the distribution from the ensemble at each assimilation time, which would allow the distribution to change over time, but might be computationally expensive.However, even if the actual transform is not known, we expect to see improvements for atmospheric variables that are clearly non-Gaussian, because the four Johnson transforms span the entire skewness-kurtosis plane (see Figure 1e), and can thus conform to the first four (finite) moments of any distribution.We leave the determination of optimal transforms for future work, and focus instead on the development of a non-Gaussian data assimilation technique for a general transform, provided it satisfies the constraints outlined above.
An important property of the Gaussian distribution is that the sum of two independent Gaussian random variables is again a Gaussian random variable.For Johnson variables, this can be generalized as follows: if X and Y are independent Johnson random variables with transform   , that is, if   ∼  (, ) and   ∼  (, ) , then the star product of X and Y is an independent Johnson random variable with the same transform, that is, (2) Here, the star product is defined as For example, if  ∶ ↦ (Gaussian transform), then x ⋆ y = x + y, such that the additive property of Gaussian variables is recovered.If instead  ∶ ↦ log (semi-bounded transform), then x ⋆ y = xy, verifying that the product of two lognormal random variables is again a lognormal random variable.See Appendix A for a formal proof of this statement.
In order to use the Johnson distribution in a data assimilation setting, a multi-variate form is necessary.The univariate probability distribution (Equation 1) can easily be extended as where the transform   () now acts on every element separately, such that  (≠) = 0 .Importantly, each separate   can be a different transform, allowing for different distributions for individual variables.In this definition, x is a vector of size N, μ is the expected value of   () , and Σ is the covariance matrix of It should be noted that Σ is not the typical covariance matrix, in the sense that it does not describe the covariances of the random variable x, but instead those of   () .Changing Σ not only changes the variance, but also the mean, mode, skewness, and kurtosis of the distribution.However, we keep the terminology of "covariance matrix" to be consistent with other, Gaussian-based, data assimilation techniques.

Variational Data Assimilation
At the heart of most data assimilation techniques lies Bayes' theorem, which relates the probability to be in a certain state of the system, given some observations (the posterior), to the likelihood of making those observations and the prior knowledge of the current state (the background).Specifically, assuming background and observational errors are independent, the posterior distribution must have the form

𝑝𝑝𝑎𝑎(𝐱𝐱) ∝ 𝑝𝑝𝑏𝑏(𝐞𝐞𝑏𝑏)𝑝𝑝𝑜𝑜(𝐞𝐞𝑜𝑜),
(5) with e b and e o the background and observational errors, respectively.In order to use this expression, an assumption needs to be made on the distribution of the errors; here, we replace the usual Gaussian assumption with the assumption that the errors are Johnson random variables.Because the star-product of two independent random Johnson variables is a random Johnson variable with the same transformation, the background error is always a Johnson random variable if the background x b and true state x t are Johnson random variables with the same transformation.Although this is not necessarily the case, the true state is never explicitly used (as it is unknown), but is instead approximated by the analysis state, defined in the same vector space as the background.In order to define the errors, we assume that the background and analysis involve the same transformation.The same assumption is made for the observational errors.With this in mind, the errors are defined as ( 6) with y the observations, h the nonlinear observation operator, relating the state variables to observed variables, and the star division defined in a similar fashion as the star product, that is (7) acting point-wise on each component.
In a 3D variational data assimilation scheme (3DVAR), Bayes' theorem is used to find the most likely current state (the analysis x a ) given the observations and the background.This can be done by maximizing the posterior distribution, or, equivalently, minimizing the negative log of the posterior.Using Equation 4, with μ = 0 and Σ = B for the background and μ = 0 and Σ = R for the observational distributions, one can write the negative log of the posterior as 10.1029/2023GL105148 5 of 10 (8) This defines the cost function J(x) that needs to be minimized to find the analysis state.Note that we allow for a different transformation to be used for the background and observations.The background and observation error covariance matrices are denoted as B and R, respectively.The number of state variables is N, and the number of observed variables is N o .
Readers familiar with the Gaussian 3DVAR cost function might recognize the first and third term in Equation 8, as they are similar in both frameworks.The second and fourth term, however, arise from the normalizing prefactor in the probability density function, Equation 4. In Gaussian anamorphosis, where the variables are transformed before the cost function is defined, these terms would not be present.Yet, they are crucial when one is interested in the most likely state, that is, the mode of the posterior distribution, found by minimizing Equation 8.The state that minimizes a cost function without this second and fourth term is instead the median of the posterior distribution (Fletcher & Zupanski, 2007), which can lead to biases, particularly when the errors are large (Fletcher et al., 2019).The extra terms inevitably introduce nonlinearities into the cost function, complicating the minimization algorithm.However, the monotonicity of   keeps the probability density functions unimodal, guaranteeing a single minimum of Equation 8as long as the observation operator is monotonic as well.
Solving the Bayesian problem (Equation 5) thus comes down to minimizing the cost function in Equation 8.This can be done directly, or by solving for the root of the gradient of the cost function, ∇ x J(x a ) = 0, defining the analysis state, as long as the hessian is positive definite at that point.To compute this root, it is useful to first rewrite the cost function in terms of the transformed variables.That is, defining 8can be written as Recalling the chain rule, ∇ x J(x) = (∇ x χ)∇ χ J(χ), together with the properties that   is strictly monotonic and acts on every element separately (such that  (∇)  = ∕ = ∕ ≠ 0 ), the solution of ∇ x J(x a ) = 0 is equivalent to the solution of ∇ χ J(χ a ) = 0, with    =  () .Component-wise, the gradient of the transformed cost function is given by with F j (χ j ) = (∂S j /∂χ j )/S j (χ j ) and F o,ν (ψ ν ) = (∂S o,ν /∂ψ ν )/S o,ν (ψ ν ).Solving for the analysis state in the transformed space has a few advantages.For one, it makes the background part of the cost function (approximately) quadratic, such that the computation can be done efficiently.Moreover, the (inverse) transforms have to be evaluated less often, keeping the additional computational cost to a minimum.Finally, the range of χ is always infinite, and thus an unbounded solver can be used, without the worry of predicting unphysical states.
Once the minimum of the cost function is found, a nonlinear model   can be used to predict the background state at the next time from the analysis at the current time, that is, Then, new available observations can be assimilated, and the cycle repeats.In table 1, examples of the functions S and F, introduced in the cost function and its gradient, for the transforms proposed by Johnson (1949), are presented.

Experiments
To test the performance of the proposed technique, twin experiments are carried out.In such experiments, artificial observations are created from a known true solution of a given model.These observations are used in a data assimilation scheme and the resulting analysis state is compared with the truth, leading to a measure of the efficacy of the technique.Specifically, we use the Lorenz-63 model (Lorenz, 1963), which provides a simple analog of an atmospheric system that is highly nonlinear and non-Gaussian.The results of these experiments are summarized in Figure 2; the computational details can be found in Appendix B.
The Lorenz-63 model has three state variables, and thus three different transforms can be chosen.For the sake of simplicity, here we only consider the transforms originally proposed by Johnson, as shown in table 1.We test three different cases: (a) all three state variables follow a bounded Johnson distribution; (b) x 1 is bounded, x 2 is unbounded, and x 3 is semi-bounded; and (c) all three variables are unbounded.All three state variables are directly observed, and artificial observations are generated assuming the given distributions.Multiple twin experiments are performed, using the mean absolute error of the analysis state versus the true solution to determine the performance of the data assimilation method.To decrease effects of random variations in the realizations of the random variables, each value shown in Figure 2 is averaged over multiple experiments.Results using the correct underlying error distributions are compared to an all-Gaussian approach.Note that in the latter case, the observations are still generated using the underlying Johnson distributions.Therefore, the Gaussian results can differ from panel to panel, as the Gaussian approximation might be worse for some error distributions than others.
In the top row of Figure 2, an observation is made every 20 model time steps.As can be seen, the non-Gaussian technique (blue bars with top left to bottom right hatch) always outperforms the all-Gaussian version (red bars with bottom left to top right hatch).The difference is especially significant for larger observation errors, because in this case the bias introduced by incorrectly assuming Gaussian statistics is larger.In the bottom panels, the same techniques are shown, but now with an observation every 50 time steps.Again, the non-Gaussian scheme has lower absolute errors than the Gaussian one, for all cases shown. is bounded: ξ = min x j − 2 and λ = max x j − min x j + 4.This sets the lower bound at min x − 2 and the upper bound at max x j + 2 (allowing for some leeway in the bounds with respect to the climatology).
• If   is unbounded: ξ = mean x j and λ = std x j .This approximately standardizes the variable.5.The background error covariance matrix is estimated by sampling from the climatology every τ o /dt time steps, calculating the transform of each sample, and calculating the covariance matrix from the transformed variables for all samples in the climatology.A background error for the Gaussian 3DVAR is created in the same way, but without transforming the variables.6.The observation error covariance matrix is a diagonal matrix with   2  for each diagonal element (thus, each observed variable has the same error).7. Initial conditions for a control run are calculated by perturbing the final state of the climatology by a standardized random Gaussian variate, integrating the model for 1,000 time steps, and selecting the final state of the integration.In this way, the initial conditions x 0 are random, but consistent with the model.Note that in these expressions h(x) = x and   =  .
10.The initial guess for the data assimilation analysis is chosen by randomly perturbing the initial condition x 0 with a standard Gaussian variate.11.A 3DVAR data assimilation scheme, outlined in Section 3, is performed, cycling over all available observations.12.A fully Gaussian 3DVAR data assimilation method is applied to the same initial conditions and observations, to create a baseline to compare the proposed non-Gaussian method to.13.The mean absolute error of the entire run is estimated by comparing the analysis state to the known truth, that is, The same is done for the Gaussian run.14.Steps (7)-( 13) are repeated for N c = 50 times to test the robustness of the method to different initial conditions.15.The average mean absolute error is calculated over all runs, to generate a single number that estimates the performance of the data assimilation method, that is,

Figure 1 .
Figure 1.Probability density functions in the Johnson family, for (a) bounded, (b) semi-bounded, (c) unbounded, and (d) Gaussian variables.Here, ξ = 0, λ = 1, σ = 0.5, for different values of μ (see legend in panel b).In e, the regions in the skewness-kurtosis domain described by the different transforms are shown.

Figure 2 .
Figure 2. Comparison of Gaussian and non-Gaussian variational data assimilation methods.For different values of the observation variance   2  , the red (bottom left to top right hatch) bars show the mean absolute error of twin experiments, assuming a Gaussian distribution for all state variables; the blue (top left to bottom right hatch) bars show the results for different Johnson transforms, with transforms  {1, 2, 3} as indicated in the figure titles.In the top panels, an observation is made every 20 time steps; in the bottom panels, an observation is made every 50 time steps.
8. A nature run is created by the model from t = 0 to t = N a × τ o , with N a = 250 the number of analysis times, defining the true state x t .9. At each time t n = nτ o , an artificial observation y n is generated by (a) Acting the observation operator on the true state    =  (    ) (b) Transforming the observed state    = (  ) (c) Generating a random Gaussian variate ζ n ∼ G(0, σ o ) (d) Inverse transforming the observed, perturbed state    =  −1  (  +   )

Table 1
Examples of Johnson Transforms and Their Related Functions, Used in the Cost Function and Its Gradient =1 MAE .This value is shown in Figure2for different values of τ o and σ o .