Dynamical Gaussian, lognormal, and reverse lognormal maximum‐likelihood ensemble filter

We present a non‐Gaussian ensemble data assimilation method based on the maximum‐likelihood ensemble filter, which allows for any combination of Gaussian, lognormal, and reverse lognormal errors in both the background and the observations. The technique is fully nonlinear, does not require a tangent linear model, and uses a Hessian preconditioner to minimise the cost function efficiently in ensemble space. When the Gaussian assumption is relaxed, the results show significant improvements in the analysis skill within two atmospheric toy models, and the performance of data assimilation systems for (semi)bounded variables is expected to improve.

The main difficulty with making the latter techniques operationally viable is twofold: (1) one needs a way to select which distribution to use for each assimilation step, and (2) most numerical weather prediction centres make use of hybrid data assimilation methods, not just variational techniques.To solve the former, Goodliff et al. (2022) proposed to use machine learning to be able to switch between distributions, and demonstrated that this improved the analysis skill in the Lorenz-63 model.A crucial step in solving the latter was made by Fletcher et al. (2023), where a mixed Gaussian-lognormal version of the Kalman filter was derived and tested; this was later extended to include reverse lognormal variables (Van Loon & Fletcher, 2024).Here, we expand further on these ideas to develop a non-Gaussian version of the maximum-likelihood ensemble filter (Zupanski, 2005), advancing towards operationally viable non-Gaussian data assimilation.
The maximum-likelihood ensemble filter (MLEF) uses a Hessian preconditioner to minimise the 3DVAR cost function efficiently, while simultaneously estimating the analysis uncertainty and updating the forecast-error covariance matrix through the Kalman filter equations.In order to derive a non-Gaussian MLEF, one has to replace the cost function to allow for other distributions and use the appropriate Kalman filter equations to update the analysis-error covariance matrix.Moreover, we allow the number of Gaussian, lognormal, and reverse lognormal variables to change in time, to reflect the realistic scenario where the underlying error distribution of atmospheric variables is not fixed.
The MLEF does not rely on a linearisation of the model equations or the observation operator.For that reason, there is no need to develop a tangent linear model to use this method, and nonlinear observation operators can be handled consistently.Moreover, because the minimisation of the cost function is performed in ensemble space, the dimensionality of the problem is significantly reduced.Combined with the relaxation of the Gaussian assumption, this makes the MLEF an efficient, nonlinear, and non-Gaussian ensemble-based data assimilation technique.
Using the proposed non-Gaussian version of the MLEF improves the analysis skill of the data assimilation system in the presence of lognormal or reverse lognormal background or observation errors.In particular, we show that correctly assuming a (reverse) lognormal distribution for the assimilation of nonlinear observations in the Lorenz-05 model II (Lorenz, 2005) leads to lower mean absolute errors in the analysis state, when compared with a known truth.Moreover, we test the full power of the dynamical Gaussian, lognormal, and reverse lognormal MLEF in a coupled version of the  model (Lorenz, 1963;Van Loon & Fletcher, 2024), and demonstrate an improved analysis skill when relaxing the Gaussian assumption.
In Section 2, we provide the derivation of the non-Gaussian MLEF.In Section 3, additional details on the implementation of this technique are given.Twin experiments are performed on two different atmospheric toy models in Section 4. Finally, we present conclusions in Section 5.

FORMULATION
In this section, we derive the equations for the MLEF, allowing for Gaussian, lognormal, and reverse lognormal state variables and observations.

Cost function and notation
Central to the MLEF is the cost function that needs to be minimised to find the most likely estimate of the state variables at every assimilation analysis step.This cost function is defined through the negative log of the posterior probability density function.For the mixed Gaussian-lognormal-reverse lognormal distribution, discussed in Goodliff et al. (2023) (see also Fletcher, 2022), the associated cost function is (1) In this definition, and in all of the following, we use Greek letters to denote vectors in "mixed-variable space", and Roman letters for vectors in the original state or observation space.That is, we define with x the parameter of the cost function and x b the background state, both vectors of size N, the number of state variables.The observations are contained in y (a vector of size N o ), and H is the nonlinear observation operator mapping the state variables to observation space.The operator  acts on every element separately (such that  i (x  ) =  i   (x  ) for i,  ∈ {1, … , N}), and is defined as  ∶ x  → x for Gaussian variables,  ∶ x  → log x for lognormal variables, and  ∶ x  → log( f − x) for reverse lognormal variables.Without loss of generality, one can sort the Gaussian, lognormal, and reverse lognormal variables in the state vector x = (x (g) , x (l) , x (r) ) T , such that it is possible to write explicitly Here, x (g) is a vector of size n g containing all Gaussian variables, x (l) is a vector of size n l containing all lognormal variables, and x (r) is a vector of size n r containing all reverse lognormal variables, with n g + n l + n r = N.Note that the sorting is only done for notational simplicity, and is not necessary in practice.The parameter  f (vector of size n r ) is specific to the reverse lognormal distribution, and sets the upper bounds for the state variables.The operator  (o) is defined in a similar way, but acts on variables in observation space, instead of state space.Specifically, sorting the Gaussian, lognormal, and reverse lognormal observations, y = (y (g) , y (l) , y (r) ) T , one can define where y (g) is a vector of size n o,g containing all Gaussian variables, y (l) is a vector of size n o,l containing all lognormal variables, and y (r) is a vector of size n o,r containing all reverse lognormal variables, with n o,g + n o,l + n o,r = N o .Again, the reverse lognormal distribution requires an additional parameter  o (vector of size n o,r ) that sets the maximal value that the observations can have.
In Equation 1, we have introduced the forecast-error covariance matrix P f and the observation-error covariance matrix R. Note that these contain covariances of the mixed variables, not the original variables, and are thus defined in mixed-variable space.Finally, 1 = (0 n g , 1 n l , 1 n r ) T is a vector of size N, with zeroes for all Gaussian elements and ones for all lognormal and reverse lognormal elements; T is a vector of size N o , with zeroes for all Gaussian elements and ones for all lognormal and reverse lognormal elements.
Compared with an all-Gaussian approach, the cost function has extra terms that are only present for the lognormal and reverse lognormal variables (the second and fourth term in Equation 1).They originate from a normalising factor in the (reverse) lognormal distribution, such that the minimiser of J(x) coincides with the maximiser of the distribution (the mode).Without the presence of these terms, the minimiser of the cost function would instead describe the median of the distribution; see Fletcher and Zupanski (2007) for a detailed discussion.
Note that, when n g = N and n o,g = N o (and thus n l = n r = n o,l = n o,r = 0), the cost function reduces to the Gaussian cost function in the context of 3DVAR (Lorenc, 1986).
When n l = N and n o,l = N o , it reduces to a fully lognormal cost function (Fletcher & Zupanski, 2007), while when n r = N and n o,r = N o all variables follow a reverse lognormal distribution (Goodliff et al., 2023).Although we do not write it explicitly, the partitioning of the variables into the different distributions can change over time (so n g = n g (t), n l = n l (t), n r = n r (t), and similarly for the numbers of observed variables).

Error covariance matrices
Whereas in 3DVAR the cost function is used directly to find the optimal estimate, in the MLEF a clever transformation is performed that introduces ensemble members.Using the Kalman filter equations (Fletcher et al., 2023;Jazwinski, 1970;Van Loon & Fletcher, 2024), the forecast-error covariance matrix P f can be calculated by evolving the analysis-error covariance matrix P a from the previous analysis step in the linearised model M and written in a square root form as Here, a perfect model is assumed, such that no model error term is present.The square-root analysis-error covariance matrix is written as a rectangular matrix involving the ensemble perturbations where S is the number of ensemble members and  ∈ {1, … , S}.Every p  is a vector of size N, containing perturbations of the state variables around the analysis from the previous assimilation step, and   is its mixed-variable representation.Thus, P 1∕2 a is a matrix of size N × S. Similarly, the square-root forecast-error covariance matrix is written as Here, x a is the state that minimises the cost function (the analysis state) at the previous analysis time, b  are the ensemble perturbations after the model run, and   are the perturbations in mixed-variable space.To simplify the notation, we have defined the "star product", which represents a sum in mixed-variable space, that is, with ⊙ representing a pointwise product and  −1 the inverse transformation operator that maps the mixed variables back to the original state variables, Instead of using the linearised model M, the forecast-error covariance matrix is calculated using the nonlinear model  (see Equation 10).One can recover the original, linearised, forms of the Kalman filter equations (Equation 6) by using the additive tangent linear model for the Gaussian variables, and the geometric tangent linear model (Fletcher & Jones, 2014) for the (reverse) lognormal variables.

Hessian preconditioner
A Hessian preconditioner is introduced to transform the cost function to ensemble space and make the minimisation procedure more efficient.To formulate the preconditioner, one could try to follow the same method as in the Gaussian MLEF of Zupanski (2005).However, due to the logarithms, the Hessian ∇ 2 x J(x) would not be a good choice for multiple reasons.For one, the modal terms in the cost function lead to extra contributions in the Hessian that make it impossible to write the preconditioner in a computationally efficient way, negating the benefits of the MLEF.Second, using a change of variables as defined in Zupanski (2005) would never lead to a simple cost function.Instead, an alternative form of the preconditioner should be defined, that directly replaces the mixed variables  in the cost function, not x.To do this, one can calculate the second derivative of the cost function with respect to , the variable we want to replace.This leads to In calculating the derivatives of the cost function, the Jacobian of the observation operator H x = ∇ x H(x) appears in the equations.Within the mixed representation, due to the chain rule, this Jacobian is scaled as diagonal scaling matrices.
In the calculation of the Hessian, we have omitted the vector terms that come from the second derivative of the observation operator.This is equivalent to saying that ∇    ≈ 0, which is only true if () is linear.The validity of this assumption will be discussed at the end of this section.The approximation to the Hessian, Equation 14, is written by defining a new matrix in ensemble space as By using the additive tangent linear model for the Gaussian variables and the geometric tangent linear model for the lognormal and reverse lognormal variables, the adjoints can be removed and the matrix C  is approximated by Note that C  is a matrix of size S × S, and its square root has size N o × S. The approximation in Equation 18 has a clear benefit: it avoids the direct calculation of any tangent linear model or adjoint, which significantly reduces the complexity of the data assimilation method.This important property of the MLEF can thus be extended to the (reverse) lognormal formulation.It should be noted, however, that this approximation is limited to small ensemble perturbations in observation space, which is not necessarily the case, especially for highly nonlinear systems.
Within this framework, one can define a change of variables, based on the inverse square root of the Hessian, where  is a vector of size S.As such, the transformed cost function and its gradient in ensemble space are given by In the observational modal term, the evaluation of the Jacobian of the observation operator can be bypassed by writing 18).Because of the nonlinearity of the method, the preconditioner defined in Equation 19 is not exact.However, with the assumption ∇    ≈ 0, used in the approximation of the Hessian of the original cost function, one can also write and so the Hessian of the cost function in ensemble space is the identity, ∇ 2  J() ≈ I S×S , confirming that a good preconditioner was used.
The cost function is minimised in ensemble space, defining  a = argmin J().Equation 19 can then be used to find . This is the mixed-variable representation of the analysis state x a =  −1 ( a ), the optimal estimate of the state variables.
Finally, the analysis-error covariance matrix is updated with the inverse Hessian of the original cost function, such that which defines the ensemble perturbations around the analysis.
For highly nonlinear observation operators, the approximation     ≈ 0 used throughout the derivation is not generally true.In the case of Gaussian statistics, it is only exactly valid if H is linear.For (reverse) lognormal errors, this condition can be somewhat relaxed, as the (reverse) lognormal transformations allow for a specific kind of nonlinearity in the observation operator, thus broadening the scope of validity.However, even if    x ≉ 0 in general, the Hessian preconditioner is still good close to the background state, where the linear expansion is locally valid.In other words, the change of variables in Equation 19 makes the cost function J() approximately quadratic around  =  b , such that a minimum of J() can still be found efficiently as long as  a −  b is not too large.Finally, because the Hessian is evaluated at the analysis state in Equation 22, P 1∕2 a is a good estimate for the square-root analysis-error covariance matrix for arbitrary nonlinear observation operators (see Zupanski, 2005).

IMPLEMENTATION
The general framework of the MLEF, outlined in the previous section, can be implemented and separated into three components: forecast, transform, and update.These stages are explained in detail in this section.

Forecast
In order to make future predictions from a given analysis state, a forecast needs to be made.Additionally, this forecast is used as the background state in the next data assimilation step, when new observations are available.Because the MLEF is an ensemble method, a forecast needs to be made for all ensemble members.First, the forecast or background state at time t = t k is computed as with x a k−1 the analysis state at time t = t k−1 .To estimate the forecast-error covariance matrix, each ensemble member needs to be run through the model, to define where the star product is defined in Equation 11.Because every ensemble member is independent, these calculations can be done in parallel.
Ensemble methods are sensitive to the choice of the initial errors, at the start of the data assimilation cycle (Zupanski et al., 2006).There are multiple ways to address the initiation of the ensemble members; here, we make use of a lagged forecast method (Suzuki et al., 2017;Zupanski, 2021).First, a model run is performed, centred around the starting time of the data assimilation cycle (t = 0), that is, from t = −T to t = T.Then, the initial perturbations are calculated as perturbations from the initial state: S − 1, where x(t) represents the output of the free model run at time t, S is the total number of ensemble members, and t  are specific time steps evenly spread out over the domain [−T, T].

Transform
The model outputs the original state variables (Roman letters), while the data assimilation system uses the mixed representation (Greek letters), defined through the transformations  and  (o) given in Equations 4 and 5. Thus, one has to decide which state variables are considered Gaussian, lognormal, or reverse lognormal.
Once that decision is made, each variable can be transformed to mixed-variable space.Defining  b =  (x b ) and   =  (g  ), the forecast-error covariance matrix (see Equation 10) consists of the ensemble perturbations   =   −  b .After the analysis state is found, one can perform an inverse transform back to the original state variables, see Equation 12.
The cost function is minimised in ensemble space by introducing the Hessian preconditioner of Equation 19.
Finally, the preconditioning can be performed to solve for the minimum of the cost function in ensemble space.

Update
The minimiser of the cost function, that is, the solution of ∇  J( a ) = 0, can be used to update the analysis state: , which can then be used in the next forecast step.The matrix C  a is then computed at the minimum and used to compute the inverse square root of the Hessian of the cost function, as outlined above.This then defines the square root of the analysis-error covariance matrix, which contains the ensemble perturbations p  .Again, these perturbations initialise the next data assimilation step, through Equation 24.

RESULTS
To test the mixed MLEF, we perform twin experiments on two atmospheric toy models.First, we use a model introduced by Lorenz (2005) (Lorenz-05 model II) with a nonlinear observation operator to mimic (reverse) lognormal signals in the observations.Second, we use a coupled version of the Lorenz-63 model (Lorenz, 1963;Van Loon & Fletcher, 2024), as it contains non-Gaussian state variables that switch between Gaussian, lognormal, and reverse lognormal distributions (Goodliff et al., 2020).Moreover, a machine learning algorithm has been developed for this model, which can decide on the optimal underlying statistics (Goodliff et al., 2022).

Lorenz-05 model II
The Lorenz-05 model II is an atmospheric toy model that describes an atmospheric variable on N = 240 points along a latitude circle (a grid spacing of 1.5 • ) and can simulate wave dynamics.More details can be found in Lorenz (2005).The model has two parameters that need to be set; we choose them to be K = 8 and F = 15.Although we work with nondimensional time steps, one time unit can be thought of as five days (the typical damping time of the model).We integrate the model using a fourth-order Runge-Kutta scheme, with a time step of dt = 1∕120, coinciding with one hour.Within these settings, the state variables seem to be well-described by a Gaussian distribution, which was checked by plotting histograms of the variables after a long control run (see Figure 1a for an example).For each experiment, a preparatory model run of 1000 time steps is performed from random initial conditions.The final state of that run is used as the initial conditions for a climatology, on which we base the initialization of the ensemble members, as explained in Section 3.1.The climatology is run over 5000 time steps and ends at t = 0, which defines the initial condition for the simulated nature run.The initial perturbations are created with a lagged forecast with T = S ×  o , with  o the number of time steps between observations, which we choose to be  o = 12.Finally, a nature run is performed over N a ×  o time steps, which is used as the true state of the twin experiments.Here, N a = 250 is the number of assimilation times.
From the true state, we create artificial observations, every  o time steps, such that every experiment contains N a = 250 observations.We perform experiments with two different observation operators, in order to simulate (reverse) lognormal signals.Specifically, we define Thus, an observation is made from the mean of the state variables on six grid points (such that N o = N∕6 = 40), and a nonlinear transform  is taken to make the observations non-Gaussian.The artificial observations are generated from the true state x t , perturbing it by a random error.First, the mean true state is taken over every six grid points, d  = ∑ +6 = x  ∕6, for every  ∈ {1, … , N o }.Second, a random sample of length N 0 is taken from an unbiased Gaussian distribution, r ∼ G(0,  o ), with  o the observation error.Finally, the observation is generated as y = [d + r].The observation-error covariance matrix R is then a diagonal matrix with  2 o for all diagonal elements.The error is added before performing the transform  to remain consistent with the error distributions.
To test a static lognormal MLEF, we define  ∶ d  → exp(d − )∕, with  = 2.7 and  = 5.8.The values  and  are chosen to approximately standardise the state variables, so as not to create too large values for the observations after exponentiating; they were calculated by estimating the sample mean () and sample standard deviation () from a long control run.Assuming the mean over six grid points is a Gaussian random variable, the exponential of that random variable behaves like a lognormal random variable.This is visualised in Figure 1.In Figure 1a, the distribution of the average of the first six state variables (d 1 = ∑ 6 =1 x  ∕6) is shown.As the Lorenz-05 model II is cyclic, the other state variables behave similarly.Visually, the state variables of the model behave as approximately Gaussian.In Figure 1b, the transform (d 1 ) = exp(d 1 − )∕ is applied, such that the observations behave approximately lognormally.Thus, for this experiment, we expect the MLEF with lognormal observation errors to outperform the fully Gaussian one.
For each experiment, the mean absolute error is calculated for the analysis state over all assimilation windows, by comparing it with the known true state.For a given standard deviation  o and number of ensemble members S, such an experiment is run N c = 50 times, and the mean is taken over all runs, that is, to test the robustness of the technique to random noise, although noise could still be present in the results shown.
In general, a lower MAE in these experiments should indicate a better analysis and forecast skill.
The results are presented in Figure 2, for different ensemble sizes S. For all situations shown, the lognormal MLEF outperforms the Gaussian approach.The higher MAE values for the Gaussian method with  o = 0.01 might be due to an underestimation of the actual observation errors, such that the method overfits the observations.If one assumes the Lorenz-05 model II is well-described by Gaussian statistics (see Figure 1), this result is to be expected.Indeed, the approximation of Gaussian background errors works well, but the observation errors are designed to be lognormal.Thus, blindly applying the Gaussian assumption to the observation errors introduces biases in the analysis state; these biases are resolved when applying lognormal statistics correctly to the observational part of the cost function.
To test the dynamical capabilities of the proposed non-Gaussian MLEF, next we use the transforming function  ∶ d  → {tanh[(z − )∕] + 1}∕2, with  = 2.7 and  = 5.0.Here,  is lowered with respect to the previous case to allow for more spread in the observations.Now, the transforming function maps variables to a bounded region between 0 and 1.The resulting probability density is shown in Figure 1c.Because the observations now have an upper and lower bound, they can change between having lognormal, Gaussian, and reverse lognormal behaviour.Specifically, we expect the observations to be well-approximated by a lognormal distribution when y  is close to zero, but by a reverse lognormal distribution when y  is close to one.
To test this hypothesis, we run four different MLEF twin experiments, all using the same observations.
The latter experiment tries to mimic the expectation that the observations are lognormal if they are close to zero, and reverse lognormal if they are close to one.The cut-offs 0.3 and 0.7 are chosen as to have about the same amount of Gaussian, lognormal, and reverse lognormal observed variables.The results are summarised in Figure 3.As can be seen, the dynamical method, where the distribution of the observation errors is allowed to change, generally outperforms the other methods.The only exception occurs when  o = 0.5, where the Gaussian technique has the lowest mean absolute error, although the differences are small and might be due to random noise.The (reverse) lognormal methods often perform worse than the Gaussian one.This is because assuming a lognormal distribution when the observation is close to the upper bound (and is thus expected to behave more reverse lognormally) is worse than the Gaussian assumption.The same argument can be used for the reverse lognormal method when the observations are close to the lower bound.
Two conclusions can be drawn from these results: (1) allowing for non-Gaussian errors can improve the analysis skill in the Lorenz-05 model II; and (2) the choice of the distribution is important.Even when allowing for lognormal and reverse lognormal errors, the use of these distributions is still an approximation to the actual statistics of the observations.The dynamical MLEF described in this work expands the scope of non-Gaussian methods, but the choice of which underlying distribution to use depends on the specific application.For example, although switching between Gaussian, lognormal, and reverse lognormal errors seems to work well for observations generated through a hyperbolic tangent, it might be more appropriate to use a bounded distribution for all times (Van Loon & Fletcher, 2023).

Coupled Lorenz-63
The three-parameter Lorenz-63 model (Lorenz, 1963) has been shown to contain non-Gaussian statistics (Goodliff et al., 2020).However, because of its simplicity, it is not well-suited for experiments with the non-Gaussian MLEF proposed here.Therefore, we simulate a higher-dimensional version of the Lorenz-63 model, similar to Akhmet and Fen (2015), and used in Van Loon and Fletcher (2024) in the context of the mixed Kalman Filter.The model is governed by the following coupled differential equations: with  ∈ {1, … , n p }.If the original Lorenz-63 model describes three atmospheric variables (x, y, and z) at a given location, then the coupled Lorenz model describes these same three variables at different locations, with coupling between neighbouring grid points and cyclic boundary conditions.The variable n p sets the number of grid points, and the coupled system contains a total of N = 3n p state variables.In our experiments, we choose n p = 10, (, , ) = (10, 28, 8∕3), and c x = c y = c z = 0.1; these values are known to produce chaotic behaviour in the Lorenz-63 model, to which we added a weak coupling, with the same value for each variable for simplicity.We solve the equations using a fourth-order Runge-Kutta scheme with dt = 0.01.The coupled Lorenz model can be used to test the dynamic component of the proposed filter.The z  components change the distribution over time, while x  and y  are well-described by Gaussian statistics.To decide when to use which distribution, we can make use of a predictor based on a machine learning model (Goodliff et al., 2020(Goodliff et al., , 2022)), see the Appendix for more details.The trained machine learning model can be used to predict the skewness class (negative, neutral, or positive) of z  at a given analysis time, given x  and y  at that time.It has one free parameter, , which controls the width of the neutral class and thus the sensitivity to skewness.If  = 0, the neutral class is nonexistent, such that only negative or positive classes are predicted.If  → ∞, the neutral class dominates, and the model never predicts negative or positive values.If the variable is predicted to be in the positive class, we use a lognormal distribution for the errors; if the prediction is neutral, a Gaussian distribution is used; and if it is negative, a reverse lognormal distribution is used.For x  and y  , a Gaussian distribution is always assumed.
Similar twin experiments are performed as outlined in Section 4.1.This time, the observation operator is linear and all variables are observed, that is, H( x o = e 2+ 2 (e  2 − 1), and solving for  and .Finally, the observation-error covariance matrix R is a diagonal matrix with  2 as its diagonal elements.Because  2 depends on the value of H  (x true ) and  is selected by the machine learning model, we allow R to change every assimilation time.
Again, we perform twin experiments with four different techniques.
The resulting MAE are shown in Figure 4.
In the top row of Figure 4, the machine learning sensitivity parameter is chosen to be  = 1.28 (see the Appendix for the reasoning behind this choice).For the values shown, the non-Gaussian approaches almost always outperform the Gaussian approach, with the exception of the machine learning version for 10 ensemble members and  o = 0.5.This indicates that the z  -components of the coupled Lorenz-63 model are non-Gaussian, in line with previous results (Goodliff et al., 2020(Goodliff et al., , 2022;;Van Loon & Fletcher, 2024).Surprisingly, the machine learning model usually performs worse than the cases where the z  components are kept (reverse) lognormal for all time windows, except for the tests where the observation error was small ( o = 0.01 and  o = 0.02) and the ensemble size was large (S = 20 and S = 30).The ML( = 1.28) method seems to be very sensitive to changing  o .As the model is used both to generate the observations and to select the distribution at the assimilation time, for larger errors the observation The Gaussian-fits-all approach is shown in red (bottom left to top right hatch); for the blue bars (top left to bottom right hatch), {z 1 , z 3 , z 5 , z 7 , z 9 } were taken to be lognormal and {z 2 , z 4 , z 6 , z 8 , z 10 } to be reverse lognormal ("alternate"); for the green bars (crosshatch), {z 1 , … , z 5 } were taken to be lognormal and {z 6 , … , z 10 } to be reverse lognormal ("split"); for the yellow and brown bars (dots), the machine learning (ML) model is used to switch between distributions.In the top row, the ML sensitivity parameter  = 1.28, in the bottom row  = 0.0.might stray too far from the generated error, and the wrong prediction might be made.In other words, at a given location and time, the machine model may select a different distribution for the true state (to generate the observation) than for the observation in the assimilation run.
To investigate the cause of the disappointing performance of the dynamical MLEF, the bottom row of Figure 4 shows the results of the same experiments, but using  = 0.0.In this case, the machine learning model never predicts Gaussian statistics for z  , but switches between lognormal and reverse lognormal distributions dynamically.As can be seen, the dynamical MLEF now performs as well as or better than the other non-Gaussian methods, and always better than the Gaussian method.Moreover, the ML( = 0.0) method is less sensitive to changing observation errors than the ML( = 1.28) method.This indicates that, in contrast to the three-parameter Lorenz-63 model (Goodliff et al., 2022), the Gaussian assumption is never good for z  , and using (reverse) lognormal statistics improves the analysis skill of the MLEF with the coupled Lorenz-63 model.
Finally, in Figure 5, we compare the non-Gaussian MLEF with similar non-Gaussian methods based on the Kalman filter (Fletcher et al., 2023;Van Loon & Fletcher, 2024) and 3DVAR (Goodliff et al., 2022(Goodliff et al., , 2023)).In these experiments, we use the machine learning model with  = 0.0 to select the underlying distribution of all z  .In all cases tested, the MLEF outperforms the Kalman filter and 3DVAR, with only 10 ensemble members.Based on the results shown in Figure 4, we expect the MLEF to perform even better when more ensemble members are used.
The results show that allowing for non-Gaussian errors always improves the analysis in the coupled Lorenz-63 model, with particularly significant improvements at larger ensemble sizes.The ensemble members at the assimilation time describe perturbations around the forecast, assuming a certain distribution.Thus, using more ensemble members gives a larger sample size to approximate the error distribution.If the wrong assumption for the error distribution is made, this may lead to relatively higher inaccuracies in the analysis.Conversely, If the correct error distribution is selected, then increasing the ensemble size will lead to a stronger reduction of analysis errors.This suggests that the importance of selecting the correct error distribution increases with increasing ensemble size, as seen in the experiments with the coupled Lorenz-63 model (Figure 4).

CONCLUSION
We extended the original, Gaussian, formulation of the maximum-likelihood ensemble filter (Zupanski, 2005) to a non-Gaussian approach, by allowing the background and observation errors to follow a mixed distribution of Gaussian, lognormal, and reverse lognormal random variables.The non-Gaussian MLEF is tested on two different atmospheric toy models, and shows significant improvement over the Gaussian-fits-all approach in all cases investigated.Our conclusions are in line with previous studies, which indicated improved performance in variational (Fletcher, 2010;Fletcher & Zupanski, 2006a;Goodliff et al., 2022;Van Loon & Fletcher, 2023), Kalman filter based (Fletcher et al., 2023;Van Loon & Fletcher, 2024), or ensemble based (Bishop, 2016) data assimilation methods, when allowing for non-Gaussian error statistics.In particular, using lognormal or reverse lognormal errors resolves biases introduced when trying to assimilate bounded observations with skewed distributions.
The main limitation of the MLEF is that it can only assimilate observations at the analysis time.To overcome this, Hurst et al. (2022) introduced the maximum-likelihood ensemble smoother, allowing for a time component in the cost function.There, minimising the cost function finds the optimal initial condition that fits the observations throughout the assimilation window.Within four-dimensional variational data assimilation, it has been demonstrated that the Gaussian assumption can be lifted to include lognormal (Fletcher, 2010) and reverse lognormal variables.The derivation of the non-Gaussian MLEF is easily extendable to a 4D formulation, to create a fully nonlinear, non-Gaussian ensemble smoother; we leave this for future work.
First, C 1∕2  b is calculated from the mixed variables, which defines C  b = C T/2  b C 1∕2  b .Then, the inverse square root of Q = I S×S + C  b is computed by using an eigenvalue decomposition on Q = VWV T , such that Histograms of observed state variables using different transforming functions with the Lorenz-05 model II.Probability densities are estimated through a long control run (10 6 time steps), using the first six state variables, that is, d 1 = ∑ 6 =1 x  ∕6.(b)  = 2.7 and  = 5.8, (c)  = 2.7 and  = 5.0.

F
Mean absolute errors from twin experiments with different versions of the MLEF for the Lorenz-05 model II with exponential errors, changing observation error  o for different ensemble sizes S. The Gaussian-fits-all approach is shown in red (bottom left to top right hatch), while the blue bars (top left to bottom right hatch) depict the situation where the observations are considered lognormal.F I G U R E 3 Mean absolute errors from twin experiments with different versions of the MLEF for the Lorenz-05 model II with observations generated through a hyperbolic tangent, changing observation error  o for different ensemble sizes S.
) = x.An observation is made every  o = 20 time steps.Observations are simulated by first letting the machine learning model decide the distribution from the true state, and then sampling from the corresponding distribution.Specifically, an observation error  o is chosen, representing the standard deviation of H  (x) ( ∈ {1, … , N o }), and an observation is sampled from a given distribution as y  ∼ (, ).Here, (, ) is either a Gaussian, lognormal, or reverse lognormal distribution with parameters  and , depending on the skewness class (neutral, positive, or negative, respectively).If  is Gaussian, one sets  = H  (x true ) and  =  o .If  is lognormal, one sets  = log[H  (x true )r] and  = √ log r, with r the only real root larger than 1 of the equation r 4 − r 3 − ( o ∕H  (x true )) 2 = 0.If  is reverse lognormal, one sets  = log[( b − H  (x true ))r] and  = √ log r, with r the only real root larger than 1 of the equation r 4 − r 3 −  2 o ∕( b − H  (x true )) 2 = 0.These values are calculated by setting the observed true state to the mode of the distribution, H  (x true ) = e − 2 , equating the observation variance to the variance of the distribution,  2

F
I G U R E 4 Mean absolute errors from twin experiments with different versions of the MLEF for the coupled Lorenz-63 model, changing observation error  o for different ensemble sizes S.

F
Comparison of different data assimilation techniques with the coupled Lorenz-63 model.The maximum-likelihood ensemble filter (MLEF) uses S = 10 ensembles.All methods use the machine learning model to switch between distributions, with  = 0.0.