A Machine Learning Augmented Data Assimilation Method for High‐Resolution Observations

The accuracy of initial conditions is an important driver of the forecast skill of numerical weather prediction models. Increases in the quantity of available measurements, particularly high‐resolution remote sensing observational data products from satellites, are valuable inputs for improving those initial condition estimates. However, the traditional data assimilation methods for integrating observations into forecast models are computationally expensive. This makes incorporating dense observations into operational forecast systems challenging, and it is often prohibitively time‐consuming. Additionally, high‐resolution observations often have correlated observation errors which are difficult to estimate and create problems for assimilation systems. As a result, large quantities of data are discarded and not used for state initialization. Using the Lorenz‐96 system for testing, we demonstrate that a simple machine learning method can be trained to assimilate high‐resolution data. Using it to do so improves both initial conditions and forecast accuracy. Compared to using the Ensemble Kalman Filter with high‐resolution observations ignored, our augmented method has an average root‐mean‐squared error reduced by 37%. Ensemble forecasts using initial conditions generated by the augmented method are more accurate and reliable at up to 10 days of forecast lead time.


Introduction
The accuracy of operational weather forecasts is highly dependent on the quality of the initial conditions provided to the model (Bauer et al., 2015).To correct drift and maintain the robustness of forecasts, model initial conditions are regularly updated based on measurements (Bannister, 2017;Edwards et al., 2015).These measurements include both in-situ data, such as weather station measurements, and remotely sensed data (Choi et al., 2017;Zhang & Tian, 2021).Observations are noisy and may not be aligned with the model grid or state variables; the task of identifying optimal initial conditions consistent with all available information is therefore challenging and computationally expensive.Data assimilation (DA) methods are employed to do this.With the proliferation of high-resolution data sets, often at resolutions higher than that of the forecast models, otherwise useful data is regularly ignored and not assimilated into operational models due to time or computational constraints (Eyre et al., 2022).Correlations in observation errors often require thinning of observations before they are assimi-Filter (EnKF), in which a set of model realizations are simulated to quantify covariance structures before assimilating observations (Evensen, 2003;Hoteit et al., 2018).Both methods require running multiple simulations of the full forecast model, a step that is computationally expensive and time-consuming.To capture the information of a high-resolution measurement, the model itself must at least match the resolution of the measurement-further increasing the cost of this step as the model run time scales with resolution.In addition to the necessity of using a higher resolution model (run multiple times), the physics of smaller scale dynamics create more complex correlation structures, and a larger ensemble size is required to actually improve the forecast (Miyoshi et al., 2015).
An additional cost associated with assimilating high-resolution observations is the observation operator.Both variational and sequential methods assess the error of a model forecast for a given initial condition in observation space.This requires, for each observation, explicitly mapping between forecast model space and observation space.For some observations, this is straightforward.For others, particularly for remotely sensed data such as high-resolution satellite measurements, this calculation itself is a physics-based model that can be a computational bottleneck (Eyre et al., 2022).In some operational models, the trade-off between the speed and accuracy of these observation operators is already an important avenue of research for improving the performance of their DA systems even before currently unusable high-resolution data is considered (Shahabadi & Buehner, 2021).The observation operator calculation must be performed for each data point and so also scales with the number of discrete observations, again increasing its cost.
Remotely sensed observations are integral to the performance of numerical weather forecasts, especially since in-situ observations of large portions of the atmosphere and surface are sparse (Bannister, 2017;Cardinali, 2009).Efforts to incorporate satellite and other remotely sensed observations into assimilation systems have been effective at improving model initialization and forecast accuracy (Buehner et al., 2018;Shahabadi & Buehner, 2021).However, as a result of the expense associated with assimilation much of the potential of these high-resolution measurements for improving state initialization in forecast models has not been realized.Currently employed DA methods are simply not efficient enough to sufficiently quickly ingest this data to be useful in an operational setting.Machine Learning (ML) methods may provide a potential solution.
ML techniques have been increasingly used in earth science applications, including DA (Abarbanel et al., 2018;Bonavita et al., 2021;Penny et al., 2022;Sonnewald et al., 2021).They are appealing for this particular problem mostly due to their speed.While the training process is often expensive, once trained ML methods are very fast compared to weather forecast models.Since many of the bottlenecks in traditional DA methods are related to computational efficiency, as described above, much of the effort in employing ML to improve DA has been targeted at the most computationally expensive parts of the process.
One obvious place to look is at the model simulations themselves.Attempts to use deep learning, in which neural networks comprised of many layers are used to capture complex structures, have proven successful.The basic approach is to train the ML model on the output of a traditional physics-based model (Kim et al., 2019;Pathak et al., 2017).The result is more computationally efficient but at the cost of accuracy.In the context of DA, it has been demonstrated that model surrogates can be successfully trained iteratively using DA state estimates (Brajard et al., 2020;Gottwald & Reich, 2021).
Extending this approach beyond demonstrating that ML can capture the dynamics of complex and chaotic systems, augmented approaches that use model surrogates only to represent scales unresolved by the physical model (Brajard et al., 2021;Gottwald & Reich, 2021) have shown that the integration of ML as a model surrogate can generate improvements over traditional DA methods.Other work has demonstrated the utility of using ML model surrogates to increase the ensemble size beyond what would be otherwise practical (Wu et al., 2021;Yang & Grooms, 2021).Related to the issue of unresolved scale and model resolution, ML has been employed to successfully generate parameterizations used to capture unresolved physics, making the generation of the tangent linear models needed in many variational DA methods more efficient (Hatfield et al., 2021).
The observation operator, which can be another computational bottleneck, has also been targeted using ML (Geer, 2021;Jung et al., 2010;X. Liang et al., 2022;J. Liang et al., 2023;Stegmann et al., 2022;Wang et al., 2022).Other efforts have used ML methods to identify regions of tropical cyclone activity to target high-resolution modeling in a subdomain (Lee et al., 2019), to perform bias correction of model forecasts before they are fed into the assimilation algorithm (Arcucci et al., 2021;Chen et al., 2022), and apply time-varying localization to the covariance structure of the system (Lacerda et al., 2021).

10.1029/2023MS003774
3 of 16 In contrast, relatively limited efforts have been directed toward using ML to perform the assimilation step directly.Rather than using ML to replace pieces of the DA process, we propose an augmented DA method in which a ML model is trained offline to assimilate high-resolution measurements.Convolutional neural networks (CNN) are particularly good candidates for assimilating spatial data and learning the spatial correlation structure of the system of interest, as their design and main demonstrated use cases rely on their ability to identify spatial patterns and localize input data (Dong et al., 2021;Mallat, 2016).
In a real-world scenario, computational resource bottlenecks require some high-resolution observations to be either thinned before being assimilated or discarded entirely.As a proof-of-concept demonstration for our proposed method, we use a synthetic system with observations available at regular time intervals.The observations are alternatively high-resolution or low-resolution.Low-resolution observations are always assimilated using the EnKF, and in the augmented method, high-resolution observations are assimilated using a CNN trained on analyses produced in an offline setting where computational constraints are less pressing.This study will explore these two hypotheses: 1.A shallow CNN can be successfully trained to reproduce the analysis of the EnKF offline 2. When used online to assimilate otherwise ignored high-resolution data, with the traditional EnKF used for low-resolution data, assimilation performance will be improved with the chaotic Lorenz-96 model as the test system.Yet the relevance is broader with applications in weather and climate predictions.Section 2 describes the Lorenz-96 modeling system, the ML augmentation of EnKF framework and the explainable AI methodology.Section 3 presents the results from the experiments and analyses performed.Section 4 summarizes the results and discussion and Section 5 concludes.

Lorenz-96 System
The Lorenz-96 system is described by a set of N discrete differential equations, designed to mimic some behaviors of the atmosphere (Lorenz, 2005).It is commonly used for testing DA methods.It is defined as a 1-D analog of an atmospheric state variable at discrete points evenly spaced in the zonal direction, with its dynamics governed by: for i ∈ [1, N] and F a constant forcing term.The system is cyclically symmetric with grid point i = N + 1 equal to grid point i = 1.We use F = 8, a value for which the system is known to be chaotic, and N = 40, a typical value for testing DA methods.
We numerically integrated Equation 1 forward to generate a reference trajectory for our experiments.A fifth-order Runge-Kutta method was used, with a variable time step to control error assuming fourth-order accuracy (Dormand & Prince, 1980), as implemented in the SciPy package (Virtanen et al., 2020).The maximum allowable relative error was set to 0.001 and the maximum allowable absolute error to 10 −6 .The system is in an unstable equilibrium when all variables are equal to F; initial conditions were set by perturbing one of the variables to a value of 8.01.The system was integrated out to t = 2,000 and output was generated at time intervals of Δt = 0.05 generating data for a 40 variable vector at 40,000 time steps, or 800,000 data points representing the true time evolution of the system.Synthetic observations were generated by adding normally distributed random noise with a standard deviation equal to 30% of the standard deviation of the reference state.This level of observation noise is consistent with other work done using the Lorenz-96 system for testing DA methods (Brajard et al., 2020;Hatfield et al., 2018;Hoteit et al., 2008).

The Ensemble Kalman Filter
Data assimilation combines model forecasts and observations and solves the filtering problem.The filtering problem is to generate a minimum-variance estimate of a state vector,    , conditional on a noisy forecast and a noisy observation.The state vector evolves in time with model dynamics represented by a forward operator M. The time evolution of the system is defined iteratively; the system states at times t i and t i+1 ,    and   +1 are related via: where μ is the assumed model forecast error.Observations y i at time t i are related to the state vector via an observation operator, H: where ν is the assumed observation error.
The solution to the filtering problem is referred to as the analysis.When forecast and observation errors are unbiased, normally distributed, and independent, and the forecast model and observation operator are linear, the Kalman filter (KF) provides the closed-form optimal solution to the filtering problem (Kalman, 1960).
In earth system applications, the system's time evolution and thus the forecast models are often non-linear.The EnKF is an extension of the KF that accommodates nonlinear models at the cost of being an approximate, rather than exact, solution to the filtering problem by using an ensemble of model forecasts (Evensen, 2003).The EnKF analysis equation is: Here, X a is a matrix whose column vectors are analysis ensemble members.X f is a matrix whose column vectors are individual forecasts.C is the sample covariance of the ensemble forecast, X f , used as an approximate representation of the true covariance, R is the observation error covariance matrix, and Y is a matrix whose columns are the observation vector y i .To ensure that the covariance does not systematically underrepresent the true error, random Gaussian noise with covariance R is added to the observation matrix Y (Burgers et al., 1998;Evensen, 2003).
The EnKF assumes normality for the forecast and observation errors,    and    , although has been demonstrated to be somewhat robust to non-Gaussian distributions (Reichle et al., 2002).Also relevant for earth system models in which the state space is very large, the EnKF can be effective even when the number of ensemble members is much smaller than the size of the state space.This contrasts with more exact methods, such as particle filters, which often exhibit stability problems in such situations (Farchi & Bocquet, 2018;Hoteit et al., 2008).
The EnKF is known to be vulnerable to issues associated with the fact that it approximates a PDF with samples represented by a finite number of ensemble members (Ehrendorfer, 2007).These issues include spurious correlations, in which deviations of the sample covariance from the true covariance degrade filter performance, and covariance collapse, in which the ensemble becomes sharply clustered at a point in state space far away from the true state.To address spurious correlations, localization is often used, a technique that has been shown to improve performance by reducing the impact of spurious correlations of the system state at distant grid points (Evensen, 2003).For this application, we used a step function to localize the covariance, setting any covariance between variables greater than five grid points apart equal to zero (Houtekamer & Mitchell, 1998).
Covariance inflation, in which all covariance values are multiplied by a factor greater than one before computing Equation 4, is another technique used for improving the stability and performance of the EnKF by reducing the risk of covariance collapse.Our base settings did not include covariance inflation as our initial experiments did not show significant improvements employing it.Both approaches can improve performance in some circumstances by preventing covariance collapse (Evensen, 2003).However, since both address issues created by the finite size of the ensemble, they become less necessary with larger ensemble sizes and must be tuned (Miyoshi et al., 2015).As such, they are appropriate parameters to vary in our sensitivity analysis in order to identify optimal values.
We use the EnKF here for two purposes: to generate training data for a CNN and as a benchmark to evaluate the performance of our augmented method.After assimilating the synthetic observations with the EnKF using the settings described above, the following data are available at all 40,000 time steps: the true model state, the model forecast, the synthetic observation, and the EnKF analysis.Other combinations of settings were also used to assess sensitivity.These are outlined in Table 1.Observation error is specified as the standard deviation of the added noise used to generate the synthetic observations as a fraction of the system standard deviation.Note.Observation error is presented as observation noise standard deviation as a fraction of the total system standard deviation.

Table 1
Sensitivity Settings for the Ensemble Kalman Filter Runs 10.1029/2023MS003774 5 of 16

CNN Architecture and Training
The ML model is a CNN with several hidden layers.Its architecture is shown in Figure 1.The input layer has three channels: the model ensemble forecast mean, the difference between the forecast mean and the observation (the innovation), and the ensemble forecast standard deviation.The output layer contains two channels: the ensemble analysis mean and standard deviation, generating a probabilistic prediction of the analysis state.All input and output variables are drawn from the same time step.
The following parameters for each layer define a CNN: the filter size, the number of feature maps, and the activation function (Alzubaidi et al., 2021).
The weights of each convolutional filter are independent of space and are applied uniformly across the domain.Each feature map in a hidden layer is assigned different filter weights and a constant bias weight.The ReLu activation function is used for hidden layers and no activation function is applied to the output, which is then a linear combination of the activation values in the second hidden layer plus the value of the bias weight.
Output from a convolution has a smaller dimension than its input, since locations on the edge of the domain don't have a neighboring point on one side.
In image recognition and other similar tasks, zero-padding addresses this issue by artificially adding zeros to the input on the edges of the domain.
Here, with a cyclically symmetric system, zero-padding is not an appropriate solution.Instead, we implemented cyclic padding such that the neighboring spatial nodes for i = 1 are i = N and i = 2. Similarly, the neighboring spatial nodes for i = N are i = 1 and i = N − 1.The data from spatial locations on one end of the domain are therefore concatenated to the other before being passed to the input layer.This maintains the cyclic nature of the Lorenz-96 system and ensures that the size of the CNN output matches the dimensions of the system.
The training data comprises the first half of the EnKF analysis states, for times t = 0 to 1,000 covering 20,000 individual time steps.The data set from the best-performing EnKF sensitivity run settings described in Table 1 was used for training.The Adam optimizer (Kingma & Ba, 2015) was used to train the model, using 25 training epochs and 100 batches per epoch.We treated the filter size, number of hidden layers, and number of feature maps per hidden layer as tunable hyperparameters and tested various combinations to identify a set that performed well.We trained CNN's with 96 combinations of hyperparameters, with filter sizes of 1, 3, 5, or 7, the number of hidden layers varying from 1 to 4, and the number of feature maps per hidden layer varying from 2 to 7. Based on the results of this tuning, the chosen CNN has a filter size of 3, 5 hidden layers, 7 feature maps per hidden layer, and 576 trainable weights, with an input shape of 50 × 3 and an output shape of 40 × 2.

Augmented Method and Experimental Setup
The augmented method is designed to be applicable to a scenario in which high spatiotemporal resolution observations are available but not fully assimilated due to computational resource constraints.This can be addressed by thinning the observations spatially and/or temporally before assimilation (Zeng et al., 2018).To create an analog of this scenario with the Lorenz-96 system, we created three sets of thinned observations for the second half of the time series (t = 1,000.05to t = 2,000): 1. Data temporally thinned by discarding every other time step and spatially thinned by observing a randomly selected 25% of the variables at retained time steps 2. No temporal thinning is used.Observations are spatially thinned by randomly selecting 12.5% of the variables at each retained time step 3.No temporal thinning is used.Observations are spatially thinned by randomly selecting 25% of the variables at each retained time step.
The EnKF using base settings was then used to assimilate each set of observations, which will subsequently be referred to as EnKF SparseObs, EnKF SparseObs2, and EnKF SparseObs3, respectively.The method used to generate the training data, in which no thinning was applied to observations before being assimilated by the EnKF, will be referred to as "EnKF AllObs." The augmented method uses the EnKF to assimilate low-resolution spatially thinned observations.Before the temporal thinning is applied to the observations in EnKF SparseObs, on alternating time steps, a "high-resolution" observation is available, including observations of 100% of the variables.For EnKF SparseObs these observations are assumed to be prohibitively computationally expensive to assimilate and are therefore ignored during thinning.The forecast continues on to the next time step where the next thinned observation is available and assimilated by the EnKF.In the augmented method, the CNN assimilates the high-resolution observations not assimilated by the EnKF.A flow chart demonstrating the operation of the augmented method is shown in Figure 2.
The CNN takes the ensemble forecast mean, ensemble forecast standard deviation, and the innovation as input and returns the analysis ensemble mean and standard deviation as output, creating a probabilistic estimate of the analysis state.At this stage, an ensemble must be re-created to generate an ensemble forecast for the next time step (where the EnKF will be used to assimilate the low-resolution observation).One potential method would be to assume a diagonal analysis covariance matrix and re-sample the ensemble using the CNN-predicted means and standard deviations.Initial testing showed this approach to be ineffective, likely due to ignoring off-diagonal non-zero terms in the covariance resulting in a degraded ensemble forecast.
Instead, we create a new ensemble by scaling the forecast ensemble perturbations by the ratio of CNN-predicted standard deviations to forecast standard deviations.Defining the adjusted perturbation of ensemble member i at location j as: provides new perturbations that when added to the analysis mean will produce an updated ensemble.Here  δ is the adjusted perturbation, δ f is the forecast perturbation, X f is the forecast ensemble, σ f is the forecast standard deviation, and σ a is the analysis standard deviation, with i and j indexing ensemble members and location.This technique will also ignore some off-diagonal information, and a blended perturbation created by taking a weighted average of the forecast and adjusted perturbations proved more effective.This weighting parameter was tuned and the optimal value was 0.3, with 0 giving full weight to the forecast perturbations and 1 full weight to the adjusted perturbations (Figure S1 in Supporting Information S1).
Figure 2. Flow chart of the experimental setup and augmented method.Ensemble Kalman Filter (EnKF) AllObs is provided with full observations at every time step.EnKF SparseObs is provided with observations 25% of the variables at every other time step.The augmented method is identical to EnKF SparseObs but is additionally provided with observations of 100% of the variables on alternating time steps and uses the trained convolutional neural networks to assimilate these.

Explainable AI
ML models are fast to run and accurate when sufficient training data are available.In many earth system science applications, the computational efficiency of traditional tools is a significant bottleneck and available training data is voluminous.These models have a major drawback, however: models are a black box and it is therefore often not clear how they are generating their predictions (Gevaert, 2022).Using testing and validation data sets can provide some level of confidence in the models by demonstrating their level of accuracy on data not used for training.If they are to be deployed in something like an operational weather forecast system, however, such demonstrations may not provide a sufficient level of confidence.Out-of-sample input data cannot be guaranteed never to occur, and no guarantees can be made about the behavior of the ML model when presented with such data.
A variety of tools are available to make otherwise opaque data-driven models more transparent, collectively referred to as Explainable AI (XAI) methods (Linardatos et al., 2021).Shapely Additive Explanations, or SHAP, is one such tool.SHAP quantifies the impact of a specific input variable on the output generated by a model.The method is model-agnostic and is equally applicable to a simple linear regression model or a deep neural network with millions of trainable parameters.Full details and a formal definition can be found in Lundberg and Lee (2017).Heuristically, a SHAP value is the anomaly in an output variable attributable to the anomaly in an input variable.It provides a way of apportioning the deviation from the mean in the output to each input variable.This information can increase confidence that the trained model is reliable and provides insights into the structure of the underlying system.
We apply it here to analyze how the trained CNN assimilates observations and ensemble forecasts.In a DA context, the behavior of the CNN should be predictable and consistent with our understanding of the dynamics of the Lorenz-96 system; it should not, for example, heavily weight forecasts from highly spatially distant nodes.
Evaluating the CNN in this way can provide confidence in its ability to perform well when presented with new data.

Sensitivity and Training
The results of the base run and 9 sensitivity runs using the EnKF are outlined in Table 2.These runs are intended to identify optimal settings for generating training data, with the EnKF assimilating all observations at every time step (i.e., the high-resolution observation at every time step).The performance for each run is evaluated as the mean analysis root-mean-squared error (RMSE) divided by the standard deviation of the observation error.For all 10 cases, the EnKF analysis has a lower error than the observation error, as expected, with all runs achieving better than 24% on this metric.As the EnKF approximates the optimal solution by using the first two moments of the forecast ensemble to represent a normal distribution, errors caused by the finite size of the ensemble are expected to decrease with ensemble size.This is evident in our results, which show larger ensemble sizes producing lower errors.
Localization and covariance inflation can improve EnKF performance by mitigating errors related to finite ensemble size but can be detrimental for larger ensemble sizes as such errors become less important.We expect the performance to be dependent on localization and inflation settings but it is not clear a priori which values will be optimal.The best-performing combination of settings was run s9 with localization of 7 grid spaces and covariance inflation factor of 1 (i.e., no inflation).These are the EnKF settings used for training the CNN and used in the augmented method.A run using no localization was also tested, which resulted in filter divergence (not shown).
The results from the training process are shown in Figure 3.The training targets were the EnKF ensemble analyses and standard deviations produced using observations of all variables as described in Section 2.3.CNN error can be considered in terms of how well CNN output matches the EnKF results and the deviation of its predictions from the true state.Errors with respect to true states in the second half of the time series, that is, the set not used for training, are treated as validation metrics.As the CNN produces probabilistic predictions, we evaluated several error metrics to assess performance, shown over 25 training epochs in Figure 3. Figures 3a and 3b show the mean absolute error and RMSE of the state predictions, with the latter shown with respect to both truth and training data.Over-fitting would be indicated by an increase of the validation error in Figure 3b even as the training error remained flat or decreased.This is not evident here, demonstrating that our trained CNN is not overfitting and produces reliable predictions when presented with data from outside its training set (Ying, 2019).
Low mean absolute error is also a desirable property, a feature evident in Figure 3b.
To evaluate the quality of the ensemble standard deviation predictions as well as the state predictions, we additionally normalize the errors by the predicted standard deviation.The normalized mean absolute errors and the standard deviation of the normalized errors are shown in Figures 3c and 3d, respectively.The normalized mean absolute error should be small compared to 1 (with 1 indicating that the mean absolute error is 1 standard deviation from the true value).The standard deviation of the normalized errors would be 1 if the predicted standard deviation perfectly characterized the error.Therefore, a well-trained model will produce a value close to 1 for this metric.Both of these conditions are met.All metrics plotted show a small decrease in performance from epoch 24 to 25, and the weights from epoch 24 were therefore used for the final trained model.
Another check on the trained CNN is to compare its RMSE on the validation data set to the observation error.
To improve the state estimate, the CNN must perform better than observation error statistics alone.Using EnKF AllObs forecasts and residuals, the final validation RMSE with respect to the true state in Figure 3 is 23% of the observation standard deviation, compared to 21% for the EnKF AllObs analyses.
The sensitivity of the augmented method to different settings was tested.These results are shown in Table 2.The CNN trained on results using the s9 run settings was used for all runs.The s1 and s2 run settings were not tested; the CNN was trained on data with the observation error specified by s9 settings and the s1 and s2 settings are therefore not applicable for the augmented method.For those cases that are applicable, the augmented method performance was compared to the EnKF assimilating only the low-resolution observations (EnKF SparseObs).For all sensitivity settings, the augmented method outperforms EnKF SparseObs (Table 2).

Performance Comparison
Having shown that the trained CNN does not over-fit and that its error is 23% of observation error, we can now assess the performance of the augmented method using this CNN.Base settings from Table 1 were used for all runs presented in this section.In addition to the augmented method and EnKF SparseObs, the performance of EnKF AllObs assimilating observations of all variables at every time step is included for comparison.The analysis RMSE for all three methods is shown in Figure 4. Time is shown as earth-years with 1 model time unit is equivalent to 5 real days (Lorenz, 2005).EnKF AllObs performs best with an average RMSE of 0.22.Considering only the density and frequency of assimilated observations, this comparative overperformance is unsurprising as EnKF AllObs assimilates more data than the other methods.More interestingly, the augmented method performs better than EnKF SparseObs, with an average RMSE of 0.57 compared to 0.88, representing an improvement of 35%.F-tests and Levine tests performed on the errors are significant at the p < 0.01 level, and a Mann-Whitney U test performed on the RMSEs is also significant at this level.
Other than the time-averaged RMSE, the other thing to note is the variability of errors for EnKF SparseObs and the augmented method.The time series on the left is smoothed, and in this rolling average the augmented method consistently outperforms SparseObs at nearly all time steps.The histogram on the right shows the distribution of errors at all time steps for EnKF AllObs, EnKF SparseObs, and the augmented method.There is substantial overlap in the distribution of errors; using unsmoothed data, the augmented method outperforms in 45% of coincident time steps.However, the EnKF SparseObs errors have a notably fatter tail in the histogram.These spikes are periods where the state estimate diverged from reality, generating instabilities in the EnKF that resulted in large errors.
Despite having a similar error distribution to EnKF SparseObs, the augmented method does not have the same fat tail.It is better at maintaining the state estimate close to the true state, preventing instabilities and periodic spikes in the analysis error.This accounts for the consistent over-performance in the smoothed time series.The improved stability is an important factor in evaluating the relative performance and suggests that the augmented method is more reliable than what would be otherwise assessed based on the fact that it only outperforms EnKF SparseObs in 45% of time steps.The improved stability and reduced mean RMSE are clear benefits of exploiting all available data in assimilation using an efficient but possibly sub-optimal technique (the CNN) compared to using only the EnKF on spatially thinned observations.
The results of applying the EnKF to alternatively thinned observations described in Section 2 are not shown in Figure 4 for visual clarity but will be described here.Those results are included in the supplemental information (Figures S2 and S3 in Supporting Information S1).The average RMSE for EnKF SparseObs2 is 0.95, and the average RMSE for EnKF SparseObs3 is 0.56.Notably, the latter is comparable to the results for the augmented method.To the extent that the number of observations assimilated by a traditional method drives computational cost, EnKF SparseObs3 assimilates twice as many observations with the EnKF as the other methods on a per unit time basis.Despite this, the augmented method performs similarly at potentially lower computational cost.
Another way of assessing performance is to generate forecasts using analyses as initial conditions.Forecast skill over time can then be compared.These results are shown in Figure 5.The mean error of ensemble forecasts from a sample of initial conditions is plotted out to 10 days of lead time.As with the results in Figure 4, EnKF AllObs performs best, generating better forecasts for all lead times.The augmented method again outperforms EnKF Sparse-Obs.Out to greater than 10 days of forecast lead time, the RMSE of forecasts generated using initial conditions from the augmented method is statistically significantly better at the p < 0.05 level.Graphically, this is immediately evident as the 95% confidence interval bands do not overlap.
Figure 5 also shows another metric for evaluating the reliability of the ensemble forecasts.By definition, when errors are unbiased, the standard deviation of the error and the RMSE are equivalent.If the ensemble spread reflects the true error, then the actual RMSE should equal the ensemble standard deviation (Gneiting & Katzfuss, 2014;Leutbecher & Palmer, 2008).If the ensemble is overprecise with estimated errors smaller than actual, it is considered underdispersive.If the ensemble is under-precise, its spread overestimates actual errors and precision and is said to be overdispersive.Here, the ensemble standard deviation for all three methods is less than the RMSE for all times shown, indicating underdispersive ensemble forecasts that do not adequately represent the true forecast error.
For a more detailed examination of the reliability of ensemble state estimates, we examine the distribution of actual versus expected analysis errors.These Figure 5.The forecast accuracy out to 10 days using initial conditions produced by the augmented method as well as the Ensemble Kalman Filter (sparse and all obs).95% confidence intervals are included for all three methods based on the root-mean-squared error standard deviation across 1,000 randomly selected initial conditions.The mean ensemble standard deviations are also shown as dashed lines.Filter (EnKF) AllObs, EnKF SparseObs, and augmented methods.The mean RMSE for each method is represented by a horizontal dashed line.The first half of the time series includes only results for EnKF AllObs, which is used for training the convolutional neural networks.The second half of the time series includes EnKF SparseObs and augmented as well, with both initialized using the last analysis produced by the EnKF AllObs.The time series data is smoothed with a moving window of 60 days for readability.On the right, a histogram of the distribution of RMSE for all three methods is shown using the same axis as the time series plot.This plot uses unsmoothed data and as a result, the tails extend beyond the range of time series traces.
results (as opposed to the distributions of RMSE) can indicate whether forecasts are biased or poorly distributed.If errors are assumed to be Gaussian, then if scaled by the ensemble standard deviation the error distribution should follow a unit normal with a mean of zero.Conversely, if a normal distribution does not well represent the errors, or the ensemble standard deviations don't reflect the true analysis error statistics, the scaled distribution will diverge from the reference unit normal.This comparison is shown in Figure 6.
Figure 6a suggests that both EnKF SparseObs and the augmented methods produce slightly underdispersive forecasts, with the PDF peak lower and tails higher than the reference unit normal PDF.These features are also apparent in Figure 6b, which displays the same data but as a CDF instead.For negative errors, the CDF is higher than the reference distribution; for positive errors, it is lower.The augmented method better represents the true statistics of its error than EnKF SparseObs, with a standard deviation of 1.08 compared to 1.18, while a perfectly dispersive ensemble would have a standard deviation of 1.
Rank histograms are an alternative way of visualizing the dispersion of ensemble forecasts or analyses (Candille & Talagrand, 2005;Hamill, 2001;Hatfield et al., 2018;Weigel, 2011).For each forecast (or analysis), the percentile of the true value within the ensemble is calculated.When the distribution of the percentile values is plotted, a uniform distribution indicates a well-dispersed forecast.Errors of a given size occur as frequently as expected if the ensemble spread represents the true error statistics.A U shape is underdispersive, with errors outside the range of the ensemble over-represented.A tilt indicates a biased ensemble forecast, with positive errors more or less likely to occur than negative errors.Figure 7 includes rank histograms for both methods.It is more visually apparent here that both methods produce underdispersive forecasts.Small and large percentile frequencies are clearly larger than frequencies at or around the 50th percentile.

Explainable AI: SHAP Values
We now return to the behavior of the CNN in producing state estimates from forecasts and innovations.In an operational setting, allowing black-box operators to produce new initial conditions is not tenable.There must be some confidence that the system will not generate unrealistic results when presented with out-of-sample data, and some understanding of how it is producing its state estimates.Here we use SHAP values to estimate the impact of input variables on the outputs generated by the CNN (Figure 8).
Figures 8a-8f show the mean absolute SHAP values of each input variable (vertical axis) on each output variable (horizontal axis).The first column shows the impact of each of the three input variables on the analysis, and these plots show that the largest contributors to the state estimate of a variable are the forecast and the innovation of that variable.This is an excellent first sanity check on the CNN.In estimating a state variable, it weights the forecast and observation of that variable more heavily than forecasts or observations of nearby variables.
Figures 8g and 8h are identical to the plots from Figures 8a and 8f but with the first input variable not shown.The long-term spatial correlation structure of the Lorenz-96 system is important to note at this point.Since the system is symmetric, without loss of generality we can consider the dynamics and correlation between locations only in terms of absolute spatial lag.The dynamics of a state variable are nonlinearly dependent on the state variables at spatial lags of 1 and 2 as described in Equation 1.This dependence defines the temporal derivative.In non-differential terms, considering the variable values rather than rates of change, the correlation extends further than two grid points.At spatial lags of 1-4, the long-term  absolute correlation coefficients are 0.05, 0.33, 0.11, and 0.03.Variables at a distance of 2 and 3 grid points removed from one another are more highly correlated than immediately adjacent variables one grid point removed (Lorenz, 2005).It is also important to note that the convolutional layers will tend to create a binomial distribution here: larger lags have fewer paths of influence.This means that we should expect the SHAP results at increasing spatial lags to be a combination of a binomial distribution and the Lorenz-96 correlations.Figure 8 demonstrates monotonically decreasing SHAP values with increasing spatial lag consistent with the binomial influence of and forecast ensemble standard deviation on the predicted analysis state and standard deviation as a function of spatial offset.The impact of the forecast ensemble mean on the analysis mean and standard deviation is shown in panels (a) and (b), respectively.The impact of the innovation on the analysis mean and standard deviation is shown in panels (c) and (d), respectively.The impact of the forecast ensemble standard deviation on the analysis mean and standard deviation is shown in panels (e) and (f), respectively.Panels (g) and (h) show the same data shown in panels (a) and (f), respectively, with the results for a spatial lag of 0 omitted.the CNN. Figure 8d shows a peak SHAP value at a spatial lag of 2, consistent with the correlation structure of Lorenz-96.

Discussion
The augmented method's results are encouraging, showing that it outperforms a traditional approach using only the EnKF.However, considering the training process of the CNN makes some limitations apparent.First: the network is trained using only the ensemble mean and standard deviation, rather than the entire ensemble, as input.As a result, it can only learn the diagonal elements of the covariance structure to the extent that they are dependent on location in the state space.Other factors, most obviously the time since the last observation and analysis, will impact forecast covariance.The trained CNN cannot include such factors.
Even within the confines of the experiment we have set up, the training data's limitations impact performance.When employed online in the augmented method the CNN is provided as input a forecast initialized by assimilating only 25% of the variables.In comparison, all variables are observed in the EnKF configuration used to generate training data.In the online setting, therefore, the initial condition error will be larger and the forecast precision lower than in the training set.In the experimental setup, the augmented method has an RMS of 0.56, more than double the RMSE when it is simply applied to forecasts from the validation time period generated offline.This partly reflects that the augmented method simply assimilates less data.On a time-averaged basis, it is observing 62.5% of the observations assimilated in the training set, but that does not fully explain the greater than 2-fold increase in RMSE.The remaining decrease in accuracy is attributable to the smaller forecast errors in the training data compared to forecast errors in the online setting.
While the reduced performance of the CNN applied to an online setting as opposed to input data generated offline is unavoidable to some extent, future opportunities for improvement may be found by allowing the CNN to better approximate forecast accuracy.Providing additional input to the network, such as other portions of the forecast covariance or the entire forecast ensemble itself, may improve performance.Our results here provide no indication either way whether a neural network would be able to learn effectively from other input data, or how complex the network would have to be, but it is a potential avenue of further exploration.
The results from the SHAP analysis provide additional insights into the possible extensions of the approach.Localization is widely used to improve the performance of many assimilation systems.The SHAP values demonstrate that the trained CNN has applied localization to the forecast.The CNN also has learned the long-term correlation structure (teleconnections) of the system, applying a localization structure to the innovations consistent with that of the Lorenz-96 system.These are both reasons to think it is plausible that in future extensions convolutional layers may be able to generate spatial estimates that blend forecasts and observations in a way that is both reliable and skillfully reflects underlying system behavior and dynamics.

Conclusion
This study demonstrated a proof-of-concept augmented assimilation methodology in which ML was used to directly assimilate high-resolution observations for potential improvement of the performance of an assimilation scheme.Significant quantities of observational data, particularly from remote-sensing platforms, go unused in operational forecast models due to the computational cost and time required for to incorporate them into the model.The potential viability of training a ML model offline to assimilate this data could have a significant impact-improved state initialization has real and notable impacts on forecast quality, and the ability to use the vast amounts of newly available observational data products to that end is of clear benefit.
To demonstrate the potential feasibility of such an approach, we trained a small CNN to replicate the results generated by the EnKF on synthetic observations.Using the EnKF on low-resolution observations and the trained CNN on the high-resolution observations outperformed an EnKF assimilating only spatially and temporally thinned data.More specifically, in an experimental setting using the Lorenz-96 model, the analyses generated by the augmented method have a mean RMSE 37% lower than using the EnKF on only low-resolution observations.Forecasts using analyses generated by the augmented method as initial conditions produce lower RMSE up to a forecast lead time of 10 days.
Additionally, using an explainable AI method, we demonstrate that the trained CNN effectively applies localization and learns the underlying system's correlation structure (teleconnections) via training.Distant observations do not impact its estimates.The natural tendency of convolutional layers to exploit local spatial correlations in this way is encouraging for potential extensions to more realistic applications.It also generates confidence that such a method would both be reliable and generate physically realistic results when presented with new data.
Further studies are needed to demonstrate the ability of this approach to work in more complex systems and at scale.Testing using a quasigeostrophic model and more realistic observational data would be a logical next step.The demonstrated feasibility of the general approach in this proof-of-concept study will hopefully encourage additional efforts to address the large quantity of data that is currently unusable in an operational forecast setting using ML approaches.

Figure 3 .
Figure 3. Training performance of the chosen convolutional neural network's architecture over all 25 training epochs.Panel (a) shows the mean absolute prediction error in the state, panel (b) shows the prediction root-mean-squared error of the state, panel (c) shows the absolute mean prediction error normalized by the predicted standard deviation, and panel (d) shows the standard deviation of the state prediction errors normalized by the predicted standard deviation.Training error is with respect to Ensemble Kalman Filter analysis, validation error is with respect to truth.

Figure 4 .
Figure 4.A time series (left) and histogram (right) of the root-mean-squared error (RMSE) of the Ensemble KalmanFilter (EnKF) AllObs, EnKF SparseObs, and augmented methods.The mean RMSE for each method is represented by a horizontal dashed line.The first half of the time series includes only results for EnKF AllObs, which is used for training the convolutional neural networks.The second half of the time series includes EnKF SparseObs and augmented as well, with both initialized using the last analysis produced by the EnKF AllObs.The time series data is smoothed with a moving window of 60 days for readability.On the right, a histogram of the distribution of RMSE for all three methods is shown using the same axis as the time series plot.This plot uses unsmoothed data and as a result, the tails extend beyond the range of time series traces.

Figure 6 .
Figure 6.The distribution of augmented (red) and Ensemble Kalman Filter SparseObs (blue) analysis errors normalized by ensemble standard deviation as probability density plots (a) and cumulative probability plots (b).A reference unit normal distribution is also included in both plots in black.

Figure 7 .
Figure 7. Rank histograms of the observations with respect to the ensemble of analyses for Ensemble Kalman Filter SparseObs (left) and augmented (right) methods.The percentile in which an observation falls is on the x-axis, with the normalized frequency on y-axis.

Figure 8 .
Figure8.Estimated mean absolute SHAP values showing the impact of the forecast ensemble mean state, innovation, and forecast ensemble standard deviation on the predicted analysis state and standard deviation as a function of spatial offset.The impact of the forecast ensemble mean on the analysis mean and standard deviation is shown in panels (a) and (b), respectively.The impact of the innovation on the analysis mean and standard deviation is shown in panels (c) and (d), respectively.The impact of the forecast ensemble standard deviation on the analysis mean and standard deviation is shown in panels (e) and (f), respectively.Panels (g) and (h) show the same data shown in panels (a) and (f), respectively, with the results for a spatial lag of 0 omitted.
Note.Root-mean-squared error (RMSE) is presented as a fraction of the observation standard deviation for EnKF AllObs results to allow for comparison between different observation error settings.For SparseObs and augmented, raw RMSE values are used.Ensemble Kalman Filter (EnKF) AllObs, EnKF SparseObs, and Augmented Method Sensitivity Results