Machine Learning Enables Real‐Time Proactive Quality Control: A Proof‐Of‐Concept Study

To improve the forecast accuracy of numerical weather prediction, it is essential to obtain better initial conditions by combining simulations and available observations via data assimilation. It has been known that a part of observations degrade the forecast accuracy. Detecting and discarding such detrimental observations via proactive quality control (PQC) could improve the forecast accuracy. However, conventional methods for diagnosing observation impacts require future observations as a reference state and PQC cannot be real‐time in general. This study proposes using machine learning (ML) trained by a time series of analyses to obtain a reference state without future observations and enable real‐time ML‐based PQC. This study presents proof‐of‐concept using a low‐dimensional dynamical system. The results indicate that ML‐based and model‐based estimates of observation impacts are generally consistent. Furthermore, ML‐based real‐time PQC successfully improves the forecast accuracy compared to a baseline experiment without PQC.


Introduction
Numerical weather prediction (NWP) is a core infrastructure of society and plays an important role in mitigating severe damage due to extreme weather.To improve the forecast accuracy of NWP, it is essential to obtain accurate initial conditions by statistically combining information from simulation and observations via data assimilation (DA).Operational centers such as the Japan Meteorological Agency (JMA) and National Centers for Environmental Prediction (NCEP) routinely utilize many observations such as radiosondes, aircraft, weather radars, and satellite measurements.
It is of interest to quantify to what extent each observation improves the forecast accuracy because such information known as forecast sensitivity to observations or observation impacts indicates the value of each observation.Observation impacts can be estimated by a variational method (Langland & Baker, 2004), an ensemble Kalman filter-based method known as ensemble forecast sensitivity to observations (EFSO, Kalnay et al., 2012), and a hybrid of them (Buehner et al., 2018).The basic concept of these methods is to estimate the forecast error reduction at future time t + 1 due to the assimilation of a target observation at current time t (Figure 1a).When assimilating many observations, these methods can provide the contributions of each observation at the same time and are efficient than performing many observation denial experiments (e.g., Yamazaki et al., 2021).Here, the forecast error is evaluated against future observations (e.g., Sommer & Weissmann, 2016;Todling, 2013) or a future reference state (analysis) that is obtained by assimilating future observations (e.g., Kalnay et al., 2012;Lien et al., 2018;Ota et al., 2013).
It has been known that a part of observations degrade the forecast accuracy (e.g., Lorenc & Marriott, 2014).Such detrimental observations can be identified by observation impacts.In particular, Hotta, Chen, et al. (2017) proposed proactive quality control (PQC) that denies detrimental observations diagnosed by EFSO.Previous studies demonstrated that PQC successfully improves the forecast accuracy in a low-dimensional dynamical system and a low-resolution operational NWP system (Chen & Kalnay, 2019, 2020;Hotta, Chen, et al., 2017).However, in general it is impossible to perform EFSO and PQC in real-time because they require future observations.Practically, Hotta, Chen, et al. (2017) proposed a remedy that uses an early analysis obtained by assimilating observations that arrive at an earlier time only.
Recently, machine learning (ML) receives much attention in improving NWP via various applications (e.g., Bonavita et al., 2021).Pioneering works by Pathak et al. (2017Pathak et al. ( , 2018) ) demonstrated that a ML method known as reservoir computing (RC, Jaeger, 2001;Jaeger & Haas, 2004;Lukoševičius & Jaeger, 2009) successfully predicts evolution of low-dimensional dynamical systems and their chaotic behavior.Tomizawa and Sawada (2021, hereafter TS21) indicated that the forecast accuracy of RC trained by a time series of analyses can outperform that of a physics-based model if the model is imperfect and contains a non-negligible bias.Similarly, Arcomano et al. (2020); Arcomano et al. (2022Arcomano et al. ( , 2023) ) showcased that RC surrogate models trained by a reanalysis data and a low resolution NWP model can provide reliable global weather predictions.More recently, Bi et al. (2023) and Lam et al. (2023) have demonstrated that neural-network-based ML models can provide more skillful deterministic predictions than an operational physics-based NWP model.
Inspired by TS21, this study proposes using ML to obtain an estimate of the future reference state for EFSO and PQC.Namely, an ML surrogate model trained by a time series of DA outcomes (analyses) is used for predicting future analyses (Figure 1b), which could be more accurate than physics-based model's prediction if the model is imperfect as in most NWP models.This ML-based approach does not require future observations and could provide an estimate of observation impacts and enable real-time PQC.As the first proof-of-concept study, we compare the performance of ML-based EFSO with that of model-based EFSO and evaluate its application for PQC using a low-dimensional chaotic dynamical system known as the Lorenz96 model (Lorenz, 1996;Lorenz & Emanuel, 1998).Among ML methods, this study uses RC because RC has been successfully applied for the Lorenz96 model (TS21; Penny et al., 2022) and the training of RC is easier than other complicated ML architectures such as deep neural networks.
The rest of this article is structured as follows.Section 2 describes the experimental designs including a DA system, ML, EFSO, and PQC.Section 3 provides results and discussion.Section 4 summarizes our findings and remarks.

The Lorenz96 Model
The Lorenz96 model (Lorenz, 1996;Lorenz & Emanuel, 1998) is a chaotic dynamical system and has been used for numerous DA studies (e.g., Hotta, Kalnay, et al., 2017;Kotsuki et al., 2017;Kurosawa & Poterjoy, 2021).The governing equation of this system is as follows: Black curves are forecasts initiated at the previous time t 1 and t.Forecast error reduction is evaluated using the difference of the two forecasts against the (a) analysis at future time t + 1 obtained by assimilating observations at t + 1 and (b) a ML-based reference state without assimilating the observation at t + 1.In this example, EFSO lead time (time difference between current and future times) is assumed to be one for simplicity.

Geophysical Research Letters
where x j is a state variable at jth grid point and F is a forcing term, which characterizes the chaotic behavior.The boundary condition is cyclic.Similar to TS21, we used F = 8.0 as the true dynamical system (also referred to as the nature run), whereas a different value of F was used for a forecast model in DA.As commonly done, the model dimension was fixed at 40 (i.e., 1 ≤ j ≤ 40).Following Chen and Kalnay (2019), Equation 1was integrated by the fourth-order Runge-Kutta method with a time step (Δt) of 0.05.
We performed the nature run by integrating the Lorenz96 model for 300,000 time steps.To initiate the chaotic dynamics, we set the initial conditions for the nature run at 8.0 for all grid points except for the 21st grid point with 8.01.Observations were assumed to be available at all 40 grid points every Δt.As commonly done in the previous studies with the Lorenz96 model (e.g., TS21; Miyoshi, 2005;Kotsuki et al., 2017;Kurosawa & Poterjoy, 2021), we generated observations by adding Gaussian noise N (0.0, 1.0) to the nature run.

Data Assimilation
This study uses the local ensemble transform Kalman filter (LETKF, Hunt et al., 2007;Miyoshi & Yamane, 2007) as a DA method.As mentioned above, we used an imperfect forecast model with F = 10.0 for DA experiments.
As in Chen and Kalnay (2019), to avoid potential complexity due to localization and its advection, the ensemble size was set at 40 and no localization was applied.We assimilated the generated observations every Δt.The initial ensemble was randomly drawn from the nature run between t = 50, 000 and t = 90, 000.
As a baseline, we performed a DA experiment without PQC (hereafter referred to as No-PQC).We started No-PQC at t = 100, 000 and assimilated the observations until t = 300, 000 (see Figure S1 in Supporting Information S1).The factor of multiplicative covariance inflation was set at 1.62 after sensitivity experiments.

Machine Learning: Reservoir Computing
This study employs a simple recurrent neural network known as RC.Our RC aims to predict a next state of the Lorenz96 system from a given current state.When training RC, the input state at time t and the target output state at time t + 1 are obtained from training data.The training data used is the time series of the analyzed ensemble mean from No-PQC because this data is the best estimate from ensemble Kalman filter systems (e.g., Section 3.1 in Houtekamer & Zhang, 2016).During the training process, many combinations of the input and output states at different times are used for minimizing the objective function (see Equation 14 of TS21) by solving a ridge regression.When making predictions, RC's prediction serves as its input for the subsequent time, so that RC becomes recurrent.
In general, RC consists of three layers: an input layer, a reservoir layer, and an output layer (see Figure S2 in Supporting Information S1).The input layer and the reservoir is connected by an input weight matrix W in , which maps an input vector u to the reservoir state r.RC uses the reservoir layer for memorizing past states and predicting next states.Namely, an output vector v at next time t + 1 is obtained by where W out is an output weight matrix, f ( ) is a nonlinear transformation proposed by Chattopadhyay et al. (2020), and the subscript denote the time.Time evolution of r is given by where u is an input vector, A is an adjacency matrix that defines connections of each reservoir state, l is the leaky rate that characterizes dynamics.Notably, RC trains W out only and uses fixed W in and A, so that its training is not computationally demanding compared to other complicated ML architectures such as deep convolutional neural networks.
The architecture of our RC generally follows that of TS21 (Text S1 in Supporting Information S1).After training and optimizing hyperparameters (Text S2 and S3 in Supporting Information S1), our RC surrogate model achieved higher forecast accuracy than that of the imperfect Lorenz96 model (Figure S3 in Supporting Information S1), which is consistent with TS21.

EFSO and PQC
EFSO (Kalnay et al., 2012) is an ensemble-based method for diagnosing observation impacts.As shown in Figure 1a, EFSO measures the forecast error reduction by the assimilation of observations at the current time (t in Figure 1a).To do so, EFSO uses the analysis state obtained by the assimilation of future observations (t + 1 in Figure 1a) or observations themselves (Kotsuki et al., 2019;Sommer & Weissmann, 2016;Todling, 2013) as a reference state at future time t + 1.The time interval between the current time and future time is referred to as EFSO lead time.In Figure 1, EFSO lead time is assumed to be one for simplicity, but it could be set at a longer lead time.Indeed, Chen and Kalnay (2019) showed that the optimal setting of EFSO lead time was between 16 and 21 time steps in their experiments with the bias-free Lorenz96 model.Sensitivity to EFSO lead time will be discussed in Section 3.3.
The forecast errors of two forecasts initiated at time t and t 1 valid at time t + 1 are denoted as follows: where e t+1|t 1 and e t+1|t are the forecast errors valid at t + 1 for the forecasts initiated at t and t 1, respectively.Following Kotsuki et al. (2019), this study re-initiates the latter baseline forecast at t from the ensemble mean of the ensemble forecast initiated at t 1 (i.e., first guess at t) (See Figure 8 of Kotsuki et al., 2019).x ref is a future reference state valid at t + 1 that is not available at t.This study intends to obtain an estimate of x ref using ML without observations at t + 1.This prediction by ML initiated from the analysis at time t is expected to be more accurate than the forecast by the physics-based imperfect model initiated from the same initial conditions (see Figure S3 in Supporting Information S1), indicating that ML's prediction can be a reference state for EFSO.
PQC (Hotta, Chen, et al., 2017) is an EFSO-based quality control (QC) method and denies some of the detrimental observations characterized by detrimental (positive) EFSO impacts.Detrimental observations would contribute to deviate the forecast from the truth (see Figure 7 in Chen & Kalnay, 2019).By removing the analysis increments associated with detrimental observations, PQC aims to improve analysis and subsequent forecasts.Hotta, Chen, et al. (2017) showed that PQC successfully improves the forecast accuracy with an operational global NWP system.
This study performs two PQC experiments in addition to the No-PQC experiment.The first experiment performs PQC with future observations.This experiment is referred to as model-based PQC.The other experiment uses the trained ML (RC) surrogate models for PQC.This experiment does not require future observations for PQC and is referred to as ML-based PQC.In ML-based PQC, this study uses ML only for obtaining a reference state in PQC and employs the physics-based imperfect Lorenz96 model for ensemble and extended forecasts as in model-based PQC.Similar to Chen and Kalnay (2019); Chen and Kalnay (2020), the two PQC experiments use the forecast improved by PQC as the first guess when assimilating next observations.We performed these two PQC experiments from the analyzed ensemble from No-PQC at t = 200, 000 for 2,000 cycles using the imperfect Lorenz96 model.The resulting ensemble-mean analyses for the later 1,000 cycles were used for the initial conditions of deterministic forecasts.This study evaluates the forecast accuracy of these forecasts.

Comparison Between Model-Based EFSO and ML-Based EFSO
Before evaluating the performance of ML-based PQC, we compare the EFSO impacts obtained from model-based EFSO and ML-based EFSO (Figure 2).EFSO lead time is set at 10 time steps for the moment and the sensitivity to this choice will be discussed in Section 3.3.Surprisingly, the observation impacts from model-based and MLbased EFSO are generally similar with a correlation coefficient of 0.64, although ML-based EFSO tends to slightly underestimate the absolute values.This indicates that ML-based EFSO would provide useful estimates of observation impacts although it requires no future observations.ML-based EFSO cannot always detect the most detrimental observation accurately, but its detected observations are considerably detrimental in most cases.Namely, the most detrimental observation diagnosed by ML-based EFSO is included in the worst 10% detrimental observations diagnosed by model-based EFSO with the probability of >70% (see Figure S4 in Supporting Information S1).This high probability is quite encouraging because Chen and Kalnay (2019) indicated that the worst 10% detrimental observations mainly contribute to degrading the forecast accuracy.However, it is difficult for ML to detect the worst 10% observations at the same time (Figure S5 in Supporting Information S1).Therefore, we chose to deny only 1 of 40 observations (i.e., 2.5%) in the PQC experiments.

Application to PQC
Next, we evaluate the performance of ML-based PQC for improving the forecast accuracy compared with modelbased PQC.As shown by the magenta and black curves in Figure 3, ML-based PQC successfully improves the forecast accuracy compared to No-PQC.The forecast improvement is large especially near the forecast time of 10 time steps, which corresponds to the setting of EFSO lead time.As expected, the forecast improvement is smaller than that obtained by model-based PQC (gray dashed curve in Figure 3) because ML-based EFSO cannot always detect the most detrimental observations accurately (Figure S4 in Supporting Information S1).
Although the above result is promising, the forecast improvement by ML-based PQC becomes unclear if the number of denied observations is four (i.e., worst 10%) (Figure S6 in Supporting Information S1).This is different from the findings by Chen and Kalnay (2019) who showed that the largest improvement by model-based PQC was found when denying the most detrimental 10% (or more) observations.The different response of ML-based PQC would be explained by its limited accuracy of estimating observation impacts.Namely, the worst 10% observations diagnosed by ML-based EFSO are included in those diagnosed by model-based EFSO with the probability of 10% at best (Figure S5 in Supporting Information S1).Nevertheless, denying only one detrimental observation by ML-based PQC actually improves the forecast accuracy and would be still promising because it requires no future observations and can be real-time.

Sensitivities to EFSO Lead Time and the Forcing Term F
The forecast improvement by ML-based PQC depends on EFSO lead time (Figure 4).Here, the relative improvement of forecast errors is defined as where RMSE is the forecast root-mean-squared errors averaged over 1,000 forecasts in each experiment and the subscript PQC is either model-based PQC or ML-based PQC.As indicated by Hotta, Chen, et al. (2017) with model-based PQC, the relative improvement by ML-based PQC is broadly distributed around EFSO lead time.
The largest improvement by ML-based PQC is approximately 7% for EFSO lead time of 10 time steps (green curve in Figure 4).
The improvement by ML-based PQC depends on the forecast model bias.This is because the advantage of ML predictions over those by imperfect physical models relies on the degree of the model bias (TS21).Indeed, MLbased PQC fails to improve the forecast accuracy when using the perfect model with F = 8.0 (See Figure S7 in Supporting Information S1).When using the forecast model with F = 6.0, the improvement by ML-based PQC is slightly large (See Figure S8 in Supporting Information S1) probably due to reduced forecast error growth.Since operational NWP models are certainly imperfect and have a non-negligible bias, our proposed ML-based realtime PQC can be applied for operational NWP.

Summary and Concluding Remarks
To improve the forecast accuracy of NWP, it is vital to assimilate available observations.Recent studies have indicated that a part of observations actually degrade the forecast accuracy and denying such detrimental

Geophysical Research Letters
10.1029/2023GL107938 observations as PQC may improve the forecast accuracy.However, estimating observation impacts on forecasts with conventional methods requires a reference state obtained by future observations and cannot be real-time in general.This study proposes using a ML method known as RC to obtain an estimate of the future reference state.
By doing so, it would be possible to estimate observation impacts and to deny detrimental observations as realtime ML-based PQC.
As the first proof-of-concept study, we have implemented ML-based EFSO into a Lorenz96 DA system.RC surrogate models have been trained by a time series of the analyses from No-PQC.A comparison of EFSO impacts between model-based EFSO and ML-based EFSO shows a good agreement with a high correlation coefficient.As expected, the most detrimental observations diagnosed by those two methods are not always the same, but the most detrimental observations diagnosed by ML-based EFSO are actually detrimental in modelbased EFSO in most cases.
Encouraged by these promising results, we have evaluated the impacts of ML-based PQC on the forecast accuracy.The results indicate that ML-based PQC clearly improves the forecast accuracy compared to No-PQC.
The relative forecast improvement against No-PQC is approximately 7% for EFSO lead time of 10 time steps.It should be emphasized that this forecast improvement is achieved in a real-time manner without using future observations.Furthermore, the improvement by ML-based real-time PQC is large if the forecast model has a bias, which is the case for operational NWP models.
Our proposed ML-based method enables real-time PQC for the first time.While we have used RC for proof-ofconcept, other ML architectures could be used for ML-based PQC if predictions by ML trained with a time series of analyses can outperform those by a physics-based model that has a bias (TS21).In this regard, recent studies have shown that other ML architecture such as deep neural networks trained by long-term reanalysis data can achieve better forecast accuracy than an operational NWP model (e.g., Bi et al., 2023;Lam et al., 2023).By using these trained ML models, the proposed method could be implemented into the operational NWP systems that routinely diagnose observation impacts.Another interesting research direction is to use ML not only for PQC but also for other components in DA (e.g., Penny et al., 2022;Tsuyuki & Tamura, 2022;Yasuda & Onishi, 2023).

Geophysical Research Letters
10.1029/2023GL107938 Although we have used ML only for obtaining a reference state for PQC, it would be technically possible to replace ensemble and extended forecasts by a physics-based model with those by ML.However, it is unclear whether ML can appropriately respond to small perturbations in the initial conditions for ensemble forecasts as in a physics-based model.Indeed, Penny et al. (2022) have indicated that the error growth of their ML model was insufficient at very short time scales even though the forecast error correlations were well reproduced by ML.In addition, Lam et al. (2023) have shown that their ML model exhibited blurring at longer lead times despite its good forecast accuracy in terms of skill scores.A recent study by Price et al. (2023) has solved these issues and enabled reliable ML-based probabilistic forecasts at low resolution.Yet, more importantly, as pointed out by Lam et al. (2023), ML largely depends on accurate training data provided by DA with a physics-based model.Therefore, it would not be practical to immediately replace all forecasts by a physics-based model with ML.Nevertheless, ML could contribute to improving the forecast accuracy of NWP in various ways, and the method proposed in this study is one such possibility.Combining ML and physics-based NWP models rather than solely using either of them would be a promising way of further improving NWP.
Our proposed ML-based PQC can be regarded as an indirect method for bias correction.Namely, ML-based PQC aims to align forecast states closer to ML predictions with a small bias by rejecting detrimental observations.Therefore, it would be important to compare the performance of ML-based PQC with other bias correction methods (e.g., Amemiya et al., 2023;Bhargava et al., 2018;Chen et al., 2022;Danforth et al., 2007;Farchi et al., 2021) in future research.A drawback of ML-based PQC would be that it cannot be used for online bias correction in which model tendencies are directly corrected.A potential advantage of ML-based PQC over these methods would be that ML-based PQC does not require the preparation of long-term analysis increments for estimating the model bias or training ML in these studies.Although training ML models for ML-based PQC requires long-term analysis data, this requirement can be bypassed by employing trained ML models, such as GraphCast (Lam et al., 2023).

Figure 1 .
Figure1.Schematic of (a) ensemble forecast sensitivity to observations (EFSO) without ML and (b) EFSO with ML.Gray dashed curves are the true time evolution of an atmospheric state variable.Red marks are analyses.Blue marks are observations.The green arrow in (b) denotes the ML prediction initiated at the current time t.Black curves are forecasts initiated at the previous time t 1 and t.Forecast error reduction is evaluated using the difference of the two forecasts against the (a) analysis at future time t + 1 obtained by assimilating observations at t + 1 and (b) a ML-based reference state without assimilating the observation at t + 1.In this example, EFSO lead time (time difference between current and future times) is assumed to be one for simplicity.

Figure 2 .
Figure 2. Scatter plot of observation impacts estimated by model-based EFSO and ML-based EFSO.The correlation coefficient (R) of these observation impacts is 0.68.EFSO lead time is 10 time steps (one step is Δt = 0.05).The gray dashed line is the diagonal reference line.

Figure 3 .
Figure 3.Time evolution of the forecast root-mean-squared errors averaged over 1,000 forecasts by the same imperfect Lorenz96 model (F = 10.0)initiated from the analysis ensemble mean in the (magenta) ML-based PQC, (black) No-PQC, and (gray, dashed) model-based PQC (not real-time) experiments.EFSO lead time is 10 steps (1 step is Δt = 0.05).The number of denying observations by PQC is 1.

Figure 4 .
Figure 4. Time evolution of the relative improvement (%) of forecast root-mean-squared errors averaged over 1,000 forecasts initiated from the analysis ensemble mean in ML-based PQC against those in No-PQC.The model forcing F for the DA system (nature run) is 10.0 (8.0).The ensemble size and inflation coefficient for the DA system are 40 and 1.62, respectively.EFSO lead time is (blue) 1, (yellow) 5, (green) 10, (red) 20, and (purple) 30 time steps.Colored stars indicate the highest improvement (%) of model-based PQC (not real-time) against No-PQC for each EFSO lead time.The location of the stars in terms of forecast time corresponds to when the improvement in model-based PQC becomes highest for each EFSO lead time.