Building Tangent-Linear and Adjoint Models for Data Assimilation With Neural Networks

We assess the ability of neural network emulators of physical parametrization schemes in numerical weather prediction models to aid in the construction of linearized models required by four-dimensional variational (4D-Var) data assimilation. Neural networks can be differentiated trivially, and so if a physical parametrization scheme can be accurately emulated by a neural network then its tangent-linear and adjoint versions can be obtained with minimal effort, compared with the standard paradigms of manual or automatic differentiation of the model code. Here we apply this idea by emulating the non-orographic gravity wave drag parametrization scheme in an atmospheric model with a neural network, and deriving its tangent-linear and adjoint models. We demonstrate that these neural network-derived tangent-linear and adjoint models not only pass the standard consistency tests but also can be used successfully to do 4D-Var data assimilation. This technique holds the promise of significantly easing maintenance of tangent-linear and adjoint

The success of the 4D-Var approach is crucially dependent on the construction of accurate tangent-linear and adjoint versions of every model component, especially for the physical parametrization schemes that represent, for example, the unresolved part of the fluid dynamics. Such an effort is a focus of significant activity at centers that use 4D-Var, such as ECMWF, Météo-France (Janisková, 1999), and the UK Met Office (Payne, 2021). Constructing these linearized models can either be achieved by manually differentiating the nonlinear code or by using an automatic differentiation tool. The comprehensive linear physics package of ECMWF was constructed through the former, manual approach after first constructing simplified and "regularized" versions of the nonlinear physics, which are computationally cheaper and give more stable behavior after linearization (Janisková & Lopez, 2013). This technique has been used successfully so far for the radiation, vertical diffusion, unresolved gravity wave drag, convection and clouds and precipitation schemes, but could present limits to the overall scalability of the data assimilation system in the future (Bauer et al., 2020). Due to the strict nature of the linear code, namely that the nonlinear, tangent-linear and adjoint models must be formulated in a mutually consistent way, it is harder to port these model components to novel computational hardware, such as graphics processing units, in their current form. The main alternative to 4D-Var, the ensemble Kalman filter, does not require tangent-linear or adjoint models (e.g., Bonavita et al., 2015;Hamrud et al., 2015;Houtekamer et al., 2016;Scraff et al., 2016). However, this approach is not viable for ECMWF at this stage, for reasons outlined by (Bonavita et al., 2017). It is therefore worthwhile to explore means to keep the 4D-Var algorithm competitive by alleviating the aforementioned scalability issues.
In this study, we propose a novel method of constructing tangent-linear and adjoint models for use in 4D-Var that relies on machine learning. Our new method consists of training a neural network to emulate one or more physics schemes and then, instead of coding tangent-linear and adjoint models of the physics scheme manually or automatically, using the tangent-linear and adjoint versions of the neural network emulator. The neural network code is significantly easier to linearize compared with a normal physical parametrization scheme, owing to the inherently trivial differentiability of neural networks and their regular structure. Furthermore, the same code can in principle be reused for the tangent-linear and adjoint versions of multiple model components, provided that the nonlinear counterparts can all be accurately emulated by the same neural network architecture. This also applies if one of those model components is changed-the emulator could simply be retrained to emulate the new scheme and the resulting parameters reused in the original tangent-linear and adjoint neural networks, without the need for further coding. Here we test this idea with the physics scheme that describes how gravity waves generated by non-orographic processes (e.g., HATFIELD ET AL. 10.1029/2021MS002521 2 of 16 tropospheric convection) travel to the stratosphere and break, thereby acting as a drag on the atmospheric flow.
A number of previous studies have successfully attempted to represent the effects of unresolved physical processes through neural networks (e.g., Brenowitz & Bretherton, 2018;Chevallier et al., 1998;Krasnopolsky et al., 2005;Rasp et al., 2018;Yuval et al., 2021) and we refer the reader to Krasnopolsky (2020) for a summary of the state-of-the-art. Some studies went further in quantifying the accuracy of neural networks by inspecting their Jacobians, that is, the matrix of first derivatives of all output variables with respect to all input variables. This is relevant to our effort here as the Jacobian and its transpose can be used in 4D-Var in a similar way to the tangent-linear and adjoint models. Aires et al. (1999Aires et al. ( , 2004 demonstrated that an otherwise accurate neural network can have a noisy Jacobian. This is because the cost functions typically employed when training neural networks do not explicitly encourage neural networks with a smooth first derivative, and so the authors proposed to adopt certain regularization constraints during the training step to alleviate this problem. Chevallier and Mahfouf (2001) found a similar issue with their own neural network emulation of the ECMWF radiation scheme, described by Chevallier et al. (2000), and so they used pre-computed mean Jacobian matrices in their 4D-Var experiments. This approach was used operationally at ECMWF for a time (Janisková et al., 2002). Finally, Krasnopolsky (2006) proposed to minimize noise in the Jacobians by averaging the Jacobian across an "ensemble" of neural network-derived Jacobians. The subject of neural network Jacobians is therefore well established in the literature. However, there are comparatively few recent studies on this key feature of neural network-based parametrization schemes. A recent study, Nonnenmacher and Greenberg (2021), considered the use of "flow-dependent" neural network Jacobians (as opposed to the pre-computed mean Jacobians of Chevallier & Mahfouf, 2001). In this important work, the authors use neural network emulators of numerical model components to generate Jacobians, and test these Jacobians in a number of contexts including 4D-Var data assimilation. However, these experiments only considered the Lorenz '96 toy model, not a full complexity NWP model as in our case. For reasons we discuss in Section 3, the use of explicit Jacobian matrices in data assimilation is generally not practical for NWP, so here we attempt to use traditional techniques employed at ECMWF to construct tangent-linear and adjoint models of our neural network.
The experiments here are intended to complement those of our earlier publication Chantry et al. (2021). There we considered the use of neural networks in the nonlinear model (used to perform forecasts and compute trajectories in 4D-Var). Here we focus on the use of neural networks in the tangent-linear and adjoint models, continuing to use the original physics scheme in the nonlinear model. We specifically want to know if the derivatives of the neural network can be used in data assimilation, and we wish to decouple that from the exploration of neural network emulators for other purposes. Our approach, where only the derivatives of the neural network are used, may also be a natural choice in many circumstances, and we discuss this further in Sections 5 and 6.
In Section 2, we describe the chosen physics scheme and how we emulate it with a neural network, in Section 3, we describe how to construct the tangent-linear and adjoint models of a neural network, in Section 4, we explain the standard procedure for testing tangent-linear and adjoint model changes at ECMWF and apply these procedures to our neural network, in Section 5, we show results from data assimilation and weather forecast experiments using the neural network, and finally we conclude in Section 6.

Neural Network Emulation of the Non-Orographic Gravity Wave Drag Scheme
The physical parametrization scheme that we focus on here describes how gravity waves generated in the troposphere by non-orographic processes, such as convection and frontogenesis, travel to the middle atmosphere and break, thereby acting as a drag on the resolved flow. This phenomenon is known to influence the dynamics of the stratosphere and the levels above, including the Brewer-Dobson circulation and the quasi-biennial oscillation. Since model cycle 35r3 of the ECMWF model, the Integrated Forecasting System (IFS), went into operations in 2009, the effects of non-orographic gravity wave drag have been parametrized (Orr et al., 2010), based on the scheme of Scinocca (2003). This model upgrade delivered a much improved HATFIELD ET AL.

10.1029/2021MS002521
3 of 16 representation of the middle atmosphere, compared with the previous scheme based on a crude Rayleigh friction term.
In this study we chose the non-orographic gravity wave drag scheme as our case study as we had prior experience in neural network emulation for this scheme. We developed a neural network-based emulator which in principle can be used as a drop-in replacement for the original scheme. Our companion paper, Chantry et al. (2021), focuses on the development of the emulator, including the network design, training data generation and training procedure, and its testing through forecast and climatology verification. Here we summarize the relevant details of the emulation.
The neural network architecture we chose is the fully connected multilayer perceptron, which is arguably the simplest architecture which can be used for deep learning. Viewed generally, this class of neural network can be written as a vector-valued function of a vector that maps an input x to an output y: In this study we deal exclusively with multilayered networks of 1 N  nonlinear layers and 1 linear output layer, with each nonlinear layer (i.e., each of the input and hidden layers) having a constant width. If i W is the matrix containing the weights of the ith layer (where 0 i  indicates the input layer), i b is the vector containing the biases in the ith layer and h is the vector-valued function of a vector that evaluates the activation function for each element of its input, then the operation of the neural network up to the ith layer can be written recursively as By this definition N y y  . Note that it is not conventional to define the partial output of a neural network, i y , as including the application of layer i's weights and biases, as we have done here. Usually i y is defined as the immediate output of the activation function. We use this formulation to allow brevity in the equivalent tangent-linear and adjoint expressions given below.
In this study, we propose to replace one or more of the physical parametrization schemes used in a numerical model of the weather with a neural network. Under this scheme, the function  in Equation 1 would map a vector of variables describing the atmospheric state, x, to a vector containing the contributions to the tendencies (i.e., time derivatives) of variables, y, from the physical process encoded in  . For the specific case study of non-orographic gravity wave drag, x contains vertical columns of zonal and meridional wind velocities and temperature along with pressure and geopotential at the surface. Then, y contains the contribution to the tendencies of zonal and meridional wind velocities from non-orographic gravity wave drag. Note that this parametrization scheme acts on each atmospheric column independently and in isolation. It can therefore be perfectly parallelized in the horizontal directions.
We created a data set for training, validating and testing our neural network by modifying cycle 45R1 of the IFS (operational in 2018). We ran the IFS for 30-day integrations over a three-year period from 2015 to 2017, starting every 30 days. Every 20 h, the inputs and outputs to the subroutine executing the non-orographic gravity wave drag scheme were saved to disc. We then subsampled the data to reduce its volume. The training, validating, and testing portions of the data set were taken from the years 2015, 2016, and 2017, respectively. Note that here we trained the neural network based on the "high complexity" version of the non-orographic gravity wave drag scheme, as described in Chantry et al. (2021). This high complexity scheme is actually more accurate than that used for operations at ECMWF and the reference model for the experiments below, and is more expensive as a result. Nevertheless, a neural network of a fixed architecture is always the same cost, even if it emulates a more accurate parametrization scheme. Since we are only evaluating a single neural network in this paper, we decided to choose the most accurate. We refer the reader to Chantry et al. (2021) for the full details on the data set.
We used the hyperbolic tangent as the activation function for all layers of the network, as its derivative can be computed easily and is completely smooth. We note that this function is not as computationally efficient as, say, a partially linear piecewise activation function, but it is sufficient for this first demonstration. We HATFIELD ET AL.
10.1029/2021MS002521 4 of 16 manually tuned the layer width (keeping a constant width for each layer except the output layer) and the number of layers, aiming to minimize the root-mean-square error over the training data set, while keeping the total number of degrees-of-freedom fixed. Ultimately we settled on a network of 6 hidden layers (so that the total number of layers including the output layer, N, is 7) each with a width of 43 neurons. For each variable with a vertical dimension, we considered only the top lev 93 N  out of a total of 137 vertical model levels, as the parametrization scheme is only active for these levels. As mentioned, the input consists of the zonal and meridional wind velocities and temperature (three-dimensional fields) and surface pressure and geopotential (two-dimensional fields). The input layer therefore has a width of 3 93 1 1 281     elements. The output consists of two three-dimensional fields, namely the wind velocity tendency contributions in the zonal and meridional directions, and therefore the output layer has a width of 2 93 186   elements. To illustrate the accuracy of our trained neural network, Figure 1 shows how well it predicts four unseen cases, chosen randomly from the testing data set (the data covering 2017). Note that the neural network considered here is trained from model data comprised of 137 vertical levels as opposed to the 91 vertical levels used in our companion paper Chantry et al. (2021).

Linearizing Neural Networks
In order to use the neural network in 4D-Var data assimilation, as described in Section 1, we require its tangent-linear and adjoint. In other words, we also require linear functions F and where x  and y  are perturbations to x and y, respectively, and J is an arbitrary scalar function of the output of the neural network, y.
can be written explicitly as the matrix-vector multiplication ( ) ( ) x K is the Jacobian matrix of partial derivatives of  evaluated at x. In this case F  ( , / ) x J y   can also be obtained simply by transposing ( ) x K and computing K  ( ) / x J y   . In practice this "Jacobian" approach to 4D-Var is only feasible for low-dimensional models; it is rarely feasible to store the actual Jacobian in memory for a weather forecasting model due to its high dimension. Fortunately, we can still compute Equations 3 and 4 as a series of multiplications without ever representing the Jacobian explicitly. This is the approach HATFIELD ET AL.
10.1029/2021MS002521 5 of 16 we take here to construct the tangent-linear and adjoint models of the neural network, as it is taken for the IFS model at ECMWF generally.
The tangent-linear of the neural network can be found by taking the derivative of Equation 2 and using the chain rule. Before we present this, we first define two useful functions. First, h is the vector-valued function of a vector that evaluates the derivative of the activation function for each element of its input. Second, Then, we can write an explicit form for Equation 3, the tangent-linear of the neural network: Note that left-multiplying the matrix G with a vector is equivalent to an element-wise vector product between the diagonal of G and that vector, and this is how we implement the operation in the code. We write G here as a matrix only for notational simplicity. An explicit form for Equation 4, the adjoint of the neural network, can be found by taking the transpose of Equation 6: Here, we have essentially reversed the order of the matrix multiplication and taken the transpose of each matrix. The symmetry between Equations 6 and 7 is clear.
As in Chantry et al. (2021), we trained our neural network using TensorFlow, in Python, but integrated the network back into the IFS by writing our own Fortran module. This module reads the network parameters generated by TensorFlow and executes the neural network on the inputs provided by the IFS, acting as a drop-in replacement for the existing non-orographic gravity wave drag scheme. This allowed us to subject our neural network code to the standard tangent-linear and adjoint model evaluation for model changes in the IFS.

Tangent-Linear and Adjoint Model Evaluation
When evaluating a model change at ECMWF we must always verify that the nonlinear, tangent-linear and adjoint models remain consistent with each other. If there is any inconsistency between the tangent-linear and adjoint models due to a code bug, for example, then this can be expected to disrupt the convergence of the 4D-Var cost function minimization. In our case we are essentially proposing an alternative scheme to be used for non-orographic gravity wave drag and so we follow the same procedure in testing the linearized models that we would follow for any other model change.
We began by evaluating the neural network subroutines in isolation, namely those that are intended to replace the existing non-orographic gravity wave drag subroutines. These subroutines, the nonlinear, tangent-linear and adjoint versions of the neural network, must satisfy principally two tests before we can consider integrating them back into the IFS itself. As before, if the nonlinear, tangent-linear and adjoint subroutines are denoted by  , F and F  , respectively, then first  and F must satisfy the expression where 0 x is an arbitrary point about which the linearization is performed,  is a number that tends to zero and x  is an arbitrary vector. This test essentially measures the accuracy of the tangent-linear model as a first-order Taylor approximation to the nonlinear model. In practice this test is performed by fixing

Journal of Advances in Modeling Earth Systems
The second test performed against the subroutines in isolation was to verify that the adjoint identity is satisfied, namely that the expression holds, where y  is an arbitrary vector. In fact, given that the operators F and F  are actually computer subroutines employing floating-point arithmetic, the identity in Equation 9 will not hold exactly but instead be violated somewhat (Hatfield et al., 2020). When using double-precision the left-hand-side and righthand-side can correspond to up to about 16 decimal digits. We verified that this was the case, observing no difference between the left-hand-and right-hand-sides of Equation 9 up to the machine epsilon. The code for these two tests is available online . Having verified that our neural networks pass the tangent-linear and adjoint tests at the individual subroutine level, we then applied similar tests at the level of the entire IFS model. These tests are far more stringent and provide a strong indication as to whether the code can be used to perform data assimilation or not.

Tangent-Linear Test
If the nonlinear model operator that evolves a model state forward over a certain time period (usually 12 h in the case of 4D-Var) is given by  then the tangent-linear model is given by M. This tangent-linear test compares a perturbation x  evolved by the tangent-linear model linearized about a reference state 0 x against the difference between the two states evolved by the nonlinear model initialized at where the angle brackets denote a spatial average (e.g., global or zonal) and the vertical bars represent taking the absolute value. This test is essentially a model-wide version of the isolated subroutine tangent-linear test of Equation 8, performed using perturbations with a magnitude comparable to actual analysis increments that the tangent-linear model will encounter when used to perform 4D-Var data assimilation. The temperature variables in x  here have magnitudes of several Kelvin, for example. Note also that this test does not consider the physical accuracy of the nonlinear or tangent-linear models. Even a physically inaccurate nonlinear and tangent-linear model pair can perform well in this test, as long as they correspond closely according to Equation 10.
The value of  in general will not be zero, even when the tangent-linear model is coded perfectly, as atmospheric evolution is often nonlinear and so violates the first-order Taylor approximation, especially for the relatively large values of x  we consider here. Instead we aim to show that the  computed using the neural network non-orographic gravity wave drag scheme (labeled NN) is comparable to the  computed from a reference (labeled REF), namely the operational configuration of the IFS. Specifically, we look at the relative change in  when switching from the reference model to the neural network model, rel We use the same NN and REF labels for all following experiments in the paper.
We present the results of the tangent-linear test for two cases, which we call the "easy test" and the "hard test." For the easy test, both the nonlinear model  and the tangent-linear model M use the neural network for the non-orographic gravity wave drag scheme. For the hard test, we use the neural network for M but the original physics scheme for . The latter test is relevant for the weather forecast tests we present later, in which we only use the neural network for the tangent-linear and adjoint model, not the nonlinear model. It is a harder test, as the tangent-linear neural network is tested against the original nonlinear non-orographic gravity wave drag scheme, not the nonlinear neural network as in the easy test. This test therefore also indirectly measures the accuracy of the neural network with respect to the original physics scheme. We performed these tests at a resolution of TCo399 (a triangular truncation with a maximum total and zonal wavenumber of 399 in spectral space with a cubic-octahedral reduced Gaussian grid in grid point space), the same resolution at which we generated the training data for the neural network. This corresponds to roughly 25 km grid spacing at the Equator.
The results for both the easy and hard tangent-linear tests are shown in Figure 2. These tests were run over 20 start dates. For the distribution over these start dates, the central lines and shaded areas in Figure 2   is around zero for most of the atmosphere, except for the mesosphere where it is actually negative. This indicates that, in this region, the nonlinear and tangent-linear models actually correspond more closely for NN compared with REF. This could be due to the nature of the tangent-linear model used by REF, which is based on that of the operational IFS. This setup uses a regularized version of the non-orographic gravity wave drag scheme for the tangent-linear case in which, for example, momentum fluxes for high phase speed spectral components are zeroed. This helps to achieve numerical stability in the 4D-Var minimization (Janisková & Lopez, 2013). The tangent-linear model for NN, on the other hand, is derived directly from an emulation of the nonlinear scheme without any of these regularizations. Hence, the formulations of the nonlinear and tangent-linear model are closer for NN and this likely explains the negative values of rel  .
For the hard test, rel  is again mostly close to zero, but is positive in the mesosphere, indicating that the nonlinear and tangent-linear models correspond less well for NN than for REF here. For the hard test, the nonlinear and tangent-linear models for NN are formulated differently. Clearly this difference is bigger for NN than for REF, hence the positive values of rel  . Through further training and hyperparameter exploration, rel  could probably be brought closer to zero. However, instead of spending further effort to refine the neural network, we will proceed to use this configuration in our later data assimilation experiments. An increase in rel  will not necessarily result in a degraded analysis, as there is always an unavoidable discrepancy between the nonlinear and the tangent-linear and adjoint models in practice due to the fact that the latter run at a lower resolution than the former, to save computational resources. Whether the increase in rel  is acceptable will be determined by whether there is a degradation in the analysis, which we investigate in Section 5. We now move on to the adjoint test.

Adjoint Test
If the tangent-linear model is given by M then the adjoint model is given by M  such that the expression holds, where y  is another arbitrary model state perturbation. This is equivalent to the isolated subroutine test of Equation 9, though in practice we consider a correspondence of at least 10 consecutive equal digits between the left-hand-and right-hand-sides of Equation 11 to be a "pass." A greater mismatch than this can HATFIELD ET AL.
10.1029/2021MS002521 8 of 16 occur if the numerical precision is too low (e.g., single-precision) but it can also be evidence of an incorrectly coded tangent-linear or adjoint model.
We ran this test for experiment NN across 20 different dates and for 10 different configurations of the model and the results are shown in Figure 3. Again, we used a resolution of TCo399, the "native" grid of the neural network. Each date and model configuration pair produces a single number-the number of consecutive matching digits between the inner products on the left-hand-and right-hand-sides of Equation 11. We do not detail all 10 of these configurations here, but essentially the first two configurations include the full physical parametrization suite of the IFS with slightly different options whereas the other configurations have different parametrization schemes switched on and off to help in debugging the tangent-linear and adjoint code. Across most dates and physics configurations there is a good tangent-linear and adjoint correspondence with at least 12 consecutively equal digits. From operational experience, we can say that failures in the data assimilation process become likely when there is a correspondence of 8 digits or fewer, but we do not observe any such cases. We are therefore confident that the adjoint of the neural network is also coded correctly and move on to the data assimilation and weather forecasting experiments.

Data Assimilation and Weather Forecasting Experiments
Though we know from experience that model changes need to satisfy the tests in Section 4 to be used in 4D-Var data assimilation, these tests alone are not sufficient evidence that 4D-Var can actually be performed successfully. Here we demonstrate the suitability of our neural network-based tangent-linear and adjoint models for data assimilation by performing an actual cycled 4D-Var data assimilation experiment. Our experiment covered the winter period from December 1st, 2018 to February 28th, 2019. In this experiment consecutive 4D-Var data assimilation cycles were run every 12 h and 10 day forecasts were issued from the final analysis of each assimilation cycle. In total, 177 forecasts were performed. The horizontal resolution of the nonlinear integrations (the 4D-Var trajectory and the actual forecasts) was TCo399. The horizontal resolution of the tangent-linear and adjoint model integrations was TL95 (a triangular truncation of 95 with a linear reduced Gaussian grid), TL159 and TL255 for the first, second and third inner loops of the 4D-Var minimization, respectively. The model had 137 vertical levels for all integrations. For reference, the high-resolution operational configuration of the IFS runs at a resolution of TCo1279 (roughly 8 km grid-spacing at the Equator) for the 4D-Var trajectories and has four inner loops at resolutions of TL255, TL319, TL399, and TL399, respectively. There are 137 vertical levels in all cases. Our experiments were performed with the IFS at model cycle 46r1.
We evaluate the performance of the neural network based on the quality of the weather forecasts initialized from the analysis products of the data assimilation cycle. This is done by reference to the operational configuration of the IFS, which uses hand-coded tangent-linear and adjoint models of the non-orographic gravity wave drag scheme. We also present diagnostics evaluating how well the model of experiment NN HATFIELD ET AL.
10.1029/2021MS002521 9 of 16 fits observations, compared with REF. As mentioned earlier when explaining the hard tangent-linear test, we only use the neural network in the tangent-linear and adjoint models. We do not use it in the nonlinear model used to generate the 4D-Var trajectories nor in the nonlinear model used to run the actual forecasts (these two models should be the same to avoid readjustment problems in the early stages of the forecast). If we did use the neural network here too, we would not be able to directly attribute any changes in forecast scores between NN and REF to the data assimilation system alone. A detailed verification of the nonlinear neural network emulator can be found in our companion paper, Chantry et al. (2021).
Our headline result is shown in Figure 4. This shows the relative difference in root-mean-square error of forecasted temperature averaged in the zonal direction and across all forecasts in the experimental period, for a number of forecast lead times. The difference in error shown is between the two experiments NN HATFIELD ET AL.

10.1029/2021MS002521
10 of 16 Hatching indicates that differences are significant with 95% confidence, and the sparsity of hatching indicates that the error difference is for the most part not statistically significant. and REF, with red or blue colors indicating that NN is worse or better than REF, respectively. For both experiments, the forecast error is computed with respect to the experiment's own analysis. We repeated the verification with the ECMWF operational analysis as a common reference for both experiments but found no significant difference in the result (which we omit here). The absolute value of the differences amount to no more than 4%, though this is neither decisively in the positive or negative direction and in any case is not statistically significant, which is demonstrated by the scarcity of significance hatching. In other words, experiment NN does not perform statistically significantly differently from experiment REF. We also looked at the fields of geopotential height, relative humidity, zonal wind and meridional wind and there too the conclusion is the same (which we don't show here). Additionally, we detected no significant difference in the variability of the forecast errors between the two experiments, as measured by the relative change in the standard deviation of the error.
In Figure 5 we compare both the analysis and background (first guess of the atmospheric state at the beginning of the 4D-Var process) for each experiment with brightness temperature observations taken from the Advanced Technology Microwave Sounder satellite instrument flying on the Suomi NPP and NOAA-20 polar-orbiting satellites, which both provide global coverage. This instrument is sensitive primarily to atmospheric temperature and moisture content, depending on the channel. For our purposes, channels 10-15 are of particular interest as these are sensitive to temperature in the stratosphere where the non-orographic gravity wave drag scheme is active (Bormann et al., 2013). We compute the root-mean-square error between the observations and the analysis and then the relative change of this error between experiments REF and NN. We then repeat this procedure but for the background instead of the analysis. This is averaged across the whole globe and all data assimilation cycles. Confidence intervals of 95% are also shown, computed according to the procedure in Geer (2016). Note that channels 16 and 17 are missing from these figures as those channels are not assimilated at ECMWF.
For almost all channels there is no significant difference between the two experiments according to the root-mean-square of their analysis and background departures. There are several channels at which the change in departure root-mean-square is outside of the 95% confidence interval, such as channels 8, 14, and 15 for the analysis departures and channels 8, 9, and 10 for the background departures. However, the confidence intervals computed through the ECMWF verification software do not include factors to account for temporal autocorrelation which could inflate these intervals by up to 30%. We are therefore not concerned HATFIELD ET AL.
10.1029/2021MS002521 11 of 16 about these apparently statistically significant differences. We also considered observations from another microwave sounder, the Advanced Microwave Sounding Unit A (AMSU-A), but we could not detect any significant changes from experiment NN compared with experiment REF there either.
Finally, we show an equivalent plot to Figure 5 but for Global Positioning System radio occultation (GPSRO) bending angle observations in Figure 6, which are also sensitive to the quality of the stratospheric analysis. These instruments measure the bending of radio waves emitted by GPS satellites as they pass through the atmosphere and have globally uniform coverage. They are given as a function of the "tangent height," the lowest altitude that the wave reaches as it passes from source to receiver. In the stratosphere, GPSRO observations are primarily sensitive to the temperature as this influences the refractive index of the atmosphere.
As in Figure 5 there is almost no discernible difference between experiments NN and REF, as the relative change in RMS is not significantly different from 0% at most tangent heights. The only exception is around 2 km where there is an apparent degradation for experiment NN. At this level experiment NN permitted a slightly larger number of observations to pass quality control, compared with REF. These observations happened to match more poorly against analysis and background than at other tangent heights. In any case, GPSRO observations are most reliable between 10 and 30 km tangent height where there is no discernible change in RMS error so we are not concerned about the apparent difference at 2 km.

Conclusion
Techniques from the field of machine learning have a number of potential applications within NWP. Here we have demonstrated one use for neural networks in variational data assimilation that takes advantage of their easy differentiability. We began by training an emulator for a physical parametrization scheme representing drag in the middle atmosphere due to the breaking of gravity waves forced by non-orographic processes. Then, we developed tangent-linear and adjoint versions of this neural network and implemented these in an operational NWP model. We then tested our neural network-based parametrization scheme according to the standard internal testing procedure for tangent-linear and adjoint model changes at ECMWF and demonstrated that it can also be used successfully in a data assimilation and weather forecast experiment. Our experiment using a neural network gave no statistically significant difference to the forecast scores, especially near analysis time, compared with a reference that did not use a neural network.
HATFIELD ET AL.
10.1029/2021MS002521 12 of 16 Considering that we trained our neural network to emulate the chosen physics scheme at a horizontal resolution of TCo399, a comment on generalizability to other horizontal resolutions of the emulation approach is warranted. Already we have presented results where the neural network operated successfully under a range of different horizontal resolutions and different grids, from the lowest resolution of TL95 used in the first inner loop of the 4D-Var minimization to the highest of TCo399 used for performing the tangent-linear and adjoint tests and the 4D-Var nonlinear trajectories. This adaptability is natural: the scheme we are emulating operates on a single atmospheric column and so is in theory agnostic to the horizontal context. The success we have had in the data assimilation experiments is empirical evidence for this, but nevertheless we present in Appendix a repeat of the tangent-linear and adjoint tests but at a resolution of TL159. The results are equivalent to the tests run at TCo399.
If this approach could be extended to other physics parametrization schemes as well then it holds the potential to significantly simplify tangent-linear and adjoint model maintenance not just at ECMWF but other forecasting centers around the world that use 4D-Var. The evaluation of the neural network is nothing more than a large batch of matrix-matrix multiplications, with the activation function evaluated after each hidden layer. It is therefore ideal for use on hardware optimized for such machine learning purposes, like graphics processing units. Therefore, thanks to improvements in hardware portability, the technique outlined here may help to keep 4D-Var competitive as we enter the age of heterogeneous supercomputer architectures. Furthermore, as demonstrated here, the neural network can be trained to emulate the full-complexity nonlinear model component, with the weights being reused in the tangent-linear and adjoint versions. This is unlike the existing approach in which physical processes must occasionally be simplified for reasons of computational cost. The formulation of the linearized physics can therefore be kept closer to that of the nonlinear physics, and in some cases could include physical processes usually omitted.
In our data assimilation and forecasting experiment NN, for reasons we explained, we only used the neural network in the tangent-linear and adjoint models, not in the nonlinear model used for generating 4D-Var trajectories and running forecasts. In an operational context, it may be desirable to use the neural network here too, if it can deliver better scores for a reduced computational cost. However, there are reasons why such an approach might not be adopted for operational weather prediction. First, it may be that a neural network cannot compete with the existing scheme in the nonlinear model but provides a sufficiently accurate gradient for the tangent-linear and adjoint models. In such a case the neural network could be used just in the linear models. In fact the integration of the tangent-linear and adjoint models dominates the total cost of 4D-Var, so this is where efforts should first be concentrated in bringing down the cost of data assimilation in any case. Second, there are cases where intermediate physical quantities from the nonlinear forecast model are required (e.g., entrainment rates from the convection scheme may be disseminated as a forecast product). The neural network cannot provide these quantities as their computation is bypassed entirely, and so in these cases the original nonlinear code would have to be run. Ultimately, operational concerns will determine which approach to take.
As mentioned earlier, another motivation for using simplified physics and additional regularization procedures in tangent-linear and adjoint models is to improve the operational stability of these models. The perturbations and gradients (the inputs to the tangent-linear and adjoint models, respectively) actually encountered in operational data assimilation are far from infinitesimal in size, and therefore the tangent-linear and adjoint models must be modified ("regularized") to handle these without producing unphysical outputs (Janisková & Lopez, 2013). Furthermore, some physical processes are modeled by discontinuous functions, with the derivative ill-defined at the threshold. In principle a neural network emulator would suffer from the same problems. A step function would be automatically smoothed when emulated by a neural network, so in that case the neural network would at least remain continuously differentiable. However, the issue of large gradients would remain which could render the tangent-linear and adjoint models unstable in cases of strong nonlinearity, just as for the conventional paradigm of manually linearized physics. We did not encounter any issues in our experiments despite having trained the neural network on the unregularized non-orographic gravity wave drag scheme, perhaps because this scheme is not sufficiently nonlinear for these to occur. However, a similar emulation procedure applied to the cloud scheme may present problems. In the existing cloud scheme, a manual intervention HATFIELD ET AL.

10.1029/2021MS002521
in the tangent-linear and adjoint models drawing on domain knowledge of cloud processes was required to ensure stability (Tompkins & Janisková, 2004). If such an intervention is also necessary for the linearization of the neural network emulator then this arguably defeats the purpose of such an automated approach.
For addressing the problem of strong gradients it may be necessary in future to adopt one or more simple regularizing constraints during the training of the neural network. Strong gradients could be discouraged by introducing an additional term in the objective function that penalizes large gradient norms. Another regularization might tackle the problem of inaccurate emulated gradients more generally, through a term based on the error of the gradient, provided that gradient training data is available. There is, after all, no guarantee that an accurate neural network emulation will also produce an accurate gradient when differentiated, if the training objective function is only based on a simple error norm. Further work on other physical parametrization schemes within the IFS, including the convection and large-scale cloud schemes, will help to answer these questions.
In our companion paper, Chantry et al. (2021), we briefly investigate the use of more complex neural network architectures than the simple fully connected multilayer perceptron (MLP) we used here. Specifically, we constructed an architecture combining convolutional layers with fixed weights across all model levels (a local layer) and weights that vary with model level height (a fully connected layer). For a fixed number of model parameters, this hybrid neural network performed much better than the MLP. However, translating this architecture into Fortran and developing tangent-linear and adjoint versions, as we did for the MLP, would be somewhat more complicated. This is a strong motivation for developing greater interoperability between Fortran and Python, the latter of which is used for most model training in the field of machine learning (Ott et al., 2020).
Going forward, in addition to giving more attention to the matter of regularization, we intend to identify other benefits to the neural network approach. As mentioned, the tangent-linear and adjoint models of the neural network need only be written once. They can then in principle be reused for other model components provided that accurate neural network emulators of those components can be trained. Other physical parametrization schemes could be considered individually or they could be grouped together. For example, both the orographic and non-orographic gravity wave drag schemes could be emulated together. This would provide a simplification of the existing tangent-linear and adjoint code and also potentially a computational acceleration, though only if the issues with emulating the orographic gravity wave drag scheme outlined in  can be resolved. The computational acceleration provided by the neural network is itself another avenue to explore. The computational cost of 4D-Var, dominated by the tangent-linear and adjoint model integrations, is no less of a concern than for the forecast model, as data assimilation comprises around half of the total wall clock time for generating a forecast at ECMWF. As discussed in the companion paper, on central processing units (CPUs) the neural network is roughly the same cost as the existing scheme. We did not focus on this here as the cost of the non-orographic gravity wave drag scheme is in any case only around 1% of the total cost of the model evaluation. However, given that the physical parametrizations as a whole are often 25% of the total cost of a model integration (Bauer et al., 2020), the computational cost savings from neural network emulators could prove to be their main asset, and this is no less of a concern for tangent-linear and adjoint model integrations.
The idea that we have promoted in this paper, that of automatic generation of tangent-linear and adjoint models by neural networks, demonstrates that machine learning techniques provide auxiliary benefits besides simple replacement of model components by emulators. We believe that neural networks could be an essential tool for keeping forecasting and data assimilation methods based on tangent-linear and adjoint models competitive, especially as we enter the exascale era.
HATFIELD ET AL.

10.1029/2021MS002521
15 of 16 Figure A1. A repeat of the tangent-linear test shown in Figure 2 but at TL159 resolution. Figure A2. A repeat of the adjoint test shown in Figure 3 but at TL159 resolution.