Dynamic Bayesian Networks for Evaluation of Granger Causal Relationships in Climate Reanalyses

We apply a Bayesian structure learning approach to study interactions between global climate modes, so illustrating its use as a framework for developing process‐based diagnostics with which to evaluate climate models. Homogeneous dynamic Bayesian network models are constructed for time series of empirical indices diagnosing the activity of major tropical, Northern and Southern Hemisphere modes of climate variability in the NCEP/NCAR and JRA‐55 reanalyses. The resulting probabilistic graphical models are comparable to Granger causal analyses that have recently been advocated. Reversible jump Markov Chain Monte Carlo is employed to provide a quantification of the uncertainty associated with the selection of a single network structure. In general, the models fitted from the NCEP/NCAR reanalysis and the JRA‐55 reanalysis are found to exhibit broad agreement in terms of associations for which there is high posterior confidence. Differences between the two reanalyses are found that involve modes for which known biases are present or that may be attributed to seasonal effects, as well as for features that, while present in point estimates, have low overall posterior mass. We argue that the ability to incorporate such measures of confidence in structural features is a significant advantage provided by the Bayesian approach, as point estimates alone may understate the relevant uncertainties and yield less informative measures of differences between products when network‐based approaches are used for model evaluation.


Plain Language Summary
To produce reliable forecasts and projections, climate models should accurately reproduce the observed behavior of different processes that play a role in Earth's climate, including the relationships between them. Statistical methods can be used to describe these interactions in models and in observations, and which can then be compared to evaluate how well a given model captures the observed relationships. However, networks obtained from estimates of the true historical state of the climate, known as reanalyses, will also be affected by the properties of the systems used to create these estimates, as well as random variability, and hence may have significant uncertainties. Using what are known as Bayesian statistical methods, we estimate the uncertainties associated with particular interactions in two widely used reanalyses. Interactions that are found to be very likely to be present in one reanalysis but not the other are suggested to be due to systematic differences in the two reanalysis systems and need to be kept in mind when these state estimates are used to evaluate climate models. Therefore, it is important to account for the uncertainty associated with each relationship when analyzing state estimates and further employing them to evaluate climate models.
HARRIES AND O'KANE the globally contiguous observational record of the climate system, and in particular the subsurface ocean, extends only over a few decades.
This process-oriented approach, in which attention is focused on a comparatively small set of physical processes, has been widely applied in the climate science community. On the one hand, the complexity of the full coupled climate system means that it is typically only feasible to focus on specific subsystems in any given analysis, or otherwise necessitates some form of dimension reduction. For instance, specific targeted process-oriented diagnostics (Maloney et al., 2019) permit the representation of these processes in models to be evaluated against observations in order to drive model development (Eyring et al., 2019). At the same time, the existence of distinct teleconnections, that is, recurrent large-scale modes of variability, has long been recognized (Ångström, 1935), motivating simplified models of the climate in terms of a small number of interacting modes. For example, atmospheric teleconnection patterns such as the Arctic Oscillation (AO) (Thompson & Wallace, 1998), the North Atlantic Oscillation (NAO) ( van Loon & Rogers, 1978;Walker, 1923), the Pacific North American (PNA) (Barnston & Livezey, 1987;Horel & Wallace, 1981;Wallace & Gutzler, 1981) and Pacific South American patterns (PSA) (Lau et al., 1994;Mo & Ghil, 1987;O'Kane, Monselesan, & Risbey, 2017), and the Southern Annular Mode (SAM) (Rogers & van Loon, 1982;Thompson & Wallace, 2000) constitute important sources of low-frequency variability (Hannachi et al., 2017) and are associated with wide-ranging impacts (see, e.g., Gillett et al., 2006;Hurrell et al., 2003;Leathers et al., 1991;Mo & Paegle, 2001;Thompson & Wallace, 1998). On interannual time-scales, modes that express in the tropical oceans such as the El Niño Southern Oscillation (ENSO) (Bjerknes, 1969;Walker, 1924) and the Indian Ocean Dipole (IOD) (Saji et al., 1999) emerge as dominant sources of variability, with important implications for global weather and climate (see, e.g., McPhaden et al., 2021;Schott et al., 2009). As these large-scale modes tend to be relatively persistent, understanding their evolution and dynamics may enable more skillful forecasting over longer time-scales (Goddard et al., 2001;Hannachi et al., 2017;A. G. Marshall et al., 2014). Thus, beyond simply being convenient for reducing the dimension of the problem in model evaluation studies; simplified models based on these physically observed modes provide an approach for better understanding the properties and interactions of key sources of climate variability.
Of particular interest when investigating the interactions of different modes is establishing the cause and effect relationships between processes. While causal inference based on observational data is in general challenging, definitions of causality based on predictability (Wiener, 1956) provide a useful starting point for such studies. Granger causality (Granger, 1969) formalizes this more restricted sense of causality. In simple terms, a given variable is traditionally defined to Granger cause another if the inclusion of information about the first variable leads to better predictions of the second, compared to predictions that do not use this information. This definition has the advantage of immediately leading to quantitative tests for identifying such relationships from observational data, and has recently been advocated for in climate science (McGraw & Barnes, 2018) over simpler lagged regression approaches.
It is ordinarily not possible to fully understand the behavior of an individual mode of variability in isolation or based only on its relationship to one other process. Similarly, even when focusing on only a few individual processes, it is generally difficult to directly attribute model biases to problems in the representation of a single process (Eyring et al., 2019). To capture the intrinsically coupled nature of the system in simplified models network-based approaches have become increasingly popular (Donges et al., 2009b;Steinhaeuser, Chawla, & Ganguly, 2011;Tsonis & Roebber, 2004;Tsonis, Swanson, & Roebber, 2006). In this framework, the climate system is represented in terms of a set of nodes, corresponding to appropriately defined subsystems or processes of interest, and edges describing the interactions between these nodes. The subsystems, for example, may be identified with one or more spatial gridpoints (Bello et al., 2015;Fountalis et al., 2018), or predefined modes characterized by empirical indices (Tsonis, Swanson, & Kravtsov, 2007). In either case, the full climate system is then modeled in terms of a collection of individual, non-linear dynamical systems interacting with their neighbors in the constructed network. Such networks have variously been applied to study synchronization and climate shifts (Tsonis, Swanson, & Kravtsov, 2007; G. Wang et al., 2009), to investigating the collective spatial structure of the statistical relationships between fields and changes over time Donges, Schultz, et al., 2011;Gozolchiani, Havlin, & Yamasaki, 2011;Gozolchiani, Yamasaki, et al., 2008;Guez et al., 2012;Radebach et al., 2013;Steinhaeuser, Ganguly, & Chawla, 2012;Tsonis, Swanson, & Wang, 2008;Y. Wang et al., 2013;Yamasaki et al., 2008) and HARRIES AND O'KANE 10.1029/2020MS002442 2 of 37 as tools for automatic mode identification (Bello et al., 2015) or dimension reduction (Falasca et al., 2019;Fountalis et al., 2018) via community detection methods.
When attempting to build a network representation of the climate, the structure of the network is generally not known beforehand and must be inferred by some means. A typical approach is to add or remove edges from the network on the basis of the level of statistical interdependence of pairs of nodes, quantified by, for example, the correlation (Tsonis & Roebber, 2004) or mutual information (Donges et al., 2009a) between time series associated with each node. As these measures of association will almost always be non-zero in finite samples, some level of thresholding or pruning must also be applied in order to exclude edges corresponding to weak or spurious associations. In the simplest case, the result is an undirected network; that is, the presence of an edge between two nodes indicates some level of mutual association, but does not provide information on any possible directionality in the relationship.
Networks constructed in this way have proven to be very useful but do have some important limitations. In particular, while correlation graphs allow for comparisons between modeled and observed associations (Falasca et al., 2019), as noted above, it is often of interest whether a particular set of variables is causally related to another set. If there is a causal mechanism, we may also wish to quantify the magnitude of the effects of those causal factors; commonly, climate networks based on the above approaches do not provide direct access to measures of effect. In practice, to answer these sorts of questions it is necessary to imbue the networks with additional structure. This can be naturally achieved by identifying the original graph with an underlying statistical model, that is, by working in the context of a (probabilistic) graphical model (Koller & Friedman, 2009). In a graphical model, the graph encodes, in a well-defined way, the set of qualitative independence relationships between the random variables, corresponding to nodes, in the model (Jordan, 2004). Given a particular functional form for the interactions between variables in terms of the joint probability density function (PDF), quantitative questions can also be formulated and addressed using standard algorithms (Dechter, 1999;Koller & Friedman, 2009;Pearl, 1982Pearl, , 1988. Graphical models may utilize undirected or directed edges, or even a mixture of both, corresponding to different ways of constructing the joint PDF of the model. Models based on directed acyclic graphs (DAGs), in which all edges also have an associated direction and the graph does not contain any directed closed loops, provide an intuitive representation of complex systems in terms of conditional independence relationships between quantities (Jordan, 2004;Pearl, 1988;Spiegelhalter et al., 1993). Compared to undirected networks, directed graphs have the advantage that edges in the graph can often be interpreted causally. The conditional independence relationships encoded by the graph structure amount to statements of Granger (non-) causality between nodes in the graph (Eichler, 2012;Runge, 2018a) and thus can be regarded as Granger causal relationships, although stronger statements about causal mechanisms will usually require further assumptions (Peters et al., 2017). As a result, graphical models based on DAGs provide a powerful tool for studying causal relationships (Pearl, 1995), and form a useful basis for other forms of causal inference (Greenland & Brumback, 2002). Bayesian, or belief, networks (BNs), as such models are usually called, have therefore received increasing attention from the climate science community (Ebert-Uphoff & Deng, 2012a), having variously been used for forecasting and risk assessment based on expert systems (e.g., Abramson et al., 1996;Boneh et al., 2015;Catenacci & Giupponi, 2009Leonard et al., 2014;Peter et al., 2009), for learning independence relationships and possible causal interactions in observations (Di Capua et al., 2020;Ebert-Uphoff & Deng, 2012b;Harwood et al., 2021;Horenko et al., 2017;Kretschmer, Coumou, et al., 2016;Kretschmer, Runge, & Coumou, 2017;Li et al., 2018;Pfleiderer et al., 2020;Runge, 2015Runge, , 2018aRunge, , 2018bRunge, Bathiany, et al., 2019;Runge, Nowack, et al., 2019;Runge, Petoukhov, Donges, et al., 2015;Runge, Petoukhov, & Kurths, 2014;Saggioro et al., 2020;Samarasinghe, Deng, & Ebert-Uphoff, 2020;Samarasinghe, McGraw, et al., 2019) and models (Deng & Ebert-Uphoff, 2014;Ebert-Uphoff & Deng, 2017), and, most recently, for model evaluation (Nowack et al., 2020;Vázquez-Patiño et al., 2020).
One approach for using BNs as tools for model evaluation is to learn the network structure in a model and in observations and assess the agreement between the two. Learning the structure of climate networks, in the absence of expert knowledge, has primarily been achieved by utilizing constraint-based algorithms (Colombo & Maathuis, 2014;Spirtes & Glymour, 1991) in which the set of edges is determined starting from a series of conditional independence tests (Ebert-Uphoff & Deng, 2012a;Runge, Bathiany, et al., 2019 et al., 2013) together with predefined constraints, allowing for non-linear dependence structures to be estimated from data. However, the inclusion of edges on the basis of an (initially arbitrary) significance level together with multiple testing adjustments makes assessing the level of confidence in the inferred networks difficult. Instead, various strategies such as varying the hyperparameters of the structure learning algorithm (Ebert-Uphoff & Deng, 2012a) or the form of independence test used (Hlinka et al., 2013) are often used to identify robust links in a fitted network.
While we may use BNs as tools for model evaluation by comparing differences between modeled and observed networks, it is necessary to assess whether any mismatch is likely due to a genuine bias, or if it is the case that a slightly different structure happens to be more compatible with the given data than the "true" structure, even if the latter is also consistent with those data. In other words, an additional level of uncertainty quantification is required. While a great deal of insight can be derived from the structure of the estimated networks, quantifying both the sign and magnitude of the interaction between nodes is often relevant as well. When this is achieved by estimating parameters conditional on the inferred structure, characterization of the associated uncertainties is difficult, and (depending on the approach used) may result in insufficiently conservative estimates (Draper, 1995). For the purposes of model evaluation, it is usually of interest whether a model captures both the existence of a link and with the correct strength. Unreliable estimates for the uncertainty in fitted interaction strengths may result in an inability to determine if an interaction is present but differs significantly in the model compared to observations. On top of this, the additional complexity involved in implementing such two-stage fitting procedures appears to have discouraged their use (McGraw & Barnes, 2018) compared to simpler model-based analyses framed in terms of Granger causality.
In this article, we investigate possible solutions to the above limitations through the use of Bayesian methods (Uusitalo, 2007). It must be emphasized, of course, that any method will be inherently limited by the finite length of the observational record or model run; the approaches we consider here do not address those uncertainties arising due to, for example, longer-term variability that is not covered by the record, or due to the restricted class of models considered. Nevertheless, given the available data and a particular set of models, Bayesian methods provide a well-founded means of quantifying uncertainties in the estimated structures and parameters conditional on the observed data and modeling assumptions. In the Bayesian framework for learning the structure and effect measures Madigan et al., 1995;Spiegelhalter & Lauritzen, 1990); it is natural to use the posterior probability of a given network and its associated parameters as a score to measure model fitness (Buntine, 1991;Cooper & Herskovits, 1992;Geiger & Heckerman, 1994;Heckerman et al., 1995). Existing knowledge and constraints may be incorporated into such a score-based approach through the use of suitable prior distributions, although in practice this must be balanced against computational feasibility. Bootstrap (Friedman, Goldszmidt, & Wyner, 1999) or sampling-based (Godsill, 2001;Madigan et al., 1995;Madigan & Raftery, 1994) methods provide some measure of the uncertainties in model selection as well as allowing the predictions of multiple models to be combined via model averaging (Madigan & Raftery, 1994). Consequently, the Bayesian approach provides a principled quantification of the uncertainties associated with estimation of the network structure and parameters, conditional on the observed data and model class. Models typically used for testing for Granger causal relationships, that is, linear autoregressive models, can be straightforwardly expressed in graphical terms (Arnold et al., 2007;Lèbre, 2009), and, at least under a choice of conjugate priors, analyzed efficiently using closed form expressions for the desired posterior densities (Geiger & Heckerman, 1994). In this sense, the model specification is largely familiar, reducing the barriers to use in climate applications. While there is additional complexity associated with the inference scheme, we believe that, where a Bayesian approach is feasible, the ability to arrive at intuitive uncertainty estimates is a strong reason to employ these methods.
To illustrate the utility of Bayesian methods for structure learning; we consider the application of fitting BNs to a set of climate indices derived from two reanalysis datasets. Our purpose in doing so is two-fold. First, as the score-based approach has not been widely used in analyses of this type, we wish to investigate the suitability of the method for networks of the size encountered in realistic data. Our second, and more important, aim is to perform a comparison of different reanalysis products on the basis of the fitted networks. Structure learning methods have recently been applied as a model evaluation tool by Nowack et al. (2020), in which a constraint-based approach was used to compare the graph structures obtained in HARRIES AND O'KANE 10.1029/2020MS002442 4 of 37 CMIP5 models (Taylor et al., 2012) to those found from reanalyses. In contrast, here we focus primarily on the reanalyses themselves, with the goal of characterizing the robust similarities and differences between the products using the score-based method described above. In addition to allowing for possible biases and differences in the reanalyses at the level of interactions between modes to be studied, differences in the networks from different reanalyses give some additional indication of uncertainties in the observed networks that model runs are evaluated against. This, in turn, must be taken into account when deciding what level of disagreement between models and observations can be taken to indicate clear model biases.
While our focus here will be on the benefits of the Bayesian approach to structure learning, we should emphasize that this approach is of course not without its own limitations. In particular, obtaining distributional estimates is necessarily significantly more computationally intensive than analogous constraint-based approaches, especially for the more complex models that may be needed to adequately characterize systems with non-trivial temporal behavior. Consequently, this approach is largely restricted to smaller systems, while for higher-dimensional problems constraint-based algorithms offer better performance. In practice, the choice of method will depend on considerations of system size and properties, and whether a point estimate is sufficient or if distributional information is required. The strengths of constraint-based approaches, and where they might be used, have been extensively discussed in the literature, to which we refer the reader (see, e.g., Runge, Nowack, et al. (2019), and references therein); here we describe the use of an alternative framework in the case where Bayesian approaches are feasible and offer complementary information to that obtained using constraint-based algorithms.
The remainder of this paper is structured as follows. In the next section we provide a brief review of Bayesian network models and the inference methods used to fit such models. In Section 3 we describe the datasets and diagnostics that we study, together with our choice of prior distributions. In Section 4 we present the results of fitting the network models. Finally, we summarize our findings and discuss their implications for follow-up comparisons in Section 5.

Structure Learning
As noted above, a graphical model is simply a statistical model that has associated with it a graph encoding the relationships between the variables in the model. Each random variable in the model is represented by a node in the graph, with the allowed conditional dependence relationships between variables indicated by edges between nodes. Under suitable assumptions (see, e.g., Koller and Friedman (2009) for a review) translating between graphs and joint PDFs that constitute a model can be achieved using a prescribed set of rules.
For BNs and other graphical models based on DAGs the graph structure implies a factorization of the joint PDF into conditional density functions. Given a set of random variables Y 1 , …, Y n and a DAG G, we denote by pa G (Y i ) the set of nodes in G that have a directed edge connecting to Y i , The graph G then provides a representation of the joint PDF P(Y 1 , …, Y n ) if the PDF admits a factorization of the form where θ i denotes any parameters required to characterize the conditional density. For example, when all of the variables in {Y i } ∪ pa G (Y i ) are discrete, the conditional density P(Y i |pa G (Y i ), θ i ) is the conditional probability Generally, in geophysical applications, the random variables of interest exhibit non-trivial spatial and temporal correlations. In our case, these variables are a collection of (continuous) climate indices, some of which (e.g., ENSO) show substantial autocorrelation. Feedback loops or temporal dependence of this form cannot be represented in a BN with a single node for each index Y i , due to the requirement for the graph G to be acyclic (Uusitalo, 2007). To handle this, the set of nodes is expanded to consist of the values of the random variables Y i at the current time t, i t Y , as well as the values of the variables   i t Y at previous times t − τ (Friedman, Murphy, & Russell, 1998;Kjaerulff, 1995;K. Murphy & Mian, 1999;K. P. Murphy & Russell, 2002), up to some maximum lag τ max . Temporal dependencies are described by edges between the nodes corresponding to, say, i t Y and   i t Y , and a full time series of observations is described by a graph at each point in time relating the values of the variables at that time point to those in the previous time-slices. Similar graphical models for multivariate time series have also been introduced as time series graphs (Eichler, 2012). The resulting model is referred to as a dynamic Bayesian network (DBN).
In the simplest case, the structure of the graph and the associated parameters remains the same across all time-slices, so that the full time series is modeled by repeating the graph at each time t. The corresponding DBN is said to be (time-)homogeneous (K. P. Murphy & Russell, 2002). However, the assumption of homogeneity is often violated in geophysical applications. Interactions between spatially separated processes, for example, may be seasonally dependent, while on longer time-scales the possibility of climate regime changes, for example, in association with tipping points (Lenton et al., 2008), in response to anthropogenic forcing has recently become a key concern. Non-homogeneous DBNs, in which either the graph structure, parameters or both simultaneously, are allowed to change over time, admit the possibility of modeling features such as secular trends and regime changes (Wu et al., 2018), at the cost of a significant increase in complexity in terms of model specification and inference. Here we focus on the simpler case of homogeneous models for the purposes of investigating the usefulness of Bayesian methods for assessing model uncertainty; the more complicated case of non-homogeneous models will be described in a separate study.
where θ denotes the collection of all parameters for the conditional PDFs, the learning process can be conveniently divided into two steps. In the first, structure learning stage, the structure of the graph G is sought, independent of specific values of the parameters. Structure learning methods for BNs can be roughly categorized as constraint-based or score-based. The former set of methods attempt to reconstruct the graph structure on the basis of conditional independence tests and available prior knowledge and constraints, using, for example, the PC-algorithm (Spirtes & Glymour, 1991) and its extensions (e.g., Colombo & Maathuis, 2014;Runge, 2018a;Runge, Nowack, et al., 2019). As discussed in Section 1, most recent examples in climate science have made use of constraint-based algorithms to learn an initial structure, followed in some cases by a separate parameter learning step. In contrast, in score-based approaches the graph G is estimated based on maximizing a suitable score function (Cooper & Herskovits, 1992;Geiger & Heckerman, 1994;Heckerman et al., 1995), such as the marginal likelihood P(D|G) or an information criterion. However, when there may be significant uncertainty associated with the selection of a single model, rather than finding a single optimal model, it may be preferable to attempt to account for this model uncertainty by sampling from the full posterior distribution of possible graphs P(G|D) (Madigan et al., 1995). Estimates for derived quantities of interest Δ may then be computed by averaging over the posterior distribution (Draper, 1995;Madigan & Raftery, 1994), For structurally modular priors of the form the posterior over graphs also factorizes, so that each factor can be computed independently, up to an overall normalization.

Choice of Conditional Densities
The conditional densities P(Y i |pa G (Y i ), θ i ) can, in principle, be chosen to model arbitrary relationships between the random variables in the graph, consistent with the independence assumptions embodied by the HARRIES AND O'KANE 10.1029/2020MS002442 7 of 37 graph structure. In practice, this prevents marginalizing out the graph parameters (i.e., evaluating Ψ i (D; G) analytically) to sample from the marginal posterior distribution P(G|D) directly. It is then necessary to construct a Markov Chain Monte Carlo (MCMC) sampler that samples from the joint posterior P(θ, G|D) using, for example, reversible jump MCMC (RJMCMC) (Green, 1995) or related methods (Carlin & Chib, 1995;Godsill, 2001). A summary of samplers that we use is given in Appendix A.
In special cases, the necessary integrals can be evaluated in closed form, allowing for the posterior P(G|D) to be efficiently sampled after marginalizing out the conditional PDF parameters. For continuous data, a widely used example for which this is possible is the linear Gaussian regression model (Lèbre et al., 2010;Punskaya et al., 2002). In this model, which can be regarded as a specialization of the BGe model for continuous data (Geiger & Heckerman, 1994) given by a linear function of the parent variables pa G t The local marginal likelihoods Ψ i (D; G) and posterior distributions for the parameters of a given graph can be analytically evaluated provided that conjugate normal-gamma priors are assumed for the conditional where a τ , b τ , and  2 i are prior hyperparameters. Similar linear models have previously been applied (Di Capua et al., 2020;Kretschmer, Coumou, et al., 2016;Kretschmer, Runge, & Coumou, 2017 ;Saggioro et al., 2020) to perform estimation of the interaction strengths after an initial stage of constraint-based structure learning. Using the underlying generative model, Equation 11, posterior estimates for both the structural features and model parameters can be obtained within a single sampling scheme. Appropriate choices for the hyperparameters a τ , b τ , and ν 2 allow varying levels of regularization to be imposed so as to yield more reliable, if more conservative, estimates given relatively short and noisy time series. Alternatively, they may be allowed to vary and another level of priors specified for the unknown hyperparameters.
In the presence of significant non-linearity, linear models of this form may no longer be appropriate. In this case, one strategy to capturing the underlying non-linear relationship is to first discretize the original data and employ models for discrete data, for example, the analytically tractable BDe model (Buntine, 1991;Cooper & Herskovits, 1992;Heckerman et al., 1995). However, this comes at the cost of necessarily losing some information, generally yielding only a coarse approximation of the original continuous distribution (Friedman & Goldszmidt, 1996). Additionally, choosing an appropriate discretization scheme is in general difficult, as there is an inevitable trade-off between having sufficient resolution to describe the data versus the exponential growth in the number of parameters that are required to specify the CPT of each variable. Hence, in the following we focus on the case of continuous data.
For a given choice of model, after performing any initial pre-processing, fitting the homogeneous DBN model to the observed indices data then consists of sampling from the posterior distribution P(G|D) using the known marginal likelihood P(D|G) in order to derive a set of candidate networks. For a given graph drawn from P(G|D), the posterior distribution for the parameters, and hence summary statistics and credible intervals, can be computed analytically or obtained via standard within-model MCMC methods.

Toy Model Example
To make the above discussion more concrete; it is helpful to consider a simple toy example. A standard problem in climate studies is to determine the direction of the relationships, if any, between some set of variables. For example, during the positive (El Niño) phase of ENSO, anomalously warm Pacific sea surface temperatures (SSTs) drive elevated mean surface temperatures over North and South America (McGraw & Barnes, 2018). As a simplified example of this sort of driver-response relationship McGraw and Barnes (2018) considered a two-dimensional system consisting of two observables D t and R t that evolve according to The innovations D t  and R t  are taken to be independent Gaussian noise drawn from a standard normal distribution. Typically, in climate analyses, the relationships between the system variables would be studied by regressing the postulated response (e.g., surface temperature anomalies or R t ) on lagged values of the driver (SST anomalies or D t ), and vice versa to test for the possibility of the reversed relationship. However, this approach is susceptible to detecting spurious relationships when one or both processes exhibit substantial autocorrelation. As noted above, this often occurs in climate applications, where, for example, the driver may correspond to relatively slowly varying boundary conditions such as SST driving an atmospheric response, as in the ENSO-surface temperature example. To account for this, autoregressive models of the form (and similarly for the dependence of D t on R t ), may be used instead to test for the presence of Granger causal links between processes.
Depending on the system, however, there can be considerable uncertainty associated with selecting one of these models over the other. We can apply the sampling-based approach described in the previous sections to better quantify this. The two models, Equations 14 and 15, correspond to particular choices of parent set under the linear Gaussian regression model. Thus, to illustrate the method we consider the results of learning the structure of a DBN describing the system Equation 13 under a linear Gaussian model. For given values of the system parameters, we generate a random realization of the system and fit a DBN after standardizing the input data to zero mean and unit variance. For the prior hyperparameters, we take, for example, a τ = 1.5, b τ = 10, and ν 2 ≈ 43.3, yielding a weakly informative t prior with 3 degrees of freedom for each coefficient, with 90% prior credible intervals of −4 ≤ β ≤ 4. The prior 1% and 99% percentiles for the conditional precision (variance) are 0.57 (0.02) and 56.7 (1.74), respectively. Alternative choices with a heavier tailed distribution for the coefficients and much broader priors for the conditional precision, for example, a τ = 0.5, b τ = 10, and ν 2 ≈ 2, tend to yield similar parameter estimates but less sparse models. The set of models considered includes all lags of up to 6 time steps and at most four parent nodes for each variable. Posterior samples are obtained using the MC 3 algorithm described in Appendix A, with the choices for the structure priors and proposal densities as described in Section 3.3.
Graphical summaries of the results of fitting this Gaussian DBN to realizations of the system Equation 13 for (α, γ, τ) = (0.2, 1, 2), (0.5, 10, 2), and (0.9, 4, 2) are shown in Figure  Where an edge also occurs in the maximum a posteriori (MAP) model, the approximate posterior 95% HDI for the corresponding coefficient is also shown.
The true system, Equation 13, corresponds to the graph given in Figure 1a, with the edges labeled by the values of the standardized regression coefficients. The remaining panels show the estimated posterior probabilities ˆ for each edge, computed as in Equation 4, for each realization of the system; for clarity, only edges for which   0.5 are shown. Where an edge also appears in the maximum a posteriori (MAP) estimate for the structure, the posterior 95% highest density interval (HDI) is also shown for the corresponding coefficient.
For low levels of autocorrelation and noise, the model recovers the correct edges with virtual certainty; in this case, the MAP structure consists of the (true) parent sets pa(D t ) = {D t−1 } and pa(R t ) = {D t−2 }. Recovery of the true dependence structure is more difficult in the presence of large amounts of noise (e.g., γ = 10) or high autocorrelation (e.g., α = 0.9). McGraw and Barnes (2018) note that, in the latter case, a bivariate Granger causality analysis exhibits reduced power for detecting the D t−τ → R t edge while lagged regressions tend to too frequently identify spurious wrong-way relationships in which R drives D. For high levels of noise, both methods show reduced power for detecting the driver-response relationship, although accounting for autocorrelation successfully controls the false-positive rate for the reversed relationship. This difficulty in identifying the true underlying model is clearly evident in the lower panels of Figure 1 from the reduced estimated posterior probability of a dependence of R t on D t . In both cases, although the MAP structure suggests that R t depends on lagged values of D t , the lag at which this occurs is misidentified, with pa(R t ) = {D t−1 }. Across all sampled models, the estimated posterior probability for lagged values of D t to R t reflects greater uncertainty in the model structure, and hence in the selection of a single optimal model. For instance, while the MAP parent set for R t when α = 0.5, γ = 10 is found to contain only D t−1 , the next most likely parent sets (for this particular sample, pa(R t ) = ∅ and pa(R t ) = {D t−2 }) also have non-negligible posterior probabilities, so that in practice it may be important to take this model uncertainty into account. A similar uncertainty measure is considerably more difficult to construct and interpret in the testing based approach of McGraw and Barnes (2018). As time series with long memory or large noise levels are frequently encountered in the climate analyses that we are interested in, the ability to provide some formal estimate of model uncertainty is an important advantage of the approach adopted here.

Data and Methods
We now describe an application of the above methods to sets of reanalysis-derived climate indices. To do so, for each reanalysis we apply the methods illustrated in the toy example above to obtain a sample from the posterior distribution of possible structures. From this sample, we can compute summary statistics of interest, such as the posterior probability of the existence of one or more edges, for the purpose of comparing samples derived from different reanalyses. Alternatively, one may consider the properties of individual graphs of interest in the sample, such as the MAP structure. The whole process is sketched in Figure 2. In the following, we largely focus on summary statistics computed over the full sample, but we also briefly consider the differences between the MAP network structures. In addition to uncertainties due to model and parameter selection, in practice the reanalysis datasets that we use have associated uncertainties as well.
Comparison of the networks derived from different products over a common timespan allows the role of these differences to be investigated and provides a baseline against which free-running model simulations can be compared.

Data
The data that we analyze are obtained from the Japanese 55-year Reanalysis (JRA-55) and the National Centers for Environmental Prediction/National Center for Atmospheric Research (NCEP/NCAR) Reanalysis 1 (NNR1).
The NCEP/NCAR Reanalysis 1 (Kalnay et al., 1996) is an atmospheric reanalysis covering the years 1948 to present. The data assimilation system employs a global spectral model with a T62 resolution on 28 vertical levels, and assimilates surface and atmospheric observational data. While a fixed analysis and forecast system is used for the duration of the reanalysis, changes in observing systems still have an impact and, consequently, the reanalysis is less reliable in the first decade than at later times (Kistler et al., 2001). NNR1 represents a first generation reanalysis providing a multidecadal record of the atmospheric state, albeit with HARRIES AND O'KANE 10.1029/2020MS002442 10 of 37 several known errors (Kistler et al., 2001) and biases, particularly in data-sparse regions in the high latitudes and the Southern Hemisphere (SH) (see, e.g., Bromwich & Fogt, 2004;Bromwich et al., 2007;Greatbatch & Rong, 2006;Hertzog et al., 2006;Hines et al., 2000;Lindsay et al., 2014;G. J. Marshall, 2002; G. J. Marshall & Harangozo, 2000). For the purposes of our analysis, global fields of daily mean 500 hPa geopotential height (Z g (500 hPa)), zonal winds at 850 hPa and 200 hPa (u 850 hPa and u 200 hPa ), mean sea level pressure (MSLP), and surface zonal and meridional winds (u sfc and v sfc ) are obtained on the provided 2.5° × 2.5° latitude-longitude grid. Daily mean top-of-atmosphere outgoing longwave radiation (OLR) fields are provided on a T62 Gaussian grid and are subsequently regridded to a 2.5° × 2.5° latitude-longitude grid using a bilinear interpolation scheme. To compute indices of tropical variability based on SST data for NNR1, we use version 1.1 of the HadISST SST data set (Rayner et al., 2003), which provides monthly global SST on a 1° × 1° latitude-longitude grid from 1870 to present.
The JRA-55 reanalysis (Kobayashi et al., 2015), covering the period from 1958 to present, is a more recent atmospheric reanalysis product that aims to take advantage of ongoing improvements in forecasting systems and available observations. As for the NNR1 reanalysis, a frozen analysis system is employed and atmospheric and surface observations are assimilated. The assimilation system used for JRA-55 employs a TL319 resolution operational system with 60 vertical levels. The use of a higher resolution model, together with other updates to the system, has been found to yield improvements in the representation of the synoptic scale atmospheric circulation compared to the previous generation JRA-25 reanalysis (Onogi et al., 2007), although there remain known issues (Harada et al., 2016). Daily mean Z g (500 hPa), u 850 hPa , u 250 hPa , u sfc , v sfc , MSLP, and OLR fields are obtained on a 1.25° × 1.25° latitude-longitude grid. For SST fields, the model surface brightness temperature provided on a 1.25° × 1.25° latitude-longitude grid is used. Where required by the definition of the index as noted below, we regrid the initial fields to a 2.5° × 2.5° latitude-longitude grid using a bilinear interpolation method.

Indices
From the full gridded fields, we compute a set of indices diagnosing the activity of a selection of major global climate modes, which will form the nodes in the fitted graphical models. In a fully data-driven approach, the definitions of such indices might be automatically determined by using community detection methods ( Bello et al., 2015;Steinhaeuser, Chawla, & Ganguly, 2011). This has the advantage of accounting for differences between datasets or models in the representation of particular modes, for example, due to shifts in the geographic centers of action. While these approaches have been employed in studies of causal effect networks (Kretschmer, Runge, & Coumou, 2017), here we use a set of fixed, empirical definitions for the climate indices. We do so for two reasons. First, this allows for a simpler evaluation of the performance of the models defined in Section 2, as the features in the fitted networks can be directly compared to well-studied relationships between traditional indices. Such a comparison would be more difficult to perform when using automatically extracted indices, as some differences might arise solely from the definition of the index. Additionally, positional shifts and other differences in the expression of particular modes with respect to a predefined diagnostic are themselves of interest from the point of view of comparing and evaluating models, and so we prefer to use a single set of reference definitions.

HARRIES
We choose a set of indices that provides reasonably comprehensive coverage of the dominant modes of variability active on intraseasonal through to interannual time-scales. Where anomalies are required in the definition of an index, for consistency across the different datasets we compute all anomalies as differences from the daily or monthly climatology calculated with respect to the reference period January 1, 1979-December 30, 2001. Where available, we have checked that our calculation of an index matches publicly available results for the same index. To assist with the identification of individual modes the spatial patterns corresponding to the indices we include are given in Figures S1-S12, from which the approximate regions corresponding to the various indices may also be deduced.
As measures of tropical variability, we include an updated version (Zhang et al., 2019) of the multivariate ENSO index (MEI) (Wolter & Timlin, 1993, the dipole mode index (DMI) to characterize IOD activity (Saji et al., 1999), and the Wheeler-Hendon Madden-Julian oscillation (MJO) index (Wheeler & Hendon, 2004), denoted below as RMM1 and RMM2. For both the MEI and the RMM1 and RMM2 indices, all of the input fields are evaluated on a common 2.5° × 2.5° latitude-longitude grid. Where required, monthly MJO indices are defined as the monthly mean of the corresponding daily index.
In the extratropical atmosphere, we include indices of the AO, the SAM, the PNA, the PSA1 and PSA2 modes, and a set of modes associated with blocking in the North Atlantic and western Europe. We define the AO and SAM as the leading empirical orthogonal functions (EOFs) (Lorenz, 1956) of anomalies of monthly mean Z g (500 hPa) poleward of 20°N and 20°S, respectively. All anomalies are weighted by the square root of the cosine of the gridpoint latitude when computing the EOFs. Corresponding AO and SAM indices are calculated by projecting the (area-weighted) daily or monthly anomalies onto the leading EOF and normalizing by the standard deviation of the monthly leading principal component (PC).
The PNA pattern is taken to be the leading mode obtained after performing a VARIMAX rotation (Kaiser, 1958) of the first 10 EOF modes of monthly standardized anomalies of monthly mean Z g (500 hPa) polewards of 20°N during boreal winter, taken to be December, January, and February (DJF). The PNA index is then the projection of the standardized height anomalies onto the resulting pattern, standardized by the monthly mean and standard deviation within the climatology reference period. The analogous modes in the SH, PSA1, and PSA2, are defined as the second and third modes in an EOF analysis of year-round anomalies of daily mean Z g (500 hPa) polewards of 20°S, projecting onto each mode and normalizing by the standard deviation of the corresponding PC over the reference period to obtain associated indices.
Following the method presented in Straus et al. (2017), we define a set of Euro-Atlantic circulation regimes via a k-means clustering analysis of the leading 24 PCs of boreal winter anomalies in daily mean Z g (500 hPa) in the sector 20°N-80°N, 90°W-30°E, after applying a 10 day running mean smoothing. This method has the advantage of better capturing spatial asymmetries present in opposing phases of the NAO.
Using k = 4 clusters, we obtain a set of patterns that correspond to the positive and negative NAO phases, NAO + and NAO − , as well as Atlantic Ridge (AR) and Scandinavian blocking (SCAND) patterns, which are associated with blocking events in the Atlantic and western Europe, respectively. The index for each cluster is obtained by projecting the daily or monthly height anomalies onto the anomaly composite associated with that cluster (i.e., the cluster centroid). Each index is then standardized using the appropriate monthly mean and standard deviation of the monthly index over the reference period.  1960-November 30, 2005, and this full time period is used for fitting DBNs. This time period has been chosen to facilitate later comparisons with historical model simulations. As some indices (e.g., SAM) exhibit significant trends over this period, we estimate and remove a linear trend for every index beforehand. Each index time series is then standardized to have zero mean and unit variance over the fitting period. It should be noted that the use of monthly mean data for atmospheric processes, in conjunction with the exclusion of contemporaneous edges, may give rise to edges that do not reflect direct physical processes but are instead due to unmodeled interactions taking place on time-scales of less than 1 month. While these choices reduce the cost of fitting the models, their impact should be kept in mind when attempting to interpret the results physically. For our purposes, that is, of comparing the overall structure of networks derived from two reanalyses, differences in unresolved processes that manifest as otherwise spurious edges are still of interest in terms of highlighting differences between the two products. This is the case even if direct attribution to an underlying process is not possible at the chosen temporal resolution.

Priors and Sampler Settings
As noted in Section 2, when excluding the possibility of contemporaneous edges, it is natural to choose structurally modular priors for the parent sets pa G t i Y ( ) , such that the prior for a graph G decomposes into independent priors for the parent sets pa G t i Y ( ) of each of the n indices included in the model. We fix a maximum allowed lag τ max , such that the n(τ max + 1) nodes in the network at a given time are For networks based on monthly indices, we take τ max = 6 months, corresponding to the approximate e-folding time of the MEI. To enforce some degree of sparsity in the networks, we also impose a constraint on the maximum parent set size for each index, pa Subject to these constraints, in the absence of additional information we adopt uniform priors over the set of possible parent sets for each index, that is, The MCMC sampling schemes described in Appendix A also require an appropriate proposal density for proposing updates to the structure of the model. We choose to adopt a uniform proposal density on graphs G′ in the neighborhood of the current graph G, The neighborhood of a graph G, nhd(G), is defined as the set of graphs that can be reached from that structure by a single move in a predefined move set. Note that, when an explicit distinction is made between structure and parameter updates, as in the basic RJMCMC scheme in Appendix A, we do not include the current graph itself in the neighborhood. We take the set of possible moves to consist of addition of a single edge, deletion of a single edge, or an exchange of two edges (Grzegorczyk & Husmeier, 2011). The neighborhood of a graph contains those graphs that can be reached from it by performing one of these three moves, subject to the imposed condition on the maximum parent set size. A structure update move thus consists of determining the neighborhood of the current graph based on the available moves before selecting with equal probability one graph from this set. As the models considered here allow for the node parameters to be marginalized out, we use the MC 3 sampling algorithm (Algorithm 2 in Appendix A) with the above proposal for fitting the model. For each index, posterior samples were obtained by running eight chains of length 10 7 samples. The number of samples to be retained for analysis was chosen based on the estimated convergence rate from short initial runs following Brooks et al. (2003), where we required the thinning to HARRIES AND O'KANE 10.1029/2020MS002442 13 of 37 be such that the resulting dependence between samples was reduced by a factor of 100 compared to the dependence between successive samples. To provide some assessment of chain convergence, homogeneity of the distribution of parent sets within chains was monitored using χ 2 and Kolmogorov-Smirnov tests (Brooks et al., 2003) for each index, although we note that, as always, these tests are not sufficient alone to determine approximate convergence (Sisson, 2005). In the following, both the posterior distribution over possible models and for individual features is of interest. Thus, in addition to applying these tests to the distribution of overall model indicators, we also monitored the results for the case when each model was labeled by a binary indicator for the presence of each possible edge, so as to check if inferences for the individual posterior edge probabilities were homogeneous across chains. Where required, samples for the node parameters were drawn from the conditional posterior distributions by Gibbs sampling.
The associations between the continuous valued indices are modeled using the linear Gaussian conditional density, Equation 11. For each index, we take the hyperparameter values a τ = 1.5, b τ = 20, and   2 3 i , for i = 1, …, n. This corresponds to independent t 3 marginal priors for the regression coefficients, with 95% prior HDI −1 ≤ β ≤ 1. The 1% and 99% percentiles for the conditional precision are 1.1 and 113.4, respectively. Prior simulations suggest that this choice of priors yields a reasonable scale for the prior predictive distribution for the 1-step ahead forecast values, while not being so heavily regularized that relatively large values of the coefficients and indices are excluded. Similarly, the conditional precision hyperparameters are chosen to yield typical monthly innovation variances of order 1 or less. As these choices lead to somewhat informative priors, to assess the sensitivity of the results to the hyperparameter values we also perform fits with the much more weakly informative choices of a τ = 0.5, b τ = 10,   2 2 i (corresponding to a 90% prior HDI of −4 ≤ β ≤ 4 and prior 1% and 99% percentiles for τ 2 of 7.6 × 10 −4 and 33.2, respectively), which we find lead to qualitatively similar results (see supporting information).

Results
The result of the structure MCMC algorithm described above is a sample from the approximate posterior distribution of possible graph structures for a given model class. From this sample, we may derive a large range of summary statistics with which the two reanalyses can be compared. Below, we will primarily focus on comparing the estimated posterior probabilities for individual edges in the structures, and hence show summary DAGs indicating the posterior probability estimated for each edge. While this representation of the sample preserves the information of interest about the lagged relationships found in the data, the spatial structure of the networks (implicit in the definitions of the indices) is less evident. Given a DAG derived from the posterior samples, we note that, where needed, some idea of the spatial structure can be obtained by mapping the individual nodes to the regions in which modes of variability are localized. An example of this is given in Figure 3, where we show the location of the nodes in a network derived for Northern Hemisphere (NH) modes during boreal winter. In Figure 3, the temporal structure is collapsed to show an edge only if two nodes have an edge between them with estimated posterior probability >0.5. Doing so amounts to a loss of information regarding to the fitted temporal structure, but does highlight the role of within-hemisphere and tropical to extratropical interactions; the presence of an (apparently spurious) cross-hemispheric edge is discussed further in Section 4.2.

Full-Year Networks for Monthly Indices
We first consider the results of applying the above methods to year-round monthly indices for the two reanalyses. To better display the spatial structure of interactions, we separate the full network into subgraphs consisting of those indices corresponding to NH extratropical, tropical, and SH extratropical modes, together with their inferred parent sets. In each network, nodes corresponding to the indices are sorted according to the approximate latitude at which they occur, from north to south (from left-to-right or top-to-bottom, as appropriate), to provide some indication of the hemispheric distribution of interactions.  association between the parent and child node, either through strong autocorrelation or Granger causal links.
Strong evidence of long-range dependence on lagged values, reflecting the expected high levels of autocorrelation, is apparent for indices representing ENSO and the MJO, with the fitted models for both products featuring posterior probabilities near one for dependence of the MEI and both RMM indices on lags of up to 4 months (Figures 4 and 5). In both cases, the preferred lags are consistent with the expected time-scales for these modes, noting that a maximum lag of 6 months is imposed when fitting the model. Purely atmospheric modes, on the other hand, exhibit little memory beyond time horizons of several weeks to a month. This is apparent in the fitted networks for both models in the absence of strong evidence for dependence of the NH extratropical ( Figure 6) and SH extratropical indices (Figure 7) on lagged values of themselves beyond lags of 1 month. For instance, in both NNR1 and JRA-55 there is found to be little, or at most relatively weak, evidence for serial dependence on monthly time-scales for the AR and SCAND indices, which respectively represent blocking in the Atlantic and Scandinavia with a characteristic time-scale of 7-10 days. Although the DMI diagnoses Indian Ocean SST variability, the partial autocorrelation structure  of the monthly mean DMI is consistent with an AR(1) process, and hence only the edge DMI t−1 → DMI t is found to have appreciable posterior mass.
Where a high posterior probability edge is inferred between distinct indices, the edge generally matches with well-known associations or teleconnections among the modes included in the fit. This is highlighted, for example, in Figure 8, where the posterior probabilities for edges entering the AO, NAO + , and NAO − nodes in both reanalyses are shown. The close association between the AO and NAO is clearly evident in both reanalyses, and the estimated dependence relationships involving just these two indices are in essentially exact agreement. For both NNR1 and JRA-55, the presence of an edge in the network from the AO at lag 1 to each of the child nodes AO t ,  NAO t , and  NAO t is inferred with very high confidence. The same is true for the edges This clear interdependence of the AO and NAO is, of course, very well established, to the extent that the existence of the former as a distinct physical mode has been debated (Ambaum et al., 2001;Deser, 2000). The appearance of this relationship in the learned structures does, however, provide a useful check that the method recovers the expected relationships among particular modes. That these features agree between the two reanalyses also suggests that the relationships between these modes are consistent in the separate datasets.   Similarly, the fitted parent sets for the monthly PNA index in NNR1 and JRA-55, shown in Figure 9, both include with posterior probability >0.5 dependence on the value of the PNA index at a lag of 1 month (  0.95 and   0.91 in NNR1 and JRA-55, respectively) and on the MEI at lag one (  0.71 and   0.59 for NNR1 and JRA-55, respectively). In this case, the results of the fit indicate that in both reanalyses there is reasonable evidence that the PNA is associated with tropical forcing, here captured by the lagged value of the MEI, consistent with previous observational and modeling studies (e.g., Franzke, Feldstein, & Lee, 2011;Hoskins & Ambrizzi, 1993;Trenberth et al., 1998). The posterior mass for this edge is similar in NNR1 and JRA-55, suggesting that, as for the AO and NAO, this relationship is also consistent in the two products, in the sense that there is comparable evidence for the association across the two datasets.
Although the structures estimated from the NNR1 and JRA-55 datasets agree well in the above examples there are notable differences in the estimated edge probabilities as well. Moreover, several of these differences involve features that are assigned appreciable posterior probability (i.e.,   0.5) based on the data from one reanalysis and not the other. That is, the differences are not limited only to associations that are weakly supported in both datasets. For example, in Figure 9, the estimated structure for NNR1 contains an edge from the lag one monthly mean RMM1 index to the monthly PNA index at lag 0 with high posterior probability. The same feature in JRA-55 is found to have a posterior probability   0.38 that is, <40% of that found for NNR1. Given the observed evidence for interactions between tropical convection and extratropical modes such as the PNA on intraseasonal time-scales (Lau & Phillips, 1986), in this case the weaker evidence for an MJO-PNA relationship in the JRA-55 data may arise due to the suppressed MJO observed in the JRA-55 reanalysis (Harada et al., 2016). It should be noted, however, that directly attributing the difference to this bias is not straightforward; for example, other known associations between the MJO and  other modes of variability are identified in the JRA-55 fits, discussed below. Confirming whether there is a difference in this particular feature, and the underlying source of the difference, would require a closer examination of the representation of the two processes in the reanalysis, which is left for further studies. Nevertheless, this does highlight that comparison of the learned parent sets between the two reanalyses allows differences in the captured interactions between modes to be identified, particularly when we may have confidence that such a difference is in fact present. By adopting a Bayesian approach to structure learning, the level of confidence in this difference can be estimated: while the edge RMM1 t−1 → PNA t may be found in individual models for the time-evolution of the PNA index in both reanalyses, the existence of this feature is ∼2.5 times more likely in NNR1 than in JRA-55 given the class of models and possible parent nodes that we consider. The presence of this edge with reasonable confidence in one product and not the other is suggestive of an underlying bias in one reanalysis, rather than simply being due to sampling variability.
Other differences in edges with large posterior mass, however, do not have quite as clear possible sources in specific underlying biases. In both reanalyses, the parent set of the  NAO t node is estimated to contain the monthly PSA2 index at lag one, but with a posterior probability that is,  the estimated probability is <0.5, while an edge PSA1 t−1 → AR t is also present in both reanalyses. O' Kane, Risbey, et al. (2016) and O'Kane, Monselesan, and Risbey (2017) concluded that the PSA modes predominantly reflect dynamics localized to within the SH waveguide, such that a direct physical interaction between the PSA2 mode and the NAO is unlikely. Thus, in contrast to the RMM1 t−1 → PNA t feature, one might expect that the differing level of confidence in this edge between the two reanalyses may be due to differences in relevant factors that are omitted from this simple analysis. In particular, as the fit shown in Figure 8 is based on data from all seasons, the presence of such a feature may reflect seasonal covariations in the extratropical circulation in both hemispheres that is otherwise unaccounted for here. While we consider the effects of seasonality in Section 4.2 below, ultimately direct determination of the root of these differences must be based on detailed evaluation of the two products. Here we seek only to highlight the use of the fitted networks for learning possible dependence relationships and hence their utility for guiding comparative analyses on the basis of identifying relationships that can be inferred to be present or absent with reasonable confidence.
Features for which there is lower confidence tend to differ more between the two reanalyses; although in these cases it becomes less clear as to whether they indicate substantive differences. This is most evident in the subgraphs corresponding to the parent sets for the SH extratropical indices in Figure 7. While there is agreement between the models fitted to NNR1 and JRA-55 in features such as the 1-month memory in the SAM and the presence of an association (even if not a direct interaction) between the PSA1 and tropical forcing captured by the MEI, a larger number of features with posterior mass   0.5 are found using the NNR1 data, involving a larger set of indices as parents. This greater disagreement between reanalyses, and overall lower confidence in the inferred non-independence relationships, may in part be due to the disparate resolutions and configurations of the respective atmospheric models in combination with relatively sparse observations in the SH prior to the satellite period. Spurious associations may arise as a result of the low monthly mean resolution data used here, as well as from the effects of omitted factors such as seasonal changes in the circulation, and low signal-to-noise ratios outside of the SH winter. For instance, the posterior distributions over parent nodes for the SAM index in the two reanalyses are summarized in Figure 10. In both NNR1 and JRA-55, non-zero associations are found with the MEI at lags of 4 and 6 months, albeit with somewhat higher posterior probabilities in JRA-55, and in NNR1 an edge from AR t−5 → SAM t is identified with   0.52. The same edge in JRA-55 is found to have   0.42.
Noting that a similar, comparatively low confidence relationship is found between NAO + and the PSA1 at a lag of 4 months in NNR1, and absent a mechanism for such interactions, it appears likely that in this case the association is an artifact arising from the use of year-round, monthly mean data; we confirm that this is the case in the following section. As noted above, more generally for processes that evolve on time-scales shorter than a month, the use of monthly mean data as here without allowing contemporaneous edges is not sufficient to fully capture the relevant fast processes that are unresolved in this analysis. Together with the seasonal effects discussed in Section 4.2, this may give rise to spurious interactions and must be kept in mind when interpreting these plots. In practice, while we may reasonably expect to capture aspects of the longer time-scale variability present in these modes, to better understand differences arising due to shorter time-scale processes it is necessary to use data with higher temporal resolution. It is worth noting, however, that the fact that there is overall low posterior weight for these edges in both reanalyses allows identifying them as lacking robustness. This in turn is of use for the purposes of guiding model evaluation, where differences to observations in low probability relationships are of potentially less relevance.
It is important to bear in mind that the structures discussed above summarize the marginal posterior probability for the presence or absence of each individual edge over all possible models, rather than the presence or strength of an association between two indices within a single model. In addition to inspecting the esti- mated marginal distributions for each feature, given a sample from the approximate posterior distribution it may also be of interest to consider aspects of the sample that involve either the joint occurrence of one or more edges, as well as the posterior distribution over complete models. In particular, point estimates for the parent set, analogous to those obtained using constraint-based approaches, can be obtained as the MAP estimate with the largest posterior probability. As all parent sets are assigned equal prior probability, this structure is simply the one that maximizes the marginal log-likelihood. Conditional on a particular MAP structure, estimates for the parameters under the model may be simply obtained by sampling from the conditional distributions P(θ|G, D), which in the case of the linear Gaussian model can be evaluated in closed form. Thus, in the sampling-based approach we may also obtain estimates of the strength of an association, conditioned on a particular model, in addition to the above estimates for the probability of the presence of the corresponding edge in the structure.
For example, the MAP parent sets for the NH extratropical modes in each reanalysis are summarized in Table 1, where we show both the estimated posterior probability for each edge in the parent set, as well as the posterior mean and 95% HDI for the corresponding coefficient in Equation 11. In general, edges present in the MAP parent sets tend to have at least moderately high posterior probabilities, that is, they correspond to non-independence relationships that are also present in a large fraction of the sampled models. Edges found in the MAP parent set are not necessarily found in the majority of possible models, however, with several edges having posterior probabilities of order ∼0.3 in Table 1. For these edges, while their inclusion leads to a good fit for this particular sample, the data do not provide especially strong evidence for their presence compared to other edges in the MAP structure, given the set of all other possible predictors. To some extent this also reflects the fact that multiple models with parent sets that do not include these edges may still provide a reasonable fit to the observed data, despite not maximizing the marginal likelihood, and hence can account for non-negligible posterior mass. Considering only the single most probable structure may therefore fail to take into account relevant model uncertainty. Notably, differences between the MAP structures for the two reanalyses are again not only restricted to edges with low posterior mass, suggesting that at least some differences may be due to systematic effects rather than as a result of minor differences in the two samples. For example, the absence of the edge RMM t−1 → PNA t in the MAP parent set from JRA-55 indicates that not only is there weak evidence for this relationship in the data set, but that it is also not required in order to produce a good fit to the observed time series. Where an edge is present in the MAP parent set for both reanalyses, there is good agreement between the two products in terms of the estimated coefficient. For the NH modes, the two reanalyses yield very similar estimates for the strength of each association, while there are somewhat larger differences for the tropical and SH extratropical modes (see supporting information).

Seasonal Networks for Monthly Indices
Overall, the fitted networks based on full-year data show good agreement between the two reanalyses. Some of the differences noted in the previous section may arise due to confounding or spurious associations generated by, among other things, the omission of relevant variables in the fit (i.e., failure of causal sufficiency, in the case that the models are interpreted as being causal). An obvious possible factor is the seasonal variation in relationships between nodes that arises as a result of seasonal changes in the background flow and hence in the available pathways for propagation of disturbances (e.g., Ambrizzi et al., 1995;Hoskins & Ambrizzi, 1993). In general, this seasonal dependence should be treated by appropriately modeling the seasonal nature of the interactions, for example by including a systematic seasonal component together with seasonal indicators as nodes within the graph (Harwood et al., 2021) and/or allowing for time-varying network structures. For simplicity, however, here we consider the results of restricting the analysis to a single season so as to account for the larger part of the interseasonal variation. We repeat the above analysis when restricted to data in the 3-month winter season for each hemisphere. Note that, as lags of up to 6 months are still allowed, observations entering into these fits also include lagged values of the indices during the previous season. For brevity, we restrict our attention to only a subset of the major climate modes within each hemisphere.
In Figure 11, the estimated posterior probabilities for the parent sets of the NAO + and NAO − indices during DJF are shown. As for the full-year fits, in both reanalyses there is strong evidence for an association between the AO and the two phases of the NAO, with approximately equal posterior mass assigned to NAO t t is no longer found to have appreciable posterior mass (  0.1 in both reanalyses), consistent with this edge arising as a result of the use of full-year data. In both NNR1 and JRA-55, a relationship between the MJO, via the value of the RMM1 index at a lag of 2 months, with the positive phase of the NAO is found in a large fraction of sampled models (  0.67 in NNR1 and   0.83 in JRA-55). Interactions between the MJO and the NAO have previously been reported in winter season observations (e.g., Lin et al., 2009) arising from known dynamical mechanisms (e.g., Frederiksen & Frederiksen, 1993). The inferred presence of this relation, with moderately high confidence, indicates that both products are consistent in capturing this association, although it should again be noted that monthly mean data is being used here in contrast to the more  Note. Dashes indicate a node that is not in the MAP parent set for a given reanalysis. with estimated probabilities   0.55,0.51 , and 0.54 in JRA-55, respectively, and   0.49,0.43 , and 0.45 in NNR1. As these features are present in both reanalyses with relatively similar posterior weights, this suggests that there is some, if comparatively weak, evidence for these associations from both products, and the two reanalyses appear to thus be consistent. More notably, the feature    6 DMI NAO t t is estimated to have a posterior probability of   0.52 based on the indices computed using NNR1 and HadISST data, while when fitted to JRA-55 the same feature is assigned a posterior mass of only   0.09, implying substantially weaker evidence is found for the presence of this edge in the JRA-55 data. We note that, while the two reanalyses may be consistent with respect to some of these remaining less strongly supported edges, it is difficult to envisage a direct physical mechanism acting at these time-scales. In these cases, it seems to be likely that the resulting associations are spurious, or arise as artifacts from the use of monthly mean data, as is reflected by the lower posterior probability assigned to them. To the extent that understanding such differences at the process level is of interest, it is important to recognize that the temporal resolution of the data and the underlying model form will play an important role in determining whether this can be done. For the purposes of performing an initial comparison of the two reanalyses, the use of monthly mean data is sufficient to identify several major teleconnections between different modes; we plan to investigate the use of higher temporal resolution in concert with non-homogeneous models in future work.
For fits based on austral winter (June-July-August, JJA) data, similar results are found in that several apparently spurious associations cease to be present in a large fraction of the sampled structures. In Figure 12, the estimated posterior probabilities for parent nodes of the PSA1 index during JJA are shown for the two reanalyses. Compared to the previous, full-year fits in Figure 7, the set of edges with high posterior mass from NNR1 data no longer includes long-range dependence on the positive phase of the NAO, or on the PSA2 index. Instead, an edge RMM1 t−5 → PSA1 t is found with posterior probability   0.60, with the same feature being obtained from the JRA-55 data with   0.77. The JRA-55 fit also contains an edge RMM2 t−6 → PSA1 t with estimated posterior probability  corresponding feature is found to have   0.42. Associations between the MJO and the PSA modes on intraseasonal time-scales during winter have previously been noted (Mo & Paegle, 2001), although the fitted models here assign greater posterior weight to dependence at longer lags. While the posterior probability for the presence of RMM1 as a predictor of the monthly mean PSA1 is roughly consistent between NNR1 and JRA-55, the approximate factor of two difference in the value of the sampled posterior probability for the RMM2 edge is more sizable and may provide weak evidence of a difference between the two reanalyses in terms of this relationship. A similar statement may be made for the MEI t−1 → PSA1 t edge, for which the fitted posterior probabilities are   0.59 in JRA-55 and   0.22 in NNR1.
Interestingly, approximately half (  0.49) of the sampled structures in NNR1 instead contain an edge PNA t−1 → PSA1 t , which may reflect the effects of a common dependence on tropical forcing. Overall, there are a larger number of features with high posterior mass in the JRA-55 fits. As usual, the precise underlying reasons for these differences are not determined by the fits alone. Possible biases, such as a poorer representation of the wintertime SH circulation in one product compared to the other, would require further detailed follow-up. Our purpose here has been to highlight the use of Bayesian structure learning as a tool to identify possible differences, and to estimate the level of uncertainty associated with each.

Summary
Probabilistic graphical models provide a natural and intuitive framework with which to describe the complicated interactions between climate processes. As a result, they are increasingly being applied for the purposes of studying potential causal relationships and for model evaluation (Nowack et al., 2020;Vázquez-Patiño et al., 2020). In this latter application, models are generally evaluated on the basis of structural comparisons between graph structures inferred from observations and from model runs. Additionally, the strength of associations may be compared by performing a second stage of model fitting, conditional on the inferred structure. The use of constraint-based approaches allows high-dimensional systems with substantial memory to be treated efficiently, and often with better performance than traditional Granger causal analyses. As such, these methods constitute a powerful lens for examining differences between models and observations.
That being said, constraint-based approaches do have limitations when used as tools for model evaluation. One challenge arises from the difficulty of assessing the sampling properties of the method (Draper, 1995;Madigan & Raftery, 1994), with the result that it can be difficult to understand the uncertainties associated with a particular choice of structure. When feasible, this difficulty can in principle be overcome by employing a Bayesian approach to structure learning. By learning a posterior distribution over possible structures, rather than selecting a single graph, overall model uncertainty can be quantified and accounted for. This can be particularly important where multiple different structures may all be nearly equally well supported by the data, in which case selection of a single model may overestimate the confidence warranted in particular features. Given a sample from the model posterior distribution, by averaging over the set of possible models the posterior credibility of given features may instead be estimated in the Bayesian approach to identify edges that are well supported by the data. Subsequent estimation of the model parameters conditional on a given structure is straightforward, and provides a basis for comparing the magnitude of relationships between different processes. However, the additional computational cost of this fully Bayesian approach restricts its use to lower dimensional systems in practice; for higher-dimensional problems, constraint-based methods remain the preferred option.
The result of the sampling-based structure learning algorithms is a sample from the set of possible models, from which posterior probabilities for particular features can be derived. In this way, robust features for which there is high confidence may be identified, and the set of such edges may in turn form the focus of model comparisons. To illustrate this approach, we have applied an MCMC based approach to learn DBNs describing associations between climate modes in two different reanalyses, with the goals of identifying the robust structural differences between the two products and to establish a set of baseline estimates for subsequent model evaluation studies.
In general, features in the networks derived from NNR1 and JRA-55 data that have high estimated posterior probabilities agree reasonably well. While not surprising, as both reanalyses attempt to provide an estimate of the same climate state, this provides some reassurance that consistent results are obtained using a sampling algorithm, and that the expected associations and dependence structures, such as long memory in oceanic modes and close correspondence between the AO and NAO, are recovered with reasonable levels of certainty. Differences between the models estimated from the two reanalyses are not only limited to edges with low posterior mass, however. In some cases, these differences involve modes for which there are known biases in one reanalysis or the other; the lack of evidence for a dependence of the PNA on the MJO in JRA-55 is one such example. In other cases, these features may arise as a result of the effects of omitted confounding effects, such as seasonal cycles and other common drivers, that nevertheless differ between the two reanalyses. Some evidence for this is found by considering networks fitted from data restricted to only the winter season in each hemisphere. In these fits, apparently spurious cross-equatorial dependence present in fits to year-round data are no longer found to have strong evidence to support their presence. A greater number of differences between the networks derived from the two reanalyses are found for edges that have lower posterior probabilities, with this being particularly noticeable for the SH modes. This may in part be due to greater differences in the representation of these modes, as well as lower signal-to-noise ratios, at least outside of the austral winter.
It is important to note that in this study our aim has not been to perform a detailed evaluation of the differences between the two reanalysis products. Extensive characterization of the performance and biases of both NNR1 and JRA-55 has been done in the past, and further investigations of the differences found here would require additional detailed study of the involved processes in each product. Rather, our purpose has been to demonstrate the applicability of the Bayesian approach to structure learning in deriving graphical models suitable for use as tools for process-based model evaluation. We argue that an important aspect of this application is the need to account for inevitable model uncertainty in order to identify differences that are likely to be robust and hence represent genuine model biases. Independent of the context of the analysis, this can be naturally achieved in the Bayesian approach. By considering a relatively straightforward initial application to reanalysis data, for which there exists (at least some) consensus on the interactions between modes, we have shown that a score-based sampling approach recovers the expected relationships, while also providing additional benefits over constraint-based approaches in the form of estimates for the posterior distribution over models and features.
Given this, we have primarily focused on a single analysis of year-round monthly mean data, or data within a single season. While sufficient to illustrate the relevant features of the results, in order to utilize this HARRIES AND O'KANE 10.1029/2020MS002442 26 of 37 approach in the context of causal discovery it would be necessary to further extend the analysis presented here. In addition to careful selection of the relevant variables, time periods, and temporal resolution, further consideration should be given to the form of the likelihood and priors used in defining the model. In this work, the simplest case of a linear model with conjugate priors on the parameters defining the conditional PDFs has been used, together with priors on the structures to ensure structural modularity. No prior restriction has been enforced to ensure stationarity of the resulting autoregressive model. Additionally, no attempt has been made to incorporate pre-existing or expert knowledge into the definition of the chosen priors. More complex forms for the conditional PDF, as well as the use of non-conjugate priors and inclusion of additional constraints, may be directly handled using a generic reversible jump MCMC instead of the more specialized MC 3 used here.
Whether or not the resulting model has analytic structure, care must be taken in the design of the sampler and in assessing the approximate convergence of the simulation to the target posterior distribution. For the results presented here, the non-parametric convergence diagnostics of Brooks et al. (2003) have been used to monitor (non-)convergence, but in general assessing convergence for trans-dimensional MCMC remains challenging (Sisson, 2005). In our case, trace plots and convergence diagnostics applied to individual edge indicators suggested that (model-averaged) estimates derived from individual chains were consistent and stable, but there was evidence for non-homogeneity in the posterior distribution over structures across chains based on 10 7 samples. For this reason, we have avoided making definitive statements with respect to Bayes factors and other model-specific quantities, recognizing that further sampling would be required for these to be reliably determined. Poor mixing and multi-modal posterior distributions over models may also be problematic, with the sampled chains remaining trapped in the vicinity of a single mode. Given the large model space considered here, this is a concern as ensuring the chains have adequately explored the full posterior distribution is unlikely to be feasible, raising the possibility that the fits have converged to a local mode in the model space. For generating causal hypotheses, it may be more suitable to restrict attention to smaller systems of variables. On the other hand, the relatively close agreement between the fits for the two reanalyses shown here suggests that the simulations explore sufficiently similar regions of the model space for useful qualitative comparisons to still be performed. As in this paper only two datasets were compared, direct inspection of the individual fitted networks, amounting to visual inspection of the sampled posterior distributions, was sufficient. In more extensive evaluation studies, this may be extended by the use of appropriate summary measures computed on the fits for different models.
The DBN models obtained for the NNR1 and JRA-55 reanalyses form, in principle, a set of ground truth results against which free-running models may be compared; we are currently performing such a comparison over the historical period. However, to perform detailed evaluations of models it is essential to consider the relevant time-scales involved and the possibility of time-varying coupling between climate modes. In this respect, while sufficient for the purpose of illustrating the benefits of Bayesian approaches to structure learning, the homogeneous models employed here for monthly data are too restricted. One of the most notable features of the observed climate over the historical period is the existence of secular trends in the behavior of particular modes (O'Kane, Risbey, et al., 2016). Key open questions remain as to whether there is evidence for accompanying changes in the underlying interactions, on top of the seasonally varying coupling between different climate modes noted above. Similarly, it is important to assess whether models indicate the possibility of changes in these relationships under future forcing scenarios and, if so, whether these changes take the form of regime shifts (Franzke, O'Kane, et al., 2015) or more smoothly occurring transitions. Addressing these questions requires the use of models that adequately incorporate time-varying interactions (Harwood et al., 2021;Saggioro et al., 2020). Non-homogeneous network models based on higher temporal resolution data in which either the model parameters or structure are allowed to vary represent one possibility for extending the current approach to enable more meaningful evaluations of climate models. In a forthcoming article, we perform just such a comparison against the reanalysis-derived networks for a subset of the CMIP5 ensemble to study the existence of regime transitions in projections. Assessing the evidence for, say, the presence of sudden structure changes versus slow variation in the underlying network parameters does not entail any additional conceptual changes, making the approach presented here well-suited for investigating such questions.

Appendix A: MCMC Sampling Methods
In this appendix we summarize the MCMC methods used for structure learning. We employ the composite parameter space formulation described in detail in Godsill (2001). For the models considered in the main text, in the absence of same time edges the networks are structurally modular and the parent set associated with each index can be inferred independently. However, in general a given structure  G  will describe the joint distribution of all of the indices simultaneously at a given time. To accommodate such models, we keep the notation relatively general.
The observed data takes the form of a time series of one or more indices D = {y 1 , …, y T }. The allowed set of models  to describe these data are taken to be specified a priori. Let θ denote the collection of all such parameters across all of the possible models in . For example, for the set of all linear Gaussian models, the vector θ would contain all of the possible regression coefficients     , i j associated with the possible edg- Y as well as the conditional precisions  2 i and intercept parameters  0 i . Any given model G will only use a small subset of all of the possible parameters. The subset of the parameters used by a structure G will be written   is an appropriate index set indicating which components of the complete parameter vector θ are required by G. The remaining parameters not used by G will be denoted by is the composite model space (Godsill, 2001).
To infer the network structure and parameters given data D, we aim to sample from the posterior distribution The likelihood P(D|G, θ) is taken to depend only on the subset of parameters associated with G, while the prior distribution P(G, θ) is taken to be of the form corresponds to a set of proper pseudo-priors (Carlin & Chib, 1995) for the parameters not used by G, and may otherwise be chosen essentially freely. The samplers that we use correspond to Metropolis-Hastings schemes in the composite model space (Godsill, 2001) with a proposal density of the form for a move from (G, θ) to (G′, θ′) with corresponding acceptance probability ; When the class of models considered does not admit analytic evaluation of any of the required integrals, we make use of the simple reversible jump MCMC scheme given in Algorithm 1. At each iteration, either an update to the parameter associated with the structure G is chosen, with probability .
Note that, when the probability of a parameter update move is the same in both states, this is simply an ordinary Metropolis-Hastings update for a single model. If instead a structure update move is chosen, a new structure G′ is proposed according to the proposal distribution q G (G′; G). Any parameters that are common to both G′ and G are held fixed at their current values, while any new parameters       \ G G θ   are sampled from an additional proposal density . Parameters that are either present in the initial structure but not in the proposed structure, or are not used by either, are left unchanged for simplicity. The acceptance ratio for this move is This structure update move is just a particular case of the general reversible jump move (Green, 1995) with a unit Jacobian. For more general mappings from the current to proposed parameters a non-trivial Jacobian factor would remain following the change of variables in the proposal density Equation A1.
The parameters associated with the current structure can be updated via a standard Metropolis-Hastings or Gibbs step, as in the parameter update move for the simple reversible jump scheme. Parameter and structure updates can either be proposed randomly or performed in a fixed order. For the models presented in Section 4, the conditional posterior distribution for all of the parameters can be evaluated, that is, the set     \ G G θ   is empty. In this case, the scheme reduces to the MC 3 scheme of Madigan et al. (1995), with the dependence on the graph parameters dropping out entirely. The acceptance ratio for a structure drawn according to q G (G′; G) is in this case where the likelihood can be written in terms of the local marginal likelihoods, as in Equation 8. Particular choices of the proposal density and structure priors are described in Section 3.3. We summarize the resulting sampling scheme for the structures in Algorithm 2.
For each sampling method, we run multiple chains, discarding the first half of each sample as burn-in. For the MC 3 sampler, approximate convergence of the chains to the target distribution is assessed using the χ 2 and Kolmogorov-Smirnov tests proposed in Brooks et al. (2003).  / a b and 2a τ degrees of freedom, β ∼ T 1 (0, ν 2 /(a τ b τ ), 2a τ ). For the models shown in Section 4, a τ , b τ , and  2 i are taken as fixed hyperparameters. In practice, the signal-to-noise  2 i may be poorly known, in which case it is possible to also sample it from a conjugate inverse gamma prior using the sampling schemes for models with partial analytic structure discussed in Appendix A. For a given par-