A Bayesian Deep Learning Approach to Near-Term Climate Prediction

Since model bias and associated initialization shock are serious shortcomings that reduce prediction skills in state-of-the-art decadal climate prediction efforts, we pursue a complementary machine-learning-based approach to climate prediction. The example problem setting we consider consists of predicting natural variability of the North Atlantic sea surface temperature on the interannual timescale in the pre-industrial control simulation of the Community Earth System Model (CESM2). While previous works have considered the use of recurrent networks such as convolutional LSTMs and reservoir computing networks in this and other similar problem settings, we currently focus on the use of feedforward convolutional networks. In particular, we find that a feedforward convolutional network with a Densenet architecture is able to outperform a convolutional LSTM in terms of predictive skill. Next, we go on to consider a probabilistic formulation of the same network based on Stein variational gradient descent and find that in addition to providing useful measures of predictive uncertainty, the probabilistic (Bayesian) version improves on its deterministic counterpart in terms of predictive skill. Finally, we characterize the reliability of the ensemble of ML models obtained in the probabilistic setting by using analysis tools developed in the context of ensemble numerical weather prediction.


Introduction
The climate system consists of diverse yet interconnected components, such as the atmosphere, oceans, etc., and each can exhibit complex, multiscale, and chaotic behaviors.
Additional interactions and feedbacks among these subsystems drive dynamic evolution over an enormous range of spatial and temporal scales in the climate system (IPCC, 2007;Stocker et al., 2013;Masson-Delmotte, 2021).In this setting, comprehensive climate models have emerged as a powerful tool in helping unravel and better comprehend the myriad processes underlying climate and climate change.Moreover, studies using such models have greatly improved the understanding of climate system processes over the past few decades (Masson-Delmotte, 2021).
Importantly, comprehensive climate models have helped to better anticipate the climate system's response to external forcings, such as those stemming from increased greenhouse gases that typically are realized on a timescale of a few decades or longer.At shorter timescales at which natural variability plays an increasingly important role, however, improvements in the ability to predict climate are not commensurate with advances in understanding dynamics and processes (IPCC, 2007;Stocker et al., 2013;Masson-Delmotte, 2021).Notably, improvements in predicting the El Niño-Southern Oscillation (ENSO) remain more of an exception than the rule.Poor predictive skill at the shorter timescales is due to the fact that sources of predictability at these timescales reside in modes of natural variability of the climate system, and because models have difficulty in representing and capturing such modes and their timing with adequate accuracy.
Modes of natural variability in the climate system often are associated with delicate balances between multiple physical processes, and realizing those same balances in a climate model is difficult.This leads to biases in a model's representation of the modes of variability.
These model biases also exist in the representation of the mean climate state.While the downstream dynamical consequences of such model biases tend to be both complicated and manifold, from a dynamical systems perspective, an overall consequence tends to be that the model attractor is biased as well.
One way to understand poor predictive skill at shorter timescales is in terms of bias in the model's representation of the climate attractor: when initialized predictions attempt to realize the predictability associated with natural variability by initializing the model state to be consistent with an observed climate state, the biased model attractor quickly pulls it away.This leads to the model trajectory exhibiting a jump away from the observed trajectory toward the biased model attractor that typically involves complicated nonlinear dynamics.An invariable effect tends to be loss of predictive skill (Nadiga, Verma, et al., 2019).
The current approach for dealing with this loss of skill consists of statistically correcting the predictions in a post-processing step.Given the nonlinear and complicated dynamics that take place to effect the dramatic readjustment of the flow field (e.g., (Sanchez-Gomez et al., 2016)), namely, the jump-like behavior of the initialized prediction trajectory, it is unlikely that the statistical post-processing of the predictions is capable of correctly compensating for these dynamics.
Given the problems associated with a comprehensive climate-model-based approach to near-term predictions, we are interested in investigating and developing alternative datadriven approaches to such predictions.Herein, we split future climate into "near-term" and "long-term" and define near-term to mean the period over which initial conditions (IC) matter.Thus, while long-term predictability is solely determined by boundary conditions (BC) and/or forcing, near-term predictability is affected by both BC/forcing and IC.
In particular, we are investigating the utility of an approach for predicting near-term variations in a quantity of interest (QoI) that is based on learning spatiotemporal variability of that QoI in a controlled setting.Such learning can be achieved using both feedforward and recurrent neural networks (FNN, RNN respectively) (and transformer networks that are beginning to outperform RNNs in at least certain applications).Using RNNs for learning spatiotemporal variability can be traced back to applying optical flow-based computer vision techniques to extrapolate radar echo images toward nowcasting convective precipitation (e.g., see (Sakaino, 2012)).Further developments along these lines wherein precipitation nowcasting is formulated in the general framework of a "sequence-to-sequence" learning problem-transforming a sequence of past radar maps to a sequence of future radar maps-quickly led to the proposal of a convolutional long short-term memory (convLSTM) architecture/approach (Xingjian et al., 2015).In essence, a convLSTM network determines the future states of a QoI at a spatial location using past states of a local neighborhood and other inputs.Subsequently, convLSTM has emerged as a machine learning (ML) technique that delivers good performance in various applications.This is especially evident in some previous work involving climate-relevant settings of predicting interannual variations of global surface temperature and sea-surface temperature in ocean basins (Nadiga, Jiang, & Farimani, 2019;Park et al., 2019;Jiang et al., 2019).As such, even as other recurrent neural network (RNN) architectures have emerged in the context of sequence-to-sequence learning (e.g., attention-based transformers) and are displacing convLSTM as the state-ofthe-art, this work is restricted to considering feedforward architectures and comparing their performance to convLSTM.We will report on ongoing work using attention-based methods elsewhere.
Another contribution of the present work consists of considering the ML-based prediction of near-term climate variations in a probablistic fashion as opposed to a deterministic approach.In the context of numerical weather prediction (NWP), the chaotic nature of atmospheric dynamics necessitates considering the evolution of an ensemble of trajectories in order to make reliable forecasts (of the one trajectory that actually is realized in the observed weather system).Starting with the statistical-dynamical prediction methods of (Epstein, 1969) and more widely adopted at NWP centers across the globe since the early 1990s1 , probabilistic forecasts using an ensemble prediction system (EPS) have proven to be valuable in improving the skill of weather forecasts.
Likewise, we expect that probablistic ML models of spatiotemporal variability of climate will be both more skillful and useful than deterministic ML models.However, probabilistic ML remains in its infancy.As such, developing and applying efficient probabilistic deep learning models is difficult, and studies examining their utility and performance are few.In this context, assuming the network parameters (weights and biases) are random variables and applying Bayes' rule provide the theoretical basis for inferring the posterior distribution of the network parameters that best fit the training data.Here, we note that (parametric) variational inference (VI) methods were developed to efficiently approximate such inference computationally by minimizing the Kullback-Leibler (KL) divergence between an approximate posterior and the true posterior.Subsequently, to extend the use of VI beyond the specialized families of distributions that enjoy particular conjugacy properties, approaches to nonparametric VI have been developed.Our study considers the Stein variational gradient descent (SVGD) approach to nonparametric VI.By adapting and applying this probablistic deep learning approach to the climate prediction problem being considered, we find that as in the context of NWP, a probabilistic ML approach serves to improve on the skill of a deterministic ML approach.
Next, we comment on the nature of the ensemble in a probabilistic ML setting.For this, it is useful to note that in the NWP setting-a first-principles-based setting-two kinds of ensembles have typically been used in EPSs: (1) Initial condition ensembles (ICE) where the model (is assumed to be perfect and so the model) configuration is held fixed and uncertainty in estimation of the state of the system is represented by an ensemble of initial conditions.(2) Perturbed physics ensembles (PPE) where the initial condition is held fixed, but the parameterizations that are used to represent unresolved processes are perturbed to represent uncertainty related to model error.In the current data-driven probabilistic ML setting, uncertainty represented by the ensemble may be thought of in the PPE sense as the diversity of predictions can be traced back to perturbations of the weights and biases that constitute the model's parameters.In this data-driven setting, while it is true that the probabilistic learning algorithm is trying to learn generalities over a diverse set of IC to infer the perturbations of the ML model parameters that are required, the IC diversity in the training data is not a representation of uncertainty in state estimation as would be required for an ICE.
Finally, we make novel use of diagnostics developed for NWP-EPS in an ML context.This is motivated by the fact that the goal of ensemble prediction, whether it is in the more traditional context of ensemble prediction systems or in the current probabilistic ML context, is for the prediction to span the range of likely outcomes given the uncertainties (Leith, 1974).These diagnostics are based on the joint analysis of error and ensemble variance.
To the best of our knowledge, we use these diagnostics for the first time in the context of probabilistic ML to gain added insight into both the network architecture and the process of probabilistically inferring the weights of the network.In this context, we note, however, that the joint analysis of error and ensemble variance can be carried out in many ways and we consider only the most elementary/simplest of such methods.Using the error-spread and rank-histogram diagnostics, we find, in a global sense, that the ML prediction ensemble is underdispersed.And the behavior persists on enlarging the size of the ensemble.This leads us to further considering the reliability diagnostics in a spatially localized or fine-grained sense.On so doing, a more complicated picture emerges: There are some regions, such as the subpolar North Atlantic, where the ML ensemble is actually overdispersed.However there are other larger regions, such as equatorial and tropical North Atlantic, where the ensemble is underdispersed.Therefore, in the aggregate an overall underdispersive behavior emerges.As such making changes to the probabilistic ML methodology to further improve the reliability of the prediction ensemble and making it optimally reliable tends to be tricky.
The rest of the paper is organized as follows.Section 2 presents the details of the data and definitions of prediction problem.Section 3 discusses the proposed Bayesian deep learning model, including the key derivation, architecture designs, training and testing procedures, and implementation guidelines.Section 4 performs a comparative study on different models and covers a qualitative and quantitative examination of the climate predictions.
Finally, conclusions and suggestions for future developments are provided in section 5.

Atlantic
We consider the spatiotemporal variability of sea surface temperature (SST) in the North Atlantic over the last 800 years of the pre-industrial control simulation, or piControl, a simulation in which external forcing is held fixed, from the Community Earth System Model (CESM) (Danabasoglu et al., 2020) as part of the sixth phase of the Coupled Model Intercomparison Project (CMIP6).CESM2 is a global coupled ocean-atmosphereland-land ice model, and the piControl simulation considered herein uses the Community Atmosphere Model (CAM6) and Parallel Ocean Program (POP2) at a nominal 1 o horizontal resolution in both the atmosphere and ocean.Readers can refer to (Danabasoglu et al., 2020) for details.These data are publicly available from the CMIP archive at https://esgfnode.llnl.gov/projects/cmip6and its mirrors.These monthly data display variability on a large range of spatial and temporal scales.The largest spatial variation is in the meridional (i.e., latitudinal) direction, while the largest temporal variation is at the annual timescale and represents the seasonal cycle.Because both variations are easily learned and predicted, we preprocess the data to remove these components.The latitudinal variation is eliminated by subtracting the time-mean SST at each geographical location, and the seasonal cycle is removed by considering a 12-month moving average also at each geographical location.

Formulation of the Learning Problem
Without loss of generality, we cast the near-term climate prediction problem in a video prediction format with the model input-output relationship described by a mapping of the form: where X and Y denote the respective model input and output, n x and n y represent samples of x and y along the temporal dimension (that are chronologically ordered and at a constant sampling frequency), and SST is considered on a regular latitude (H) and longitude (W) spatial grid.Equivalently, the prediction problem may be written as: x r Y X e l v M 8 0 4 2 j r y t k S / O U v / 4 X q a d E / L 3 p 3 Z 4 X S z e O 8 j i w 5 J E f k h P j k g p T I N S m T C u F k R J 7 I C 3 l 1 J s 6 z 8 + a 8 z 0 c z z q L C A / J L z s c 3 X s i V F g = = < / l a t e x i t >

Remove Mean
< l a t e x i t s h a 1 _ b a s e 6 4 = " f 5 O g L F G / b 7 m 7 q m 0 t c J q u E e c 0 F q k = " >

Methodology
The goal is to develop efficient probabilistic deep learning models for near-term climate prediction while using advanced inference methods in the context of deep neural networks.
This section describes our Bayesian learning strategy, the network architectures used, and other specifics regarding the training and testing procedure.

Bayesian Deep Learning
Estimation and quantification of the various sources of uncertainty is critical to establish the reliability of an ML model and provide an assessment of confidence in its predictions (Ghahramani, 2015).This aspect of modeling is particularly important in the context of deep learning because of the large number of parameters that have to be leared in the DL setting.To that end, we consider a probabilistic formulation that allows for characterizing uncertainties associated both with the data and model (Kendall & Gal, 2017).
We assume that the weights have a probability density function of a fully factorized Gaussian prior with zero mean and a precision α that is Gamma-distributed.Specifically, Bayesian deep learning (BDL) treats the network parameters w as random variables that can be generated via a prior distribution p(w).By constructing the likelihood function (3) Subsequently, the predictive distribution p(y|x) ≡ p(y|w; x) can be obtained by sampling the posterior w ∼ p(w|D).
For the regression problem stated in Section 2.2, consider a deterministic neural network (4)

Prior definition
As little is known about the network parameters before training, a non-informative prior typically is suggested to reduce and minimize the bias associated with the introduction of a prior (Neal, 2012).Assuming that the prior is a fixed distribution independent of the input, we find imposing a sparsity-inducing prior on weights w via a hierarchical Bayesian model provides good performance.In particular, epistemic uncertainty of model parameters is described by a fully factorized Gaussian with zero mean and Gamma-distributed precision: This results in a prior with a Student's T-distribution centered at zero.By tuning the rate parameter a 0 and the shape parameter b 0 , one can employ a wider region with heavy tails than a standard Gaussian (Luo & Kareem, 2020;Zhu & Zabaras, 2018).In this study, a 0 = 1 and b 0 = 0.05 are the values taken for the rate and shape parameters.On the other hand, aleatoric uncertainties capturing the noise in the data are assumed to be homoscedastic.
Here, we prescribe additive noise n same for all output pixels/grids for lower memory and faster computation.Explicitly, the noise term is defined as n = σ , where σ is a scalar denoting the standard deviation of the data and is Gaussian noise, i.e., ∼ N (0, I).In this work, the noise precision β = 1/σ 2 is modeled as a random variable with a conjugate prior in the form of to better simulate the real-world applications.In most applications, the prior noise variance is assumed to be very small, e.g., 1×10 −6 , so that providing a good initial guess for the prior hyperparameters (Gramacy & Lee, 2012).Here, we consider the shape and rate parameters to be a 1 = 2 and b 1 = 1 × 10 −6 .It is worth noting that β is a learnable parameter that is derived from data.Consequently, in the posterior estimation step, we will learn model parameters w and data parameter β.Unless otherwise specified, let θ = {w, β} denote all the uncertain parameters for brevity.

Posterior estimation
The second step of Bayesian learning is to estimate the posterior with predefined prior distributions.One of the standard ways to obtain the approximate posterior is to use sampling methods (Neal, 2012).Given a large number of network parameters, e.g., tens or hundreds of millions in a modern deep learning model, this approach can be slow and difficult to converge.In recent years, significant progress has been made using VI methods as an alternative to approximate high-dimensional posterior distributions (Blei et al., 2017).
Let D = {x i , y i } N i=1 be the training data.With a specified prior and a specified functional form for the likelihood, VI casts the Bayesian inference problem as an optimization problem.For a given likelihood p (y | x, θ) and prior p(θ), the latter seeks to minimize the KL divergence between a proxy distribution q(θ) and the posterior distribution p(θ | D): where p(θ is the unnormalized posterior and Z = p(θ)dθ is the normalizer, also called model evidence.In practice, the normalization constant is not considered in the KL divergence minimization (Blei et al., 2017).Also, the proxy distribution q(θ) is usually parameterized with a specified form of distributions Q, inevitably introducing deterministic biases (Blundell et al., 2015).In this work, SVGD, a nonparametric VI algorithm, is adopted (Liu & Wang, 2016).Without defining a variational approximation family as parametric VI methods do, SVGD employs a set of independent identically distributed particles θ 1 , θ 2 , . . ., θ M and minimizes the KL divergence between the empirical measure of these particles and the true posterior.The central idea is to iteratively move the set of particles toward the true posterior using the gradient descent method: where is a small number representing the step size in the updating scheme and φ(•) is the optimal perturbation direction that gives the steepest KL divergence gradient: with k(., .)denoting a positive definite kernel.In this work, we choose a standard radial basis function kernel for k(., .).In Equation ( 9), the gradient term pushes the particles toward high-density regions of the target distribution, and the repulsive force term imposes diversity and prevents particle collapse (Liu & Wang, 2016).Overall, the SVGD updating procedure can be summarized in five steps: Step 1: compute the joint likelihood log p θ j t = N i=1 p y i | θ j t , x i p θ j t .
Step 2: calculate the gradient ∇ θ j t log p θ j t by back propagation.
Step 3: compute the kernel matrix k θ j t , and its gradient ∇ θ j t k θ j t , θ i t .

Predictive distribution
On completion of training, the model can be applied to make probabilistic predictions using unseen data samples (x * , y * ); the predictive distribution is given by Note that the SVGD algorithm provides a sample representation of the posterior p(θ | D).Meaning one can use the learned SVGD particles to estimate the predictive distribution than the direct integration.Furthermore, effectively, the BDL model can be viewed as an ensemble of deep learning models where the number of ensemble members is given by the particle number M (e.g., M = 20 in this work), we can approximate the moments of the predictive distribution using the Monte Carlo method.For instance, the ensemble mean prediction is given by the mean over the particles: and uncertainties are estimated utilizing the second moment of the predictive distribution: With the estimated mean and variance, we can make the prediction for new data samples and quantify its associated predictive uncertainty.

Architecture Design
Even as the success of applying deep learning to problems in science and engineering depends crucially on the choice of network architecture, designing efficient and effective networks remains problem-specific and requires human expertise.In this context, we note climate-relevant data typically are high dimensional, geographically heterogeneous, and most often result from dynamical and other physical interactions over a diverse range of spatial and temporal scales (Reichstein et al., 2019).
In regard to the high-dimensional nature of climate data, recent studies reveal that the intrinsic dimension captured by dimensionality reduction techniques tends to be too low and, therefore, insufficient (Kashinath et al., 2021).Consequently, rather than rely on dimensionality reduction techniques, we use an architecture that considers the full extent of the spatial degrees of freedom present in the data (Xu et al., 2021).Next, motivated by the fact that common yet important fluid-dynamic processes, such as advection and diffusion, are represented by regular stencils in the numerical solution of partial differential solutions governing the climate system, we use convolution layers as an integral aspect of the network.Finally, to permit the learning of multiscale interactions, e.g., both local and remote interactions, we employ a bottleneck of sufficiently high dimension with additional optional fully connected layers in the bottleneck.Notably, this design maintains the deep learning promise of automatically extracting features, allowing them to interact appropriately for the task on hand and subsequently projecting them back at the required resolution in an end-to-end fashion.
In Figure 2, the down-and up-sampling learning modules greatly reduce the number of network parameters, thereby accelerating the training process.Specifically, convolution operations are performed to reduce the data size and extract features.Non-adjacent connections then are established for aggregating extracted features (He et al., 2016a).As most deep learning models are data-intensive, a densely connected convolutional network structure, known as dense block, is adopted in our encoder-decoder architecture to reduce network parameters and support more stable learning (Huang et al., 2017).Consequently, each layer can reuse the features extracted from all preceding layers in the dense block.Inside each dense block, a layer is defined as a set of composition operations usually denoted as convolution, nonlinear activation, batch normalization, and dropout (He et al., 2016b).We combine image resizing techniques and convolution in the upsampling learning module.In particular, transposed convolutions are commonly performed for upsampling the extracted features to the desired spatial dimensions.However, Odena et al. (2016) find transposed convolutions with uneven overlapping cause a checkerboard pattern of artifacts.Therefore, image resizing techniques, such as nearest-neighbor or bilinear interpolation, serve as good alternatives.For instance, bilinear interpolation discourages high-frequency artifacts via an implicitly weighting filter, which is adopted here.Lastly, we observe pooling operations, usually implemented in-between successive convolution layers to reduce the size of feature maps, can deteriorate prediction performance.Knowing climate data, by nature, differ from most computer science application data (e.g., handwriting digits), we argue that max or average pooling may lead to the loss of distinctive features to infer finer pixel-wise regression.Hence, pooling operations are not considered.Instead, a convolution operator with a non-unit stride is used to manage the feature sizes (Dumoulin & Visin, 2016).

Dense Block Dense Block
Dense Block Dense Block < l a t e x i t s h a _ b a s e = " T J < l a t e x i t s h a 1 _ b a s e 6 4 = " y i l     2016).Thus, the influence from long-distance locations is seamlessly integrated into the MLP architecture.However, an MLP model can be memory-demanding and computationally prohibitive because the number of total parameters increases too quickly, i.e., as the cumulative product of the number of perceptrons in each layer.A better, more efficient way to account for long-distance effects is to add a fully connected linear layer in the feature space, ideally at the bottleneck level.The combination of convolutional and fully connected layers has shown its effectiveness in many computer vision tasks, such as objective detection and image segmentation (Dosovitskiy et al., 2020;Rasp et al., 2020).Hence, we also consider fully connected layers between the densely connected encoder and decoder.In such a case, it is anticipated that the fully connected layers will relieve part of the burden on the encoder-decoder parts of the network to learn remote interactions, freeing them to better represent spatiotemporally local interactions-and convolution layers excel at these tasks.
Following such reasoning, we develop our deep learning (DL) model and its Bayesian version (BDL) for climate prediction.We also include the state-of-the-art dynamics forecasting model, ConvLSTM, in the study for better comparison.Here, all three models have a convolutional encoder to condense information.Of note, DL and BDL share the same architecture, and BDL is initialized and stored in a predefined particle number.We defer the network parameter details, including size of the convolving kernel, stride of the convolution, zero-padding size, etc., to Appendix A. Figure 2 offers a graphic illustration of the proposed model.

Network Training
The goal of network training is to minimize the mismatch between a prediction ŷ = f (x) and the correct output y.For DL and ConvLSTM, ŷ denotes the model output.On the other hand, ŷ is defined as the predictive mean of Bayesian particles in BDL.The mean squared error (MSE) is selected as the criterion here to measure the difference between ŷ and y gradient descent algorithm is used as the default optimizer with weight decay specified to 5 × 10 −4 to regularize the weights via an L2 penalty (Kingma & Ba, 2014), which ensures the model generalizes better to unseen data; (5) the initial learning rate is set to 0.001 with a dynamic scheduler to reduce the learning rate by a factor of 10 when the computed metric has stopped improving; and (6) the dropout technique is used after each convolutional layer to further reduce overfitting and improve generalization error (Hinton et al., 2012), where the probability of an element to be zeroed is set to 0.5.

Implementation Notes
We consider Python 3.7.3 and PyTorch 1.7.0 to implement the methodology.All data processing and model training assessments are carried out on a NVIDIA Tesla V100 graphics processing unit (GPU) card with 16 GB high-bandwidth memory (HBM).For full reproducibility of the results, the computer codes and data can be found at https:// xihaier.github.io/upon publication.

Results and Discussions
Figure 3 shows a measure of the magnitude of interannual internal/natural variability that we are interested in predicting.It is computed as the standard deviation of the interannual anomaly of SST.We also refer to this as the climatological standard deviation.
This variability is seen to be geographically heterogeneous with the largest variations in the subpolar North Atlantic and the isolated part of the Eastern Pacific (related to ENSO).We nondimensionalize prediction error using the climatological standard deviation to make the errors geographically commensurate and to facilitate comparison of prediction error across different regions.On comparing the predictions in the other two panels with the target distribution, it is seen that both the convLSTM prediction (center panel) and the ensemble-mean of the BDL prediction (right panel) successfully capture the main aspects of the spatial distribution such as the warm spot near Grand Banks, the cooler temperatures of the subpolar gyre, the warm anomaly in the East Greenland current, the lower variability in the subtropical gyre region, and others.However, it is also seen that whereas the convLSTM predictions tend to be more diffuse, features in the BDL prediction are better correlated with the target and tend to be sharper even though we are considering an ensemble-average.This is suggestive of better performance of the probablisitic BDL system as compared to the deterministic convLSTM system (Xingjian et al., 2015;Park et al., 2019;Xu et al., 2021).error is seen to be lower in the Bayesian model, and the error in the deterministic model is seen to have smaller scale features.This suggests the possibility that the deterministic DL model is more prone to overfitting and that the averaging inherent in the Bayesian DL manuscript submitted to AGU acts to regularize the BDL predictions.This, in turn, reiterates the need to assess model and data uncertainty using probabilistic modeling techniques, particularly when considering deep networks (Ghahramani, 2015).nonlinear least-squares problem using a trust-region algorithm (Conn et al., 2000).Along with the previously mentioned models, the damped persistence fit, obtained by fitting a first order autoregressive model is shown and indicated as AR1.First, it is seen that for the most part, irrespective of whether it is FNNs or RNNs, and whether it is deterministic ML models or probabilistic ones, each of the ML models performs better than damped persistence.Next, it is interesting to note that DL and BDL, both feedforward networks (FNN) outperform the convLSTM model, a recurrent network (RNN) that has previously been seen to provide good performance in a variety of temporal prediction settings.Furthermore, the probabilistic Bayesian deep learning model performs better than its deterministic counterpart, as discussed earlier.Finally, for BDL, uncertainty is estimated as the standard deviation of the ensemble spread and is shown by the red envelope in Figure 7.The uncertainty in prediction is seen to increase with lead time.In an RNN setting, in contrast to the current FNN setting, it is typical to train the recurrent network to produce a one step prediction.Thereafter, predictions at longer lead times are produced based on both the input at the current time and output at the previous time.
Thus, compounding of error and uncertainty with increasing lead time explains the increase of both error and uncertainty with increasing lead time in an RNN setting.On the other hand, in the current FNN setting, the straight forward process of compounding of error and uncertainty with increasing lead time is absent and thus constitutes an independent validation of the current FNN approach.Indeed, we go on to consider the nature of these increases with time further in the following section.
In terms of computational cost, our implementation of the DL model (primarily based on convolution operations) is approximately 20 times faster than the ConvLSTM model.We have further verified this speed-up in the contest of a larger data set (Park et al., 2019;Xu et al., 2021).It is worth noting that the time complexity of training a Bayesian network is theoretically O(n), where n denotes the number of particles in the SVGD algorithm (Liu & Wang, 2016).In all experiments, we find that the BDL model achieves better scalability than linear complexity and requires less training time than the ConvLSTM model.

Deep Learning Model
Now that we have a probabilistic prediction system that takes into account the possibility of a range of models fitting the training data in a Bayesian framework, we are able to generate a range of outcomes for the test data as well.Such a distribution of predictions has multiple uses including obtaining information of alternative future evolutions and the possibility of predicting extreme events.However, given the experimental nature of the probablistic prediction system considered, we presently confine ourselves to examining the quality of the predictions and its utility in providing insights into the methodology itself.
In addition to the slight improvement in prediction skill when compared to determin- The rightmost panel in Figure 8, shows an estimate of uncertainty in the prediction of the SST for a particular instance, defined as the standard deviation of the ensemble.
Prediction uncertainty varies depending on the underlying dynamical state and the dynamics underlying SST varies significantly from the tropics to the midlatitude and sub-polar regions.This is reflected in the spatial heterogenity of the uncertainty estimated in the BDL scheme.
The heterogenity of the estimated uncertainty is in agreement with the hetrogenity of the magnitude of interannual variability shown in Fig. 3. Finally, the figure also shows that the higher (lower) level of error in the subpolar (tropical) region is accompanied by a higher (lower) level of uncertainty.We previously saw the growth of uncertainty with increasing prediction lead time in a domain-averaged sense in Fig. 7. Figure 9 shows the spatial distribution of the growth of uncertainty with prediction lead time.The spatial heterogenity is related to the hetrogenity of the dynamics governing the evolution of SST as discussed previously.The higher level of uncertainty in the isolated patch of the Pacific in the southwest corner of the domain is likely due to the fact that the dynamics in that region is controlled more by processes in the rest of the Pacific that is not considered presently.Next, we focus attention on the nature of the relationship between prediction error and uncertainty.The reason for doing this is because in a realistic situation we do not have verification data.As such, we cannot directly estimate error in (or skill of) the predictions.
So, the question is, as to what extent the uncertainty in the prediction can be used as a measure of prediction error?It is somewhat natural to think that the future state will be close to the ensemble mean when the dispersion of the ensemble is small and conversely for the accuracy of the ensemble mean to small when the ensemble is highly dispersed.As such, we proceed to quantify the relationship between ensemble spread and prediction accuracy of BDL when verification data is present.
The inset in the top left panel of Fig. 10 shows a scatterplot of error of the ensemblemean against ensemble-spread at each geographical location, for each of the test instances, and at each prediction lead time of between one and eighteen months.In this inset plot, a large degree of scatter is seen and error and spread are moderately correlated; the Spearman correlation coefficient is 0.  The large scatter in the spread-error plot is due to considering the predictions at the highest level of detail available.Given the usual reduced predictability of the smaller of spatio-temporal scales (equivalenty, increased predictability of the larger of spatiotemporal uncertainty (Fig. 10).To examine this, we present three heatmaps showing the counts of elements contained in the left exterior (#0), interior (#1 ∼ 19), and right exterior (#20) bins, respectively.Focusing attention on the center plot, a more detailed picture emerges wherein the ensemble is well-dispersed or even over-dispersed in certain locations (darker shades of red) and under-dispersed in other regions (lighter shades of red to white).As such, we were unable to use this diagnostic to improve the performance of the probabilistic ML methodology to the extent we originally anticipated.Nevertheless, we find this is a useful diagnostic that succintly characterizes the behavior of the ensemble predictions produced by a probabilistic ML framework.

Conclusions
Following concerted national and international efforts over the past 70 years to model climate, comprehensive climate models have emerged as a powerful tool in helping unravel and better understand the myriad processes underlying climate and climate change.For example, such models now help us better anticipate the climate system's response to external forcings, such as those due to increased greenhouse gases on timescales longer than a few decades.However, efforts aimed at nearer term predictions are still only in a nascent stage given the difficult the comprehensive climate models have in representing and capturing internal modes of variability that are relevant at these shorter timescales with adequate accuracy.Furthermore, comprehensive climate models demand extensive infrastructure and are computationally very intensive.In this context, the increasing reliance on predictions of future climate for a wide variety of purposes ranging from integrated assessment to developing mitigation strategies to developing resilience and adaptation strategies, makes the availability of computationally efficient and accurate surrogates of comprehensive earth system models highly attractive.
In this paper we add to the growing body of efforts to build surrogates by first considering a recently proposed convolutional network architecture to develop such a surrogate and then integrating Bayesian inference into this architecture to further assess predictive uncertainty.We show that the resulting Bayesian deep learning model while marginally improving prediction accuracy, also provides a quantification of the uncertainty inherent in the data and that arising from the model itself, on having considered a particular architecture (inductive bias).
The probabilistic climate prediction framework we develop has multiple uses including obtaining information of alternative future evolutions and the possibility of predicting extreme events.However, given the experimental nature of the work, we went on to use diagnostics developed in the context of probabilistic weather prediction to examine the quality of the probabilistic ML predictions and its utility in providing insights into the methodology itself.The use of such diagnostics allowed us to examine certain characteristics of the prediction ensemble such as its reliability-a property that permits the use of the ensemble spread to estimate prediction error.Indeed, we find that the error-spread relation of the prediction ensemble is not optimal and suggest that efforts to drive such diagnostic relationships to optimality is one way to improve the probabilistic ML methodology itself.
) where x k denotes the current state.Direct prediction of futures states Y = [x k+1 , . . ., x k+ny ] are made given a sequence of past and current states X = [x k−nx , . . ., x k ].The schematic in Figure 1 outlines the prediction problem.< l a t e x i t s h a 1 _ b a s e 6 4 = " z W m Q U h g T e d 4 T T P U B 9 g e i U s 5 L k y I 5 I y V S J p z c k U f y T F 6 8 B + / J e / X e J q M z 3 r T C L f J D 3 v s X O u C Y Z w = = < / l a t e x i t >Remove Seasonality< l a t e x i t s h a 1 _ b a s e 6 4 = " r X / N E P B e K v u z P6 T R D X M f f u z 5 / o M = " > A A A B + 3 i c b Z B L S w M x F I U z 9 V X r a 6 x L N 8 E i u C o z I u q y 4 M a F Q h X 7 g L a U T H r b h m Y y Q 3 K n t A w F f 4 k b F 4 q 4 9 Y + 4 8 9 + Y P h Z a P R D 4 O O e G 3 J w g l s K g 5 3 0 5 m Z X V t f W N 7 G Z u a 3 t n d 8 / d z 1 d N l G g O F R 7 J S N c D Z k A K B R U U K K E ea 2 B h I K E W D K 6 m e W 0 I 2 o h I P e A 4 h l b I e k p 0 B W d o r b a b b y K A A A C B H i c b V C 7 S g N B F J 3 1 G e M r a p l m M A j a h F 0 R t R S 0 s L C I Y B 6 Q h H B 3 c m O G z D 6 Y u S u G J a C N v 2 J j o Y i t H 2 H n 3 z h 5 F J p 4 4 M L h n H u Z O c e P l T T k u t / O 3 P z C 4 t J y Z i W 7 u r a + s Z n b 2 q 6 Y K N E C y y J S k a 7 5 Y F D J E M s k S W E t 1 g i B r 7 D q 9 8 6 H f v U O t Z F R e E P 9 G J s B 3 I a y I w W Q l V q 5 f I P w n t J 9 O O A X Q M A v 0 A g t 4 6 E 5 a O U K b t E d g c 8 S b 0 I K b I J S K / f V a E c i C T A k o c C Y u u f G 1 E x B k x Q K B 9 l G Y j A G 0 Y N b r F s a Q o C m m Y 5 C D P i e V d q 8 E 2 k 7 I f G R + v s i h c C Y f u D b z Q C o a 6 a 9 o f i f V 0 + o c 9 p M Z R g n h K E Y P 9 R J F K e I D x v h b a l R k O p b A j a 7 / S s X X d A g y P a W t S V 4 0 5 F n S e W w 6 B 0 X 3 e u j w t n V w 7 i O D M u z X b b P P H b C z t g l K 7 E y E + y R P b N X 9 u Y 8 O S / O u / M x X p 1 z J h X u s D 9 w P n 8 A S c y Y U Q = = < / l a t e x i t > (a) Data Description < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 a I 4 5 c g q N H 0 c 9 Z m m x P K x I 5 h o 5 m 5 F Z I g 1 J t b F U 3 Q h B I s v L 5 P m e T W 4 r P r 3 F + X a 3 d M s j g I c w w l U I I A r q M E t 1 K E B B I b w D K / w 5 g n v x X v 3 P m a t K 9 4 8 w i P 4 A + / z B 5 i o j m E = < / l a t e x i t > t(k)< l a t e x i t s h a 1 _ b a s e 6 4 = " x J E n T F 7 R h S r 2 x G w + 7 i E 4 z 2 W T + e o = " > A A A B 7 3 i c b V D L S g N B E O y N r x h f U Y 9 e B o M Q D 4 Z d E f U Y 8 O L B Q w T z g G Q J s 5 N J M m R 2 d p 3 p F c M S 8 B u 8 e F D E q 7 / j z b 9 x 8 j h o Y k F D U d V N d 1 c Q S 2 H Q d b + d z N L y y u p a d j 2 3 s b m 1 v Z P f 3 a u Z K N G M V 1 k k I 9 0 I q O F S K F 5 F g Z I 3 Y s 1 p G E h e D w Z X Y 7 / + w L U R k b r D Y c z 9 k P a U 6 A p G 0 U o N L A 5 O V P v x u J 0 v u C V 3 A r J I v B k p w A y V d v 6 r 1 Y l Y E n K F T F Jj m p 4 b o 5 9 S j Y J J P s q 1 E s N j y g a 0 x 5 u W K h p y 4 6 e T e 0 f k y C o d 0 o 2 0 L Y V k o v 6 e S G l o z D A M b G d I s W / m v b H 4 n 9 d M s H v p p 0 L F C X L F p o u 6 Figure 1.(a) The North Atlantic domain is shown in light blue.(b) Data preprocessing: The temporal mean and seasonal cycle are removed because they are easy to predict.While the original data span a range of [−1.79, 30.41]C, the interannual SST anomaly fields span a range of [−1.68, 1.35]C.The anomaly fields are obtained by removing the temporal mean and a mean seasonal cycle at each location.(c) Statement of the interannual SST anomaly prediction problem.
p(D|w) from the given training data set D = {x i , y i } N train i=1 , Bayes' rule can be used to infer the posterior distribution of the network parameters w: p(w|D) = p(D|w) p(w) p(D) .
where x is the input, y is the output and the parameters w include both the weights and biases.While deterministic DL models treat the network parameters w as deterministic unknowns, BDL considers w as random variables to account for epistemic uncertainty induced both by limitations of the model (hypothesis set) considered and limited sampling of the data.A further additive noise term n is used to model the irreducible aleatoric uncertainty in the data in this setting leading to y = f (x, w) + n.
r A 8 6 z 8 + a 8 z 1 t L T j G z j 3 7 B + f g G g U a R b A = = < / l a t e x i t > UpLM < l a t e x i t s h a 1 _ b a s e 6 4 = " t P 4 y o Q Z Z 8 6 G y G o a k p w T r / Y A i G B k = " > A A A B 9 H i c b V B N S w M x E M 3 W r 1 q / q h 6 9 B I v g q e y K q M e C H h Q U K t g P a J e S T b N t a D Z Z k 9l q W f o 7 v H h Q x K s / x p v / x r T d g 1 Y f D D z e m 2 F m X h A L b s B 1 v 5 z c w u L S 8 k p + t b C 2 v r G 5 V d z e q R u V a M p q V A m l m w E x T H D J a s B B s G a s G Y k C w R r B 4 H z i N 4 Z M G 6 7 k H Y x i 5 k e k J 3 n I K Q E r + W 1 g j 5 B e q A d 5 f T P u F E t u 2 Z 0 C / y V e R k o o Q 7 V T / G x 3 F U 0 i J o E K Y k z L c 2 P w U 6 K B U 8 H G h X Z i W E zo g P R Y y 1 J J I m b 8 d H r 0 G B 9 Y p Y t D p W 1 J w F P 1 5 0 R K I m N G U W A 7 I w J 9 M + 9 N x P + 8 V g L h m Z 9 y G S f A J J 0 t C h O B Q e F J A r j L N a M g R p Y Q q r m 9 F d M + 0 Y S C z a l g Q / D m X / 5 L 6 k d l 7 6 T s 3 R 6 X K l d Z H H m 0 h / b R I f L Q K a q g S 1 R F N U T R P X p C L + j V G T r P z p v z P m v N O d n M L v o F 5 + M b F J + S U w = = < / l a t e x i t > DownLM < l a t e x i t s h a 1 _ b a s e 6 4 = " t P 4 y o r 5 p i a 6 e e C Z 2 k C J r 3 B 7 V S S d H Q P C L a F B Y 4 y o 4 H x q 3 w u 1 J + y S z j 6 I M s + B C i 3 y f / h b O N c r R V j k 4 2 S 7 u H d / 0 4 J s g S W S a r J C L b Z J c c k G N S I Z z c k 0 f y T F 6 C h + A p e A 3 e + q 1 D w S D C R f K j g v c v 9 L a f t w = = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " Z m e / b e v P d x a 8 6 b R L h F f s n 7 + A a d S 5 3 p < / l a t e x i t > Upsampling learning module (UpLM) < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 a 3 Z n / / N l z t + o x w P O z U c Y y i 8 H 1 w m M 0 4 w w F u O H u p n i l P D h K r w j t S 2 s + p a A 0 N L + l Y s e a B B k t y v Y E f z f l f + S 2 k H J P y r 5 l c N i u X I 3 n m O B b b E d t s d 8 d s z K 7 I x d s C o T 7 J 4 9 s m f 2 4 j w 4 T 8 6 r 8 z a O T j m T C T f Z D z j v X 9 2 7 m l k = < / l a t e x i t > (a) Overall architecture < l a t e x i t s h a 1 _ b a s e 6 4 = " h z B B D j L p 9 2 S Z l U C S e P 5 J m 8 k j f n y X l x 3 p 2 P 2 W j K m V d 4 S P 7 I + f w B k c m Y g w = = < / l a t e x i t >
in each grid.Training details for all models are given as: (1) the data set is split into a training set consisting of 1280 paired samples ({x i , y i } 1280 i=1 ) and a test set containing 128 samples in all experiments; (2) the data is standardized by removing the mean and scaling to unit variance; (3) the batch size of both training and test sets are set the same as 128 for non-probabilistic models and 32 for Bayesian models; (4) the Adam stochastic

Figure 3 .
Figure 3.The standard deviation of the interannual anomaly of SST is plotted here as a measure of the magnitude of interannual variability (in units of degree Centigrade).

Figure 4
Figure4compares the predictions of SST in the North Atlantic at a lead time of six months for a randomly selected test case.The target spatial distribution of SST is shown in the left panel and we recall that both the strong, time-mean latitudinal variability and the seasonal cycle have been removed in the current study so as to focus largely on the interannual component of variability.As such the main variability seen in the target distribution is related to the spatial heterogenity of the nature of interannual variability.

Figure 4 .
Figure 4. Prediction results of BDL and ConvLSTM for a randomly selected test sample at a lead time of six months.While both predictions correlate reasonably with the target, the BDL predictions are seen to sharper, even while exhibiting lower errors.

Figure 5 .
Figure 5. Prediction error map showing the non-dimensional root mean square error at a lead time of six months.The BDL errors are seen to be lower.

Figure 6
Figure 6 compares the prediction error of DL and BDL for the same test sample as shown in Fig. 4, but at a lead time of twelve months.Two features are seen in this comparison:

Figure 6 .
Figure 6.A comparison of errors between the deterministic and Bayesian predictions at a prediction lead time of a year.The ensemble mean BDL prediction is seen to have smaller errors than its deterministic counterpart.

Figure 7
Figure 7 compares the domain averaged error as a function of prediction lead time in the various models considered.For convLSTM, DL, and BDL, NDRMSE averaged (over the domain and) over the test sets are shown in filled circles.A further fit of the inidividual points using an function of the form NDRMSE = α (1 − exp(−β * τ )), with τ denoting the prediction lead time is also shown.The fit is obtained by minimizing the residual in a

Figure 7 .
Figure 7. Domain averaged, and test set-averaged (root mean square) non-dimensional error as a function of prediction lead time for the various methods considered.The baseline model "AR1"(first order auto regressive process) corresponds to damped persistence.All ML models are seen to be better than the baseline.The deterministic and Bayesian versions of the convolution based FNN are seen to be better than convLSTM, a sophisticated RNN architecture.The Bayesian version of the FNN is slightly better than its deterministic counterpart, while simultaneously providing a measure of uncertainty in the prediction.
istic deep learning models (DL and ConvLSTM), BDL allows for a means of quantifying the uncertainty inherent in the data and model.For example, in the particular approach we consider, learning the posterior distribution of β, a parameter related to data uncertainty and whose prior distribution is given in (6) serves to quantify data uncertainty.Likewise, the learning of the posterior distribution of the model parameters w, the priors for which are given in (5) serves to capture uncertainty in the model itself.While we have already shown and briefly discussed the behavior of uncertainty in the BDL in the previous section, we seek to analyze it further in this section and see what further insight it may yield into the workings of the Bayesian model.

Figure 8 .
Figure 8. Error and uncertainty captured by the BDL model for the randomly selected sample presented in Fig. 4
33. (We prefer to use the rank-based Spearman correlation coefficient since a) it is a nonparametric measure of monotonicity of the relationship between the two variables, b) unlike the Pearson correlation, it does not assume that the variables are normally distributed, which makes it more robust.)Here we note that even in idealized experiments where the prediction model is perfect (in the sense that it does not have any biases), for statistical reasons, the spread-error correlation need not be large (e.g., see Barker91, Houtekamer93).The inset in the bottom-right shows the Spearman correlation coefficient as a function of the prediction lead time.The correlation coefficient is seen to decrease largely monotonically with lead time.The low correlation at long lead time corresponds to the prediction reverting to climatology.

Figure 10 .
Figure10.Analysis of the spread-error relationship for BDL.The inset at the top-left shows a scatter-plot of error vs. spread over the full period of prediction (1-18 month lead time).The data is highly dispersed and the Spearman correlation coefficient is a modest 0.33.The inset in the bottom-right shows a plot of the correlation coefficient as a function of prediction lead time.The correlation is seen to decay largely monotonically with increasing lead time.The main panel shows the correlation for a wide range of bin sizes and when outliers are eliminated; the blue and red dots correspond to two differnt thresholds for determining outliers.A distinct plateau of correlation is seen over a wide range of intermediate bin sizes.This analysis suggests that the uncertainty information from BDL sytem can be used to estimate prediction error to a certain degree.

Table A2 .
DL with MLP at the bottleneck