Detecting Climate Signals Using Explainable AI With Single-Forcing Large Ensembles

It remains difficult to disentangle the relative influences of aerosols and greenhouse gases on regional surface temperature trends in the context of global climate change. To address this issue, we use a new collection of initial‐condition large ensembles from the Community Earth System Model version 1 that are prescribed with different combinations of industrial aerosol and greenhouse gas forcing. To compare the climate response to these external forcings, we adopt an artificial neural network (ANN) architecture from previous work that predicts the year by training on maps of near‐surface temperature. We then utilize layer‐wise relevance propagation (LRP) to visualize the regional temperature signals that are important for the ANN's prediction in each climate model experiment. To mask noise when extracting only the most robust climate patterns from LRP, we introduce a simple uncertainty metric that can be adopted to other explainable artificial intelligence (AI) problems. We find that the North Atlantic, Southern Ocean, and Southeast Asia are key regions of importance for the neural network to make its prediction, especially prior to the early‐21st century. Notably, we also find that the ANN predictions based on maps of observations correlate higher to the actual year after training on the large ensemble experiment with industrial aerosols held fixed to 1920 levels. This work illustrates the sensitivity of regional temperature signals to changes in aerosol forcing in historical simulations. By using explainable AI methods, we have the opportunity to improve our understanding of (non)linear combinations of anthropogenic forcings in state‐of‐the‐art global climate models.

In addition to the influence of internal variability, the effective radiative forcing from anthropogenic aerosol emissions also remains uncertain over the historical period (Bellouin et al., 2020;Booth et al., 2018;Thorsen et al., 2020). In a novel experiment design, Dittus et al. (2020) assessed the sensitivity of a climate model to a plausible range of historical aerosol forcings. They found better agreement between the observed global mean surface temperature record and an experiment with smaller net aerosol forcing than the standard configuration of the GCM. Consequently, this suggests that temperature signals may be highly sensitive to small changes in aerosols, even when the aerosol forcing in GCMs is constrained to fall within observational estimates (Dittus et al., 2020). This also could be one explanation for the higher climate sensitivities found in CMIP6 models (Flynn & Mauritsen, 2020;Meehl, Senior et al., 2020).
Recent advances in computational power have led to the development of a growing number of initial-condition large ensembles for assessing climate change and variability (Deser, 2020;. Within a single large ensemble GCM simulation, one can obtain the forced response (i.e., climate signal) by averaging across individual ensemble members that differ by only a small random perturbation error. Thus, if the model is correct, observations of the real world should fall within the ensemble spread in order to reflect both a common forced signal (climate change) and the unpredictable noise of the atmosphere. In other words, the statistical characteristics of internal variability should be similar between the real world and the individual model ensemble members. However, although numerous statistical methods have been proposed to further extract the forced response from internal variability (e.g., Barnes et al., 2019Barnes et al., , 2020Deser et al., 2016;Hegerl et al., 1996;Santer et al., 2019;Sippel et al., 2019Sippel et al., , 2020Wills, Battisti et al., 2020), the problem of climate pattern attribution still remains difficult (Wills, Sippel et al., 2020).
To improve our understanding of the forced signals from individual anthropogenic climate drivers amidst the noise of internal variability, we implement a method of explainable artificial intelligence (XAI) using data from a novel set of single-forcing large ensemble experiments. The adoption of machine learning applications for geoscience issues continues to rapidly grow (Boukabara et al., 2020;Ebert-Uphoff et al., 2019;McGovern et al., 2019;Rasu et al., 2019;Toms et al., 2020;Watson-Parris, 2020), especially due to an increasing number of XAI methods (Montavon et al., 2018;Samek et al., 2017Samek et al., , 2020. Recently, machine learning models have been used for diverse applications in mesoscale meteorology (e.g., Gagne et al., 2019;Lagerquist et al., 2020), numerical weather prediction (e.g., Rasp et al., 2020;Weyn et al., 2020), simulating cloud and radiation processes in GCMs (e.g., Rasp et al., 2018), turbulence and convection parameterizations (e.g., Beucler et al., 2019;Zanna & Bolton, 2020), attribution of global climate change (e.g., Barnes et al., 2019;Mansfield et al., 2020;Sippel et al., 2020), and reconstructions of historical temperature trends (Kadow et al., 2020). To explore how machine learning models are making their predictions, we focus on using XAI techniques in order to gain new scientific insights for climate science.
In this study, we use artificial neural networks (ANN) in association with an explainability method called layer-wise relevance propagation (LRP) on data from climate model simulations. By comparing the LRP results between ANNs, we investigate climate patterns that are related to different combinations of external forcings, namely, greenhouse gases and industrial aerosols. Finally, we assess the utility of the ANNs by evaluating them on real world observations and introduce a metric to mask noise in assessing the LRP visualizations.  Hurrell et al., 2013) covering 1920 to 2080. CESM1 is a fully coupled GCM and is run with 30 vertical levels and a horizontal resolution of 1°. The atmospheric model is the Community Atmosphere Model version 5 (Neale et al., 2012), which is coupled to interactive land, ocean, and sea ice components.
Here, we first analyze the widely used 40-member large ensemble as described in Kay et al. (2015), which we refer to as "ALL" (for all-forcing). The large number of ensemble members is useful for characterizing atmospheric internal variability (or noise) in the climate system Maher et al., 2019). Each of the ensemble members have the same external forcing, but are generated from a small random round-off difference in the atmospheric initial conditions. Historical forcing is imposed from 1920 to 2005, and thereafter Representative Concentration Pathway 8.5 (RCP8.5; Vuuren et al., 2011) is used to simulate a worst-case climate scenario through the end of the 21st century (Peters & Hausfather, 2020). Land use/land cover changes, biomass burning, and stratospheric ozone concentrations also evolve with time in the ALL simulation. Although large uncertainties exist, CESM1's total aerosol effective radiative forcing falls within one standard deviation of observational evidence (Bellouin et al., 2020;Deser, Phillips et al., 2020;Zelinka et al., 2014). We will return to this last point later in the study.
In addition, we also use a set of two new single-forcing simulations from CESM1 that are both run with 20 ensemble members (Deser, Phillips et al., 2020). These large ensembles have the same GCM, initialization protocol, and external forcing as ALL, but differ by one time-evolving forcing agent that is withheld per simulation. In particular, greenhouse gas concentrations are held fixed to 1920 levels in one experiment (AER+), and industrial aerosols are held fixed to 1920 levels in another (GHG+). While our notation in this study reflects the dominant external forcing agent per simulation (either greenhouse gases [GHG] or industrial aerosols [AER]), we do note that there are other important climate feedbacks and natural variability included in each experiment (hence, the "+" sign) that may contribute to our interpretation of the ANN results (e.g., Deng et al., 2020;Hawkins et al., 2017;Lehner et al., 2020;Luyssaert et al., 2014;Maher et al., 2020;Milinski et al., 2020). Since we only focus on one GCM (CESM1) with historical and RCP8.5 forcing, differences between the simulations cannot be due to emission scenario uncertainties or model structural uncertainties that would arise from using, for instance, CMIP5/6 (Hawkins & Sutton, 2009;Knutti & Sedlacek, 2013;Lehner et al., 2020).
After taking into account the smaller ensemble size of the single-forcing runs, we only consider the first 20 members of ALL. However, this does not affect the skill of the ANN for training and testing data (not shown). We apply a bilinear interpolation to the three sets of large ensembles so that they share a slightly coarser latitude by longitude global grid (1.9° × 2.5°). We only consider fields of monthly near-surface air temperature (°C) to calculate seasonal and annuals means from model output. An overview of the climate model simulations used in this study can be found in Table S1.

Observations
To understand the effect of training on climate model simulations with different external forcing, we test the ANN on observations using the new National Oceanic and Atmospheric Administration/Cooperative Institute for Research in Environmental Sciences/Department of Energy Twentieth Century Reanalysis (20CR) version 3 (20CRv3; also referred to here as "observations") (Slivinski et al., 2019). Updates to 20CRv3 include an 80-member ensemble size for confidence estimation, a four-dimensional incremental analysis data assimilation scheme , and a higher resolution core model (described in Slivinski et al., 2019). These improvements lead to a reduction in biases of near-surface temperature, sea surface temperature, and sea level pressure compared to older versions of 20CR, especially in the early to mid-20th century (Compo et al., 2011;Giese et al., 2016). Further, 20CRv3 was found to be in close agreement with other independently derived reanalysis data sets, including the European Centre for Medium-Range Weather Forecasts (ECM-WF) ERA-20C and CERA-20C (Slivinski et al., 2019(Slivinski et al., , 2020.

10.1029/2021MS002464
We analyze monthly fields of 2-m air temperatures (°C) from 20CRv3 after interpolating (bilinear) onto a common grid of 1.9° latitude by 2.5° longitude for consistency with the climate model simulations. 20CRv3 was selected for our analysis due to its temporally and spatially complete fields of 2-m temperature that are available globally from 1920 to 2015. Similar results were also obtained from the ANN after evaluating on the ECMWF ERA5 reanalysis (Hersbach et al., 2020) for the more recent 1979 to 2019 period. However, in this study, we focus our attention on 20CRv3 for consistency with the historical climate model output. A summary of the observations can be found in Table S2.

Neural Network Framework
In this analysis, we adopt a neural network architecture that was first introduced in Barnes et al. (2020) and is further illustrated here in Figure 1. We compare the impact of time-evolving greenhouse gases and industrial aerosols on a classification task of predicting the decade (year) from input maps of temperature. Each unit of the ANN input layer represents one grid point from a 2-m temperature map (13,824 units per map with dimensions of 96 latitudes by 144 longitudes), and our output layer represents the probabilities of a particular decade class (e.g., 2000-2009).
Our ANN is set up with two hidden layers that each contain 20 hidden units (relatively shallow). We find that increasing the number of layers does not improve the skill of the model ( Figure S1), and this architecture supports the interpretability of the fully connected neural network for scientific discovery. In particular, we apply the Rectified Linear Unit (ReLU; Agarap, 2018) activation function to all hidden layer nodes before the output layer, which is defined as f(x) = max(0, x). ReLU is well equipped for use in LRP visualization, since it tests whether individual neurons have been activated . We also apply a soft-max function to the output layer, which remaps the decadal class probabilities so that they add up to one. Both ReLU and soft-max functions are common in ANN classification problems such as ours (e.g., Goodfellow et al., 2016;Lecun et al., 2015;Samek et al., 2020).
To retrieve the predicted year (output) by the ANN from the maps of 2-m temperature (input), we use a method called fuzzy classification encoding and decoding (Amo et al., 2004;Zadeh, 1965). This occurs during the ANN's output layer (see Barnes et al., 2020).   (Zadeh, 1965) to assign each prediction year to the probability of it occurring in a single decade (e.g., within 2000-2009) ). An example heatmap using layer-wise relevance propagation (LRP; Bach et al., 2015) is also illustrated here. LRP highlights the regions of greater relevance for the ANN to predict the year by propagating an output sample backward through the frozen nodes of the ANN until it reaches the input layer .
year by computing the weighted sum of the decadal class probabilities (decode).  Figure 2 of Barnes et al. (2020). Given our approach using both LRP and fuzzy classification, we do not explore the more typical method of multiple linear regression in this work. However, that approach has been explored in Barnes et al. (2019Barnes et al. ( , 2020 for CMIP temperature and precipitation data. Before the maps are fed into the ANN, all training data are standardized by their standard deviation across all ensemble members and years at each grid point. Each ANN is then trained using a randomly selected subset of 80% of the climate model simulation data (16 ensemble members) and tested on the remaining 20% (4 ensemble members). During training, our loss function uses binary cross-entropy/log loss, which acts to penalize the ANN when the prediction is wrong, but the model confidence is still high. The ANN are trained using the Nesterov method (momentum = 0.9) for stochastic gradient descent (Ruder, 2016) for 500 epochs. While the interpretability results are not sensitive to our selection in hyperparameters, we set our learning rate to 0.01 and a batch size to 32 for each ANN used to generate the following figures.
To overcome the problem of overfitting the input data, we use L 2 ridge regularization (Friedman, 2012). The L 2 parameter is set to 0.01 and applied to the weights of the first hidden layer. L 2 regularization imposes a penalty on the model by adding a coefficient to the loss function that is proportional to the sum of the squares of the feature weights. Thus, L 2 regularization leads to weights that are more smoothly distributed across the model and are not as sensitive to outliers in the input data. Importantly, and in relation to standard climate science tools, the inclusion of this parameter accounts for spatial autocorrelation that can exist in the 2-m temperature fields. L 2 also improves the interpretation of the LRP heatmaps for identifying key regions that are relevant for the ANN to make its prediction (e.g., see Figure 3 in Barnes et al., 2020).

Layer-Wise Relevance Propagation
The motivation for this work is to reveal the underlying climate patterns that are learned by the ANN from climate model simulations with different combinations of external forcing. As we will show, using XAI LABE AND BARNES 10.1029/2021MS002464 5 of 18  1920-1959 (a, e, i), 1960-1999 (b, f, j), 2000-2039 (c, g, k), and 2040-2079 (d, h, l) for the ensemble means of three climate model simulations (AER+; a-d, GHG+; e-h, ALL; i-l). Statistically significant trends are shown with shaded contours at the 95% confidence level following the Mann-Kendall test (Bevan & Kendall, 1971;Mann, 1945), while those that are not are masked out using black hatch marks. tools alongside existing climate science methods have the potential to bring new insights for interpreting projections of climate change in GCMs.
For this work, we use an interpretation method called LRP (Bach et al., 2015;Montavon et al., 2018) for tracing the decisions determined by the ANN. While there are an increasing number of LRP routines, we use a form here (alpha-beta rule) that works well for ReLU networks and is related to Taylor series expansion . By propagating information backward until the first layer of the ANN is reached, we learn about the individual input units (features) that are "relevant" to make the ANN's prediction.
While a detailed overview of using LRP in the geosciences in provided in Toms et al. (2020), we briefly describe the method here: (a) the weights and biases of the ANN are frozen after training, (b) a single prediction output (prior to the soft-max function) is conserved and propagated backward through each node of the ANN based on the frozen weights and biases, (c) the feature relevance is learned until the propagation reaches the input layer, and (d) the final output of LRP retains the original dimensions of the input data by showing the relevance for each pixel (i.e., gridded latitude by longitude points on a map). This process is repeated for every sample. Hence, we are left with a spatial heatmap (unitless) showing the regions of importance for the ANN to determine the decade (see Figure 1).
In this study, our heatmaps are composites of both training and testing sample data, because we are interested in where the ANN is learning regional indicators to make all predictions. However, our LRP results are nearly the same when only using a composite of testing data (e.g., Figure S13). Since our output layer can return multiple probabilities of a 2-m temperature map occurring in a particular decade (fuzzy classification encoding and decoding), we only propagate the output value with the highest probability of belonging to a particular decade. Again, LRP can only propagate one single output node backwards at a time. However, previous work has found that this does not affect the interpretation of the LRP output . One final note about our use of LRP is that it returns information that positively contributes to the ANN's predicted likelihood (i.e., increases confidence in the prediction). Other XAI methods explore ways to interpret contributions that lead to less confident predictions (e.g., Botari et al., 2020), but that is beyond the scope of this analysis. To interpret the heatmap figures in this study, the higher relevance values indicate greater importance for the ANN's prediction. Lastly, we introduce a method to mask noise (i.e., relevance) in the LRP output (section 3.2). LABE AND BARNES 10.1029/2021MS002464 6 of 18

Evolution of Simulated and Observed Trends
We first evaluate the three large ensemble experiments (AER+, GHG+, and ALL) using more traditional climate science methods (i.e., trend analysis, signal-to-noise ratios, and timing of emergence) to understand the spatial patterns of the 2-m temperature response. Figure 2 shows annual maps of temperature trends over four separate 40-year periods for the ensemble mean of each experiment. In the historical period, there is an observed cooling for AER+ (time-evolving aerosols; constant greenhouse gases) for all continental regions and most of the world's oceans (Figures 2a and 2b). However, there is a notable statistically significant region of warming over parts of the North Atlantic and Southern Ocean (Figure 2b). These areas of warming may be connected to a strengthened Atlantic Meridional Overturning Circulation (Dagan et al., 2020;Keil et al., 2020;Menary et al., 2020). The global signature of cooling prior to 2000 is associated with an increase in industrial aerosol emissions. Trends in aerosol optical depth are driven by an increase in emissions over Southeast Asia, North America, and Europe in the first half of the 20th century (see Figure 2 in Deser, Phillips et al., 2020). However, a decrease in aerosol optical depth is observed in North America and Europe closer to present-day with the largest aerosol forcing remaining over Southeast Asia. As industrial aerosols are reduced over the 21st century, there is a net warming trend globally in AER+ through 2080 (Figures 2c  and 2d). Notably, the temperature trend in the North Atlantic reverses and resembles the "North Atlantic Warming Hole." In agreement with earlier studies (e.g., Dagan et al., 2020), this suggests an important role for aerosols in North Atlantic climate variability. Figures 2e-2h reveals the global warming signature due to the dominant greenhouse gas forcing in GHG+ (time-evolving greenhouse gases; constant aerosols), along with a cooling patch in the North Atlantic. Relative to GHG+, statistically significant warming trends emerge later in ALL (Figure 2i), which is due to its greater aerosol forcing prior to 1960 (net cooling effect). As trends in optical aerosol depth decrease by 2040, there are larger global temperature trends in ALL (Figure 2l) compared to GHG+ (Figure 2h).
We compare the simulated temperature trends with observations by showing the observed (using 20CRv3) 2-m temperature trend (annual mean) for two 40-year periods in Figure S3. However, the observations only reflect one possible realization of internal variability combined with the forced response. Therefore, they are not directly comparable with the ensemble mean trends presented in Figure 2. Regardless, we still find some common temperature signatures emerge. By the second half of the 20th century ( Figure S3b), we find statistically significant warming across the majority of the tropics and parts of North America. We also find the cooling trend over the North Atlantic detectable in observations for the 1960-1999 period.
To understand the patterns of forced climate signals, we compute signal-to-noise (SNR) maps in Figure S4. Here, the SNR is computed as the absolute ensemble mean trend divided by the standard deviation of the individual ensemble member trends for each 40-year period. We observe the highest SNR in the tropics, which is a result of the smaller internal variability in this region. High values of SNR ( >3) emerge as early as the 1920-1959 period in GHG+ from the Amazon to the Indian Ocean (Figure 4e), but do not appear until the later half of the 20th century in ALL (Figures S4j and S4k). SNR values are also high in the tropics for the AER+ simulation, but there is little to no forced response (SNR < 1) in the extratropics and polar regions ( Figures S4a-S4d). This is likely a result of the small temperature trends in AER+ (compared to GHG+ and ALL), which make up a small fraction of internal variability at higher latitudes. While the global warming signal overwhelms internal variability in GHG+ and ALL beginning in the 2000-2039 period, SNR values remain lower ( ∼1-2) in the subpolar Atlantic.
The effect of aerosols has a consequential role in identifying patterns and the temporal evolution of forced climate signals. Figure S5 shows the timing of emergence (ToE) of annual mean temperature for each large ensemble simulation. Following Lehner et al. (2017), the maps of ToE are computed as the first year that the 10-year running-mean temperature exceeds and stays above the mean 1920-1949 reference temperature by more than two standard deviations. ToE is computed for every grid point in each ensemble member before taking the ensemble mean. While there are numerous definitions and metrics for detecting ToE (Mahlstein et al., 2012), here we are interested in a baseline to compare with our interpretable ANNs. Consistent with the SNR maps, we find that ToE is delayed by nearly a decade in ALL ( Figure S5c) compared to GHG+ ( Figure S5b) due to the effect of aerosol masking. This is particularly found across parts of Southern Asia. The North Atlantic does not emerge in GHG+ and ALL until at least the mid-21st century. Although ToE is found to be later in AER+ ( Figure S5a), this is only a result of reduced aerosol optical depth in the 21st century, since there is no time-evolving greenhouse gas forcing in the simulation.
In summary, increases in industrial aerosol loading (e.g., prior to 1960) can mask the ToE of greenhouse gas-induced warming, particularly in the extratropics. Therefore, to further compare the patterns of responses that are driven by anthropogenic climate drivers, we now turn to our interpretable ANN architecture. One advantage to using our ANN is that we can address potential nonlinearities in regional patterns that evolve over time, which would not be captured in the standard methods of trend and SNR/ToE analysis that are conducted grid point by grid point. Figure 3 shows the predictions by the ANN after separately training and testing on each of the three large ensemble experiments. Here, we use fuzzy classification decoding to show how well the ANN can predict the year from the input maps of 2-m temperature. It is clear that the ANN closely predicts the year on the climate model data (blue shading), especially after 1980 ( Figures S1 and S2). We also note that the ANN predicts the correct year similarly as well in AER+ compared to ALL for testing (Figures S1a and S1g), despite the fact that there is no time-evolving greenhouse gas forcing and consequently smaller global mean temperature trends.  To assess the utility of our ANNs that are trained only on climate model data, we test their performance on observations by inputing 2-m temperature maps from 20CRv3. By testing on observational data, we find striking differences between the ANN predictions. The ANN has no skill in predicting the year for observations after training on AER+ (Figure 3a). Since the real world features a large greenhouse gas-induced warming signal, the ANN does not learn regional indicators that are in common with observations. For the ANN trained on ALL, there is an improvement for predicting the order of the years after 1980 (Figure 3c). Considering that a forced temperature response has not clearly emerged from the background noise (see Figures S4i and S4j), we infer that this is why the ANN is less able to predict the correct ordering of the years before 1980.

Predictions by the ANN
In contrast, the ANN performs quite well after training on GHG+ for predicting the order of all of the years in observations (Figure 3b). Since the real world does consist of both direct and indirect effects of greenhouse gases and aerosols, it is somewhat surprising to see that the ANN trained on GHG+ has a higher correlation to the actual year than for the predictions trained on ALL (Figure 4a). In fact, the observations approximately parallel the 1:1 line in GHG+, but are offset by about four decades. This means that the patterns of forced responses are similar, but may emerge later in the climate model data compared to observations. This offset could also arise from a difference in Earth's mean temperature that is common between climate models and reanalysis data sets (Hawkins & Sutton, 2016). Therefore, we compare our results in Figure 3 to ANNs trained using input data with the global mean temperature removed from each map (Figure S6). While the correlation is weaker, the overall results of the observations are quite similar. The ANN is still more skillful in predicting the order of the years for observations on the ANN trained using GHG+. This evidence suggests that the ANN is learning regional temperature signals and not just differences in the global mean temperature to make its predictions, as discussed further in section 3.3.
We investigate the robustness of our observational predictions in Figure 3 by using 100 unique ANNs trained on different combinations of training and testing data sets (i.e., individual ensemble members) for six different L 2 and epoch hyperparameter choices. Since L 2 regularization imposes a degree of spatial autocorrelation in the weights, we wanted to see if the skill of the observational predictions could change by using different parameters for each large ensemble ANN. We then test our observational data on each of these 100 iterations of every ANN architecture and plot a histogram of their correlation between the ANN-predicted year and the actual year. Our conclusions remain the same as Figure 3. We find that the median correlation is closer to 1 for GHG+ in the six ANN architectures evaluated here. Figure S7 also shows a comparison between the best correlations in GHG+ and ALL, which again confirms that the median correlation in GHG+ is higher than the ALL.
Proceeding with the L 2 and epoch parameters outlined earlier (e.g., Figure 4a), we also plot a histogram of the predicted (linear) slopes for our observational data in Figure S8. In agreement with our single trained ANNs in Figure 3, we find that the observations tested on the ANN using GHG+ performs the closest to the 1:1 (or perfect correlation) line with little variability between each iteration. Once again, there is no skill in predicting the year of the observations for the ANN trained on the AER+ simulation. In ALL, the median slope is greater than the 1:1 line likely due to the fact that a forced temperature signal does not emerge until after the middle of the 20th century.
While the results in Figures 3 and 4 show predictions based on maps of annual mean 2-m temperature, we also investigate differences by calculating seasonal means before training and testing the ANN. Figure S9 show the results of predicting the year for boreal winter (January-February-March; JFM) and boreal summer (July-August-September; JAS) in the ANNs using GHG+ and ALL, respectively. Once more, we find that the correlation of the predicted year of observations is higher for the ANN trained on GHG+. Notably, we also find a higher correlation for observations in JAS relative to JFM for both GHG+ and ALL trained ANNs. This may be a result of greater internal variability of 2-m temperatures in the Northern Hemisphere during JFM. In other words, the indicator patterns in common between observations and the climate model data may be weaker in boreal winter compared to summer.
To understand how the ANN is making its predictions, we utilize LRP for evaluating regional climate patterns of interest. In particular, we investigate why the ANN predictions of observations are better correlated to the actual year after training on a climate simulation without time-evolving aerosols. As a reminder, the LRP heatmaps indicate areas of "relevance" (or importance) for the ANN to make a prediction. Therefore, greater relevance does not necessarily correspond to the locations of greatest climate forcing. Additionally, the locations of higher relevance may change over time.

Uncertainty in Layer-Wise Relevance Propagation
The LRP algorithm employed here provides output (relevance) for all grid points of every sample (e.g., Figure 1). However, it can be difficult to distinguish physically meaningful regions of importance to the ANN, especially for identifying known climate signals. To limit noise in our LRP maps, we compute a threshold (or statistical significance) using a baseline relevance value. In other words, we determine the maximum feature relevance that could be expected from an ANN that is trained on random noise. While other uncertainty metrics for LRP have been proposed (e.g., Bykov et al., 2020;Fabi & Schneider, 2020), our simple method can be employed without modifying the existing ANN architecture or LRP algorithm and takes a common approach applied by climate scientists.
We compute this baseline relevance threshold as follows: (1) we randomly shuffle the individual ensemble member and year dimensions of the ALL input data while keeping the true year fixed (not shuffling), (2) we proceed with training and testing using the same ANN architecture and hyperparameters as section 2.3, (3) each output sample is then propagated backward into the ANN to compute the relevance map, (4) we repeat steps 1-3 for 500 iterations of the ANN by using unique random initialization seeds and taking different combinations of the training and testing data, and (5) finally, we compute the 95th percentile from the distribution of LRP values at all grid points that are obtained from this procedure. Thus, this bootstrapping-like method determines the distribution of LRP values that could be expected from climate data with no serial autocorrelation or temporal trends from forced signals. Figure 5 displays a histogram of this distribution of LRP values after 500 unique iterations of the shuffled ANN. We also test our observations (20CRv3) on the ANN trained by the shuffled ensemble from steps (1)-(5). As expected, the ANN cannot predict the year (median linear slope near 0), since it is unable to learn any forced climate signals from the shuffled data. Figure S10  the linear fit of observations compared to the median R 2 of observations trained on either AER+, GHG+, or ALL (section 3.1.2). We also show an example of a LRP map from a single iteration of the ANN trained on the shuffled ensemble, which highlights the lack of relevant regions for the ANN to make a decision on this synthetic data ( Figure 5).
As an additional check of our methodology, we create a "large ensemble" of random numbers drawn from a normal distribution. This large ensemble of random noise has the same dimensions as our real data (20 ensembles, 161 years, 96 by 144 spatial grid points). After repeating steps (2)-(5), we find that the 95th percentile of the random noise LRP is in close agreement with our baseline calculated from Figure 5 (not shown). Figure 6 show the LRP heatmaps for the individual ANN's trained on AER+, GHG+, and ALL input data of annual mean 2-m temperature masked using the method outlined in section 3.2. Our LRP maps are averaged for every prediction sample (ensemble member) that is accurate to within ±2 years of the actual year . In Figure 6, we show the temporal evolution of relevance for the four periods we have considered in this study (e.g., Figure 2). These LRP maps are composites after masking out the relevance below our new uncertainty threshold (see Figure 5). To compare the influence of our LRP uncertainty metric introduced in section 3.2, we also show the same LRP heatmaps in Figure S11, but without using a mask. Comparing Figures 6 to S11, we now see several climate regions of interest (e.g., North Atlantic and Southeast Asia) that are more clearly distinguishable from the background noise.

Regions of Climate Signal
The North Atlantic is a key region of relevance between all three large ensembles, but is largest in GHG+ during the 2000-2039 period (Figure 6g). The LRP maps also reveal Southeast Asia as an important region for the AER+ and ALL neural networks. The relevance is largest in Southeast Asia for AER+ during the early 20th (Figure 6a) and early 21st centuries (Figure 6c). Again, although the regions of relevance do not directly correspond to surface forcing, we infer that the emissions of anthropogenic aerosols over Southeast Asia and India are important indicators for the ANN to predict the year in the AER+ and ALL large ensembles. We also find that the Southern Ocean is a significant region of relevance for the large ensembles that observe time-evolving greenhouse gases (GHG+ and ALL). Notably, this Southern Ocean signal appears along the Antarctic sea-ice edge. However, in agreement with Barnes et al. (2020), we find that the Arctic LABE AND BARNES 10.1029/2021MS002464 11 of 18  1920-1959 (a, e, i), 1960-1999 (b, f, j), 2000-2039 (c, g, k), and 2040-2079 (d, h, l) for the three large ensemble experiments (AER+; a-d, GHG+; e-h, ALL; i-l). Higher LRP values indicate greater relevance for the artificial neural network's prediction. Relevance values less than the 95th percentile threshold (see text) have been masked out (gray shading).
is not a region of importance for predicting the year in any of the large ensemble simulations. Despite the effects of Arctic amplification, the lack of relevance to the ANN prediction is likely a result of the large atmospheric internal variability in the high latitudes relative to the tropics ( Figure S4).
To compare the differences in LRP maps between seasonal and annual mean input data, we show their relevance composites over 1960-2039 in Figure 7. This period is selected due to the greater differences in the ToE of forced signals between the three large ensembles (section 3.1.1). For the LRP maps based on the annual mean data (Figures 7a, 7f and 7k), we observe higher relevance in the North Atlantic for AER+, GHG+, and ALL neural networks. This area of relevance is largest in the ANN trained on GHG+ and is somewhat consistent between seasons. In agreement with Figure 6, this shows that the North Atlantic is a particularly important region for the neural network to predict the year. For AER+ and ALL, we observe a relevance hotspot over India and Southeast Asia, which is distinct during JFM and October-December. This is likely due to the local influence of time-evolving aerosols in these climate model simulations, which are absent in the ANN trained on GHG+. Although there are some regional and seasonal differences in Figure 7, the primary climate indicators ("relevance hotspots") remain similar. Thus, we focus on the annual mean input data for the rest of our analysis.
As previously discussed (e.g., in Figure 4), we test the robustness of our results by running 100 unique iterations of each large ensemble ANN for different combinations of training and testing data. Figures S12a-S12c shows a composite LRP heatmap that is averaged over all 100 possible iterations of the ANN from 1920 to 2080 compared to a composite of ANNs using a smaller L 2 regularization parameter and larger epoch parameter (Figures S12d-S12f). The conclusions remain the same. The regions of greatest relevance are consistent with Figure 6 and point to the North Atlantic and portions of Southeast Asia (only in AER+ and ALL) as essential to the ANN's predictions. This highlights that the regional signals are robust, even after considering different combinations of individual ensemble members and a smaller regularization parameter. Figure 8 shows the distribution of relevances from the 100 unique ANN iterations for the mean relevance value   of these areas as key climate indicator patterns that are learned by our nonlinear ANN. We find weaker relevance over Southeast Asia ( Figure 8a) and India ( Figure 8b) for GHG+, which is likely a result of its industrial aerosols being held fixed to 1920 levels. Thus, the temperature signals in these regions (e.g., absence of local cooling due to aerosols) are not as important for the ANN prediction. In contrast, GHG+ observes the greatest relevance in the North Atlantic, while AER+ observes the smallest relevance in this same area ( Figure 8c). Interestingly, the North Atlantic distribution for ALL falls between AER+ and GHG+. The relevance signals across Central Africa (Figure 8d) and the Southern Ocean (Figure 8e) are mostly consistent between large ensemble simulations. Nevertheless, we note that there is a slight tendency for the Southern Ocean to be more important for the ANN when there is a larger relative contribution from greenhouse gas forcing (GHG+ and ALL). These LRP results highlight the key importance of the North Atlantic and Southeast Asia for the ANNs to make their predictions. To further compare their spatial differences of relevance, Figure S14 shows the difference in LRP heatmap composites for AER+ minus ALL and GHG+ minus ALL.
The largest contrasts in LRP are highlighted across Southeast Asia, the subpolar Atlantic, and parts of Central Africa.
Finally, to understand where the ANN focuses its attention when making predictions on real world data, Figure 9 shows LRP maps for the observations that are input into the ANNs. Similar to the previous LRP maps of the climate model training and testing data, we find several common relevance regions emerge (e.g., North Atlantic and Southeast Asia). However, recall that the prediction of the years for observations are strikingly different between each large ensemble ANN (Figure 3). In particular, the GHG+ neural network is more skillful in predicting the order of the years than by ALL. While there is somewhat greater relevance using observations across the North Atlantic and Southern Ocean for the ANN trained on GHG+ (Figures 9c and 9d) compared to ALL (Figures 9e and 9f), the general patterns between the LRP maps are similar. This indicates that the neural networks are using different combinations of these regional temperature signals to predict the observations. This also suggests that the GHG+ network may be more skillful by focusing on greenhouse gas-induced responses that are closer to real world data, rather than the temperature patterns which are modulated by industrial aerosol forcing in the AER+ and ALL large ensembles. Hence, the LRP maps reveal how industrial aerosols can either mask or augment detection of greenhouse gas-induced warming signals on local to regional scales.   1920-1959 (a, c, e) and 1960-1999 (b, d, f) for observations (OBS) tested separately on each large ensemble artificial neural network (ANN) (AER+; a and b, GHG+; c and d, ALL; e and f).
Higher LRP values indicate greater relevance for the ANN's prediction. Relevance values less than the 95th percentile threshold (see text) have been masked out (gray shading).

Discussion and Conclusions
Due to complex interactions between internal and external forcings in the climate system, it remains difficult to estimate the local and regional influence of human-induced climate change on surface air temperatures (Deser et al., 2012;McKinnon & Deser, 2018;Schneider & Held, 2001). Our work demonstrates the utility of XAI methods for extracting patterns of climate signals due to varying external forcing, which adds to an existing set of statistical techniques for evaluating signal-to-noise in the Earth system (e.g., Wills, Sippel et al., 2020). By leveraging a XAI tool as a novel pattern recognition method, we aim to understand how a nonlinear ANN makes a prediction by learning regional climate signals.
We build off of ANN results from Barnes et al. (2019Barnes et al. ( , 2020 by investigating the role of different anthropogenic external forcings on temperature patterns relative to the influence of atmospheric internal variability. Using climate model data from a new set of large ensemble experiments, we compare different combinations of human-induced climate drivers (greenhouse gases and industrial aerosols) on forced temperature signals over the 20th and 21st centuries. The large number of ensemble members from one fully coupled climate model (CESM1) allow us to disentangle forced changes from internal variability. In particular, we use LRP to investigate how the ANN learns regional climate patterns in order to predict the year from inputs of 2-m air temperatures. Importantly, LRP allows us to investigate the time-evolving relevance (from 1920 to 2080) of input features (maps of 2-m temperature) for the ANN to make an accurate prediction. We also introduce a simple metric to further extract the key relevance regions from the LRP maps. Lastly, we test our nonlinear ANN on observations from a new 20th century atmospheric reanalysis data set (20CRv3) in order to understand how the effect of different external climate forcings impact the prediction of our ANN after testing on real world data.
While efforts are underway to constrain observational uncertainties for the effective radiative forcing of aerosols (e.g., Bellouin et al., 2020;Bender, 2020;Yoshioka et al., 2019), the net influence of aerosols on regional temperature variability remains highly uncertain in historical and future climate model simulations (Bauer et al., 2020;Dittus et al., 2020;Peace et al., 2020). Surprisingly, we found that our ANN trained on a climate model simulation with fixed industrial aerosols (set to 1920 levels; GHG+) made predictions of real world temperature observations that correlated higher with the actual year. In contrast, the ANN trained on a large ensemble with the most realistic external forcing configuration (ALL) was less likely to correctly identify the order of the years for observations. The LRP maps based on observations indicate that the temperature signal in the North Atlantic is particularly relevant for the predictions by the ANN trained on GHG+ compared to ALL. We also note that the spatial features of the LRP maps are similar to areas of anomalously late or early temperature signals in the ToE maps (relative to the rest of the globe), especially across Southeast Asia, Central Africa, and the North Atlantic. XAI methods, such as LRP, may be another promising tool to explore for identifying the emergence of other climate variables in future work.
Our ANN results suggests that CESM1 is highly sensitive to combinations between external forcings when simulating the variability and timing of emergence of global climate signals, such as the North Atlantic Warming Hole, compared to observations. While we focus on only one set of single-forcing large ensembles, we recommend that additional experiments are conducted to fully understand the sensitivity of GCMs to aerosol radiative forcing and subsequently simulate realistic temperature trends and variability.