Detection of Forced Change Within Combined Climate Fields Using Explainable Neural Networks

Assessing forced climate change requires the extraction of the forced signal from the background of climate noise. Traditionally, tools for extracting forced climate change signals have focused on one atmospheric variable at a time, however, using multiple variables can reduce noise and allow for easier detection of the forced response. Following previous work, we train artificial neural networks to predict the year of single-and multi-variable maps from forced climate model simulations. To perform this task, the neural networks learn patterns that allow them to discriminate between maps from different years—that is, the neural networks learn the patterns of the forced signal amidst the shroud of internal variability and climate model disagreement. When presented with combined input fields (multiple seasons, variables, or both), the neural networks are able to detect the signal of forced change earlier than when given single fields alone by utilizing complex, nonlinear relationships between multiple variables and seasons. We use layer-wise relevance propagation, a neural network explainability tool, to identify the multivariate patterns learned by the neural networks that serve as reliable indicators of the forced response. These “indicator patterns” vary in time and between climate models, providing a template for investigating inter-model differences in the time evolution of the forced response. This work demonstrates how neural networks and their explainability tools can be harnessed to identify patterns of the forced signal within combined fields

few. Recently, neural networks have also entered the fold. Neural networks are machine learning algorithms that are able to detect complex, nonlinear relationships between input and output data (Abiodun et al., 2018). Because neural networks are able to detect highly complex relationships, they are useful for many high dimensional problems and have become prevalent in several atmospheric science research fields, such as weather forecasting (e.g., Lagerquist et al., 2019;Lee et al., 2021;Weyn et al., 2020), climate model parameterizations (e.g., Brenowitz & Bretherton, 2018;Gettelman et al., 2021;Silva et al., 2021), and, most relevant to the focus of this study, detection of a forced climate response (e.g., Barnes et al., 2019Barnes et al., , 2020Labe & Barnes, 2021;Madakumbura et al., 2021). To detect patterns of forced change, Barnes et al. (2020) trained a neural network to predict the year label of maps of annual-mean temperature (or precipitation) from climate model simulations for forced historical and future scenarios. Given that the internal variability in any given year differs between the various climate models, the neural network had to learn patterns of the forced climate response. Using neural network explainability methods, they then visualized the regions that were most reliable indicators for identifying change across the Coupled Model Intercomparison Project (CMIP5) models. Barnes et al. (2020) demonstrated that neural networks, and their explainability methods, are powerful tools for extracting forced patterns from climate data. This neural network method is a natural approach for isolating the forced climate response. While many other methods require assumptions to be made about the time evolution of the forced signal and internal variability within the system, neural networks do not (Barnes et al., 2019). Following Barnes et al. (2020), neural networks have since been used to explore the sensitivity of regional temperature signals to aerosols and greenhouse gases using single-forcing large ensembles, and to detect the signal of extreme precipitation in observational data sets (Labe & Barnes, 2021;Madakumbura et al., 2021). Though many climate signal detection studies focus on single variables, such as annual-mean temperature or a single season of precipitation (Gaetani et al., 2020;Li et al., 2017;Santer et al., 1996Santer et al., , 2019, there are benefits to studying climate change through a multivariate lens (Bindoff et al., 2013;Bonfils et al., 2020;Mahony & Cannon, 2018). Many variables in our atmosphere are closely interconnected, so when the variables are intelligently selected signals of change within multiple variables may be detected earlier than in single variables alone. For example, departure from natural variability can be seen decades earlier in bivariate maps of summertime temperature and precipitation than in either variable alone (Mahony & Cannon, 2018). Similarly, Fischer and Knutti (2012) found that climate model biases in the signal of relative humidity and temperature are negatively correlated such that climate model simulations of their combined quantity, heat stress, have considerably less spread. Combined variables have also been used to identify the impacts of anthropogenic forcings on climate in observational data sets by identifying the multivariate patterns that enhance the signal of change relative to the underlying noise (e.g., Barnett et al., 2008;Marvel & Bonfils, 2013). Understanding how the patterns of the forced response take shape through multiple atmospheric variables also allows for a deeper understanding of the physics at play, as in Bonfils et al. (2020). They explored the evolution of the climate fingerprint by analyzing the leading combined empirical orthogonal functions of temperature, precipitation, and climate moisture index. This multivariate approach illuminated two cross-variable patterns of change: intensification of wet-dry patterns and meridional shifts in the ITCZ associated with interhemispheric temperature contrasts. Neither pattern can be fully explained by a single variable which highlights the utility of combining variables when identifying patterns of the forced response.
Combining fields can be useful for identifying patterns of forced change that do not reveal themselves in single fields alone, but this added information does not come without its drawbacks. Many variables covary in complex and nonlinear ways, such as sea surface temperature and precipitation (Lu et al., 2015), drought indices (Wu et al., 2017), and snowpack, soil moisture and flood risk (Swain et al., 2020), often requiring complex statistics to isolate these interactions. Identifying nonlinear correlations within climate fields introduces another issue, namely in explaining the complex interplay between fields. These drawbacks highlight the need for methods that are both complex and explainable in multivariate climate analyses.
Providing a method for both nonlinear and multi-variable analysis of the forced response, this study extends the neural-network approach of Barnes et al. (2020) to combined fields of input. Combined fields could mean the same variable for different temporal segments (e.g., seasons), or different geophysical variables, both of which are explored here. For the sake of consistency and comparability, this study largely follows the methodology of Barnes et al. (2020), however there are some departures. We standardize the input fields differently which improves the predictive skill of the neural networks. We also use a slightly simpler neural network architecture 2. Data

CMIP6 Climate Models
We use climate model output from the sixth phase of the CMIP6 . Specifically we focus on monthly-, seasonal-, and annual-mean fields of 2-m air temperature (K), precipitation rate (kg m −2 s −1 ), and precipitation rate from very wet days (kg m −2 s −1 ), hereafter referred to as temperature, precipitation, and extreme precipitation, respectively. We use the meteorological seasons of December-January-February (DJF), March-April-May (MAM), June-July-August (JJA), and September-October-November (SON) for calculating seasonal-mean fields. Defining seasons in this way allows for the earliest detection of forced change (see Figure  S1 in Supporting Information S1 for more details).
Very wet days are defined as days that exceed the 95th percentile of all days with precipitation over a pre-defined baseline period (Donat et al., 2016). This is a popular index for measuring changes in extreme precipitation (Cui et al., 2019;Kim et al., 2020) and is used as an indicator of climate change in the U.S. Global Climate Research Program (USGCRP, 2018). We define the baseline as the 40 years from 1980 to 2019, a period for which daily precipitation data exists in both the climate models and the observations. To remove the instances in which climate models simulate sub-trace daily precipitation totals, we only include days that simulated at least 1 mm of precipitation when calculating the 95th percentile of all days with precipitation (Dai et al., 2007).
The neural networks are trained on CMIP6 climate model data. One ensemble member is selected for each of the 37 CMIP6 climate models analyzed so each climate model is only represented once in the training and testing data. Since daily output is required to calculate very wet days, we are limited to 32 models for extreme precipitation ( Figure S3 in Supporting Information S1). We analyze the climate model data from 1920 to 2098 under historical forcing  and the shared socioeconomic pathway 585 (SSP585) scenario . SSP585 represents the highest development pathway within CMIP6 scenarios (O'Neill et al., 2016), combining SSP5 and representative concentration pathway 8.5.
Our neural network methodology requires that all climate model fields have the same shape. To accommodate this we regrid the climate model fields from their native resolutions using the second-order conservative remapping method in the Climate Data Operators package from MPI (Schulzweida, 2019). This regridding step reduces the spatial resolution of the data for most climate models. For temperature and precipitation, the data is regridded to 4° latitude by 4° longitude. We elect to use lower resolution data to reduce the computational expense of training neural networks over global maps of temperature and precipitation. Since the domain for extreme precipitation is smaller than the domain for temperature and precipitation (see the following paragraph), and higher resolution data may better capture regional extreme precipitation patterns, the data for extreme precipitation is regridded to a slightly higher resolution: 1.5° latitude by 1.5° longitude.
Two spatial domains are considered in the results of this paper. For temperature and precipitation, the neural networks are trained on all land north of 60°S. Here, we choose to focus on land grid points because that is where humanity lives and will acutely feel the impacts of changing surface temperatures and precipitation. We also exclude Antarctica where climate models and reanalyses struggle to accurately simulate temperature and precipitation. Each map of temperature and precipitation has 948 unique data points. For extreme precipitation, the neural networks are trained on North and South America (land grid points bounded by 90°N, 55°S, 170°W, and 25°W). Here, we choose to narrow the regional scope to show that neural networks are powerful tools for identifying the forced response even when the spatial domain, and thus the available data, is limited. Each map of extreme precipitation has 2,314 unique data points.

Observations
While this work largely focuses on the results of neural networks trained and tested on climate model data, we show that neural networks trained on climate model data can be applied to observational data as well. For temperature, we use the Berkeley Earth Surface Temperature data set (Rohde & Hausfather, 2020). This data set provides both a temperature climatology and the anomalies at monthly resolution from 1850 to the present. We added the anomalies to the climatology to reconstruct the absolute temperature (K) at each grid point for all months between 1920 and 2019. Monthly observational precipitation fields are obtained from the NOAA Global Precipitation Climatology Project (GPCP), version 2.3, for 1979 to the present (Adler et al., 2018). Since daily precipitation fields are required to calculate extreme precipitation, and daily GPCP precipitation observations are only available back to October 1996, we elected to calculate observed extreme precipitation using the European Centre for Medium-Range Weather Forecasts' ERA5 global reanalysis (Hersbach et al., 2020) at 6-hr resolution from 1980 to present. All observations are regridded in the same way as the climate model data for each respective variable.

Neural Network Design
To identify indicator patterns of the forced response for combined fields we first develop artificial neural networks that, given maps of CMIP6 climate model output from every simulated year from 1920 to 2098, are tasked to predict the year that is being simulated. The results for neural networks trained on 10 different input vectors are explored in the following two sections. The input vectors include annual-, seasonal-, and monthly-mean data for temperature, precipitation, and temperature and precipitation combined, as well as seasonal-mean maps for extreme precipitation over the Americas. We use this diverse selection of input vectors to compare neural network performance and indicator patterns for single-field and combined-field inputs.
The neural network architecture is illustrated in Figure 1. Each unit of the input layer corresponds to a different grid point in the input fields. For example, a neural network that uses seasonal-mean maps of temperature and precipitation as input (two variables and four seasons for a total of eight maps, 948 grid points per map) would have an input vector with 7,584 units. In all cases, this input layer is followed by two fully connected hidden layers with 10 nodes each. The hidden layers are followed by an output layer that consists of 22 classes, one corresponding to each decade midpoint between 1905 and 2115 (e.g., 1905, 1915, 1925, …, 2115). A softmax function is applied to the outputs to convert them to units of likelihood, where the sum of the output vector is one.
Neural networks with this architecture learn the patterns of forced change well, and more complicated architectures do not substantially improve neural network performance (see Figure S2 in Supporting Information S1). It is also notable that this neural network architecture performs better than multiple linear regression, especially when trained on precipitation, and thus using nonlinear techniques improves our ability to detect the year via patterns of forced change ( Figure S2 in Supporting Information S1). This architecture is also widely accessible to most in the climate science community as it can be trained on a personal laptop-highly complex architectures can be prohibitively computationally expensive (Chen et al., 2020). These neural networks were trained on a standard desktop computer with 16 GB of RAM and a 3.1 GHz, 6-core processor. Training a single network took anywhere between 2 and 10 min depending on the size of the input field. More details on the neural network design and hyperparameter tuning can be found in Supporting Information S1.
The neural network is tasked with "predicting the year" rather than "predicting the decade" as the output layer may suggest. To translate between decade midpoints and individual year labels, we use fuzzy encoding (Zadeh, 1965) such that each year can be mapped to one or more neighboring classes with varying degrees of membership (encoded as likelihood). This is different than traditional methods that would map each year to a single decade midpoint. In the traditional case, 2040 and 2049 would be considered to be members of the same class since they are in the same decade, and information would be lost as there is no way to distinguish whether the samples come from the beginning or the end of the decade. Using fuzzy encoding, this information of where a sample lies in each decade is retained. We use a triangular membership function (Zadeh, 1965) with a width equal to one decade such that each year has partial membership in one or two neighboring decade classes, and the total membership sums to one. Following this method, any year directly on a decade midpoint has membership in that class only while years that fall between decade midpoints have membership in the two neighboring classes. The year 1925, for example, is mapped to a likelihood of one for the class 1925 and a likelihood of zero in all other classes. The year 2078 is mapped to a likelihood of 0.7 for the 2075 class and a likelihood of 0.3 for the 2085 class. Note that decoding class likelihoods back to their year is simply the decade-weighted sum of the likelihood: 0.7 × 2075 + 0.3 × 2085 = 2078. A visualization of the encoding/decoding process can be found in Figure 2 of Barnes et al. (2020).

Neural Network Training
For each input vector we train 100 neural networks that differ only in which climate models are randomly split into the training and testing sets. Partitioning so that each climate model's samples are all part of either the training set or the testing set avoids issues with autocorrelation (i.e., near-identical data appearing in both the training and testing sets). One hundred neural networks provide a range of results across multiple combinations of training and testing simulations and offer confidence that the results are consistent across CMIP6 climate models and do not overfit to any one training set. Each neural network is trained over the entire 1920-2098 period on 80% of the climate model simulations, and then tested on the remaining 20%. This leads to a training set of 30 simulations and a testing set of 7 simulations for temperature and precipitation fields, and a training set of 26 simulations and a testing set of 6 simulations for extreme precipitation fields. We train the neural networks using the binary cross-entropy loss (see Barnes et al., 2020) between the predicted class likelihoods and the correct class membership weights, such that the loss function is minimized when the two are equal. Properties of the neural network training process, such as the learning rate and activation functions, can be found in Supporting Information S1.
The neural networks have several hidden nodes which enable them to learn complicated relationships between the input and output data. However, with limited training data, many of these learned relationships will capture Figure 1. Schematic of the fully connected neural network architecture. Inputs from multiple maps of data are flattened into an input layer vector ( size of the input layer ranges from 948 to 22,752 ). These inputs are fed through two hidden layers with 10 nodes each. The neural network is trained to predict the year that the data came from, outputting the likelihood that the input data came from each decade midpoint between 1905 and 2115. This is then converted to a year via fuzzy classification.
patterns of the noise in the training data set which can lead to overfitting (Srivastava et al., 2014). Atmospheric science data is also highly correlated in space and this collinearity can cause complications in the interpretation of the learned weights (Newell & Lee, 1981). Thus, to reduce overfitting and address these issues, we apply ridge regularization (L 2 regularization, see Barnes et al., 2020) to the weights of the first hidden layer. Ridge regularization adds a penalty (called the ridge penalty) to the square of the weights so the solution is penalized for having large weights. Through training, this acts to shrink the largest weights, thus spreading the weight out more evenly across multiple grid points. In our application this results in a more even distribution of importance across regions with strong spatial correlation and improves the performance of the neural networks when given data they were not trained on, namely those models in the testing set (elaborated on in Figure 3, Section 4 of Barnes et al., 2020).
Unlike classical approaches which tune the neural network to reduce the mean squared error (MSE) between the predicted and truth outputs in the testing set (in our case this would be the MSE between the truth and predicted years), we select the ridge penalty that minimizes the time of emergence (TOE) of the forced climate signal (see Section 3.3). Using TOE, rather than MSE, to identify the appropriate ridge penalty ensures that we are encouraging the neural networks to learn the patterns of the forced response across all decades. When a small ridge penalty is used, the neural networks are able to predict the year at the end of the twenty-first century almost perfectly, at the expense of the predictive skill in earlier decades. This results in a later calculation of TOE for the testing set. Slightly increasing the ridge penalty can allow the neural networks to detect the climate change signal slightly earlier ( Figure S4 in Supporting Information S1). The ridge penalty used for each input vector can be found in Supporting Information S1. We use the same ridge penalty for all 100 neural networks trained on each input vector.
All input fields (for climate models and observations) are standardized to assist with the training and overall performance of the neural network. We subtracted the 1980-2019 mean at each grid point of the input fields for each climate model independently. This recasts each input field to measure the change relative to the 1980-2019 mean, rather than the raw magnitudes, which improves the predictive skill of the neural networks and is also appropriate for identifying indicator patterns of forced change. Since values for precipitation change are often on the order of 10 −6 , while the values for temperature change are on the order of 10 0 , we normalized the data so the inputs to the neural network all have a similar magnitude. To do this, the data from 1980 to 2019 at each grid point for each climate model are detrended using ordinary least squares linear regression. We then take the multi-model mean of the standard deviation of the detrended 1980-2019 data for each grid point. The input fields are then divided by this new field of standard deviations so the inputs are of the same magnitude and fall in a reasonable range for training the neural networks. Since all our observational data sets include the years 1980-2019, we standardize the observations as if they were additional climate models: raw observations are subtracted by their own 1980-2019 mean, and divided by the same multi-model standard deviations that were used to standardize the CMIP6 data.

Time of Emergence Calculation
The TOE of the forced climate response is the time in which the forced response signal is distinguishable from the background climate by the neural network. Specifically, we define the TOE as the year when the neural network is able to distinguish that year's map from any map over a historical baseline period. In this work, we define this baseline period as 1920-1959 and, under this definition, the earliest possible TOE estimate is 1960. The TOE is estimated for each climate model simulation independently and a schematic of how the TOE is estimated is presented in Figure 2. First, we calculate the maximum of the neural network-predicted years over 1920-1959 for each model, which is referred to as the baseline maximum. We then identify the TOE as the earliest year in which . The TOE is defined as the earliest year in which a map, and all subsequent maps, permanently exceed the maximum predicted year from the baseline period . The baseline maximum for each model is indicated by the horizontal lines, the last year that falls below the baseline maximum is circled, and the TOE is indicated by the vertical lines. Sample model 1 ( dark red ) has a baseline maximum of 1966 and permanently exceeds this threshold in 2028. Sample model 2 ( light green ) has a baseline maximum of 1981 and permanently exceeds this threshold in 1989. Thus, the TOE for sample models 1 and 2 are estimated as 2028 and 1989, respectively. a map, and all subsequent maps, permanently exceed the baseline maximum. In Figure 2, sample model 1 has a baseline maximum of 1966 and permanently exceeds this prediction threshold in 2028. Sample model 2 has a baseline maximum of 1981 and permanently exceeds this threshold in 1989. Thus, the TOE for sample models 1 and 2 are estimated as 2028 and 1989, respectively. In the following sections we present the TOE for the testing set, however TOE estimates are similar for both the training and testing sets. Year predicted by the neural network ( y-axis ) versus the truth year ( x-axis ) for temperature ( a, d, g ), precipitation ( b, e, h ), and temperature and precipitation combined ( c, f, i ). Input maps include annual-mean data ( a-c ), seasonal-mean data ( d-f ), and monthly-mean data ( g-i ). Testing data is shown in color and observations are shown in white.

Layer-Wise Relevance Propagation
To visualize the patterns learned by the neural network we apply LRP which highlights the regions that were most relevant in the neural network's decision-making process (Bach et al., 2015;Montavon et al., 2019). Toms et al. (2020) discusses in detail how LRP can be used for neural network explainability in the geosciences, though the most relevant details of LRP are described here.
LRP is a neural network explainability method that traces how information flows through the pathways of a trained neural network. The values in a single-sample input vector (in our case, a single year) are passed forward through the neural network. Using the same weights and activations used in the forward pass, LRP then propagates a single-valued output back through the neural network to infer the extent to which each of the values in the input layer contribute to the output (see Figure 2 in Bach et al., 2015). We refer to this quantity as relevance. Through this backpropagation process the output value is conserved such that the sum of all relevance is equal to the output. At first order, relevance can be likened to the product of the regression weights and input map in a linear model. This quantity is natively unitless, but we convert it to a fraction by dividing by the output value. This way, we can consider the relevance of a single pixel in terms of its fractional contribution to the predicted class. Since LRP propagates only a single output value at a time, we propagate relevance only for the decade class with the highest likelihood. While the relevance maps detected by these networks evolve from year to year, this evolution is slow so we find visualizing the highest likelihood decade is sufficient.
There are several LRP decomposition rules which provide different methods of visualizing neural networks (Lapuschkin, 2019;Mamalakis et al., 2021). In our applications we use the αβ-rule which propagates positive relevance (regions that act to increase the class likelihood) and negative relevance (regions that act to decrease the class likelihood) separately. Using the parameters α = 1 and β = 0 we choose to only propagate positive relevance, thus highlighting the regions that added to the likelihood of the selected decade class. We also looked at the relevance maps for β = 1 and found that propagating negative relevance did not impact the conclusions.

Signal-to-Noise Ratio Calculation
In Section 4, we compare the LRP relevance maps to maps of S/N ratio, a more conventional method for identifying indicator patterns of the forced response. S/N ratio consists of three distinct components: the forced signal, which is divided by the sum of noise due to internal variability, and noise due to climate model disagreement. A higher S/N ratio indicates that the signal of the forced response within the climate models is very large relative to the underlying noise. We evaluate the S/N ratio for each grid point separately, following the methodology in Hawkins and Sutton (2012). First, we smooth the data from 1920 to 2098 for each climate model using a fourth-order polynomial fit. The signal is defined as the difference between 2090 and 1920 in the smoothed data, while internal variability is defined as the standard deviation of the residuals from the smoothed data, and climate model disagreement is defined as the standard deviation of the signals calculated for all the climate models. S/N ratio is calculated by dividing the climate signal by the 90% confidence interval in the noise: internal variability and climate model disagreement. S/N ratio, and its components, can be seen in Figure S8 in Supporting Information S1.

Time of Emergence
Across all input vectors of temperature and precipitation, the neural networks are able to learn patterns of the forced response. In the early-to-mid twentieth century, the forced signal is small and undetectable by the neural networks amidst the noise of internal variability and model disagreement, which leads to poor predictive skill ( Figure 3). However, as the signal increases in magnitude into the late-twentieth and twenty-first centuries, the neural networks are able to detect the patterns of the forced response and distinguish between maps in different years. These patterns of the forced response detected by the neural networks are generalizable across CMIP6 models, and as a result the neural network has predictive skill for seen data (the training set, see Supporting Information S1) as well as unseen data (the testing set). These behaviors are shown in Figure 3 which presents the predicted years from one trained neural network for each combination of global precipitation and temperature input fields. Across all input vectors, a similar story of the forced signal unfolds. Prior to the TOE, the neural network is unable to identify patterns that allow it to accurately predict the year. As a result, the neural network is equally confident (or unconfident) about which year, between 1920 and the TOE, each input came from, so it predicts years right around the middle of the twentieth century. After the TOE, the predicted years tend to follow a 1:1 line with the truth years, indicating that the neural network has identified reliable indicators of change for this period.
Although the neural networks are trained on climate model simulations, their learned patterns can be used to predict the year for observational data as well. When observations are used as input, the predicted years increase with time, just as they do for climate model input (Figure 3). This means that the indicators of change derived by the neural networks trained on climate models simulations are largely consistent with the real world. Pearson correlations (r) of the actual years with the years predicted by each neural network are shown in Figure 4. All correlations are positive, indicating that the years predicted by the neural networks increase with time. These correlations are strongest for temperature and combined observations (r ≈ 0.9), but still quite high for precipitation (r ≈ 0.8). Correlations of actual years with predicted years are slightly higher for the combined temperature and precipitation observations than for temperature observations alone ( Figure S5 in Supporting Information S1), suggesting that the multivariate indicator patterns derived from climate model data are useful for understanding trends in the present-day climate. Across all variables, the highest observational correlations are found by the neural networks trained on seasonal-mean data. The correlation of actual years with predicted years for precipitation observations are sensitive to the data set of choice, which is expanded on in Section S4 and Figures S5 and S6 in Supporting Information S1.
The average TOEs, calculated from the climate models in the testing sets of all 100 trained neural networks for each input field ( Figure 5), reveal that the forced response can be detected earlier in maps of temperature than in maps of precipitation (Figures 5a-5c). When presented with combined fields the neural networks are, in many cases, able to detect the forced signal even earlier than when given single fields alone (Figures 5b and 5f). The TOE is generally earlier for the neural networks trained on seasonal-mean data than for the neural networks Correlations were computed for all years beginning in 1980 where observational data exists for all variables. The box plots indicate the first, second, and third quartile statistics, and the whiskers denote 1.5 times the interquartile range, or the minimum/maximum value, whichever is less extreme. Outliers are excluded for clarity, but can be found in Figures S5 and S6 in Supporting Information S1. trained on annual-mean data (Figures 5d-5f). This is most notable for precipitation fields, likely because there are large seasonal precipitation responses muted by taking the annual mean (Tabari & Willems, 2018;Zappa et al., 2015). The TOEs are earlier for temperature and precipitation combined than temperature alone when using seasonal-mean maps (Figure 5b), but are approximately equal when using annual-mean or monthly-mean maps (Figures 5a and 5c), which suggests that precipitation only improves upon the detectability of the forced temperature signal when seasonal-mean fields are used. While annual-mean precipitation may mute seasonal precipitation signals, monthly-mean precipitation is noisy. In this case, seasonal means emerge as the appropriate temporal segments for detecting precipitation change, underlining the importance for the intentional and intelligent selection of neural network inputs.
The neural networks identify the earliest TOEs when trained on seasonal-mean temperature and precipitation combined (Figures 5b and 5f). The TOE results for all 100 seasonal-mean neural networks are summarized in the box plots in Figure S7 in Supporting Information S1. While the improvement in forced response detection is small when precipitation is combined with temperature, it is still notable given that the forced signal of temperature is much clearer than the forced signal of precipitation. We use these variables as an initial example for employing this neural network methodology. We anticipate that more robust results might be found for combinations of variables that have more distinct combined signals, such as humidity and temperature (Fischer & Knutti, 2012). , and monthly-mean ( c ) input fields, and neural networks trained on temperature ( d ), precipitation ( e ), and temperature and precipitation combined ( f ). A total of 100 neural networks with different train-test splits were trained for each input field. Each dot represents the mean TOE for all climate models in the testing set for a single trained neural network, ranked from earliest to latest. Note the change in the y-axes between panels, and that the TOE results for each set of neural networks appear once in the panels ( a-c ), and once in panels ( d-f ).

Indicator Patterns for Combined Variables
Having shown that the neural networks are able to predict the year given seasonal means of temperature and precipitation (Figures 3 and 5), we now identify and explore the spatial indicator patterns used by the neural networks to make correct predictions. By understanding the neural networks' decision-making process, we can identify which regions act as combined (multi-seasonal and multi-variable) indicators of forced change amidst a background of internal variability and climate model disagreement. To identify these indicator patterns, we apply LRP to all climate model samples in the training and testing sets from the year 2090 for the seasonal-mean combined neural networks. Averaging the LRP results for each season and variable, we highlight the regions that have the highest mean relevance across the 37 CMIP6 climate models and 100 trained neural networks. The relevance maps for temperature (precipitation) are shown in Figures 6a-6d (7a-7d) and indicate the importance of each region in the neural networks' predictions of the year 2090.
LRP identifies temperature over North Africa and Central Asia in JJA (Figure 6c) and the Andes and Central Africa in SON (Figure 6d) as the most relevant regions for predicting the year. For precipitation, the regions of highest relevance can be found in Canada and Russia in DJF and SON (Figures 7a and 7d) and in Central Africa and India in JJA and SON (Figures 7c and 7d). That is to say that these are the regional patterns identified by the neural networks that indicate the presence of forced change across the CMIP6 climate models. The scale of the color bars are different between Figures 6 and 7, such that the darkest regions in the temperature maps are approximately one order of magnitude more relevant than the darkest regions in the precipitation maps. Hence, the neural network is relying more heavily on the temperature inputs than the precipitation inputs to accurately predict the year. This is not surprising because the forced signal of temperature is clearer than the forced signal of precipitation (Figure SPM.7 in Field et al., 2014). Even so, including seasonal precipitation allows the neural networks to detect forced change earlier within combined fields than in temperature fields alone (Figure 5b). The improvement in neural network performance provided by precipitation (alongside temperature) is particularly noteworthy given that the S/N ratio for temperature is larger than the S/N ratio for precipitation in all seasons and regions (Figures 6e-6h and 7e-7h, discussed further in this section). In other words, the forced temperature signal is always more pronounced than the forced precipitation signal, but the precipitation signal is still useful for detecting forced change.
LRP is designed to highlight the regions that were most relevant for predicting the correct class (in our case, the correct decade class). These LRP indicator patterns for 2090 are not the time-mean patterns of the forced response, they are the patterns used by the neural network to distinguish the end of the twenty-first century from all other decades. This is distinctly different from S/N ratio which identifies the regions where the forced change from 1920 to 2090 is largest relative to internal variability and climate model spread. Maps of S/N ratio for temperature are shown in Figures 6e-6h, and the corresponding maps for precipitation are shown in Figures 7e-7h, where a higher S/N ratio (darker green) indicates a clearer forced signal. These regions of high S/N ratio are consistent with other related studies (e.g., Hawkins et al., 2020). For the most part, the indicator patterns identified by LRP correspond with the regions with the highest S/N ratios. Calculating the Spearman's rank correlation (ρ) between each map of relevance and S/N ratio, we find that there is generally a strong positive correlation (0.71 ≤ ρ ≤ 0.77) between the LRP indicator patterns and the S/N ratios for temperature, and a moderate positive correlation (0.30 ≤ ρ ≤ 0.56) for precipitation. The exact correlation coefficients between each map are displayed in the subtitles for Figures 6e-6h and 7e-7h.
Given that precipitation only contributes a small amount of relevance compared to temperature, it is perhaps unsurprising that there are several regions where the S/N ratio for precipitation is high, but the relevance is low (e.g., Alaska in JJA, Figures 7c and 7g or South Africa in SON, Figures 7d and 7h). Most likely, the forced signal of temperature is clear enough that these regions do not add to the predictive skill of the neural networks. Regions also exist where the S/N ratio for temperature is high despite low relevance (e.g., North Africa in DJF, Figures 6a  and 6e), although these are more rare, as hinted by the strong correlation between the temperature maps of S/N ratio and relevance. In contrast, there are fewer regions with high relevance despite low S/N ratios, but they do occur (e.g., SON temperatures in northern South America, Figures 6d and 6h). These high-relevance, low-S/N ratio regions confirm that the indicator patterns identified by LRP capture more than the local S/N ratio. Some reasons a region/variable/season may be important in terms of LRP, but not in terms of S/N ratio, are: (a) LRP may be identifying places in our data where a signal exists only in the combination of regions/seasons/variables, which would not be captured by this definition of S/N ratio. (b) Since LRP highlights the patterns the neural networks use to predict the correct decade over all other decades, it may be capturing abrupt nonlinear changes in the climate that are filtered out by the century-long analysis of S/N ratio In the next section, we discuss further applications of neural network-derived indicator patterns and task the network with the much harder problem of identifying changes in extreme precipitation over the Americas. Figure 6. Combined indicator patterns of the forced response ( temperature ). Average temperature LRP results for the seasonal-mean combined neural networks ( left, in yellow ) and signal-to-noise ( S/N ) ratio ( right, in green ) for 2090. Darker shading indicates regions of temperature that are more relevant for the neural network's prediction or have a higher S/N ratio. The Spearman's rank correlation ( ρ ) between corresponding maps of relevance and S/N ratio are shown in the subtitles of panels ( e-h ).

Extreme Precipitation Over the Americas
We now task the neural networks to predict the year given combinations of seasons for a single variable: extreme precipitation over the Americas. We choose to shift our focus for a few reasons. First, we wish to demonstrate Figure 7. Combined indicator patterns of the forced response ( precipitation ). Average precipitation layer-wise relevance propagation results for the seasonal-mean combined neural networks ( left, in yellow ) and signal-to-noise ( S/N ) ratio ( right, in green ) for 2090. Darker shading indicates regions of precipitation that are more relevant for the neural network's prediction or have a higher S/N ratio. The Spearman's rank correlation ( ρ ) between corresponding maps of relevance and S/N ratio are shown in the subtitles of panels ( e-h ). that this neural network approach can be extended to variables that have considerable noise (like extreme precipitation, see Figure S8 in Supporting Information S1), and data sets that do not cover the globe. Second, extreme precipitation has major implications for human health (Ali et al., 2019;Eekhout et al., 2018;Rosenzweig et al., 2002) but there is considerable disagreement between climate models in its signal ( Figure S8 in Supporting Information S1). This neural network approach can be used to identify agreed-upon patterns despite climate model spread. Further in this section, we will demonstrate that LRP maps can be used to investigate climate model differences and better understand the time evolution of the forced response.
The extreme precipitation signal is not as pronounced as the temperature signal, and using the Americas rather than the full globe limits the amount of unique information in the input field. Nevertheless, the neural networks are still able to detect patterns of forced change. Figure 8 depicts the years predicted by one neural network trained on seasonal-mean extreme precipitation. As in Figure 3, the neural network is unable to accurately predict the year given CMIP6 data prior to the TOE around 2010, whereafter the predicted years generally follow the 1:1 line with the truth years, indicating that the neural network has identified reliable indicators of change for this period. All Pearson correlations of the actual years with the predicted years for extreme precipitation in observations are positive (r ≈ 0.4), demonstrating that the indicator patterns found in climate models can be successfully applied to observations (Figure 4). These correlations are not as strong as those for mean precipitation observations, due in part to the magnitude of climate model disagreement in extreme precipitation as well as the observational data set used: ERA5. As shown in Figure S6 in Supporting Information S1, the correlations of actual with predicted years for ERA5 precipitation observations are far smaller than those for GPCP observations. ERA5 tends to perform poorly in remote regions such as northern North America and northwestern South America (Bell et al., 2021), which may be responsible for these low correlations. The correlation between actual years and neural network-predicted years for extreme precipitation observations are explored in much more detail by Madakumbura et al. (2021).
To investigate the indicator patterns used by the neural networks to predict the year when the forced signal first emerges from the background noise, we apply LRP to all climate model samples in the training and testing sets for all 100 neural networks at the TOE (using the TOE calculated for each climate model and neural network individually, see Figure S9 in Supporting Information S1). LRP points to western South America in DJF and British Columbia in MAM and SON as the most relevant regions when the neural networks first detect the forced response (Figures 9a-9d). These LRP maps exhibit a more even distribution in relevance across each region and season than the end-of-the-twenty-first-century LRP maps of global temperature and precipitation (Figures 6a-6d and 7a-7d). Predicting the year at the TOE, when the signal has just barely emerged from the background climate, likely requires the neural networks to use all of the information available to them.
Up to this point, we have only considered the mean LRP maps across climate models. Since the neural networks are nonlinear by nature, they can identify multiple patterns that differ between climate models for a given decade. We apply k-means clustering to all 3200 LRP maps at the TOE (32 climate models samples, 100 neural networks) to identify two distinct indicator patterns that are being used by the climate models (Figures 9e-9l, see Supporting Information S1 for more details on k-means clustering). Taking the difference between the mean LRP maps for clusters one and two reveals that the Amazon in JJA is a highly relevant region in cluster one, while western Canada in DJF is a highly relevant region in cluster two (Figures 9m-9p). With the sole exception of MPI-ESM1-2-HR, all 100 LRP maps for each individual climate model fall cleanly into one cluster or the other, suggesting that there are two distinct ways in which the forced signal emerges in the CMIP6 simulations ( Figure 10). Interestingly, when k-means is instructed to identify 32 unique clusters within the LRP maps, each cluster contains all 100 relevance maps for each of the 32 climate models. In other words, the pathway used by the neural networks to predict the year is unique to each climate model and distinguishable from all other climate models, regardless of whether the climate model samples appear in the training or testing sets (further investigated by Labe and Barnes (2022)).
In the same way that indicator patterns can differ between models, indicator patterns are also able to evolve through time (e.g., Barnes et al., 2020;Labe & Barnes, 2021;Madakumbura et al., 2021). Comparing the LRP maps at the TOE (Figures 11a-11d) with those at the end of the twenty-first century (Figures 11e-11h) highlights the regions that become more important for predicting the year over time. The difference plots in Figures 11i-11l   Figure 9. Relevance map clusters at the time of emergence ( TOE ) for extreme precipitation. Average layer-wise relevance propagation results for: extreme precipitation at the TOE ( a-d ), each cluster identified by k-means ( e-l ), and the difference between the clusters ( m-p ). In panels ( a-l ), darker shading indicates regions of extreme precipitation that are more relevant for the neural networks' prediction of the year at the TOE. In panels ( m-p ), blue shading indicates the regions that are more relevant in cluster 1, while red shading indicates the regions that are more relevant in cluster 2. Note that panels ( a-d ) are identical to panels ( a-d ) in Figure 11. reveal that the neural network learns to focus on Alaska during MAM, JJA, and SON, Greenland in JJA and SON, and Quebec in MAM and SON as the forced response becomes stronger. These regions are more important for predicting the year at the end of the twenty-first century than the early twenty-first century. While further exploration is required, there are several reasons a region may become more relevant over time. For example, it may be that the region does not initially have a clear forced signal, but following some abrupt change (e.g., an ice-free Arctic) the forced signal becomes extremely pronounced. It may also be that the region has a signal that is consistently agreed upon by the majority of CMIP6 climate models, and becomes more relevant compared to other regions as climate model projections in those other regions drift apart. These time-varying patterns support the idea that combined indicators are effective for identifying dynamically evolving patterns of forced change.

Conclusions
Neural networks are powerful tools for identifying patterns of forced change in the climate system. When tasked with predicting the year given climate model simulations of temperature, precipitation, or extreme precipitation, artificial neural networks can learn these patterns of forced change that allow them to distinguish between maps from different years. In combined fields, such as multiple variables, seasons, or both, the forced response can be detected earlier than in single fields alone. By visualizing the decision-making process of the neural networks with an explainability method we extracted reliable, multivariate patterns of forced change. These neural network-derived combined indicator patterns are complex and nonlinear and capture more than the local S/N ratio. Explainability methods take a huge step toward disentangling the relationships learned by neural networks by pointing out what inputs contributed most to the final prediction, but they stop short of explaining why.
Expanding on previous work by Barnes et al. (2020), we used k-means clustering in tandem with LRP to study the relationships learned by the neural networks. This approach revealed two distinct ways in which the extreme precipitation response emerges in CMIP6 data. While combining neural network explainability methods with other statistical techniques can enhance explanations of neural network decisions, there is still a large gap between what the neural network has learned and what we can explain post hoc. Some unanswered questions, such as "why does temperature in Region A combine with precipitation in Region B to improve the signal of the forced Figure 10. Climate models in each relevance map cluster at the time of emergence ( TOE ). The number of times each climate model appears in each cluster when k-means is applied to the maps of relevance at the TOE for 100 ANNs trained on extreme precipitation over the Americas. Only the relevance maps for MPI-ESM1-2-HR appear in both clusters. All other relevance maps for each climate model are found in one cluster or the other. response?" may be better answered with a different architectural approach, such as neural network designs that are inherently interpretable and do not require post-hoc approaches like LRP (Rudin, 2019). This is a natural next step for future work. The flexibility and accessibility of this framework provide several other future research directions. Given that this predict-the-year approach can be applied to observational data, one possible extension of this work could involve exploring the observed features of forced change that are consistent with climate model simulations. There is also space for these methods to be used to determine which definitions of seasons are optimal for detecting forced change. While we used meteorological seasons here, there may be more appropriate definitions, such as unique definitions of the wet and dry seasons, or the shoulder seasons, that vary between variables and regions. Furthermore, this framework should be expanded to other variables, regions of focus, and climate change scenarios, to identify the combined indicators that best elucidate the forced signal. For example, extreme precipitation and extreme drought may combine to capture the increased volatility in precipitation extremes that are expected with climate change (O'Gorman, 2015). Further application of this technique to compound climate extremes, such as heat wave intensity, drought duration, and flood frequency, may reveal that explainable neural networks are useful for assessing societal impacts and improving climate change preparedness. Figure 11. Time evolution of extreme precipitation relevance. Average layer-wise relevance propagation results at the time of emergence ( a-d ), 2090 ( e-h ), and the difference between ( i-l ). Darker shading in panels ( a-h ) highlights regions that were more relevant for the neural networks' prediction of the year. In panels ( i-l ), red shading indicates regions where the relevance has increased over time, while blue shading indicates regions where the relevance has decreased over time. Note that panels ( a-d ) are identical to panels ( a-d ) in Figure 9.

Data Availability Statement
All data used in this study is publicly available and referenced throughout the paper. The CMIP6 simulations used in this study can be via the Earth System Grid Federation (ESGF, https://esgf-node.llnl.gov/projects/ cmip6/). Monthly temperature observations are available through Berkeley Earth (http://berkeleyearth.org/data/). Global Precipitation Climatology Project monthly global precipitation fields are available through the NOAA Physical Sciences Laboratory (https://psl.noaa.gov/data/gridded/data.gpcp.html). Monthly, daily, and sub-daily precipitation reanalyses were provided by the European Centre for Medium-Range Weather Forecasts (ERA5: https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era5) and the National Center for Atmospheric Research (JRA55: https://climatedataguide.ucar.edu/climate-data/jra-55). Python code used in this work has been made publicly available at https://doi.org/10.5281/zenodo.6780638.