Changes in United States Summer Temperatures Revealed by Explainable Neural Networks

To better understand the regional changes in summertime temperatures across the conterminous United States (CONUS), we adopt a recently developed machine learning framework that can be used to reveal the timing of emergence of forced climate signals from the noise of internal climate variability. Specifically, we train an artificial neural network (ANN) on seasonally averaged temperatures across the CONUS and then task the ANN to output the year associated with an individual map. In order to correctly identify the year, the ANN must therefore learn time‐evolving patterns of climate change amidst the noise of internal climate variability. The ANNs are first trained and tested on data from large ensembles and then evaluated using observations from a station‐based data set. To understand how the ANN is making its predictions, we leverage a collection of ad hoc feature attribution methods from explainable artificial intelligence (XAI). We find that anthropogenic signals in seasonal mean minimum temperature have emerged by the early 2000s for the CONUS, which occurred earliest in the Eastern United States. While our observational timing of emergence estimates are not as sensitive to the spatial resolution of the training data, we find a notable improvement in ANN skill using a higher resolution climate model, especially for its early twentieth century predictions. Composites of XAI maps reveal that this improvement is linked to temperatures around higher topography. We find that increases in spatial resolution of the ANN training data may yield benefits for machine learning applications in climate science.


Introduction
The detection of a forced signal rising above the background of internal climate variability -often referred to as the "timing of emergence" (ToE) -can be a potentially useful metric for societal and ecological planning to account for changes in weather and climate that exceed the known variability over a particular region (IPCC et al., 2021).Over the last few decades, numerous studies have investigated how to quantify the ToE for different climate variables (e.g., Fischer & Knutti, 2014;Giorgi & Bi, 2009;Hawkins et al., 2014Hawkins et al., , 2020;;Hawkins & Sutton, 2012;King et al., 2015;Mahlstein et al., 2012;Mora et al., 2013;Satoh et al., 2022;Schlunegger et al., 2020), but the precise definition is still quite sensitive to the choice of data set, future greenhouse emission scenario, baseline reference period, spatial scale, temporal filtering, consideration of internal climate variability, and statistical testing for ToE consistency.The ToE can even be examined through a lens of compound events and combined variables (François & Vrac, 2023;Mahony & Cannon, 2018;Rader et al., 2022).While it is often implied to be associated with anthropogenic climate change, the ToE definition can also be influenced by multidecadal variability in the climate system (Lehner et al., 2017).To partially alleviate this issue, recent ToE of temperature change linked to this ANN performance and its possible dependence on the size of the climate map training data.

Climate Model Large Ensembles
For the main results of this study, we use a collection of 30-member initial condition large ensemble simulations from a fully coupled GCM called the Seamless System for Prediction and EArth System Research (SPEAR; Delworth et al., 2020).SPEAR is the newest seasonal to multidecadal prediction and projection system from the National Oceanic and Atmospheric Administration (NOAA) Geophysical Fluid Dynamics Laboratory (GFDL).SPEAR uses the same atmospheric model code (AM4) and land model code (LM4) as in the GFDL CM4 model (Held et al., 2019;Zhao et al., 2018aZhao et al., , 2018b)), but employs a different configuration of the MOM6 ocean code (Adcroft et al., 2019) to optimize the model for seasonal to decadal predictions and projections.SPEAR incorporates 33 vertical atmospheric levels and can be designed for different atmosphere/land resolution configurations ranging from 0.25°to 1.0°.All model versions include a common ocean grid of approximately 1.0°s pacing (though refined to 0.33°around the equator).The SPEAR system has already been successfully used for several studies in evaluating the predictability of temperature variability and heat extremes across North America (e.g., Jia et al., 2022;Yang et al., 2022).
Due to the availability of 20 more individual ensemble members to train, validate, and test our neural network than in the configuration with the highest atmospheric resolution, we focus on simulations from only the SPEAR_MED (atmosphere/land of 0.5°) and SPEAR_LO (atmosphere/land of 1.0°) configurations.Ensemble members in each are initialized from conditions in an 1850 control simulation that are branched 20 years apart.Both SPEAR_MED and SPEAR_LO are forced with CMIP6 historical forcing through 2014 (Eyring et al., 2016).From 2015 to 2100, they are then forced with future projections from the Shared Socioeconomic Pathway 5-8.5 scenario (SSP5-8.5;Kriegler et al., 2017;Riahi et al., 2017).Recent work has shown that SSP5-8.5 is likely an unrealistically extreme future emission scenario (e.g., Burgess et al., 2020;Hausfather & Peters, 2020;Peters & Hausfather, 2020).Although our study here is mostly focused on the ToE in the recent and historical past, we also compare the sensitivity of our machine learning results to training and testing on SPEAR_MED simulations conducted with more probable SSP scenarios (Pielke et al., 2022), including the low-end SSP1-1.9 and more moderate SSP2-4.5 (O'Neill et al., 2016).
Finally, we compare our SPEAR climate change simulations with a large ensemble experiment starting from the same initial conditions in 1921 as SPEAR_MED, but holding all anthropogenic forcings (i.e., greenhouse gases, anthropogenic aerosols, land use/land change) fixed at 1921 levels.This experiment, referred to as SPEAR_-MED_NATURAL, is instead prescribed with only natural radiative forcings, such as those due to solar irradiance and volcanoes (historical to 2014, hypothetical thereafter; see Delworth et al. (2022)).By comparing SPEAR_MED and SPEAR_MED_NATURAL, we can extract the role of anthropogenic forcing on changes in summertime temperatures in the climate model and gain insights for understanding how well the neural network performs by training on data without a long-term anthropogenic signal.
To summarize, we consider collections of 30-member large ensembles from either SPEAR_MED or SPEAR_LO for designing our machine learning architecture.These simulations are conducted on two different horizontal atmosphere/land grids, which we will now refer to as either MED (nominal 0.5°grid) or LOW (nominal 1.0°grid) throughout the text.This highlights a key focus of this work, which is on comparing the effect of higher spatial resolution on the performance of the neural network.Given the limited availability of other fully coupled large ensembles with high-resolution atmospheric models, we can only assess the MED grid using simulations conducted by SPEAR_MED or its previous generation version called the Forecast-Oriented Low Ocean Resolution (FLOR) system (Vecchi et al., 2014).FLOR is a fully coupled GCM based upon GFDL's CM2.5 (Delworth et al., 2012); its large ensemble includes 30 members with CMIP5 historical forcing from 1921 to 2004 and Representative Concentration Pathway 8.5 (RCP8.5;Riahi et al., 2011;Vuuren et al., 2011) thereafter from 2005 to 2100.FLOR has a land-atmosphere resolution of 0.5°using the AM2.5 and LM3 model components (Milly et al., 2014), but includes a coarser ocean from OM2.1 (Gnanadesikan et al., 2006) at a nominal resolution of 1.0°.Since FLOR does not offer a corresponding LOW version of the large ensemble like SPEAR, we simply bilinearly interpolate its corresponding temperature maps to the LOW grid (denoted as FLOR (LO)) for again attempting to compare the advantage of inputting more (or less) spatial information into our neural network framework.
We also briefly make use of three more GCM large ensembles, which offer a similar horizontal resolution and data availability as SPEAR_LO (i.e., data from at least 1921 to 2100 and a grid size of approximately 1.0°× 1.0°).These include the 40-member Community Earth System Model Large Ensemble Project Version 1 (CESM1-LE; Hurrell et al., 2013;Kay et al., 2015) (CMIP5 class; RCP8.5), 100-member CESM2-LE (Danabasoglu et al., 2020;Rodgers et al., 2021) (CMIP6 class; SSP3-7.0), and the 50-member large ensemble using the sixth version of the Model for Interdisciplinary Research on Climate (MIROC6-LE; Tatebe et al., 2019;Shiogama et al., 2023) (CMIP6 class;SSP5-8.5).While other climate model large ensembles are available from the multimodel large ensemble archive that contain at least 30 members (Deser et al., 2020;NCAR, 2020), their horizontal resolution is generally too coarse for our regional deep learning approach, particularly when considering the three smaller geographic areas of the United States (Suarez-Gutierrez et al., 2020).In other words, an insufficient number of grid points nearly reduces the problem to a change-point time series task.This then limits the real utility of the neural network methodology, which here is to exploit any possible nonlinear temperature patterns across the maps in order to identify the emergence of forced climate signals.
For the climate model data, we leverage monthly mean near-surface daily maximum, minimum, and average temperature data (i.e., TMAX, TMIN, TAVG) and then calculate the seasonal mean over June to August (JJA) using only grid points across the conterminous United States.A summary of the large ensemble data can be found in Tables S1 and S2 in Supporting Information S1, which include the final horizontal resolution elected to be used as input to the neural network.

Observations
To evaluate the ToE of summertime surface temperatures in the United States, we primarily use the NOAA Monthly U.S. Climate Gridded Data set (NClimGrid; Vose et al., 2014aVose et al., , 2014b)), which is a station-based gridded product of temperature (and precipitation) across land areas of the conterminous United States since 1895.NClimGrid is based on the interpolation of quality-controlled station data onto 5 km latitude/longitude grids using records from the Global Historical Climatology Network (GHCN; Durre et al., 2010;Menne et al., 2012).The homogenized data set from NClimGrid also accounts for bias correction of artificial station breaks, such as for changes in weather station locations, instruments, and other temporal inconsistencies (Menne & Williams, 2009).Area-average NClimGrid temperature errors are larger over CONUS prior to 1990, but generally still within 1°C for both TMAX and TMIN (Vose et al., 2014a(Vose et al., , 2014b).Here we focus on the period of 1921-2022, which overlaps with the output from the SPEAR climate model simulations.
For comparing the sensitivity of our observational neural network predictions to the use of NClimGrid, we also briefly evaluate our results with two atmospheric reanalysis data sets: the European Center for Medium-Range Weather Forecasts fifth generation of atmospheric reanalysis (ERA5) available from 1940 to 2022 and the NOAA/Cooperative Institute for Research in Environmental Sciences/Department of Energy Twentieth Century Reanalysis (20CR) version 3 (20CRv3) available from 1836 to 2015.Using ECMWF's Integrated Forecast System release 41r2 and four-dimensional variational analysis as a data assimilation scheme, ERA5 provides global data at a horizontal resolution of 31 km in near-real time (Hersbach et al., 2020).It is constrained by numerous observations, like land-based weather stations, satellites, radiosondes, and other aircraft records.We focus on near-surface temperature (2-m height) from ERA5 through its entire available temporal period .To compare with a longer reanalysis record, we use near-surface temperature (2-m height) from 20CRv3 (Slivinski et al., 2019) during the overlapping period from 1921 to 2015.Unlike ERA5, this 20CR product only assimilates surface pressure observations (Compo et al., 2011), which is completed through four-dimensional incremental analysis updates and an 80-member ensemble Kalman filter approach (Lei & Whitaker, 2016;Slivinski et al., 2019).20CRv3 uses the coupled atmosphere-land National Centers for Environmental Prediction Global Forecasting System version 14.0.1 with boundary conditions from prescribed sea surface temperatures and sea-ice concentration.Overall, 20CRv3 is an improvement over its predecessor (20CRv2c) for simulating synoptic dynamics and other long-term surface climate fields (Slivinski et al., 2021), though work continues to improve temperature data earlier in the twentieth century due to greater uncertainties (Gillespie et al., 2023).
We use monthly mean temperature output from all observational and reanalysis data sets to calculate the JJA seasonal mean.The neural network used here requires the input data to have the same latitude and longitude dimensions.Therefore, we bilinearly regrid NClimGrid and ERA5 onto the MED (0.5°) and LOW (1.0°) spatial maps which are also used by the climate model large ensemble data.The coarser 20CRv3 data set is instead only Earth's Future 10.1029/2023EF003981 LABE ET AL. interpolated onto the LOW resolution grid.Importantly, these three observationally based products encompass a wide range of different structural methodologies and uncertainties, which therefore provide ample opportunity to test the robustness of the neural network results on out-of-sample data (Table S3 in Supporting Information S1).A comparison of average JJA CONUS temperature anomalies is also presented in Figure S1 for NClimGrid, ERA5, and 20CRv3.

Neural Network Framework
We adopt a machine learning ToE method first proposed by Barnes et al. (2019), which uses a neural network to input geographic maps of climate variables and then to output the year associated with each map.While this is quite a simple prediction problem, it has been shown that the neural network must learn to leverage time-evolving patterns of forced climate signals in order to correctly identify the year with a single map (Barnes et al., 2020).This attribution method has since been used in a wide range of climate applications (e.g., Anderson & Stock, 2022), such as for disentangling the role of aerosols and greenhouse gases in single-forcing large ensembles (Labe & Barnes, 2021), quantifying anthropogenic signals in extreme precipitation (Madakumbura et al., 2021), and identifying the ToE of combined variables such as precipitation and temperature (Rader et al., 2022).For this work, we take a similar approach, but build upon these previous efforts by focusing on a narrower application.Here we train on high-resolution climate model data and evaluate the ToE over smaller spatial regions.This ToE approach is also calculated for a location where there is an observed absence of daytime warming trend (e.g., boreal summer in the Central United States) (Partridge et al., 2018).While previous machine learning efforts in climate science have usually interpolated data to coarser grids for reasons such as computational limitations, we are particularly interested in considering whether the neural network skill (or the actual ToE) changes by training and evaluating on higher resolution data.
For this study, we use an ANN, which is a statistical algorithm that can learn to approximate nonlinear functions from large quantities of data.They have become increasingly popular tools for Earth science prediction problems and in numerical modeling (Boukabara et al., 2021;Chantry et al., 2020;Irrgang et al., 2021).ANNs are fully connected networks, which in their simplest form are comprised of an input layer, a set number of hidden layers and nodes, and an output layer (i.e., the final prediction).Every node in this feed-forward architecture receives information from the previous layer and can be individually computed by weighting the sum of the inputs and an added bias term.The weights and biases are updated iteratively until the training process is finished, such as when the loss function is minimized (i.e., a measure of machine learning model error).Given enough available data and limited overfitting, the ANN can then be used to make skillful predictions on data it did not see during training (i.e., testing data).More thorough introductions to neural networks can be found in for example, Lecun et al. (2015); Goodfellow et al. (2016); Neapolitan and Jiang (2018).Domain-specific tutorials for machine learning applications, such as in atmospheric science, are also provided by Chase, Harrison, Burke et al. (2022) and Chase, Harrison, Lackmann, and McGovern (2022).
Figure 1 shows the ANN architecture used for this study.The ANN receives vectorized latitude by longitude maps of JJA temperatures (TMAX, TMIN, or TAVG) with either 10,080 input values (70 × 144) per sample for the MED resolution maps or 2,520 input values (35 × 72) for the LOW resolution maps.In addition, we also evaluate ANNs using regional map inputs for the Western, Central, and Eastern CONUS as depicted in Figure 2. We focus on only land areas and therefore mask all other areas by assigning values of zero, which the ANN then learns to ignore.This vector is fed into the ANN hidden layers, and the output is the probability of a particular decade midpoint (a classification problem).These output values are then translated to a particular year (a regression problem) using fuzzy classification (Zadeh, 1965), which is described in detail in Barnes et al. (2020).Briefly, by denoting the central year of a particular decade (e.g., 1995 for 1990-1999), a particular output (e.g., 1994) can be mapped to more than one decade class.Using triangular membership functions (Zadeh, 1965) with a width of one decade, the weighted sum of the decadal class probabilities can finally be mapped to a specific year.For example, the year 1994 has a probability of 0.9 for falling within the decade class midpoint of 1995 and a probability of 0.1 for the decade class midpoint of 1985.We refer to the regression problem of the predicted year throughout the rest of the study for evaluating the ANN skill and ToE calculations.
We select our ANN architecture for different input maps by considering the effect of spatial region and grid resolution.The final ANNs are selected by identifying the lowest median Mean Absolute Error (MAE) on validation data after considering 20 ANNs (randomized combinations of training, testing, and validation data and initialization seeds) over architectures that range in complexity by the number of hidden layers and nodes (see Figures S5 and S11).In other words, this is related to the number of parameters that the model can use to learn the relevant climate patterns to more accurately predict the year of a map (i.e., variations of deeper or shallower neural networks).For CONUS inputs on the MED grid, we use an ANN with three hidden layers of 10 nodes each.For CONUS inputs on the LOW grid, we use an ANN with three hidden layers of 20 nodes each.Finally, for the Figure 1.Schematic of the artificial neural network (ANN) used to take an input map of average June to August (JJA) temperatures over the contiguous United States (CONUS) and then output the likelihood that the map is from a particular decade.Fuzzy classification (Zadeh, 1965) is used to decode this decadal likelihood to associate each map with a single year (i.e., our final predicted output).The ANN consists of different combinations of hidden layers and nodes depending on the horizontal resolution of the training data (see Section 3 for the architecture and hyperparameter choices).Explainable artificial intelligence attribution methods are then used to reveal the regions that acted to increase or decrease the likelihood of the ANN's predicted year.Earth's Future 10.1029/2023EF003981 regional CONUS maps on both the MED and LOW grids, we use an ANN with 2 hidden layers of 100 nodes each.Despite selecting different ANN architectures, we find that our results are generally robust across minor changes in hyperparameter options.The rectified linear unit (ReLu; Agarap, 2018) activation function is used for the nonlinear transformation in the hidden layers, and a softmax operator is included in the output layer to ensure that the decadal class probabilities of the output vector sum up to one.All ANNs here use the binary cross-entropy loss function, stochastic gradient descent optimizer (Ruder, 2016) with Nesterov momentum set to 0.9 (Nesterov, 1983), a learning rate of 0.01, and a batch size set to 32.
Unless otherwise stated, we train on 24 ensemble members, validate on four ensemble members, and test on two ensemble members.To limit overfitting on the training data, we apply a few different methods.First, we apply early stopping, which ends the ANN training process if there is no improvement in the validation loss for 25 consecutive epochs.The epoch with the best performance is ultimately selected.Next, we apply ridge regularization (set to 0.001) to the weights of the first hidden layer (L 2 ; Friedman, 2012), which acts to limit how sensitive the weights are to outliers in the input data.This also helps to smooth out any spatial autocorrelation that exists in the temperature maps and improve overall interpretability (Barnes et al., 2020;Sippel et al., 2019Sippel et al., , 2020)).The sensitivity of our ANN results to the choice of the L 2 parameter are shown in Figures S5 and S11.
Before inputting the data into the ANN, we standardize all climate model temperature maps by subtracting the training data mean and dividing by the training data standard deviation over the 1981 to 2010 climatological baseline.This is computed separately at every grid point.Note that similar skill is found for training and testing data using other reference periods, such as 1951-1980.An example of the training mean and standard deviation for 24 ensemble members in SPEAR_MED is shown in Figure 2. Due to mean state biases that may exist between the climate model large ensembles and observations (Suarez-Gutierrez et al., 2020), we separately standardize the observations by their own mean and standard deviation over 1981 to 2010 before making ANN inferences.Although, as we discuss later, it is still possible that differences in the amount of mean warming between the climate model simulations and observations could influence the machine learning skill and related ToE results.
In addition to evaluating our ANN and ToE predictions, we consider several ad hoc attribution methods of XAI.Explainability methods have increasingly been shown to aid in building trust and understanding for the decisionmaking process of neural networks, including for climate science applications (e.g., Diffenbaugh & Barnes, 2023;Labe & Barnes, 2022;Mamalakis et al., 2023;Rampal et al., 2022;Shin et al., 2022;Sonnewald & Lguensat, 2021).Output from XAI attribution methods describe the contribution of every input sample's latitude and longitude grid point (described here as "relevance") to the overall prediction of the ANN.In other words, the XAI algorithms return a relevance heatmap (unitless) for every input year.To evaluate the sensitivity our explainability results across different methods (Bommer et al., 2023;Mamalakis et al., 2022), we consider three different attribution techniques: the layerwise relevance propagation z-rule (LRP z ; Bach et al., 2015), LRP epsilon-rule (LRP ϵ ; Bach et al., 2015) and Integrated Gradients (Sundararajan et al., 2017).However, given the similarity of the results across the XAI methods, we only show relevance figures from LRP z and Integrated Gradients for brevity in the main results.A more detailed overview on an application of LRP to a geoscience problem can be found in Toms et al. (2020).Lastly, we caution that although these XAI techniques are very useful for outlining the important climate patterns learned by the ANN, they do not imply causation, such as for the specific physical drivers.
Our XAI heatmaps are based on composites of the testing ensemble members (or observations), where positive areas of relevance can be interpreted as regions that pushed the ANN toward its predicted year.Negative areas of relevance are subsequently interpreted as the converse, that is, locations that tried to push the ANN away from making its yearly prediction.We only consider relevance maps for testing data predictions that are accurate to within 5 years of the actual year, but note that the XAI results are not sensitive to this threshold (not shown).

Calculated Timing of Emergence for Observations
An annotated graphic of our ToE approach is shown in Figure 3.The ANN in this example is first trained and tested on data from SPEAR_MED using inputs of CONUS maps of mean JJA TAVG.The testing ensemble predictions are shown using blue scatter points with the actual year of a TAVG map on the x-axis, and the ANN predicted year is on the y-axis.A perfect prediction thereby follows along the gray 1:1 line.Predictions for maps of mean JJA TAVG from NClimGrid data are then shown with red markers, which are the predictions related to observed CONUS maps that are input into the ANN after the training and testing on SPEAR data is complete.This Earth's Future 10.1029/2023EF003981 LABE ET AL.
inference step using the observational data forms the basis of our ToE calculation and is fundamentally similar to Mora et al. (2013).These ToE estimates are then calculated in our study as the year that a map of temperature first departs the bounds of historical climate variability and continues to do so for all future years (e.g., the red shading in Figure 3).This general approach was also demonstrated using ANNs in Barnes et al. (2019) and Rader et al. (2022).Here we define the ToE as the first predicted year that is greater than the maximum prediction during our historical baseline of 1921-1950 (gray shading in Figure 3).In addition to 1921-1950 being the earliest 30year period available from SPEAR_MED, it more importantly overlaps with the observed anomalous warmth of the 1930s in the United States (Program, 2018;Schubert et al., 2004) (Figure 4).Thus, we can directly compare whether an observed JJA temperature forced signal has emerged outside of this historical record that includes the extreme heat of the Dust Bowl era.
We also compare our results with a more traditional baseline ToE estimate using SPEAR_MED data.For a given ensemble member, the ToE is the first year that the 10-year running JJA mean consistently stays above the 1921 to 1950 climatological period by greater than two standard deviations (Lehner et al., 2017).This variability is again based on the 10-year running mean temperature in 1921-1950.The actual ToE composites from this method are then calculated from the mean across all ensemble members and at every grid point.

Changes in United States Summertime Temperatures
Before estimating the ToE by the neural network framework, we start by assessing changes in temperature in observations and as simulated by SPEAR_MED.Figure 4 shows the time-mean JJA temperature anomalies averaged over the CONUS for TMAX, TMIN, and TAVG from 1921 to 2022.We find that observations from NClimGrid lie outside the ensemble spread of SPEAR_MED in all three temperature metrics during the Dust Bowl of the mid-1930s.This is especially prominent for JJA TMAX, which reaches values of more than 1°C greater than the warmest ensemble member from SPEAR_MED.The peak TAVG during this early twentieth century period was reached in 1936 (Cowan et al., 2017(Cowan et al., , 2020)), but it is now statistically tied with 2021 as the hottest summer on record (within 0.01°C) over the CONUS (Thompson et al., 2022) (Figure 4c).The large climate response following the eruption of Mount Pinatubo (Parker et al., 1996), however, is well captured by SPEAR_MED.1992 subsequently remains the coldest mean summer TMAX in the NClimGrid observational record (since at least 1921) (Figure 4a).In more recent years, temperatures from NClimGrid have remained consistently below the ensemble mean and therefore exhibit less net warming than the forced response in SPEAR_MED, especially for TMAX.We also compare NClimGrid with ERA5 and 20CRv3 reanalysis products for TAVG anomalies in Figure S1.While NClimGrid and ERA5 agree well in capturing the interannual variability and long-term trends, we find a larger discrepancy in TAVG prior to 1975 where 20CRv3 shows larger warm anomalies.These discontinuities have been pointed out in previous studies using older model generations of 20CR (Ferguson & Villarini, 2014), which were found to be largest in the mid-twentieth century for the Central United States (Ferguson & Villarini, 2012).
Figure S2 shows the time series of TAVG anomalies broken down by the three regions of interest, including the Western, Central, and Eastern United States.This more clearly distinguishes the anomalous heatwaves of the mid-1930s across the Central United States, which again fall outside the ensemble spread of SPEAR_MED (Figure S2b in Supporting Information S1).All three regional domains experience substantial interannual temperature variability in the NClimGrid record and reveal less long-term warming compared to the SPEAR_MED ensemble mean over the 1990 to 2022 period.
Figure 5 provides spatial maps of the TAVG trends for 1921 to 1989 and 1990 to 2022.Statistically significant cooling trends in NClimGrid are found over the southern United States (Figure 5a), which are close to the warming hole region for this earlier period (Mascioli et al., 2017).The observational trends are also compared to SPEAR_MED and its parallel natural forcing-only simulation (SPEAR_MED_NATURAL) in Figures 5b and 5c, which does not simulate any long-term TAVG trends in the ensemble means.However, greater warming is found in observations for the more recent past (1990-2022; Figure 5e), which is largest over the Southwestern United States.The warming hole spatial pattern is again found over the Central United States (Figure 5e).In comparison to observations, SPEAR_MED simulates greater ensemble mean warming over the entire CONUS and shows the largest TAVG trends over the northern Rocky Mountains (Figure 5f).Without anthropogenic forcings, such as greenhouse gases or aerosols, Figure 5g reveals little to no warming simulated over CONUS as found in the ensemble mean of SPEAR_MED_NATURAL.Lastly, comparing over the entire 1921 to 2022 historical record, we find that the ensemble mean of SPEAR_MED simulates slightly greater warming trends over CONUS (Figure S3b) and does not show evidence of the warming hole as in Figure S3a.Although it remains unclear  (Bevan & Kendall, 1971;Mann, 1945) for the 95% confidence level.
Earth's Future whether this is simply due to internal variability in the observational record, which would not be captured in a composite of the ensemble mean trends of a large ensemble (Eischeid et al., 2023).Due to this issue when comparing a single realization of internal variability (i.e., real-world observations) with the ensemble mean, we subsequently show the trends in the SPEAR_MED ensemble member with the highest pattern correlation against the observed trends in each epoch (Figures 5d and 5h).Although this particular ensemble member better captures the slower rate of recent warming across the Midwest and Southwest (Figure 5h), the positive trend in TAVG is still overestimated across the western half of the United States.
Focusing on JJA mean maximum and minimum temperatures, Figure 6 shows larger recent trends in TMAX than TMIN across the Northern and Western United States in the observed record.Consistent with previous findings, the warming hole is also more prominent in TMAX.This is reflected by an area of insignificant cooling across the Southeastern United States (Figure 6a).Although SPEAR_MED again simulates greater positive trends in TMAX (Figure 6b) and TMIN (Figure 6d) for the CONUS (exceeding 1°C per decade at its local maximum), there are similarities in the spatial pattern of mean warming compared to NClimGrid.This includes a relative maximum in warming over the Western United States and relative minimum over the Southeastern United States (Figures 6b  and 6d).
Lastly, we show in Figure S4 the JJA mean CONUS TAVG for NClimGrid compared to a collection of 30member large ensembles of SPEAR_MED, but using different radiative forcing scenarios from 2015 to 2100.This includes future projections from SSP1-1.9, SSP2-4.5, and SSP5-8.5, which are compared to the natural-only forcing experiment of SPEAR_MED_NATURAL from 1921 to 2100.The forced response in the SPEAR_MED historical simulations only begins to clearly rise outside the variability in the natural forcing simulation between 1990 and 2000.This occurs a decade later when comparing NClimGrid to SPEAR_MED_NATURAL.While there is a large range in projected ensemble mean JJA TAVG change across the climate change scenarios to 2100, the uncertainty due to internal variability alone is almost 2°C across the ensemble spreads.Notably, we also find that the ensemble mean TAVG begins to cool by 2040 for the more aggressive climate mitigation scenario of SSP1-1.9.This continues through the end of the twenty-first century for SSP1-1.9.We also point out that there are only negligible differences across future forcing scenarios in the ensemble means of JJA TAVG until between 2030 and 2040, but by after 2080, there is very little overlap in their ensemble spreads due to the greater effects of projection scenario uncertainty.Additional comparisons between observations and SPEAR_MED for summertime temperatures across the CONUS can be found in McHugh et al. (2023) and Eischeid et al. (2023).

Predictions by Neural Networks
We now turn to the machine learning results in Figure 7, which shows the skill of the ANN for predicting the year of CONUS maps of TAVG, TMAX, or TMIN.Note that Figure 7a is from the same ANN as the one displayed in the annotated schematic in Figure 3. Again, we focus our results on only the testing ensemble members from SPEAR_MED, which are data that the ANN has never seen before.The testing data predictions (blue scatter points) closely follow the 1:1 line (or perfect prediction) in all three ANNs, which suggests that the ANN is able to distinguish individual JJA temperature maps despite the background noise of internal climate variability.The robustness of these results to different ANN architectures and ridge regularization parameters is shown in Figure S5 for TAVG, which are each assessed for 20 ANN iterations that used different combinations of training, validation, and testing ensemble members and random initialization seeds.The median MAE score from this distribution of ANNs is displayed in Figure S5g for the architecture used to produce Figure 7 (see Section 3.1).This helps to ensure that our high skill is not simply due to the chance that our ANN performed well on only one subset of testing data or overfit on the training ensemble members.
The results for the observational predictions are also shown in Figure 7.To restate from earlier, these predictions are obtained by inputting maps of JJA temperature into the ANN after it has already been trained and tested on the climate model large ensemble data.However, unlike the predictions for SPEAR_MED, we do not find that the ANN can correctly predict the year during most of the twentieth century for TAVG, TMAX, or TMIN.As described in Rader et al. (2022), since the ANN is not confident in predicting the year of a given temperature map, it tends to predict around the same year in the middle of the entire time series (i.e., to reduce its potential error penalty in the loss function).However, especially for TAVG and TMIN, we find that the ANN observational predictions begin to lie on the 1:1 line after around 1995.One measure that can be used to reveal whether the ANN has identified patterns of forced change is by evaluating the order of the predicted years (Labe & Barnes, 2021).Thus, this suggests that the ANN is beginning to identify common patterns of forced climate change in NClimGrid that were learned from the SPEAR_MED temperature maps in more recent years.As discussed later, we relate this point to the ToE of observed temperature change.
Since the SSP5-8.5 radiative forcing may be an unrealistic future climate scenario (Peters & Hausfather, 2020), we examine our results using ANNs trained on TAVG maps from the same historical forcing in SPEAR_MED, but then following either the SSP1-1.9pathway (Figure S6a) or SSP2-4.5 pathway (Figure S6b).Overall, we find very similar skill for the testing ensemble member predictions across the SSP scenarios relative to SSP5-8.5 (Figure S6c), which is used throughout the remainder of the study.The predictions for inputs of TAVG from NClimGrid are also strikingly similar.We do point out that there is some higher testing data error toward the end of the twenty-first century, especially for SSP1-1.9(Figure S6a), which suggests that the forced patterns of change may become less prominent after climate mitigation efforts (Figure S4).This implies that the ANNs are learning to extract time-evolving climate signals, including from within a single ensemble member's realization of internal climate variability.Both of these detection outcomes are not as easily addressed by traditional signal-tonoise time-mean statistics.
Another possibility is that our ANN inferences made on the observational maps are sensitive to the choice of data product.We evaluate this prospect in Figure S7 by inputting maps from either ERA5 reanalysis (Figure S7a) or 20CRv3 reanalysis (Figure S7b).The results from ERA5 are very similar to NClimGrid and once again indicate that only by about 2000 is the ANN able to identify the order of the years of the CONUS maps with a high degree of accuracy.On the other hand, we find greater divergence in predicting the year for maps of 20CRv3.This could be due to the lower resolution of the training and testing data (i.e., LOW grid; Section 2) and/or a result of the discontinuity in the TAVG anomalies prior to 1980 (Figure S1).As a result of the greater uncertainties in the early 20CR data and lack of available data for the Dust Bowl era in ERA5, we focus on NClimGrid for the remainder of this study.We will also return to this point regarding the grid resolution size and its possible effects on the ANN skill.
To identify the spatial regions that are important for the ANN testing skill for SPEAR_MED, we evaluate composites of the XAI relevance maps in Figure S8 for the LRP z , LRP ϵ , and Integrated Gradients methods.These composites are assessed over the entire time series from 1921 to 2100.Again, positive areas can be interpreted as regions that were more relevant for the overall ANN yearly predictions.Although there are some small differences outlined between the LRP methods and Integrated Gradients, all three composites reveal that the northern Rocky Mountains, such as in western Montana, are an important indicator region.Other relevant temperature regions include areas in the Eastern United States, especially in southern Florida and on the leeward side of the Appalachian Mountains.There is also a notable gradient along this topographic boundary, with negative areas of relevance (i.e., locations that pushed the ANN to predict another year/decade) stretching from the Ohio Valley to western New York State.Note that we also compare the XAI results for a shallower ANN in Figure S9 and for an ANN with no ridge regularization in Figure S10, which continue to highlight the greater relevance in the western United States.Nevertheless, the shallower ANN has slightly worse skill in predicting the early twentieth century period for SPEAR_MED (Figure S15), and thus we do see some differences in the relevance regions that likely result from the different pathways of information the neural network is leveraging to make predictions compared to our more complex ANN structure used in the rest of the analysis.
Although the ANNs are clearly able to learn a climate signal that distinguishes one climate model temperature map year from another, this does not necessarily imply it is related to anthropogenic forcing.We therefore explore this possibility in Figure 8b, which shows the predictions for an ANN trained on SPEAR_MED compared to the simulation with only natural forcing (Figure 8c).This reveals that the ANN is no longer able to make an accurate prediction of the year when trained on maps from SPEAR_MED_NATURAL.Similarly, there is a much larger temporal spread in predictions after inputting NClimGrid data into this trained ANN (Figure 8c) compared to the SPEAR_MED network.That is, the ANN is likely using the response to external forcings, such as those prescribed in SPEAR_MED, to more skillfully predict the year of summertime temperature maps even when temperature trends are weaker prior to 1990 (Figure 5).Having said that, there is slightly smaller spread in the earlier yearly predictions of SPEAR_MED_NATURAL when the predictions are closer to the 1:1 line, although mainly still predicated later than the actual year.This could be due to the ANN detecting a minor influence of solar or volcanic forcings on regional time-evolving climate patterns in the SPEAR model, but this remains an open question that is left for future work.We did briefly explore training on a simulation of SPEAR Earth's Future with anthropogenic aerosols held fixed to 1921 levels (not shown), but found similar yearly map predictions as those from using SPEAR_MED, which implies a limited role for anthropogenic aerosols on our ANN ToE results.
As noted by the results when evaluating 20CRv3, a last possibility is that training on the high spatial resolution of SPEAR_MED is having an important role in the skill of the testing ensemble members.Put in another way, the ANN could be more likely to weight spatial information, such as temperatures around topographical gradients, for identifying the relevant climate indicators.We can compare this effect by training on data with the LOW grid from the SPEAR_LO configuration, which is demonstrated in Figure 8a.In addition to higher Root Mean Squared Error (RMSE) for SPEAR_LO predictions before and after 1990, there is also a greater range in prediction years that are found above and below the 1:1 line in the late twentieth century after inputting NClimGrid TAVG maps.Similar to earlier, the robustness of the ANN skill for training and testing on SPEAR_LO across different architectures is shown in Figure S11.The overall effect of grid size is explored more in Section 4.2.2.

Regional Variations in Timing of Emergence
So far, we have demonstrated that an ANN can distinguish the year of a given map of summertime temperatures across the contiguous United States after training on a high-resolution climate model large ensemble (SPEAR_MED).Consistent with recent work (e.g., Barnes et al., 2020;Labe & Barnes, 2021;Rader et al., 2022), the ANNs here are learning time-evolving temperature patterns associated with external forcing to differentiate each individual year and in the correct sequential order.Moreover, the ANNs can make skillful predictions on the order of temperature map years from out-of-sample observations, but only in the last decade or two.
To associate the period when the ANN predictions for observations begin to fall along the 1:1 prediction line, we compute the observed ToE following the methods in Section 3.2 and outlined in Figure 3.In short, we find the maximum predicted year during the 1921-1950 reference period and then identify the ToE as the point where the forward-looking ANN predictions no longer fall below this historical maximum.This is first calculated using NClimGrid maps of CONUS that are seasonally averaged for JJA.To ensure the robustness of our observed ToE estimates, we conduct 100 ANNs that are trained on SPEAR_MED and use the same architecture as previously outlined.The uncertainty spread in these ToE predictions, as displayed in Figure 9a for TAVG, TMAX, and TMIN, can be attributed to differences in ensemble members used for training and validation and through the choice of 100 different random initialization seeds.Figure 9b displays the Spearman's rank correlation (R) calculated between the actual and predicted years across the NClimGrid inputs as a measure of skill for the ANN to correctly identify the order of the years (Labe & Barnes, 2021).Albeit, by construction, earlier ToEs will also correspond to higher correlation coefficients.
The earliest median ToE is found for maps of TMIN, which is calculated to be 2003 (R = 0.78).On the other hand, the latest median ToE occurs in 2018 for TMAX (R = 0.31).An important caveat, however, is that these ToE estimates could be biased early.This is mainly an issue for late ToE predictions, like for TMAX, where by construction there are few future years to compare with against the historical 1921-1950 maximum.We also cannot rule out temporary reductions in temperature for a single JJA future year as a result of internal climate variability (Maher et al., 2020) or from the influence of an event like an explosive volcanic eruption (Sear et al., 1987).
We now investigate the ToE for the three selected regions across the United States by separately training on mean JJA maps of TAVG, TMAX, and TMIN from SPEAR_MED, but only over each smaller domain (outlined in Figure 2).Figure 10 shows the yearly predictions for these testing ensemble members and for regional inputs from NClimGrid.If a forced signal has emerged in the observational record according to our definition, then the ToE is annotated per each region.Note that if the estimated ToE is within 5 years of present day (2022), then cautionary asterisks are included next to the ToE year, given the greater uncertainty that future predictions over the next several years will remain above the base period maximum.Across all three regions of CONUS, the earliest ToE occurs for TMIN, especially for the Eastern United States at an estimate of 1998 (R = 0.84) (Figure 10f).We also find that the results for TMIN closely follow the 1:1 line, especially after 1950 for the Central and Eastern United States (Figures 10e and 10f).This suggests that these regional climate signals learned by the ANN after training on SPEAR_MED are generalizable to NClimGrid.Most regions have not observed the emergence of a signal in JJA TMAX (Figures 10a-10c), and the earliest possible estimate here is for the Western United States (ToE = 2014, R = 0.28) (Figure 10a).These results are consistent with recent station-based studies finding greater warming rates in TMIN than TMAX (Abatzoglou & Barbero, 2014;Meehl et al., 2009Meehl et al., , 2016;;Program, 2018).
In addition to the observational results, we find that the testing predictions from SPEAR_MED closely follow the 1:1 line, especially after 1990.The lowest testing RMSE is found for the Western United States, and generally the worst ANN ensemble member skill is found for TMAX and TAVG in the Eastern United States (Figures 10c and  10i).At the same time, the overall skill in the early to mid-twentieth century continues to be surprising, especially given the lack of ensemble mean warming for all three regions (Figure S2).Therefore, in order to provide a baseline with a more traditional linear method of calculating the ToE at each grid point, we compare the ANN predictions with the estimated ToE as shown in Figure S12 following Lehner et al. (2017) (see Section 3.2).For Eastern USA (c).The actual year is denoted on the x-axis and the predicted year on the y-axis.Red markers are shown for ANN predictions after inputting maps from NClimGrid.A perfect prediction (1:1 slope) is annotated behind all ANN predictions with a solid gray line.The Root Mean Squared Error (RMSE) for the SPEAR_MED testing ensemble members is included for predictions over the actual years of before and after the year 1990.If the observed timing of emergence (ToE) occurs for the NClimGrid predictions, then it is denoted for each region with a bright red marker and vertical line.If five or less years of predictions exist after this calculated ToE, then it is annotated with two added asterisks (d-f).As in panels (a-c), but for input maps of TMIN (g-i).As in panels (a-c), but for input maps of TAVG.most regions of CONUS, the more conventional ToE in SPEAR_MED for TAVG, TMAX, and TMIN occurs in the 1990s or 2000s, though there is some evidence of an earlier ToE across the higher elevations of the Western United States (Figure S12).
Finally, we evaluate the XAI conclusions for NClimGrid (after training on SPEAR_MED) by compositing those heatmaps over 2005 to 2022 in Figure 11a.This temporal range corresponds more closely to the period when a temperature signal has emerged (Figure 9).As discussed in Section 3.1, the inclusion of ridge regularization can be a useful parameter to limit the amount of overfitting.Likewise, it is also useful for interpreting the explainability results for how an ANN is making its predictions, as it is analogous to spatial smoothing for removing spurious outliers (Barnes et al., 2020;Sippel et al., 2019).These NClimgrid relevance maps are shown in Figure 11 for ANNs trained on SPEAR_MED but using different regularization parameters and no additional smoothing filters.For lower ridge parameters, we find greater noise when interpreting the XAI maps, but overall higher positive relevance areas over portions of the Western United States in the vicinity of topography .As the ridge parameter increases (i.e., penalizing larger weights to spread out the importance more evenly), the spatial patterns of the XAI maps are smoothed, but note that the error of the ANN testing data skill begins to subsequently grow too (Figure S5).For the NClimGrid relevance maps in Figures 11d-11f, we find that ANN is leveraging temperature patterns in the Eastern and Western United States (positive relevance) to make its yearly predictions.Notably, this spatial pattern of relevance somewhat resembles the summertime warming hole (Mascioli et al., 2017).The effect of having a ridge parameter set to 0.0 is also shown in Figure S10 and underscores that these XAI heatmaps are not interpretable since the ANN may be more likely to overfit on neighboring temperatures.

Influence of Horizontal Resolution
At this stage, it is evident that the ANNs are leveraging different spatial features to predict the year depending on the availability of fine detail on a given map grid, especially at temperature gradients around high and low geographic elevations.To evaluate this finding more closely, we now compare our regional results using ANNs trained and tested on coarser regional maps from SPEAR_LO in Figure 8. Figures 8d-8f shows these regional ).The L 2 value used for the main results of the paper is labeled in bold font.Positive relevance indicates regions that pushed the artificial neural network (ANN) to make its predicted year.Negative relevance suggests areas that tried to push the ANN away from its predicted year.Note that no Gaussian filter is applied to this set of explainable artificial intelligence heatmaps.
predictions of JJA TAVG and overall indicates poorer testing ensemble member skill for the Western, Central, and Eastern United States.In fact, the ANN is unable to find any time-evolving signals in SPEAR_LO for the Eastern United States until the late 1990s.There is also greater spread in the yearly predictions after inputting LOW grid size maps from NClimGrid.As such, this result is consistent with Section 4.2 that found lower skill in predicting the year of CONUS maps after training on SPEAR_LO compared to SPEAR_MED.To better compare the prediction skill between the two different spatial resolutions of SPEAR_MED and SPEAR_LO, we show in Figure 12 the distribution of scores using MAE across 20 different ANNs with the same architecture, but for different combinations of training, testing, validation data and random initialization seeds.We also compare the MAE of ANNs trained on the FLOR climate model large ensemble (MED grid) compared to ANNs trained on maps of from FLOR, but interpolated onto the LOW grid spacing (FLOR (LO); see Section 2.1).For both of these ANN experiments, we again find lower MAE scores (i.e., better skill) for the ANNs trained on the maps with higher spatial resolution (MED grid).This performance is also contrasted to the experiment without any anthropogenic forcing (not shown).The results of SPEAR_MED_NATURAL show a median MAE score over 1921 to 2100 for its best performing ANN architecture that is about 22 years, more than four times the MAE of the ensemble with anthropogenic forcing (lowest MAE on one ANN iteration/seed is still more than 18 years).
Figure 12 also compares the skill using SPEAR and FLOR to other climate model large ensembles, but this is only possible for their coarser resolution (LOW grid).Overall, the largest error in predicting the year of summertime TAVG maps is found for MIROC6-LE.Notably, and despite the coarser resolution, the lowest error is found for CESM2-LE across all climate model large ensembles for both before and after 1990 (Suarez-Gutierrez  , 2020).This is examined more closely in Figure S13 for a single ANN using MIROC6-LE, CESM1-LE, and CESM2-LE along with the predictions after inputting observations from NClimGrid after training each network.Taking into consideration that there are more ensemble members for these three climate models (compared to 30 total members in SPEAR and FLOR), one possibility for the better skill in CESM2-LE is greater availability of training data.Therefore, we conduct three more ANNs that are shown in Figures S13d-S13f

Early 20th Century Temperature Signals in SPEAR
A remaining question still across all of the results is how the ANN is able to distinguish summertime temperature maps for the climate model large ensemble data prior to the late twentieth century.Stated another way, what signals existed before 1990 when the forced ensemble mean warming trend has not started to clearly emerge yet (e.g., Figure 4)?One advantage of using feature attribution XAI methods is that a relevance map is obtained for each input sample.Accordingly, it is then possible to take XAI composites over different temporal periods to better understand the time evolution of the most relevant temperature signals.Figure 13 shows the relevance maps using the LRP z and Integrated Gradients methods for composites of SPEAR_MED testing data before and after 1990.For the 1921 to 1989 period, both XAI methods reveal a hotspot over western Colorado and generally muted relevance elsewhere across the United States.A similar relevance pattern is also found for the XAI results based on inputs of TMAX (Figure S14).Apart from that, the relevance maps for 1990 to 2100 are more similar to those described in Figure S8.
We compare the effect of regularization parameter and ANN architecture choices on ANN skill for predicting the year of temperature maps from only 1921 to 1989.This is displayed in Figures S15 and S16 for ANNs trained on SPEAR_MED and SPEAR_LO, respectively.As previously discussed in Section 4.2.2, we again find better skill for the higher resolution spatial grid and when using a lower ridge parameter.Relevance maps for ANNs using different regularization parameters are composited for SPEAR_MED in Figure S17, which reveals this effect of the ANN using smaller regional information, like over the Rocky Mountains, for predicting the year of JJA temperature maps.Unlike earlier (e.g., Figure 13 and Figure S8), we do not apply any smoothing filters to these XAI maps to better contrast the differences when interpreting the areas of higher relevance.With larger ridge parameters comes worse ANN skill (Figure S5) but improved interpretability for the XAI maps (Figures S17d-S17f).Correspondingly, it is then likely that the ANN is learning temperature indicators from finer spatial information, especially across western Colorado, during this early twentieth century period.To better understand whether the higher relevance values are due to particular years or are more consistent through time, we compute the variance of the relevance fields from 1921 to 2100 in Figure S18.While some of the regions of higher variability are located near the relevance hotspots outlined in Figure S17a across the western United States and southern Florida, there are some subtle differences in the spatial patterns.It is therefore likely that the ANN is leveraging both combinations of consistent climate patterns and time-evolving temperature signatures associated with large anomalies.
In addition to testing data from SPEAR, the ANNs were also able to predict the year of CONUS maps from other climate model large ensembles (Figure 12 and Figure S13).We again turn to XAI methods in Figure S19, but alternatively for evaluating the temperature signals leveraged by the ANN after inputs of CESM2-LE.The spatial patterns of positive relevance in the early twentieth century are different for CESM2-LE over 1921-1989 (Figures S19a and S19b) compared to SPEAR_MED (Figures 13a and 13b).Instead, there are positive areas of relevance derived from CESM2-LE over the Southeastern United States and again across portions of the Rocky Mountains but located north of the previous Colorado hotspot.
Although there are no long-term temperature trends in the time-mean of SPEAR_MED for the Western United States prior to around 1990 (Figure S2a), it is still possible that there are spatial patterns of temperature change.Figure S20 shows the linear TAVG trends for NClimGrid, SPEAR_MED, and SPEAR_MED_NATURAL from 1921 to 1950.While observations reveal a few areas of cooling across the West, most of the CONUS does not have any statistically significant temperature trends.In contrast, we find a patch of warming over Western Colorado in SPEAR_MED, which closely aligns with the previously identified XAI relevance hotspot (Figures 13a and 13b).Conspicuously, this warming is absent for TAVG trends in SPEAR_MED_NATURAL (Figure S20c), which indicates it could be caused by anthropogenic forcing in SPEAR.We compare trends for TMAX and TMIN in Figure S21 and also find mainly similar results for the small warming trend over western Colorado in the ensemble mean of SPEAR_MED.
To understand the possible physical drivers for this local warming in SPEAR_MED, we calculate averages over a small box in western Colorado (outlined in Figure 2).We find that this warming in SPEAR is strongly correlated with decreasing evaporation rates (Figures S22a and S23a), greater surface runoff, and even higher land surface temperature warming (not shown).Despite this evidence, we do not find any corresponding changes to precipitation or antecedent wintertime snowfall that could completely explain this effect (not shown).It is necessary to consider, however, that this small warming trend is still insignificant compared to the large spread of internal climate variability as simulated by the individual ensemble members over the region (Figures S22b-S22d).These trends are not found for the ensemble mean of SPEAR_MED_NATURAL (Figure S23b), as it is important to recall that land use and land change fields are also set to 1921 levels in SPEAR_MED_NATURAL.We hence hypothesize that this small warming signal over western Colorado could be related to the land surface forcing in SPEAR_MED, such as through the prescription of interactive vegetation influencing the surface energy budget, but this is outside the scope of this study, and more work would be needed to answer this question.
Finally, one last point we wish to make is that despite the ANN using this temperature signal over western Colorado to help predict the year of a given map, it is not the only reason for the high 1921-1989 skill (Figure 12).

10.1029/2023EF003981
This is reflected in the results from Figure 10, which shows that the ANN can still reasonably predict the year even for regions of CONUS that do not include western Colorado.

Discussion and Conclusions
In this study, we used a machine learning approach to identify the ToE of summertime temperatures across the CONUS.There are several differences in this methodology compared to more traditional signal-to-noise metrics used in earlier ToE works.One advantage is that the ANN needs to learn time-evolving patterns of forced change to make an accurate prediction, instead of only comparing time-mean metrics over different epochs.In fact, we show that after training, the ANN is able to resolve these forced signals even within a single ensemble member, that is, one realization of internal climate variability.Changes in these temperature signals can then be visualized by applying XAI methods from 1 year to the next.Further, the ANN has the ability to learn and identify nonlinear relationships across the entire spatial map, which ultimately contribute to the different ToE predictions.These temperature patterns could differ from metrics calculated at the point-by-point level or averaged over larger domains, which all could subsequently impact ToE estimates.By design, if observational predictions fall close to the 1:1 line, this suggests that climate change patterns in the training data are generalizable to the real-world.This is found to be the case in our ANN framework for the last two recent decades across different large ensemble climate models used for training.
We calculate the ANN-derived ToE by comparing to an early twentieth century baseline period, which encompasses the record heatwaves of the Dust Bowl-era.Nonetheless, we find the emergence of a forced signal as early as the late 1990s for the observed TMIN in the Eastern United States.More broadly, we also find the emergence of summertime TAVG and TMIN across the entire conterminous United States.In other words, the ANN can still distinguish a climate signal in JJA temperature maps during recent years, despite the overall observed mean not exceeding the record warmth of the mid 1930s.It is also possible that if the size of the spatial domain of the temperature maps were increased, such as to consider all of North America, that the ToE may be identified even earlier than found in this work (e.g., Barnes et al., 2019;Labe, Barnes, & Hurrell, 2023;Sippel et al., 2020).
The ANNs are also able to make accurate predictions of the year for a given temperature map using the climate model large ensemble data, and they are skillful before the temperature response to greenhouse gas forcing overwhelms later in the twentieth century.This suggests that the ANN can leverage patterns of temperature indicators for distinguishing these temperature maps.We find that this is related to the higher spatial resolution of the training data particularly in the vicinity of complex topography.We perceive that this is not simply related to more available data samples, as we do not find any skill improvement when training on additional individual ensemble members using the other GCMs.Rather it is more likely this is related to the ANN learning information from the high-resolution grid and thus the ability for a climate model to represent finer temperature structures and gradients.
To consider this point further, these results indicate that there is considerable potential to use the machine learning methods described in this study to detect the emergence of climate signals much earlier than with conventional methods, but the ability to realize this potential is likely hindered by climate model error or biases.For example, this could be analogous to the difference between potential predictability and forecast skill in climate prediction studies.If you calculated the ToE in the same way described here but using individual ensemble members as truth, you could define a "potential ToE" that would be much earlier than the actual ToE (again, analogous to predictability).The difference between the actual and potential ToE likely would reflect errors in the simulation of certain physical processes that the climate model deems important for distinguishing forced changes from internal variability.The ability to realize that potential ToE depends, however, on whether those processes are realistic and if the gap between simulated and observed changes can be narrowed.In this work, the gap also seems to be related to GCM resolution, as the earliest potential ToE generally occurs in the higher-resolution simulations.Similarly, we are also limited by the ability of most climate models to accurately simulate the temperature variability realized over the twentieth century across the CONUS (i.e., the warming hole spatial pattern) (Eischeid et al., 2023).Lastly, we do find that that other lower resolution large ensembles (e.g., CESM2-LE) have equivalent or even better ANN prediction skill than SPEAR_MED, but it's unclear whether this skill would also be further improved with greater resolution in these other GCMs.This suggests that differences in the forced response or internal variability between observations and individual GCMs could also be influencing the regional ToE, not just the horizontal resolution.Recent studies have just started to begin addressing this type of question Earth's Future 10.1029/2023EF003981 for disentangling internal variability from forced climate change using XAI (e.g., Gordon et al., 2023;Po-Chedley et al., 2022), and we hope to investigate the utility of our machine learning approach for this method of GCM evaluation in future work.
Moving forward, our findings have several potential broader implications related to using machine learning methods on climate science applications.While many XAI applications have regridded to coarser inputs because of lower computational cost, this may come at the expensive of better machine learning model performance that potentially could be achieved if using high-resolution data.For example, it might be interesting to further explore this effect for applications of machine learning in subseasonal to decadal prediction (Meehl et al., 2021;Merryfield et al., 2020), where neural networks may be able to derive more information from small features such as simulated mesoscale eddy activity.But this remains an active area of research even for the climate model development community (Hewitt et al., 2017;Scaife et al., 2019).It also could be interesting to leverage these XAI tools for diagnosing biases in GCMs, such as the utility briefly explored here for identifying an unexpected temperature response in western Colorado which may be related to the land surface forcing fields.
However, having a greater number of input samples (e.g., higher resolution input maps) can raise the risk of statistical overfitting.It also tends to result in lower interpretability for understanding the machine learning models, even after applying XAI attribution methods (Barnes et al., 2020;Samek et al., 2019;Toms et al., 2020).Likewise, there is also the question for whether a more complex machine learning architecture is actually needed, despite only a small improvement in skill or where computational cost is a concern.There are hence tradeoffs to balance in all of these machine learning design choices, and the answers are largely dependent on the purpose of the task.Despite our observational ToE estimates, which are found to be robust across a range of ANN experiments and training data sets, we propose that it would be helpful for more work to investigate the sensitivity of machine learning model skill and interpretability of different XAI methods to variations in input data across a variety of climate science applications.

Data Availability Statement
Atmospheric reanalysis data are openly available for ERA5 by Hersbach et al. (2023), which is supported by the Copernicus Climate Change Service (C3S; Thépaut et al., 2018) Climate Data Store.Twentieth Century Reanalysis Project version 3 (20CRv3) data are provided by the NOAA/OAR/ESRL PSL, Boulder, Colorado, USA through L. Slivinski et al. (2023).Monthly U.S. Climate Gridded Data set (NClimGrid) data are provided by NOAA/NCEI through Vose et al. (2014aVose et al. ( , 2014b)).Climate model large ensembles used in this study are available from the Climate Data Gateway (NCAR, 2023), Earth System Grid Federation (Laboratory, 2023), Facility for Weather and Climate Assessments (Murray et al., 2023), and the SPEAR data portal (GFDL, 2023).Citations for documentation papers describing each data set are provided throughout the main text and in the reference list.

Figure 2 .
Figure 2. (a) Composite of JJA average near-surface temperature (TAVG) from an example of the training data mean used to standardize the input maps, which is calculated here across 24 ensemble members from SPEAR_MED and over the period of 1981-2010.(b) As in panel (a), but for the training data standard deviation.See Section 3.1 for more details.The vertical yellow longitude lines are displayed at 104°W and 85°W to differentiate the three regions of the CONUS considered for this work (i.e., the Western USA, Central USA, and Eastern USA).The thin yellow box outlines the western Colorado (W.CO) region of interest used for this study (37°N-41°N and 108°W-105°W).

Figure 3 .
Figure 3. Schematic of the output provided by the artificial neural network (ANN) and the subsequent calculation of the timing of emergence (ToE) based on CONUS maps from observations.Blue scatter points denote ANN predictions based on SPEAR_MED (historical + SSP5-8.5 forcing) testing ensemble members for the inputs of TAVG averaged over JJA.The actual year is shown on the x-axis, and the predicted year on shown on the y-axis.Red markers are used for ANN predictions after inputting maps from NClimGrid.A perfect prediction (1:1 slope) is annotated behind all ANN predictions with a solid gray line.To first calculate the ToE of the NClimGrid maps, the largest predicted year in the 1921 to 1950 climatological period (vertical gray shading) is identified (left, bright red marker).The actual ToE (right, bright red marker) is then the first year when all proceeding predictions (vertical red shading) exceed the year of the 1921-1950 maximum.The observed ToE from this ANN is 2005.

Figure 4 .
Figure 4. (a) Time series of mean JJA maximum temperature (TMAX) anomalies over the CONUS from 1921 to 2022 for the ensemble mean of SPEAR_MED (dark green line) compared to observations from NClimGrid (dashed red line).The spread across SPEAR_MED ensemble members is shown with the light green shading.Anomalies are computed for each data set with respect to their own 1981-2010 climatological mean (b).As in panel (a), but for the mean JJA minimum temperature (TMIN).(c) As in (a), but for the mean JJA average temperature (TAVG).Note that the Dust Bowl period from 1932 to 1938 (Schubert et al., 2004) is annotated with a vertical gray shading, and the year of the Mount Pinatubo eruption (1991) is highlighted with a vertical black line.

Figure 5 .
Figure 5. Linear least squares trends of average JJA TAVG from 1921 to 1989 (a-c) and 1990 to 2022 (e-g) for NClimGrid (a, d), the ensemble mean from SPEAR_MED (b, e), and the ensemble mean from SPEAR_MED_NATURAL (c, f).Trends are first calculated at every grid point and in each individual ensemble member before averaging together for the ensemble mean maps (b, c, f, g).The SPEAR_MED ensemble member with the highest pattern correlation (corr) of TAVG trends relative to those in NClimGrid is plotted in panel (d) for 1921-1989 and (h) for 1990-2022.For maps of NClimGrid, black hatch marks indicate TAVG trends that are not statistically significant following a Mann-Kendall test(Bevan & Kendall, 1971;Mann, 1945) for the 95% confidence level.

Figure 6 .
Figure 6.(a) Linear least squares trends of average JJA TMAX from 1990 to 2022 for NClimGrid (b).As in panel (a), but for the ensemble mean from SPEAR_MED (c, d).As in panels (a, b), but for TMIN.Trends are first calculated at every grid point and in each individual ensemble member before averaging together for the ensemble mean maps (b, d).Black hatch marks indicate TAVG trends that are not statistically significant following a Mann-Kendall test for the 95% confidence level for NClimGrid data.

Figure 7 .
Figure 7. (a) Predictions of SPEAR_MED testing ensemble members by the artificial neural network (ANN) for input maps of TAVG averaged over JJA.The actual year is denoted on the x-axis and the predicted year on the y-axis.Red markers are shown for ANN predictions after inputting maps from NClimGrid.A red line is displayed showing the linear least squares regression through the NClimGrid predictions along with its corresponding R 2 value.A perfect prediction (1:1 slope) is annotated behind all ANN predictions with a solid gray line.(b) As in panel (a), but for an ANN trained and tested on maps of TMAX.(c) As in panel (a), but for an ANN trained and tested on maps of TMIN.

Figure 8 .
Figure 8.(a) Predictions of SPEAR_LO testing ensemble members by the artificial neural network (ANN) for input CONUS maps of TAVG averaged over JJA.The actual year is denoted on the x-axis and the predicted year on the y-axis.Red markers are shown for ANN predictions after inputting maps from NClimGrid.A perfect prediction (1:1 slope) is annotated behind all ANN predictions with a solid gray line.The Root Mean Squared Error (RMSE) for the SPEAR_LO testing ensemble members is included for predictions over the actual years of before and after the year 1990.(b) As in panel (a), but for SPEAR_MED.(c) As in panel (a), but for SPEAR_MED_NATURAL.(d) As in panel (a) but for SPEAR_LO predictions based on input maps of only the Western USA.(e) As in panel (a), but for SPEAR_LO predictions based on input maps of only the Central USA.(f) As in panel (a) but for SPEAR_LO predictions based on input maps of only the Eastern USA.

Figure 9 .
Figure 9. (a) Distribution of timing of emergence (ToE) predictions for inputs of CONUS maps using NClimGrid after training ANNs on SPEAR_MED data of TAVG, TMAX, or TMIN.The median ToE is shown with a thin white horizontal line.The mean ToE is shown with a dashed black line.Each distribution of ToE years is constructed from 100 artificial neural network (ANN) iterations (use of random initialization seeds and different combinations of training, testing, and validation ensemble members).(b) As in panel (a), but for the distribution of Spearman's Rank correlation coefficients (R) between all the actual years and predicted ANN years for NClimGrid data.

Figure 10 .
Figure 10.Artificial neural network (ANN) predictions of SPEAR_MED testing ensemble members of TMAX maps for the Western USA (a), Central USA (b), andEastern USA (c).The actual year is denoted on the x-axis and the predicted year on the y-axis.Red markers are shown for ANN predictions after inputting maps from NClimGrid.A perfect prediction (1:1 slope) is annotated behind all ANN predictions with a solid gray line.The Root Mean Squared Error (RMSE) for the SPEAR_MED testing ensemble members is included for predictions over the actual years of before and after the year 1990.If the observed timing of emergence (ToE) occurs for the NClimGrid predictions, then it is denoted for each region with a bright red marker and vertical line.If five or less years of predictions exist after this calculated ToE, then it is annotated with two added asterisks (d-f).As in panels (a-c), but for input maps of TMIN (g-i).As in panels (a-c), but for input maps of TAVG.

Figure 11 .
Figure 11.(a-f) Composite of LRP z heatmaps from predictions based on NClimGrid, which are averaged over 2005 to 2022 for ANNs trained on SPEAR_MED data using different L 2 regularization values (0.001, 0.01, 0.1, 0.5, 1, 5).The L 2 value used for the main results of the paper is labeled in bold font.Positive relevance indicates regions that pushed the artificial neural network (ANN) to make its predicted year.Negative relevance suggests areas that tried to push the ANN away from its predicted year.Note that no Gaussian filter is applied to this set of explainable artificial intelligence heatmaps.

Figure 12 .
Figure 12.(a) Distribution of Mean Absolute Error (MAE) scores for validation predictions over 1921-1989 based on inputs of CONUS maps for the overall artificial neural network (ANN) architecture with the lowest median MAE (e.g., Figure S6) after training neural networks on individual climate model large ensembles with the MED resolution (see Section 3.1) (SPEAR_MED or Forecast-Oriented Low Ocean Resolution (FLOR)).Each distribution of scores (red points) is constructed from 20 ANN iterations (different combinations of training, testing, and validation ensemble members and random initialization seeds).The median score is shown with a thin black horizontal line.(b) As in panel (a), but for MAE scores calculated over 1990-2100.(c) As in panel (a), but for MAE scores calculated over 1921-2100 (d-f) As in panels (a-c), but for ANNs trained on individual large ensembles with the LOW resolution (SPEAR_LO, FLOR (LO), CESM1-LE, CESM2-LE, MIROC6-LE).
, but use the same number of training and testing data ensemble members as done with SPEAR.Close results are found, and for that reason it is unlikely that the differences in skill are due to more training data ensemble members.Similar to the testing data results, there is greater spread in the NClimGrid predictions after training on MIROC6-LE (Figures S13a and S13d).Meanwhile, the predictions after training on CESM1-LE and CESM2-LE agree broadly well with those from the SPEAR ANNs, but a TAVG signal again only emerges by the late 1990s (i.e., predictions closer to the 1:1 line).

Figure 13 .
Figure 13.(a) Relevance heatmaps of TAVG using LRP z and (b) Integrated Gradients for testing ensemble members from SPEAR_MED composited over 1921 to 1989.The composited heatmaps are smoothed using a Gaussian filter to improve interpretability.Positive relevance indicates regions that pushed the artificial neural network (ANN) to make its predicted year.Negative relevance suggests areas that tried to push the ANN away from its predicted year (c, d).As in panels (a-b), but for composites over 1990-2100.