On the need of bias adjustment for more plausible climate change projections of extreme heat

The assessment of climate change impacts in regions with complex orography and land‐sea interfaces poses a challenge related to shortcomings of global climate models. Furthermore, climate indices based on absolute thresholds are especially sensitive to systematic model biases. Here we assess the effect of bias adjustment (BA) on the projected changes in temperature extremes focusing on the number of annual days with maximum temperature above 35°C. To this aim, we use three BA methods of increasing complexity (from simple scaling to empirical quantile mapping) and present a global analysis of raw and BA CMIP5 projections under different global warming levels. The main conclusions are (1) BA amplifies the magnitude of the climate change signal (in some regions by a factor 2 or more) achieving a more plausible representation of future heat threshold‐based indices; (2) simple BA methods provide similar results to more complex ones, thus supporting the use of simple and parsimonious BA methods in these studies.

Heat extremes are very likely to be more frequent and intense in the future (Seneviratne et al., 2021), mainly as a direct consequence of the increase in mean temperature (Fischer & Schär, 2010;Schär et al., 2004). Global climate models (GCMs) are the fundamental tools producing future climate projections for impact and adaptation studies. However, uncertainties still remain for key large-scale processes (see e.g., Fernandez-Granja et al., 2021) and sub-grid scale processes, which are often misrepresented due to the coarse resolution of GCMs (e.g., Maraun, 2016). This is particularly true for regions with complex orography, intricate coastlines and/or small islands (e.g., Karl et al., 1999;Peterson et al., 2001;Sanjay et al., 2017), leading to large uncertainties and biases for extreme events.
Absolute threshold-based temperature indices are largely sensitive to systematic model biases, and therefore they cannot be reliably calculated from raw GCM outputs. Bias adjustment (BA) methods are often used to correct specific statistical properties and reduce these biases (see e.g., Li et al., 2020;Maraun, 2016;Matthews et al., 2017;Teutschbein & Seibert, 2012). A proper application of BA (Ehret et al., 2012;Maraun et al., 2017) provides an improved and more robust signal (Dosio, 2016) through the reduction of the multimodel ensemble spread (see e.g., Zhao et al., 2015), by placing all models on equal footing, at the expense of additional uncertainty related to the BA method (Casanueva et al., 2020a;. Previous studies have shown the high sensitivity of threshold-based index projections to the BA method, although these are mostly limited to regional models or/and regional spatial scales (Ahmed et al., 2013;Dong Dosio, 2016;Schmith et al., 2021). Here, we provide a global analysis of the effects of BA on an extreme temperature index (the annual number of days with maximum temperature above 35 C, TX35), using three fit-for-purpose BA methods applied to the GCM simulations of the Coupled Model Intercomparison Project Phase 5 (CMIP5) from historical and RCP8.5 experiments.
The effect of BA on the projected TX35 changes is then evaluated for different global warming levels (GWLs) analysing regional differences focusing on the updated IPCC-WGI reference regions .
2 | DATA AND METHODS

| Model data (CMIP5)
Daily maximum temperature from 28 GCMs from CMIP5 (Taylor et al., 2012, curated version used for IPCC-AR5) was used in this work considering both the historical and RCP8.5 experiments (see Table 1). All the simulations were downloaded from the IPCC Data Distribution Centre (https://www.ipcc-data.org/sim/gcm_monthly/AR5/ index.html; last accessed, 31 December 2019). For comparability, all simulations were interpolated to a common 2 grid. The common grids and land/sea masks used are available in the ATLAS GitHub repository. 1 The period 1986-2005 was considered as the historical baseline while +1.5 C, +2 C, and +3 C GWLs (with respect to the pre-industrial 1850-1900 mean value, see for example, Nikulin et al., 2018) were used for future projections. The corresponding time periods for each GCM are computed using 20-year moving windows. Table 1 shows the central years (n) of the 20-year window where the warming is first reached. The GWL period is thus taken as [n À 9, n þ 10]. The use of a 20-year moving window is selected to be consistent with 20-year time slices typically used for near-term (2021-2040), mid-term (2041-2060), and long-term (2081-2100) future projections. The reference GWLs (and additional supplementary materials and reproducibility scripts) are available in the ATLAS GitHub repository. 2

| Observational data
Daily maximum temperature of W5E5 (Cucchi et al., 2020;Lange, 2019;Weedon et al., 2014) was used as the observational reference to calibrate the GCM output. This dataset was developed as part of the Phase 3b of the Inter-Sectoral Impact Model Intercomparison Project (https://www. isimip.org/), being the observational reference for the calibration of the GCMs considered in the third phase of this initiative, which is focused, among others, on the detection and attribution of observed impacts following the definition established by the IPCC-WGII (Cramer et al., 2014). W5E5 is a global daily dataset with 0.5 horizontal resolution covering the period 1979-2016. It is a merged dataset which combines WFDE5 data (Cucchi et al., 2020;Weedon et al., 2014) over land with ERA5 (Hersbach et al., 2020) over the ocean.
To avoid spurious effects due to the scale gap between model and observations, W5E5 was regridded to the same common 2 resolution grid used for the GCMs before training the BA methods. This way, the downscaling effect is avoided, being thus BA used as a mere adjustment (see Casanueva et al., 2020a, for a discussion on this).

| Bias adjustment
In this study, we use three BA methods of increasing complexity. The simplest parametric methods adjust only the mean (referred to as MA) and the mean and variance (MVA), respectively (similar to RaiRat-M6 and RaiRat-M7 in the Cost Action VALUE intercomparison experiment Gutiérrez et al., 2019, see Appendix A1). These methods are applied on a monthly basis, that is, the parameters are adjusted separately for each month.
Empirical quantile mapping (EQM) is a popular BA method and consists in calibrating a transfer function over the control period to map the quantiles from the empirical cumulative distribution function of the model output onto the corresponding observed distribution. Here we use the implementation from Déqué (2007) which fits 99 empirical percentiles and uses constant extrapolation for out-ofsample values (i.e., values below and above the calibration range). The EQM implementation is similar to that in the Cost Action VALUE Gutiérrez et al., 2019), with a slight modification in the moving window size in order to alleviate the computational demand of the method for the global domain (here EQM is applied on a monthly basis, consistently with MA and MVA). The reader is referred to  for an overall evaluation of these methods over Europe.
The intercomparison of the three BA methods presented here allows to assess the suitability of simple (parsimonious) versus complex BA methods for this particular problem. Casanueva et al. (2013) showed that adjusting the mean (MA) reduces to a large extent biases in high-and lowtemperature percentiles and these are close-to-zero after the second-order correction (MVA). Here we further analyse the practical implications for heat indices depending on absolute temperature thresholds (annual number of days with maximum temperature above 35 C, TX35). Moreover, this comparison allows to assess the effect of inflation in the results.
Whereas MVA and EQM produce an inflation of the variance, thus modifying the climate change signal, the simplest method (MA) does not affect the variance and preserves trends (trend-preserving method).
The BA methods were trained over the historical 1986-2005 period and subsequently applied to the 20-year GWL periods (Table 1) considering every land gridbox for each GCM separately. Finally, values of TX35 were calculated from both the raw and BA daily maximum temperature time series. These methods are implemented in the R package downscaleR  through the function biasCorrection, using the optional arguments method = "eqm" or method = "scaling" or method = "mva", and window = c(30,30). Further details and worked examples of the EQM application for absolute threshold, temperaturebased climate indices are given in Iturbide et al. (2019) and companion materials.

| Model biases in the historical period
When compared with the observed data in the calibration period, the EQM exhibits very low biases of the ensemble mean TX35 (see Figure 1, first column) with less than F I G U R E 1 The first row shows the number of mean annual days with maximum temperature above 35 C-TX35-For the W5E5 observational reference (first column) and the raw CMIP5 ensemble for the historical 1986-2005 period (second) and +2 C global warming level (third). Rows 2-4 show the results corresponding to the different bias adjustment methods (simple mean bias adjustment-MA-, mean-variance bias adjustment-MVA-And empirical quantile mapping-EQM-, respectively), representing the bias (first column) and the differences between the raw and bias-adjusted values (adjustment factor) in the training period (middle column) and a representative test period (right). The overlaid map shows the updated IPCC-WGI reference regions (see Figure 4a) 3 days/year bias over most of the global surface. This is not surprising since this method adjusts the percentiles of the distribution. Moreover, the results are very similar for the MVA method, indicating that adjusting just the mean and variance indirectly produces a good adjustment of the upper quantiles (e.g., those related to TX35). The simplest method, MA results in higher differences but still smaller than the adjusted biases (as shown in column 2). Therefore, parametric methods seem to be convenient for the assessment of the TX35 index, related to high percentiles of the maximum temperature.
The ensemble mean maps of the historical (training) period exhibit large differences between raw and biasadjusted TX35 over sizeable global land areas (see Figure 1, middle column), highlighting the large effect of the BA step.
These differences, expressed as the absolute difference BA À raw (days), are remarkable affecting pre-eminently the intertropical range, for which the bulk of land area is concentrated on Eastern Africa and Sahara Desert with positive differences over 50 days (IPCC-WGI regions SAH, WAF, CAF, NEAF, see Figure 4a), South America (mixed pattern with positive differences over NWS and SAM and mostly negative in NSA-Amazon Basin-and NES), Central America (SCA), Arabian peninsula (ARP, with positive/ negative differences in the northern/southern parts), and Southern Asia (mostly negative in SAS). Some important differences are also found in some adjacent areas of the extratropic in North/South America (NCA/SES regions), Northern/Southern Africa (MED/SWAF), the Middle East (WCA), and Australia. In South America, positive differences F I G U R E 2 Changes in mean number of annual days with maximum temperature above 35 C-TX35-For three future 20-year periods corresponding to +1.5 C, +2 C, and +3 C global warming levels (GWL) (in columns) relative to the historical 1986-2005 period. The first row shows the results corresponding to the raw model data and the second to fourth rows show the results for the MA, MVA and EQM bias adjustment methods, respectively. Hatching represents multimodel uncertainty (hatched areas correspond to weak model agreement, that is, less than 80% of the models agreeing on the sign) are located over the major mountainous areas of the continent, namely, the Andes (NWS, SAM, SES) and Brazilian highlands (SAM), while the negative differences are strengthened in low-lying areas of the Amazon (NSA) and Paran a (SES) River Basins.
Overall, we find a distinctive spatial pattern of the difference between raw and BA maps resembling the spatial pattern of CMIP5 biases in extreme values of mean temperature described in previous studies (see e.g., Zhao et al., 2015). This pattern is consistent among the different BA methods (Figure 1), particularly for EQM and MVA which exhibit an almost identical pattern worldwide. The simpler MA method yields some differences of low magnitude in a few regions; Eastern South America (NES), F I G U R E 3 Historical and RCP8.5 time series of the individual models (thin lines) and the multimodel mean (solid lines) of regional TX35 for the Southeast Asia (SEA) region for the (a) raw and (b-d) MA, MVA and EQM bias-adjusted model data. The red shaded area indicates the multimodel range Central Africa (CAF) and Indian Peninsula (SAS), and Southeast Asia (SEA), for example, but in general these differences are weak and restricted to small areas. This result supports the use of simple and parsimonious BA methods over more complex EQM involving multiple parameters.

| Future projections of extreme heat
The differences between raw and BA projections are increasingly reinforced consistently for the three BA methods from the historical experiment to future projections, as shown in Figure 1 (right column) which corresponds to +2 C GWL. In the Amazon basin (NSA, SAM) the negative difference in the historical period is inverted showing a strong positive increment at +2 C GWL. These differences tie in with the increment of the projected changes from lower to higher levels of warming shown in Figure 2, which unveils a progressive southward displacement toward South American, South African and SEA regions, more accentuated with increasing GWLs (Figure 2).
In general, there is a strong multimodel agreement over the bulk of land areas, notably improved in the landsea transitions after BA application over most of the world coastal regions like in west Africa (western SAH, CAF, WSAF) and Indian Ocean coasts (NEAF, SEAF, ARP, SAS, SEA, NAU, SAU), as indicated by limited areas with hatched pattern in Figure 2. The low multimodel agreement depicted in the higher latitudes of the northern hemisphere (less than 80% of models bear the same delta change sign) are due to the very low delta changes in the number of days above the 35 C threshold, ranging from small negative to small positive values around zero. Differently, multimodel agreement is met (all zero values) in those areas where this temperature threshold is never reached by any model (e.g., Antarctica and Greenland).
To gain a better insight into the effect of BA, Figure 3 shows the multimodel time series of the spatially averaged TX35 over SEA. Here, the climate change signal is largely reinforced after BA application. The trend of the ensemble mean is clearly more pronounced for the BA series, with a quasi-linear increment until reaching +125 days/year by the end of the 21st century, this is, more than double the raw projection (50 days/year). Furthermore, the ensemble spread of the historical period and the near-future projections is drastically reduced with EQM and MVA (not that much with MA), yielding a more robust ensemble projection than the raw version. This particular result is consistent with the overall improvement in the multimodel agreement found after BA application in this region, as indicated by the reduction of hatched areas in Figure 2. Figure 4 summarizes the main results by showing regional averages of the projected change signal at +2 C GWL before and after BA (Figure 4b,c), and the magnitude of the adjustment for the +2 C GWL (BA À raw, Figure 4d). This synthesis of the information clearly shows that the magnitude of the correction is, in absolute value, similar or even larger than the raw TX35 climate change signal (for the policy relevant +2 C GWL) in many IPCC-WGI reference regions, mostly within the tropical range.
This result highlights the paramount importance of the BA step in order to obtain credible TX35 change projections. Overall, the BA projections in this case yield much more intense future heat in the equatorial range in F I G U R E 4 (a) Land subset of the updated IPCC-WGI reference regions (see Iturbide et al., 2020, for details). (b, c) Climate change signals of TX35 for the 2 C warming level w.r.t. the historical values, for raw and BA (MVA) data. (d) Differences between bias-adjusted and raw data for the 2 C warming level Africa and South America, regions where the raw projections may dangerously underestimate future impacts. Moreover, these results suggest that the change in the climate change signal is mostly due to the nonlinear transformation between high temperatures and number of threshold excesses (a desired effect) with variance inflation playing a small role in this case.

| DISCUSSION AND CONCLUSIONS
Even though fundamental model errors cannot be improved by BA, and process-informed BA is in general a preferable approach (Maraun et al., 2017), BA could still be justified for highly biased climate indices such as those defined using absolute thresholds, for which the raw signal is unreliable (Dosio, 2016), as we show using TX35. In this context, some BA methods allow, by construction, the modification of the raw climate change signal at the cost of introducing some additional uncertainty inherent to the statistical adjustment of the raw model outputs. The BA application of the TX35 CMIP5 products here presented constitutes a suitable example of this, allowing for more credible future extreme heat projections.
The added value of the correction is noteworthy since the observational dataset (W5E5) has a higher native resolution allowing for a better representation of orographical features, even after the degradation of its resolution prior to BA. Similarly, over islands and predominantly insular regions, the trend is reinforced after BA application (see e.g., Figure 3, SEA region). This effect can therefore be considered as an actual "correction" of the raw model outputs based on the more reliable representation of observed conditions of the reference dataset (here W5E5) used for the adjustment.
While EQM applies a specific adjustment factor for each of the 99 percentiles of the distribution, MA acts on just one parameter (the mean) and MVA on two (mean and variance). Our results show that the three methods yield similar results (virtually identical in the case of EQM and MVA), thus supporting the use of simpler and more parsimonious BA methods (MVA in this case) for threshold-based indices. This work also paves the way for further analyses with alternative BA techniques for handling climate indices based on absolute thresholds. The increasing temperature trends throughout the 21st century projections (particularly accentuated for the RCP 8.5) pose a non-stationarity problem that is tackled in the case of the EQM using a constant correction for the outlying values of the latest percentile. However, alternative techniques may prove to be better suited to this particular extrapolation problem, and also in preserving the warming signal trends (Casanueva et al., 2020a).
This study demonstrates the need of BA for achieving a more plausible representation of future climate impacts, since observational references can aid to improve the poor GCM representation of these features due to their coarse original resolution. These results unveil a stronger and more rapid increase of the frequency of heat extremes in the future than that one may expect using the raw model outputs alone. The projected changes affect to large world land areas, some of them highly populated and vulnerable, stressing the compelling need for adaptation and mitigation strategies to face unprecedented heat extremes in the next decades.