Assessing the impact of bias correction approaches on climate extremes and the climate change signal

We assess the impact of three bias correction approaches on present day means and extremes, and climate change signal, for six climate variables (precipitation, minimum and maximum temperature, radiation, vapour pressure and mean sea level pressure) from dynamically downscaled climate simulations over Queensland, Australia. Results show that all bias‐correction methods are effective at removing systematic model biases, however the results are variable and season‐dependent. Importantly, our results are based on fully independent cross‐validation, an advantage over similar studies. Linear scaling preserves the climate change signals for temperature, while quantile mapping and the distribution‐based transfer function modify the climate change signal and patterns of change. The Perkins score for all the values above the 95th percentile and below the 5th percentile was used to evaluate how well the climate model matches the observational data. Bias correction improved Perkins score for extremes for some variables and seasons. We rank the bias‐correction methods based on the Kling–Gupta efficiency (KGE) score calculated during the validation period. We find that linear scaling and empirical quantile mapping are the best approaches for Queensland for mean climatology. On average, bias correction led to an improvement in the KGE score of 23% annually. However, in terms of extreme values, quantile mapping and statistical distribution‐based transfer function approaches perform best, and linear scaling tends to perform worst. Our results show that, except linear scaling, all approaches impact the climate change signal.


| INTRODUCTION
The outputs of global and regional climate models (GCMs and RCMs) are often biased and need to be bias-corrected against observational or reanalysis datasets for use in impact studies, and for end-user applications, such as assessing the impacts of climate change on climate extremes and hydrology (Eccles et al., 2021;Trancoso et al., 2020).GCMs simulate the behaviour of atmospheric circulation patterns with coarse spatial resolution (100-200 km).Dynamical downscaling with RCMs is one method to refine the spatial resolution of climate simulations, which are required for many impact studies (Andela et al., 2019;Chiew et al., 2022;Dickinson et al., 1989;Giorgi, 1990;Yang et al., 2010).Compared with GCMs, RCMs are more reliable at reproducing mesoscale patterns of local precipitation, including the topographic effects on precipitation (Frei et al., 2006;Grose et al., 2019), sea-land contrasts and representation of land cover (Boé & Terray, 2014;Suh & Lee, 2004;Syktus & McAlpine, 2016;Tian et al., 2013).However, RCMs are still subject to considerable biases, due to imperfect boundary conditions provided by reanalysis or by a GCM data and due to systematic model errors, such as those inherited from the parameterization of climate processes (Ehret et al., 2012;Teutschbein & Seibert, 2012).Reducing model bias is important for many applications, such as hydrological, agricultural and fire risk models as rainfall, temperature and other required climate variables need to be representative of the observational data for the modelling to be reliable (Jiang et al., 2019;Nahar et al., 2017).Bias correction is therefore routinely a prerequisite step of most climate change impact studies and is very commonly used (Gunavathi & Selvasidhu, 2021;Maraun, 2016;Oruc, 2022;Shen et al., 2020;Tong et al., 2021;Vogel et al., 2023).
Bias correction (or bias adjustment) is the process of scaling climate model outputs to account for their systematic errors to improve their fitting to observations.Several bias-correction approaches ranging from simple linear scaling to sophisticated quantile mapping have been developed (Casanueva et al., 2020;Chen et al., 2013;Dowdy, 2020;Enayati et al., 2020;Feigenwinter et al., 2018;Heo et al., 2019;Lafon et al., 2013;Maraun, 2016;Mehrotra et al., 2018;Miao et al., 2016;Nahar et al., 2018;Piani et al., 2010;Smitha et al., 2018;Teutschbein & Seibert, 2012;Wu et al., 2022).Linear scaling (LS) corrects projections based on monthly errors; however, it only corrects the mean and can have poor performance for extremes and for other statistical characteristics, such as variability (Teutschbein & Seibert, 2012).Distribution mapping approaches (sometimes termed probability mapping, quantile-quantile-mapping, statistical downscaling or histogram equalization) can address some of these limitations and correct variability as well as extremes (Teutschbein & Seibert, 2012).However, some of these methods have been found to inflate simulated extremes, and so their use for climate change applications where extremes are important, such as flooding, has been questioned (Huang et al., 2014;Pastén-Zapata et al., 2020).Most bias-correction methods also assume stationarity; that the same bias-correction algorithm will apply to the future as the present (Teutschbein & Seibert, 2012;Themeßl et al., 2012).Despite these issues, evaluations of the impact of bias correction on extremes and the climate change signal are limited (Dosio, 2016).
Bias-correction methods may change the climate change signal due to intensity-dependent biases (Teutschbein & Seibert, 2012;Themeßl et al., 2012).Some authors regard the change in the climate change signal from the removal of intensity-dependent biases as an improvement over the original climate change signal (Gobiet et al., 2015;Ivanov et al., 2018).Others regard changes to the climate change signal from bias correction as undesirable, as it may modify climate model sensitivity and distort relationships between meteorological variables (Cannon et al., 2015;Ivanov et al., 2018).Impacts on the climate change signal may be more pronounced for extremes.For this reason, when assessing bias-correction methods, it is important to evaluate the impact on the climate change signal.A further issue with bias correction is that most studies only examine the impact of bias correction on temperature and precipitation variables; however, other metrics are required in impact studies.Examples include radiation for crop models, evapotranspiration for hydrological models and wind and humidity for fire risk models.Despite the importance, studies evaluating multiple variables are limited and most of them published recently (e.g., Dieng et al., 2022;Guo et al., 2019;Mehrotra & Sharma, 2019;Tootoonchi et al., 2022;Van De Velde et al., 2022;Wilcke et al., 2013).There is also limited understanding on the performance of distinct bias-correction approaches across different climate variables, and as a result, modelling groups and practitioners are making subjective decisions on the choice of bias-correction techniques.Therefore, there is a need to systematically evaluate these aspects to underpin impact studies and provide more reliable advice to climate adaptation policies.Thus, the objectives of this research are given below: a. Assess the performance of the three bias-correction methods in removing biases in mean and extreme climate for six climate variables.b.Assess the impact of the bias-correction methods on the climate change signal.c.Rank the approaches based on the three criteria and provide recommendations.
We undertake this assessment using an ensemble of 11 CMIP5 downscaled climate models representative of the envelope of the GCM's climate change signal and gridded observational data as a benchmark over a climatic gradient (tropical, equatorial, subtropical temperate and arid) across the state of Queensland, Australia.
Traditionally, assessments of bias-correction methods have focused on the seasonal mean climatology.However, most climate change impacts occur as climate extremes.Thus, this research examines the tails of a daily-based probability density function (PDF), which encapsulate the extreme events manifested as droughts, floods, heatwaves and bushfires across the study area.We also demonstrate how the use of different bias-correction approaches modify the future climate change signal.

| Study area
We evaluate bias-correction methods over Queensland, Australia (Figure 1).Queensland provides a useful study area for evaluating bias-correction methods due to its diverse environment, including equatorial, tropical, subtropical, temperate and arid regions, and the presence of mountainous and coastal areas.

| Climate models
We dynamically downscaled 11 representative CMIP5 GCMs (Table 1) to a 10-km spatial resolution over Queensland using global variable resolution Conformal Cubic Atmospheric Model (CCAM) developed by CSIRO (McGregor, 2005;McGregor & Dix, 2008).The dynamical downscaling approach consisted of a two-step process.First, we used CCAM with global uniform resolution at 50 km (Grose et al., 2019) and bias-corrected SSTs and sea ice from the host CMIP5 GCMs for the period 1950-2099.Then, the global variable resolution CCAM was used with a maximum spatial resolution of 10 km over Queensland.These simulations were run with the spectral nudging approach (Thatcher & McGregor, 2008) with 6-hourly data from the 50-km CCAM simulations for the 1980-2099 period.For additional details on approaches to dynamical downscaling with CCAM, refer to Katzfey et al., 2016;Trancoso et al., 2020;Chapman et al., 2020;Grose et al., 2019;Syktus & McAlpine, 2016.The primary reason to select the CMIP5 GCMs for downscaling was availability of monthly sea surface temperature and sea ice for the historical period and the two emission scenarios from the same model simulations in the existing CMIP5 archive.In addition, the CSIRO (CSIRO & BOM, 2015) has assessed the skill of historical climate and representativeness on climate change signal for Australia from CMIP5 models, which informed our selection of 11 CMIP5 models for downscaling.The climate models used in this study are listed in Table 1.
We bias-corrected daily precipitation, minimum and maximum temperatures, mean sea level pressure, solar radiation and vapour pressure and examined the impact of the climate change signal for RCP8.5, comparing 1986RCP8.5, comparing -2005 to 2079-2098 using three distinct bias-correction approaches.CCAM doesn't output a daily average vapour pressure and we derive it from screen mixing ratio and mean sea level pressure.We note that CCAM outputs will be based on its own physics rather than representative of host model physics.Simulations biases are expected to be mostly driven by the regional model.However, the biases from individual CCAM runs are heterogeneous (see ACCESS1-0 and HadGEM2 models on Figures S3 and  S4), while representative of regional model generated biases.This offers a unique opportunity to evaluate the bias-correction approaches for simulations whose biases are dominated by regional model physics rather than GCMs.

| Observations
We used the Australian Climate Project observational dataset (also known as SILO).The SILO dataset interpolates Australian Bureau of Meteorology (BoM) weather stations and additional observational sources to provide daily gridded surfaces for Australia (see Figure 1).The gridded SILO datasets cover the period from 1880s to the present, and are updated daily (Jeffrey et al., 2001).Since SILO datasets have a 5-km resolution, SILO was regridded to 10 km using a conservative remapping approach prior to performing bias correction to match the grid of regional projections from CCAM.We use Climate Data Operator (CDO) to perform re-gridding tasks which provides four options (nearest neighbour, linear, cubic and conservative remapping).Conservative remapping uses all the grids from the original datasets that lies inside the new grid and does a better job of preserving integrals of the data between the source and target grids.We did a few tests with types of remapping before deciding that CDO operator "remapcon" was the best one to use for our cases.Note that for vapour pressure, SILO only provides gridded datasets based on daily 9 am weather station observations (daily means are not available).We use this dataset as reference to bias-correct our  et al., 2001).

| Bias-correction approaches
We evaluated how three bias-correction approaches (and variations of them) performed for individual regional model runs for six daily variables using SILO as benchmark.The 1981-2000 period is used for training the biascorrection methods (calibration) and the 2001-2015 period for cross-validation assessment.Therefore, the reported metrics are based on 15 years of independent data (not used for calibration), an advantage over similar studies conferring higher reliability to results.We biascorrect the historical period (1986)(1987)(1988)(1989)(1990)(1991)(1992)(1993)(1994)(1995)(1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005) and the RCP8.5 scenario (2079-2098) to examine the impact of bias correction on the climate change signal.The bias-correction approaches used differ substantially in terms of complexity.Linear scaling (LS) is the simplest method which only used monthly mean to correct model output.Statistical distribution-based transfer functions (Dis_T_F) technique is the most sophisticated bias-correction technique, which used all the individual quantiles (daily values) to construct the transfer functions.In between these two techniques, there are various parametric and non-parametric quantile matching variations, with different quantiles and parameter forms to construct transfer functions.

| Linear scaling (LS) using monthly data
Linear scaling is a parametric method, which corrects model data using monthly correction factors based on the differences between observed and modelled data.Minimum and maximum temperature and mean sea level pressure are corrected by an additive term.All other climate variables are adjusted with a multiplier.The advantage of LS is its simplicity and effectiveness for mean climatology, and the disadvantage is its inadequacy for correction of extreme values (Teutschbein & Seibert, 2012).
For variables that are corrected using an additive term, firstly the monthly mean is obtained for each climate model in the calibration period (1981)(1982)(1983)(1984)(1985)(1986)(1987)(1988)(1989)(1990)(1991)(1992)(1993)(1994)(1995)(1996)(1997)(1998)(1999)(2000) for comparison with observations in the same period.The difference between climate model simulations and observations is then added to the projections in the future period (or similarly for the validation period).See Equation (1) below: where V i,m (d) is the model value in month m and day d for the climate model i, ΔV i,m is the difference between the mean of the climate model i and the observations in month m and V i,m,corr (d) is the corrected temperature (or sea level pressure) in month m and day d for the climate model i.
Given the positivity constraints on precipitation, vapour pressure and radiation, we correct the model values using a multiplicative factor: c i,m is the ratio between the mean of the climate model i and the observations in month m.
where μ m is the monthly mean for observed and modelled daily data in the calibration period.

| Quantile matching (QM)
The first two QM versions are parametric, which define a transfer function, and are then used to adjust the model data (Equation 4): First, a cumulative distribution function (CDFs) is calculated for each grid cell for model and for the observations.
For the first implementation (QM_monthly), the CDF for each month is calculated using daily data for each month, drawn from all years in the calibration period.For example, January daily data from all years is used to compute CDF for January.A linear regression is then used to fit the model and observed CDF, which provides the parameters a i,m and b i,m from Equation 4. With these parameter values, model bias is corrected in the future periods or in the validation period.This method is related to the ISI-MIP2b method (Hempel et al., 2013), albeit quantile matching rather than a transfer function (Piani et al., 2010) is employed.Hempel et al. use a non-linear transfer function assuming that precipitation is modelled by a gamma distribution.Here, we use a linear regression method rather than a non-linear method.Figure S1, Supplementary Material, shows the cumulative distribution functions for one grid cell for each variable.This shows that the linear regression method is able to adjust the model values to match the observed values and is useful.
The second version of quantile matching (QM) lumps together all daily data in the calibration period.This method is similar to the first version above except that the CDFs were calculated based on the entire daily data for the whole calibration period rather than monthly data.
The third quantile matching version is non-parametric, namely, empirical quantile matching (EQM) using monthly data (EQM_Monthly).In EQM, CDFs are calculated for each month for individual models and observations.It differs from parametric QM in that, rather than using linear regression, it uses an interpolation scheme (Feigenwinter et al., 2018;Ivanov & Kotlarski, 2017).There are different variants for empirical QM (see, e. g., Robust EQM and Smoothing Splines, etc. and a comparative study by Enayati et al. (2020), in which the empirical QM and robust QM approaches were identified to perform best for correcting RCM-simulated rainfall data, while all QM methods, except a parametric QM, performed relatively well for RCM-simulated temperature data).In this work, we used a minimization scheme to find the unavailable quantile values, which can be regarded as nearest neighbour interpolation.For values outside the range used for training (extrapolations), the highest (or the lowest) percentile is adopted.We note that Dowdy (Dowdy, 2020;Dowdy, 2023) proposed a method called Quantile Matching for Extremes (QME), which is better than our current implementation to deal with extreme values.Our tests indicate that Dowdy's method (Dowdy, 2020, Dowdy, 2023) for precipitation for two models (ACCESS1-3Q and ACCESS1-0Q) indeed produced better results (see Table S3 in Supplementary Material).
Given specific daily model value (V d model ) for a variable for validation and future period, EQM is trying to find the closest model quantile (CDF value) for calibration period, which is used to correct the model bias in validation and future period.Bias is corrected according to Equation (5).
EQM algorithm can be summarized as below: a. Compute and store CDF for simulations and observations for all grid cells for calibration period.
Note that for extreme data at both ends of the distribution, extrapolation has to be employed in non-parametric QM bias corrections.This is done via constant extrapolation, that is, the first (last) percentile is used for correcting data below (above) the outermost CDF value.All simulated values below the 1% percentile are corrected with the bias attached to the 1% percentile and simulated values above the 99% percentile are corrected with bias attached to the 99% percentile.

| Statistical distribution-based transfer functions (Dis_T_F)
The last bias-correction approach we evaluated is based on the statistical bias-correction technique of Piani et al. (2010).The original method by Piani et al.Piani et al. (2010) used a parametric bias-correction technique, whereas we modified the approach to use a non-parametric bias-correction technique.
In this methodology, observed and modelled daily data are sorted in ascending order and then paired so that each observed value had a corresponding simulated value.A spline was then fitted through these points so that a transfer function was created for the range of observed and modelled values.The one-to-one mapping between the observed and model values defines a raw transfer function, which is then numerically smoothed using spline interpolation.Transfer functions are computed independently for each grid cell.
A number of adjustments and constraints must be imposed to ensure the transfer mapping is reliable for precipitation, which include the following: (i) simulated values below a threshold (0.1 mm/day) are discarded.The corresponding observed values are also discarded to ensure the vectors of model and observed data are of the same length; and (ii) both the modelled and observed datasets must have a proportion of rain days equal to or exceeding the equivalent of 20 days/decade, where a rain day is considered to be any day with rainfall exceeding the 0.1-mm threshold.If either dataset does not have sufficient rain days, a transfer function is not constructed (our dry/wet day counting indicates that only a minority, e.g., below 5% of the grid cells, fall into this category).Bias correction is then done using a simple linear scaling method (namely, an additive term Δ change is used, which is computed as the difference between the observed and modelled means over the calibration period).The bias-corrected values are then given by V model + Δ, with negative values truncated to zero.
In this study transfer functions were not calculated for each month, but a single transfer function was calculated based on daily data for the whole calibration period, following Piani et al. (2010) approach.

| Wet day frequency correction
We also examined the impact on precipitation biases of correcting wet day frequency prior to correcting wet day intensity using a dynamical thresholding approach.We compared the results for Dis_T_F for two climate models (ACCESS1-0 and HadGEM2-CC) with and without dynamical thresholding applied prior to the intensity correction.
Our dynamical thresholding approach corrects both wet days (drizzle effect) and dry days (Ivanov et al., 2018;Van De Velde et al., 2022).It is based upon the dry day counts for observed and modelled precipitation datasets.Instead of using a simple static threshold (e.g., 0.1 mm), the dynamical thresholds are grid-, model-and monthdependent.The dynamical thresholds are determined by comparing the modelled and observed distribution functions in the calibration period and finding the model threshold value for each grid so that the dry (wet) days are the same as observation.If there are more wet days in a model compared with observation, the threshold can be much larger than 0.1 mm.On the other hand, if there are fewer wet days in a model compared with observation, the threshold can be much smaller than 0.1 mm.If there are equal wet days in a model compared with observation, the threshold is set to 0.1 mm.Through such frequency corrections, the number of dry day and wet days match between model and observation.Most climate models have too many wet days, and correcting for this effect is quite common in bias-correction studies (Van De Velde et al., 2022).Excessive dry days in a climate model relative to observations is unusual, and it is thus rarely corrected.However, in areas where the model has too many dry days, if this is not corrected, quantile mapping approaches can lead to a systematic wet bias after bias correction (Themeßl et al., 2012).These corrections are usually performed before undertaking intensity bias correction.However, for the sake of this experiment, we want to understand the impact of each bias-correction approach on precipitation, and if we performed a pre-correction, we would not be able to separate out the effect of pre-correction and bias-correction approaches in the performance scores.
We evaluated bias-correction performance using the Kling-Gupta efficiency (KGE; Gupta et al., 2009) for seasonal mean climatology.The KGE combines the three components of Nash-Sutcliffe efficiency, correlation, bias and variability, into one metric, which is then used to rank models/bias-correction methods for the six climatic variables.
where r is the correlation component (Pearson's correlation coefficient), β measures bias and is the ratio of estimated and observed means, and γ measures variability and is the ratio of estimated and observed coefficients of variation (γ = α/β).α is the ratio of estimated and observed standard deviation (STD).Note that seasonal (DJF/JJA) correlations are based on all grid points, not on the area average per month, so the sample size is substantially greater than 3.
In addition, we used the Perkins skills score (Perkins-Kirkpatrick et al., 2007) to evaluate bias-correction methods at daily time steps.The Perkins skill score evaluates performance based on similarity between the modelled and observed probability density functions (PDFs).The binning of data to construct histograms was based on the distribution of the observed data.If a model simulates the observed PDF poorly, the Perkins skill score will be close to zero, while if the PDF is well simulated, the score will approach a maximum of 1. Perkins's metric has an advantage of measuring the ability of a climate model across the whole of a PDF, and it can also be used to assess the values, for example below the 5th percentile and above 95th percentile (e.g., at the lower / upper tails of the simulated distribution).The 5th and 95th percentiles are very commonly used indices, and Expert Team on Climate Change Detection and Indices (ETCCDI) recommends them, so it is very useful to know how bias-correction impacts on them.

| Biases for historical mean climatology
Figure 2 shows the annual ensemble mean bias for six climatic variables.After bias correction, there is better agreement between the climate models and observations for all six climatic variables except for precipitation, and the performance of all bias-correction methods is similar for mean climate.For precipitation, linear scaling and EQM perform the best, and wet biases are present for the QM, QM-Monthly and Dis_T_F.The wet biases are most prevalent in areas where the climate is dry.We also found in some of these arid regions the climate models have fewer dry days than the observations (Supplementary Material Figure S2).We tested the impact of including a wet day frequency correction for the Dis_T_F method and found this reduced the precipitation biases (Supplementary Material, Figures S3 and S4).While correcting for too many dry days is uncommon, it is likely important in parts of Australia, due to the arid climate.Ensemble mean seasonal biases for all variables are in the Supplementary Material (Figure S5a-f).After bias correction, the model biases are reduced substantially for the majority of variables and seasons, excluding precipitation.The three bias-correction methods based on monthly data (QM_Monthly, LS and EQM_Monthly) have larger improvements in DJF and JJA than the biascorrection methods that use all available data (QM and Dis_T_F).This is most apparent for minimum temperatures in JJA, where large biases remain even after correction, though the approaches that use monthly data have biases half the size of those that do not.For precipitation, all bias-correction approaches reduce biases for DJF (austral winter); however, QM, QM_Monthly and Dis_T_F are not as effective as other two approaches for reducing biases in JJA (austral winter).Performance is also improved for Dis_T_F when dynamical thresholding is used to correct wet day frequency prior to correcting wet day intensity.
Figure 3 shows the ensemble mean for correlation, variability and KGE score for the bias-correction methods annually, and for DJF and JJA.The ensemble mean correlation and variability is improved by all methods.The correlation improves by 3% annually and variability improves by 16% annually on average.This (and bias improvement) leads to an improvement in the KGE score of 23% annually on average.
In terms of correlation (Column 1, Figure 3), most bias-corrections methods improve correlation as compared with the raw data.The correlation statistics from the five bias-correction approaches indicate better correlations than RAW CCAM (except for JJA precipitation), though RAW CCAM models have nearly perfect correlations for most variables and seasons.
In terms of variability (Column 2, Figure 3), the raw models have higher variability than observations for solar radiation.Most bias-correction approaches improve the agreement with observations compared with RAW CCAM, with some exceptions (e.g., for precipitation in JJA and min temperature in DJF).For variability (expressed as coefficients of variation-CV), several RAW CCAM models (models 2, 5, 7, 8, 9) outperformed two bias-correction approaches (QM and Dis_T_F) for precipitation in JJA and mean sea level pressure in DJF, which shows that sometimes the approach can degrade simulations rather than enhance.
In terms of KGE (Column 3, Figure 3), there are significant improvements after bias corrections for all variables in ANN and DJF season.In JJA season, two bias-correction approaches (QM and Dis_T_F) degrade performance for daily precipitation, but for other variables, bias corrections improve upon Raw CCAM.Raw CCAM has the highest KGE scores for maximum temperature, and bias corrections have large impact only in DJF season.Performance improvements in KGE scores are more apparent for solar radiation, mean sea level pressure, vapour pressure and minimum temperature.
Additional results for individual model runs are provided in the supplementary material (Figure S7).The biases from individual CCAM runs can be quite different.For example, the two ACCESS models (one wet and one dry) have different biases and KGE scores for precipitation.Likewise, the two GFDL models are also different.Generally, the greater the biases of individual modes, the lower the KGE score.There is no consensual single best bias-correction technique.Instead, the KGE scores are variable-dependent, model dependent and season-dependent.Bias corrections can improve the KGE scores for most models and variables with several exceptions (e. g., for precipitation in JJA).

| Assessment of bias-correction methods for extreme climate
At daily frequency, we used the Perkins skill score to assess the entire distribution as well as the values below the 5th percentile and above 95th percentile for the six variables (Figure 4).The Perkins skill score evaluates models based on similarity between the modelled and observed probability density functions (PDFs)and is a useful way to evaluate model performance at daily time steps.
The overall Perkins scores results (Row 1, Figure 4) show there are significant improvements for the PDFs after bias corrections for most variables.However, for precipitation, performance is degraded after bias correction for LS and Dis-T-F.Bias correction improves the Perkins score for temperature (min and max), mean sea level pressure, solar radiation and vapour pressure.For minimum temperature, RAW CCAM performs worst in JJA; however, this is improved with bias correction by 10%-20%.
The tail-based Perkins scores (e.g., Perkins below 5th percentile and Perkins above 95th percentile in the second and third row in Figure 4) were devised to quantify how well a model replicates the observational data.While the overall Perkins scores encompass the tails of the distributions, the frequency of extreme events within the tails is much lower than other parts of the PDF and therefore contribute little to the overall Perkins scores.Generally, tail-based Perkins scores are much lower than overall Perkins scores for both RAW and bias-corrected CCAM results.This is particularly true for Perkins below 5th percentile, indicating the challenges to model extreme values accurately in the lower tail.For example, in DJF, two bias correction (QM and QM-Monthly) methods improve the Perkins below 5th percentile for rainfall; however, for JJA, performance is the same or worse for most methods.For minimum temperature, only two methods improve performance in DJF (LS and Dis-T-F), though the performance is still quite poor.For JJA, only linear scaling has the same performance as the raw CCAM models, and all other methods degrade performance.For radiation, all methods improve performance in DJF; however, in JJA, performance is the same or worse for all methods.Vapour pressure is improved by all methods in both DJF and JJA.We also calculated the Perkins scores for the values below the 5th percentile for individual models and the results are shown in Figure S6a,b in Supplementary Material.
Similarly, the Perkins scores for the values above 95th percentile are shown in the last row in Figure 4 (see Figure S6c,d) in Supplementary Material for individual model results.The majority of bias-correction methods improve performance for all variables for Perkins above 95th percentile annually.The results are more mixed when examined seasonally.In general, scores are lower in DJF for RAW CCAM and the bias-correction methods.For rainfall, only EQM-Monthly improves upon RAW CCAM.For minimum and maximum temperature, radiation and vapour pressure most methods improve upon RAW CCAM in DJF.For mean sea level pressure, no method improves upon RAW CCAM.QM and Dis-T-F have particularly poor performance, with a score of 0.2.In JJA, the RAW CCAM result for rainfall is quite high (0.9), and no bias-correction method improves upon it.For all other variables, most methods improve upon the Perkins above 95th percentile score during JJA.
The poor performance for precipitation dry extremes in comparison with the high extremes may be due to the lack of the correction for dry day frequency.The worse performance in JJA than DJF supports this, as JJA is the dry season and should have more dry days.The poor performance of linear scaling for Perkins above 95th percentile may be due to the limited percentiles in the tail region in precipitation cumulative distribution functions, and the variability is much bigger so parametric bias- correction approaches might capture such behaviour better than non-parametric bias-correction approaches.

| Impact of bias-correction methods on climate change signal
We investigated the impact of the bias-correction methods on the climate change signal (Figure 5).All methods preserve the direction of change in the climate change signal.However, for some variables and methods, the magnitude of change was altered.For maximum temperature, most methods, excluding linear scaling, slightly decreased the magnitude of the climate change signal.Prior to bias correction, the area average ensemble mean increase in maximum temperature was 4.1 C, after bias correction (excluding linear scaling), the increase was between 3.7 and 3.9 C. For minimum temperature, most methods slightly increased the magnitude of the climate change signal.Prior to bias correction, the area average ensemble mean increase in minimum temperature was 4.4 C, after bias correction (excluding linear scaling), the increase was between 4.6 and 5.2 C.
For precipitation, some bias-correction methods increased the magnitude of change, while others decreased it.All methods preserved the sign of the climate change signal.Prior to bias correction the area averaged ensemble mean change in annual rainfall was 2.6%.After bias correction it was between 0.3% and 3.2%.Precipitation changes were more pronounced in JJA; however, as this is the dry season, the present day precipitation is low, so this represents a small change in the absolute value.
For mean sea level pressure and radiation, the change in the climate change signal is similar regardless of biascorrection method.Prior to bias correction, the ensemble mean change in mean sea level pressure is À0.6 hPa, after bias correction, it is between À0.6 and À0.5 hPa.For vapour pressure, the bias-correction methods tend to decrease the magnitude of the climate change signal.Prior to bias correction, the ensemble mean change is 4 hPa; after bias correction, it is between 3.5 and 4.1 hPa.

| Overall bias-correction method ranking
We ranked the bias-correction methods using the KGE score (Table S1, Supplementary Material).The methods that incorporate monthly data perform the same or better than those that do not, for all variables (see Table S1).For mean climatology, RAW CCAM is ranked the lowest, while LS, EQM-Monthly and QM-Monthly are the top three performing methods.
The detailed total KGE scores for the ensemble average for the six climate variables are listed in the last two columns in Table S1.Overall, EQM_Monthly / LS approaches have the highest score (16.4), followed by QM_Monthly ( 16), QM (15.4) and Dis_T_F (15.3) for mean climatology.RAW CCAM has the lowest scores for mean climatology (12.5).Note that 18 is the perfect score for six variables and three seasons.
The Perkins score is useful for looking at extremes, and the ranks for individual variable are provided in Table S2 (a-c; column 2-7).According to the performances from PDF distributions against observation-based SILO datasets, EQM_Monthly ranked best for most variables except precipitation.For precipitation, the two parametric bias-correction approaches perform better for the extreme values.Linear scaling is ranked last based on the Perkins score and is not recommended for extreme value analysis.It is unable to capture the tail behaviour for precipitation at all as it only used monthly mean to correct the whole distribution.
For the majority of models and variables, bias-correction improves the Perkins scores over RAW CCAM models.The total Perkins score and the ranking of the bias-correction methods is shown in Table S2 (last two columns) of the Supplementary Material.While performance is variable-dependent, EQM_Monthly has the highest total score if we consider the entire PDF distribution, values below the 5th percentile and values above the 95th percentile together.QM_Monthly ranked the second according to their total Perkin scores, followed by QM/Dis_T_F.Linear scaling's total scores ranked the last, but it is still better than RAW CCAM for the total Perkins scores.

| DISCUSSION
We evaluated and ranked the performance of three biascorrection methods for mean and extreme climate for six variables: minimum and maximum temperature, precipitation, radiation, vapour pressure and mean sea level pressure.We found that the majority of bias-correction methods improved performance in the validation period; however.QM, QM-Monthly and Dis_T_F struggled with precipitation.
The poor performance for precipitation is directly linked to the wet day frequency.When dynamical thresholding was used to correct both too many wet and dry days in the climate model prior to correcting wet day intensity, biases were reduced.Most climate models have excessive wet days, 'drizzle effect', and this is often corrected by bias-correction studies (Ivanov et al., 2018;Van De Velde et al., 2022).Correction for excessive dry days is less common, partly because it is a more uncommon issue than wet days and partly because the correction is more complex than the correction for the drizzle effect.However, these results show that in arid parts of Australia, CCAM simulations may have overestimated the number of dry days, and correcting for this may be required before correcting rainfall intensity to avoid systematic wet bias (Themeßl et al., 2012).
We note that other approaches such as Singularity Stochastic Removal (SSR, Vrac et al., 2012) exist that can improve the bias-correction performances through correcting both occurrence and intensity in precipitation.This SSR approach deals with too many wet days and too many dry days in the same way.Lehner et al. (2021Lehner et al. ( , 2020) ) provided an algorithm that follows the bias adjustment and adds additional wet days in order to reproduce the observation's precipitation sums and wet day frequency.Tschöke et al. (2017) proposed a methodology for handling the null precipitation values in order to improve the dry day quantities from the simulations.Casanueva et al. (2020) tested several standard and state-of-the-art bias adjustment methods with different dry/wet days corrections.All methods deal with the wet day frequency differently, but all use a wet day threshold of 1 mm, which is different from our non-static (dynamical) thresholding technique used in this work and possibly more appropriate when both issues are present.
For extremes, bias correction improved the Perkins below 5th percentile for the ensemble mean for minimum temperature, radiation and vapour pressure.However, for precipitation, maximum temperature and mean sea level pressure, bias correction degraded performance.For Perkins above 95th percentile, the majority of biascorrection methods improved performance.We also ranked the performance of the bias-correction methods, and found linear scaling ranked last for extremes, and EQM_Monthly ranked best according to the model averaged Perkins scores for six variables.
In previous evaluation studies, Ivanov and Kotlarski (2017) pointed out that the QM methods outperform the simple correction of the mean bias (similar to linear scaling in this work), especially with respect to distributiontail statistics.Despite using different implementations of quantile matching approaches, our findings are generally consistent with previous studies.The performance of the EQM_Monthly method is generally better than other parametric QM or linear scaling methods for most variables.This study supports the growing body of evidence that empirical QM is one of the best-performing BC methods (refer to the comparative evaluation studies by Luo et al. (2018) and by Enayati et al. (2020) for temperature and precipitation).In a recent study, Niranjan et al. (2022) found that the empirical QM methods are relatively better in correcting the quantiles with calibrated precipitation close to the observed cumulative distribution.On the other hand, in extreme rainfall cases, the skill of calibrated precipitation through parametric QM methods seems promising and suitable for flood forecasting applications.
We found that only linear scaling consistently preserved the climate change signal among examined biascorrection methods.For other bias-correction methods, the direction of change was preserved, however the magnitude was not.The effect however was generally small.In most applications (Cannon et al., 2015;Lehner et al., 2020;Lehner et al., 2021), trend preservation may be regarded as an advantage.However, other studies pointed out that provided that application of intensitydependent bias correction is scientifically appropriate, the climate change signal (CCS) modification should be a desirable effect (Ivanov et al., 2018).
The impact of bias correction on the climate change signal for vapour pressure, however, may be important.Under RCP8.5, the 11 GCM/CCAM simulations predicted an increase of approximately 15%-29% (ensemble average increase of 25%) by the end of the 21st century (2079-2098 in relative to 1986-2005).This corresponds to an annual increase of vapour pressure by about 6% C À1 of warming for ensemble average vapour pressure, consistent with the Clausius-Clapeyron relationship.Seasonally, there are some variations in the increases of vapour pressure, but they are estimated to be around the range from 22% to 26% for model averaged vapour pressure in the study area.While for the ensemble mean, the increase in vapour pressure was generally lower after bias correction than before, this was not the case for all models.For some models, some of the bias-correction methods resulted in a larger increase in vapour pressure, above the Clausius-Clapeyron relationship.Depending on the applications of bias-corrected outputs, this impact may be important to be aware of (e.g., if the biascorrected data were then used in a dynamic climate model).Cannon et al. (2015) investigated how well quantile mapping methods preserve changes in quantiles and extremes, arguing that projected trends should be preserved following bias correction so that the underlying model's climate sensitivity is maintained.For precipitation, preserving the relative change signal is necessary to maintain physical scaling relationships with model-projected temperature changes (e.g., the Clausius-Clapeyron equation).Thus, for precipitation, it might be an advantage to maintain model-projected relative changes.Quantile mapping algorithms, however, have been shown to modify the magnitude of projected trends in mean precipitation quantities (Hagemann et al., 2011;Maraun, 2013;Maurer & Pierce, 2014;Themeßl et al., 2012;Tong et al., 2021).In the work by Cannon et al. (2015), they presented an innovative form of quantile mapping (quantile delta mapping-QDM), which explicitly preserves relative changes in all quantiles of a distribution.They also compared QDM against detrended quantile mapping (DQM), which preserves changes in the mean, and against the standard QM algorithm.Their results indicated that QM can inflate relative trends in precipitation extremes indices, whereas DQM and QDM are less prone to this issue.Thus, quantile delta mapping approach had the advantages over standard QM in terms of preserving CCS and extreme values with the price of slightly more computational resources required.
It is important to use several bias-correction methods for crosschecking in climate and hydrological assessment studies, even though the performance of bias-correction methods can depend upon the RCM model, region and topographic conditions, reference dataset, among others (Oruc, 2022).

| CONCLUSION
This research assessed how three bias correction approaches applied to six climate variables impacted mean climatology, extremes and the climate change signal.Novelties of the assessment include: (i) higher number of climate metrics, (ii) higher number of ensemble runs, (iii) focus on extreme events, (iv) evaluation of impact on the climate change signal and (v) more robust experiment design with fully independent crossvalidation.
We found all bias-correction methods improved results when compared with RAW CCAM.LS and EQM performed slightly better for most seasons and climate variables for mean climatology.QM-monthly linear regression is generally applicable for all the climatic variables but the bias-corrected results are not equally good for precipitation.Correction of the drizzle effect, and too many dry days, is required in Australia to avoid a systematic wet biases in arid areas.Our dynamical thresholding method can correct both dry day and wet day occurrence for precipitation and can reduce precipitation model biases substantially.
For future climate projections, linear scaling (with additive term) preserves the climate change signal strictly, which may be preferable under some circumstances.However, for analysis of extremes, linear scaling is not recommended as it performs worst for most variables.EQM_Monthly did well for most variables and is therefore the best candidate if we consider all the three aspects studied.Therefore, we recommend its use in univariate bias corrections, despite some relative weaknesses, such as its low Perkins scores for precipitation values below the 5th percentile.

F
I G U R E 1 Map showing domain of analysis (bounding box) and elevation, derived from SRTM (Shuttle Radar Topography Mission (Gallant et al., 2009)).mASL = meters above sea level.Climate zones were based on modified Koppen Climate Classification (Bureau of Meteorology, 2005).

F
I G U R E 2 Annual mean percentage bias maps before and after bias correction, for precipitation, min/max temperature, mean sea level pressure, solar radiation and vapour pressure.Validation period is from 2001 to 2015.RAW and bias-corrected CCAM results are shown for ensemble mean of the 11 CMIP5 models.Observational data (SILO) shown on the left.Note that frequency of wet days not corrected prior to correcting wet day intensity.Average biases are annotated on figure panels for reference.

F
I G U R E 3 Heatmaps for bias-correction methods and six climatic variables across three Austral calendar seasons for validation period (2001 to 2015) using correlation, variability (coefficient of variation) and KGE for annual (row 1), DJF (row 2) and JJA (row3).KGE scores indicate the overall performance of various bias-correction approaches/models.Performance metrics are shown for ensemble means.

F
I G U R E 4 Perkins scores for six daily climatic variables across annual (ANN), summer (DJF), winter (JJA) seasons in Queensland.Validation period is from 2001 to 2015.Results shown for ensemble mean.Models were compared with SILO observational dataset.

F
I G U R E 5 (a, b).Heatmaps for climate change signal (percentage difference for precipitation and difference for other variables between future and present) from 1986-2005 to 2079-2098.RAW and bias-corrected CCAM results are shown in each column and individual models and ensemble mean are shown in each row.
List of CMIP5 models downscaled by Conformal Cubic Atmospheric Model for this study.