Evaluation of some distributional downscaling methods as applied to daily precipitation with an eye towards extremes

Statistical downscaling (SD) methods used to refine future climate change projections produced by physical models have been applied to a variety of variables. We evaluate four empirical distributional type SD methods as applied to daily precipitation, which because of its binary nature (wet vs. dry days) and tendency for a long right tail presents a special challenge. Using data over the Continental U.S. we use a ‘Perfect Model’ approach in which data from a large‐scale dynamical model is used as a proxy for both observations and model output. This experimental design allows for an assessment of expected performance of SD methods in a future high‐emissions climate‐change scenario. We find performance is tied much more to configuration options rather than choice of SD method. In particular, proper handling of dry days (i.e., those with zero precipitation) is crucial to success. Although SD skill in reproducing day‐to‐day variability is modest (~15–25%), about half that found for temperature in our earlier work, skill is much greater with regards to reproducing the statistical distribution of precipitation (~50–60%). This disparity is the result of the stochastic nature of precipitation as pointed out by other authors. Distributional skill in the tails is lower overall (~30–35%), although in some regions and seasons it is small to non‐existent. Even when SD skill in the tails is reasonably good, in some instances, particularly in the southeastern United States during summer, absolute daily errors at some gridpoints can be large (~20 mm or more), highlighting the challenges in projecting future extremes.

changing there are some fundamental differences. While temperature is eventually expected to rise everywhere the sign of the precipitation response varies by location, season, and between models. However, different RCM forced by the same GCM can yield opposite signed precipitation responses (Karmalkar, 2018;Holtanova et al., 2019). Hence, projecting changes in precipitation is particularly challenging, especially at the regional scale.
Variations in temperature and precipitation differ fundamentally in that temperature varies more smoothly both spatially and temporally. Precipitation is often discontinuous in both space and time. Accordingly, the distribution of precipitation is often highly asymmetrical and skewed whereas temperature is usually more wellbehaved. Additionally, precipitation is characterized by two aspects: the (binary) occurrence and the actual distribution of amount on wet days, which makes statistical modelling of precipitation much more difficult.
Of particular relevance is the historical and projected increase in extreme precipitation events (IPCC, 2013). Compounding this is the disproportionate contribution of rare heavy events (Pendergrass and Knutti, 2018) such that globally-the wettest 12 days typically account for approximately half the annual total, with this concentration projected to increase in the future.
To mitigate the deficiencies of physical climate models and provide information for policymakers better suited for local areas, a wide variety of statistical downscaling (SD) techniques have been developed (Maraun and Widmann, 2018). Recently, this author team, members of the Geophysical Fluid Dynamics Laboratory (GFDL) Empirical Statistical Downscaling (ESD) team (https://www.gfdl.noaa. gov/esd_eval) has focused on evaluating some of these techniques. For an SD overview and our evaluation approach philosophy see our earlier works and cited references (Dixon et al., 2016;Lanzante et al., 2018;Lanzante et al., 2019a, hereafter L19a;Lanzante et al., 2019b, hereafter L19b). As a caution we note that even the best SD methods will fail to produce credible results if the driving physical climate model is flawed in its representation of circulation features (Hall, 2014;Maraun et al., 2017). Furthermore, large-scale models such as GCMs may not realistically represent sub-grid processes needed to simulate extreme precipitation (Giorgi et al., 2016;Maraun et al., 2017) in which case high resolution models may be needed.
We use a Perfect Model (PM) approach to test the 'stationarity assumption' inherent to all SD methods which implicitly assume that relationships defined during a historical period, intended to calibrate the method against observations, are valid for use in a future epoch when the climate has changed. The PM provides 'future observations' which do not exist in the real-world. However, as discussed below (see 2.1), it is important to note that the idealized nature of our PM design does not allow us to assess all sources of non-stationarities. This is a follow-up to our recent SD work which assessed and improved representation of tails (L19a) and assessed daily maximum temperature (L19b). The methods we consider here and previously (L19a; L19b) are from a class of SD techniques operating distributionally, thus the expectation of better suitability than other SD techniques for reproducing extremes (i.e., tails). It is worth noting that this exercise is at a severe disadvantage (Maraun, 2013) because deterministic methods, such as those used here, cannot bridge the scale mismatch between GCM and observations, particularly for precipitation, having considerable sub-grid-scale variability (Chen and Knutson, 2008). Nevertheless there is value in our assessment because: (a) SD output from these methods is widely used in impact studies, (b) SD methods can provide bias correction, and (c) SD methods are often embedded in multivariate methods capable of bridging the scale mismatch.
2 | DATA AND METHODOLOGY 2.1 | Data GFDL-HiRAM-C360 model data were introduced by Dixon et al. (2016) and used by Lanzante et al. (2018), L19a and L19b. We provide only a brief description as the reader is referred to theses earlier works, especially Dixon et al. (2016), for more details as well as appendix A of L19a for data availability.
Daily precipitation covering a rectangular region surrounding the Continental United States (CONUS), excluding oceanic points, constitute our PM data. Thirty years of data from a GCM driven by historical forcings cover the period 1979-2008. Thirty years of data based on three 10-year ensembles driven by forcings from a high emissions scenario (RCP8.5) cover the period 2086-2095.
Via our shorthand we refer to historical ( ('model' or 'GCM' output) even though all are GCM data.
Note that SD methods are typically faced with two challenges: (a) the spatial scale mismatch between model and OBS and (b) model biases. Often in common usage the term 'statistical downscaling' is a misnomer, lacking explicit mention of (b). The fact that model values represent spatial averages results in (a). Strictly speaking the SD methods we use are bias correction methods. Since we use a single physical model to generate both model and pseudo-observations, our PM design explicitly introduces only challenge (a), thus we are not able to assess non-stationarities resulting from model biases in mean state or in climate change signals. More complex PM's, deriving 'OBS' and 'GCM' values from two different physical models would also explicitly introduce challenge (b). However, by way of spatial averaging our approach can introduce biases implicitly by altering distributions. We have chosen our simpler design in initial work as it facilitates easier diagnosis.

| Downscaling methodology
Guided by L19b, we use the two best performing methods for daily maximum temperature, one conceptually simple, quantile delta mapping (QDM) (Cannon et al., 2015) and one more complex, Kernel density distribution mapping (KDDM) (McGinnis et al., 2015). We also use Bias correction quantile mapping (BCQM) (Lanzante et al., 2018) which is both conceptually simple and very widely used. Finally, we consider PresRat, a modification of QDM designed specifically for precipitation (Pierce et al., 2015). Note that two of L19b's methods utilized the anomaly approach which is inappropriate for precipitation, a positive definite quantity. Below we briefly introduce the four methods-the interested reader is referred to L19b and references therein for more details.

| BCQM
BCQM, one of the most widely used SD methods, is often referred to as 'quantile mapping', although some use this term more generally in reference to various distributional SD methods. It is computed as: where F is the cumulative distribution function (CDF), F −1 its inverse and x the M f value to be downscaled. Equation (1)

| KDDM
KDDM uses a complex, multi-step algorithm involving kernel density estimation to smooth O h and M h which have first been standardized to zero mean and unit variance, separately for each month of each year. Subsequent integration of the generated distribution functions followed by inverse standardization yields the desired transfer functions. We use R code kindly supplied by the KDDM authors (https://github.com/sethmcg/climod).

| QDM
QDM can be thought of conceptually as using M f as a first guess, but modifying it via a correction factor, which varies by position in the distribution. The correction factor, while additive for most variables, is multiplicative for precipitation. Its additive form is: and its multiplicative form is: We use R code made available by the QDM authors (https://github.com/cran/MBC).

| PRAT
PresRat, hereafter referred to as PRAT, is a simple modification of QDM which preserves the model-predicted future change in mean precipitation. Each value of D f computed from (3) is multiplied by a calendar-month specific correction factor K: where bars indicate climatological means over a specific calendar-month.

| Configuration options
Although some authors have fixed specific options in their particular implementation, here we make a distinction between SD methods and configuration options. We consider an SD method to be associated with overarching principle(s) whereas configuration options to be specific choices made in implementation. We evaluate configuration choices since in much prior work consequences of these choices have not been considered. Some of our motivation stems from Vrac et al. (2016), hereafter V16, who did assess some configuration options, although not in a PM setting. Added complexity of precipitation yields several additional choices, four of which we consider. The first is whether to use an additive (A) versus multiplicative scaling (M). Conventional wisdom has dictated the latter for precipitation as additive correction can yield negative values-although these can be reset to zero. We consider the additive option since we are not aware of any prior studies that have evaluated this approach for precipitation.
Another option is frequency adjustment (freqadj) in which a threshold (above the US trace value of 0.01 in.) is chosen yielding the same fraction of dry days in M h as found in O h , by setting M h values below the threshold to zero. The same threshold is applied in the future. Frequency adjustment is aimed at accounting for the well-known GCM 'drizzle bias' (Stephens et al., 2010), hereafter DBIAS, the widespread tendency for GCMs to simulate an excess number of small amounts of precipitation compared with real-world observations.
A fundamental issue in precipitation SD is how to deal with days having zero precipitation. One could ignore them and apply SD only to non-zero values or SD could be applied to all values, including the zeros. We introduce the option ignore0, which when on (off) applies to the former (latter) case. A further issue arises with ignore0 off, namely the potential for a large number of identical values (i.e., zeros). Cannon et al. (2015) added a small random value (distributed uniformly over [0, trace/2]) to each zero before downscaling. We refer to this option as below trace noise (BTN), which can either be on or off.
Note that the four configuration options are not applicable to all of our SD methods. The choice of additive versus multiplicative is not applicable to BCQM or KDDM by nature of their algorithms. Furthermore, freqadj and ignore0 are 'baked into' the complex KDDM code. For QDM and PRAT all four options are viable, which is an extension to the methodology given by the original authors.
Although we consider four configuration options, our list is not exhaustive. For example, time windows used for SD training and evaluation can vary: 12-monthly or 4-seasonally non-overlapping, or overlapping moving windows of different lengths, to name a few possible choices. There is a tradeoff: wider windows yield larger sample sizes for training but narrower ones are better able to resolve seasonally varying relationships. Window choice may introduce artefacts (Dixon et al., 2016; especially Figure 4b). Hence, configuration options beyond those considered here may have consequences.

| Tail schemes
Special attention is warranted to tails, which present a greater challenge than the remainder of the distribution. We examined this issue in considerable detail and have devised special procedures (L19a) which were evaluated extensively (L19b). We use the limited tail adjustment scheme (LIM) as per L19a and L19b. LIM is applied only to tail values after initial application of any arbitrary SD method.
For application of LIM the user decides a priori how many values at the end of the distribution are to be modified by specification of the parameter tail-length (TLN). The user also specifies, via parameter NPT, the number of values to be used in performing the tail adjustment. Previously (L19b) we found TLN = 10 and NPT = 10 yield good results for temperature. While smaller values of NPT produce poorer results, increasing NPT beyond 10 generally yields little gain.
Conceptually LIM operates by computing a constant correction factor from the NPT points and applies it to the TLN points. For example, with NPT = 5 and TLN = 10 to apply LIM to the right tail the correction factor is computed using the 11th through 15th largest values and then applied to the 10 largest values. The correction factor is either additive, used for most variables such as temperature, or multiplicative, assumed more appropriate for precipitation. Here our LIM results are multiplicative, following conventional wisdom, with NPT = 10 and TLN = 10, based on L19a. We only apply LIM to the right tail as left-tail values are small, marginally larger than the trace value.

| Evaluation procedure
Evaluation statistics are presented by treatment, which we define as a combination of an SD approach (one SD method for either the base algorithm, or additionally with LIM adjustment in the right tail) and a set of configuration options as detailed in section 2.2.5. Each of the three 10-year future ensembles is downscaled separately and verification statistics are averaged over the ensembles. SD is performed separately at each gridpoint and for each of four standard seasons DJF (December, January, February), MAM, JJA, and SON. Results are presented mostly as averages over the four seasons with some limited results given for DJF and JJA. Use of seasons rather than months (as in our earlier works for daily temperature) is aimed to provide adequate sample sizes given that for some approaches, the presence of dry days reduces available sample size, sometimes substantially.
Our primary metric is the mean absolute error (MAE) which we use in two different ways. In the traditional synchronous application MAE is based on differences between values of D f and O f occurring on the same day. We also apply it asynchronously (MAE-ord) such that paired values of D f and O f represent the same order statistics. While MAE is a measure of how well SD represents day-to-day weather variations, MAE-ord measures how well SD reproduces the statistical distribution of values. MAE-ord is motivated by guidance provided by Maraun and Widmann (2018) and Maraun et al. (2019) regarding statistical model evaluation. Contrasting results based on these two similar but distinct metrics will help illustrate issues raised by Maraun (2013) regarding the stochastic nature of precipitation. After computing MAE or MAE-ord over a season, or averaging the four seasonal values, it is converted to a standard skill score (Wilks, 2006;L19a;L19b): We then average skill over all non-ocean gridpoints. Averaging utilizes the biweight mean (Lanzante, 1996) which guards against effects of outliers. The biweight is data adaptive behaving more like the arithmetic mean for 'well-behaved' data or more like the median otherwise.
As in L19a and L19b we compute separate verification statistics for different portions of the distribution referred to as distributional categories (CAT's): CAT 1 (CAT 9) consists of the lowest (highest) value in the sample, CAT 2 (CAT 8) the second-third lowest (highest) values, CAT 3 (CAT 7) the fourth-sixth lowest (highest) values, and CAT 4 (CAT 6) the 7th-10th lowest (highest) values. Finally, CAT 5 consists of all values in the sample. When considering extremes we devote our attention to the right tail, as values in the left tail are very small. Some results are presented as averages over the entire right tail (CAT 6-9), weighted by the number of values in each category.
Because of the binary nature of precipitation occurrence we utilize an additional metric in the form of a fractional error in the number of dry days, computed separately for M f and D f . It is computed as the number of days in error divided by the total number of days. For example, in the case of D f , if D f and Of are both wet days or both dry days there is no error. On the other hand, if one is wet and the other dry we count this as an error. As above, given fractional errors for both M f and D f we compute a skill.
In order to estimate statistical significance we use the same procedure developed previously, referring the reader to L19a (especially appendix B) for details, with only a brief overview here. We first compute the mean skill over our domain, separately for two SD approaches of interest. The difference between these two skills is the quantity for which significance is sought. We perform two separate Monte Carlo simulations, with 1,000 trials each to derive a distribution of differences in skill. The position of the actual difference in this randomly derived distribution determines the significance level.
The first step in the process is to estimate the spatial degrees of freedom in order to account for the fact that gridpoint values are not independent of one another. For each trial we apply random translational shifts in both the north-south and east-west directions to the pair of maps. Next we pattern correlate the original and shifted pairs of maps and use the distribution of correlations to infer an effective block size. In the second step, for each trial we randomly shuffle blocks of gridpoints between the two original maps and compute a difference in mean skill between the permuted maps. The distribution of these differences is used to assess significance. In order to ensure robust results we report three significances based on conservative and liberal objective estimates as well as a very conservative subjective choice (L19a). We modify the subjective choice of effective number of blocks of 4 × 2 (latitude by longitude gridpoints) used in our earlier works for temperature to 7 × 4 for precipitation based on Huang et al. (1996) and Richman (1986).

| RESULTS
3.1 | Evaluation of skill over the entire distribution Table 1 summarizes skill (based separately on MAE and MAE-ord) averaged over the entire domain for various SD treatments. Rather than considering every possible combination, we limit to a manageable number from which we can draw conclusions considered representative of the class of SD methods examined in our PM framework. Our shorthand for treatments uses the first letter of the SD method followed by the ordinal row number from the first column of Table 1 which specifies configuration options. During discussion we also refer to Table 2, with results from a limited number of significance tests, pairwise between two treatments. For convenience, Table 2 lists group numbers (i.e., G1-G7) for sets of related significance tests.
The most noteworthy aspect of Table 1 is the clear distinction between skills based on MAE versus MAE-ord, with the former substantially lower. Higher skill for MAEord reflects the fact that while quantile mapping substantially improves the distribution of values compared with the raw GCM output, it exhibits less skill in reproducing the day-to-day weather sequencing, as measured by MAE. This can be explained by the arguments made by Maraun (2013) that attempting to bridge the scale mismatch between OBS and GCM using a deterministic method such as quantile mapping yields results with a corrupted time sequence. As such, MAE skill levels are disappointingly low (~20-25%), about half that found for temperature in our earlier work (L19a; L19b). Sub grid-scale variability, which is at the core of the problem, is much less of an issue for temperature than it is for precipitation. It is worth noting that the relative ordering of skills by treatment are generally quite consistent between the two metrics-the correlation of MAE and MAE-ord skills across treatments exceeds 0.9 for CAT5 and CAT6-9. Thus, comparisons between treatments-one of the main motivations for this work-are not much affected by the choice of metric. For the remainder of this work we draw conclusions based primarily on MAE-ord.
Next we examine overall skill (CAT 5) for four different configurations of BCQM (B1-B4) based on combinations of freqadj on/off and ignore0 on/off. BCQM was chosen for this purpose since it is the simplest of our methods, has the fewest possible options, and is a very commonly used distributional SD method. Three of the configurations (B2-B4) yield skill of the same order whereas B1 performs much more poorly. Significance tests in Table 2 for group G1 indicate that the outlier treatment (B1) is highly significantly worse than the others while the two better treatments (B2 and B4) are not different from one another. The key factor is that the better treatments have ignore0 off. However, use of freqadj can substantially mitigate the effect of having ignore0 on (B3), although this treatment is still significantly worse than the two better treatments. Below (3.5.2) we explore the reasons for this behaviour, explaining why it is configuration-specific but not SD method-specific. Finally, we consider K5 which has skill a bit lower than B4, but not significantly so.
Next we examine results based on a variety of treatments for QDM, chosen for this purpose since previously (L19b), for temperature, it was found to be as good or T A B L E 1 Skill (%) averaged over the domain for various SD methods and configurations (T/F/I/B) referenced by treatment number (first column). CAT 5 is for the entire distribution whereas CAT 6-9 is averaged over the right tail. CAT 6-9 L is based on the LIM adjustment averaged over the right tail

MAE
MAE-ord Method T F I B CAT 5 CAT 6-9 CAT 6-9 L CAT5 CAT 6-9 CAT 6-9 L D-DRY better than any of the tested distributional methods, and has the most configuration options. In the literature, conventional wisdom from a variety of SD methods suggests that multiplicative scaling rather than additive adjustment should be used for precipitation. One of the reasons for this is that the additive approach may result in negative precipitation-which then must be reset to zero. Since this assumption is rarely tested we have applied a number of such treatments with different configurations. Indeed, the multiplicative approach is superior in all cases. Not coincidently the best approach for QDM is Q6, the default configuration in the code made publicly available by the creators of QDM (Cannon et al., 2015). However, while the multiplicative and additive QDM treatments tend not to differ significantly from one another, the best of the former group (Q6) is marginally significantly better than the best of the latter group (Q12) as seen in G2. Finally we note that the best versions of BCQM (B4) and QDM (Q6) do not differ significantly (G2). The final method under consideration is PRAT, which (see above) is a variant on QDM. Having used QDM to explore various configurations we limit the number of PRAT treatments. Not surprisingly the best (P19) has the same configuration as the best QDM (Q6). Although P19 is not better than B4, it is marginally significantly better than both Q6 and K5 (see G3), and T A B L E 2 Significance level (%) testing the hypothesis that there is no difference in skill between the two specified SD treatments   Table 1. certainly the poorest version of PRAT (P17) which has ignore0 on. Next we consider the effect of the BTN option. Although we have not made extensive tests, there are several pairings that differ only in the BTN setting (Q6/Q11, Q8/Q9, Q13/Q14, and Q15/Q16). In general there is very little difference.
Finally, Table 3 gives a limited comparison between our findings and those involving four configuration scenarios of V16: Positive Correction (PC), Direct Approach (DA), Threshold Adaptation (TA), and Singularity Stochastic Removal (SSR). These are analogous to our configurations: PC (freqadj = off, ignore0 = on), DA (freqadj = off, ignore0 = off, BTN = off), TA (freqadj = on, ignore0 = on), and SSR (freqadj = off, ignore0 = off, BTN = on). Although V16 utilized a single downscaling method (CDFt; Michelangeli et al., 2009), not used here because of sub-par performance (L19b), our use of multiple methods enables us to demonstrate much greater sensitivity of results to configuration rather than SD method.
In agreement with V16 we find that, of the methods tested here using our PM experimental design, PC is by far the worst approach with the other three yielding fairly similar results, although TA may be slightly poorer than DA and SSR. We also agree that SSR (also used by Zhang et al., 2009 andCannon et al., 2015) is preferable because of its flexibility in dealing equally well when M h has more wet days than O h (i.e., the DBIAS) as well as the inverse. Recall that use of freqadj can only remedy the DBIAS not the inverse. The SSR approach is also simpler in that it avoids having separate corrections for occurrence and amount.

| Evaluation of skill in the tails of the distribution
Skill based on MAE-ord averaged over the right tail using the base algorithm (CAT 6-9 in Table 1) is about half that for CAT 5 (~25-30% vs. 50-60%). Other than the B1 outlier, differences in CAT 6-9 skill between treatments are generally small and few if any are likely to be significantly different as evidenced by the B4 versus B3 comparison (G4) which exhibits fairly typical differences. Application of the LIM adjustment yields small improvements which are likely to be mostly insignificant with again not much difference between treatments (G4). Furthermore, tail skill for our preferred treatment (P19) does not differ significantly from that of other leading treatments (G5).
Skill based on MAE shows a very different pattern. While CAT 6-9 skill is much poorer than for CAT 5, LIM yields substantial improvement. However, skill in the tails for both the basic algorithm as well as LIM adjustment is negative, indicating they provide no improvement over the raw GCM. Here the results in the right tail are strikingly different than was found previously for temperature (L19b) for which skill in the tails was comparable to the whole distribution.
For further perspective on tail performance Figure 1 shows MAE and MAE-ord, along with their associated   skills, for P19 as a function of distributional category. High skill in the left tail is of little practical importance since the values there are quite small. Aside from consistency with the general points made above, this figure demonstrates the rapid increase in error going farther out in the right tail. Although P19, likely as good as any of the tested treatments, shows considerable improvement over the raw GCM, errors in the tail (representing domain averages) are still quite large~5-20 mm.

| Evaluation of dry-day skill
The right-most column of Table 1 gives skill based on the dry-day error metric. Other than for B1, which is highly significantly worse than other treatments (G6), differences tend to be not too dissimilar. This is indicated by the fact that while P19 differs by more than two from both B4 and K5, these differences are not even close to being significantly different (G7). Note that the actual fractional error (not shown) does not differ much between SD treatments, with that for M f~0 .27 and that for D f~0 .18.

| Spatial patterns of skill and MAE
The pattern of DJF skill for CAT 5 in Figure 2a shows identified lower skill for temperature downscaling in coastal regions. Although diagnosis of this feature is beyond the scope of this work the mechanistic explanations from our earlier work would not seem applicable here. In the tails (Figure 2c,d) the patterns are roughly similar to their corresponding CAT5 patterns, but with a considerable reduction in overall skill. As a consequence, substantial areas of the domain have near-zero or negative skill. For MAE-ord the DJF patterns for CAT 5 and CAT 6-9 (Figure 3a,c) both have higher values over the Southeast and along the extreme West Coast, which correspond roughly to the DJF climatology (not shown). For JJA while the CAT 6-9 pattern (Figure 3d) also corresponds reasonably well to the JJA climatology (not shown), the CAT 5 pattern (Figure 3b) does not, with largest errors in the higher latitudes of interior North America. The domain-averaged MAE-ord seen in Figure 1 of~5 mm for CAT 6-9 masked the strong regionality seen in Figure  3d where the tail errors along the East Coast, and particularly the Southeast are typically several times larger~2 0 mm, even in areas exhibiting considerable skill (Figure 2d).
There is an interesting contrast between Figures 2b and 3b that relates to precipitation frequency. While lowest skill is found both in the mountainous West and the Northern Interior, the former (latter) has relatively low (high) MAE-ord. As illustrated below (3.5.2), having a very large number of dry days (i.e., zero values) complicates distributional mapping. The extreme aridity of the Far West results in either an insufficient sample for analysis or very small errors, while the Interior has just a bit more total and frequency of precipitation to trigger the complications.

| Jupiter and Coaldale
The mountainous West has some distinctive variations in skill and MAE-ord characterized by sharp gradients aligning with topography (Figures 2a and 3a). In the West, the basic nature of this pattern for CAT 5 is similar among the four SD methods and all seasons (not shown) except JJA which differs somewhat due to climatologically much drier conditions. These variations can be explained in terms of interaction between topography and nonstationarity introduced by climate change. To illustrate this, two relatively nearby points are chosen for which the behaviour is strikingly different. Near Jupiter, CA (Figure 4), CAT 5 MAE-ord skill for P19 is highly negative (−62.0%) yet at a gridpoint near Coaldale, NV ( Figure 5), skill is impressively high (90.9%). Here the use of different seasons and members (see captions Figures 4 and 5) was made to accentuate the disparity, but is not crucial to the conclusions.
Jupiter is upwind of the Sierra Nevada in a region of rapidly rising elevation from west to east whereas Coaldale is downwind on the plateau. As a result, Jupiter is located near an orographically forced climatological local maximum of precipitation while Coaldale is near a climatological minimum. Accordingly, for Jupiter (Coaldale) the much larger GCM footprint encompasses areas of less (greater) precipitation in the surrounding areas. As seen in Table 4 at Jupiter (Coaldale) mean precipitation of OBS is greater (less) than GCM. Although the climate change signal at both locations is that of drying, this effect is more pronounced at Jupiter, which, since it is wetter, has more to lose via climate change. A better understanding is had by examining quantilequantile (qq) plots at the two locations. These plots are like traditional x-y plots except that instead of each x-y pair coming from the same point in time, they come from the same relative location in their respective distributions. For example, the right-most (left-most) point consists of the maximum (minimum) x-value paired with the maximum (minimum) y-value. To complement the qq plots we also show corresponding plots of CDF curves. For added clarity, both types of plots are shown zoomed in (top) on the most salient features as well as over the full range (bottom) of values (with nonlinear scaling).
The qq plots (Figures 4a,c, 5a,c) consist of three curves, each with GCM on the abscissa and OBS or DWN on the ordinate; black (red) depicts the historical (future) relationship. Thus, black represents what is available to 'train' the downscaling method while red represents 'truth'. The cyan curve represents what the SD method generated as its rendering of the future. One can think of the green line, with a slope of 1, as the starting pointdownscaling moves from it to the cyan curve-with the best possible result lying on the red curve. The point at which curves cross y = x shows where the bias changes sign. For points above (below) the green line OBS is greater (less) than GCM. Changes from black (historical) to red (future) indicate non-stationarity.
In the case of Jupiter (Figure 4a,c) the downscaled (cyan) departs significantly from the truth (red) for most of the distribution. To understand the downscaling operation we use QDM-the results for which (not shown) are quite similar, only slightly worse. However, the basic principles apply to any of the methods used herein operating via analogous bias correction principles. Downscaling via QDM consists of using M f as a 'first guess' and then modifying this through a multiplicative factor   (3)). The percentile is that of M f from its distribution. At first glance it seems curious that downscaling does so poorly given that the red curve is not that different from the black one. But these curves only represent relative relationships. The key lies in the fact that the distributions have shifted significantly towards lower values in the future (Figure 4b,d) due to drying (Table 4). Another important factor is that the black curve shifts from a local ratio (OBS to GCM) much larger than 1 (i.e., above the green line) at the high end of the distribution to values less than 1 at the low end. This ratio is proportional to the multiplicative correction factor applied by QDM. Note that ratio values less than 1 at the very low end of the distribution are to be expected as per the well-known DBIAS which results from the GCM having a larger footprint than the OBS.
Consider an arbitrary value of Mf which is to be downscaled. The correction factor (Equation (3)) is determined by applying the percentile of this value from the M f distribution to the O h and M h distributions. However, because of considerable drying from historical to future periods the quantile (i.e., amount of precipitation) corresponding to this percentile will be higher in the historical than the future periods. Thus, the correction factor which is applied will be biased towards the high end of the historical distributions. Since, as we noted above, the O h /M h ratio increases with increasing precipitation amount, the resulting correction factor will be too large.
There is an equivalence between the explanations based on qq plots and CDFs (Figure 4b Next consider Coaldale ( Figure 5), located in the orographically induced down-wind 'rain shadow' region. The qq curves are below the green line due to the DBIAS but unlike Jupiter they do not cross above it for higher values because the DBIAS operates only for low values of precipitation. For larger amounts of precipitation, especially convective, a localized intense area of precipitation is more likely surrounded by less intense precipitation, leading to in effect an inverse of the DBIAS. The fact that Coaldale is so arid (compare the y axis extents in Figures  4a,c and 5a,c) precludes it from reaching the inverse DBIAS regime which Jupiter is able to attain. Furthermore, although there is drying, it is less dramatic than at Jupiter. The combination of the drying and percentile shift effect does force the cyan curve above the black curve, just as was the case for Jupiter, but the amount is much less. But this shift has a positive effect by pushing the cyan curve closer to the red curve, in contrast to Jupiter in which it pushed it away from the red curve. In passing we point out that the discontinuity in the cyan curve is a reflection of SD adjustment for the DBIAS (i.e., converting some wet days to dry days). It is more prominent here because of the very dry climate which results in a 'zooming in' on the low end of the distribution.
In summary, drying due to climate change and the inevitable DBIAS operate in opposite fashion between the two locations. At Jupiter they conspire to yield a poor downscaling result. Because the magnitudes are smaller at Coaldale they have a smaller effect, but fortuitously they combine to produce an exceptionally good result.

| Alpena
Motivation for this example comes from results shown earlier (Table 1) comparing four variants of BCQM. Treatment B1, having ignore0 on and frequency adjustment off produced much worse results. Maps not shown here, show a considerable similarity between patterns of seasonal variation in B1 skill and patterns of seasonal variation in climatological amount and frequency of precipitation. Namely, relatively low skill for B1 corresponds with climatological small amounts and daily frequencies of precipitation, with poorest performance during JJA. Furthermore, seasonal patterns of skill for our favoured approach of P19 are quite similar to those for the better performing variants of BCQM (B2-B4).
The qq plots in Figure 6a,c for a point near Alpena, KS typify the poor performance of B1. The relationship between OBS and GCM is similar between the historical (black) and future (red) periods. When configured optimally with ignore0 off, BCQM (B4) and PRAT (P19) yield similar and reasonable results (violet and cyan circles, respectively) that follow the O h /M h and O f /M f curves quite well. On the other hand, turning ignore0 on, with all other settings the same yields again similar, but this time extremely poor results for both BCQM and PRAT (violet and cyan pluses, respectively). It is striking that skill at this location for B1 is-26.9% while that for B4 is 52.9%.
In our PM experimental design, we find that poor performance with ignore0 on and frequency adjustment off is not downscaling method specific but instead is much more likely to occur when two conditions are met: (a) a large difference in dry day frequency between GCM and OBS and (b) high dry day frequency (~80% or greater). As shown in Table 5, at Alpena during both time periods condition (a) is met with a disparity of~18-24%. Differ-ences~10-20% are common at a majority of gridpoints, and in some locations/seasons are as high as~40-75%. Condition (b) is met since more than 80% of the days are dry for OBS in both the historical and future periods.
The reason for the problem (B1) is that when (a) is met mapping occurs between disparate portions of the OBS and GCM distributions. Condition (b) exacerbates the problem by reducing the sample used to define the distributions. For example (Table 5), at Alpena the upper 42.8% of the M h distribution is mapped to the upper 18.9% of the O h distribution with ignore0 on and frequency adjustment off. However, if frequency adjustment was invoked with ignore0 on, the upper 18.9% of both distributions would be used in the mapping. Finally, if both options were off then 100% of both distributions would be used in the mapping. As seen in Table 5, an additional consequence of the more equitable mapping (B4) is the good representation of precipitation frequency (89.5 vs. 87.4) as opposed to the mismatched case (B1) where a large DBIAS remains (64.8 vs, 87.4).
We can visualize the mechanisms for the poor performance of B1 via the qq plot for Alpena (Figure 6a,c)

| DISCUSSION AND CONCLUSIONS
We have compared a number of distinct distributional downscaling methods as applied to daily precipitation in a Perfect Model context as a follow-up to our recent studies involving daily maximum temperature (L19a; L19b). Applying a more stringent metric (MAE) geared towards assessing agreement in day-to-day variability yields skill 20-25% which is barely half of that found in earlier studies for temperature. Because of the more stochastic nature of precipitation (Maraun, 2013) we emphasize results based on a metric that assesses agreement of distributions (MAE-ord). By this metric skill is~50-60% overall and about half of that in the right tail. This is distinctly different than for temperature for which skill in the tails could be boosted comparable to that for the remainder of the distribution (L19a; L19b).
Although downscaling overall yields useful MAE-ord skill (~30-35%) for values in the right tail of the distribution, there are considerable seasonal and regional variations. More importantly, even when skill is attained the magnitude of the errors can be considerable, for examplẽ 15-25 mm in the southeastern U. S. during summer. We remind the reader that our Perfect Model design is somewhat idealized-accounting only for the mismatch in spatial scale between OBS and GCM but not for differences in the underlying climate statesthus, real-world downscaling performance may differ. Compared with temperature, downscaling of precipitation via distributional methods is more complex having more configuration choices. Certain of these may be more consequential than the choice of SD method. These configuration choices result from the fact that precipitation consists of two aspects: (a) binary occurrence of precipitation (dry vs. wet days) and (b) distribution of precipitation conditional on occurrence. An equivalent but simplifying approach is to treat dry days as having zero precipitation, yielding a single distribution. Ultimately how these zero values are handled is crucial.
In our PM framework the poorest performance occurs when SD methods train and apply a transfer function to only the non-zero daily precipitation values, without a frequency adjustment to account for the DBIAS. However, the use of a frequency adjustment when downscaling only the non-zero values largely remedies the situation. The best configuration occurs when downscaling all values (zeros included) without a frequency adjustment. With regard to the use of configuration options our findings are in general agreement with V16.
Using optimal configurations, comparisons between several different downscaling methods do not always yield conclusive differences. PresRat, which involves a tweak to QDM was found to be comparable to BCQM while KDDM and QDM were found to be comparable to one another; the former pair were deemed marginally statistically significantly better than the latter pair.
For diagnostic purposes we have provided some examples which highlight the mechanisms responsible for good or bad performance in our PM framework. Poor performance can result from non-stationarity introduced via climate change, as was the case at Jupiter. Surprisingly, especially good performance can result when nonstationarity due to climate change by chance compensates for a deficiency in the downscaling method (Coaldale)thus, two wrongs can make a right. Finally, when excluding all dry days from the SD, we demonstrated that for locations (such as Alpena) having infrequent precipitation (typically less than 20% of the days) the nearly ubiquitous DBIAS leads to an inappropriate mapping between the OBS and GCM distributions leading to very poor performance.
Our case studies identified an intrinsic weakness of distributional methods when applied to precipitation. Because of the DBIAS, for low values of precipitation the ratio of O h to M h will be less than 1. However, for larger values, often the bulk of the distribution, typically this ratio will be greater than 1 because of the spotty nature of precipitation and the larger GCM footprint. This effect is accentuated in convective regimes. Furthermore, this ratio often increases with increasing precipitation amount, again due to convection which tends to produce greater, more isolated bullseye values.
An inherent weakness of the class of quantile mapping methods is that while distributional methods operate via mappings between relative positions within distributions, there are certain physical constraints that operate with regard to absolute amounts of precipitation via spatial scale. A perturbing factor such as climate change (Jupiter) or excessively infrequent precipitation (Alpena) can distort the mapping in a manner that yields a physically inconsistent mapping-that is, where the DBIAS and its inverse get mapped to each other. Other perturbing factors, which have yet to be identified, may exist as well. The underlying characteristics of precipitation which often lead to poor results are not inherent to other better behaved variables such as temperature.
It is intriguing that for some seasons and locations errors in the right tail can be quite large even when downscaling has demonstrable skill. One wonders what affect these errors might have on extreme value analysis (EVA) of precipitation, which has frequently been applied to raw GCM output? Recently Lopez-Cantu et al.
(2020) performed EVA on CONUS precipitation and found large differences among five downscaled datasets. In future work we intend to explore this issue in a PM context.
Finally, as a bridge back to our earlier PM SD evaluations for daily maximum temperature (tasmax), which were based solely on MAE skill, here we have computed MAE-ord skill for tasmax for a limited number of cases from L19b. In this comparison we report only the averages of three SD methods that correspond most closely to B4, Q6 and K5. For the basic approach for CAT5, going from MAE to MAE-ord skill (%) increases from~42 to 67 for tasmax compared with~22 to 58 for precipitation (see Table 1). Using the LIM adjustment and averaging over the tail (CAT6-9) skill increases from~46 to 57 for tasmax compared with approximately −8 to 32 for precipitation. Hence, the improvement in skill based on MAE-ord over that for MAE is greater for precipitation than temperature as expected given the more stochastic nature of the former. Furthermore, the improvement in the tails is much greater for precipitation, even though tail performance is much poorer compared with overall (CAT5) performance for precipitation than temperature.