Mean and extreme precipitation over Aotearoa New Zealand: A comparison across multiple different estimation techniques

This study provides a comprehensive evaluation of mean and extreme precipitation features over Aotearoa New Zealand (hereafter New Zealand or NZ) in station data and in six merged satellite‐gauge products, five reanalyses and two in situ gridded products over a period of 10 years inclusive (2001–2010). All products show a similar mean precipitation pattern with a clear maximum over alpine regions. However, only the higher spatial resolution datasets capture the details of the west–east gradient across the South Island accurately. For extremes, we used a set of climate indices to examine the wet tail of the precipitation distribution in the different products. The products have significant differences in these indices, especially over the West Coast in the South Island, indicating substantial uncertainty in these gridded precipitation extremes. Overall, MSWEP and BARRA‐R perform well in all the statistical tests, including higher values of the nonparametric Kling Gupta efficiency (NP‐KGE) metric, favouring their utility for capturing the main characteristics of both mean and extreme precipitation across New Zealand.


| INTRODUCTION
Extreme heavy daily precipitation events are extremely disruptive to New Zealand property and infrastructure (Frame et al., 2020;Pastor-Paz et al., 2020). The warming from anthropogenic climate change is expected to intensify the hydrological cycle globally, with a general increase in the intensity of extreme daily precipitation and thus in the frequency of exceedance of historical thresholds (Allen and Ingram, 2002). In order to accurately monitor precipitation and its extremes, we need to assess the quality of different observational products. Despite New Zealand having a relatively high density of observational rain gauges relative to many land areas, it is still insufficient to sample the entire country at a fine spatial scale. The high-altitude regions in particular are not well populated with observation sites which makes it Abbreviations: ENI, east North Island; ESI, east South Island; WNI, west North Island; WSI, west South Island. difficult to quantify precipitation. Additionally, the large variation in topography and its impact on the precipitation as observed in the Southern Alps (Tait et al., 2012;Caloiero, 2014) further complicates the problem and thus we need to examine other options available to us.
Globally, two different types of data products are available to measure precipitation-direct and indirect. Surface observations from rain and snow gauges contribute direct products by measuring the depth of the water (or water equivalent) at a point as it accumulates over time. A key shortcoming of gauge measurements is that they provide incomplete areal coverage as they are often sparsely distributed over the ocean and remote regions of the world. They are also prone to a number of errors including spatial inconsistency, sampling and systematic errors, instrument error and other environmental and evaporative effects associated with wind (Peterson et al., 1998;Michelson, 2004;Sun et al., 2018).
Indirect precipitation estimates are obtained from satellite retrievals that are inferred from infrared (IR) or microwave (MW) sources. Active sensors used in the spaceborne radars such as the one used in various missions including TRMM, GPM, CloudSat and others also offer a unique three-dimensional view of the Earth's hydrological cycle. The quantitative precipitation estimates derived from satellites provide a more spatially and temporally consistent coverage over most parts of the globe, including the oceanic regions (Dinku et al., 2007;Kucera et al., 2013;Rana et al., 2013), thereby providing a key advantage over rain gauges. However, their accuracy can be limited, due to measuring quantities such as cloud top temperature and thermal radiance instead of precipitation directly. To offset the biases observed in satellite products, merged products that combine the benefits of both satellite and rain gauges have been developed that have great potential to provide information about the nature of precipitation variability. For example, gauge corrected radar derived rainfall merged with satellite information is found to be superior to satellite estimates alone, partially due to its high spatiotemporal scale (Cole and Moore, 2008;Qiu et al., 2020;Yu et al., 2020). Using merged products could be very useful for detecting the location of intense precipitation and can be directly applied to hydrological applications such as flood forecasting (He et al., 2011). However, this method presents its own challenges such as limitations in our understanding of the Z-R relationship used to convert radar reflectivity (Z) into rainfall (R) (Harrison et al., 2000;Scofield and Kuligowski, 2003;McKee and Binns, 2016) and also the variations that exist in the Z-R relationship between the locations and the drop-size distribution (Atlas et al., 1999).
Another valuable type of data that is commonly used is reanalysis products. The reanalyses generate precipitation through a data assimilation scheme by incorporating information from various sources such as satellite, radar, radiosonde along with a background model forecast to derive precipitation over a uniform grid (Bosilovich et al., 2008). Even though reanalysis products provide snapshots of the weather conditions at regular intervals over long periods and have a better spatial consistency, they still suffer from errors and uncertainties arising due to poor data quality, lack of observations and deficiencies in model parameterizations (Cui et al., 2017), and thus may not be suitable for climate assessment studies (Bengtsson et al., 2004;Thorne and Vose, 2010;Donat et al., 2014). However, some recent studies incorporating the latest generation of reanalysis data such as the work done by Ashrit et al. (2020) and Li et al. (2021) suggest the potential utility of the latest generation of reanalysis data for climate studies.
The precipitation products detailed above have been used in numerous studies for multiple purposes, including documentation of climate variability, evaluation of different climate models at global and regional scales, and climate change attribution. However, the observed precipitation is uncertain among the datasets due to the different data sources, quality control schemes, and estimation algorithms. Moreover, the poor consistency between the products is far more significant when it comes to deriving extreme precipitation (Simmons et al., 2010;Trenberth, 2011) and in capturing the extreme daily rainfall properties (Hofstra et al., 2010;AghaKouchak et al., 2011;King et al., 2013).
Evaluation of precipitation products over NZ has so far focused on bespoke assessment of new regional products, such as VCSN and BARRA-R, without comparison across the full range of regional and global products. For instance, Pirooz et al. (2021) studied the ability and accuracy of three global (ERA-Interim, ERA-5, and ERA-5 Land) and one regional reanalysis (BARRA-R) products in capturing the variability over New Zealand by analysing a range of meteorological parameters such as the mean and gust wind speeds, total precipitation, and surface air temperature. The authors noted that the regional reanalysis dataset BARRA-R performs better for precipitation and temperature than global reanalysis products such as ERA-Interim over New Zealand. However, BARRA-R fails to capture the frequency of high wind gusts, unlike ERA-Interim and its successor ERA5, over most of the mountainous region over the South Island.
This study comprehensively evaluates the ability of a wide range of daily precipitation products obtained from satellite, reanalysis, and gauge based products to capture the mean and extreme precipitation features over New Zealand. The two goals of this study are to (a) provide an intercomparison of various daily precipitation datasets and evaluate the consistency among them at annual and seasonal timescales and (b) examine the ability of different products to represent the wet tail distribution using a set of climate extreme indices. This study also provides valuable guidance for choosing a reliable product for mean and extreme precipitation features that can be potentially used as a reference dataset for model evaluation studies (Schwalm et al., 2013).
The 10-year 2001-2010 analysis period has been selected as the only overlapping period between all the datasets, such that the analysis can be performed consistently across all products. Nominally, 10 values (years) is short for accurate estimation of correlation coefficients (and other metrics) between too partly dependent series. However, the observational products we examine here (discussed in section 2) are supposed to be measuring the same quantity, hence the null hypothesis for any statistical significance estimation would be that the correlation coefficients are 1. Sizeable deviations from r = 1 are only possible if at least one of the data sets has sizeable errors, hence finding correlations much less than 1 is clear evidence of errors, while r = 1 is a necessary (but not statistically sufficient) condition for consistency with a lack of relative error between the two data sets. In other words, our analyses on the 10 years of data can be sufficient for correct identification of poor precipitation products, but not necessarily for correct identification of accurate precipitation products.

| Study region
New Zealand lies in the mid-latitude southwest Pacific between latitudes 34 S and 47 S and is thus exposed to subtropical anticyclones to the north and the westerly wind belt, or "Roaring Forties" to the south. The location and intensity of these circulation features significantly affect the country's rainfall pattern (Watts, 1947;Garnier et al., 1950;Tait and Fitzharris, 1998;Jiang et al., 2013;Prince et al., 2021;Reid et al., 2021). In addition to these circulation patterns, the country's rainfall is also strongly influenced by the local topography, producing a distinctively regionalized rainfall pattern (Salinger, 1980). For instance, the mountainous spine of the South Island with a southwest-northeast orientation provides a significant barrier to the mean westerly flow (Figure 1). The moisture-laden flow from the west deposits large quantities of rain over the west coast as it ascends while the eastern side of the Southern Alps is characterized by a rain shadow, thereby making these regions the "wettest" and the "driest" over the entire country, respectively. The west-east gradient across a fairly short distance is larger over the South Island than in the North Island. This is clearly evident in Figure S1, Supporting Information where the precipitation gradient between Milford and Dunedin (MF and DN in Figure 1, respectively) is about 6,000 mmÁyear −1 (dashed line T2), while the difference between New Plymouth and Gisborne (NP and GS in Figure 1) is only about 390 mmÁyear −1 (dashed line T1).
Given the strong dependence of rainfall on topography and to obtain a regional perspective on precipitation variability, we break up the station data into four coherent regions: west North Island (WNI), east North Island (ENI), west South Island (WSI), and east South Island (ESI) as shown in Figure 1 using a simple classification based on previous subjective classifications detailed in Salinger (1980), Mullan (1998), and Zheng and Renwick (2003). Very approximate land areas occupied by individual regions are calculated and displayed in Figure 1 caption. The state values are approximate because the regions are defined based on station membership rather than boundaries.  This product is based on spatial interpolation of the reference station data using a thin plate smoothing spline model Tait and Turner (2005) and Tait et al. (2006) NZ station data Gauge data G https://cliflo.niwa.co.nz/ Abbreviations: G, gauge data; R, reanalysis; S, satellite data.

| Datasets
This study examines several datasets from in situ, reanalysis, satellite-based, and merged products at various spatial resolutions in order to quantify the daily precipitation estimates over New Zealand. Further, we also use daily data for the calculation of extreme climate indices (discussed in detail in section 2.4) to capture the signature of extremes among different datasets. By restricting ourselves to daily data we disregard highintensity convective rainfall events that may only have short durations; these are not as well captured by operational networks (Schroeer et al., 2018;Alexander et al., 2019). Using daily data for rare extremes might also suffer from aliasing issues due to the difference in the 24-hr accumulation period of records between the national products such as VCSN and station data (0900-0900) and the global products (generally 1200-1200). Although this issue is minimized when statistics are taken over long periods or for low to moderate extremes, it may influence the more extreme indices with our 10-year sample (Timmermans et al., 2019). Since New Zealand mainly experiences large-scale precipitation, especially over the west coast of the South Island, as shown in Figure S2, daily measurement frequency is generally sufficient for assessing many rainfall-related hazards. The different datasets and their most salient features are listed in Table 1. Owing to the differences in time periods covered by different datasets, only products including data from 2001 to 2010 are considered in this study. The rationale for the selection of these gridded precipitation products for evaluation is that they represent the state of the art precipitation datasets available. Further, we also explore a range of horizontal resolutions (coarser [GPCP] to finer [CHIRPS and IMERG]) and their ability to resolve the precipitation over complex topography and its variability over short distances across NZ. Overall, it is anticipated that the intercomparison of these products will help in the identification of preferred and reliable datasets for capturing the mean and extreme characteristics of precipitation over NZ.

| Reference data
We compare individual datasets against a "ground truth" which in our case are station data from New Zealand's National Climate Database (https://cliflo.niwa.co.nz). Although the station data are probably the best choice for benchmarking our evaluation, they still suffer from errors and uncertainties which may compromise their use if not corrected. The station data obtained from the National Climate Database have already had standard quality control measures recommended by WMO applied. Other rigorous quality control measures using machine learning algorithms are used operationally on the station data to detect gaps and flag anomalies in the dataset (https:// niwa.co.nz/news/from-sky-to-server). More information highlighting the data quality control measures and other potential errors associated with the station sites can be found in Ministry for the Environment (2020). In addition, we applied further selection criteria to obtain the final dataset used. Stations were selected as follows: 1. Only daily records from stations that have a minimum of 12 hr of data on a daily timescale are considered in this study. This ensures that the daily rainfall accumulations are consistent with values from other rainfall products. 2. From the stations that satisfy criterion 1, data from 2001 to 2010 were then used.
The above selection criteria resulted in a set of 88 AWS (Automatic weather stations) operated station sites from around New Zealand. Station IDs and their corresponding coordinates are detailed in Table 2. The gauge network over New Zealand is limited in some areas, particularly over the high alpine region in the South Island, where observing sites are particularly sparse. Using station data as a reference will also bias our evaluation towards those products that have incorporated these data. Hence, agreement, or the lack thereof, of various products with the station data will be considered in context.

| Methods
To evaluate the skill of various precipitation products with respect to the station data, we use a range of parametric and nonparametric tests, such as the Pearson correlation coefficient (PC), Spearman rank order correlation (SR), and the mean bias (MB). These metrics help us examine the performance of different datasets when compared with the station data. Additionally, to quantify the differences among the various datasets in capturing the accumulated seasonal precipitation pattern, we calculated the relative difference percentage (RDP) of all the datasets using station data as a reference in each defined region as shown in Table S1. The formulation of these metrics is detailed below. 2. Spearman rank order correlation

Pearson correlation coefficient
3. Mean bias Here, D and S are different precipitation products and station data, respectively. Cov and σ denote the covariance and standard deviation between a station point and the nearest dataset grid point. T, n denote the total time period (3,652 days) and the number of observations, respectively.
The metrics were calculated on daily precipitation time series between each of the 14 products and the station data over the period of 2001-2010. For the gridded products, the nearest grid point method was used to calculate all metrics relative to the station location. Using the nearest grid point method is often a good choice for these studies since the magnitude of precipitation is preserved over the calculated grid as opposed to using other methods such as bi-linear interpolation schemes that tend to smooth the precipitation field, resulting in higher bias for larger rainfall regions (Acharya et al., 2019). However, uncertainties do exist when comparing precipitation estimates from gridded products especially from satellites that capture "snapshots" of area averages as opposed to the point value of precipitation accumulation over a period recorded by the station data.
We use a series of indices of climate extremes as shown in Table 3 to investigate the ability of different products to represent and capture the signature of precipitation extremes. These were developed by the CLIVAR Expert Team on Climate Change Detection and Indices (ETCCDI) (Easterling et al., 2003;Zhang et al., 2011;Alexander et al., 2019).
To assess the intensity and frequency of extremes, we calculate the simple daily intensity index (SDII), R10mm, and the R99P indices across the different datasets. The indices are quantified for every calendar year for all products. The individual yearly values (10 in total) are then averaged to obtain the climatological mean value defined over 2001-2010. For duration-based indices, such as consecutive wet and dry days (CDD and CWD, respectively), the climatological mean value is defined over 2001-2009 in order to allow cross-calendar sequences across the end points. Only wet-days (≥1 mm) are considered in the calculation of these indices. The base period for the percentile based indices, such as the R95P and R99P, is defined over 2001-2010. Moreover, we omit CHIRPS from these calculations due to the dataset's poor ability to represent the count of wet days on which the calculation of many of these indices is dependent.
We use the Kling-Gupta efficiency (KGE) (Gupta et al., 2009) metric to assess the overall performance of different precipitation products relative to the station data. Although there are different forms of the KGE metric, we use its nonparametric form, hereafter termed NP-KGE. This metric encapsulates three statistical properties of the datasets, namely the Spearman rank correlation coefficient (SM), the coefficient of variability (CV), and mean bias ratio (MB) (Pool et al., 2018). The NP-KGE metric is effectively normalized between 1.0 and −1.0. The best value of NP-KGE is 1.0, which indicates a perfect match between station data and gridded product while a value of −1 suggests a poor match between the station and gridded product. The expression for NP-KGE can be written as The Spearman rank correlation (SM) evaluates the degree of similarity in the time series between the station point and the gridded precipitation product. Using SM as opposed to the Pearson correlation coefficient (used in the parametric form of the KGE metric) gives an added advantage of being less sensitive to extreme values and the presence of outliers, leading to a more robust characterization of the correlation (Krause et al., 2005). The coefficient of variability assesses the ratio of standard deviation to the mean of the time series, while the mean bias evaluates the ratio of bias relative to the reference station. Previous studies have used the KGE metric to evaluate the overall performance and accuracy of different precipitation products, mainly reanalysis datasets (Siqueira et al., 2018;Beck et al., 2019;Avila-Diaz et al., 2021).

| Spatial distribution
We begin our analysis by comparing the spatial distribution of annual mean precipitation in New Zealand for the period 2001-2010 in Figure 2 using the datasets listed in Table 1. Overall, there is a reasonable agreement regarding rainfall distribution with the highest rainfall on the western side of the Southern Alps in all datasets, while the lowest amounts occur over the east coast of the South Island. Although the datasets reveal a similar large-scale pattern, they present significant differences at a finer scale, especially over the mountainous areas. Over WSI, both VCSN and CPC underestimate the total mean precipitation amount relative to the station data, although the relative difference between VCSN and station data was more substantial than CPC. Tait et al. (2012) also noted a similar underestimation in VCSN at higher altitudes due to the sparse observations there. We expect reanalysis and satellite products to estimate higher precipitation totals in high-altitude areas, because they include measurement/simulation over the peaks, and they do, especially over the Southern Alps. The precipitation simulated over the high terrain in the South Island from all five reanalyses (see Figure 2 midpanel) is higher, with BARRA-R values exceeding 12 mmÁday −1 . Closer inspection of the reanalysis products reveals large differences between MERRA-TP and MERRA-PC in capturing the orographic precipitation, with MERRA-TP capturing a similar pattern to other reanalysis and station data, but MERRA-PC underestimating the precipitation amounts. The underestimation of precipitation by MERRA-PC and overestimation by MERRA-TP is consistent with previous work detailed in Reichle et al. (2017) which showed similar biases globally over high terrain. Higher-resolution satellite merged products such as IMERG and MSWEP marginally overestimate the precipitation amounts relative to the station data, while the coarser products such as CMORPH and GPCP depict lower precipitation totals over WSI. Figure 3 shows accumulated seasonal precipitation (DJF, MAM, JJA, and SON) over the four regions during the study period. For a complete seasonal cycle, the analysis was performed using data from December 2000 to February 2011 except for GPM-IMERG, where the seasonal cycle is considered from January 2001 due to spurious outliers found in the dataset for December 2000.
Over the entire North Island (WNI and ENI), we note that a higher fraction ($33%) of total annual precipitation falls in JJA (winter), while the remaining seasons contribute to about 22-25% of the total annual precipitation depending on the dataset. The higher fraction of winter precipitation can be attributed to the increase in the number of frontal systems encased in the midlatitude westerlies that pass over NZ (Watts, 1947;Salinger, 1980;Shu et al., 2021). Conversely, the regions in the South Island are far enough south to be embedded in the westerlies and are thus less influenced by the frequency of frontal passages (Tait and Fitzharris, 1998). Over the South Island (WSI and ESI), we see an opposite pattern with higher precipitation in the summer (DJF) than in winter. However, we note that the higher fraction of summer precipitation contributing to the total annual precipitation is more prominent over ESI than over WSI where the mountain ranges and the prevailing westerlies act as the major drivers of precipitation.
Over WNI, the precipitation values are underestimated year-round by CHIRPS, GSMaP, and CMORPH, while IMERG and GPCP overestimate precipitation with more significant differences during the winter season when the largest fraction of total precipitation falls in this region. Significant overestimation of winter precipitation over WNI also occurs in BARRA-R, with higher differences during the spring and summer than in the winter season. In the Southern Alps, most satellite products, including GSMaP, CMORPH, and GPCP, underestimate the magnitude of precipitation, while the suite of reanalysis products, excluding MERRA-PC, overestimates precipitation amounts year-round relative to the station data. One possible reason for the underestimation of precipitation could be poor detection of shallow orographic convection by the satellite products at higher altitudes by the retrieval algorithms used (Kubota et al., 2007;Kwon et al., 2008;Sohn et al., 2010), while reanalysis products likely suffer from model deficiencies in the representation of orographic precipitation and issues with the lack of upper air data assimilated into the model cycle, leading to their overestimation (Lorenz and Kunstmann, 2012;Rana et al., 2013).

| How often does it actually rain?
Given the sporadic nature of precipitation, it is also important to quantify not just the accumulated and mean precipitation pattern, but also the frequency and duration of rainfall events. Here we examine the total number of wet days for different rainfall intensities expressed as a percentage of the total period, 3,652 days (10 years). For this study we choose three different thresholds: 0.5, 1.0, and 2 mmÁday −1 , respectively. We chose 0.5 mmÁday −1 as the lowest threshold since this is the practical limit for detecting precipitation from space-based instruments (Trenberth and Zhang, 2018).
Broadly, the geographic patterns of the frequency at various thresholds (see Figures 4,S3, and S4) align closely with the accumulated precipitation (not shown). This implies that highest precipitation frequency occurs where the highest rainfall amounts are recorded, which is on the west coast of the Southern Alps and a few other areas with steep west-facing topography, as in the central North Island. Overall, among the different satellite-based products except for CHIRPS, there is a reasonable similarity in terms of the spatial distribution of precipitation with the highest >0.5 mmÁday −1 frequency of about 60-65% over the Southern Alps as shown in Figure S3. In the case of high-resolution datasets such as IMERG and GSMaP, the frequency of rainfall occurrence looks similar over the land regions, while they differ over the oceans. One possible reason for this difference over the oceanic regions could be the lack of high-quality in situ observations to validate the precipitation estimates from satellites over ocean regions (Burdanowitz et al., 2015). However, at higher rainfall thresholds, as shown in Figures 4 and S4, the frequencies over the ocean and land areas have minor differences when compared to each other. Overall, GPCP shows the lowest wet day frequency of all the satellite-based products likely due to its coarser grid size, thereby leading to a larger and wider swath path, while MSWEP depicts the highest frequencies over the wettest regions. A similar observation regarding the low frequency of wet days by GPCP over highprecipitating areas was also noted by May (2004).
Previous inspection of the CHIRPS dataset revealed that the mean precipitation pattern agreed well with the other products, whereas the frequency of occurrence at various thresholds exhibits significant differences. Katsanos et al. (2016), studying Cyprus, also noted a high mean spatial correlation of rain gauges with the corresponding CHIRPS grid cells, but poor detection of wet day frequency. Over New Zealand, the region with highest wet-day frequency detected by CHIRPS is located over the east of the Southern Alps, a dry region with little precipitation. This discrepancy in terms of spatial shift can likely partially be explained by the cold cloud duration (CCD) algorithm used by CHIRPS to derive precipitation rate (Funk et al., 2015). The CCD algorithm relies on the assumption that convective cloud is the main source of rainfall, thereby more convective precipitation. However, over the east of the Southern Alps, the large-scale precipitation is the dominant type of precipitation as shown in Figure S2, which could likely explain the low count of wet days and thus is not suitable for assessing frequency of wet days over NZ.
On a seasonal timescale, an assessment of wet-day frequency estimates as RDP relative to station data is shown in Figure 5. Most of the features discussed in section 3.1 can be directly observed and related in terms of wet-day frequency. In particular over WSI, we see that the reanalyses products tend to model a higher count of wet days, leading to higher precipitation amounts than satellite merged products during the summer season. This tendency of simulating higher precipitation amounts over the mountainous regions by the global reanalysis datasets has been reported in other studies (Cui et al., 2017). However, in the rain shadow zone over the ESI, the reanalyses tend to show higher frequencies than the station data in all four seasons, which is consistent with our previous findings on the distribution of seasonal mean precipitation patterns.

| Statistical metrics
The Pearson correlation coefficient (PC) averaged across the four geographical regions on an annual time scale is shown in Figure 6a. MSWEP is in excellent agreement with station data, with coefficients exceeding 0.85 over most of New Zealand. The global reanalysis and the gauge-based products are also strongly correlated with the station data across both islands. However, other products such as GSMaP, GPCP, and CHIRPS tend to show lower (or very low) correlations, especially over ESI. CPC and VCSN tend to show a higher correlation at most station points among the land-based products. This is expected since the values are solely interpolated based on the station data and thus are not at all independent.
In terms of mean bias (MB) relative to the station data, we note that there is significant variability in the magnitude and sign of MB based on different products (see Figure 6b). For instance, over WSI, we see that reanalysis products such as BARRA-R, ERA5, and MERRA-TP have a large relative wet bias in the high orographic areas, while MERRA-PC has a dry relative bias. This variability in the nature of the mean biases over the WSI is also exhibited in the satellite products, where IMERG and MSWEP show a smaller positive bias (overestimation) while GSMaP, CMORPH, and GPCP show larger negative biases relative to the station data. One possible reason for this discrepancy in the mountainous regions for all the reanalysis and satellite-gauge products could be the remote sensing instruments' poor detection of nonconvective precipitation (Ebert et al., 2007;Tian et al., 2007) in addition to the poor sampling of precipitation in the high-altitude regions where there is a paucity of in situ observations. In addition to this, some differences between the station observations and satellite products would also arise because satellite instruments measure areal averages as opposed to the point observations recorded by stations.
Further, we also computed the nonparametric coefficients Spearman's rank order correlation (SM) and Kendall's tau-b (KT) to quantify the spatial coherence F I G U R E 5 Seasonal and regional frequency occurrence in different datasets expressed as relative difference percentage (RDP, %) calculated using station data as reference [Colour figure can be viewed at wileyonlinelibrary.com] and the strength of association between the datasets and the station data. As shown in Figure S5, MSWEP has the highest correlation among all the products irrespective of the statistical measure used and over any region.

| Climatology of indices
A 10-year climatology of each of the extreme indices in Table 2 was analysed. The climatology is calculated at every station and later averaged across all the stations belonging to the same geographical region as shown in Figures 7 and S6, respectively. Overall, when compared with station data, we note clear differences and interproduct spread (defined as the ratio of standard deviation to the mean) among different datasets for most of the indices. However, the spread among the datasets belonging to a particular data type (satellite, reanalysis, or in situ) also varies considerably and depends strongly on the nature ("moderate" extreme vs. "extreme" extreme) and calculation of the particular index. For instance, over WSI which is the wettest region of NZ, we see that the spread within each type of product for a "moderate" extreme index, such as the simple daily intensity index (SDII), is fairly small, ranging between 13 and 17%, while for a more "extreme" index like the maximum 1-day precipitation (Rx1day), the interproduct spread for datasets belonging to the same type is   about 18-30% and varies drastically depending on the specific products. Regional differences were noted in the extreme indices that were similar to those in the mean precipitation estimates among the different data types. Over the North Island, we note a clear difference between most of the satellite and reanalysis estimates wherein the satellite products such as IMERG, GSMaP, and GPCP overestimate the intensity of precipitation (as seen in Figure 7a) while all the reanalyses underestimate it. Furthermore, for the count-based indices such as the R10mm as shown in Figure 7b, the satellite products tend to form into "wet" (IMERG, MSWEP, and GPCP) and "dry" (GSMaP and CMORPH) groups, thereby either over or underestimating the count with reference to the station data. In contrast, the reanalysis products, except BARRA-R, tend to slightly underestimate the duration of R10mm. For more "extreme" count based indices, such as R30mm, both the satellite and reanalysis clusters again form into "wet" and "dry" groups, a pattern which is more apparent over the WSI than in other regions. Figures 8 and S7 show the NP-KGE scores between the individual precipitation datasets and the station data at both annual and seasonal timescales. Overall, MSWEP, BARRA-R, and VCSN show higher scores (0.8-0.9), while GPCP performs poorly with negative KGE values at both annual and seasonal timescales. For indices belonging to the same category (intensity, frequency, or percentilebased), there are clear differences over the defined geographical regions (WNI, ENI, WSI, and ESI) among some of the products. For example, the west east difference in the magnitude of the NP-KGE metric for intensity based indices such as the Rx1day (see Figure S7a) for both satellite and reanalysis products is very large across both the islands, although it is more prominent over the South Island likely due to the forced orographic trigger that modulates the intensity of precipitation. SDII, however, shows minor differences in NP-KGE score depending of data choice. This regional difference between indices belonging to the same category (in this case intensity) could, as noted previously, be to do with the "extreme" nature of the Rx1d index compared to the SDII, which is a more "moderate" measure.

| Performance evaluation
Similarly, IMERG has moderate to high KGE scores in the range of 0.5-0.95 over the North Island for R10mm (Figure 8b), while there is a drop in the statistical KGE scores for the more extreme R30mm index ( Figure S7b). Reanalysis and gauge based products also reveal moderate to high values of KGE score over much of New Zealand in detecting the frequency of R10mm, while the performance of the products is degraded for the R30mm index. This is not surprising given the known inability of reanalysis products to capture local precipitation extremes especially in the high terrain regions (Dulière et al., 2011).
Percentile indices such as the R95P and R99P have moderate to high KGE scores over the whole of New Zealand for all products except the GPCP dataset, though some regional and seasonal differences exist.   Figure 10 displays the overall performance ranking for both mean and extreme precipitation based on the NP-KGE values for all precipitation products in the defined geographical regions. For the mean features, the overall rank of the metric is based on the KGE scores for mean precipitation only while for extremes, the overall rank is calculated as the mean of the KGE scores obtained from all the indices on an annual timescale. The largest overall skill independent of the geographical domain was demonstrated by MSWEP, followed by BARRA-R and VCSN. However, a closer examination reveals some domain dependent performance issues in a few satellite and reanalysis products. For instance, on comparing WSI and ESI, we note a significant difference in the performance of individual products. While IMERG and CMORPH performed moderately well with an overall skill score of 0.6 over the WSI, this value reduces to 0.2-0.3 over the rain shadowed region on ESI. In the case of GSMaP, we see a reversed pattern with slightly better performance over the ESI than in WSI. The group of reanalysis products consistently performs well, similar to BARRA-R, with overall scores of 0.5-0.7 over WSI. However, their scores drop marginally over ESI, which is particularly apparent in the case of coarser resolution products, such as MERRA-PC and CFSR. Thus, the usage of various products in different regions is therefore informed by its ability to capture the precipitation variability and its performance relative to the station data measured using the NP-KGE metric.

| DISCUSSION
In this study, we have evaluated 14 precipitation products from multiple sources, mainly the gauge based, merged satellite, and reanalysis datasets in capturing the mean and extremes over New Zealand. We compared the annual and seasonal mean precipitation across datasets to investigate the ability of the products to capture the regional features of precipitation relative to the station data. We also examined the overall skill of the products in representing mean and the extremes using the nonparametric Kling Gupta efficiency metric.
Based on the statistical scores, we deem MSWEP to be the best all-round dataset over New Zealand, with a correlation coefficient of greater than 0.85 at most gauge locations. This may be because the product is unique in the way that it merges gauge, satellite and reanalysis data to obtain its precipitation estimates (Beck et al., 2019). Furthermore, this dataset incorporates gauge reporting times to minimize the temporal mismatch between the satellite-reanalysis estimates and gauge observation. Beck F I G U R E 1 0 Overall average of NP-KGE metric for both mean and extremes calculated for different datasets relative to station dataset. For mean, the overall average KGE includes the contribution from only mean precipitation while for extremes, the average is obtained using the KGE scores from all the climate indices as discussed in Table 3 et al. (2017) also highlighted the suitability of MSWEP across New Zealand, given its long temporal record, full global coverage and good performance in both densely gauged and ungauged regions such as over the Southern Alps. Among the reanalysis products, BARRA-R and ERA-5 seem to perform better in all regions and seasons.
The VCSN and CPC gauge data are consistent with the station data, with higher correlation coefficients, as expected since they directly incorporate the gauge data. For extremes, the nonparametric Kling Gupta efficiency index (NP-KGE) succinctly summarizes the product to product variability for all the extreme indices over different regions as shown in Figures 8 and S7. Among satellite merged products, MSWEP again outperforms the other datasets with NP-KGE scores of about 0.8-0.9, while BARRA-R and VCSN perform well among the reanalysis and gauge based products, respectively. The main limitation of any study like this is that it requires a reference for comparison, and that reference may itself have a systematic bias. We have adopted station gauge data because it is the only in situ type of measurement, it is generally considered authoritative, and it is usually used as the reference for hydrological and other applications. However, the distribution of stations is known to underestimate the total moisture budget in mountainous areas of NZ (Tait et al., 2012). In that sense, certain biases relative to the stations might be considered desirable, for instance that the reanalyses predict more precipitation over the Southern Alps, notwithstanding some ambiguity on how much more precipitation is desirable.
A further issue arises when considering rare extremes. Whether or not a daily extreme is deemed to have occurred on a specific day depends on synchronization of the extreme 24-hr accumulation with the 24-hr period of record. These periods differ between products, being 0900-0900 for national products (stations and VCSN) and generally 1200-1200 for the global products. This phase shift issue reduces when statistics are taken over long periods, but may influence the more extreme indices with our 10-year sample (Timmermans et al., 2019). Another limitation is that our regional definitions are relatively simplistic and have been selected using a simple subjective analysis. This potentially introduces sampling biases especially in the high terrain areas because rain gauges tend to be located at lower elevations. This also means that the representativeness of the stations used is different in the different regions (WNI, ENI, WSI, and ESI).

| CONCLUSIONS
On an annual time scale, all the datasets, irrespective of their source, can reproduce most of the precipitation features over New Zealand, such as the highest precipitation amounts over the west coast in the Southern Alps and in areas with elevated topography. However, only the datasets with higher spatial resolution, such as the IMERG, MSWEP, GSMaP, BARRA-R, and ERA-5, capture the distinct west-east gradient in the South Island, while the coarser spatial resolution datasets, such as GPCP, fail to represent this gradient well.
We expect that high-resolution datasets that are not interpolating over topography (i.e., not VCSN) should have higher wet day frequency over the highest-altitude locations. As expected, the total count of wet days simulated by both global and regional reanalyses over mountainous regions is higher than the station data. This overestimation in reanalyses might be associated with model deficiencies in accurately representing precipitation over complex terrain and the lack of upper air data used in the assimilation cycle (Cui et al., 2017). On the other hand, only the high-resolution satellite products, such as IMERG and GSMaP, can capture the higher frequency of wet days over the high orographic areas while the coarser resolution datasets, such as GPCP, fail to represent this pattern. Global maps of precipitation and its trends, such as in reports of the Intergovernmental Panel on Climate Change, generally use observational products and climate models with spatial resolution no finer than GPCP's, or otherwise present averages across NZ. Our study finds that these datasets representation of precipitation is flawed over NZ likely due to contributions from the strong precipitation gradients across NZ.
One key feature that stands out in the analysis is the wet day occurrence observed in CHIRPS. Although the mean precipitation pattern in the CHIRPS dataset was consistent with the other products, it shifts the location of heavy precipitation occurrence to the east in the South Island. This is contrary to all of the other products and the station data. We consider that CHIRPS should not be used for characterizing daily precipitation over NZ, and, moreover, should probably not be used for characterizing monthly or annual averages, because of its inability to accurately describe the underlying daily features.
So which dataset should be used to monitor mean and extreme precipitation over New Zealand? This investigation suggests that it depends on the particular aspect of the climate system and the region of interest that is under consideration. For example, if one is interested in understanding the mean precipitation features over the whole of New Zealand, then MSWEP outperforms all other products, followed by IMERG, ERA-5, and BARRA-R. However, for extremes, BARRA-R appears better than ERA-5 especially over the North Island, perhaps due to its higher spatial resolution and assimilation of local station observations in its model cycle (Pirooz et al., 2021). MSWEP is also, however, a strong performer in representing the extremes. BARRA-R and MSWEP both have the best overall performance score, suggesting their potential reliability for capturing both the mean and extreme characteristics of precipitation across New Zealand. Furthermore, we have also noted some domain-dependent product performance that should be of interest to researchers and stakeholders investigating the precipitation variability over specific regions.