Assessing the performance of parametric and non‐parametric tests for trend detection in partial duration time series

The detection of nonstationarities in partial duration time series (PDS) depends on several factors, including the length of the time series, the selected statistical test, and the heaviness of the tail of the distribution. Because of the more limited attention received in the literature when compared to the trend detection on block maxima variables, we perform a Monte Carlo simulation study to evaluate the performance of different approaches: Spearman's rho, Mann–Kendall, ordinary least squares (OLS), Sen's slope estimator (SEN), and the nonstationary generalized Pareto distribution fit to identify the presence of trends in PDS records characterized by different sample sizes (n), shape parameter (ξ) and degrees of nonstationarity. The results point to a power gain for all tests by increasing n and the degree of nonstationarity and by reducing ξ. The use of a nonparametric test is recommended in samples with a high positive skew. Furthermore, the use of sampling rates greater than one to increase the PDS sample size is encouraged, especially when dealing with small records. The use of SEN to estimate the magnitude of a trend is preferable over OLS due to its slightly smaller probability of occurrence of type S error when ξ is positive.


| INTRODUCTION
The changes in the frequency and intensity of extreme hydrological events, such as floods and droughts, have motivated studies to detect trends to support water resources management planning, either by the scope of the influence of their effects (e.g., economic, environment, risks to human life), or by the need of anticipating present actions in face of a changing and uncertain future (e.g., Kundzewicz et al., 2008).Thus, parametric and nonparametric statistical tests have been widely used in trend detection analyses involving hydrometeorological variables, such as flow, temperature, and precipitation (e.g., Alkama et al., 2013;Bayazit, 2015;Lettenmaier et al., 1994;Madsen et al., 2014;Vogel et al., 2011).However, the ability of these methods to identify the presence of a nonstationarity is intimately related to sample characteristics (e.g., size, variability, and asymmetry, e.g., Chiew & McMahon, 1993;Pittock, 1980;Yue et al., 2002).The lack of knowledge about how these factors affect the detection of trends can lead to errors in the identification of nonstationarity and has been the subject of numerous studies (e.g., Chiew & McMahon, 1993;Pittock, 1980;Radziejewski & Kundzewicz, 2004;Yue et al., 2002;Yue & Pilon, 2004).For instance, Yue et al. (2002) investigated the power of the Mann-Kendall (MK) (Kendall, 1975;Mann, 1945) and Spearman's rho (e.g., Lehmann & D'Abrera, 1975) when applied for trend detection in independent and noncorrelated time series following different distributions (i.e., normal, log-normal, pearson type III, and Generalized Extreme Values [GEV] type 1 [Gumbel], type 2 [Fréchet], and type 3 [Weibull]) with different values of variance, samples sizes, trend magnitude, and significance level.They found that the power of both tests increases as the magnitude of the trend and sample size increases, and/or the sample variance reduces.The statistical distribution and skewness also exert influence on the power of the test.
In addition to Yue et al. (2002), Totaro et al. ( 2019) compared the performance of parametric and nonparametric methods for trend detection when applied to GEV variates with different sample sizes, scales, and shape parameters.As a result, a higher percentage of true trend identification was obtained by using the likelihood ratio (LR) test.In general, the presence of nonstationarity was detected by the existence of statistically significant differences in the likelihood function when selecting between the GEV stationary and nonstationary models.They also reported that the power of the Akaike Information Criterion Ratio test (AIC R ) is always larger than what estimated by the MK test, except when the shape parameter was highly positive (ξ = + 0.4).Moreover, a significant difference of power between MK, LR, and AIC R was noted when the sample size was small, and even more when the GEV distribution is heavy-tailed (ξ = +0.4).Gelman and Carlin (2014) proposed a novel and complementary approach for trend detection.They suggested that the evaluation of stationarity can be complemented by the estimation of Type S (signal) and Type M (magnitude) errors by using Sen's slope estimator (SEN;Sen, 1968).Instead of using exclusively the conventional approach (Type-I and Type-II errors), Gelman and Carlin (2014) pointed out that is possible to erroneously estimate the signal and trend's magnitude, especially for highly noisy and small samples.
Despite the effort to improve our understanding of the limitations of the methodologies commonly used in trend detection analysis, most of the literature is focused on block maxima (i.e., one event per block) and neglects important aspects related to the power of tests when dealing with POT series.Widely used in a variety of analysis (e.g., Gyarmati-Szab o et al., 2017;Mailhot et al., 2013;Mallakpour & Villarini, 2015;Prosdocimi et al., 2015;Roth et al., 2012;Villarini et al., 2013), the POT approach was reported to be advantageous in comparison to the block maxima in estimating Gumbel quantiles when the rate of a Poisson process is equal to or greater than 1.65 (λ ≥ 1.65) (Taesombut & Yevjevich, 1978) and in flood frequency analysis applications (Bezak et al., 2014).Furthermore, the use of partial duration series was strongly recommended by Cunnane (1973) when dealing with hydrological variables (especially for samples with less than 10 years of records), despite the difficulties in selecting the POT series to ensure independence between successive events and to select the threshold that produces the lowest bias of quantile estimates (Rutkowska et al., 2017).
Based on this overview, the detection of trends in POT records, albeit important, has received much less attention in the literature.Therefore, our goal is to characterize the strengths and weaknesses of different methods to detect trends in partial duration time series.Here, we use a Monte Carlo (MC) approach and evaluate the performance of different parametric (i.e., ordinary least squares [OLS] and generalized Pareto distribution [GPD] fit) and nonparametric (i.e., MK, Spearman rho, and Sen's slope) tests to detect nonstationarities in POT series.

| Generalized pareto distribution (GPD)
In the POT method, a variable X over a certain threshold μ (X > μ) is assumed to follow the GPD.The GPD cumulative distribution function (F μ,σ,ξ ) can be expressed as: where σ > 0, ξ ℝ are parameters of scale and shape, respectively.
Here, we set μ equal to zero and ξ as constant.Only the scale σ parameter is a linear function of time (t) with an exponential link function (Coles et al., 2001): where β is equal to 1 and β 1 represents the slope of the nonstationary linear scale parameter term.The vector T is obtained assuming that the time between the occurrence of two consecutive POT events (t i ) can be described by an exponential distribution, because the events are obtained from a Poisson process with a rate λ on [0, ∞).
In practice, T is a result of the accumulation of exponentially distributed values of t i randomly generated, with i being the number of intervals between two consecutive POT events and equal to the product of the record size (or the length of samples when λ = 1) and the Poisson process rate minus one (i max = n b Â λÀ1).In other words, each T i is equal to the summation of all t j , with j ≤ i.Furthermore, the summation of all times between consecutive POT events is equal to the record length independent of the rate λ ( P t i ¼ n b , 8λ j ).

| Monte Carlo simulations
We perform MC simulations to evaluate the trend detection capabilities of five commonly used approaches to detect trends in POT records: Spearman's rho, MK test, ordinary least squared regression (Rao et al., 1973), Sen's slope estimator, and nonstationary GPD fit (Coles et al., 2001).While the first three are nonparametric (i.e., MK, SP, and SEN), the other two methods are nonparametric (i.e., OLS and GPD_NS).Furthermore, the necessary steps to perform this MC are outlined in Figure 1 and were carried out by using R scripts (R Core Team, 2022).After randomly generating the t i 's values (Step 1, see Section 2.1) and computing the associated T vectors (Step 2, see also Section 2.1), we generate independent GPD variates (Step 3).This process is repeated 1000 times (i.e., Steps 1-3) for each possible combination between the different record sizes (n b = 30, 40,50,60,70,80,90,100), expected average number of POT events per year/ block (λ = 1, 2, 3, 4), slope coefficient of the GPD scale parameter (β 1 = 0, 0.01, 0.03, 0.05, 0.09), and shape parameters (ξ = À0.30,0, +0.30).The selection of the ξ values is based on the recommendations by Martins and Stedinger (2000), who defined the interval between À0.30 and + 0.30 as reasonable in hydrological applications.Moreover, the random GPD variates are paired with their respective T values (i.e., a vector containing the time of each POT event occurrence), which is needed for the Sen's slope and all parametric tests considered here (i.e., OLS and GPD_NS) when assessing the presence of trends.
Then, the parametric and nonparametric tests are applied to the GPD variates (Step 4) and the rejection rate of the null hypothesis (H 0 : no trend) is computed as (Step 5): where N is the total number of simulation experiments (i.e., 1000) and N rej is the number of experiments for which we reject the null hypothesis at the 5% significance level.For the GPD_NS method, the rejection of the stationary null hypothesis (H 0 ) occurs when the estimated value of β 1 is evaluated as statistically different from 0 (β 1 ≠ 0), considering that the test statistic (T ¼ β 1 =s b , s b is the standard error of the β 1 estimate) follows a Student-t distribution with n Â λ À 2 ð Þdegrees of freedom (Naghettini & Pinto, 2007).
In terms of type S Error, we perform an evaluation of the SEN and OLS methods considering only the generated times series that are nonstationary by both approaches.Therefore, the type S error is estimated as (Step 6): where N Type S error is the number of times that the estimated trend signal by SEN and OLS methods is negative (i.e., different from the positive β 1 assumed during the time series generation process) and N non_stat is the number of nonstationary series according to the MK and OLS, respectively.

| Nonparametric and parametric statistical tests
In this section, we present a brief description about the nonparametric and parametrical statistical tests used here to identify the presence of monotonic trends in random generated GPD variates.

| Mann-Kendall test
The MK statistic S is given as: where: When the null hypothesis is true (i.e., S = 0), Kendall (1975) showed that for values of n > 10, the distribution of S can be approximated by a Normal distribution, S $ N(0, σ 0 2 ), with null mean and variance given by: where t j is the tth tie in the series and m is the number of tie groups.Then, the test statistics Z c is computed as: Z c follows a standard normal distribution.A positive (negative) value of Z signifies an upward (downward) pattern.A significance level α is also utilized for testing either an upward or downward monotonic trend.
F I G U R E 1 Six-step flowchart of the proposed methodology to generate (1)-( 3), identify presence of nonstationarities (4), estimate the percentage of rejection ( 5) and probability of type S errors (6) in GPD variates with different sample size, sampling rates, shape parameter, and trend magnitude.

| Spearman rho test
The test static of Spearman rho (D) and the standardized test statistic Z SR are expressed as follows: and: where R i is the rank of the ith observation X i and n is the length of the time series.Positive values of Z SR indicate upward trends, while negative Z SR indicates downward trends in the time series.When jZ SR j > t (n-2,1-α/2) , the null hypothesis is rejected and a significant trend exists in the time series.

| Sen's slope test
We compute for each of the n(n -1)/2 possible sample observation pairs (X j , X i ) the ordered slopes (d ij ), which are given by: where i and j are the time indices of the sample positions occupied, respectively, by X i and X j , where 1 ≤ i ≤ j ≤ n.
The estimator of Sen's slope (β SEN ) is defined as the median of all d ij calculated.Therefore: Here, we use the t-test to assess whether β SEN is statistically significantly different from zero.

| T-test: Ordinary least squares and generalized pareto distribution fit
The estimated slope for the OLS quantiles and for the GPD_NS shape parameter (Equation ( 2)) have their statistical significance assessed by the application of the t-test.In other words, we determine whether these slope coefficients are different from zero at α significance level.
The statistic of the t-test (T) follows a Student-t distribution with n Â λ À 2 ð Þdegrees of freedom, and is computed by the ratio between the estimated slope ( β) of the linear regression model (i.e., X = α + βΤ + ε, with respect to time t, random errors ε, and constant intercept α) and its standard error SE(β) as follows: where the estimates of β and SE(β) are given by: where RSS is the sum of squared residual, which is expressed by: If jTj >> t (n-2,1-α/2) , we reject the null hypothesis of stationarity; otherwise, there is no evidence to reject H 0 .

| RESULTS AND DISCUSSION
Figure 2 shows the direct relation between the sample size (n) and the estimated rejection rate by three nonparametric (i.e., SP, MK, SEN) and two parametric (i.e., OLS and GPD_NS) tests, assuming λ = 1 and β 1 = 0.01 for different shape parameters.An increase in n b results in a progressive increase in the ability of these different methodologies to identify the presence of nonstationarities, when β 1 and the GPD's shape parameter (ξ = À0.30,0 and + 0.30) are held constant.Overall, the GPD_NS is the method with the highest percentage of rejections of the stationarity assumption independent of n b and ξ, reaching values that are almost three times larger than those estimated for the other four methods (e.g., n b = 30, ξ = À0.30,%Rej OLS = 7.6 and % Rej GPD_NS = 18.6).The difference between the rejection rates estimated by the GPD_NS and the other methods reduces as the sample size and the shape parameter increase, achieving its lowest value when ξ = +0.30and n = 100.
An increase in the value of the shape parameter results in decreases in rejection rates for all five methods.In addition, Figure 2 suggests that these methods are differently affected by the value of the shape parameter, where the GPD_NS and OLS tests experienced greater variations (increases or decreases) in the percentage of rejection when ξ changed for the same n.For example, when n b = 60 and ξ increases from À0.30 to +0.30, the decrease of the percentage of rejections for GPD_NS is equal to 25.3%, while it drops to 10.5% for the MK test.A decreasing pattern is observed for the OLS test for increasing ξ.The percentage of rejections related to the OLS drops abruptly from 67% (ξ = À0.30) to 25% (ξ = +0.30)when n = 100.For the same conditions, the MK test rejection rate ranged from 47.8% to 28.8%.
A similar performance of nonparametric tests is observed for all sample sizes and ξ evaluated here.Although less impacted by differences in the shape parameter, SP, MK, and SEN consistently exhibit lower rejection rates when compared to GPD_NS and OLS, except when ξ = +0.30.For example, the percentage of rejection of the nonparametric tests is close to the significance level (or 5%) for sample sizes smaller than 40 (Figure 2) regardless of the ξ value.All evaluated methods show rejection rates of less than 50% for samples with a size smaller than or equal to 70 years of record when β 1 = 0.01, highlighting the difficulties in detecting nonstationarities when the signal is small.
Although different in the magnitude of rejection rates, the same dependences between sample size (n), shape parameter (ξ), and test discussed in Figure 2 are observed for values of λ greater than 1, as shown in Supplementary Figures S1-S3.
An increase in β 1 is accompanied by an increase in the rejection rate and varies according to the value of the shape parameter (ξ) and the method used to assess the presence of nonstationarity in a given time series (Figure 3).There are higher rejection rates for all β 1 (0.01 (0.02) 0.09) and ξ (À0.30, 0, +0.30) scenarios when using GPD_NS, and the sample size is fixed at n b = 50.There is an inverse relationship between the rejection rate and ξ for all methods.However, the OLS is the approach most impacted by the increase in ξ.For example, considering n b = 50 and β 1 = 0.09, the difference in the percentage of rejections decreases from 89.7% to 52.2% (i.e., 37.5% reduction in the % Rej).For the same scenario, SP, MK, SEN, and GPD_NS rejection rates reduced from 81.7%, 82.7%, 82.2%, and 96.4% to 62.1%, 62.2%, 61.7% and 72.2%, respectively (i.e., drop by 19.6%, 20.5%, 20.5% and 24.2%).Among the different methods considered here, the better performance of the GPD_NS should not come as a surprise, given that we generated GPD variates.However, we can compare its performance to that the other approaches, and Figure 4 presents a direct comparison between the performance of MK and SEN tests with respect to the GPD_NS.Overall, these two nonparametric tests perform in a similar way.However, their power is lower than that of the GPD_NS, especially when the shape parameter is negative.As ξ increases, their performance approaches that of the GPD_NS.For example, the ratio between the percentage of rejection of the MK and GPD_NS increases from 0.39 to 0.65 by increasing ξ from À0.30 to 0.30 when n = 30, β 1 = 0.03, and λ = 1.
We obtain similar patterns when considering SP and OLS (Figure S4), even though the power of OLS converges faster to the GPD_NS one.Different from what we observe for the nonparametric tests (i.e., MK, SEN, SP), the median value of the ratio between the percentage of rejection for the OLS and GPD_NS test is lower for each n when ξ = 0.30.In this case, the maximum value of this ratio is approximately 0.90 (i.e., the rate rejection of the OLS test is equal to 90% of that of the GPD_NS), tending to 1 for sample sizes greater than 90.In addition, the variability of the percentage of rejection decreases as the sample size increases, and it depends on the value of the shape parameter.
Figure 5 presents a comparison between the MK and OLS tests, focusing on the percentage of rejection estimated for each possible GPD variate and stratified by GPD shape parameter (ξ = À0.30,0, 0.30).When ξ varies between À0.30 and 0, the percentage of rejection obtained using OLS is greater than observed for the MK test regardless the values of λ, n, and β 1 .On the other hand, MK consistently reaches higher rejection rates when ξ = 0.30.The differences in the power of the tests tend to be greater as the percentage of rejection increases, except in the region where the rejection rate values start to converge to 1.0 regardless of ξ (i.e., when % Rej OLS $ 0.75).Furthermore, the curves in Figure 5 associated with ξ = À0.30and 0.30 represent outer envelopes for rejection rates for in-between values of the shape parameter.Similar pattern observed in MK test is noted for the Spearman Rho's and SEN test, as shown in Supplementary Figures S5, S6.
The influence of ξ on the percentage of rejections is explained by considering that the GPD distribution with a positive shape parameter is unbounded above, leading to time series with lower signal-to-noise-ratio and, consequently, to a less favorable scenario for all methods to detect the presence of a true trend.Amorim (2018) reported a similar behavior for time series following a GEV distribution with different sample size, scale, and shape parameters (Vogel et al., 2011).More specifically, we can consider the maximum values of each of the thousand time series generated considering n = 100, β 1 = 0, λ = 1, and different shape parameters (i.e., ξ = À0.30and + 0.30) (Figure 6): there is an inferior limit at X ≥ μ for the GPD variates regardless of the ξ value, while there is a superior limit equal to X ≤ μÀσ/ξ when ξ < 0. The existence of these limits explains why the maximum GPD quantiles will never be greater than 6.7 when n = 100, ξ = À0.30, and λ = 1.On the other hand, the quantiles related to a positive shape parameter only have a lower bound.On average, by increasing the ξ value from À0.30 to +0.30, the change in the coefficient of variation (C v ) is equal to approximately 70% (C v(À0.30)= 0.82 and C v (+0.30)) = 1.48).As reported by Pittock (1980) and Chiew and McMahon (1993), the presence of a higher noise (or higher C v values) associated with a constant signal reduces the ability to detect an actual trend in a given sample.This explains the difference in the percentage of rejection rates observed for samples with the same n and β 1 , but different ξ.
The relationship between the rejection rate (or power), base sample size (n b ), shape parameter (ξ), and the product between λ and the scale parameter slope (β 1 ) is summarized in Figure 7.The combination of β 1 and λ in a single variable is based on the idea of allowing a direct comparison between the power of samples that came from a record with the same length but with different sampling rates per block/year (or λ).The base sample size is defined as the length of a sample when λ ¼ 1 (i.e., when the number of years in a given dataset coincides with the sample size).The power of tests is an increasing function of the base sample size and the product λβ 1 .This means that as n b and/or λβ 1 increase, our ability to identify the presence of a true trend also increases.This becomes more evident in Figure 7 when considering fixed values of λβ 1 and ξ, the power associated with a specific curve increases as the values of n b increase.As an example, if λβ 1 = 0.05 and ξ = À0.30, the power of the MK test is 0.15 and 0.67, respectively, when n b = 30 and 60.We observe the same behavior for the increase of λβ 1 : as the values of λβ 1 increase, the power of MK increases for the same n b .Figure 7 also shows that the distance between power curves for ξ = À0.30 is slightly smaller than for ξ = +0.30,considering the same value of n b .This fact reinforces the idea that the relationship between power and shape parameter is negative and Comparison between the (%) percentage of rejection estimated by the Mann-Kendall (MK) and ordinary least squares (OLS) tests and the nonstationary generalized Pareto distribution fit (GPS_NS) for all different λ, samples sizes, ξ and β 1 .The interpolated curves were developed using the LOESS method with a span equal to 0.75.its sensitivity to λβ 1 increases as ξ increases, especially for smaller sample sizes.For example, when n b = 30, λβ 1 = 0.15, and the shape parameters are equal to À0.30 and + 0.30, the detection powers are 67% and 42%, respectively.Figure 7 also suggests that the incremental increase in power by adding more years of observations progressively reduces as n b gets larger.Furthermore, for large base sample sizes, small changes in λβ 1 result in more significant gains in the percentage of rejection.It is also important to note that the existence of a small signal of change (or β 1 ) can be balanced by an increase in λ for the same n b .For example, if n b = 40, ξ = + 0.30, and β 1 = 0.03, the MK power of detection increases from 22% (λ = 1) to 37% (λ = 2), or to 91% (λ = 10).
Figure 8 presents the MK results, considering different values of λβ 1 and fixed base sample size.The six distinct regions, ranging from n b = 30 (darkest gray) to 80 (lightest gray), have their upper (UL) and lower limits (LL) associated with the percentage of rejection estimated for a fixed λβ 1 and n b , for ξ = À0.30and + 0.30, respectively.This means that the width of the shaded areas reflects the expected decrease in the power of the MK test due to the increase in ξ from À0.30 to +0.30 for a given λβ 1 and n b .As expected, the region associated with the smallest base sample size (n b = 30) exhibits the absolute maximum width.In other words, the difference between UL and LL is greater than that observed for the other five regions, reaching its peak for a λβ 1 close to 0.20.For a fixed n b , there is less variability in the rejection rate for lower values of the λβ 1 product (or when % rejection is close to the significance level) or the UL approaches 1.This means that the increment in power decreases as the percentage of rejections approaches its maximum for a given n b .For example, when n b = 30 and λβ 1 = 0.20, the difference between UL and LL is equal to 24%, while, for the same base sample size and considering λβ 1 = 0.04 and 0.50, the UL-LL is equal to 3.4% and 6.8%, respectively.Figure 8 points to the low power of trend detection in series with small n b values and moderately high λβ 1 .For example, when n b = 30, the estimated power of the MK for λβ 1 = 0.20 and ξ = +0.30 is equal to 0.50.It is equivalent to saying that there is, on average, a 50% chance of not rejecting the null hypothesis when β 1 ¼ 0:05 (i.e., when the scale parameter is changing linearly at a rate of 5% in a 30-year period) and the sample size is equal to 120 observations (or λ ¼ 4).We can obtain the same rejection rate by the MK when combining n b = 70, λ = 1, ξ = + 0.30 and β 1 = 0.05 or by using the GPS_NS method with n b = 60 instead of using the traditional nonparametric methods, such as SP, MK, or SEN.
Figure 9 presents the difference in trend detection power between a sample composed by n values obtained from (a) a time series with n years of record (i.e., n b or λ = 1); or (b) increasing the number of values sampled by X (λ = X, for X > 1) in a time series with n/X years of observation.Considering the same sample size (e.g., 60), F I G U R E 6 Plot of 1000 maximum variates (X max ) of time series following a generalized Pareto distribution with constant sample size (n = 100), scale parameter slope (β 1 = 0), λ = 1, and ξ = À0.30(red dots) and ξ = + 0.30 (black dots).The black and red dashed lines represent the theoretical inferior (X ≥ μ) and superior (X ≤ μÀσ/ξ) limits for GPD quantiles when ξ < 0.
we show the difference in the percentage of rejection between n = n b = 60 and λ = 1, n/2 = 30 and λ = 2, n/3 = 20 and λ = 3, and n/6 = 10 and λ = 6.In general, Figure 9 suggests that an increase in the sample size results in higher rejection rates.However, the percentage of rejections obtained for samples when λ > 1 is bounded above and below by the values of the rejection rate curve λ = 1 and the significance level of the test (α = 0.05), respectively.Comparing samples of size n, as the value of λ increases, the probability of detecting an actual trend decreases.In other words, the power of detection is also a function of λ, where increasing its value to extend the record length results in power gains, but significantly lower than if the additional data (nÀ n b Â λ) were extracted from a longer time series.The vertical dashed black lines limit the rejection zones associated with a specific sample size (n) and different values of β 1 (β 1 = 0.05 and 0.09).If the curve for λ = 1 represents the maximum theoretical rejection rate for any n, the maximum power, on average, is determined by its intersection with these vertical lines for a specific sample size (n), scale parameter slope coefficient value (β 1 ), and using MK.For example, considering n = 40 and β 1 = 0.09, the maximum percentage of rejection expected is equal to 67% (λ = 1).The analogous can be done for different values of λ.For example, the previously mentioned maximum of 67%  = 30, 40, 50, and 80) and β 1 = 0.05 (top) and 0.09 (bottom).The intersection between the vertical black dashed lines and the λ = X curves represents the maximum rejection rate for a specific n and λ.The shaded area represents the 95% confidence level interval related to a loess smoothing curve with span = 0.60 and λ = X.drops to 39.8% when λ = 2.The position of the vertical lines is a function of β 1 and the methodology applied to assess the presence of nonstationarity in a POT time series.
In general, regardless of the method considered, as β 1 increases, the power of the test tends to increase.This fact is illustrated by the differences in the position of the vertical lines between the top and bottom panels in Figure 9. Therefore, for the same sample size (n), an increase in β 1 results in a higher signal of change and, consequently, a greater percentage of true trend detection for all methods.For example, the maximum percentage of rejection expected when β 1 = 0.05 and n = 40 (42%) is approximately equal to the maximum rejection rate estimated for n = 30 and β 1 = 0.09 (41.1%) when using MK.
Up to this point, we did not consider whether the sign of the detected trend was correctly identified.This aspect is addressed in Figure 10, which shows the relationship between the probability of occurrence a type S error using SEN and OLS to estimate the magnitude of trends in GPD time series with different base sample sizes (n b = 30 (10) 100), shape parameters (ξ = À0.30,0 and + 0.3), sampling rates (λ = 1, 2, 3), and a fixed scale parameter slope (β 1 = 0.01).Bars with the same fill colors for a given n b represent the probability of occurrence of type S error for different values of λ, which are plotted in an ascending order of magnitude (i.e., λ = 1, 2, and 3).Regarding the sample size, the trend signal estimated by OLS and SEN is more likely to be correctly estimated as n increases.Thus, type S error is a decreasing function of the sample size.As an example, for ξ = +0.30,β 1 = 0.01, and λ = 1, the probability of occurrence of type S error is equal to $19% for n b = 30, decreasing to 6% when n b = 50.As noted for the ability to detect the presence of trends, the use of longer data records is preferable when compared to an increase in λ to match a certain base sample size.Despite the reduction in the probability of type S error occurrence due to the increase in the sampling rate (λ), the variation observed in power by adding 10 years of record to n b (n b = 40-50) is equivalent, in terms of type S error, to the incorporation of 40 additional events keeping the same n b (or λ = 2).
A similar behavior observed for the sample size occurs with an increase in β 1 .This means that as the values of β 1 increase, a reduction in the probability of occurrence of type S error is observed (Figure 10).when β 1 increases from 0.01 to 0.05. Figure 10 also suggests that the probability of type S error occurrence is more concerning when it comes to trends of lower magnitude (β 1 = 0.01), reaching values close to zero for a relatively small increase in β 1 .
The inverse relation between the rejection rate (or power) and the probability of occurrence of type S error was already expected.Small sample size, low β 1 , and high ξ values are examples of GPD time series characteristics that tend to reduce the ability of trend detection methods to accurately identify the presence of a true nonstationary signal.This indicates that isolated or simultaneous variations in n, β 1 , λ, and ξ that tend to increase the detection power (or make the signal of change more evident for the methods tested here) result in a reduced probability of estimating a negative trend signal of change when, in fact, it is positive or vice versa.
The OLS and SEN exhibited similar performance when applied to time series generated considering all possible combinations of n b , β 1 , ξ, and λ.While the OLS performed better in detecting the presence of changes in situations where the ξ ≤ 0, the same behavior is not observed for the type S errors.For this specific scenario, the probability of occurrence of type S error by using SEN and OLS is the same regardless β 1 and n b .We observe small and uniform differences when ξ = +0.30,with SEN performing slightly better than the OLS.

| CONCLUSIONS
We have performed a MC simulation study to explore the power of different parametric and nonparametric approaches to detect trends in partial duration records and map potential issues in estimating the signal of change when a nonstationarity is identified (Error type S).We assumed that the partial duration datasets followed a GPD with different sample sizes (n), shape (ξ), and scale (σ t ) parameter, which was defined as a linear time-dependent parameter, with β 1 ≥ 0 as the slope coefficient.
In agreement with previous studies (Totaro et al., 2019;Yue et al., 2002) developed for Gaussian-and GEV-distributed time series, a large sample size (n), high β 1 magnitude, and high negative ξ values resulted in an increased percentage of stationary null-hypothesis (H 0 ) rejections for all methods tested here-Spearman's rho (SP), MK test, SEN, OLS and a nonstationary GPD distribution fit (GPD_NS).Moreover, the GPD_NS test is the method that performed the best and obtained the highest values of rejection rate, considering all possible configurations of GPD-distributed variates (n, λ, β 1 , and ξ), consistent with the fact that we generated GPD variates.
Using the performance of the GPD_NS as a reference, we showed that OLS performed better than all parametric tests when detecting trends for small or negative values of ξ.On the other hand, the use of a nonparametric test is recommended when dealing with records that exhibit large values of ξ.
With respect to the change in the sample size (n), we observed a significant difference in power increase between using a longer record datasets and extending a times series by increasing the sampling rate (λ).However, we encourage the use of λ greater than 1 (i.e., selecting more than just one event per year) to increase the POT sample size, especially when dealing with small records.In this case, we noted gains in power of detection and a reduction in the probability of type S error occurrence, especially when ξ < 0 (i.e., bounded distribution).
In terms of type S errors, the results also suggested that the probability of estimating a trend with the opposite sign of the true trend (when a statistically significant trend is detected) is not negligible and depends on the power of the test, assuming values greater than 19% for sample sizes commonly used in trend detection analysis.In addition, the use of Sen's slope estimator is preferable over OLS to estimate the magnitude of trends in a GPD time series.Despite the similar performance between these two methods, SEN exhibited a slightly lower probability of type S error occurrence when ξ is negative, proving to be more robust to shape parameter fluctuations.
In a practical sense, our results and recommendations represent an advance in the understanding of the weaknesses and strengths of the most used approaches to identify trends when dealing with extremes.Moreover, we recognize that accurately detecting trends is the first step to address questions involving changes in hydrometeorological extremes (e.g., flooding), which have important socio-economic consequences.Design concepts (e.g., return period, hydrological risk) and design life level are concepts that should be redefined (or adapted) in a changing world, as well as the best practices in the management of hydraulic structures or decision-making in water resources.Therefore, guidance documents and methods to incorporate the effects (or uncertainties) of the violations of the stationarity assumption in the design and planning phases are needed, representing a topic of future research on which our results shed light.

F
I G U R E 4 Comparison between the (%) percentage of rejection estimated by the Mann-Kendall (MK) (a) and Sen's slope estimator (SEN) (b) tests and the nonstationary Generalized Pareto distribution fit (GPD_NS) for all different λ, samples sizes, ξ and β 1 .The colorbar represents the value of the ratio between the rejection rates estimated by the MK or SEN and the GPD_NS.The interpolated curves were developed using the locally weighted scatterplot smoothing (LOESS) method with a span equal to 0.75.

F
I G U R E 7 Power curves of the Mann-Kendall test for different λβ 1 , base sample size (n b ) and shape parameters (ξ = À0.30(top), 0, and + 0.30 (bottom)), considering the level of significance equal to 5% (α = 0.05).F I G U R E 8 Variability of the percentage of rejection for the Mann-Kendall test, considering different base sample sizes (n b = 30 (10) 80) and λβ 1 .The lower and upper limits of the regions are the value of the power of trend detection associated with ξ = +0.30and À 0.30, respectively.The darker the colors, the smaller the sample sizes.F I G U R E 9 Comparison between the (%) percentage of rejection estimated by the Mann-Kendall test (MK) for different λ (1, 2, 3, and 4), samples sizes (n Considering n b = 40 and ξ ¼ þ0:30, the probability of occurrence of type S error drops from approximately 10% to 0% F I G U R E 1 0 Probability of occurrence of Type S error using the ordinary least squares (OLS) and Sen's slope estimator (SEN), considering that the time series was already declared as nonstationary by the MK test.The results refer to ξ = À0.30(a), 0 (b), +0.30 (c), λ = (1, 2, 3), n b = (40, 50, 60, 70, 80, 90, 100), and β 1 = 0.01 (left panels) or β 1 = (0.01, 0.03, 0.05, 0.07, 0.09) and n b = 40.For each method and specific n b (or β 1 ) the three bars refer to λ = 1, 2, and 3 (from left to right).