Testing for differences between two distributions in the presence of serial correlation using the Kolmogorov–Smirnov and Kuiper's tests

Testing for differences between two states is a staple of climate research, for example, applying a Student's t test to test for the differences in means. A more general approach is to test for differences in the entire distributions. Increasingly, this latter approach is being used in the context of climate change research where some societal impacts may be more sensitive to changes further from the centre of the distribution. The Kolmogorov–Smirnov (KS) test, probably the most widely‐used method in distributional testing, along with the closely related, but lesser known Kuiper's (KU) test are examined here. These, like most common statistical tests, assume that the data to which they are applied consist of independent observations. Unfortunately, commonly used data such as daily time series of temperature typically violate this assumption due to day‐to‐day autocorrelation. This work explores the consequences of this. Three variants of the KS and KU tests are explored: the traditional approach ignoring autocorrelation, use of an ‘effective sample size’ based on the lag‐1 autocorrelation, and Monte Carlo simulations employing a first order autoregressive model appropriate to a variety of data commonly used in climate science. Results indicate that large errors in inferences are possible when the temporal coherence is ignored. The guidance and materials provided here can be used to anticipate the magnitude of the errors. Bias caused by the errors can be mitigated via easy to use ‘look‐up’ tables or more broadly through application of polynomial coefficients fit to the simulation results.


| INTRODUCTION
In climate science, one of the most fundamental pursuits is determination of the significance of differences between two states or sets of conditions. For example, the two states may be characterized by opposite phases of the North Atlantic Oscillation (NAO), by dry versus wet phases of the Asian Monsoon, or by historical and future climate states as simulated by a Global Climate Model (GCM). Typically, a statistical test is applied to a limited sample of data derived from each of the two states. The Student's t test is perhaps the most commonly applied test and can be used to infer differences between the means of the two states. Although quite useful, sometimes an analyst wants information regarding higher order differences beyond the means. For example, in a climate change context an increase in the upper tail of a temperature or precipitation distribution may be more impactful for some applications than a change in the mean. Thus, in some situations there is a desire to test for a general difference in distributions.
The Kolmogorov-Smirnov (KS) test has been used widely to perform such distributional testing. The KS test operates by quantifying the distance between the empirical distribution functions derived from two different samples of data. Since it is nonparametric it makes no assumptions regarding the underlying distributions from which the samples are drawn. However, like most commonly used statistical tests, there is an underlying assumption that the values within each sample are statistically independent. Violation of this assumption typically leads to an excessive rate of rejections of the null hypothesis that there is no difference. Because most common physical variables of interest to a climate scientist exhibit non-trivial correlation spatially (i.e., between nearby gridpoints or stations) and temporally (i.e., from one observation in time to the next) vigilance in hypothesis testing is warranted. It should be noted that although this work deals only with the effects of temporal coherence, once local (e.g., gridpoint or station) significance has been established by appropriate means, recent advances allow for addressing the problem of spatial coherence in a straightforward manner (Wilks, 2016).
With regard to addressing serial correlation in significance testing, it appears that Laurmann and Gates (1977) introduced to the atmospheric sciences community the notion of an 'effective sample size', n eff , in relation to the actual sample size, n. They proposed that an estimate of n eff , based on the lag-1 autocorrelation in the data, could be used to adjust an estimate of the variance in testing the difference in means between two samples, thereby accounting for serial correlation. Thereafter, Thiebaux and Zwiers (1984) demonstrated that the concept of an 'effective sample size' is nebulous with no unique way to estimate it and cautioned against its use in the general application of statistical tests. Later work by Zwiers and von Storch (1995) explored various ways in which serial correlation could be taken into account in testing for the difference between the means of two samples.
In spite of the complexities and cautions raised, substitution of n eff for n in various statistical tests to account for autocorrelation in time series has proliferated. In particular, the simplified expression for n eff based on the assumption of a first-order autoregressive process (Laurmann and Gates, 1977) has become the de facto means for dealing with serial correlation in significance testing. Surprisingly however, accounting for serial correlation in application of the KS test is often lacking. While several studies have examined this issue (Weiss, 1978;Durilleul and Legendre, 1992;Xu, 2013) they have not resulted in a general approach that can be readily implemented. The purpose of this work is to fill that void.
This work examines three competing strategies in application of the KS test, and the closely related Kuiper's (KU) test, both of which are detailed below: • the traditional approach of ignoring serial correlation • substitution of an effective sample size, n eff , for n • use of Monte Carlo simulations to account for serial correlation.
The third approach, which is the main focus, is analogous to that of Zwiers and von Storch (1995) who provided revised critical values for use in application of the Student's t test based on the lag-1 autocorrelation. As a shorthand, subscripts 't' for traditional, 'e' for effective, and 's' for simulation (using a polynomial fit to smooth the results) are affixed to KS, KU, or simply K when referring to both tests generically.
In what follows, before delving into the details of the methodology, Section 2 presents examples based on both actual observations as well as realistic synthetic data in order to motivate this work. The details of the methodology for the three approaches are given in Section 3 along with thorough instructions for implementation. Section 4 explores some of the properties of the simulation approach in comparison with the other two approaches. Finally, Section 5 concludes with a summary and outlines several different ways in which the results from this study can be applied in practice. Supplementary material contains the coefficients and translation tables that can be used to implement the K s distributional testing procedures.

| MOTIVATIONAL EXAMPLES
Before delving into the details of the various methodologies some examples, based on both real-world observations as well as synthetic data, are presented to motivate this work. Daily maximum (T max ) and minimum (T min ) surface air temperature from Central Park in New York City, New York were obtained from the National Centers for Environmental Information (NCEI) (https://www. ncdc.noaa.gov/cdo-web/datatools/findstation) spanning the time period 1869 to 2019. These data are from the daily Global Historical Climatology Network (GHCN-Daily) dataset prepared by Menne et al. (2012). Figure 1 displays the time series of monthly averaged Central Park T max and T min . Although both series exhibit distinct warming trends, the smoothed curves highlight the considerable interdecadal variability. This lowfrequency variability provides a convenient testbed for application of the KS and KU tests. A series of tests are applied to adjacent 20-year periods, separately for T max and T min , for January and July, and for daily values and monthly means. The first set consists of testing 1870-1889 versus 1890-1909. Shifting the starting points of the sets by 5 years per set produces a total of 23 sets terminating with 1980-1999 versus 2000-2019. An example of the test results is show in Figure 2 for the KS test applied to Central Park July daily T min with probabilities given for three variants of the tests (KS t , KS e , KS s ) for each of the 23 sets. Note that the probabilities here, and throughout the rest of this paper, are expressed as the cumulative probability, from minus infinity to the point in question. For example, a probability of .95 is equivalent to a one-tailed significance level of .05, typically interpreted as reason to reject the null hypothesis that the two samples are drawn from the same distribution. The K tests are one-tailed because a probability in the lower tail is indicative of a very close agreement between the two distributions, hence irrelevant to rejection of the null hypothesis of no difference.
The results in Figure 2 show a consistent pattern in which K t (K e ) indicates the most (least) significant results with K s in between, however the separation between the three tests varies considerably. That K t is consistently the most significant is expected because any degree of autocorrelation, which should be considerable for daily data, reduces the amount of independent information available. Since K t neglects this it is overly confident. However, these examples do not explicitly display the amount of dependence due to serial correlation.
In order to provide a perspective from which to interpret these results in a more controlled environment, synthetic daily data have been generated. Four sets of data consisting of 10,000 months each are derived from a firstorder autoregressive process (AR1) with lag-1 autocorrelations of 0.0, 0.3, 0.6, and 0.9. While the details of the data generation are given below, the relevant point to note here is that since these data were generated in the same fashion as the data used in the Monte Carlo simulations to derive K s , by construction results for KS s and KU s represent the 'truth'.
In evaluating K t and K e in the virtual world of synthetic data the customary probability of .95, corresponding to a significance level of .05, is adopted as an example. Thus, a K t or K e probability exceeding .95 defines a significant result. If the corresponding K s probability falls below .95 this is denoted as a false positive (FP). Similarly, when K s is significant but K t or K e are not, this represents a false 1880 1900 1920 1940 1960 1980 2000 2020 Year negative (FN). Because of the predictable nature of the biases of K t and K e as seen in Figure 2, virtually all errors for K t (K e ) are of the FP (FN) type; round off error for very close results leads to the rare exceptions. Figure 3 displays the FP rates for K t (black) and FN rates for K e (red) as solid (KS) and dashed (KU) lines. For small values of serial correlation the FP and FN rates are modest. However, as the serial correlation increases the FP rate for K t increases nonlinearly, reaching error rates 30 to 50% for values of the lag-1 autocorrelation that may not be uncommon in many physical variables of interest, such as daily temperature. The K e approach performs much better, especially for greater serial correlation but still has error rates~5 to 10%. For comparison the values averaged over all of the Central Park cases are plotted as symbols. For monthly data, which typically have negligible correlation from year to year, the error rates are near zero. However, much larger error rates 25% (15%) are seen for K t (K e ). Note that choosing a fixed value (.95) to denote significance yields a simple, but narrow framework in which to compare the three approaches. Below the differences between the three tests are explored more fully and it is shown that the much higher error rate for K t compared to K e seen here is not universal-in other circumstances the roles may be reversed.

| Introduction
Distributional testing involves the comparison of empirical Cumulative Distribution Functions (CDFs) derived from two separate samples and can be divided into two classes: supremum and quadratic (Stephens, 1986). The former are based on the maximum difference between the two CDFs while the latter are based on the squared differences between the two CDFs. The most widely used is the KS test which utilizes the supremum approach. Note that the two-sample KS test is sometimes referred to as the Smirnov test (Wilks, 2006). Closely related to it is the lesser-known KU test. Two popular quadratic tests are the Cramer von Mises (CM) and Anderson-Darling (AD). While the KS test is more sensitive to differences near the middle to the distributions, with the CM often yielding similar results, the KU test is equally sensitive across the distribution while the AD test is more sensitive in the tails (Stephens, 1970(Stephens, , 1986. The AD test suffers from the fact that critical values are not as readily available since they depend on sample size (Pettitt, 1976).
This work focuses on the KS test because of its widespread use and also the KU test because it is so closely related to the KS test and is complementary with regards to its sensitivity. Details for the implementation of each of these tests are given below in three ways: the traditional approach, use of an effective sample size, and results derived from Monte Carlo simulations.
It is important to note that all of the results herein pertain to tests applied to two distinct samples. In some contexts it is desirable to compare the CDF from a sample with the CDF from a parametric fit to the same sample. Unfortunately, this violates the basic assumption that the two CDFs are derived from independent samples of data. In such instances, another approach, such as the Lilliefors test (Lilliefors, 1967) or Monte Carlo simulation are required.

| The traditional approach (K t )
The first step in performing the K tests involves creating empirical CDFs, F 1 (x) and F 2 (x), from the two samples of data (Press et al., 1992) and defining: For the KS test define: and for the KU test define: In summary, the KS test is based on the maximum distance between the two CDFs while the KU test is based on the sum of the largest distance above and largest distance below.
Next, if n 1 and n 2 are the two sample sizes compute: Finally, define the KS (λ KS ) and KU (λ KU ) test statistics: While the significance levels corresponding to the λ K test statistics are available from software packages, they can be estimated via summation series. First, define the significance level (α): where p is the probability summed to the upper tail (i.e., the CDF value). For the KS test: and for the KU test: Although these are infinite series they generally converge rapidly and no more than 100 terms are necessary (Press et al., 1992). The summations can be terminated when either of two conditions are met: • when the absolute value of the current term in the summation is ≤10 −3 times the absolute value of the previous term or • when the absolute value of the term is ≤10 −8 times the current sum.
3.3 | Use of an effective sample size (K e ) The approach introduced here based on an effective sample size is a simple extension of the traditional approach employing an assumption used widely in climate science. Following Laurmann and Gates (1977), if one assumes an AR1 process an effective sample size (n eff ) can be defined in terms of the actual sample size (n) and the lag-1 autocorrelation (r): Note that when r < 0 here we set n eff = n. This is a conservative approach which prevents n eff from exceeding n. Some may prefer a less conservative approach in which n eff is allowed to exceed n, reflecting the reduced sampling variability that would be encountered. Once effective sample sizes have been estimated they can be substituted for n 1 and n 2 in (5), proceeding with the traditional approach as outlined above.

| Monte Carlo simulation and polynomial approximation (K s )
The simulation procedure employed here is based on one critical assumption, namely that the data to be tested follow that of an AR1 process: where X t is the value at time t, r is the lag-1 autocorrelation, and e t is a zero mean Gaussian random variable with constant variance. Note that while (12) represents a zero-mean process, this does not limit the generality of the results. Strictly speaking, if the AR1 model is not appropriate for the data in question the results of this work cannot be applied. However, in geophysics many common physical variables can be approximated by an AR1 process. The procedure for the simulations is straightforward, relying on brute-force computing power. A series of simulations, each consisting of 1,000 trials, were performed in which two samples were generated independently from the same AR1 model. For each trial the λ values from (6) or (7) were computed and saved. Based on the distribution of 1,000 λ values quantiles were extracted corresponding to the following percentiles in the CDF: 0.1, 0. Although all of the trials were based on large sample sizes of n 1 = n 2 = 20,000 in order to reduce sampling error, this was not crucial. One of the simplifying features of the K tests is that the sample size dependence is captured in the λ values via (5)-(7) so that λ critical values are essentially independent of sample size. This was verified by running some simulations for sample sizes of 10, 100, 1,000, and 10,000. This is consistent with the statement by Press et al. (1992) that the approximation in (9) is quite good even for n as small as 8, noting as well that (9) and (10) have no sample size dependence. In addition, the fact that the two samples had equal sizes does not diminish the wide applicability of the results since differing sample sizes are also taken into account by (5).
In order to make the simulation results more readily available for application a series of polynomials were fit to the raw quantile values described above. Each CDF for a given lag-1 value was subdivided into three segments, with a separate polynomial fit to each segment. The three segments correspond to CDF ranges of 0.0-0.10, 0.10-0.50, and 0.50-0.99 for KU. For KS the cut-off of 0.5 is mostly replaced by 0.6, the choice being dictated by a better fit. For the lower two segments the polynomial is linear while the upper one is cubic. All of the coefficients and ranges for which they apply are given in Tables S1 and S2 (supporting information), along with an example for their use. These can easily be used to create computer code that returns significance levels in the user's programming language of choice. Note that in application the user would apply (1)-(7) as for the traditional approach, but use probabilities from the polynomials in lieu of (9) and (10). Throughout this paper, results referred to as 'simulation' (K s ) are based on the values derived from the polynomial fits.
There is one additional consideration in application of the polynomial fits, namely how to handle the lag-1 autocorrelations. All of the simulations use the same lag-1 for both samples as it would not have been practical to simulate all of the 126 combinations of the 17 levels used. In most applications one would anticipate that the autocorrelations for the two samples would not be too dissimilar. Given lag-1 autocorrelations of r 1 and r 2 there would seem to be three reasonable choices, two of which involve a consensus value r being applied to the polynomials: a. r = max (r 1 , r 2 ) b. p = average (p 1 , p 2 ) c. r = average (r 1 , r 2 ) A conservative approach (a) would be to apply the larger of the two autocorrelations to the polynomial fit.
This would tend to underestimate the level of significance. The second approach (b) would be to apply r 1 and r 2 separately and average the resulting probabilities. The third approach (c), the recommended one here, is to apply the average of the two autocorrelations to the polynomial fit, using Fisher's z transformation (Zar, 2010) for the averaging: As further incentive for use of the results produced herein a series of 'look-up tables' have been produced, which although less accurate, are easier to apply than the full set of polynomials. Tables S3-S6 (Supporting Information) allow the user to enter the probability arrived at based on either the traditional or n eff approaches, along with the lag-1 autocorrelation, to yield the probability one would obtain based on the polynomial fits. An example illustrating their use is included in the tables. A drawback of this approach is the ability to discriminate between levels of significance varies by probability and lag-1 as discussed in Section 5.

| Polynomial fits
Curves displaying the polynomial fits to the Monte Carlo simulation results are shown in Figure 4. Both the lower and upper portions of the CDF exhibit highly nonlinear behaviour. Although the linear fits in the lowest segment are far from ideal (see below), it was decided that the considerable effort needed (i.e., more simulations and much higher order curve fitting) for a proper rendering is not justified since this is a region of extreme non-significance; the small errors incurred will not change any inferences. At the high end of the distribution the user would be advised not to extend results beyond a CDF of 0.99. When a λ exceeds the upper limit (as indicated in Tables S1 and S2) for the given polynomial the probability should be censored to a value of 0.99. Again, in this region results that are highly significant will not be changed by this approach.
The other noteworthy nonlinearity concerns the spacing between the curves representing different levels of autocorrelation. As autocorrelation increases the curves are farther apart indicating a greater sensitivity to the lag-1 value. Taking a big-picture look at the results, consider the fact that the bottom-most curve (lag-1 = 0) represents critical values used in the traditional approach. It is clear that for lag-1 values typical of daily surface temperature for example (i.e.,~0.6 to 0.7, the upper orange and lower green curves) the departure is considerable.
4.2 | Polynomial fits vs. numerical recipes for lag-1 = 0 As a means of verifying the validity of the Monte Carlo and curve-fitting approach, simulations were performed for lag-1 = 0. In theory, these results should be the same as those from the traditional approach, which in this case was carried out using Numerical Recipes in FORTRAN (Press et al., 1992). As seen in Figure 5, for the bulk of the distributions (0.10-0.99) the results are virtually indistinguishable. At the lower end (0.0-0.1) the error incurred by the linear approximation is obvious, although significance at this end is usually of little practical importance. However, Stephens (1986) points out that very small probabilities indicate that the small differences between the two CDFs are unlikely to be random. Such results are called superuniform and sometimes indicate that the data have been tampered with. Figure 6 displays the raw Monte Carlo values along with the polynomial fitted curves for some select values of autocorrelation. It can be seen that the fitted curves represent the raw values quite well. In the far right tail, particularly for higher values of autocorrelation, the fitted curves smooth out noise expected at the end of the distribution.

| Comparisons of distributions of K t , K e , and K s critical values
The CDFs for the K t , K e and K s critical values are shown in Figure 7 for four values of autocorrelation. The three approaches yield essentially identical values when lag-1 = 0, except for p < .10 where the poorer linear fit is used for K s . As autocorrelation increases, the curves for K e and K s move farther to the right, indicating a greater disparity with the traditional approach which does not take into account temporal coherence.
Note also how the relative positions involving the three approaches vary as a function of autocorrelation, especially for higher autocorrelation. For large probabilities (i.e., more significant results) K s is closer to K e than K t . This is consistent with the result from Figure 3 indicating smaller errors for the K e than the K t approach. However, results from Figure 3 are biased towards the high end of the distribution since a value of p = .95 was used in determining the FP and FN rates. In Figure 7 it can be seen that for lower probabilities K s is closer to K t than K e . Thus, the reader should not be misled by the specific example in Figure 3-in fact which is better, K t or K e is a function of the position in the CDF and is modulated by the amount of autocorrelation.
As a compliment to Figure 7, Figure 8 shows the corresponding probability distribution functions (PDFs). Determination of a PDF is not as straightforward as a CDF because it first requires the estimation of the local slope of the CDF followed by application of a somewhat arbitrary kernel density smoothing operation. Here the smoothing was chosen to present a reasonable appearance. The step-function at the low end for K s curves is due to the use of a linear fit at the low end. As in Figure 7, K t and K e are identical for lag-1 = 0 with K s nearly the same except for p < .1. This figure makes it clear how the three approaches diverge as autocorrelation increases. For the largest autocorrelation, there is not a lot of overlap between the three distributions.

| DISCUSSION AND CONCLUSIONS
This work has examined the problem of testing for differences between distributions based on two samples of data using both the widely used KS test as well as the closely related but lesser known KU test. While the former is more sensitive to differences near the middle of the distributions the latter is equally sensitive over the entire distribution. The question explored here is to what extent does lack of independence of the values within each sample, as quantified by the lag-1 autocorrelation, affect the outcomes of the tests? Three approaches to application of these tests were examined to address this question: the traditional approach ignoring the autocorrelation, a traditional (black), simulation (cyan) and use of n eff (red). For the simulation and n eff sets, from left to right, the four curves correspond to lag-1 autocorrelations of 0.0, 0.3, 0.6, and 0.9. By definition, there is only one curve for the traditional approach, corresponding to lag-1 = 0. Note that for lag-1 = 0, by definition the n eff approach is identical to the traditional approach and the simulation CDF is obscured for most of the range since it is nearly identical to the traditional CDF [Colour figure can be viewed at wileyonlinelibrary.com] modified version using a simple estimate of the effective sample size, and Monte Carlo simulation.
Some examples using both real as well as synthetic data demonstrated that substantial errors can arise when the temporal coherence is ignored. Further diagnosis showed that the differences between the three approaches vary depending on the amount of autocorrelation and the degree to which the two distributions under consideration differ.
The test based on the simulation results is recommended for use. To facilitate implementation, coefficients of polynomials fit to the raw simulation results are provided (Supporting Information). As long as the data samples being tested conform to a reasonable extent to the assumed model used in the simulations the new test will eliminate the bias found in the traditional and effective sample size implementations. Also provided are look-up tables that are easier to use than the full implementation based on the polynomials.
The results from this work can be used in a tiered fashion, depending on how much effort a user is willing to invest. At the lowest tier, the information can provide guidance as to how seriously results will be affected by applying the traditional approach. In some cases, the user may decide that any adverse effects would be minimal. At the second level the user can, with only a little additional effort, gain considerable mitigation of the bias by way of look-up tables. At the highest level, which is the recommended route, the user can fully implement the simulation results via the polynomials.
Given the results shown in Figure 3 one might be tempted to use the K e approach as a quick-fix alternative to K t . However, the efficacy of such a strategy varies and may be more appropriate when the goal is to identify the most highly significant differences. As seen in Figure 7, K e is closer to K s than K t only for high probabilities (at least~0.9 to 0.95). In instances in which accuracy for less significant cases is important, for example when using a set of significance levels in assessing field significance (Wilks, 2016), K t might be a better choice. In either circumstance, augmenting with use of the look-up tables will render more accurate results, but again results will vary.
If one examines the look-up tables for K t (Tables S3 and S4) it is possible to achieve a high level of significance only for low values of lag-1 autocorrelation; much larger K t probabilities, beyond the limits of the tables, would be needed. Conversely, for K e (Tables S5 and S6), while high significance is possible, probabilities are censored at .99; however, the smallest probabilities are rather high in most cases. To achieve a more realistic range of probabilities in these two contrasting cases would require extensions into the more highly nonlinear regions of the relationships. In summary, while the look-up tables can be useful both diagnostically as well for some applications, they are not a panacea. More generally, the full implementation, utilizing the polynomial fits is recommended.
In closing, there is one further but important consideration for the potential user. Specifically, how appropriate is the statistical model used in the simulations, namely an AR1 process, for the data at hand? The AR1 assumption is appropriate for a broad range of meteorological and climate processes (Thiebaux and Zwiers, 1984;Zwiers and von Storch, 1995;von Storch and Zwiers, 2001). On the other hand some phenomena, particularly those of a quasi-periodic nature, such as the El Nino Southern Oscillation (ENSO), the stratospheric Quasi-Biennial Oscillation (QBO), and Madden and Julian Oscillation (MJO), to name a few, are better characterized as AR2. In other instances, the Gaussian assumption may not be valid. Ultimately, a judgement needs to be made as to whether any perceived violations of the assumptions are outweighed by the benefits of accounting for the autocorrelation effect, which, as we have seen, can at times be quite large. If the simulation model cannot be accepted and there is considerable autocorrelation then it would seem that the final course of action would be a set of Monte Carlo simulations appropriate for the data on hand. Ultimately the user should keep in mind the famous quote by statistician George Box: 'all models are wrong, but some are useful' (Box, 1976).