Cramer-von Mises tests for Change Points

We study two nonparametric tests of the hypothesis that a sequence of independent observations is identically distributed against the alternative that at a single change point the distribution changes. The tests are based on the Cramer-von Mises two-sample test computed at every possible change point. One test uses the largest such test statistic over all possible change points; the other averages over all possible change points. Large sample theory for the average statistic is shown to provide useful p-values much more quickly than bootstrapping, particularly in long sequences. Power is analyzed for contiguous alternatives. The average statistic is shown to have limiting power larger than its level for such alternative sequences. Evidence is presented that this is not true for the maximal statistic. Asymptotic methods and bootstrapping are used for constructing the test distribution. Performance of the tests is checked with a Monte Carlo power study for various alternative distributions.


Introduction
Consider a sequence of independent observations X 1 , . . . , X n . We propose tests of the null hypothesis that the X i are independent and identically distributed (iid) with unknown continuous distribution H against the change point alternative that there is some (unknown) c with 1 ≤ c < n such that X 1 , . . . , X c are iid with continuous distribution F and then X c+1 , . . . , X n are iid with some other continuous distribution G. We will consider tests based on two sample empirical distribution function tests for equality of distribution, focusing on the two-sample Cramér-von Mises test.
If the time c of the potential change point were specified in advance we could test the hypothesis that F = G = H using any two sample test for equality of two distributions. The two-sample Cramér-von Mises test is one well known possibility. Notation may be simpler to read if we used the shorthand d = n − c. Let be the empirical distribution function of the first c observations and be the empirical distribution function of the remaining d observations. The combined empirical distribution function H n of the entire sample is The two-sample Cramér-von Mises test of the hypothesis F = G is based on the statistic For a thorough discussion of this nonparametric test and a simple computing formula in terms of the ranks of the first c values of X in the whole sample see Anderson (1962). The distribution of the test statistic does not depend on H under the null hypothesis provided H is a continuous function. A number of authors have suggested adapting this statistic to the change point problem. See, for instance, Picard (1985) and Brodsky and Darkhovsky (1993) where the two natural possible test statistics considered herein are suggested and studied briefly. The first of these tests can be used both to assess the existence of a change point and to estimate the location of the change if it exists. The statistic in question is W max ≡ max 1≤c≤n−1 W n (c).
We shall also use W max to define the estimated change point c n = arg max 1≤c≤n−1 W n (c); thusĉ n is the value of c achieving the maximum. (We remark that the statistic W n is discrete and in small samples there is some modest probability thatĉ n will not be unique; this lack of uniqueness plays no role in the hypothesis testing problem.) We prefer, however, the statistic W n (X 1 , . . . , X n ) = W n ≡ 1 n − 1 n−1 c=1 W n (c).
We offer several potential rationales for our choice: • In many goodness-of-fit contexts quadratic statistics like ours outperform maximal statistics. For instance, the Cramér-von Mises goodness-of-fit test is generally more powerful than the Kolmogorov-Smirnov test; see, for instance, Stephens (1986).
• Quadratic statistics such as we propose often have simpler large sample theory than do maximal statistics like the Kolmogorov-Smirnov test. Generally speaking the former have limiting distributions which are linear combination of chi-squares while the latter have limiting laws which are those of the supremum of a Gaussian process. The actual laws of these suprema are known only in special cases (although inequalities can often provide useful upper bounds on p-values).
• The large sample theory in question often provides a more accurate approximation for quadratic statistics than it does for maximal statistics. For example, see Mohd Razali andYap (2011) andBüning (2002).
In Section 2 we present large sample distribution theory under the null hypothesis, show how to compute p-values based on this large sample theory and demonstrate that the asymptotic approximations are quite accurate for n ≥ 100, particularly in the important lower tail. Section 3 presents a short power study showing that over a wide range of alternatives the statisticW is more powerful than W max . Section 4 presents asymptotic power calculations against contiguous sequences of alternatives; these permit useful approximations to the power ofW in cases where the null is not obviously false. By contrast, the limit theory for W max does not lend itself to easy power calculations. We conjecture, however, that in this context of contiguous alternatives the statistic W max has the defect that, unlikē W , its power converges to its level. In this section we present some further Monte Carlo studies relevant to contiguous sequences of alternatives. Finally we present some discussion in Section 6. We give proofs and evidence for the conjecture in the Appendix.

Null limit theory
Suppose that the null hypothesis holds and the X 1 , . . . , X n are iid with continuous cdf H. Then for all c we have Thus in computing distribution theory under the null we may, and will, assume that H is the uniform distribution; to emphasize the point we let U 1 , U 2 , · · · be an iid sequence of Uniform random variables; the joint law of (H(X 1 ), . . . , H(X n )) is the same as that of (U 1 , . . . , U n ).
Large sample theory for the two sample Cramér-von Mises statistic is well known: if c depends on n in such a way that c/n → s ∈ (0, 1) (or even just min{c, n − c} → ∞) then where the Z i are iid standard normal; see Anderson (1962). (Notice that the limit is free of s.) Our statistic has a related limit given as follows.
Theorem 1 As n → ∞ we have, under the null hypothesis, where the Z jk are iid standard normal.
The theorem is a consequence, as usual, of a suitable weak convergence result which we now present; the Gaussian process limit we derive is mentioned in Picard (1985); the specific weights in Theorem 1 do not seem to have been previously described.
We begin by defining the partial sum empirical process (van der Vaart and Wellner, 1996, p. 225 Our statistic can be described in terms of this process. Notice that We now define a process W n (s, t) for 0 < s < 1 and 0 ≤ t ≤ 1 by For given c our two sample test statistic is given by The processes Z n and W n have well known weak limits given the in following theorem. It will also prove useful to introduce the notation B n (s, t) = Z n (s, t) − sZ n (1, t).
Theorem 2 Under the null hypothesis: where ψ(t, t ) = t ∧ t − tt ; 2. As n → ∞, B n (s, t) B ∞ a mean 0 Gaussian Process with covariance function ρ B (s, t; s , t ) = ψ(s, s )ψ(t, t ); 3. As n → ∞, W n (s, t) W ∞ a mean 0 Gaussian Process with covariance function The process B is called a Brownian pillow by some writers or a 4 side tied down Brownian motion; see, for instance Zhang (2014) or McKeague and Sun (1996). The process Z is a Blum-Kiefer-Rosenblatt process ; see Blum et al. (1961).
We now record well known facts about the eigenvalues of the covariance ρ W . The covariance kernel ψ is that of a Brownian Bridge. It has eigenvalues of the form 1/(π 2 k 2 ) for k = 1, 2, · · · with corresponding orthonormal eigenfunctions f ψ,k (u) = √ 2 sin(πku). The covariance kernel χ arises in the study of the Anderson-Darling goodness-of-fit test. It has eigenvalues of the form 1/{j(j + 1)} for j = 1, 2, · · · . The corresponding orthonormal eigenfunctions are associated Legendre functions. The j th eigenfunction is where the q j are polynomials of degree j − 1 defined recursively as follows: It follows that the eigenvalues of ρ W consist of all possible products λ jk = 1 j(j + 1)π 2 k 2 with corresponding eigenfunctions The expansion in Theorem 1 is then Parseval's identity with

Numerical Work
The distribution of W ∞ can be computed numerically in order to provide approximate, asymptotically valid, p-values. Our desired approximation to the p-value is where w obs is the value of W n observed in the data. Define λ jk = 1 π 2 j(j + 1)k 2 .
In practice, we truncate the infinite sum defining W ∞ , retaining the terms with the largest values of λ jk , and replace the neglected terms by their expected value. So we write We then approximate T M by its expected value: Since the mean of W ∞ is j,k λ jk = 1 6 the mean of T M may be computed by Our approximation becomes The latter quantity may now be computed by using numerical Fourier inversion following Imhof (1961). The R package CompQuadForm (see Duchesne and Lafaye de Micheaux, 2010) implements this computation in the function imhof; we use this software in our numerical work below. We have evaluated the quality of our asymptotic approximation to the null distribution of W in a small Monte Carlo study. Since this distribution does not depend on H when the null hypothesis holds we generated N = 10, 000 samples of size n = 200, 500, 1000. Figure 1 shows a Q-Q plot for these 10,000 values for n = 200 to check the uniformity of their distribution. Specifically, we plot the order statistics against the uniform plotting points 1/(N + 1), . . . , N/(N + 1). Figure 2 is an enlargement of the smallest 10% of these values since the quality of the approximation is most important for small p-values. In both cases it is seen that the approximation is excellent. For completeness, however, we note that the hypothesis of exact uniformity of these 10,000 p-values is rejected (P ≈ 0.01) by the Anderson-Darling test. Applied to the smallest 1,000 p-values, rescaled so that p-value number 1,001 from the bottom becomes 1, the Anderson-Darling p-value is actually 0.99. We conclude the uniform approximation is very good at reasonable sample sizes, particularly in the important lower tail. For p-values over 0.5 we believe that the truncation we must do in order to compute the limit law is slightly off but argue that inaccuracy in the upper tail of p-values is not very consequential.  Figure 1 showing the lower 10% of the distribution of the ordered p-values plotted against uniform quantiles for 10,000 iid Monte Carlo samples from a continuous distribution. The blue line is the uniform cumulative distribution function; exact p-values have a uniform distribution; the graph shows this approximation is very good in the important lower tail.

Monte Carlo Power approximations
We undertook a variety of Monte Carlo simulation studies to compare the power of W n to W max . In Table 3 we show the percentage of samples rejected in 10,000 trials by the two methods at the levels α = 0.05 and α = 0.1. We consider samples of size n ∈ {20, 50, 100}. In one experiment recorded in the table we generated data from the Gamma distributions where the parameters change at c = n/2. In another experiment we change from the Gamma distribution to the Normal distribution at c = n/2; in this case neither the mean nor the variance changes. While our tests are designed to detect single change points we have included two trials in which there are three segments which change between various Gamma distributions. One changes from shape 1, scale 2 to shape 2, scale 1 at the 40% point and then to shape 0.5, scale 4 at the 60% point. All three of these have the same mean. The other changes from shape 1, scale 2 to shape 2, scale 3, and back to shape 1, scale 2; the changes happen after 30% and then 70% of the data. Finally we present two experiments with samples from the normal distribution; in one the mean changes at c = n/2 and in the other the standard deviation changes at the same point. In all these trials the parameter values in the distributions in a given segment do not change as the sample size changes; this may be compared with the further Monte Carlo results in Section 4.
It will be seen that, except for very small samples, when there is a single change point the test using W n has better power than W max . Since it is also far faster to compute p-values for W n using the highly accurate asymptotic law we recommend W over W max . At the same time we observe that the procedure is specifically designed to choose between 1 change point and no change points and not to estimate and find multiple change points. In particular, for one of the alternatives in Table 3 with 2 change points the statistic W max is usually more sensitive than W n .
The results presented here show how the powers grow with sample size when the two distributions are fixed. Other experiments, not reported here, show that both statistics have better power when the change is near the center of the sequence.
More Monte Carlo power calculations are presented in Section 5 below with a focus on contiguous alternatives.

Power approximations: contiguous alternatives
We now compute approximate distribution theory forW n when the null hypothesis is false and the extent of the change at the change point is big enough to be detectable but not obvious; that is, we study situations where the best possible power in large samples stays away from 1. To do so we consider a sequence of alternatives indexed by n and assume that these alternatives are contiguous to a sequence for which the null hypothesis of no change holds. To be specific our null hypothesis sequence will have X i iid for 1 ≤ i ≤ n with density h and cdf H. For the alternative we suppose that there is a value c 0 such that for 1 ≤ i ≤ c 0 , the X i are iid with density f and that for c 0 + 1 ≤ i ≤ n the X i are iid with density g. All of f , g, h, and the true change point c 0 may depend on n but the dependence will be hidden in our notation. Under the null hypothesis the joint density of X 1 , . . . , X n is Under the alternative the joint density becomes The log-likelihood ratio of these two is The sequence of alternatives f 1n is contiguous to the null sequence f 0n if, computing under the null hypothesis, we have . . , U n are iid with densityg(u) = g(H −1 (u))/h (H −1 (u)). The likelihood ratio becomes Since our test statistics are invariant to a monotone transformation applied to each individual data point we will take H to be Uniform[0,1] and then drop the tildes from our notation. The quantity A2 There is a u ∈ (0, 1) such that lim n→∞ c 0 n = u.
Then as n → ∞ we have, under the sequence of alternative hypotheses specified by f , g, and c,

The log-likelihood ratio satisfies
2. The process W n converges weakly to a Gaussian process with covariance ρ and mean µ( and 3. and where the Z jk are iid standard normal, As with the null distribution, this limiting alternative distribution for W can be computed using the R package CompQuadForm. As an example we take f to be standard normal and g to be normal with mean µ and standard deviation σ. The two parameters are assumed to depend on n in such a way that √ nµ → γ 1 and √ n(σ − 1) → γ 2 .
It is convenient to take h = f . Under the null the data X 1 , . . . , X n are iid standard normal. The functionsf andg are then given byf ≡ 0 and Under these conditions we may check that condition A1 holds with φ f = 0 and

Large sample behaviour of W max
The statistic W max is more challenging to analyze because the weak convergence result in Theorem 2 asserts convergence in loc ∞ ((0, 1) × [0, 1]). By loc ∞ ((0, 1) × [0, 1]) we mean the space of functions on (0, 1) × [0, 1] which are bounded on compact subsets of their domain. We give this the topology of uniform convergence on compacts. See van der Vaart and Wellner (1996). Our proof of Theorem 1 shows that our statistic is a continuous function on a subset of loc ∞ ((0, 1) × [0, 1]) to which sample paths of W ∞ are almost sure to belong. We are not able to establish the corresponding result for W max . Traditionally this problem has been handled either by fixing a small > 0 and redefining W max by maximizing only over {c : ≤ c/n ≤ 1 − } or by careful analysis of the behaviour of the process and the test statistic for c/n close to 0 or to 1. For instance, Jaeschke (1979) considers a weighted Kolmogorov-Smirnov test for the uniform distribution and shows that the supremum of the weighted empirical process has, after suitable normalization, an extreme value distribution. We have not pursued either of these ideas but offer here some evidence that this statistic has some important defects. First we look at a small simulation study. We generated 10,000 samples of size 100 and 500 from the null hypothesis. In Figure 3 we plot histograms of the valueĉ which maximizes W n (c) over 1 ≤ c ≤ n − 1. Observe that as the sample size grows the histogram concentrates near 0 and 1 (though the convergence is slow). We can prove: Proposition 1 Under the null hypothesis and under any sequence of contiguous alternatives min ĉ n , n −ĉ n → 0 in probability. Under the null hypothesis, the distribution ofĉ/n converges to a Bernoulli(0.5) law.
This means that, even for data from detectable (but not obvious) alternatives, our test statistic W max usually compares the distribution of a tiny fraction of the data to that of the vast majority of the data even when the true change point is in the middle of the sequence. We also conjecture: Conjecture 1 For any sequence of contiguous alternatives the difference between the power and the level of a test based on W max goes to 0 as n → ∞.  Table 2: Powers (percentage) for change from Gamma(shape= 1 + b/ √ n, scale=1) to Gamma(1,1) at the indicated breakpoint, n/2 in the top and 3n/10 in the bottom. Powers are based on 10,000 samples and either use Monte Carlo critical points (based on 100,000 samples) or asymptotic critical points as indicated by 'MC' or 'Asym'. All tests are at the level α = 0.05.
Normal, σ = 1 + b/ √ n, break at n/2 n = 10 n = 50 n = 100 n = 200 n = 500  Table 3: Powers (percentage) for change from Normal(0,σ = 1 + b/ √ n) to Normal(0,1) at the indicated breakpoint, namely, n/2 in the top and 3n/10 in the bottom. Powers are based on 10,000 samples and either use Monte Carlo critical points (based on 100,000 samples) or asymptotic critical points as indicated by 'MC' or 'Asym'. All tests are at the level α = 0.05.
Here is some Monte Carlo evidence from a simulation study. In Tables 2 and 3 we study four alternatives at sample sizes n = 10, 50, 100, 200, 500. For each sample size we draw 10,000 samples of size n. The first c observations in each sample have some parameter of the form a + b/ √ n and the remaining n − c have parameter a. We used the Gamma distribution and the normal distribution and tried c = 0.5n and c = 0.3n for each distribution. In the Gamma case we tried changing the shape parameter with a = 1 while holding the scale parameter at 1. The tables show the expected convergence (although we have not computed the power predicted by our theory in Section 4. For the statistic W max the tables show, in the normal case, the power declining towards the level (which is 5% here). For the Gamma cases studied here the power is rising but slowly for distant alternatives (large values of b) and declining very slowly for less distant alternatives (smaller values of b). Our experience in general is that for more distant alternatives it requires larger sample sizes before the power of W max begins to drop.
Our conjecture is motivated by an analogy with Lockhart (1991) in which it is shown that goodness-of-fit test statistics which depend only on o(n) tail order statistics have the property asserted in the second conjecture. In the Appendix we prove the proposition and provide partial details showing how we would hope to prove our conjecture, if we could.

Discussion
It is a general principle that procedures with optimal frequency properties are found by searching among Bayes procedures. It is also generally the case that optimal Bayes procedures involve averaging rather than maximizing. These heuristics motivate considering testing for change points by using test statistics which are averages over possible change points rather than maxima. In this paper we have used this heuristic to motivate an average two sample goodness of fit statistic when we are concerned about general changes in distribution, rather than simple changes in mean, in a sequence of independent data points. We have shown the resulting test statistic has computable large sample theory which can be used to provide very accurate p-values. Moreover we have shown that averaging over possible change points is generally more sensitive to alternatives than maximizing over possible change points.
The basic idea can be used in other contexts. Consider, for instance, testing for a change in mean. We describe first the unrealistic situation in which the standard deviation is known and then how to handle estimation of that SD. Suppose X 1 , . . . , X n are independent and we wish to test the null hypothesis that they are iid with unknown mean µ and known standard deviation σ (which we take to be 1 for notational convenience) against the alternative that the mean changes after the data point number c. The usual Z statistic is Our proposal would be to use the two sided test This statistic has mean 1 under the null hypothesis of no change in mean. Arguments similar to those in Section 2 show that this statistic has the same limiting distribution, under the null, as the well known Anderson-Darling goodness-of-fit statistic.
In the more reasonable case where the (assumed common) standard deviation is unknown will use the statistic where s 2 is some estimate of σ 2 which is consistent under the null hypothesis. The sample standard deviation is one possibility though this can be badly biased under the alternative. An estimate which is rather less precise but still likely to be quite accurate under the alternative hypothesis is Notice that under the alternative hypothesis all but one term in this average is an unbiased estimate of σ 2 ; the bias in the estimator is ∆ 2 µ /(2n) where ∆ µ denotes the change in the mean at the true change point. Under the null our estimate is unbiased. The statistic T 2 s also has the same limiting distribution as the well known Anderson-Darling goodness-of-fit statistic when the null holds.
Other nonparametric goodness of fit tests can be used instead of the Cramérvon Mises test. For example a Bayesian test Labadi et al. (2014), likelihood tests Csörgö et al. (1997) or other two-sample tests Büning (2002). Sample size, the kind of alternative distribution from which we expect the data to come and the expected index of the change point should likely be used to choose the best test. Finding the asymptotic distribution for less well-known tests can be difficult. Bootstrapping can be used instead. This deserves further research.
Our statistic can be described in terms of this process. Notice that Now define the process W n (s, t) for 0 < s < 1 and 0 ≤ t ≤ 1 by For given c our two sample test statistic is given by Let ν n be the probability measure on (0, 1) putting mass 1/(n − 1) on each point of the form c/n for 1 ≤ c ≤ n − 1. Our statistic is We now break the proof of our two results into steps consisting of a statement followed by a detailed proof. In each case the assertions are intended to hold under the null hypothesis and the assumption that the common distribution H is continuous.
Step 3 : For any sequence c n with n ≡ c n /n → 0 we have in probability. Under the null hypothesis the mean of W n (c) is 1/6 + 1/(6n); see Anderson (1962). The expected value of the indicated quantity is thus 2c n n − 1 Step 4 : The integral is almost surely finite. Since all the variates involved are non-negative we may compute Step 5 : For any sequence n tending to 0 as n → ∞ we have, by taking expectations, n 0 + 1 1− n 1 0 W 2 (s, t)dt ds → 0 in probability.
Step 6 : The tensor product kernel ρ = χ ⊗ ψ(s, t; s , t ) = χ(s, s )ψ(t, t ) is compact and has eigenvalue-eigenfunction pairs indexed by j, k each running from 1 to ∞. It follows as usual that the family defines a family of independent standard normal variables. Parseval's identity is then Step 7 : For each fixed > 0 we have in probability. This is an easy consequence of the fact that for i/n ≤ s < (i + 1)/n we have 1 0 W 2 n (s, t)dF n (t) = W 2 n (i).
Step 8 : For each fixed > 0 we have Under the null hypothesis H n converges weakly to the uniform law on the unit interval. Moreover ν n converges weakly to Lebesgue measure on the unit interval. The weak convergence result in Step 2 above uses a topology of uniform convergence on compacts such as the set [ , 1 − ] × [0, 1] and this implies the desired result.
Step 9 : For each fixed > 0 we have This is a direct consequence of weak convergence using the continuous mapping theorem.
Step 10 : There is a metric d on the set of probability measures on the real line for which the metric topology is the topology of weak convergence. For each fixed > 0 we have There is then a sequence n → 0 so slowly that this convergence continues to hold with replaced by n and so that the convergences in Steps to 7 and 8 continue to hold. Notice that by Step 5 for this sequence.
Step 11 : For the sequence chosen in Step 10 we therefore have In view of Step 1 we see The law of the limit is, by Step 6, that of This completes the proofs of Theorems 1 and 2.
Proof of Theorem 3. This is standard so we present only an outline. Conditions A1 and A2 can be used to prove that Λ n − S n → 0 in probability under the null. The Lindeberg Central limit theorem then establishes the first conclusion of the Theorem. For more detailed arguments in a similar context see Guttorp and Lockhart (1988). Thus, under the conditions of the theorem the sequence of alternatives is contiguous to a sequence for which the null holds. Contiguity implies that tightness under the null sequence extends to tightness under the alternative sequence. This proves tightness, under the alternative, of the sequence of processes W n . Thus we need only compute the limiting finite dimensional distributions under the alternative sequence. As usual we apply LeCam's Third Lemma (again similar arguments are in Guttorp and Lockhart (1988)) to reduce the problem to studying the joint law, under the null hypothesis, of Λ n and the vector (W n (s 1 , t 1 ), . . . , W n (s k , t k ) for an arbitrary sequence of time points t 1 , . . . , t k all in [0, 1].
The null distribution theory presented above (see Step 1 in the proof of Theorem 2) shows that, under the null hypothesis, where R W is the k × k matrix with i, jth entry The Lindeberg Central Limit Theorem may now be used to show that the vector (S n , W n (s 1 , t 1 ), . . . , W n (s k , t k )) converges in distribution to multivariate normal with mean vector (−τ 2 /2, 0, . . . , 0) and variance covariance matrix of the form Here the vector c is the limiting covariance which is found, after some algebra, to be This completes the proof of the second assertion of the Theorem. The third step is standard; Guttorp and Lockhart (1988) does similar problems.
This will prove Proposition 1. To this end fix 0 < < δ. Define While we expect such a result to hold we have not tried to prove anything along those lines. We will establish instead the lower bound lim sup s→0 1 0 π 2 B 2 (s, t) 2 log{log(1/s)}s(1 − s) dt ≥ 1 almost surely which is enough to imply (2). We enumerate the steps needed: 1. Let and I Z (s) = 1 0 Z 2 (s, t) s dt.
From this we deduce that it is enough to show that lim sup In this representation the Z j are iid standard normal and the eigenvalues λ j are given, for j = 1, 2, . . ., by 3. The process Z has independent increments in s and for each 0 < s < s the process All of these variables have the law of W (s) described above.
5. Fix > 0. Let A n be the event W * n > 2(1 − )λ 1 log(log(1/s n )) and B n be the event W n+1 ≤ 2(1 + )λ 1 log(log(1/s n )). We will show that we can choose r small enough so that (a) The event that A n occurs infinitely often (i.o.) has probability 1.
(b) The event that B n occurs for all large n has probability 1.
6. So the event A n ∩ B n i.o. has probability 1. 7. On the event A n ∩ B n we have W n ≥ 2(1 − )λ 1 log(log(1/s n ) so that this event occurs infinitely often.
which establishes (3). The definition of contiguity is that any sequence of events whose probability converges to 0 under the null has probability converging to 0 under the alternative. This finishes the proof of Proposition 1.
Proposition 1 establishes that there is a sequence n 0 such lim n→∞ P (ĉ n ∈ I n ( n )) = 1.
We now outline the steps in our strategy for proving the conjecture before giving some evidence for each step.
Step 1 There are constants a n and b n and a random variable V such that a n W max − b n V and V has a continuous limit distribution.
Step 2: So a n max{W n (c) : c ∈ I n ( n )} − b n V.
Step 3: There are random variablesW n (c) such that under the null hypothesis a n max{|W n (c) −W n (c)| : c ∈ I n ( n )} → 0 and such that for each c ∈ I n ( n ) the variableW n (c) is measurable with respect to the σfield generated by X c , c ∈ I n ( n ). To be specific we define, for c < n/2,W n (c) = (Recall the shorthand d = n − c.) Step 4: DefineΛ The log-likelihood ratio Λ n satisfies Λ n −Λ n → 0 in probability, under the null hypothesis.
Step 5: SinceW max is independent ofΛ n we may apply LeCam's third lemma to show that under the sequence of contiguous alternatives we have a n W max − b n V Step 6: Since this limit law is the same as under the null we must power minus level tends to 0.
For some of these steps we can fill in partial evidence. For Step 1 we would hope to follow the ideas in Jaeschke (1979) to show that the limit V has an extreme value distribution. In that paper the maximizer of the usual empirical process, standardized by dividing by its standard deviation, is shown to have an extreme value limit with constants analogous to a n and b n involving √ log log n and log log log n.
Step 2 is a consequence of Step 1 and (4). In Step 3 we would hope to use the closeness of H n to the uniform distribution to convert the dH n (u) integrals to du integrals. Then we write The integrals in T 1 and T 2 are both one sample Cramér-von Mises statistics so they are on the order 1. For any sequence c = c n such that c n /n → 0 the coefficient in front of T 2 is o(1). So T 2 is negligible relative to T 1 . The Cauchy-Schwarz inequality then shows T 3 is negligible relative to T 2 . There is a parallel argument when c n /m → 1.
Step 4 is not conjecture; its proof is straightforward from the assumptions of the Conjecture. Steps 5 and 6 are exactly parallel to the arguments in Lockhart (1991).