Exact change point detection with improved power in small‐sample binomial sequences

To detect a change in the probability of a sequence of independent binomial random variables, a variety of asymptotic and exact testing procedures have been proposed. Whenever the sample size or the event rate is small, asymptotic approximations of maximally selected test statistics have been shown to be inaccurate. Although exact methods control the type I error rate, they can be overly conservative due to the discreteness of the test statistics in these situations. We extend approaches by Worsley and Halpern to develop a test that is less discrete to increase the power. Building on ideas from binary segmentation, the proposed test utilizes unused information in the binomial sequences to add a new ordering to test statistics that are of equal value. The exact distributions are derived under side conditions that arise in hypothetical segmentation steps and do not depend on the type of test statistic used (e.g., log likelihood ratio, cumulative sum, or Fisher's exact test). Using the proposed exact segmentation procedure, we construct a change point test and prove that it controls the type‐I‐error rate at any given nominal level. Furthermore, we prove that the new test is uniformly at least as powerful as Worsley's exact test. In a Monte Carlo simulation study, the gain in power can be remarkable, especially in scenarios with small sample size. Giving a clinical database example about pin site infections and an example assessing publication bias in neuropsychiatric drug research, we demonstrate the wide‐ranging applicability of the test.

sequential (online) schemes that are frequently used in quality control. Here, we consider the former, an ordered sequence of independent binomial variables, with being the number of events occurring in subjects at risk. We are interested to test the null hypothesis 0 of constant event probabilities ( = 1, … , ) equal to against the alternative 1 ∶ = { , ( = 1, … , ) ′ , ( = +1,…, ) , ≠ ′ for some period in the range 1, … , denoting an unknown change point. Most commonly tests to detect such a change point in the sequence are based on taking the maximum of a test statistic 1∶ that is designed to find differences for fixed candidate change points = 1, … , . Those statistics include the log likelihood ratio, the cumulative sum and variations thereof (Pettitt, 1980), and the p-value of Fisher's exact test (Halpern, 1999); but also statistics based on Doob's martingale decomposition (Brostrom, 1997) as well as Bayesian statistics may be used (Assareh, Smith, & Mengersen, 2015;Smith, 1975). These maximally selected test statistics max 1∶ = max =1,.., ( 1∶ ) arise not only in change point detection but also in various other applications. A real-world problem might be the precise assessment of the probability of an unfavorable realization in random ordering of (clusters of) binary events. Such an example might be the interest to detect manipulations of fixture lists in sports, which define the order of (weak or strong) opponents to be matched against. Another example may be the assessment of potential context effects in surveys. The context effect relates the order of questions asked to a bias in the overall thinking and answers of survey respondents. Applications of change point models in epidemiology and medicine are common and have led to ongoing methodological developments. Examples include the epidemic wave model (Boulesteix & Strobl, 2007;Siegmund, 1986), the assessment of genetic recombination (Halpern, 1999), doseresponse models (Lausen, Lerche, & Schumacher, 2002), calendar time effects in clinical registries (Friede & Henderson, 2003), and clinical trials with adaptive designs (Friede & Henderson, 2009). Maximally selected test statistics are also used as cutpoint methods for dichotomization, although these may primarily be used when an underlying change can truly be regarded as abrupt. Otherwise these methods may lack statistical power and alter the effect estimates in comparison to continuous regression methods (see, e.g., Royston, Altman, & Sauerbrei, 2006) as long as the latter are correctly specified. Still, the simplicity of the considered change point model avoids instabilities in the parameter estimation in these scenarios.
Asymptotic distributions of maximally selected test statistics in binary sequences were derived by a number of authors (see Miller & Siegmund, 1982;Pettitt, 1979Pettitt, , 1980. For small sample sizes, however, these approximations have poor performance and exact methods are to be preferred (Friede, Henderson, & Kao, 2006;Halpern, 1982). Exact null and alternative distributions were given by Worsley (1983) for log likelihood ratio and cumulative sum test statistics. Halpern (1999) proposed to use Fisher's exact test and compared the different approaches with regard to their statistical power. Hirotsu (1997) gave exact distributions in case of two-way layouts with interaction effects. While these exact methods are designed for small sample sizes, they often lack size of the test in these scenarios. Due to the discreteness of the test statistic, the significance level of the test cannot be used to the full extent. This adds a degree of conservativeness to the test procedure (Ross, Tasoulis, & Adams, 2013). Several approaches have been discussed to overcome not only the implied loss in power, but also the lack of precision as a methodological disadvantage (Zhou, Zou, Zhang, & Wang, 2009). The trivial solution of a randomization of the test statistic is only of theoretical interest to achieve a uniformly most powerful test but cannot be recommended for practical application because of a lack of reproducibility among other reasons. The same applies to approaches that use Monte Carlo simulation techniques to obtain the required probabilities, see for example, Ross et al. (2013). An unconditional version of Worsley's test addressing the problem was suggested by Ellenberger and Friede (2016) to gain power with less discrete test statistics. Unlike Worsley's test, this test does not condition on the observed total number of events. The nuisance parameter is dealt with by maximizing the p-value over the nuisance parameter. We aim at developing a hypothesis test that uses also information in the sequences left and right of each possible change point. We can thus define an ordering of different sequences that all yield the same Worsley's p-value. The ordering will be based on binary segmentation ideas and is used to create less discrete test statistics. To do this, we develop in Section 5 exact null distributions under certain side conditions. These are used to get exact p-values on both subsequences left and right of the potential change point̂= argmax ( ) conditional on̂. With these conditional p-values, we will define a new test in Section 6 by applying a combination function such that both p-values are merged to a single meaningful p-value. The performance of the proposed test is assessed by Monte Carlo simulations in Section 7 and the test is applied to two motivating examples introduced in Section 2. One example searches for change points in a clinical database of orthopedic surgeries using external fixators in children; the other example is concerned with the assessment of publication bias in neuropsychiatric drug research. We close with a brief discussion of the findings in Section 8.   Friede et al. (2006) to investigate the effectiveness of the introduction of a new procedure for pin site care. Measures against pin site infections needed to be taken because those were frequent. The data suggest that the introduction of the new procedure is strongly associated with a decrease in infections. We are now interested whether this association holds for a subgroup of boys who had surgeries with external fixators to treat fractions at their feet. This subgroup of 26 pediatric patients with a total of 20 infections shows to be homogeneous in terms of reason of surgery and other characteristics. Since the covariates age and reason for application were found to be noninformative in previous analyses (Friede et al., 2006), these were not considered here. The binary sequence of pin site infection is shown as 1s and 0s in Panel A of Figure 1. The log likelihood ratio statistic is displayed for all possible change points .

Publication of FDA-approved neuropsychiatric drugs
Publication bias is increasingly recognized as a major problem in scientific publishing. Several authors have addressed the matter and investigated its extent. In clinical trials of neuropsychiatric drugs, Zou et al. (2018) have conducted an extensive literature search and analyzed trends in time. In the past decades, regulatory authorities have taken measures to prevent negative results from not being reported or published. The US Congress passed the FDA Modernization Act (FDAMA) in 1997, which mandated the public registry ClinicalTrials.gov that was established 3 years later. In 2005, the International Committee of Medical Journal Editors (ICMJE) enacted a policy requiring trial registration as a prerequisite for publication in member journals, leading to an increase in the number of registered studies. Zou et al. (2018) point out that FDAMA did not require registration of all studies, and the ICMJE recommendation continued to allow the publication of unregistered studies as compliance was voluntary. In 2007, the US Congress passed the FDA Amendments Act (FDAAA), which applies to all non-phase-I studies with FDA-regulated drugs and requires sponsors and investigators to register all such trials in ClinicalTrials.gov prior to enrolment and report the results to within 30 days post approval. Inappropriately delayed registration and reporting of results, as well as reporting of incorrect results, can be punished by fines and possible loss of funding. The FDAAA applies to trials initiated after September 27, 2007, and to earlier trials in progress as of December 26, 2007. Zou et al. (2018 have studied in detail the registration, results reporting, and publishing of clinical trials supporting FDA approval of neuropsychiatric drugs. They investigated the possible effects of the FDAAA on the publication of negative or unequivocal findings. Regarding this publication bias, the authors found statistically significant effects that the rate of publishing negative findings has increased from the pre-FDAA era to the post-FDAAA era. In the latter, all trials were published, though also some recent trials were found to report inconsistent results in comparison to the FDA approval assessment. In contrast to investigating only this comparison with the date of the change that is allegedly already known, it is also of interest to carry out analyses that consider all possible change points in the chronological order of the drug approvals. We therefore carried out the respective change point analyses on the data. We considered all drugs agents by different pharmaceutical companies that were approved by the FDA but had at least one trial that had either negative or questionable results in the FDA's reports. With data by Zou et al. (2018), we tested the outcome whether all trials of the approved drug were published in a scientific journal as it should be mandatory. These analyses will also investigate in the willingness of drug companies to publish older negative studies on an approved drug, whose retrospective reporting did not became mandatory by the FDAAA. Similarly, further calendar time effects in pivotal studies for the FDA approval of new drugs are the subject of ongoing investigations (Zhang et al., 2020).

MODEL AND WORSLEY'S TEST
We consider the problem of investigating the existence of a change point in a subsequence of interest starting at and ending at with 1 ≤ < ≤ . Let be consecutive sums (ℎ to ) of binomial distributed event numbers { ∶ = 1, .., } and the numbers of trials { ∶ = 1, .., } also called bin sizes. The indicators ℎ∶ can be dropped when the whole sequence of interest is referred to in order to simplify the statistical notation. So,= ∶ and= ∶ are the total number of events and subjects within the relevant subsequence. Common change point methods for binomial data are based on maximally selected test statistics for 2 × 2 tables (#{ Events Exposed }, #{ and max ∶ = max | ∶ | be the maximum over = , .., . Conditional on (= ∶ + +1∶ ) and all bin sizes fixed, ∶ is dependent on ∶ only and we may simply write it as a function ∶ ( ∶ | ). Worsley (1983) gives exact distributions of such maximally selected test statistics, which we want to generalize. All inference is made conditional on with ∕ being sufficient for the event probability to eliminate this nuisance parameter. The only regularity assumption for ∶ requires some monotonicity in ∶ , which is usually fulfilled for sensible choices of ( ⋅ ). While one-sided test statistics naturally are monotone, two-sided test statistics have to be monotone for all decreasing ∶ ≤ 0 and for all increasing ∶ ≥ 0 separately with 0 being the argmin( ) for a given ( ⋅ ). Usually, 0 ∕ ∶ is close to ∕ . With this assumption, it is guaranteed that events { ∶ < } can be expressed as a set being an interval { ≤ ∶ ≤ } with = inf{ ∶ ∶ ∶ < } and = sup{ ∶ ∶ ∶ < }. The test statistics given in Worsley (1983) were the log likelihood ratio and the cumulative sum statistic. For fixed , the log likelihood ratio statistic is The statistic was first used by Hinkley and Hinkley (1970). The cumulative sum statistic is with 0 = ∕ and = √ 0 (1 − 0 ). has the same distribution as the Kolmogorov-Smirnov statistic , (Gibbons, 1985;Pettitt, 1979). Also briefly mentioned in Worsley (1983) is the usual Pearson 2 statistic (Miller & Siegmund, 1982) that is for testing the equality of and ′ . is equivalent to a two-sample version of the Anderson-Darling statistic (Darling, 1957;Halpern, 1999). Exact two-sided statistical inference with yields the exact same results as when using the z-statistic which is by some authors referred to as z-pooled (Mehrotra, Chan, & Berger, 2003). Since 2 = holds, it is sufficient to use the Pearson statistic when two-sided testing is carried out. Another statistic introduced by Halpern (1999) is the p-value of Fisher's exact test, that is, Fay, 2010). The statistics , , ,  are the ones most prominently used in the literature, despite many more statistics for testing a change in probability are available, for example, variations of z-pooled with separate variance estimation, variations of the two-sided Fisher's exact test (Fay, 2010), rank statistics (Hothorn & Lausen, 2003;Hothorn & Zeileis, 2008;Lausen & Schumacher, 1992), martingale statistics (Brostrom, 1997), or Bayesian approaches (Smith, 1975).
Let ∶ ( ) be the distribution function of any of the above-mentioned max ∶ under 0 conditional on the number of events in the (sub)sequence. Worsley (1983) gives an exact iterative procedure to calculate ∶ ( ) as follows. For each candidate change point , the probability P( ∶ < ) can be expressed in terms of the events of the form { ≤ ≤ } as defined above. The events { max ∶ < } are thus equivalent to ∩ = . Given all and = fixed, let where ℎ ( , ) is the probability function of the hypergeometric distribution, defined for 0 ≤ ≤ ∶ and 0 ≤ − ≤ +1 (= +1∶ +1 ) as ) and probability zero otherwise. This result can now be used iteratively for = 1, .., − 2 to find ∶ ( ) for ≤ ≤ , and finally, produces the p-value 1− ∶ ( ) = P( max ∶ ≥ ). Besides the exact approach, several approximating distributions exist. In case of large sample sizes, asymptotic procedures may be preferred since exact methods may be time-consuming in their calculation. We will use the Brownian bridge approximation for maximally selected 2 statistics by Miller and Siegmund (1982) as a comparator. The approximating probability calculates as follows: ) .
The term ∶ hereby represents the first bin only. Since we have binomial data, the first bin should contain all the where we are not interested in seeking for a change point (or maximizing our test statistics). In applications, this is often used to shift the change point detection mildly away from the tails of the sequence. Regarding the Brownian bridge approximation, this is, however, a crucial parameter. In Bernoulli sequences, we restricted our search to the central 90% of the sequence as, for example, in Friede et al. (2006).
With the described procedure by Worsley, one can calculate the exact size of a change point test using any statistic. For all and all ∶ , we can calculate the probability of ∶ exceeding a certain threshold, that is, the exact 95% or 99% quantile conditional on or quantile from an (asymptotic) approximation. With being drawn from a binomial distribution under 0 , we can assess to what extent the nominal significance level is actually used. For Bernoulli sequences of small sample sizes, we see in Figure 2 a decline in size that relates to a loss in statistical power. The various test statistics lead to very similar exact sizes, with no clear favorite. The Brownian bridge approximation is mostly conservative for our choices of parameters and their type I error probability is only slowly converging to the nominal level of 5%. Here, it is only applied to the log likelihood ratio statistic and the Pearson statistic since those are "maximally selected 2 " distributed. The approximation was developed by Miller and Siegmund (1982) for the latter.

BINARY SEGMENTATION
Binary segmentation is an iterative procedure to hierarchically split sequences (Scott & Knott, 1974) usually applied in order to detect multiple change points. Initially, the entire data set is searched for one change point. Once a change point is detected, the data are then split into two subsegments: one to the left of the change point and the second one to the right of the detected change point. Subsequent change point detections are then performed on either subsegment, possibly resulting in further splits. This iterative procedure continues until a stopping criterion is met, for example, until significance cannot be achieved at a prespecified level. A plethora of criteria when to split and when to stop has been suggested. The choice has complex implications for consistency. The general trade-off to be made was described by Scott and Knott (1974): that "Choosing an appropriate value for is difficult. If is too small, the splitting process will terminate too soon, while if is too large, the process will go too far and split homogeneous sets of means." Vostrikova (1981) showed consistency of binary segmentation for the number and locations of change points, with rates of convergence of the estimators of locations, under certain technical conditions. For Gaussian processes, Venkatraman (1992) relaxed these conditions on the number and locations of the change points. Furthermore, a simulation study was done to assess the performance of various multiple change point detection methods specifically in small samples, proving real-world applicability. The theory is outlined and discussed for nonnormal cases as well.
Usual descriptions of the method do not consider that the distribution of the maximized statistics in any subsegment is now conditional on all previously found change points. For example, Scott and Knott (1974) write This starts with the best split into two groups, based on the between groups sum of squares, and then applies the same procedure separately to each subgroup in turn. The subdivision process is continued until the resulting groups are judged to be homogeneous.
[509] Fryzlewicz (2014) gives an algorithm for "standard" binary segmentation in pseudocode where the same segmentation step is recursively called at each splitting of the data into a left and a right subinterval without any constraints implied through previous steps. To the best of our knowledge, authors so far have not considered any impact of these constraints throughout the repeated/iterative splitting steps, since consistency followed by asymptotic results (Vostrikova, 1981) as well as in simulation studies. Here, we condition on preceding steps and provide exact distributions in all steps throughout the segmentation procedure that considers that subsequences are no longer completely random under 0 (apart from the initial step). In contexts of developing asymptotic results for multiple change point segmentations of the data, this aspect may be neglectable, and is proven to be asymptotically correct under some regularity constraints. Therefore, failure to consider this aspect during subsequent steps of the procedure will not question the validity nor the usefulness of such a "standard binary segmentation", as we will refer to it in the following. For our purpose, however, we will need to consider rigorous and exact distributions at all steps. We will show that the usage of binary segmentation steps without considering the conditional distribution of the (pseudo) change points found beforehand can lead to conservative but also to liberal results in some scenarios.

EXACT NULL DISTRIBUTIONS UNDER SIDE CONDITIONS
In this section, we extend the iterative procedure by Worsley (1983) for calculating the probability of max ∶ when side conditions are present. Let be the number of side conditions that are denoted as  , = 1, .., and for intersections, the notation =1  is used (to distinguish them from intersections " ⋂ = " representing possible runs within the sequence). The conditions  shall be measurable regarding for = 1, .., . Similarly to the definition of the in Section 3, we define sets of the form { ≤ ∶ ≤ } with = inf{ ∶ ∶ P( ∶ |  ) > 0} and = sup{ ∶ ∶ P( ∶ |  ) > 0}. We restrict the to be intervals as we did for the . Then intersections between them as for any = 1, .., and ∈{1,.., } are thus also intervals. This regularity condition should always be true whenever the and arise by meaningful test statistics. This assumption could eventually be dropped but the notation and implementation would be more complex.
Side conditions  naturally arise whenever tests were carried out (in a hierarchical fashion) beforehand within the binomial sequence. We can regard  as all information received in previous steps to condition on. When steps of a binary segmentation are sequentially done, the conditioning should account for initial change point tests on the sequence 1, .., with a test statistic (possibly different from 1∶ ). The indices in will also reflect the subsequence ′ ∶ ′ (with 1 ≤ ′ ≤ , ≤ ′ ≤ ) of the − +1th step in the binary segmentation procedure. This fixes the attained maximum value of the test statistic at the observed max = max =1 { ( 1∶ , 1∶ , +1∶ , +1∶ )} at the position of the detected change point. Considering the number of events as random within the subsequence = , .., , the initial step = will impose restrictions of the form We investigate the distribution of ∶ |  conditional on fixed ∶ , the number of events in the subsequence, as Worsley's method does. Then all { ∶ = 1, .., −1, +1, .., } outside of the considered subsequence ( ∶ ) are fixed (thus also 1∶ −1 and +1∶ ). Hence, is random only in ∶ . Denote = ( ∶ ). For the hypothetical binary segmentation procedure, attention must be paid to the decision rule where to split when a maximum is not unique, that is, is attained at multiple possible change points . A variety of such rulings can be considered from preferring an early or late change point to splitting the sequence directly in multiple subsegments. The ruling we chose picks the change point that is the most to the left, which corresponds to the earliest change point if the ordering is by time. Any decision rule used will impact the since the side conditions might allow the case of equality ( ∶ ) = max . When the decision rule is to take the left change point in step − +1, this would forbid the case of equality only on the left subsequence but would allow to attain further maxima on the right subsequence in the following steps. Let ∶ be an indicator function that is 1 when equality is allowed and 0 otherwise. We then define in the case of equality as = { ∶ ∶ ( ∶ ) ≤ max if ∶ = 1} and in the case of inequality as above. We now want to calculate the probability P( max ∶ < | , =1  ) under 0 conditional on the fixed, observed number of successes of our sequence and conditional on the additional restrictions { } =1 arising through the hierarchical steps in binary segmentation. Similar to Section 3, we define the probability of a partial maximum ∶ not exceeding as: .

Conditional on
The rigorous derivation of the conditional distributions allows the realization of exact distributions of the steps in a (hypothetical) binary segmentation procedure. Since binary segmentation is usually referred to as a multiple change point detection method, we use the term exact binary segmentation steps, since we do not discuss definitions of a stopping criteria, which would define such a procedure, see, for example, Vostrikova (1981), Venkatraman (1992), and Fryzlewicz (2014). Conversely, we referred to binary segmentation using unconditional distributions as standard binary segmentation steps, since this procedure is reliant on asymptotic results. Although multiple constraints arise through change points found beforehand, the method stays a one-dimensional optimization problem in the search for further possible change points. With the number of side conditions only increasing by the depth, going through the exact binary segmentation steps, it stays a greedy procedure, that is solvable in polynomial time, whereas approaches that rely on all possible 2 combinations of the input sequence are only solvable via simulation techniques, as suggested for example by Ross et al. (2013).

PROPOSED TEST UTILIZING AN ORDERING OF SEQUENCES
Let be an actual instance of a sequence of binomial variables we want to investigate a (single) change point test or some other maximally selected statistic on. The sequence is defined by its bin sizes and events { }( ), { }( ) and the attained maximum of the test statistic is max ( ) whose distribution we derived is conditional on ( ). We have A randomized p-value would be achieved with a uniform variable ∼ [0, 1] on the unit interval by Randomization yields full size and forms uniformly most powerful test statistics. When testing a change point, we still have unused information in the sequences. The sequences form a natural order regarding the likelihood of further separability in a hypothetical binary segmentation procedure. Fully conditional on the initial change point test, we can use results in Section 5 to determine a p-value of further change points as a "secondary" dimension. The sequence is split at the initial change point̂into a left subsequence left from left = 1 to left =̂and a right subsequence right from right =̂+1 to right = . Theorem 5.1 is used to determine p-values conditional on the initial estimated change point̂: To pool left and right , a combination function (, ) for p-values will be considered. This approach is used in adaptive clinical trials (Brannath, Posch, & Bauer, 2002) but also in meta-analysis. Fisher's weighted product test (1932) is one possibility to combine left and right with the function Another popular approach is the inverse normal method (Lehmacher & Wassmer, 1999) ) also called Stouffer's method (1949) with possible weights 0 ≤ < 1 and 2 1 + 2 2 = 1. Many other combination functions have been proposed, some also specifically for discrete p-values. Kincaid (1962) compares methods to pool discrete pvalues. Still, these methods are not easily adopted to the setting considered here, since they need full derivations of the exact discrete distributions (which depend on the sequence). The calculations required would be of exponential order and are therefore not considered further. If ( left , right ) is not already a valid pooled p-value, it is defined as It is guaranteed that under 0 , (, ) is stochastically smaller than a uniform distribution P( ( left , right ) ≤ | ) ≤ U[0,1] ( ) since left ( left ) and right ( right ) are conditionally independent. For our purposes, we will use Fisher's product test in the following, since it is better suited for p-values that are not continuous. For instance, p-values need to be truly smaller than 1 for the inverse normal method. Besides the numerical instability, the strong impact of p-values close to 1 will lead to less homogeneous pooling. Furthermore, we only pool the left and the right p-values if both subsequences are informative. This way subsequences with no events or the maximal number of events are not considered further by getting weight zero. If both subsequences are uninformative, the pooled p-value will be set to 1. With the known distribution for the Fisher's product test, we obtain ( left , right ) = −1 2 (−2 (log( left ) + log( right )) and a new p-value since ( left , right ) is determined conditional on the initial change point test including and thus independent. Thus, by construction, the test keeps prespecified significance levels and is at least as powerful as Worsley's exact test since the new test is less discrete and ≤ holds. The described test operates in depth 1, but can be easily extended by applying the same approach recursively to left and right . This hierarchical procedure resembles binary segmentation and forms some sort of "segmentation p-value" rather than a single change point p-value. The new test uses an exchange of information through different depths to make the test at a given depth more precise. Also when further segmentation is not of any interest, the approach is natural since it favors sequences with a sharp change in the empirical frequency of events. In order to focus on such local sharp changes, it may also be advisable to change the underlying test statistic from two-sided to one-sided, such that the subsequence on the side of the change point with a low frequency is searched for an increasing frequency the further this subsequence goes away from the change point. Conversely, the subsequence on the side with a high frequency is searched with the one-sided test in the opposite direction of a decreasing frequency. We refer to this procedure in the following as swapped one-sided alternatives. In our definition of the test, we do not provide specific stopping criteria regarding the segmentation other than the prespecified depth, or more precisely, the forced stop whenever the resulting subsequence consists exclusively of events or nonevents. We point out that the stopping criteria defined this way were chosen with the intent to get a less discrete test. However, it is neither a valid nor a sensible criterion for detecting multiple change points. The latter is a separate setting for which we refer to, for example, Scott and Knott (1974) and Vostrikova (1981), in which only a rigorous stopping criteria will be able to obtain a binary segmentation procedure in the original sense. A variety of other methods for (direct) detection of multiple change points exist, many of them proven to provide better results than binary segmentation procedures in certain multiple change point applications (see, e.g., Frick, Munk, & Sieling, 2014;Zou, Yin, Feng, & Wang, 2014). We use simply the idea of a segmentation procedure (without a stopping criteria) to obtain less discrete test statistics. The derived exact conditional distributions, however, can be used to evaluate any given segmentation procedure (with well-defined splitting and stopping criteria) that is based on Worsley's test.

NUMERICAL STUDIES
In the following, we will explore the properties of the proposed methods by means of a simulation study and by application of motivating examples introduced in Section 2.  Figure 3 study on randomly generated Bernoulli sequences under 0 was done. Figure 3 shows the results for various depths in the creation of the sequential ordering. A higher depth always leads to a less discrete and therefore smaller p-value and thus a higher size of the test. The log likelihood ratio test and cumulative sum test as underlying test statistics as proposed in Worsley (1983) were used. Also, Fisher's exact test is used with the two-sided version for the initial test but for higher depth swapped one-sided versions as described in Section 6. All test statistics show about equally large size irrespective of the depth, the nominal level (1% or 5%), and the true event probability ( = .3 or = .5). The lowest line shows the performance of the original Worsley's test that can be referred to as having depth zero. The new approach leads to a strong increase in the size already with depth 1. Even for sample sizes up to 25, this increase can be above one-fifth of the nominal level. Only when the event probability is very small (or high), the effect diminishes since many randomly generated data sets become rather trivial. When the search depth is further increased, the size slightly improves for depth 2 and very little for depth 3. Search depths beyond three only very occasionally undiscretize a p-values and then to an almost unnoticeable extent. Therefore, the gain in size beyond depth 3 is close to zero. Exact calculations of the size (as displayed in Figure 2) are no longer feasible for depths greater zero, since the number of distinguishable sequences is of exponential order (2 ). When the p-value is achieved by steps of standard binary segmentation, the statistical test becomes predominantly liberal in the case of swapped one-sided alternatives, while otherwise it is often conservative (see Figures 3-5).

Simulation studies
The power of the test statistics is displayed in Figures 6,7. The gain in power depends strongly on the simulation scenario and the test statistics used. Scenarios with alternatives consisting of one change point only are displayed in the upper tier. In the bottom row, two change points were used and the new method based on segmentation can benefit even more from such an alternative to reject the null hypothesis of no change point. The gain in power can be substantial as shown for the log likelihood ratio statistic that has a large statistical power in the tails of the sequence, as well as for the cumulative sum statistic having a large statistical power in the center of the sequence.

Motivating examples revisited
In the clinical data example about the likelihood of pin site infections in orthopedic surgeries, the estimated change point is located before observation 12 that is before the introduction of the new procedure after observation 17, as displayed in Panel A in Figure 1. Worsley's test gives a p-value of 0 = .0707 for the log likelihood ratio statistic and 0 = .0819 for the cumulative sum statistic. With the new approach of depth 3, respective p-values are undiscretized to 3 = .0565 and 3 = .0702. The data indicate no change at the timepoint when the new procedure was introduced, but rather suggest multiple changes in care prior to observation 17. It is plausible that the new procedure in care was in part tested and used beforehand. Also, the preference of hospital based (and community practice) versus ambulant pin site care was changed in this period.
In publication bias example based on data from Zou et al. (2018), we found an estimated change point after observation 10 that was the drug viibryd approved on January 21, 2011, and observation 11 that is the drug aubagio approved on September 12, 2012 when using the likelihood ratio statistic, as displayed in Panel B in Figure 1. In contrast, the cumulative sum test statistic achieves the maximum after observation 7. We found only a small gain regarding the coarseness of the p-value from 0 = .234 to 3 = .216 for the two-sided likelihood ratio test statistic. However, if the one-sided likelihood ratio test is used for detection of an increasing event probability, the new test achieves a p-value of 1s 3 = .083, while Worsley's p-value is 1s 0 = .117. The delay of the change point since the FDAAA in 2007 is not implausible, because the drug approval process usually includes multiple trials, and thus, a delay of over 3 years is likely. Especially, negative findings in the development process will possibly lead to a longer delay.

DISCUSSION
In this paper, we extended the proposal by Worsley by considering a sequential ordering to augment test statistics that compare "before" versus "after" by means to analyze 2 × 2 contingency tables. The ordering we defined originates from binary segmentation procedures, and to achieve an exact test, we first needed to derive the exact null distributions of the single steps of such procedures. With the "standard" approach not accounting for the conditional distributions, the type I error can be inflated. With the derived exact binary segmentation steps, however, a new test could be defined that is often able to attain a statistical power that is much closer to the randomized version of Worsley's test. Another promising application of the described exact methods would be the usage in building decision or regression trees with binomial outcomes. When selecting input features, different variables repeatedly compete in being best suited to partition the predictor space into various strata. Here, the p-value can serve as a criterion for selection and the absence of statistical significance subsequently as a possible stopping criterion. In this context, the developed exact methods for binary segmentation steps are promising as they are rigorous. First, current methods do not adjust for any data splits (referred to as internal nodes) that have taken place in advance as standard binary segmentation steps do neither. Exact methods would increase validity and objectivity of the procedure. Second, when the explanatory variables are continuously split to create a multitude of strata (so-called tree branches), the sample sizes naturally get small. Increases in power similar to the test developed in Section 6 would be desirable. The simultaneous handling of many covariates in building decision trees is, however, not straightforward but will need some assumptions regarding their dependence structure.

A C K N O W L E D G M E N T S
We thank Constance Zou, Joseph Ross, and colleagues for providing us with the data from their publication Zou et al. (2018) and fruitful discussions on the topic. Furthermore, we thank the Reviewers for their suggestions leading to an improved manuscript.
BL acknowledges support from grant number ES/L011859/1, from The Business and Local Government Data Research Centre, funded by the Economic and Social Research Council, UK, to provide researchers and analysts with secure data services.
Open access funding enabled and organized by Projekt DEAL.

C O N F L I C T O F I N T E R E S T
The authors have declared no conflict of interest.

O P E N R E S E A R C H B A D G E S
This article has earned an Open Data badge for making publicly available the digitally-shareable data necessary to reproduce the reported results. The data is available in the Supporting Information section. This article has earned an open data badge "Reproducible Research" for making publicly available the code necessary to reproduce the reported results. The results reported in this article were reproduced partially due to their computational complexity.