Relating and comparing methods for detecting changes in mean

In recent years, there have been a large number of proposed approaches to detecting changes in mean. A natural question for an analyst is which method is most appropriate for their applications. Answering this question is difficult because current empirical studies often give conflicting conclusions. This paper aims to show the similarities and differences between different changepoint methods. We highlight that there are two aspects to estimating changepoints: estimating the number of changes and estimating their locations, and that comparisons should separately evaluate these two aspects. We perform an extensive comparison of different methods across a range of simulation scenarios and provide code and full results for an interested practitioner to extend this comparison to more methods or different scenarios.

Most existing methods come with theoretical guarantees, which has motivated default threshold values. However, this theory, and the validity of the default threshold values, often is based on strong distributional assumptions for the data. In particular, defaults are based on assuming that the noise in the data is independent, identically distributed (IID), and Gaussian or sub-Gaussian. Our results show that when this assumption is true, all methods perform reasonably well in terms of estimating the number and position of the changepoints. However, all methods are sensitive to violations of this assumption, and the presence of heavier tailed noise and/or positive auto-correlation in the noise can lead to over-estimation of the number of changepoints. Methods that try to compare the fit with the data obtained for a range of threshold values seem to be more reliable across a range of scenarios.

DETECTING A SINGLE CHANGE
To understand the ideas behind various multiple changepoint methods, and their similarities and differences, it is helpful to first consider the simpler problem of detecting a single change. We are focusing on detecting changes in mean of univariate data, and so we assume we have data y 1:n = (y 1 , … , y n ), with y t = t + t for t = 1, … , n. Here, t is the mean of the tth data point, and 1:n is a zero-mean error process. We wish to compare two models for the data: (M0) There is no changepoint, so t = 0 for t = 1, … , n.
If we decide there is a changepoint, we would then wish to estimate its location, .
The most common approach to this problem is to model the noise process as consisting of IID Gaussian random variables with variance 2 and then perform a likelihood ratio test between the two models. If we define (y; , ) = 1 2 (y − ) 2 + log 2 , to be minus twice the log-likelihood for an observation, y, under our Gaussian model with mean and variance 2 , then the log-likelihood ratio statistic for testing between M0 and M1 if 2 is known is whereȳ s∶t denotes the sample mean of y s:t . As is defined in terms of the negative log-likelihood, we are minimizing it, which is equivalent to maximizing the log-likelihood. The last equality comes from straightforward algebraic manipulation and motivates the cusum statistic which is the natural estimate of the absolute change in mean of the data before and after , scaled to have a variance that does not depend on or n. The likelihood ratio statistic, LR 1 , is monotonically increasing in max C .
If we perform the likelihood ratio test under the assumption that is unknown, straightforward calculation gives the test statistic in terms of the cusum statistic as This test statistic is also monotonically increasing in max C .
To perform a test, we then need to specify a threshold, say, such that we detect a change if, say, LR 1 > . If we detect a change, the natural estimator of the location of the change will be the arg-max for within the calculation of the likelihood ratio statistic, or, equivalently, the value of for which C is the largest. As both test statistics are monotonic in max C , the form of the test will be the same regardless of whether we assume 2 is known or not. Furthermore, both of these tests will be equivalent to a test based on the cusum statistic, that is, that detects a changepoint if max C is larger than some pre-specified threshold. It is also clear that there will be a correspondence between thresholds such that a test based on LR 1 will be identical to the one based on LR 2 and also to the one based on max C .
To specify an appropriate threshold for one of these tests requires knowledge of the distribution of the test statistic under the null assumption of no changepoint. This distribution will depend on the distribution of the noise variables. The likelihood ratio test is non-regular, and so asymptotically does not have a chi-squared distribution under the null, even if the noise variables are IID Gaussians. Instead, under the Gaussian assumption the null distribution can be related to the maximum of a scaled Brownian bridge process (e.g., Hinkley, 1971). Importantly, these distributions may be inaccurate in practice if either of the Gaussian or independence assumptions do not hold.

DETECTING MULTIPLE CHANGES
We now focus on the more challenging problem of detecting multiple changes in mean. As before, we have data y 1:n with y t = t + t for t = 1, … , n. Let m * be the, unknown, number of changepoints and * 1∶m * be their locations. If we further define * 0 = 0 and * m * +1 = n then we can define m * + 1 segments, with the jth segment containing observations y * j +1∶ * j+1 for j = 0, … , m * . We let * j be the mean for data within this segment, so t = * j if and only if * j < t ≤ * j+1 . We now overview a range of methods for estimating m * and * 1∶m * . Each of these can be viewed as a way of extending the approach, described above, for detecting a single change in mean to this multiple changepoint setting. To aid comparison, for some methods, we will state asymptotic theoretical properties. These will tend to relate to the performance of the method under in-fill asymptotics, that is where we fix m * , the proportion of the data within each segment, that is, * j ∕n, and the segment means as n → ∞. Unless stated otherwise, results will further assume the noise random variables are IID Gaussians. A method is consistent if its estimatorsm and̂1 ∶m satisfy thatm = m * and max j |̂j − * j |∕n → 0 with probability tending to 1 as n → ∞.

Binary segmentation and variants
Possibly the simplest way of extending a method for detecting a single changepoint to the multiple changepoint setting is to repeatedly apply such a method. We first apply the method to detect a single change in the data. If we detect one, we can split the data at the inferred changepoint location and apply the method separately to data before and after the change. This can be repeated each time we detect a change, until no further changepoints are found. This approach is called binary segmentation (Scott & Knott, 1974).
Implementing binary segmentation requires specifying a threshold for our test statistic that we use for detecting a change. If we are willing to make assumptions on the distribution of the noise, this choice can be guided by the asymptotic distribution for the test of a single change.
However, even in this case, the choice of threshold is still complicated by the multiple tests that are being performed within binary segmentation, which makes it hard to relate the significance level of the test for a single change to inferential properties of the procedure when detecting multiple changepoints. There is theory showing that binary segmentation can consistently estimate the number of changepoints under in-fill asymptotics (Venkatraman, 1992;Fryzlewicz, 2014).
One issue with binary segmentation is that, because it adds changes one at a time, it can have low power to detect short segments in the data (Venkatraman, 1992;Olshen, Venkatraman, Lucito, & Wigler, 2004). This has led to a number of adaptations of binary segmentation (Olshen et al., 2004;Fryzlewicz, 2014;Baranowski, Chen, & Fryzlewicz, 2019;Anastasiou & Fryzlewicz, 2019). One of these is wild binary segmentation (Fryzlewicz, 2014). Wild binary segmentation randomly chooses M regions of data y l i ∶r i for i = 1, … , M, with l i < r i . It then applies the cusum test to detect a change on each y l i ∶r i and records the value of the maximum of the cusum statistic and the estimated changepoint location for each region. The regions are then ordered by the value of the test statistic, and these regions are processed in order. As we process the ith ordered region, we add the detected changepoint from the test for that region provided the cusum test statistic is above our threshold and no previously detected changepoints lie within the ith region.
The rationale for this procedure is that, if M is large enough, then for each true changepoint there should be at least one of the M regions that contains it and no other changepoint. We hope that our test for detecting a single change will have sufficient power to detect that change within one of these regions. When processing the results of the test for the regions, we ignore those in which we have already detected a change to avoid potentially estimating multiple changes associated with just one true change. Extensions of wild binary segmentation include seeded binary segmentation (Kovács, Li, Bühlmann, & Munk, 2020) that uses a deterministic choice of regions of the data, Isolate-Detect (Anastasiou & Fryzlewicz, 2019) that chooses regions and searches for changes from the start or end of the data, and a recursive version (Fryzlewicz, 2018a) that uses information from estimated changepoint locations to help choose new regions.
One nice property of binary segmentation and wild binary segmentation is that, rather than pre-specifying a threshold for the test statistic, it is quick to consider how the estimated number and locations of changepoints would vary as we vary the threshold. We can then use other procedures to choose between our different estimates of the number and locations of the changes. In particular, if̂m 1∶m denotes the estimated locations of the changepoints from wild binary segmentation, say, with the threshold specified so that we estimate m changepoints, then Fryzlewicz (2014) suggests estimating the number of changes, for some suitably specified , bŷ where (y; , ) is minus twice the log-likelihood under a Gaussian model, and we have defined m 0 = 0 and m m+1 = n. This condition is similar to a penalized log-likelihood or information criteria. Fryzlewicz (2014) shows that such a procedure will be consistent under in-fill asymptotics if increases with n faster than log n but slower than n. Specifically, Fryzlewicz (2014) suggest = 2(log n) 1.01 , which is almost identical to the choice of 2 log n that would correspond to using a Schwarz/Bayesian information criteria. There are also other ways of post-processing the output from a binary segmentation based algorithm. For example, the Isolate-Detect method of Anastasiou and Fryzlewicz (2019) suggests using the locations of estimated changepoints to prune changepoint estimates that are too close to each other.
Binary segmentation can be viewed as a divisive clustering method, as it initially clusters all data together in a single segment and then recursively considers splitting segments. Fryzlewicz (2018b) considers an approach to estimating changepoints based on an agglomerative clustering approach: Initially, each data point is in its own segment, and these are recursively merged together to create fewer and larger clusters.
Such a method requires the choice of a threshold that determines when to stop merging segments. For in-fill asymptotics, Fryzlewicz (2018b) shows that for a suitable choice of threshold, the method will be consistent. The theory behind this consistency result uses the standard, 2 log n, bound for the segment means for independent standard normal data (Yao, 1988) that is used to motivate the use of a 2 log n in the penalized cost methods we now introduce.

Penalized cost methods
An alternative way of extending the test for detecting a single changepoint to the multiple changepoint setting is by relating the test to the minimization of the following optimization problem where, as previously, we use the notation that 0 = 0 and m+1 = n. If we minimize subject to the constraint that m ≤ 1, then the value of m that achieves the minimum will correspond to whether a test for a single changepoint, using the test statistic LR 1 and threshold , would detect a changepoint (m = 1) or not (m = 0).
If we wish to estimate multiple changepoints we could perform the minimization of (2) without any constraint on the value that m can take.
The value of m and the associated values 1:m that minimize (2) would then be our estimate of the number and locations of the changes. As is defined in terms of the negative of the log-likelihood for a model for the data, this can be viewed as a penalized likelihood approach. However, it generalizes to other scenarios, for example, Fearnhead and Rigaill (2019) suggest using the bi-weight loss, for some chosen K > 0, so as to make detection of changepoints robust to point outliers. Thus, the general approach of minimizing (2) is often termed a penalized cost method. The costs we are using assume that the noise variance, , is known; in practice, this often has to be estimated, for example, based on a robust estimate of the variance of the first differences of the data (see, e.g., Fryzlewicz, 2014). How important the method for estimating is has had little attention in the literature, but it is simple to show that this commonly used estimator is sensitive to deviations from IID Gaussian assumptions, and this is one reason why methods based on comparing segmentations with differing numbers of changepoints (as described above and below) may be preferred in real applications.
There is a close relationship between this penalized cost approach and the binary segmentation approach. Binary segmentation is equivalent to a greedy approach to approximately minimizing (2): At each iteration, we add the changepoint, if any, that will reduce our objective function by as much as possible. Furthermore, the approach suggested by Fryzlewicz (2014) for choosing the number of changepoints given the output of wild binary segmentation involves minimizing a very similar objective function. The main difference here is that for wild binary segmentation, we have a pre-chosen set of changepoints corresponding to each value of m, whereas when minimizing (2), we are choosing the changepoint locations that are best in terms of minimizing this criteria. As with the binary segmentation methods, minimizing the penalized cost has been shown to give consistent estimates under in-fill asymptotics but with tighter results on the choice of , which needs to be at least 2 log n, rather than increasing faster than log n.
As with binary segmentation methods, rather than pre-specifying the threshold/penalty, , we may wish to compare possible segmentations obtained with differing or differing m. Haynes, Eckley, and Fearnhead (2017) give an efficient way of solving (2) for all values of within any interval-though, unlike with binary segmentation or wild binary segmentation, this involves re-running a dynamic programming algorithm a number of times (corresponding roughly to the number of different segmentations that are optimal as you vary within the interval).
Alternatively, though at slightly higher computational cost, one can solve the minimization problem (2) for a fixed value of m and repeat this for m = 0, 1, … , m max , for some chosen m max (Auger & Lawrence, 1989;Rigaill, 2015). For each m, the output will be the choice of locations of the m changepoints that minimizes the cost or, for our change in mean problem, maximizes the log-likelihood.
Looking at various values for m is needed when considering penalties that are not linear in the number of changes m (Lebarbier, 2005;Arlot, Celisse, & Harchaoui, 2012). In that case, the goal is to minimize where pen(m) is the penalty and C a constant. For example, the penalty of Lebarbier (2005) is C.pen(m) = 2 m(c 1 log(n∕m) + c 2 ); simulations suggest to take c 1 = 2 and c 2 = 5. Importantly, non-asymptotic results show that these penalties control the risk (Massart, 2007), retrieve the correct number of changepoints, and estimate the changepoint locations at the optimal rate (Garreau & Arlot, 2018).
Using such a penalty is non-trivial as the penalty is only known up to a constant, C. Determining this constant can be done using the slope heuristic (Lebarbier, 2005;Arlot, Brault, Baudry, Maugis, & Michel, 2016;Arlot, 2019) by looking at how the segmentations vary as we change C. Two variants exist in the R package capushe. The first, called Djump, looks for the unique largest jump of the function mapping C to the m minimizing (4). The second, called Ddse, considers segmentations with many changes and uses the fact that as we add false changepoints, the empirical risk reduces roughly linearly with m. Deviations from this linear function indicate when real changes are being identified. These heuristics, although theoretically justified in some cases, sometimes fail (see Arlot, 2019, for example). Also, it is important to estimate segmentations with an m max number of changepoints that is at least two to three times larger than m * : too small a choice of m max is the reason why this approach performs poorly in some of the empirical results of Fryzlewicz (2018b).
Given either a set of estimated changepoint locations corresponding to different choices of m or different choices of , Haynes et al. (2017) and Lavielle (2005) suggest looking at how the cost for these segmentations, defined as the right-hand side of (2) with = 0, changes as the number of changepoints increases. The idea is that the cost should decrease more when detecting true changes than when detecting spurious ones. Thus, we can estimate the number of changepoints, m, by locating the ''elbow'' in the plot. Figure 1 shows examples of such plots for two data sets. The left-hand plots show the ideal scenario for such an approach to estimating m, where we have data simulated with a number of similar sized changes. Here, an elbow is clearly seen and corresponds to the true number of changepoints. The right-hand plots show a more challenging, and potentially more realistic, scenario where there are many changes of differing sizes. In this case, identifying the location of an elbow is more subjective.
Estimating m by eye from such plots has obvious disadvantages, in particular such an approach cannot be applied to automated analyses of large numbers of data sets, as is often required in applications. However, this informal approach has been formalized into an automated method: It is similar to the idea behind the Ddse approach described above, the slope heuristic methods of Lebarbier (2005) and Lavielle (2005), and the steepest drop method of Fryzlewicz (2018a).

Scan statistics and multi-scale methods
Another way of detecting multiple changepoints is to apply a test for a single change to regions of data of a chosen fixed width and then move these regions through the data from start to end. These are called scan statistics (Yau & Zhao, 2016) as they can be viewed as scanning through the data. One example of these methods is the MOSUM procedure of Eichinger and Kirch (2018). For the change in mean problem, this involves using the cusum-based test for a change at t based on data from a region of width 2h say centred on t. The cusum test statistic can be calculated for each t, and the idea is that changes can be detected as local maxima in the cusum statistic, providing this maxima is above a suitable threshold. One of the challenges with this approach is avoiding estimating multiple changes around a single changepoint, which may happen due to the inherent fluctuations in the cusum statistic as you vary t. This can be achieved by not allowing two estimated changepoints to be too close to one another.
Scan statistic ideas can be extended to multi-scale methods by considering regions of differing lengths. A multi-scale version of MOSUM is suggested by Cho and Kirch (2019); but perhaps the most common multi-scale method is SMUCE (Frick, Munk, & Sieling, 2014), which brings together aspects of scan statistics and the penalized cost methods. SMUCE splits the data into regions and performs a test for the presence of a change on each of these regions. These regions are chosen to range in size, and the test statistics are aggregated appropriately across these differing scales, hence the multi-scale nature of the procedure. The number of changepoints is estimated as the smallest number of changes needed so that there is at least one changepoint within each region of data for which our test has detected a change. The locations of the changes are then estimated by minimizing a penalized cost (2) subject to requiring changes in regions of data where it was indicated by the tests.
One benefit of the SMUCE approach is that, as well as estimating the number and locations of the changes, it is able to quantify uncertainty in these estimates. For example, by appropriately choosing the threshold, and hence the significance level, for the tests, this approach immediately gives frequentist bounds on the probability of over-estimating the number of changes. However, SMUCE does not perform well for applications where there are many changes, particularly if some of these have small signal, as it tends to be too conservative in detecting changepoints.
In some situations (e.g., Li, Munk, & Sieling, 2016), it can substantially underestimate the number of changepoints. To overcome this, Li et al.
(2016) adapt the criteria for estimating the number of changepoints. They propose a method called FDR-Seg that controls a version of the false discovery rate for the estimated changepoints.

COMPARISON OF METHODS
We now present a comparison of the performance of these different methods across a range of scenarios. Presenting these comparisons is challenging due to wishing to consider a number of distinct scenarios for the mean (differing numbers of changes, sizes of change, lengths of segment, etc.); the noise (IID Gaussian, heavy-tailed and auto-correlated); different measures of accuracy (error of estimating the segmentation of the data or the mean function); and the large number of methods that have been proposed for this problem, each of which can have a different tuning parameter that impacts on the number of changepoints detected.
As a result, we present indicative results in this section, with the complete results available in Supporting Information. Moreover, the Supporting Information is available as a markdown document with associated code available. The full results are presented in a way such that it is possible to extend the results to additional methods and/or additional scenarios. The indicative results we present below are based on comparing some distinct approaches but focusing on those approaches that appear to be most accurate.

Results: Known number of changepoints
To isolate the accuracy of a method at estimating the changepoints' locations from the tuning of the method to estimate the number of changepoints, we first present results where we assume the number of changepoints is known. We consider 4 of the 11 scenarios for the mean function that are given in the Supporting Information (see the top tow of Figure 2). These are representative of the different behaviours and include a scenario with a random mix of segment lengths and sizes of change; one structured to have increasing segment sizes with decreasing sizes of change; and two with small segments and a constant size of change, the first alternating up and down changes and the latter having only up changes. All these scenarios are taken from studies in the recent changepoint papers.
For each scenario, we consider varying the number of data points, n. This is done so that a doubling of n will mean twice as many data points per segment but that the variance of the noise for each data point is also doubled; hence, the amount of information about the mean is roughly constant as we vary n. Furthermore, we compare three cases for the noise: The first is that it is IID Gaussian; second, we have IID t-distributed noise with 10 degrees of freedom, and third, we have an AR(1) noise process with lag-1 correlation of 0.3. The marginal variance of the noise is the same in each case. These cases relate in turn to the model most commonly assumed in theory for changepoint methods, a heavier tailed alternative, and a model with auto-correlation.
The results in Figure 2 give a comparison of four methods in terms of their mean square error of estimating the mean function, though qualitatively similar results are seen when the methods are compared in terms of the accuracy of the segmentation of the data that they obtain.
The four methods we compare include three methods that measure fit to data based on the square error loss (or equivalently based on the log-likelihood for IID Gaussian noise): the best fitting segmentation in terms of the square error loss as obtained by dynamic programming; binary segmentation; and wild binary segmentation. The final method is the best segmentation in terms of minimizing the bi-weight loss (3). For the scenarios in the first three columns of Figure 2, there is a consistent picture. The fit given by binary segmentation is substantially and consistently worse than the other methods. The other methods give very similar performance except that the bi-weight loss performs noticeably better for the t-distributed noise, which is as expected, as this loss is designed to give robustness to outliers. Perhaps more surprising is that, for the IID Gaussian noise case, using the bi-weight loss does not lead to a noticeable loss of accuracy relative to the square error loss.
The results for the scenario in the final column are substantially different, as binary segmentation is competitive with the alternative methods in this case. This is consistent with theory that suggests binary segmentation loses power to detect changes in scenarios where you have changes of alternating signs, whereas this scenario has changes that are all the same sign.  threshold constant set to 0.9; and threshold set to 0.9; and FDR-Seg with a putative false discovery rate of 5%. To see the impact of estimating the number of changes, we also compare with the results from minimizing the square error loss when m is known.

Results: Unknown number of changepoints
The first two scenarios (two left columns of Figure 3) give similar results. First, all methods perform well when the noise is IID Gaussian: There is little to choose between the methods, and all perform nearly as well as for the case where m is known. This is because the model we simulated from satisfies the standard assumptions underpinning the theory for the approaches for estimating m , and in these scenarios, the changes are relatively straightforward to detect. We see some drop off in performance for FDR-Seg and wild binary segmentation for the t-distributed noise, and a much larger drop in performance for the auto-correlated noise case: These and other simulation results suggest that FDR-Seg is particularly sensitive to the noise assumptions it makes being correct. The pattern of performance of wild binary segmentation is largely due to our choice of threshold for estimating the number of changes, m. The results we present are for the version of wild binary segmentation that performed best in the IID Gaussian case, but it tends to over-estimate the number of changes when we have auto-correlated or heavy-tailed noise. By comparison, using the Schwarz Information Criteria, (1) with a = 2 log n estimates fewer changes and gives a better performance for the auto-correlated noise case (see Supporting Information), highlighting how conclusions about a changepoint algorithm can be sensitive to the choice of method for estimating m.
The results in the latter two scenarios (two rightmost columns of Figure 3) give quite different results. In particular, all methods tend to perform better when there is correlated noise than when the noise is IID Gaussian. This counter-intuitive performance has a simple explanation. These two scenarios have many short segments. All the methods we compare tend to under-estimate m in these cases, and this under-estimation of m is the reason for the big difference in performance compared with a method that assumes m is known. By comparison, the impact of auto-correlation in the noise is to encourage methods to fit too many changes: fluctuations in the data caused by auto-correlation will have similar patterns to adding a small change in mean. In these two scenarios, the effects seem to cancel each other out. Across all scenarios, we can see that using the slope heuristic, that is, comparing fit to data as we vary the number of changepoints, performs well, and this type of method for estimating m appears to be promising and to give some robustness to the type of scenario and/or to the violation of the IID Gaussian assumption on which most default thresholds for estimating m are based.
The results in the Supporting Information show that both IDetect and MOSUM approaches have very strong performance for many scenarios when either we have auto-correlated or heavy-tailed noise. Results for the four scenarios we are focusing on are shown in Figure 4. The reason for this appears to be because both methods have an implicit minimum segment length, due to rules that mean they avoid detecting changes that are too close to each other. For scenarios with segments longer than this minimum segment length, this appears to make the methods robust to violations of the IID Gaussian assumptions. To demonstrate this, we also show in Figure 4 the strong performance of minimizing a penalized version of the square error loss with an explicitly imposed minimum segment length of 4, and we can directly see the improvement we obtain relative to the same approach without any minimum segment length. The results for MOSUM are somewhat erratic for the fourth scenario in Figure 4, which appears to be driven by how the choice of window length used in the scan statistic relates to the actual segment lengths as we vary n.
In some applications, the computational cost of the algorithm used is important. We present timing results for the changepoint algorithms in the Supporting Information, with a subset of results presented in Figure 5. First, we observe that most algorithms have an empirical computational cost that is close to linear in the number of observations, n; though across these, there is an order of magnitude difference in computational cost. Exceptions include methods that use the slope heuristic and require multiple runs of an algorithm to estimate segmentations with differing numbers of changepoints. Often, the number of different segmentations used by these methods is linear in n, and thus, these methods have a computational cost that is closer to quadratic in n. Also, the FDR-Seg method has a computational cost that empirically is worse than quadratic; for large n, FDR-Seg was substantially slower than all of its competitors. Furthermore, FDR-Seg requires simulation to choose appropriate threshold values-the timings in Figure 5 ignore this considerable upfront cost.

DISCUSSION
Our results suggest that care is needed when comparing different changepoint methods if one wishes to understand when and why one method is better than another-as it is rare that any method is uniformly better than another.
First, it is important to recognize that detecting changes has two aspects: estimating the number of changes and estimating their locations.
Many published studies look at these together or compare different algorithms each with a different way of estimating the number of changes.
In such results, it is hard to determine whether one algorithm is better than another or whether it is the estimate of the number of changes that is better. We would recommend that comparisons between algorithms should include comparisons which assume the number of changes is known.
Second, most methods have default settings that are based on theory for the case of IID Gaussian noise. These settings may not work well in applications, where often there will be features such as auto-correlation or heavy-tailed noise. Furthermore, the presence of auto-correlation can even affect the commonly used estimate of the marginal variance of the noise, obtained by the median-absolute-deviation estimate from the first differences of the data, which is used to standardize the data. We think that methods that estimate segmentations for a range of values of the number of changepoints are more reliable in situations where the IID Gaussian assumption does not hold, though the best way of using the information from these different segmentations is still an open problem. In applications where there is training data, using such data to train the tuning parameters so as to give reliable estimates of the number of changepoints (Hocking, Rigaill, Vert, & Bach, 2013) may be best.
Finally, assuming a minimum segment length, if the assumption is correct, can noticeably improve performance and also give some robustness to heavy-tailed noise or auto-correlation. Care is needed when comparing methods that either explicitly or implicitly assume a minimum segment length with those that do not-as often it will be whether or not this assumption holds that will be the primary reason why the method is more accurate or not.