Properties of bootstrap tests for N‐of‐1 studies

N‐of‐1 study designs involve the collection and analysis of repeated measures data from an individual not using an intervention and using an intervention. This study explores the use of semi‐parametric and parametric bootstrap tests in the analysis of N‐of‐1 studies under a single time series framework in the presence of autocorrelation. When the Type I error rates of bootstrap tests are compared to Wald tests, our results show that the bootstrap tests have more desirable properties. We compare the results for normally distributed errors with those for contaminated normally distributed errors and find that, except when there is relatively large autocorrelation, there is little difference between the power of the parametric and semi‐parametric bootstrap tests. We also experiment with two intervention designs: ABAB and AB, and show the ABAB design has more power. The results provide guidelines for designing N‐of‐1 studies, in the sense of how many observations and how many intervention changes are needed to achieve a certain level of power and which test should be performed.


Introduction
N-of-1 study designs involve the collection and analysis of repeated measures of an individual unit using an intervention and not using an intervention. The design for an N-of-1 study is often called the single case experiment design or single subject experiment design. The data from N-of-1 studies typically consist of T repeated measures, y t , t = 1, . . ., T, from a single subject, and dummy variables, x t , indicating whether or not there is an intervention at time t. The ultimate goal of N-of-1 studies is to investigate the effect of an intervention on an individual unit, and they have been applied in areas such as psychology and education (Shadish & Sullivan, 2011), and medicine (Howick et al., 2011).
Over the years, various analysis methods for N-of-1 studies have been developed and modified for more effective and simpler approaches to detecting intervention effects between periods that are subject to no interventions (phase A) and those that are subject to interventions (phase B). By and large these methods can be divided into two categories: non-regression-based (Borckardt, Nash, Murphy, Moore, Shaw, & O'Neil, 2008;Nourbakhsh & Ottenbacher, 1994;Parker, Vannest, & Brown, 2009); and regression-based McKnight, McKean, & Huitema, 2000). The former methods are simpler and easier to implement without formal statistical modelling, while the latter are based on regression theory, where parameters are formally estimated. Given the increasing adoption of N-of-1 studies for evidence-based analyses (Kratochwill et al., 2013), we concentrate on regression-based methods in this study. In particular, we estimate the statistical power of semi-parametric and parametric bootstrap tests under two single case designs, aiming to address the issue of lack of power analyses in the current literature.
We use a sample collected by a mobile phone app called "POWeR Tracker" (Morrison et al., 2014) to illustrate the power of the Wald test and bootstrap tests. Table 1 lists an extract of the data from an N-of-1 study to understand the impact on physical activity levels of using a smartphone application for weight management. It is a record of total steps of one participant over the period of 28 days. It has an ABAB experimental design (7 days without, 7 days with, 7 days without, 7 days with an intervention). In phase A, the participant had access to a web-based intervention (POWeR) only. In the intervention phase (phase B), the participant had access to both the web-based intervention and appbased intervention (POWeR tracker). During both phases, daily steps were recorded via a blinded pedometer. We initially consider the following general regression model for an N-of-1 study: where e t ¼ qe tÀ1 þ z t ; t ¼ 2; . . .; T ; are autocorrelated, order-one residuals, with z t~N (0, r 2 ), and e 1~N (0, r 2 /(1 À q 2 )). Before formally discussing the methodology, we introduce five possible alternative underlying mean behaviour patterns for two phases which can be specified by model (1) (Figure 1). A is the phase before an intervention and B is the phase after an intervention has been introduced. In Figure 1a, there is no change in the intercept or slope following the intervention (b 1 = b 2 = b 3 = 0). In Figure 1b and d there is a change in the intercept but not in the slope (b 1 6 ¼ 0, b 3 = 0). The difference between these two figures is that the former has a zero slope (b 2 = 0). In Figure 1c, there is constant increase over time, that is, no change in the slope (b 1 = 0, b 2 6 ¼ 0, b 3 = 0). No intervention changes could be detected in this figure since it is a trend developed in phase A continuing into phase B. Figure 1e represents a change in both the intercept and slope (b 1 6 ¼ 0, b 2 6 ¼ 0, b 3 6 ¼ 0). A regression-based N-of-1 study analyses a single interrupted time series that is subject to no interventions and interventions. It has two common methodological difficulties: autocorrelation and a small sample. McKnight et al. (2000) designed a double bootstrap methodology to tackle autocorrelation bias in the context of small samples. They use the first bootstrap to obtain asymptotically consistent estimates of the autocorrelation and other parameters in the model by utilizing Durbin's two-stage method, and use the second bootstrap to estimate the variance-covariance matrix of the estimated parameters. Their method reduced biases in the estimated autocorrelation and standard errors of the coefficients, and hence provided tests that have Type I error rates closer to the nominal rate and comparable statistical power to that when the true value of the autocorrelation is used. However, their estimation process is extremely computer-intensive by construction, which may limit the potential applications of method in practice. The current study attempts to deal with the issues of autocorrelation and small sample using a single parametric bootstrap within a generalized least squares (GLS) framework. Our work uses the restricted maximum likelihood (REML) estimation method in R (R Development Core Team, 2014) to detect an effect between two phases (phase A has no intervention, phase B has an intervention) where the underlying data series is autocorrelated. Parameters estimated under a GLS approach are consistent, but may suffer bias from underestimated standard errors (Park & Mitchell, 1980) due to the small sample size. We use semi-parametric and parametric bootstrap tests to reduce the effect of small sample bias in test statistics in an attempt to achieve better inferences from estimated parameters than the Wald test. Our method provides a simpler option that deals with the autocorrelation and small sample issues. It is less computerintensive and easier to implement when compared to the double bootstrap method.
Motivated by Borckardt et al. (2008), we consider a simple case design that explicitly assumes there is no slope in our model and hence concentrates on the differences among two phases (A and B). This is a realistic assumption as empirical experiments may not have a trend in phase A or B (see our motivating example). Our null hypothesis is displayed in Figure 1 and does not include a trend. Our alternative hypothesis is in Figure 1b. We use a dummy variable to detect a phase effect between A and B in one single time series as in standard linear regression analysis. A dummy variable that is not significantly different from zero indicates there is no phase effect. Further, we use simulation to calculate and compare statistical properties of bootstrap tests and Wald tests under various autocorrelations and phase effects. Despite new methods continually being developed to carry out N-of-1 studies, there is limited evidence on the power of these tests. This is the first attempt, to the best of our knowledge, to investigate the statistical power of semi-parametric and parametric bootstrap tests within a single time series setting in the context of N-of-1 studies. The results on statistical power provide guidelines for designing N-of-1 studies, in the sense of how many days and how many intervention changes are needed to achieve a certain level of power.
The rest of the paper is organized as follows. Section 2 introduces the regression model for detecting phase effects, the concepts of Type I error rate and statistical power, the construction of bootstrap tests and the estimation of the Type I error rate and power functions. Section 3 presents empirical results from two intervention designs (AB and ABAB), a discussion of these results and a power function illustration using the sample data introduced above. Our conclusions are summarized in Section 4.

Regression model
We now consider a simpler version of model (1) for an N-of-1 study: where e t ¼ qe tÀ1 þ z t ; t ¼ 2; . . .; T ; with z t~N (0, r 2 ), and e 1~N (0, r 2 /(1 À q 2 )). Recall that in this model, y t is a repeated measure at time t = 1, 2, . . ., T, x t is a phase dummy taking the value of 1 for the intervention and 0 for the non-intervention phase. The phase effect is b, with a large (small) absolute value of b indicating a large (small) phase effect. As mentioned, the problems of small sample size and autocorrelation may violate the underlying assumptions of no autocorrelation and large sample size for a standard linear regression analysis, which may lead to incorrect inferences, such as an incorrect Type I error rate and low statistical power. In order to overcome the problem of autocorrelation, we 'use GLS with REML to fit the models. Motivated by McKnight et al.'s (2000) bootstrap method, in order to address the small-sample problem, we suggest constructing semiparametric and parametric bootstrap tests of the null hypothesis H 0 : b = 0. For these tests, rather than comparing the Wald test statistics to its asymptotic null distribution, N(0,1), which henceforth we refer to as the Wald test, we compare this test statistics to a bootstrapped sample. See Section 2.3 for details. We compare the properties of the bootstrap tests to those of a Wald test for coefficients estimated by using GLS with REML. The properties under investigation are the Type I error rate and statistical power. By doing so, we aim to uncover the actual magnitude of Type I error rate, and how close it is to the nominal rate of 5%. Both Wald tests and bootstrap tests are carried out using the data simulated as described in Section 3. All estimates are calculated using the GLS REML routine in R (R Development Core Team, 2014).

Statistical properties
Two properties of a statistical test are discussed in this study: the Type I error rate and statistical power. The Type I error rate is the probability of incorrectly rejecting the null hypothesis when it is true. It is an important property which we like to control accurately. A standard acceptable Type I error is 5%.
Statistical power, or the power of a significance test, refers to the probability of rejecting the null hypothesis when it is false. Given a valid procedure, we like the power as high as possible when the null hypothesis is false (Cohen, 1988). It is an important consideration in an N-of-1 study and gives guidance on the length and frequency of interventions to reach a desirable power level, such as 80% (Cohen, 1988). It can also be used to detect whether two or more individuals are required in the trial (N-of-k studies).

Construction of bootstrap tests
The concept of the bootstrap is to replace the population with the empirical population (non-parametric) or estimated population (semi-parametric and parametric). Suppose our target is to draw inference about a population parameter h and we have observed a random sample of size T (y 1 , y 2 , . . . y T ) from this population with sample statisticsĥ. We can deriveĥ Ã b , a random quantity which represents the same statistics, but computed on a bootstrap sample b drawn from the empirical or estimated population. Computingĥ Ã b for B different bootstrap samples, we can then deriveĥ Ã 1 ,ĥ Ã 2 , . . .ĥ Ã B . The empirical bootstrap distribution ofĥ Ã b proves to be a fairly good approximation of the distribution ofĥ. In this study, we adopt parametric bootstrap tests (Efron & Tibshirani, 1993, p. 53) and semi-parametric bootstrap tests (Davison & Hinkley, 1997, pp. 389-391), with B = 100. Since we are not estimating a p-value, just performing a bootstrap test, B = 100 should be sufficient. However, to assess the effect of using a large B, we also calculate the Type I errors for bootstrap tests with B = 200 and compare the results. Under the alternative hypothesis as in model (2), theâ A ,b A ,q A andr 2 A from the observed sample are estimates of population values of a, b, q and r 2 . We also estimateâ 0 ,q 0 andr 2 0 from the observed sample under the null hypothesis that b is zero, that is, under the following model: where e t ¼ qe tÀ1 þ z t ; t ¼ 2; . . .; T ; with z t~N (0, r 2 ), and e 1~N (0, r 2 /(1 À q 2 )).
For the parametric tests, we simulate bootstrap samples under the null hypothesis. Then model (2) is fitted to the bootstrap samples to generate bootstrap estimates of . For each of the bootstrap simulations, the absolute value of the Wald test statistic based onb Ã b is compared to the Wald test statistics based on c b A estimated under the alternative hypothesis (model 2). This comparison is repeated B times for each of the simulatedb Ã b . The p-value of the bootstrap test is calculated as the percentage of times out of the total B that the bootstrapped Wald statistics generated from model (3) is more extreme than the observed statistics from model (2).
For the semi-parametric test, rather than simulating the errors, z t , from a normal distribution, they are sampled with replacement from the estimate residuals from model (3):ẑ t ¼ê t Àqê tÀ1 , t = 2, . . ., T, transformed to have mean zero and variancesr 2 0 ; see Davison and Hinkley (1997, pp. 389-391) for more details.

Estimating the Type I error rate and power functions
We estimate and compare the properties of bootstrap tests and those of the Wald tests. The Type I error rate and the power function of both tests are estimated. As mentioned, we desire the actual Type I error rate to be close to nominal rate of 5% and high statistical power. We start by simulating a data set Y t , t = 1, . . ., T, following model (2), with predetermined values of a, b, q and r 2 , where a = 0, r 2 = 1, and follow the structure of data collected from a study of POWeR Tracker (Morrison et al., 2014), with an ABAB design. Wald and bootstrap tests are then performed on the simulated data. We repeat this process 10,000 times for the Wald test and the bootstrap tests, and estimate the Type I error rate and power function for the given b and q. The actual Type I error rate is estimated as the percentage of times that the p-values of estimatedb are <5% when b is set to zero. The statistical power is the percentage of times that the p-values are <5% when b is not zero. A power function is the power as the corresponding values of b and q vary.
We expect the Type I error to be close to the nominal size of 5% for the parametric bootstrap test we have constructed. We also expect low statistical power in our study as autocorrelation and the small sample in N-of-1 studies are long-standing issues in behaviour change research in psychology (Cohen, 1988).
In order to assess the impact of the normal assumption, we also simulate the residuals from a contaminated normal distribution with a random 15% of the residual generated with an increased variance of 25 and repeat the simulation study. However, the parametric bootstrap was still based on the assumption of normality as before.

Simulation study
The Type I error rate and statistical power functions for both the Wald test and the parametric bootstrap tests are estimated by using Monte Carlo simulation. As noted above, we use 10,000 simulations for all tests and B = 100 for the bootstrap tests. Two designs of interventions are calculated: the first design (D1), as in the POWeR Tracker study, has an ABAB structure, with each of the four phases set up for 7 days (as in column 3 in Table 1); the second design (D2) has an AB structure, with both phases lasting for a period of 14 days. Both designs have a total duration of 28 days. The results give guidance on the design of N-of-1 studies, in the sense of whether it is better to have one long intervention period or several shorter intervention periods. Table 2 presents the estimated Type I error rates for the Wald, parametric and semiparametric tests. Four values of q are considered (0, .2, .5, .7) with two error distributions (normal and contaminated normal) and two designs (D1 and D2). For the bootstrap tests we present the results for B = 100 and 200. When interpreting these estimates, it should be borne in mind that if the true proportion is .05 then under repeated sampling approximately 95% of the estimated proportions based on a sample size of 10,000 would be in the tolerance interval (.0457, .0543).

Type I error rates
For all but one of the scenarios, the estimated Type I error rate for the Wald test is greater than the upper limit of the 95% tolerance interval (.0543), indicating that Wald test does not have the correct Type I error rates. However, for normal errors, the estimated Type I error rates for the parametric and semi-parametric bootstrap tests are within the 95% tolerance interval.
For the contaminated errors, the majority of the estimated Type I errors for the parametric and semi-parametric tests are closer to the nominal value of .05 than those for the Wald test, although fewer are within the 95% tolerance interval than for the normal errors. Furthermore, the estimate Type I error rates for the semi-parametric tests are closer to the normal value than those for the parametric tests, particularly for D2 and q = .5 and .7, indicating that in the presence of contaminated errors, the semi-parametric bootstrap tests perform better.
For both bootstrap tests and both error distributions, the results for B = 100 and 200 are very similar, supporting our initial belief that B = 100 should be sufficient. In particular, note that for the parametric bootstrap with contaminated errors, the Type I  error rates with the larger B are not uniformly closer to the nominal rate than those with B = 100 (Table 2).

Power
Tables A1-A6 (Appendix) present estimates of power functions for the Wald, parametric and semi-parametric tests with B = 100. Four values of q are considered (0, .2, .5, .7) with two error distributions (normal and contaminated normal) and two designs (D1 and D2). Note that D1 has three change points, whereas D2 has only one. Figure 2 presents a range of these power functions. Figure 2 presents the power functions of the three tests under the two designs with q = 0 and normal errors ( Figure 2a) and contaminated errors (Figure 2b). Although the Wald has the incorrect Type I error rate, it is the scenario which is closest to the nominal rate and therefore is included for comparison. From these two graphs, we conclude that there is no substantial differences in power when q = 0. Figure 2c presents the power functions of the bootstrap tests under the two designs with q = .5 and normal errors. Again there is no difference between the two tests for each design, but they are considerably more powerful under D1.  Figure 2d presents the same power functions as in Figure 2c, except that the errors are now contaminated. Again there is no difference between the parametric and semiparametric bootstrap test under D1, whereas under D2 the semi-parametric test is less powerful. However, recall that in this case the semi-parametric test has estimated Type I error rates closer to the nominal.
Comparing Figure 2a and b with Figure 2c and d reveals that the power decreases as q increases. Inspection of Tables A1-A6 reveals this is the case in all the scenarios considered. Comparing the two designs reveals that the power is lower for the contaminated errors; again see also Tables A1-A6. For D2 with an autocorrelation value of .2 and normal errors, to achieve a power of .8 for the parametric bootstrap test a b-value >1.5 is required; for a larger autocorrelation value, .5 or .7, a b of 2.5 or 3 is required (Table A3). These results suggest the power under D2 is low. Further comparison between D1 and D2 reveals that the bootstrap tests for both designs generally have similar Type I error rate, but for D1 they are at least as powerful as for D2 and tend to become more powerful as q or b increases. This result indicates that the shorter repeated intervention design works better than the longer period of intervention without repeat. This may be due to the impact of autocorrelation.

Bias inq
As a by-product of the simulation study, we are able to assess the bias inq as q, b, the design and the error distribution vary. As expected, inspection of the results showed that the bias did not vary with b. Therefore, in Table 3, we present the estimated bias inq as q, the design and the error distribution vary. For a particular value of q, the bias inq is very similar for both designs and both error distributions. However, the magnitude of the bias increases as q increases. This may in part explain the inflated Type I error rates for the Wald test (Table 2). However, as noted above, the bootstrap tests perform well, despite this increase in bias.

Discussion
It is clear from the above study that the bootstrap tests are more desirable for N-of-1 studies when autocorrelation is present. Under a single case design involving one individual over a period of 28 days, the statistical power is low. The comparison between ABAB and AB designs indicates that under the presence of autocorrelation, shorter and repeated interventions (ABAB design) seem to be more effective than longer and unrepeated interventions (AB design). This result lends support for the single-case intervention research design standards (Kratochwill et al., 2013) where the AB design does not meet the standard.

Conclusions
This study explores the properties of semi-parametric and parametric bootstrap tests in a single subject experiment design, or N-of-1 study, aiming to account for small sample sizes under the GLS regression framework. This is the first attempt, to the best of our knowledge, to examine the properties of N-of-1 studies in such a setting. We find the bootstrap tests are more accurate with regard to Type I errors when compared to the Wald test, and hence more desirable. We recommend the use of a parametric bootstrap with B = 100, except when both relatively large autocorrelation and contaminated normally distributed errors are thought possible. Our results can also be used to facilitate various experimental designs and provide guidelines for future N-of-1 studies. Further, we compare two different intervention designs of the same total duration and find that the tests under the design with more change points (D1) have better properties. This provides support for designs with three changes in the intervention as set out in the single case intervention research design standards (Kratochwill et al., 2013). The bootstrap methods used in the study examine an intervention effect under the assumption of no trend (Figure 1a). They can also be applied to the case where the model under the null hypothesis includes a trend (Figure 1c) and the model under the alternative hypothesis has a trend and a phase effect on the intercept (Figure 1d). Therefore, the results in this paper can also be used when designing a study to detect a phase effect on the intercept irrespective of whether or not there is a trend. The scenario our study has not covered is the case where the model under the alternative hypothesis has a phase effect on the slope (Figure 1e), although the method could easily be modified to handle this situation by adding a trend and trend by phase interaction to model (1). The method could also be extended to the situation where more than one individual is studied (N-of-k study, k > 1) by appropriately modifying model (1) to account for between-individual differences.