Single‐case experimental designs for child neurological rehabilitation and developmental disability research

Single‐case experimental designs (SCEDs) are a group of methodologies of growing interest, aiming to test the effectiveness of an intervention at the single‐participant level, using a rigorous and prospective methodology. SCEDs may promote flexibility on how we design research protocols and inform clinical decision‐making, especially for personalized outcome measures, inclusion of families with challenging needs, measurement of children's progress in relation to parental implementation of interventions, and focus on personal goals. Design options for SCEDs are discussed in relation to an expected on/off effect of the intervention (e.g. school/environmental adaptation, assistive technology devices) or, alternatively, on an expected carry‐on/maintenance of effects (interventions aiming to develop or restore a function). Randomization in multiple‐baseline designs and ‘power’ calculations are explained. The most frequent reasons for not detecting an intervention effect in SCEDs are also presented, especially in relation to baseline length, trend, and instability. The use of SCEDs on the front and back ends of randomized controlled trials is discussed.

'Single-case experimental designs' (SCEDs) or 'single-subject research designs' (1) are a group of methodologies of growing interest, aiming to test the effectiveness of an intervention at the single-participant (or very small n) (2) level, using a rigorous and prospective methodology.
Although different SCED designs exist (Figure 1a,b,c,d), their principle is essentially the same: (1) studying prospectively and intensively a single person or small group of persons over time; (2) measuring repeatedly and frequently the outcome in all phases of the study; and (3) sequentially applying and/or withdrawing an intervention in a randomized order or at randomized time points. 3 Repeated measures allow participants to serve as their own controls by comparing each individual's performance at baseline 'A' (i.e. before the intervention is introduced) with their performance during the intervention 'B'. The experimental design must be able to demonstrate that the participant improves as a result of a specific intervention and not for other reasons (retest effect, spontaneous recovery, developmental effect, time with therapist effect, or other placebo effects, which may be explored during baseline A), and that the effect can be replicated at least three times in order not to be considered a coincidence. 4 The AB quasi-experimental design (Figure 1a) leads to stronger evidence of treatment effects than pre/posttest designs (where a participant is tested only once before and once after an intervention). However, it still does not provide sufficient control of biases to be considered a true SCED [3][4][5] according to SCRIBE (Single-Case Reporting guideline In BEhavioural interventions): if a participant makes progress in phase B (compared with phase A), this may be due to the intervention being tested, but also to events occurring concurrently with the intervention, or simply to an improvement in the patient's condition with time. Causality therefore cannot be differentiated from coincidence. (3) In 'true' SCEDs, additional successful replications provide more confidence that the treatment is functionally related to improvements in the outcome measure.
Although terminology and classification vary in the literature, SCRIBE guidelines 5,8 propose four main families of SCED ( Figure 2). Readers unfamiliar with SCED can refer to a short presentation of the main SCED types, 3

to the review I N V I T E D R E V I E W
Single-case experimental designs for child neurological rehabilitation and developmental disability research Agata Krasny-Pacini 1,2,3,4 by Byiers 9 using examples of SCED in paediatric interventions, to a practical guide of SCEDs in rehabilitation, 10 and to Tanious and Onghena 11 for a comprehensive description and references of the numerous other SCEDs used in healthcare In this paper, to examine SCED methodological challenges for developmental disabilities and child rehabilitation, the two most useful SCED families will be presented in detail: (1) multiple-baseline design (MBD), of which MBD across subjects and MBD across behaviours are of particular interest for rehabilitation and developmental intervention aiming at effects maintained over time; and (2) N-of-1 and related methodologies, such as alternating treatment design and introduction/withdrawal (ABAB designs), which require the intervention being tested to have immediate effects, short washout, and on/off effects. To explore treatment benefits (i.e. to answer the question 'does this intervention help?' 12 ), whenever feasible, multiple (i.e. aggregated) N-of-1 trials should be preferred to other SCEDs and to group randomized controlled trials (RCTs), because they (1) have a higher level of evidence according to the Oxford Centre for Evidence-Based Medicine; 12 (2) increase power at the group level; 13 (3) provide a personalized statistical significance at the single-participant level; (4) rely on multiple trials on the same participant, so they are more relevant to the participant than decisions based on the mean expected responses, inferred from the best quality RCT meta-analysis; (5) may contribute to increased participant confidence and commitment to long-term management. 14 As high-quality guidelines already cover all the important themes for N-of-1 trials, 1,14 and because N-of-1 methodology is well recognized, 15,16 this paper will focus mainly, but not exclusively, on MBD.
Although reporting guidelines 8 and SCED quality appraisal standards 4 can guide researchers, SCED are still underused (or underpublished). Some of the possible reasons may be that the choice of SCED is inadequate for the intervention or condition being tested, or that the design (the baseline features and frequency of the repeated measures in particular) does not allow one to detect an intervention effect even when it exists.
The aim of this paper is to (1) provide researchers and clinicians with lesser known SCED methodological options, particularly randomization for MBD; (2) caution against the most frequent reasons for not detecting an intervention effect and provide ways of overcoming them; (3) clarify SCED terminology and its implications in terms of the hypothesis being tested; (4) trigger flexibility on how we design research protocols in developmental disabilities through SCEDs.

R A N DOMIZ ATION
According to the SCED being used, randomization can refer to the order in which different participants will begin the intervention, which goals will be trained first, what duration (defined in terms of number of measurements) the baseline and intervention phases will be, in which order the alternating intervention will be delivered, etc. The more elements of randomization embedded, 17 the greater the methodological benefit.
Randomization in SCEDs may have three purposes: (1) a methodological purpose, 18 to ensure that it is the effects of introducing the intervention that are tested, not the confounding factors, which would be over-represented if the intervention started at a time deemed favourable by the therapist or the participant (e.g. if the participant were more motivated, less fatigued); (2) for data analysis through randomization tests, 18-21(4) which can be performed only if the design were randomized; 23 (3) for randomization-driven 'power' calculations, which is the main solution available for justifying the number of measures in MBD and which may use different approaches 23 (Table 1 and Table S1).
The logic behind randomization tests is quite simple: the more possible assignments (intervention starting points), the lower the measurable p-value. The lowest possible p-value in a randomization test equals the inverse of the number of possible randomizations. Therefore, a randomization test will not allow one to show an intervention effect in the case of a small number of starting points, even if data and other statistics strongly suggest intervention effectiveness (i.e. even if the difference between mean of phase B scores and the mean of phase A scores is great and has the highest rank, and the intervention effect is obvious in visual analysis 24 ). For example, if n = 3 participants are included with three possible starting points through a Wampold-Worsham randomization, the total number of possible assignments will be n! = 3 × 2 × 1, and the lowest p-value that can be achieved by randomization tests is 1/6 = 0.16. Randomization tests are not a true way of calculating power, as they can only predict that under a certain number of possible assignments (i.e. a minimum of 20 is required, to achieve a p-value of 1/20 = 0.05 (5) ); randomization tests will not detect an effect if it exists. They do not ensure that power is adequate (because power calculation based on randomization tests does not include an estimation of expected standard deviation and mean differences).
Several questions can guide researchers and help them choose the best randomization option. Here, 'best' refers to the optimal compromise between power (number of assignments) and maximal baseline length (therefore study length and participation time per participant) that is acceptable and  in medicine when they are mulƟcycle, blinded and randomized), assumes that an intervenƟon will be applied over a period of Ɵme (either always of the same length -typically the case in N-of-1 trials, or of randomized length), with multiple consecuƟve measurements, and then withdrawn from the patient (or implemented in a modified form/dosage).

MulƟple baseline design (MBD)
the intervenƟon is introduced sequenƟally to different paƟents or seƫngs or behaviours. MBDs can be viewed as mulƟple AB designs, with as many AB designs as there are target paƟents, seƫngs or behaviours. The evidence of such designs comes from demonstraƟng that change occurs when and only when the intervenƟon is directed at that paƟent, that seƫng or that behaviour. MulƟple baseline designs eliminate the need to return to baseline and therefore are parƟcularly suited for evaluaƟon of intervenƟon with long-lasƟng effects, such as rehabilitaƟon effects.
(1) How frequently is it possible and relevant to measure the outcome variable? (2) What is the minimum baseline length according to (a) recommendations (e.g. a minimum of three 26 or five 4 points in baseline), (b) pilot data (e.g. giving an idea of the variability of the outcome measure), and (c) specific requirements based on statistical planned analysis (e.g. multilevel analysis requiring more data points than procedures such as non-overlap of all pairs)? (3) What is the maximum baseline length that is ethically and financially feasible and acceptable to the participant and to the researcher? (4) How many participants is it possible to include concurrently (if a concurrent design is decided upon (6) )?
Having answered these questions, one can use Table 1  (and Table S1) to choose the best randomization option. Different user-friendly and free resources are available to randomize SCEDs and calculate randomization tests, including the ExPRT randomization macro 22 for Microsoft Excel and R-packages. [29][30][31]

Baseline content
Baseline phases are control periods where a participant does not receive the target intervention the SCED aims to evaluate. During such a baseline, the participant is repeatedly measured with the same measures and at the same frequency as the repeated measurements that will take place in the intervention phase. However, the content of this baseline phase is variable. Baselines may consist of (1) a period without any intervention (equivalent to a waiting list); (2) a period of care-as-usual; or (3) a period of intervention useful to the participant but having a different aim from the intervention that will then be tested in the intervention phase. Similarly to the control intervention in group studies, baseline content influences the conclusion of the trial. If there is no intervention at all during the baseline phase, and improvement occurs in phase B, we can assume that intervention B is more effective than no intervention (which may be the aim of the study, especially in the many fields that lack evidence-based interventions). More rigorous baselines, however, control for major non-specific effects (i.e. time with therapist, goal setting, amount of encouragement, intervention intensity, etc. are kept equal between the baseline and interventional phases) so that only the specific intervention-related effects are measured in the intervention phase.

Baseline phase length
It is important that effects of short or unstable baseline phases or baselines showing a strong improvement trend are known to researchers, because these baseline characteristics are one of the design features that most impacts the ability to detect an intervention effect. Researchers and clinicians prefer the shortest possible baseline phases to introduce the intervention as soon as possible, to keep the protocol sufficiently short (financial reasons), and to avoid participants and therapists 'losing their time' in the baseline phase (ethical reasons). However, there are clinical benefits for a participant to have a long baseline phase, such as a better understanding of the participant's condition, performance fluctuations across time, and factors influencing performance before the introduction of the intervention being tested. Furthermore, having (clinically, ethically, financially) longer baseline phases allows more reliable modelling of the baseline phase mean (typical performance without intervention), variability, and trend.
It is commonly recognized and logical to consider spontaneous improvement in the baseline phase, in order to extract T A B L E 1 Three randomization procedures and associated power calculations for multiple baseline design (adapted from Levin et al. 23 )

Koehler-Levin regulated randomization b
Type Case randomization Start-point randomization Combination of case randomization and start-point randomization General principle As many starting points as participants a : the n cases are randomly assigned to the n staggered positions of the multiple-baseline design More possible starting points than participants a : some starting points are not used If four participants a included: With three participants a included: Minimal Theoretically the largest power, 23 but only if more than two possible starting points per participant are possible. This design seems best suited for MBD across behaviours b , when it is logical to delay the introduction of consecutive interventions. It is only feasible for a baseline phase of sufficient length (strict minimum of three starting points per participant and three participants a to achieve power) n, the number of participants/contexts/behaviours participating in the MBD (preferably concurrently); k, the number of possible starting points. Abbreviations: MBD, multiple-baseline design; SCEDs, single-case experimental designs. ). An example of useful parameters may be n = 3 goals, k = 4 per participant (leading to a total of 12 starting points in the MBD), a minimal baseline length of three measurements, and a maximal baseline length of 17 measurements. This design gives a possible number of assignments = n! × k n = 384. If data are available for three participants, each having four possible starting points, the p-value is up to 0.0026 (see Table S1 for an illustration). the improvement specifically due to intervention. This is precisely why SCEDs are a stronger design than pre/posttest designs that do not control for baseline improvement. Several solutions have been proposed to calculate the trend in baseline phases, 32 and to subtract that baseline trend from the observed improvement at the end of the intervention. If the baseline shows a clear trend in improvement, demonstrating an effect of the intervention will be harder, as the experiment will have to show that the trend accelerates/is steeper in the intervention phase ( Figure 3). This is not problematic if the trend is stable across time, but is highly problematical if the trend is expected to change during the natural history of the child ( Figure 4). For example, in the early recovery of an acquired brain injury, a fast recovery probably takes place over the first weeks/months, but this spontaneous recovery usually does not last, and even an effective intervention will have difficulty keeping the child on the same trend of progress as the initial spontaneous recovery. Similarly, if there is a strong retest effect that reaches a plateau towards the end of the baseline phase, it is unlikely that the baseline trend (mainly influenced by the retest effect) can maintain the same rate of improvement in the intervention phase. Finally, when children with no or less intensive usual care are included in an SCED, time spent with the therapist in the baseline phase, family motivation, definition of common goals with the family, and other non-specific clinical management components may trigger an improvement during the baseline phase linked to reasons of enthusiasm which are likely to fade over time, but which may result in a strong trend during baseline. This may lead one to conclude not only that the intervention is ineffective, but even that the intervention is deleterious as it decreases spontaneous recovery (see Vibrac et al. for an example 33 ). To overcome this problem, SCED methodology insists on a longer baseline phase, on using SCEDs in stable conditions only (i.e. not in the acute phase of an acquired brain injury), or even on not initiating an intervention until the baseline is stable. 26 Response-guided and randomized introduction of interventions 34 (i.e. randomizing the introduction of the intervention only after baselines have been deemed stable by an external judge) is probably the best option whenever feasible. (7) Figure 5 proposes a simplified description of available statistical analysis and simple rules that can guide design parameters if baselines are expected to be unstable.

TH E R IGHT DE SIGN FOR TH E R IGHT QU E STION
Although terminology varies, choosing the right SCED design depends on the plausibility of the on/off effect of an intervention, which is the reason why, in the introduction, I categorized most SCED into two main families. For some interventions, aiming to compensate a function, it is predictable that there will be an on/off effect (prosthetics, use of drugs with on/off effects such as methylphenidate, 36 or intrathecal baclofen, adaptive devices and assistive technology such as contactors, 37 vocal synthesizers, schooling adaptations, smartphones, etc.): ABAB (Figure 1d), alternating treatment design (Figure 1c), and N-of-1 trials are to be used. Conversely, for most rehabilitation or learning interventions, an on/off effect is not expected (and not desirable), because the aim of the intervention is learning a new skill that cannot be unlearned, developing a function, or F I G U R E 3 Influence of baseline trends on the conclusion drawn from single-case experimental design data. creating brain plasticity with maintained effects: MBD across participants is the best option (Figure 1b). Finally, some interventions are borderline: a splint may have an on/off effect in compensating a function, but its repeated use can induce a brain plasticity leading to maintained effects after its removal (see the SCED used by Ten Berge at al. 38 of a thumb opponens splint that had an on/off effect for some children, whereas others showed persistent gains in functional tasks without the opponent splint, following the phase in which the splint was intensively used). Having a clear idea about the probability of an on/off effect has important implications for choosing a SCED design and selecting appropriate statistical analysis.

Evaluating maintenance of effect or showing a return to baseline after intervention removal?
A classic design in rehabilitation practice comprises at least three phases where a baseline A is followed by an intervention phase B aiming to produce long-lasting effects, which is then followed by a phase without intervention to test for maintenance of the rehabilitation effect tested in B. Some papers call this kind of design an ABA design, which may erroneously lead readers to consider that the child's behaviour is expected to be the same in both A phases (similarly to ABAB phase designs and N-of-1 trials, where A corresponds to a withdrawal, also termed washout, 15 of an intervention with on/off effects). Such a third phase could rather be called by a different letter (e.g. FU for follow-up, C for consolidation, M for maintenance). Although it may seem only a rhetorical recommendation, it has many implications, as the hypothesis being tested is different. In ABAB/introduction-withdrawal for interventions with on/off effects, the hypothesis is that removing the intervention will make the child return to baseline levels and therefore the aim is to show a negative statistical difference between B and the second A phase (withdrawal). Showing a lack of maintenance (i.e. the production of an off effect) aims to provide a rationale for keeping the intervention on/active. Repeating the on/off phases in Nof-1 trials increases the probability that the effect is due to the intervention being tested. Conversely, in AB-follow-up, the hypothesis is that the child will maintain the intervention effect and therefore the aim is to show that there is no difference between B and follow-up phases in level (i.e. phase mean score), or to explore a change in trend. In pilot studies, and in interventions with unclear underlying mechanisms, SCEDs can also be very useful to determine whether there is an on/off effect. Follow-up phases in MBD can also serve this purpose.
Finally, SCEDs are a highly efficient tool for testing the personal evolution profile after an intervention is introduced or removed, which is very likely to vary from one participant to the next. After the introduction of an intervention, some participants show an immediate effect, others a delayed one, and yet others a gradual improvement (see Gharebaghy et al. 39 for an example). Similarly, after an intervention is removed, some participants will show persistent effects, whereas others will need frequent booster sessions to maintain gains. Grouping the results of participants (as is classically performed in group RCTs) with such different evolution profiles may lead to erroneous/unhelpful conclusions based on means, while there is no reason to think that all children respond in the same way to an intervention or to its removal.   Example of baseline correc on using a Theil Shen regression that can bring interven on-corrected measures to non-existent numbers Widely used overlap indices 56 should be interpreted with cau on as (1) they do not control for baseline trends (with the risk of concluding that an interven on exists while it is only the maintenance of a spontaneous improvement already present in baseline) ( Figure 3); (2) they tend to underes mate an interven on effect, if the effect is delayed a er the interven on starts; (3) they do not provide an effect size of the magnitude of change but only of overlap (i.e. two adjacent phases that present no overlap will both have an effect of size of 1 if the mean difference between phase A and B is 0.01m/s or 25m/s) Both graphs will have an overlap effect size of 1, even if magnitude of change is higher in the right graph Here, both graphs have an overlap effect size of 0.88; the par cipant on the le had slowed improvement in the interven on phase, while the par cipant on the right showed gradual improvement A major challenge is to include delayed effects in the sta s cal analysis. Sta s cs based on regressions 57 are of par cular interest here, as they allow one to model the trend and take into account the effect observed at the end of the interven on (which is more relevant that immediate effects in the case of rehabilita on or learning interven ons that require me to show improvement) Baseline trend is expressed by β 1 and the effect of the treatment on the me trend is expressed by β 3

F I G U R E 5
Representing results and selecting the most appropriate statistical analysis. To access an overview of statistical analysis available for SCEDs, readers can refer to Manolov and Moeyaert 60,61 or to the specific options for N-of-1 trials. 1,14 Abbreviations: MBD, multiple-baseline design; SCED, single-case experimental design. Longer baselines and more frequent measurements will allow one to model random and fixed effects more reliably in piecewise regression analysis Baseline and intervenƟon phase length are not so much determined by their Ɵme duraƟon, but by the number of measurements performed. Measuring parƟcipants with the same frequency throughout the phases allows one to easily plot and staƟsƟcally interpret SCED data (although more complex staƟstical models can take into account the unequal frequency of measurements)

Comparing different interventions
Some SCEDs can be used to compare the effect of two interventions. Alternating treatment designs are rigorous SCEDs to test this: they are particularly useful not only to provide an estimate of the on/off effect by comparing the conditions, but also the learning or developmental effect by examining the improvement trend and comparing this in different conditions. In the example shown in Figure 1c, adapted from Diamond and Ottenbacher, 40 the effectiveness of three types of ankle-foot orthosis (AFO) is tested on walking capacity: in every physiotherapy session, the order of the conditions is randomized. Results are typically represented by joining the points of each condition together. In Figure 1c, we see that step length varies across measurement days (possibly owing to fatigue, pain, motivation of the participant, random fluctuations, measurement errors), with some measures overlapping. Within this intraindividual variability, dynamic AFO results in better performance than traditional AFO and traditional AFO is better than walking barefoot. The participant shows gradual spontaneous increases in step length when barefoot, step length does not increase with the traditional AFO, while the dynamic AFO leads to gradual progress. Note that this design is feasible only because each condition has an on/off effect between conditions, even if a gradual improvement is present (in the dynamic orthosis). Such designs, besides being useful in research, are a clinically relevant tool to provide evidence-based optimal care at the single-participant level, for example in choosing the best orthotic or assistive technology solution for a given child. Alternating treatment designs can also be applied to more complex interventions and outcome measures: the study by Logan et al. 41 gives an excellent example of how a subjective outcome measure (happiness) can be transformed into an objective, reliable, repeated measure, and how SCEDs can extract an intervention effect despite a very large participant variability in complex clinical situations (here, in children with profound multiple disabilities). On the other hand, ABC designs are phase designs, where B and C are two different interventions and a series of measurements (classically a minimum of three or five) are taken consecutively in each condition. When used for interventions aiming to restore/develop a function (i.e. where an off effect is not expected), they do not allow one to compare the relative effectiveness of the two interventions. These designs only allow researchers to explore whether adding an intervention C after a learned skill in intervention B adds effectiveness to intervention B. In no way does it allow a comparison of the relative effectiveness of B and C, as phase C is a combination of the new intervention C and the carry-on effects of intervention B.

Other useful SCEDs
Other SCEDs have been used and can provide 'the right design for the right question'. Most of them are variations of the designs presented above.
MBD across settings/context (see, for an example, Feeney 42 ) is particularly useful in demonstrating the effectiveness of schooling adaptations: by implementing school adaptation sequentially to different contexts (e.g. different school subjects/teachers of a single participant), the effect of these adaptations can be replicated three times, namely across three school subjects (or more) in a single participant.
Multiple probe designs are similar to MBD. However, in multiple probe designs, the measures are not taken continuously, but probed at specific time intervals (for example, every month, with a probe of nine daily consecutive measures and 19 days without measures). Probe procedure can reduce logistical constraints and the impact of test-rest threat. 43 Repeated acquisition design 44 is a variation of alternating treatment design, applied to non-reversible behaviour (e.g. learning a list of 10 new words which cannot be unlearned). They involve repeated and rapid measurement of irreversible discrete skills or behaviours (e.g. learning a set of pictorials or gestures in developmental language disabilities) through single pre-and postintervention probes, across equivalent sets of stimuli (e.g. a list of 10 gestures/pictorials of equivalent difficulty, utility, frequency of use, etc.). This design is particularly useful for research on learning. However, establishing a large pool of substitutable, or equivalent, stimuli and documenting their equivalence can be very challenging for researchers.
The use of changing criterion designs (and range-bound changing criterion designs) has been advocated for healthcare research, 7 but they are rarely used outside behavioural analysis research. They are a combination of testing a specific intervention and the effect of setting a predefined (behavioural) level of goal to be attained thanks to the intervention. The attainment of a goal (on a predefined number of successive successful measures) acts as the criterion to change phase, with a new and more challenging goal (i.e. criterion) set for the next phase. In the classic changing-criterion design, a single-point criterion (i.e. a value) is set that has to be met by the participant; whereas in the range-bound changing criterion design, an acceptable range of values is specified. In both designs, the researcher (and/or participant) has to predefine the criterion to be met. Therefore, this design is better suited for clinically implementing an intervention of known magnitude of effectiveness (i.e. the performance a participant may be expected to reach with the intervention), rather than exploring novel interventions. Closely related to this is the changing intensity design, which can be illustrated with the following formula: AB 1 B 2 B 3 , where the Bs with subscript numerals indicate successive intervention phases in which stepwise changes in intervention parameters are implemented. Graham et al. argue that these designs are not effective for establishing causality 45 but can provide useful information in demonstrating change in patients' performance over time. On the other hand, behaviourists argue that when the rate of the target behaviour changes with each stepwise change in the criterion, therapeutic change is replicated, and experimental control is demonstrated. Conversely, control is clearly compromised when the behaviour does not reside within the pre-specified range.

Flexibility in outcome measures
Contrary to most methodologies, SCEDs do not require outcome measures with previously published validation, norms, and known metrological properties. In fact the repeated measure in SCEDs is defined as a 'target behaviour' 8 that is usually not a normalized score. Common tests can be used as repeated measures in SCEDs but are often not practical because they take a long time to administer (they are not compatible with a high frequency of measurement), and often are not personalized to the child's unique objective. In SCEDs, the design itself tests the reliability of the measure: (1) repeated measures in the baseline phase offer a personalized quantification of the measure's retest effect; (2) the SCED quality recommendation of calculating the interrater reliability on 20% of repeated measures ensures measurement reliability; (3) comparing participants with themselves circumvents the need to compare the child's scores with those of typically developing children; (4) the repeated measure is usually a behaviour directly targeted by the intervention, and thus its content validity is by definition high; and (5) SCED-associated measures (especially generalization measures) contribute to showing the concurrent validity (see, for example, Feeney and Ylvisaker's ABAB design, 46 where frequency and intensity of challenging behaviour is measured but also the percentage of schoolwork completed).

SCEDs and goal-focused rehabilitation
The future of SCEDs is probably progressing from the usual habit of measures applicable to all children of the same condition towards personalized target behaviours that will measure what matters most to the child in question, to their family, or at least to what is known to matter most for the development of the child. One way to do this is by using goal attainment scaling (GAS). GAS 47 is a systematic method for the development of personalized evaluation scales. It provides an individualized measure of a person's goal achievement that can be used as an outcome measure to quantify progress towards person-centred goals on an ordinal scale. 48 GAS, however, does have its own challenges, independent of its use within SCEDs, to ensure adequate reliability. 49 GAS uses a single outcome measure for all participants (GAS score), but the content of the GAS can be personalized, and allows conclusions about intervention effectiveness both at a single-participant level and at a group level (e.g. on four participants included in the same MBD). Clinically, it is current and recommended practice not to pursue too many goals at the same time; therefore a staggered introduction of goals (and related goal-based interventions) in a single participant, through an MBD across behaviours (each behaviour being a goal) (Figure 1b) is a research design close to current clinical practice, bridging research and clinical practice.

SCEDs for testing intervention implementation
SCED repeated measures allow monitoring of both a child's progress and the implementation of the intervention by parents, teachers, and any 'everyday people' who might be involved in it. An excellent example is the SCED used by Cosbey and Muldoon 50 in autism which shows both parental learning of strategies during meal times and the impact on children's acceptance of food and challenging behaviour.

SCEDs for challenging family needs
In RCTs, participants are usually selected by strict inclusion criteria to ensure the groups are as homogenous as possible and exclude children and families who are likely to be lost to follow-up, unlikely to understand/follow the protocol, or unlikely to stay committed. Inclusion criteria may include being motivated and being available: although it is a good start to test an intervention in motivated and available participants, interventions face the challenges of discouraged, tired, or fed-up children and families, challenging parenting styles, or parental psycho-social factors, which can impede the child from engaging in the intervention effectively. Interventions for those difficult-to-reach or difficult-tomotivate families are desperately needed (i.e. an intervention that has proved effective in 'motivated' families may show no effect in those presenting such challenges), but the loss to follow-up in RCTs discourages researchers from including this population. SCEDs are a fantastic tool for reaching them: as each participant is compared with their own baseline, there is no need for homogeneity in the population of participants, and SCEDs allow the researcher to modulate and adapt the intervention to the participant's unique challenges while still measuring its effectiveness.

SCEDs on the front end or back end of RCTs
SCEDs can also be extremely powerful on the front end or back end of RCTs. On the front end, intervention procedures can be systematically observed and modified on a small scale before using excessive resources (i.e. financial, time, personnel) to engage in a RCT. By repeatedly measuring participants, SCEDs provide knowledge on the number of sessions needed to observe an effect, on the responsiveness and variability of the outcome measure, and on the types of participants who respond to the intervention (note that SCEDs allow use of heterogenous samples because each participant is compared with themselves). On the back end, SCEDs can also examine intervention characteristics to support minimal or non-responders to typical treatment. They may support the optimization of the timing, intensity, and adaptations of intervention, at a personalized level. They allow experimentation with interventions in participants who do not match the (often very strict) inclusion criteria used in the original RCT. Finally, SCEDs allow effective individual therapeutic decision-making. 51 This is because, although metaanalysis of RCTs gives reasonable arguments for predicting that an intervention will be effective, both the patients and their therapists are more convinced if decision-making is based on a personalized trial. 52,53 This may contribute to better adherence to the interventions. A typical example is the use of methylphenidate in attention-deficit/hyperactivity disorder. Although literature has proved its effectiveness, individual N-of-1 trials can convince the participant/family/ teachers/doctors of the benefit for a particular person.

CONCLUSION
As already noted by Romeiser Logan et al., SCEDs 'may bridge the gap between research and clinical practice, with robust design options that better identify and preserve patterns of responsiveness to specific interventions'. 54 Whenever feasible, N-of-1 trials and alternating treatment design should be preferred to RCTs. A design that holds great opportunities is MBD across behaviours, applied to different individual children's personalized goals and worked on sequentially. It is particularly useful both for informing optimal clinical care at the single-participant level and for testing goal-focused interventions in research.

AC K NO W L E D GE M E N T S
I am grateful to Rumen Manolov from Barcelona University for advice on randomization and multilevel analysis and to the three reviewers for their comments and suggested references.

DATA AVA I L A BI L I T Y S TAT E M E N T
Data sharing not applicable -no new data generated, the article describes entirely theoretical research ORC I D Agata Krasny-Pacini https://orcid. org/0000-0003-0875-9591

E N DNO T E S
1 The term SCED should be preferred to the frequently used term singlesubject research design [SSRD], because SCEDs are also very valuable outside research, to determine the best therapeutic option for a given patient through experimental introduction/withdrawal of an intervention within usual clinical care. 1,2 2 Note that other small n designs exist that are not SCEDs, and therefore the term 'small n design' 3 should be avoided when describing SCEDs. 3 However, SCRIBE recommendations are rather historical: limited empirical evidence actually supports this claim. For example, recent work suggests that replicating the effect twice may be sufficient, 6 and growing literature supports the use of AB quasi-experimental randomized designs. 7 4 Randomization-test procedures for MBD incorporate a comparison between the mean (or the trend or the variability, although randomization tests often lack power for these 22 ) of the baseline phase observations (A) and that of the intervention phase observations (B). Specifically, for each procedure a set of A -B differences is constructed to form a randomization distribution. The B − A mean difference is computed for all possible intervention starting points. The mean differences are then ordered and the B -A difference that corresponds to the actual starting point that was actually randomly assigned is given a rank. The probability p is calculated by dividing the rank by the total number of all possible starting points. Therefore, if the B -A mean is ranked the third highest, in a total of 48 possible starting points, p = 3/48 = 0.06. If it is ranked highest, p = 1/48 = 0.02. 5 Note that the 0.05 threshold p-value for deciding that results are significant has major risks of false discovery rate. Increasing literature suggests it should be abandoned. 25 However, this issue does not relate specifically to SCEDs. 6 Note that although concurrent (i.e. all participants starting the baseline at the same time) inclusions of participants in SCEDs has been a criterion standard, a recent review 27 suggests that deprecation of non-concurrent designs is not well-justified: several arguments support non-concurrent design, 28 and most multiple N-of-1 literature uses non-concurrent designs. The use of non-concurrent designs can ease protocol constraints, and allow more randomization options. 7 Note that recent research 35 using machine learning questions the necessity of waiting for baseline stability. However, this finding is based on data using Monte Carlo simulation, it was found only with machine learning, and was based on measuring type I errors only. Traditional analysis methods produced fewer type 1 errors when using response-guided decisionmaking (i.e. waiting for stability). Type II errors were not assessed.