Cumulative sum (CUSUM) assessment and medical education: a square peg in a round hole


Cumulative sum (CUSUM) techniques were developed by the Statistical Research Group of Columbia University as part of the war effort during the 1940s, in a bid to increase productivity by ensuring quality manufacture of equipment, e.g. ordnance production [1]. In the last decade there has been a growing interest in the application of these techniques to medical practice. In particular, there have been several articles in recent years – including in this journal – evaluating and often promoting the use of CUSUM assessment of performance in three main areas: trainees’ acquisition of competence in procedural skills [2–4]; quality control at departmental or organisational level [5]; and performance of specialists [6, 7].

The CUSUM is a sequential analysis statistical tool that is particularly suited to the identification of small changes, or changes in counts of events, in one direction or another. CUSUM analyses the output of a process over multiple repetitions [8] and thus ‘monitors’ the probability density of the results so that when a process starts to go awry the change is rapidly identified, corrected and if necessary stopped. For example, if we consider the manufacture of tracheal tubes we can expect some general variation in the quality of the tube. CUSUM allows special variation in the quality of the tracheal tubes (for example due to wearing or developing imperfections in the machinery) to be detected, thus signalling that manufacture may have to stop for the cause of any imperfections to be investigated. Importantly, CUSUM avoids the need for multiple repeated hypothesis testing with the attendant risk of false alarms and reduced productivity.

In this era of competency-based assessment in medical education, CUSUM apparently offers an objective, and therefore reliable, measure of a trainee’s procedural expertise. However, it is of concern that there has been little consideration of how sound and appropriate CUSUM techniques actually are within the context of medical education. Ultimately, CUSUM assessment promises a binary response to the question of whether a trainee is competent or not. Before we question the wisdom behind these assumptions, it is worth remembering what CUSUM actually is and how it is calculated.

In medicine, three CUSUM techniques are commonly utilised. These include the CUSUM-graph, the CUSUM-test and the sequential probability ratio test (SPRT), all three of which are frequently confused [7]. The CUSUM-graph is probably the most familiar. Here, a graph plots the cumulative sums of the difference between the actual and ideal process outcomes, thus creating a learning curve. As an extension of this, the CUSUM-test tests the null hypothesis that a process is ‘in-control’ vs an alternative hypothesis where the process is ‘out-of-control’. In the CUSUM-test, an upper boundary is created statistically (see Appendix, eqn 6) above which the CUSUM signals an out-of-control process, that may need to be stopped. The SPRT, however, has both an upper and a lower boundary (see Appendix, eqn 7). In the SPRT, an in-control process is signalled when the plot crosses the lower boundary. In contrast, the CUSUM-test is designed so that an improving CUSUM never passes below a holding barrier of zero; i.e. the CUSUM does not diverge further and further from the upper, out-of-control boundary signal, which would then delay identification of a subsequent deterioration in performance. Therefore, a process is never assumed to be in-control when using the CUSUM-test. Importantly, the CUSUM-test should be used to monitor processes that have reached a steady-state, i.e. not a novice anaesthetist’s procedural expertise that is out-of-control at the outset.

Setting up a CUSUM sequential analysis for training purposes in anaesthesia

A detailed discussion of the mathematics required to derive the necessary equations (see Appendix, eqns 1–7) is beyond the scope of this Editorial. For curious readers, both Wald and Siegmund [1, 9] provide treatises on the subject. Suffice it to say that the formulae use natural logarithms of the probability ratios to facilitate presentation and interpretation of the data.

Both the CUSUM-test and SPRT have been used to infer a trainee’s competence or incompetence based on whether the plot crosses the lower or upper boundaries, respectively [10]. As an example, say we wished to monitor the performance of a trainee at arterial cannulation. Four variables are needed to generate a CUSUM plot. First, like any other hypothesis test, we must set our α and β values, so we must decide how concerned we are at mislabelling a trainee as incompetent (α error) or identifying as competent someone who is not (β error). In practice, for clarity in the appearance of CUSUM plots, the α and β values are often chosen to be equal so that the horizontal boundary lines, h1 (lower boundary) for acceptable and h0 (upper boundary) for unacceptable performance, are mirror images either side of zero (see Fig. 1 and Appendix, eqns 1, 2, 6 and 7) [10, 11]. In our arterial line example, we will let α = β = 0.1, i.e. we will accept a 10% chance of wrongly identifying a trainee as incompetent and a 10% risk of identifying competence when it is not present. Second, we need to define our acceptable (p0) and unacceptable (p1) failure rates (see Appendix, eqns 3 and 4). So, we might decide that p0 = 5% and p1 = 30%. These are all the data we need to define our CUSUM plot. At this stage we can enter the values for these variables into our CUSUM equations (see Appendix) and these will generate the boundary values (h0 and h1) for the graphs as well as the value ‘s’ which is the size of the step change down that occurs after a success, and the value (1 − s) which is the step up taken after a failure. In this example, we have used a calculator developed by Runcie [6]. Entering the values suggested above generates the following results: the interval between the boundary lines h0 and h1 is 2.10 and the value of s is 0.15. Starting at zero initially, each success therefore cumulatively pushes the plot down by 0.15 and each failure pushes it up by 0.85. By using these figures, we can then plot the sequential performance of arterial cannulation by two virtual trainees (see Fig. 1). The limitations of this plot will be discussed in the following section.

Figure 1.

 Sequential probability ratio test (SPRT) plot for arterial cannulation by two trainees: Trainee 1 (•) and Trainee 2 (▪). The dotted lines represent the upper boundary (h0) and lower boundary (h1), respectively. Success is displayed as a decrease in the CUSUM, while failure is displayed as an increase. Two additional outcomes for Trainee 2 are modelled by the dashed lines from attempts 14 and 24, respectively.

CUSUM as a tool for the assessment of trainees

There is no doubt that the CUSUM techniques produce interesting results and provide a useful insight into the varying experience required by doctors to learn new procedural skills. Indeed, assessment by CUSUM, or any modification thereof [4], is appealing in that it initially appears to satisfy many of the criteria for utility of assessment, i.e. reliability, validity, acceptability (by trainees and assessors), cost and educational impact (although there are some fairly complex statistics to be calculated first) [12]. However, we argue here that in fact, there is little evidence to support the utility of CUSUM as an assessment tool in medical education.

Assessment within medical education is often described as either formative or summative. Formative assessment seeks to give feedback to learners so that they can be guided towards improved performance, and it has a major educational impact; the importance of well-placed feedback in this learning process cannot be overstated, as its absence will render learning and subsequent performance absent or inefficient [13]. On the other hand, summative assessment seeks to answer a question, e.g. pass or fail, competent or incompetent; learning is of secondary importance in summative assessment.

Reliability refers to the reproducibility of assessment outcomes and it represents a major ‘quality index’ of assessment data [14]. In real practical terms, in a programme including CUSUM analyses, learners might self-assess their procedural outcomes against a standard of practice to determine whether success was achieved or not, and then plot their CUSUM chart [1]. Self-assessment can be a valuable learning tool. However, factors such as gender and personality traits can affect an individual’s self-perception such that the learners may not be able to compare their performance accurately with a given standard [15, 16]. For example, one trainee’s successful performance of a task may be considered a failure by another. Thus, self-assessment is generally regarded as an unreliable assessment except in those who are appropriately trained [15]. Furthermore, self-assessment comprises three stages: a description of what happened; an analysis thereof; and finally, application of what was learned. Learners are rarely able to progress to the analysis and application phases of self-assessment [17]. Therefore, if CUSUM assessment is to provide a measureable educational impact, trainees must be observed, guided and formatively assessed by a senior colleague to ensure that performance flaws are accurately identified and described, and that sensible strategies to overcome them are developed [17]; the increase in the trainer’s workload would be considerable. One could argue that the current raft of workplace-based assessments offer no better solution. However, data exist to support their use as formative assessment tools; they do not require complex statistical calculations, they appear to be acceptable to the majority of trainers and trainees, and they facilitate reflective practice [18–20].

Validity has been defined as ‘the evidence presented to support or refute the meaning or interpretation assigned to assessment results’ [21]. This is commonly reduced to ‘does the assessment measure what we actually need to know?’ and is the cornerstone underlying the move away from reliance on, for example, multiple true/false examination questions, and towards actual clinical practice. To ensure the validity of CUSUM assessment of competence, it will be important to come to an agreed, expert view as to what constitutes success and failure at any specific skill; to be more exact, what failure rate is acceptable, and how the seniority of the trainee and the complexity of the task at hand affect the decision of success over failure. Some of these issues could be addressed by the use of further statistical adjustments [4]. However, these re-calculations still would not address the issues of reliability, validity and educational impact. It is fair to say that it would be an enormous task for the Royal College of Anaesthetists (RcoA) to determine and defend successfully what constitutes specific procedural success and failure, and acceptable and unacceptable failure rates, for the breadth of procedural expertise across basic, intermediate and higher training grades. At first glance, successful arterial cannulation, for example, might appear easy to define but in practice it is not. Defining success might require defining the maximum number of attempts, the time taken, the number of sites used, and so on. It has been suggested that to identify some of these parameters, actuarial expertise and the advice of speciality colleges should be sought, which could represent a significant investment [11].

As a summative assessment tool, the apparent objectivity and therefore reproducibility and reliability of tools such as CUSUM techniques are highly attractive. However, CUSUM techniques offer false reassurance [22]. If we consider Fig. 1, the CUSUM plot of Trainee 2, it crosses the lower, in-control boundary (h1) after the 14th attempt at arterial cannulation and he/she is thus deemed competent. However, despite a similar performance between attempts 14–28, Trainee 1 is not considered competent because three early failures (attempts 2, 4 and 7) pushed his/her plot far beyond the upper, out-of-control boundary (h0). Thus, an unreliable assessment of the competence of Trainee 1 at that particular time point will be made. Indeed, had the plot been returned to zero after the failed 7th attempt, Trainee 1 would subsequently have been deemed competent during his/her run of successful cannulations. On the other hand, if we imagine that Trainee 2 was not so dextrous and consistently failed after attempt 14, we can see that it takes three failures to signal that the trainee is out-of-control, i.e. incompetent. However, if this pattern of failure commenced after attempt 23, we can see that it takes four failures to be labelled as incompetent. In other words, previous successful performance can mask subsequent ineptitude. In mitigation, the failure to signal competence or incompetence in a timely fashion may be corrected by returning the CUSUM to zero when a boundary is crossed.

A further constraint of CUSUM techniques that has been recognised is that the ‘unbundling’ of a clinical episode into discrete, more objective competencies risks losing the ability to assess the whole task [23]. Therefore, on paper a trainee may appear competent at multiple tasks in isolation – for example, by having an acceptable number of accidental dural punctures or successful arterial cannulations – but his/her overall clinical performance may still be lacking. Furthermore, CUSUM only focuses on the end result of a process, i.e. success or failure, and not on how that result was achieved. Therefore, review of past performance using CUSUM techniques does not allow for assessment of the non-technical expertise that is required for safe, professional practice.

In conclusion, CUSUM techniques were never developed with education in mind, and as such they are a square peg in a round hole. Specifically, they do not satisfy the major requirements of assessment tools in education, those being reliability, validity and educational impact. In addition, their narrow focus and tight definitions of success and failure limit their usefulness. However, if we return to the initial premise of CUSUM techniques as a means of statistical process control, we can begin to consider the potential of CUSUM as a means of monitoring the performance of an anaesthetic department or surgical/anaesthetic teams with respect to patient safety, medical error and quality of care [24–27]. Several parameters of performance within anaesthesia could be assessed and risk-adjusted [26], including measures of patient safety, in-session utilisation, length of stay in recovery, or overnight admissions of day-case patients due to pain or nausea and vomiting [28]. Indeed, with the advent of revalidation it is likely that these datasets of anaesthesia-related outcomes will become increasingly relied upon.

Competing interests

AN, currently Lead Regional Adviser for the RcoA, represents regional advisers on the RCoA Training Committee and has had input into the RCoA 2010 curriculum and RCoA planning for revalidation. He is also Quality Lead and formerly Training Programme Director for the Nottingham and East-Midlands School of Anaesthesia (NEMSA). RM is currently Core Training Programme Director, NEMSA.


Formulae for calculation of the CUSUM after Wald [1]:

Equation 1: inline image

Equation 2: inline image

Equation 3: inline image

Equation 4: inline image where ln is the natural logarithm of the function, p1 is the acceptable failure rate, p0 is the unacceptable failure rate, α is the risk of a type-1 error, and β is the risk of a type-2 error.

Equation 5: inline image

where s is the downward step with each successful episode and (1 − s) is the upward step with each failed episode.

Equation 6: inline image

Equation 7: inline imagewhere h0 and h1 define the spacing between boundary lines for unacceptable and acceptable performances, respectively.