Measuring the quantitative performance of surgical operating lists: theoretical modelling of ‘productive potential’ and ‘efficiency’


Correspondence to: Dr Jaideep J. Pandit


We previously defined surgical list ‘efficiency’ as: maximising theatre utilisation, minimising over-running, and minimising cancellations. ‘Efficiency’ maximises output for input; ‘productivity’ emphasises total output. We define six criteria that any measure of productivity (better termed ‘quantitative performance’) needs to satisfy. We then present a theoretical analysis that fulfils these by incorporating: ‘speed’ of surgery (with reference to average speeds), ‘patient contact’ (synonymous with minimising gaps between cases), and ‘efficiency’ (as previously defined). ‘Speed’ and ‘patient contact’ together constitute a ‘productive potential’. Our formula satisfies the pre-set criteria and yields plausible results in both hypothetical and real data sets, To be productive in these quantitative terms, teams in any specialty need to achieve minimum quality standards defined by their sub-specialty; to plan their lists to utilise the time available with no cancellations or over-runs and to work at least as fast as average with minimal gaps between cases. ‘Productive potential’ combined with ‘efficiency’ yielding ‘actual productivity’ in our theoretical analysis more completely describes quantitative surgical list performance than any other single measure.

With nearly seven million hospital operations performed each year in England and Wales and an annual NHS budget of > £1 billion, operating theatres are a significant expense [1]. As all stakeholders – patients, politicians, staff, managers – seek to maximise benefits from them, proper measurement of processes and outcomes becomes necessary.

Previously, we concluded that surgical operating list ‘efficiency’ is the completion of all the scheduled operations (that is, with no cancellations) whilst utilising properly the time available (that is, no under- or over-runs) [2, 3]. In response, Sanders et al. observed that this notion of ‘efficiency’ was possibly incomplete [4]. It is theoretically possible for two teams to be equally efficient, fully utilising their time without over-runs or cancellations, but one team consistently completes more work than the other. Thus, there is a need to define ‘productivity’ in addition to ‘efficiency’ to complete the picture of performance [5]. However it would seem challenging to develop a measure encompassing all surgical teams, undertaking as they do different combinations of procedures.

The word ‘productivity’ has various connotations in economics, engineering, or administration. The Organisation for Economic Co-operation and Development (OECD) offered more than 20 meanings, recognising that ‘productivity’ and ‘efficiency’ are often confused and emphasising that each industry needs to develop its own measures [6, 7]. For surgical operating lists, we therefore use the engineering sense: ‘efficiency’ measures how well the service functions (that is, work output for costs or effort); ‘productivity’ measures the output. It is theoretically possible for a service to be highly efficient but have low total work output, and conversely to work at high output, but extremely inefficiently [4, 5].

An objective measure of productivity would enable healthcare professionals to confirm they work as hard and effectively as reasonable within the time available, and help them achieve this aim. Other stakeholders such as patients, managers or politicians also need to be satisfied that the investments they make and the reliance they place in operating theatres are used as effectively as is reasonable.

Two measures of productivity that have been applied to hospital settings are ‘balanced scorecard’ and ‘data envelopment analysis’ (DEA). The former consists of first listing all relevant, desirable or possible (that is, quantitative and qualitative) measures of performance (such as utilisation, cancellation rates, complication rates, satisfaction, etc) and then assigning to each a score. While this encompasses a wide range of performance criteria, the scoring systems are necessarily arbitrary and weighting factors need to be employed, which may not be the same across all hospitals or even across different surgical teams [8, 9].

DEA is a very detailed mathematical process that attempts to quantify the performance of teams or units across a range of chosen indices with reference to the ‘best performing units’. However, the analysis requires conceptualisation of abstract ‘efficient frontiers’ and complex linear programming. Although some authors have argued for the validity of DEA in the context of healthcare [10, 11], it does suffer from important limitations. First, DEA yields results only relative to the currently best-performing teams, i.e., there is no absolute reference point or ideal. Thus a team only knows its performance if all other teams have themselves been analysed by DEA: it is not meaningful to conduct DEA for one team in isolation. Second, the mathematical complexity of DEA means its results are often expressed in terminology that makes it difficult to strive for a simple goal. Applying DEA and solving the required mathematical equations is generally outside the scope of most surgical-anaesthetic teams.

The broad aim of this paper is to develop a simpler, yet still rational alternative to balanced scorecard or DEA approaches to measure surgical productivity. In a step-by-step approach we (1) first define the core criteria that any ideal measure needs to satisfy. Then (2), we develop a theoretical measure that fulfils these. Finally (3), we assess the utility of this theoretical measure by application to hypothetical and real data sets. The step from (1) to (2) is a logical process: that is, if the criteria are acceptable, then the measure is deducible from them. However, the step from (2) to (3) is a practical one: i.e., a test of the measure’s ability to apply in practice. Thus, ours is primarily a theoretical analysis, which we extend to apply meaningfully to real surgical lists.



The duration of an operation is the time from the start of anaesthesia to the time the anaesthetist completes handover to recovery staff. The gap time is the sum of the ‘idle times’ between completion of handover of one patient to recovery staff to the start of anaesthesia in the next patient. The actual duration of a list is the time from the start of anaesthesia in the first patient on the list to the arrival of the last patient in recovery (minus any established breaks, if any, for rest/lunch). The scheduled duration of a list is the time available for the list (for example, lists are usually scheduled for a half day, 4 h, or a full day, 8 h). A list over-run occurs when the actual duration of the list exceeds its scheduled duration. A list under-run occurs when the actual duration of the list is less than its scheduled duration. An over-booked list is one in where the sum of estimated durations of individual operations (plus any expected gap time) on that list exceeds the scheduled duration of the list. An under-booked list is one where this sum of estimated durations is smaller than the scheduled duration of the list.

Efficiency as used throughout this paper specifically refers to the notion of a team utilising its scheduled list duration fully, without over-running or cancellation. Quantitative performance is a general term we use to describe the overall ‘output’ or ‘actual productivity’ of the surgical list, distinct from quality measures such as patient satisfaction, communication, teamwork, etc (accepting implicitly that these can contribute indirectly to overall performance). Productive potential is a term used to refer to the activity (speed and degree of active anaesthetic-surgical contact with the patient), which ‘enables a team to be productive’. We explain these concepts further below.

Core criteria that an ideal measure of quantitative performance needs to satisfy

The first step of our analysis was to define the general criteria that any mathematical descriptor needs to satisfy. Where two teams undertake the same operation (for example, knee replacement) without over- or under-run or cancellations, then an acceptable measure of their relative productivity is simply the number of operations undertaken on the list (‘operations per hour’). However, this is not a universal measure as teams rarely undertake just one operation and do not all work equally efficiently (one team may complete more cases but cancel more patients). It also biases in favour of shorter operations (other measures such as ‘income’ favour operations – arbitrarily – priced highest): and avoiding bias enables comparison of teams across specialties.

These preliminary considerations led us to define – a priori – the following six core criteria for an ideal quantitative performance measure:

  • 1 the measure should not be influenced by casemix: the type of operations conducted – or co-morbidities prolonging surgery commonly present in associated patient groups – should not influence whether a team is regarded as ‘productive’. Thus, intrinsic procedure duration should not be influential (that is, short and long procedures are equally potentially productive) [12];
  • 2 adoption of new techniques to achieve the same surgical aim should not be influential. For example, even after any ‘learning curves’, laparoscopic techniques (argued to improve safety, pain scores or postoperative stay) often take longer to perform [13]. A team which previously completed, say, three open nephrectomies per list and which now completes two laparoscopically does not automatically make it less productive;
  • 3 notwithstanding point (2), for any given surgical procedure, productive potential should partly be related to the speed with which the operation is completed: the faster the team operates, the more productive it is likely to be. Thus, a team consistently completing a hernia repair in 30 min should correctly be viewed as having a higher productive potential than a team that always takes 1 h undertaking exactly the same procedure. A team wishing to be more productive should rationally try to operate more quickly (so long as this preserves quality of care);
  • 4 the total anaesthetist-surgeon contact time with the patient should be reflected in the measure of quantitative performance. Although other aspects of the surgical process (such as cleaning theatre, stocking sutures, preparing equipment) are undeniably important, maximising anaesthesia-surgical contact minimises any idle gaps;
  • 5 where time savings might be made by improved practices, reducing idle gaps or greater speed, these should only contribute to performance if extra cases are accommodated into the saved time. Earlier finishes in themselves do not yield extra income (which accrues from cases) nor generally save staffing costs (if staff are employed for fixed base units of time);
  • 6 quantitative performance is only meaningful in the context of efficiency (the latter being the notion that all scheduled list time is utilised, with no over-runs or cancellations). The reason is that high levels of apparent ‘productivity’ are easily achieved by overwork and over-expense. Overbooking a 4 h list with, say, five nephrectomies (each taking 2 h), over-running by > 4 h to complete four cases (but cancelling the remaining patient) certainly achieves more work than booking and completing just two nephrectomies within the scheduled time. But – as a matter of policy – it is the latter that is properly better performing.

We emphasise another caveat already alluded to: measuring quantitative performance is only meaningful where predefined quality or safety standards are met. An increasing number of specialties, most notably cardiac surgery, now define minimum standards for risk-adjusted mortality rates [14]. It is a deficiency that few sub-specialties offer such clear standards, but applying productivity measures is inappropriate for teams that fail to meet existing criteria. By analogy, a factory making more televisions than any other is unproductive if its televisions do not work.

A formula describing quantitative list performance

If each of the above criteria is accepted then, logically, it is possible to fulfil them all by the following empirical formula:


in which ‘productive potential’ (as used by Schmenner [12, 15] to refer to ‘flow’ of activity within a service industry) is:


In these formulae, ‘patient contact’, ‘efficiency’ and ‘speed’ are each expressed as decimal fractions, the first from 0–1, the last two can be > 1. Multiplying the result in large square brackets by the constant 111, as shown, yields an index of performance (actual productivity) expressed in %, in which an ‘ideal’ team (one with efficiency, speed and patient contact all equal to 1.0) has an actual productivity of 100%. We now consider each of the terms of the equation in detail.


This is as detailed in our previous paper [2], where:


The ‘fraction of scheduled time utilised’ means that for a list scheduled for 8 h which finishes in 6 h, this quantity = 0.75 and the ‘fraction of scheduled time over-running’ = 0. The ‘fraction of scheduled time over-running’ means that for a list scheduled for 8 h which over-runs by 2 h, this quantity = 0.25, and the fraction of scheduled time utilised = 1. Thus, the first two terms operate mutually exclusively; a single list cannot be at once both under- or over-utilised. The finish time of the list is the time of arrival of the last patient in recovery; the start time of the list is the scheduled start of anaesthesia for the first patient on the list. The ‘fraction of scheduled operations completed’ means that if four out of five of the patients booked onto the list have their operations (that is, one patient is cancelled), this quantity = 0.80.

Productive potential

It is the product of ‘speed’ and ‘patient contact’ (Equation (2)), implying that any team consistently working fast with few or no gaps is potentially more productive than a team that is always slower and/or has long gaps between cases. The emphasis is on potential productivity because the factors detailed in the equations will determine if this is translated into actual productivity.


If an operation normally takes 1 h to complete and a team’s average for this is 30 min, the team is twice as fast as the norm and its relative speed is 2.0 (200%). If the team takes 2 h to complete the procedure, its speed is just 0.5 (50%). For each operation, the relevant formula for speed is thus:


To obtain the speed for the team, the above relative speeds for individual operations are averaged across all operations undertaken. Thus if a team undertakes n1 operations with a relative speed X and n2 operations at relative speed Y in the time period under consideration, then:


The source of the ‘reference duration of operation’ in Equation (4) is undefined, and could be variously from national databases, published literature [16–18], or from local data (such as the team’s own past operating times). Calculating average speed means that prolonged surgery due to occasional patient co-morbidity has little influence on ‘team speed’. Co-morbidity inherent to the type of surgery (such as cardiac disease needing invasive monitoring in cardiac surgery) is automatically taken into account by the longer reference times for those operations. However, if data is unobtainable for reference times, it is also possible to assume that the team operates at ‘average’ speed, with a value of 1.0.

Patient contact

This is the time actually spent conducting anaesthesia and surgery, expressed as a proportion of the total actual list time. If a list started promptly at 09.00 h and the last patient arrives in recovery at 17.00 h the actual list duration has been 8 h. If six operations each of which lasted 1 h were completed, the patient contact is thus 6/8 = 0.75 (that is, 75% of the list time was spent with anaesthetist and/or surgeon in contact with the patient). The formula is:


The converse of patient contact index is the ‘gap time’; in the example above, 25% of the list involved no patient contact.

Equation (6) correctly recognises late starts as gaps. In the example above, if the scheduled list start time was 08.00 h, then the denominator in Equation (6) is 9 h (the first hour was a ‘gap’), then patient contact is 6/9 or 0.67 (67%), with a corresponding gap time of 33%. In addition to reducing gaps, patient contact could be increased by (a) maximising cases on a list (which should also increase utilisation and so increase efficiency); (b) parallel processing of anaesthesia (as we describe further below), or (c) operating more slowly (a strategy which is perverse, and which would adversely influence the ‘speed’ term of the formula).

Overview of the quantitative performance index

Equation (1) reflects that the potential to be productive (embedded in a team’s operating speed and degree of contact with the patient) is translatable into actual productivity by an efficient system. We can draw analogy with oxygen partial pressure, which can only be translated into actual oxygen content of blood if there is sufficient haemoglobin.

The six criteria defined and Equation (1) have some analogy with measuring performance quantitatively in companies that undertake skilled, complex tasks (for example, antique clock restoration [19, 20]), as opposed to unskilled, repetitive tasks. High productivity results when the company estimates its workload accurately, for example by accepting enough clocks for repair to fill its time, but not too many that it needs to cancel or postpone orders or pay staff overtime. This is akin to sensible booking of patients onto operating lists. Staff should spend as much of their time working on the clocks, rather than in gaps or breaks (a notion akin to maximising patient contact). Finally, for any given clock, staff should ideally take no longer than the average time in repair for the complexity of the task. This is akin to our notion of speed.

Measuring quantitative performance in hypothetical and real data sets

We applied the formulae described above to some hypothetical but realistic scenarios to assess if they fulfilled the criteria required. We applied Equation (1) to the data from five arbitrarily-selected surgical teams in the region (three gynaecology, one urology and one cardiac team). For each list we identified the scheduled and actual start time, the number of cases scheduled and completed (and so the cancellation rate), the duration of each case performed (which we compared against mean/median published duration of those cases [12, 16–18] using Equations 4 and 5), and the time of the last patient’s arrival in recovery. This data enabled us to apply Equations (1)–(6). For each team we examined ten consecutive lists. We combined both half-day and full-day lists for the teams as we [2, 15] and others [3] have previously shown that timing data from such lists are proportional and equivalent.

These applications of the formulae to the hypothetical and real data sets were not a test of the ability of the formulae to meet the predefined criteria. Rather, they were designed to probe the limits of the formulae when applied to practical scenarios. We were especially interested in discovering any internal contradictions created by applying the formulae (such as a list that has a high cancellation rate appears to perform well, etc).


Graphical representation of quantitative performance

Figure 1a shows the theoretical distributions for the index of quantitative performance as a family of parallel efficiency curves when plotted against a wide range of productive potentials. A list or team represented as a point can move from one efficiency curve to another by changes in its utilisation (that is, over- or under-running) or cancellation rate; it can move along any given efficiency curve by changing its speed and/or patient contact. Thus, performance increases as speed and/or patient contact increase, but asymptotes to a value ultimately determined by the prevailing efficiency.

Figure 1.

 (a) Plot of Performance Index against Productive Potential as yielded by Equation (1) in text. (b) The curve for 85% Efficiency and vertical line for Productive Potential of 0.9 divides the plot into four quadrants, as discussed in text.

Recalling the analogy with oxygen transport, a rise in oxygen partial pressure only increases content if there is sufficient haemoglobin, or if the starting point is on the steep part of the curve. Similarly in our plot, increasing speed and/or patient contact is very beneficial if prevailing speed/patient contact are low, but less beneficial if these are already high. The mechanistic interpretation of this limiting ‘diminishing return’ is that any time savings through increasing speed or reducing gaps need to be of sufficient magnitude as to be able to accommodate an extra case.

‘Ideal’ or desired performance objectives may be displayed. We previously suggested a minimum desirable efficiency of 85% [2]. If teams are expected to work at least as fast as average (speed = 1.0), and if patient contact should ideally represent > 0.9 (< 10% of list time should be idle gaps) then the productive potential is 0.9 (Equation (2)). Notwithstanding the imprecision of these goals, the efficiency curve and the vertical line representing these divide the graph into four quadrants (of unequal size; Fig. 1b). A team that is efficient and productive lies in the top right quadrant. An efficient team which is not potentially productive will lie in the top left. A potentially productive but inefficient team lies in the bottom right. An inefficient and potentially unproductive team lies in the bottom left. Improvements in efficiency drive the points and curves upwards while increases in productive potential drive the point to the right, along a given efficiency curve.

Hypothetical scenarios

A hypothetical team operates slowly (50% expected speed) and suffers long gaps (patient contact 80%). It is also inefficient (75%) due to overbooking, over-runs and cancellations. The team’s quantitative performance is low (X1, Fig. 2, index = ∼44%). Simply accepting its own slowness and more realistically scheduling fewer patients onto lists, this team would finish on time, reduce cancellations and move higher to point X2 (Fig. 2). Further gains could then be made by increasing speed and/or reducing gaps (to point X3, Fig. 2). However, from its original situation X1, increasing speed or reducing gaps alone achieves relatively little as movement is along the efficiency curve (point X4 index << X2).

Figure 2.

 Plots of Performance Index vs Productive Potential for one hypothetical team as discussed in text.

If by adopting a new, inherently slower surgical technique a team continues to complete all its cases and accomplish similar list utilisation as it did before the new technique, then Equation (1) will yield it the same performance, regardless of the number of patients or cases completed on the list (data not plotted). If a team has high operating speed and/or minimal gaps between cases, it will have spare time. Accommodating extra cases into this time will increase patient contact and so increase performance. If, however, the team chooses instead to finish early regularly then Equation (3) will reflect this as consistent ‘under-utilisation’ and low performance. Faster operating per se only yields ‘credit’ by opportunity for more patient contact.

In summary, consideration of hypothetical scenarios yields very plausible and reasonable results which fulfil the criteria defined. These hypothetical scenarios are not dependent upon (or require any knowledge of) the surgical subspecialty of the teams or the details of the operations they perform.

Analysis of real data sets

Table 1 shows one example list and how data were extracted to construct Table 2. If we crudely plot simply the ‘number of cases scheduled per list’ (as do some hospitals, [21]) then Team C appears to perform well, while Teams A, B and E less so (Fig. 3a). Since this measure might be affected by 8 h vs 4 h list allocations or casemix (for example, Team E is cardiac), a perhaps more relevant plot is ‘cases/hour completed’ (Fig. 3b). This increases the apparent performance of all teams except Teams B and E whose performance remains ‘poor’.

Table 1.   Example of analysis of one of the urology (half day cystoscopy) lists. The columns show, in turn, the time (scheduled start time = 09.00 h), the operations conducted or activity taking place, operation time, duration of any gaps between cases, the reference times for the operation (from references [16, 17]) and thus the relative speed for the operation. Note that start time = start of anaesthesia; end time = arrival of patient in recovery. Below the table are the calculated indices for this list, using Equations (1)–(6) in text.
TimeActivityOperation time (min)Gap (min)Reference time for operations (min)Relative speed
09.00  5  
09.05Start of cystoscopy25 331.32
 Gap 5  
09.35Start of cystoscopy34 330.97
 Gap 1  
10.10Start of cystoscopy45 330.73
 Gap 10  
11.05Start of cystoscopy30 331.10
 Gap 5  
11.40Start of cystoscopy25 331.32
 Gap 5  
12.10Start of cystoscopy40 330.83
Summary data
Scheduled time for list = 09.00 h–13.00 h240 min
Number of patients scheduled/cancelled6/0
Late start5 min
Under-run10 min
Gap time = 31 min = 31/240 = 13%
Patient contact index87%
Weighted mean speed1.05 (105%)
Productivity index0.93 (93%)
Table 2.   Data from five actual teams (Teams A, C, D in gynaecology; Team B in urology; Team E cardiac: ten consecutive lists for each team).
TeamStart time (min) Utilisation (%) Efficiency (%) Speed term (%) Patient contact index (%)Cases completed/ cases booked (cancellation rate, %) Peformance index (%)
  1. Teams A and D are half-day (4 h) lists; Teams B and C are full-day (8 h) lists; Team E is a full-day (10 h) list. Values are medians (interquartile range) [range], except where indicated. Start time is given with reference to the scheduled start time (+ indicates late start, − early start). Efficiency is calculated using Equation (3) in text; Speed using Equation (4); Patient contact index using Equation (5) and Performance index using Equation (1). Cases completed are shown as a fraction of those booked (with cancellation rate %).

A−3 (−8–+6) [−10–+15]91 (80–106) [64–127]84 (60–93) [48–95]117 (91–148) [58–247]96 (89–100) [80–100]36/39 (7.6)84 (47–108) [32–111]
B+9 (+4–+15) [−5–+23]91 (88–99) [86–101]89 (88–93) [25–99]111 (99–116) [88–122]92 (88–97) [67–100]41/42 (2.4)88 (73–107) [56–128]
C−8 (−18–+3) [−30–+5]88 (65–93) [58–98]69 (39–82) [36–89]120 (108–153) [84–187]93 (89–93) [84–99]66/84 (21.4)68 (22–84) [14–96]
D−8 (−11–+1) [−25–+15]120 (108–131) [85–144]75 (58–83) [45–90]78 (69–97) [39–103]94 (85–96) [64–98]43/46 (6.5)63 (23–79) [14–89]
E0 (−4–+4) [−5–+20]92 (83–95) [55–130]85 (73–92) [28–95]88 (83–95) [65–124]98 (97–99) [72–100]19/21 (9.5)79 (62–88) [6–95]
Figure 3.

 (a) Boxplots of the number of cases booked by each of the actual teams (A–E) from which data was analysed: the horizontal line represents the median, the edges of the box the interquartile ranges, the error bars (where different from the latter) the 10th and 90th centiles, and the symbols any outlier data. On this measure, the rank order of performance is: Team C > D > B > A > E. (b) Boxplots of the number of cases completed per hour on lists of the teams (A–E): On this measure, the rank order of performance is: Team A > D > C > B > E.

Table 2, however, shows more detailed data in which Team B (although weakest on start times) appears to perform considerably better across a range of measures, especially cancellation rate. Team E (though poor on cancellation rate in percentage terms) shows utilisation closest to 100% and acceptable efficiency. It also has the highest patient contact index due to ‘parallel processing’ of anaesthesia, in which anaesthetic induction and invasive monitoring is started while the previous patient’s surgery is being completed [22, 23]. These disparate measures can now be combined meaningfully into a single measure using our formulae.

First, the plots of efficiency vs utilisation (Fig. 4) indicate that Teams B and E (the worst-performing in Fig. 3) generally complete all booked cases on time, with no over-run yet near-full utilisation. Team A (which also performed rather poorly in Fig. 3) is also quite efficient, perhaps with a slight tendency to under-run. The teams performing apparently well in Fig. 3 (Teams C and D) are, in fact, inefficient. Team C suffers cancellations on every single list and also consistently under-runs. It is most probably over-booking its lists and then managing this by prompt cancellation of cases to avoid over-run. This interpretation is supported by its high rates of booking (Table 2). The main problem with Team D is prolonged and consistent over-runs. Like Team C, it likely over-books its lists but manages them with gross over-running to complete the list rather than cancellation.

Figure 4.

 Plots of Efficiency vs Utilisation of lists (see Formula (2) in text). The ‘triangle’ represents the ‘isopleth’ where no cancellations occur: thus, any points lying on the lines of triangle represent lists where no cancellation has occurred; any point lying within the triangle represent lists where a cancellation has occurred. The horizontal line represents the desirable objective of at least 85% efficiency [1]. Points < 100% on the x-axis represent under-utilisation of the list; points lying at > 100% on the x-axis represent over-utilisation. Each data point represents a single list.

Figure 5 offers further detail. Teams A, B and E show generally acceptable performance (Team A efficiency is very slightly < 85% threshold; Team E productive potential is slightly < 0.9). For Teams A and B, significant improvements in performance will not be achieved by their working any faster or reducing gaps as they lie on the flat part of the relationship (that is, their speed is already better than average for the operations they perform and the gaps are broadly acceptable). Team A could do better instead by improving its cancellation rate (Table 2). Team E could, by contrast, improve by working slightly faster (though its patient contact time is already high). As a cardiac team it has a low absolute number of cases, so the percentage impact of even the occasional cancellation is magnified in Table 2: nonetheless it is desirable to prevent even these.

Figure 5.

 Performance Index vs Productive Potential (from Equation (1) in text) for data from each of the real teams A–E (each team’s efficiency is stated in %). As in Fig. 1b, the vertical line and the curve for 85% Efficiency divide the graph into quadrants. The point for a ‘perfect team’ (efficiency 100%, speed 1.0 and patient contact 1.0) is also shown (see text).

Team C’s performance is quantitatively very poor, despite its misleadingly impressive plot in Fig. 3. Like Teams A and B, it also cannot improve by simply working faster or eliminating gaps as it also lies on the flat part of its relationship (that is, its speed is faster than the average and it has few gaps). Since Team C books the most cases onto its lists and has the highest cancellation rate (> 21%), scheduling fewer cases is rational improvement.

The main problem for Team D may be its low average speed (Table 2), which might be due to teaching or more complex surgical methods needing further investigation. Increasing speed would also reduce the magnitude of over-runs and help both improve efficiency and productive potential.

In summary, neither the hypothetical nor the real data sets revealed any internal contradictions when subjected to our formulae. Instead, the interpretations seem reasonable and could be applied to compare teams within and across specialties.


The six core criteria properly reflect what is required of a quantitative measure of performance, and they apply to any surgical list. We present formulae that fulfil these criteria. Finally we demonstrate that this analysis can apply to a range of hypothetical and actual scenarios.

Comments on the mathematical aspects

To its credit, our measure avoids complex variables or misleading surrogates such as ‘numbers of operations done’. It is, however, possible that we failed to identify additional criteria beyond the six we defined, which should be met by an ideal measure of quantitative performance. The Audit Commission and related bodies suggested ∼100 individual measures of ‘performance’, combining these by a ‘balanced scorecard’ [1, 24–26]. While 100 indicators outnumber six, the complexity of many end-points – and the inevitable arbitrariness of any scorecard – makes the methodology unwieldy, without necessarily increasing its power to describe performance. The real gains of adding further end-points may be small.

Although the individual variables contributing to ‘efficiency’ and ‘productive potential’ are relatively easily obtained, combining these is non-linear, which may limit the ‘popularity’ of our approach. Some complexity is inevitable, as simple combinations (such as product or sum) of efficiency and productive potential just do not yield sensible results. Nonetheless, Equation (1) is much simpler than attempts to incorporate 100 performance measures, or than DEA methods. The non-linearity we propose is no more complex than, say, equations involved in oxygen delivery calculations (including the oxyhaemoglobin dissociation curve).

We do not know if an alternative, simpler formula exists. Strictly, our primary aim was simply to demonstrate that the six criteria could be described mathematically to satisfaction. Even if only a close approximation of what ‘quantitative performance’ means, our formula is a significant advance on the use of colloquial terminology in which efficiency, productivity, utilisation, etc are synonymous.

Limitations and utility of our analysis

It is in the application of our measure, rather than in its mathematical construct, that we encounter certain practical limitations. Perhaps the weakest part of our measure is accuracy with which ‘speed’ of surgery can be measured. Our regret that so little publicly-accessible data exists on how long operations take to perform is shared by Macario who argued that lack of data was a key factor in poor scheduling of cases to fit the available time [9]. A hospital could use its own local data sets, but statistical bias will favour the team as its own data generate the reference times. Also new hospitals may be disadvantaged as they have no historical data. Thus how ‘reference times’ are acquired might critically influence the results of the formula, and ideally each team’s ‘reference times’ must be collected in the same manner or from equivalent sources to make them comparable. For example, they should reflect subtle variants in procedure and significant co-morbidity in patient groups (even very closely-coded operations could differ in duration by as much as 1 h [9]). If the ‘aim’ of the productivity calculation is to compare teams (perhaps within a hospital to generate internal competition or as ‘league tables’ across hospitals) it is essential that the ‘reference times’ are robust. This will be difficult to achieve, but may develop [27] as more precise data is needed to inform changes in hospital funding to activity-based measures such as Payment-by-Results [28].

If, on the other hand, the aim of the productivity score is to engender self-reflective team improvement (akin to a golfer working on a handicap) reference times might usefully be based on teams’ own historical averages. These data may have little or no external validity, but should nonetheless focus local efforts on performance improvement.

One potential criticism of our notion of relative speed is that, by definition, 50% of teams will be ‘below average’. Although potentially contentious (no team likes to be ‘slow’), speed does need to be modelled. We think that in practice actual differences in speed will be quite small because speed is averaged over the whole range of the team’s operations over a sustained period of time. The team will have to be consistently slow by a large margin and if so, then perhaps this is something important to identify. Finally, the 50% cut-off is unreasonable: it is teams lying beyond, say, the 95% or 99% confidence intervals for speed that might reasonably be judged ‘outliers’. At worst (for example, if speed data proved too difficult to obtain), the ‘speed’ element in Formula (1) could be ignored by assigning it a value of 1.0 and so assume that all teams, speeds are average.

We cannot factor in quality criteria any more precisely than by emphasising that teams must meet any minimum levels set by their own sub-specialty. We cannot stipulate that a complication rate of, say, 3% is inherently better than, say, 5% because sub-specialties themselves (at best) only define a minimum threshold or range for such measures (for example, ‘complication rates should be < 6%’) [29, 30]. It is beyond the scope of this paper to set quality criteria for specialties, but automatically assigning ‘zero performance’ to any team that fails to meet its own specialty’s standards ensures that teams are not encouraged to place speed above quality.

Our notion of ‘patient contact’ is less novel than it might first seem. The American Association of Anesthesia Clinical Directors describe the proportion of time on a list that a patient is physically in the operating room (including induction of anaesthesia) as a ‘productivity index’ [31]. A small bias exists as patient contact (or conversely gap time) is related to the number of cases performed on a list. Thus, a list consisting entirely of one long (8 h) case has by definition zero gaps, in contrast to a list of many small cases where gaps are inevitable. But, as we discuss later, focussing on long cases introduces its own problems for achieving high productivity by our measure, such that the two effects broadly balance each other.

Our assumption that all relevant operation times are taken from the start of anaesthesia to the arrival of patient in recovery is consistent with the majority of reports in the field [2, 16–18, 32]. While some authors have argued in favour of simply ‘surgical’ or ‘cutting’ time as the relevant measure [33], this clearly ignores essential components of the surgery such as anaesthesia and positioning.

It is implicit in our analysis that teams have the constructive aim of improving performance, rather than the destructive desire of simply ‘gaming’ the system. All measures of system performance (such as waiting lists, waiting times, utilisation, cancellation rates, efficiency, etc) are open to cynical manipulation or frank fraud [34] and our index is no exception. Our formulae penalise a list with a large ‘gap’ at the end of a list somewhat more than they do a gap of equal magnitude in the middle of the list (utilisation is weighted more than patient contact). Therefore, a team repeatedly suffering this ‘penalty’ might conceivably transfer a gap from the end of the list to the middle and therefore finish ‘on time’. This emphasises that our measure should not be used as a means of benchmarking activity or as a basis for reward or punishment [35]. Instead it can only correctly be used as a means of identifying areas for improvement. Indeed, we anticipate that our measure will more commonly identify teams (such as Team B; Table 2) which are performing well and help defend them against misclassification as ‘poor performers’ or against misguided attempts at further performance improvement.

Interpreting the quantitative performance index

Whether the graphical presentation of our formulae adds further insight than simple perusal of the raw data as spreadsheets (for example, Table 2) probably depends on the experience of the analyst. An experienced physiologist need only glance at individual values for haemoglobin, saturation, partial pressure, cardiac output, etc, to understand an aberration in oxygen delivery. But this experience itself derives from knowledge of the inter-relationships. A novice on the other hand often finds it helpful to perform the calculations consciously step-by-step (for example, starting with a plot of oxygen saturation vs partial pressure, etc). Similarly, our process formally describes a method of how to analyse the data in Table 2.

Even experienced analysts may be misled by cursory inspection of spreadsheets. From Table 2 it does not seem obvious that the late starts of Team B are in fact trivial; or that Team A cancellation rate (though needing improvement) little impairs overall performance; or that slow operating speeds of Team D (but not Team E) are significant; or that the cancellation rate of Team E is simply a function of a small absolute number of cases, as expected in a cardiac team.

Targets or objectives can result in perverse outcomes. A well-intentioned policy of never over-running theatres may lead to an undesirable increase in cancellations [36]. A focus on operating speed may raise complication rates. Maximising theatre utilisation alone may lead to over-run and/or cancellation. In contrast to single aims, our formula balances relevant variables to provide a summary statistic, without creating perverse incentives.

Our analysis can only be meaningfully applied in healthcare systems that are fundamentally rational and ascribe to the six core criteria we defined. Hospitals are free instead to choose alternative priorities. If the only outcome judged important were, say, completing as many cases as possible (regardless of case type), or to start on time (regardless of how subsequent time is utilised), then our analysis is irrelevant to achieving such isolated objectives.

Consistency of our formula with other data

Our approach is consistent with – and extends the results of – a number of other studies. In our model, late starts have only modest influence on performance. We were surprised to find that so many lists started on or before time (Table 2). Although often blamed for inefficiencies, objective studies indicate that factors such as overbooking dwarf any measurable impact of late starts [4, 5, 37, 38]. Starts as late as by 15 min have no discernible impact [39–41] and some authorities have suggested that even delays of < 45 min are classed acceptable [9]. We find this last conclusion surprising, but it underlines the difficulty in showing objectively that late starts have a detrimental impact in the face of other, more influential adverse factors [40–42].

Similarly, by our formulae gaps between cases have only a modest effect on performance. Gaps are subjectively felt to be important [37] but objective analysis suggests that gap (turnover) times are in practice modest, so that even eliminating these has little overall effect [38, 42–45]. Some authors have gone as far as describing analysis of turnover times as ‘meaningless’, implying that the effort is rarely worth the expense [46]. We found that even the poorest performing teams (C and D in our data) had gap times < 10% of scheduled list time (Table 2) which is similar to that reported by Cook [38] and the ideal objective suggested by Macario [9]. This is close to the Audit Commission’s suggested target for gaps of < 8% of list time, and represents a maximum of ∼40 min [1, 24–26]. Even if all this time were ‘saved’ it is difficult to see which procedures could be comfortably accommodated. Thus, the way we modelled gap/patient contact time appropriately reflects the limited impact of reducing modest gap times, while at the same time reflecting the utility of reducing very large gap times.

Previous work has indicated that faster operating can improve throughput of cases, but only relatively modestly [47, 48]: consistent with the notion that the lists in these studies lay near the top of the asymptotic curve (Fig. 1).

Comparison with economic measures of productivity

It is beyond the scope of this article to consider our notion of productivity in the context of formal economic measures. Nonetheless, there are some parallels. Sumanth distinguished ‘efficiency’ (that is, how well the business works) from ‘production’ (that is, how many goods are produced) [49]. Scott argued that ‘productivity’ and ‘profitability’ should be distinct, as the latter can be influenced by demand and complex pricing issues (for example, price-fixing) [50]. By avoiding the latter, our approach seems consistent with Scott’s view. Our graphical plot of Equation (1) closely resembles an asymptotic function used by Wen in plotting ‘business productivity’ against ‘input’. Wen demonstrated a family of curves (similar to our efficiency curves), each representing a discrete pattern of economic growth or institutional organisation [51]. Increasing ‘input’ (for example through faster or greater work) could increase business productivity only up to curve’s asymptote. Further increases in productivity could then only be achieved by major structural change in the business (or as we describe it, ‘efficiency’).

Summary: how teams can perform well

To perform well in quantitative terms by our measure, teams need to achieve the following: first, to satisfy any minimum quality or safety criteria defined by their own sub-specialty; next, to plan lists to utilise the time available, with no cancellations or over-runs; then – and best only then – to focus on working at least at average speeds with minimal gaps between cases, using any time saved to accommodate extra cases. These appear very reasonable aspirations, independent of type of surgery, institution or healthcare system in which the work takes place.

Fitting in an extra case may be (inevitably) somewhat easier for surgical subspecialties where short procedures are common than it is for those with long cases. However, this does not bias against the latter, for the following reason. Our measure is able to distinguish a well- from a poorly-performing cardiac team. We can also distinguish a well- from a poorly-performing day case hernia team. Furthermore, we can make direct comparisons of any of these four teams. However, it is generally more difficult for the already well-performing cardiac team to perform even better, than it is for an already well-performing hernia team. This is because (as implicit in our measure) cardiac teams generally need to save > 2 h of time to accommodate an extra case, while hernia teams need to save only ∼30 min. The bias is not large because both teams, already performing well, lie on the flat part of the performance curve (Fig. 5). So any real gains for the hernia team over the cardiac team are necessarily small. Furthermore, it is possible for cardiac (or any other team) to adjust the duration of their list (for example, from 8 h to 10 h etc) to accommodate an optimal number of cases [3].

It is, however, similarly easy for currently poorly-performing cardiac and hernia teams to improve as both these teams lie on the steep part of their curves (Fig. 1); that is, the ‘poor performance’ of the cardiac team implies that a time saving of ∼2 h is realistic. Thus, the real value of the asymptotic function is to emphasise that improvements in performance are reasonably confined to those teams lying on the steep part of the performance curve (Fig. 1).

Conclusion and areas for future work

If it is desirable to develop a measure of quantitative performance of a surgical list, which applies to all surgical subspecialties rather than just specific for some, then our six criteria define what that measure needs to do.

Equation (1) fulfils these defined criteria and we have shown it is useful when applied to actual and hypothetical surgical lists, with no internal contradictions in the results. There are clear areas for further work. First, data from more operating lists from a variety of centres subjected to our measure will assess any unforeseen weaknesses or features that require modification. Second, wider publication of operating times will assist in the development of ‘reference times’. Finally, more accurate information on costs involved in each team’s operating lists will enable the hypothesis to be tested that our index of performance is related to financial success. Any disparity between these two might indicate perverse incentives within the system, which would be important to rectify.

Science works by future work adapting past hypotheses. We hope to stimulate readers to offer alternatives as better descriptors of productivity. Perhaps crude in some minor details, our formula seems to work rather well for a range of hypothetical and real lists. We suspect any improvement in accuracy will require very complex mathematics, but we think the only way of finding out if this is so is for others, on viewing our analysis, to set about trying to improve it.


We thank Professor Alex Macario, Professor of Anesthesia & Health Research Policy, Stanford University, CA, USA and Mr Andrew Vincent, Director, Medicology Ltd, Specialists in Organisational Performance Through People, for their helpful comments on the manuscript.