2.1. A typology of performance measurement
Hood (2007) has described three types of systems of performance measurement: as general intelligence; in relation to targets; with measures being aggregated so that organizations can be ranked. This section considers the development of each type and its application to health care.
Intelligence systems have a long history in health care, going back to the publication of Florence Nightingale's analyses of hospital mortality rates (Nightingale, 1863). Since the 1990s, following the technological advances in computing and Web sites, there has been an explosion in the development of intelligence systems publishing clinical outcome indicators, but without resolving the problems that were identified in Florence Nightingale's analyses. Spiegelhalter (1999) pointed out that she clearly foresaw the three major problems that were cited in a survey of the publication of surgical mortality rates that were identified by Schneider and Epstein (1996) about 130 years later:
‘the inadequate control for the type of patient, data manipulation and the use of a single outcome measure such as mortality’.
Iezzoni (1996) pointed out that issues raised in the debate following publication of Nightingale's hospital mortality rates echo those which are cited frequently about contemporary efforts at measuring provider performance. These included problems of data quality, measurement (risk adjustment and the need to examine mortality 30 days after admission rather than in-hospital deaths), gaming (providers avoiding high risk patients because of fears of public exposure) and the public's ability to understand this information. These unresolved problems mean that there was, in the 1990s, intense and polarized debate about the benefits of publication of information on hospital performance, with different camps describing this as essential, desirable, inevitable and potentially dangerous (Marshall, Shekelle, Brook and Leatherman, 2000).
Although the use of targets also has a long history (Barber, 2007; Hood, 2007), Hood identified New Zealand as pioneering the comprehensive introduction of a system of target setting across government in the 1980s. This offered a model for the Blair government which, following its election in 1997, introduced the systematic setting of public service agreement targets, as part of an implicit contract between Her Majesty's Treasury and spending departments for their budgets for public services (James, 2004). (It was during this period that both England and Wales introduced the category A 8-minute target for ambulance trusts.)
The Thatcher government in the 1990s (before devolution) introduced the ranking system of league tables of performance of schools in terms of examination results, across the countries of the UK (West and Pennell, 2000; Department for Education and Skills, 2004). Hood (2007) has argued that what was distinctive and novel in the Blairite approach to performance measurement of public services was the development of government-mandated ranking systems. The differences between approaches to performance measurement in the UK are illustrated by the decisions, following devolution, of the government in England to maintain the publication of school league tables, and the governments in Wales and Scotland to abandon their publication (Hood, 2007). Bird et al. (2005) observed that school league tables are published in some states in the USA (California and Texas), but there has been legislation against their publication in New South Wales in Australia and the Republic of Ireland. For the NHS in each country, the government in England introduced the ranking system of star ratings, but the governments in Wales and Scotland, in their developments of performance measurement, deliberately eschewed these being published as ranking systems.
2.2. Comparisons of systems of hospital performance measurement
Using resources for performance measurement, rather than delivery, of health care can only be justified if the former has an influence on the latter: there is little justification on grounds of transparency alone if this has no effect. Spiegelhalter (1999) highlighted criticism by Codman (1917) of the ritual publication of hospital reports that gave details of morbidity tables and lists of operations which were intended to ‘impress the organisations and subscribers’ but were ‘not used by anybody’. The first systematic review of evaluations of systems of performance measurement, by Marshall, Shekelle, Brook and Leatherman (2000) and Marshall, Shekelle, Leatherman and Brook (2000) commented on the contrast between the scale of this activity and the lack of rigorous evaluation of its effects. A recent systematic review (Fung et al., 2008) made the same point, emphasizing that the studies they had identified still focused on the same seven systems that had been examined by Marshall, Shekelle, Brook and Leatherman (2000); in particular on the cardiac surgery reporting system (CSRS) of New York State Department of Health. These systematic reviews produced evidence that enables us to examine three pathways through which performance measurement might result in improved performance. The first two of these, the change and selection pathways, were proposed by Berwick et al. (2003) and used by Fung et al. (2008). The change pathway assumes that providers are knights: that simply identifying scope for improvement leads to action, without there being any need for any incentive other than the provider's innate altruism and professionalism; thus, there is no need to make the results of the information available beyond the provider themselves. As Hibbard (2008) observed, the evidence suggests that this is a relatively weak stimulus to action. This finding was anticipated by Florence Nightingale in the 1850s in seeking to convey the urgent need to the government to improve the living conditions of army barracks in peacetime: her statistical analysis showed that these conditions were so appalling that the outcome was that, on the basis of comparisons of mortality rates with the civilian population outside,
‘1,500 soldiers good soldiers are as certainly killed by these neglects yearly as if they were drawn up on Salisbury plain and shot’.
She continually reminded herself that ‘reports are not self executive’ (Woodham-Smith (1970), pages 229–230). The selection pathway assumes that providers respond to the threat of patients as consumers using information in selecting providers, but, systematic reviews by Marshall, Shekelle, Brook and Leatherman (2000) and Marshall, Shekelle, Leatherman and Brook (2000) and Fung et al. (2008) found that patients did not respond as consumers in this way. In presenting the findings from that latest systematic review, at a seminar in January 2008, at the Health Foundation in London, Paul Shekelle observed that many of these studies were in the USA and showed that patients there did not use this information as consumers and, if that response has not materialized in the USA, with its emphasis on markets, then it is highly unlikely to be observed in other countries.
The systematic review of the evidence of effects of performance measurement systems by Fung et al. (2008) suggests that neither of the two pathways that were proposed by Berwick et al. (2003) for these systems to have an influence is effective. Hibbard (2008) has argued, however, that a third pathway of designing performance measurement that is directed at reputations can be a powerful driver of improvement. She has led research for over a decade into the requisite characteristics for a system of performance measurement to have an effect (see, for example, Hibbard et al. (1996, 1997, 2001, 2002, 2003, 2005a, b, 2007), Hibbard and Jewett (1997), Hibbard and Pawlson (2004) and Peters et al. (2007)). Hibbard et al. (2002) showed, in a controlled laboratory study, that comparative performance data were more likely to be used, if they were presented in a ranking system that made it easy to discern the high and low performers. Hibbard et al. (2003) proposed the hypothesis that, for a system of performance measurement to have an effect, it needs to satisfy four requisite characteristics: it must be
- (a) a ranking system,
- (b) published and widely disseminated,
- (c) easily understood by the public (so that they can see which providers are performing well and poorly) and
- (d) followed up by future reports (that show whether performance has improved or not).
Hibbard et al. (2003, 2005b) tested this hypothesis in a controlled experiment, based on a report, which ranked performance of 24 hospitals, in south central Wisconsin, in terms of quality of care. This report used two summary indices of adverse events (deaths and complications): within broad categories of surgery and non-surgery; across three areas of care (cardiac, maternity, and hip and knee). The report showed material variation (unlike insignificant differences in ranking in league tables) and highlighted hospitals with poor scores in maternity (eight) and cardiac care (three). The effects of reporting were assessed across three sets of hospitals: public report, private report and no report. For the public report set, a concerted effort was made to disseminate the report widely to the public: the report was available on a Web site; copies were inserted into the local newspaper, distributed by community groups and at libraries; the report attracted press coverage and generated substantial public interest. For the private report set, the report was supplied to managers only; the no-report set was not supplied with the report. This research design enables comparisons of the effects of the three pathways. If the change pathway were powerful, then there ought to be no difference between the public report and private report hospitals, but the public report set made significantly greater efforts to improve quality than the other two sets (Hibbard et al., 2003, 2005b). The managers of hospitals in the public report set hospitals discounted the importance of the selection pathway: they did not see the report as affecting their market share (Hibbard et al., 2003). Later analysis showed that these managers were correct:
‘There were no significant changes in market share among the hospitals in the public report from the pre to the post period … no shifts away from low-rated hospitals and no shifts toward higher-rated hospitals in overall discharges or in obstetric or cardiac care cases during any of the examined post-report time periods’
(Hibbard et al., 2005b). The reputation pathway, however, was crucial: the managers of hospitals that had been shown to have been performing poorly in the public report group took action, because of their concerns over the effects of the report on their hospitals’ reputations. We now undertake two further tests of the hypothesis that, for a system of performance measurement to have an effect, this needs to be via the reputation pathway, through two comparisons of two hospital performance measurement systems, with reference to Hibbard's four requisite characteristics.
The first comparison is between two systems of reporting clinical outcome indicators. One is the much-studied CSRS of New York State Department of Health, which began in 1989 as the first statewide programme to produce public data on risk-adjusted death rates following coronary artery bypass graft surgery, and is the longest-running programme in the USA of this kind (Chassin, 2002). The other is the annual reports from the Clinical Resource and Audit Group (CRAG) in Scotland, which when these began in 1984 were at the forefront in Europe of public disclosure of such information (Mannion and Goddard, 2001; Clinical Resources and Audit Group, 2002).
The CSRS produces annual reports of observed, expected and risk-adjusted in-hospital 30-day mortality rates, by hospital and surgeon. Green and Wintfield (1995) observed
‘CSRS became the first profiling system with sufficient clinical detail to generate credible comparisons of providers’ outcomes. For this reason, CSRS has been recognized by many states and purchasers of care as the gold standard among systems of its kind.’
The CSRS satisfied three of the above four requisite characteristics: these annual reports are published and widely disseminated, although performance is not ranked, statistical outliers are identified (New York State Department of Health, 2006). The CSRS was used by hospitals and had an influence. There is controversy over the benefits from the dramatic improvements in reported performance: Chassin (2002) observed that
‘By 1992 New York had the lowest risk-adjusted mortality rate of any state in the nation and the most rapid rate of decline of any state with below-average mortality’;
Dranove et al. (2003) found, however, that such
‘mandatory reporting mechanisms inevitably give providers the incentive to decline to treat more difficult and complicated patients’.
What is of particular interest here is that, in the account by Chassin (2002) of how four hospitals went about the tasks of improvement, he emphasized that the selection and change pathways had no effect. The key driver of change was the reputation pathway through adverse publicity from the CSRS identifying outlier hospitals performing poorly (Chassin, 2002):
‘Market forces played no role. Managed care companies did not use the data in any way to reward better performing hospitals or to drive patients toward them. Nor did patients avoid high-mortality hospitals or seek out those with low mortality … the impetus to use the data to improve has been limited almost entirely to hospitals that have been named as outliers with poor performance … hospitals not faced with the opprobrium attached to being named as poorly performing outliers have largely failed to use the rich performance data to find ways to lift themselves from mediocrity to excellence.’
The CRAG's reports aimed to provide a benchmarking service for clinical staff by publishing comparative clinical outcome indicators across Scotland. The final report for 2002 (Clinical Resource and Audit Group, 2002) included two kinds of hospital clinical indicators (that used the only data the NHS collected routinely on outcomes following discharge from hospital): emergency readmission rates (for medical and surgical patients); mortality (or survival) after hospital treatment (for hip fracture, acute myocardial infarction, stroke and selected elective surgery). The CRAG reports essentially assumed a change pathway as the means through which the information that they produced would be used. These reports, which began before, and continued after, the internal market was introduced, were explicitly designed not to damage hospitals’ reputations: the last CRAG report (Clinical Resource and Audit Group (2002), page 2) emphasized that its information did not ‘constitute a ‘‘league table’’ of performance’. The CRAG reports were evaluated by a CRAG-funded Clinical Indicators Support Team (Clinical Resource and Audit Group (2002), pages 223–229) and Mannion and Goddard (2001, 2003). Despite the enormous effort that went into the production of these statistics, these evaluations found that they lacked credibility, because of the familiar problems of poor quality of data and inadequate adjustment for variation in casemix. These evaluations also found that the reports were difficult to interpret, lacked publicity and were not widely disseminated. Hence these reports did not satisfy Hibbard's four requisite characteristics. The two evaluations found that they had little influence: Mannion and Goddard (2003) found that these data were rarely used by staff in hospitals and the boards to which the hospitals were accountable, and general practitioners in discussions with patients.
The second comparison is a natural experiment between a ranking system, the star rating system in England, which was dominated by performance against targets for waiting times, and target systems for waiting times in Wales and Scotland, neither of which were part of ranking systems.
The star rating system in England satisfied Hibbard's four requisite characteristics and was designed to inflict reputational damage on hospitals performing poorly. Ranking through annual star rating was easy to understand, and the results were widely disseminated: they were published in the national and local newspapers and on Web sites, and featured in national and local television. Mannion et al. (2005a) emphasized that the star rating system stood out from most other systems of performance measurement in that hospital staff seemed to be highly engaged with information that was used in star ratings. They attributed this to ‘the effectiveness of the communication and dissemination strategy’ and ‘the comprehensibility and appeal of such a stark and simple way of presenting the data’. Star ratings obviously mattered for chief executives, as being zero rated resulted in damage to their reputations and threats to their jobs. In the first year (2001), the 12 zero-rated hospitals were described by the then Secretary of State for Health as the ‘dirty dozen’; six of their chief executives lost their jobs (Department of Health, 2002a). In the fourth year, the chief executives of the nine acute hospitals that were zero rated, were ‘named and shamed’ by the Sun (on October 21st, 2004), the newspaper with a circulation of over 3 million in Britain: a two-page spread had the heading ‘You make us sick! Scandal of Bosses running Britain's worst hospitals’ and claimed that they were delivering ‘squalid wards, long waiting times for treatment and rock-bottom staff morale’; a leader claimed that if they had been working in the private sector they would have ‘been sacked long ago’ (Whitfield, 2004). Mannion et al. (2005a) highlighted the pervasive nature of the damage to reputations that is caused by poor scores in star ratings on hospital staff. For one hospital, the effect of having been zero rated was described as having been ‘devastating’, ‘hit right down to the workforce—whereas bad reports usually hit senior management upwards’, and resulted in
‘Nurses demanding changing rooms because they didn't want to go outside [in uniform] because they were being accosted in the streets’.
Those from a one-star hospital described this as making people ‘who are currently employed here feel that they are working for a third class organisation’. More generally, star ratings were reported to affect recruitment of staff:
‘a high performance rating was ‘‘attractive’’ in that it signalled to potential recruits the impression that the trust was a ‘‘good’’ organisation to work for. In contrast, ‘‘low’’ performing trusts reported that a poor star rating contributed to their problems as many health professionals would be reluctant to join an organisation that had been publicly classified as under-performing.’
In Wales and Scotland, the target systems for long waiting times relied on the change pathway: that hospitals would know how their performance compared with targets set by government, and this alone would be enough to drive improvement. In each country there was neither systematic reporting to the public that ranked hospitals’ performance in a form analogous to star ratings, nor clarity in published information on waiting times: in Wales, breaches to targets were tolerated but not publicized (Auditor General for Wales (2005a), page 36); in Scotland, large numbers of patients actually waiting for treatment were excluded from published statistics (Auditor General for Scotland, 2001; Propper et al., 2008). Each government's system of performance measurement lacked clarity in the priority of the various targets. In Wales, there was confusion over the relative priority of the various targets in the Service and Financial Framework and the government's targets for waiting times ‘not always [having] been clearly and consistently articulated or subject to clear and specific timescales’ (Auditor General for Wales (2005a), pages 36 and 41). In Scotland, the performance assessment framework was criticized for being ‘overly complex and inaccessible’ for the public and those working in the NHS (Farrar et al. (2004), pages 17–18). Both governments continued to reward failure. In Wales there were ‘neither strong incentives nor sanctions to improve waiting time performance’, and the perception was that
‘the current waiting time performance management regime effectively ‘‘rewarded failure’’ to deliver waiting time targets’
(Auditor General for Wales (2005a), pages 42 and 40). In Scotland, there was the perception of
‘perverse incentives … where ‘‘failing’’ Boards are ‘‘bailed out’’ with extra cash and those managing their finances well are not incentivised’
(Farrar et al. (2004), pages 20–21 and 4).
The natural experiment between star ratings in England, that satisfied the above four requisite characteristics, and the target systems in Wales and Scotland, that did not, has been subject to several studies to examine their effects on performance in waiting times, both over time and across countries at the national level: for England, Scotland and Wales (Bevan and Hood, 2006b; Auditor General for Wales, 2005a); for England and Wales (Bevan, 2008a); a detailed econometric analysis, for England and Scotland (Propper et al., 2008). These comparisons have shown dramatic improvements in performance in England; initial deterioration in Wales and Scotland, and performance in England continuing to outstrip that of the other countries. Another cross-national comparison, by Willcox et al. (2007), of different countries’ attempts to tackle the problem of waiting times, compared Australia, Canada, England, New Zealand and Wales over the 6-year period 2000–2005. They summarized their finding as
‘Of the five countries, England has achieved the most sustained improvement, linked to major funding boosts, ambitious waiting-time targets, and a rigorous performance management system’
(Willcox et al., 2007). There is a question of the extent to which the effects of the English system were due not only to their capacity to inflict reputational damage but also to the threats to the jobs of chief executives. This requires further research.
These various comparisons suggest that hospital performance measurement systems that satisfy Judith Hibbard's four requisite characteristics, in terms of their capacity to inflict damage on the reputations of hospitals performing poorly, had significant effects, whereas those systems that lacked that capacity had little or no effect. The importance of reputational damage as a key driver of change was also identified by Mannion and Davies (2002) in their interviews with experts in the USA in reporting health care performance: where reports of performance did have an effect, the underlying incentives were perceived to be, not financial, but ‘softer issues such as reputation, status and professional pride’. Hence Mannion and Davies highlighted the power of systems of ‘naming and shaming’ and that public reporting mattered, whether or not this was used by consumers, because ‘it makes providers pay more attention because they don't want to look bad’. Mannion and Goddard (2003) also emphasized that the US evidence is that public reporting has an impact ‘particularly where the organization is identified as performing poorly’. Florence Nightingale well understood reputational damage as a means of putting pressure on government to take action: she coined the battle-cry of those seeking reforms in the sanitary conditions of the peacetime army: ‘Our soldiers enlist to death in the barracks’ (Woodham-Smith (1970), page 229).
2.3. Criticisms of star ratings
The star rating system has been examined by the House of Commons Public Administration Select Committee (2003) and as part of the examination of systems of performance measurement by a working party of the Royal Statistical Society (Bird et al., 2005). National auditors have examined responses to targets for hospital waiting times in England (National Audit Office, 2001a, b, 2004; Audit Commission, 2003), Wales (Auditor General for Wales, 2005a, b) and Scotland (Auditor General for Scotland, 2001, 2006a); responses to targets for ambulance response times in Scotland (National Audit Office, 1999; Auditor General for Scotland, 2006b), Wales (Auditor General for Wales, 2006) and England (Audit Commission, 2007). The CHI has published a detailed commentary on its star rating of the NHS in 2003 (Commission for Health Improvement, 2004a); it also reported on concerns over the way that trusts responded to targets for hospital waiting times (Commission for Health Improvement, 2004b) and ambulance response times (Commission for Health Improvement, 2003c), which were identified in the course of inspections of the implementation of the systems and processes of clinical governance in each NHS organization. There is already a substantial scholarly literature on the star rating system: Alvarez-Rosete et al. (2005), Bevan (2006, 2008a), Bevan and Cornwell (2006), Bevan and Hood (2004, 2006a, b), Brown and Lilford (2006), Burgess et al. (2003), De Bruijn (2007), Friedman and Kelman (2007), Hauck and Street (2007), Heath and Radcliffe (2007), Hood (2006, 2007), Jacobs et al. (2006), Jacobs and Goddard (2007), Jacobs and Smith (2004), Kelman and Friedman (2007), Klein (2002), Mannion et al. (2005a, b, c), Mannion and Goddard (2002), Marshall et al. (2003), Mays (2006), Patel et al. (2008), Propper et al. (2008), Rowan et al. (2004), Smith (2002, 2005), Snelling (2003), Spiegelhalter (2005a), Stevens (2004), Stevens et al. (2006), Sutherland and Leatherman (2006) and Willcox et al. (2007). Some of this literature has shown reported performance improving in England against the most important targets. All the criticisms of star ratings recognize its undeniable effect, but have also identified six significant general problems: in measuring what matters, selection of targets, nature of measures, aggregation for ranking, gaming and damaging morale (Table 1). The first four of these are essentially statistical; the last two raise questions about the behavioural influence of star ratings, which we see as consequences of any system that satisfies Judith Hibbard's four requisite characteristics, which is designed to inflict reputational damage on organizations performing poorly.
Table 1. Six problems with star ratings | Measuring what matters | Often the most important aspects of performance cannot be measured, and hence what is measured becomes important. School league tables based on examination results are an exemplar as a proxy measure of teacher performance, because of the difficulty of measuring the benefits of education from teaching. The general problem, of creating incentives for agents to respond to targets that omit key dimensions of performance, was modelled by Holmstrom and Milgrom (1991), who showed that neither using a limited set of good measures nor a larger set of poor measures would produce results that are free from significant distortion by gaming. |
| Selection | There are problems in assessing performance of complex organizations in selecting what ought to be included (and hence excluded) from the set of targets. For primary care organizations in England, for example, it is difficult to see a good rationale for the set of about 50 targets and indicators used in star ratings to cover their complex set of responsibilities (for providing primary and community health services, improving public health and commissioning secondary care) (Bevan, 2006). |
| Nature of measures | Performance indicators are often ‘tin openers’ rather than ‘dials’: ‘they do not give answers but prompt investigation and inquiry, and by themselves provide an incomplete and inaccurate picture’ (Carter et al., 1995). |
| Aggregation for ranking | There are problems with methods of producing aggregate measures of performance for ranking systems (Jacobs and Goddard, 2007). League tables have been shown to be statistically unsound (Goldstein and Spiegelhalter, 1996; Marshall and Spiegelhalter, 1998). One criterion in the design of the star rating system was to avoid volatility from statistical noise in which there were substantial variations in performance from one year to the next. But the methods of determining whether an organization was three or two star were arcane (Klein, 2002), so complex that it was difficult for an organization to understand why its ranking had changed (Spiegelhalter, 2005a). |
| Gaming | When targets have been backed by high powered incentives in response to success and failure, a common effect has been gaming, both in centrally planned economies (Nove, 1958; Berliner, 1988; Kornai, 1994) and in the public sector (Smith, 1995). |
| Morale | Publicly reporting that hospitals were ‘failing’ (‘zero rated’ or one star) damaged the morale of their staff (Horton, 2004; Mannion et al., 2005a, b). |
If tackling failure were to have been ruled out on the grounds of damaging morale, then presumably the notorious scandals that beset the NHS in the late 1990s would have been allowed to continue for even longer than they did (Bevan, 2008b). Any system that satisfies Hibbard's four requisite characteristics will damage morale of those identified as performing poorly. Indeed it can be argued that damaging morale is necessary in the short term for creating the different atmosphere that is required to achieve improvement in the long term. Because star ratings were taken so seriously, they did encourage gaming, which was also a concern with the CSRS in New York. There are two corollaries of this. First, if we were only to adopt systems of performance measurement in which there were no incentive to game, these would be unlikely to be taken seriously, and hence fail to provide the external discipline to drive improvement of government services. Second, if we do take systems of performance measurement seriously, and design these to have an effect, then developing systems to counter gaming ought to be integral to the design of such systems.
The next section of this paper examines the effects of targets for ambulance response times to emergency calls in the various UK countries. There are two reasons why this is particularly interesting. First, we have a natural experiment between different approaches to using performance measurement: the category A 8-minute target was common to each UK country; but only in England was this subject to a system of performance measurement with Hibbard's requisite characteristics to effect change through the reputation pathway. Approaches in the other UK countries relied on the change pathway: that ambulance trusts would know how their performance compared with the target and this alone would be enough to drive improvement. Second, it seems so much easier to resolve the four statistical problems of star ratings that were identified above for ambulance services than, for example, for organizations as complex as acute care hospitals. A target for rapid responses by ambulances to life threatening emergency calls looks to be a good measure of what matters, as it reflects what we would expect ought to be a principal goal of those services. Indeed, the rationale for the selection of the category A 8-minute target was that
‘Clinical evidence shows that achievement of the target could save as many as 1,800 lives each year in people under 75 years suffering acute heart attacks’
(Healthcare Commission, 2005b). Furthermore, performance against this target requires only data that ought to be collected routinely and appears to be easy to measure.