Discussion on the paper by Stone
 Top of page
 Abstract
 1. Introduction
 2. Feasible performances
 3. Geometric insights
 4. Supporting algebra
 4.1. Becoming 100% efficient!
 4.2. Bad outputs
 5. Developments in data envelopment analysis
 5.1. Slack
 5.2. Weight restrictions
 5.3. Environmentals
 6. Stochastic frontier analysis
 7. Police forces and performances
 8. Critical features of the data envelopment analysis technique
 8.1. Missing values?
 8.2. Indiscrimination
 8.3. Local priorities
 8.4. Rank difficulties
 8.5. Variable returns to scale
 8.6. Weight restrictions
 9. Prospects
 Acknowledgements
 Appendix A
 References
 Discussion on the paper by Stone
 References in the discussion
Peter C. Smith (University of York)
It is an honour to propose the vote of thanks for Professor Stone's paper. The problem that it addresses can be summarized simply. Using the paper's notation, it is to indicate a measure of efficiency by constructing an index V where, for observation i,
if necessary taking into account the environmental circumstances in which the unit of observation must operate.
The paper rightly alludes to vexed questions concerning the legitimacy and the advisability of constructing such an index, and the intellectual rationale underlying the endeavour. However, once the decision to construct the index has been taken, two key issues must be addressed. Which measures of output y should be chosen? And what weights v should be used? The productivity analysis research industry seeks to provide technical solutions to these problems. However, I wish to argue that these questions are essentially political rather than technical issues.
Spottiswoode's (2000) report, which motivated Professor Stone's paper, emanated from the influential Public Services Productivity Panel and was warmly endorsed by the Chief Secretary to the Treasury. It recommended two specific technical solutions to calculating V : data envelopment analysis (DEA), which allows the weights v to vary freely between units, and stochastic frontier analysis (SFA), which—for each output—uses as a weight a statistical estimate of the sample average cost of securing an extra unit of output. Underlying these beguilingly simple constructs are some profound methodological difficulties, which the paper does a masterful job in summarizing.
DEA is easy to use and requires no specification of functional form. But it offers no guidance on the quality of the model specification. Have the ‘right’ outputs been chosen (Smith, 1997)? And are the weights that are implicit in the analysis acceptable (PedrajaChaparro et al., 1997)? Everything is left to the judgment of the analyst. SFA requires a choice of a functional form. It can then offer some guidance on model choice in the form of the usual parametric selection and specification tests, but much is still left to judgment. To demonstrate the scale of the problem, Table 1 reproduces from Jacobs (2001)the rank correlation coefficients between the efficiency rankings for five DEA models and their SFA counterparts for 232 National Health Service hospitals. It is possible to offer plausible justification for adopting any of these models, yet their policy messages differ profoundly. How are we to choose our recommended model?
Table 1. Pearson correlation coefficients of results for 232 hospitals from five DEA specifications and their SFA counterparts† DEA1  DEA2  DEA3  DEA4  DEA5  SFA1  SFA2  SFA3  SFA4  SFA5 


1.0000          
0.2298  1.0000         
0.3729  0.6340  1.0000        
0.7575  0.3513  0.5372  1.0000       
0.4722  0.6062  0.8352  0.6149  1.0000      
0.4274  0.4667  0.5946  0.5166  0.5756  1.0000     
0.0957  0.6209  0.4231  0.1831  0.4038  0.6354  1.0000    
0.2154  0.4318  0.5975  0.3165  0.4852  0.8297  0.6917  1.0000   
0.4192  0.4835  0.6583  0.5543  0.5998  0.8763  0.6815  0.8065  1.0000  
0.3399  0.5195  0.6557  0.4633  0.6343  0.9496  0.6535  0.8731  0.8217  1.0000 
Similar results arise when we confront technical choices concerned with the transformation of outputs, treatment of environmental factors and missing data, and the use of panel data. To understand the scope for debate, we have only to examine the response to the ‘World health report 2000‘ produced by the World Health Organization (2000). This document used productivity models to rank the health systems of 191 countries and has generated an intense technical and policy exchange that shows no sign of abating (Williams, 2001).
In short, numerous technical judgments must be made in the application of productivity techniques, and these judgments seriously affect policy conclusions. This is not to say that methods such as DEA and SFA cannot offer useful insights into the characteristics of a complex data set. Indeed I believe that their careful deployment can often form an important element in the development of evidencebased policy. However, any particular model specification can be readily challenged, and there is only a remote possibility of being able to come to a definitive judgment on performance by using technical apparatus alone.
So do we throw up our hands and say that it is all impossible, or do we seek out some other way forward? Professor Stone advocates an approach which he characterizes as ‘valuebased analysis’. The idea is simple. It retains the formulation of the efficiency index given above but advocates the use of ‘societal weights’ rather than a set of weights emerging from arcane technical analysis. Yet how are theseweights to be determined? The objectives and priorities that are attached to public services vary enormously between individuals, and there is no golden rule to say that some societal measure of central tendency is in any sense correct.
In my view the choices of outputs and weights are essentially political rather than technical problems. Political choices in this domain can of course be guided by technical analysis, in particular the use of welldesigned population surveys. But in the end it is the job of elected politicians, rather than statisticians, to reconcile the often diverse popular views concerning the objectives and priorities of public services.
The index of efficiency discussed here passes a stark finality of judgment on public service organizations. I believe strongly that—at least in the criminal justice sector—its construction is a legitimate and important undertaking. But if done properly it requires clarity in the definition of political priorities, and—as Benjamin Disraeli noted—‘finality is not the language of politics’. The Spottiswoode report and many similar endeavours betray a regrettable inclination to shy away from tough political choices, and to hide political values behind a technical smokescreen. Professor Stone has succeeded admirably in exposing the confused thinking underlying this trend in public policy. His paper makes an important contribution to the public debate and it gives me great pleasure to propose the vote of thanks.
AndrewChesher (University College London)
Professor Stone's masterly exposition of Michael Farrell's work on the measurement of technical efficiency and of the stochastic frontier analysis (SFA) approach to the problem is timely. (There is an interesting account of the history of the development of data envelopment analysis (DEA) in Fø rsund and Sarafoglou (2000).) The questions that he raises about the suitability of these methods for measuring the efficiency of delivery of public services are most pertinent.
The focus of Professor Stone's paper is the measurement of efficiency in the public sector. Similar issues arise in the regulated private sector. For example to inform recent price control reviews Oftel has commissioned studies employing DEA and SFA to measure the efficiency of production of fixed line telephone services, comparing British telephone service providers with around 50 US fixed line operators.
Here, as in public sector applications, the samples are small, the data are contaminated by measurement error and transitory variation, and the accuracy of estimates may be low and poorly captured by conventional summary statistics developed from asymptotic approximations.
Crucially, the identification of policy relevant magnitudes rests on very fragile foundations, as Professor Stone's critique of SFA makes clear. Identification in the context of DEA is particularly problematic since DEA is not usually cast in the context of a model of behaviour and a model of data generation in which policy relevant magnitudes can be defined and the way in which data are informative about them can be understood.
Farrell (1957) considered only productive efficiency, giving no attention to the issue of efficiency in the choice of amounts to produce . In the context of public service provision this choice is important, because incentive mechanisms aimed at promoting productive efficiency may influence the choices of amounts of outputs to produce.
In the private sector, where prices clear markets, prices serve as signals of consumers’ valuations of outputs at the margin. In the public sector signals do not usually come via prices. They must be provided by the Government. Let w(y) be the Government's valuation of an output vector y. Then, from the Government's point of view, optimal production is
were x(y) is the minimum cost function. (In practice this optimization is likely to be subject to a budget constraint, C(y)≤ C^{*}.)
In this context a meaningful measure of the ‘inefficiency’ of a public service provider, i, facing cost function x_{i}(y) and producing an output vector y_{i} is
 (1)
Measured by this yardstick, rankings of service providers facing the same value and minimum cost functions (w(⋅) and x(⋅)) depend only on w(y_{i})− x_{i}(y_{i}). The minimum cost function is irrelevant unless we wish to know the extent of the loss caused by inefficiency and/or the cause of inefficiency. The latter is addressed by noting that the inefficiency measure can be decomposed as
 ((2a),(2b))
The two terms (2a) and (2b) are necessarily nonnegative, measuring the loss due to respectively
DEA and SFA aim at measuring term (2b).
In this setting Professor Stone's value efficiency measures are φ_{i} = V_{i}/max_{j}(V_{j}) where
and as they are calculated on a per unit achieved cost basis they will generally yield rankings that differ from those generated by . Ranking becomes problematic when comparing service providers facing different value and minimum cost functions, perhaps because they operate in different geographical regions, but could perhaps be based on the contributions that each makes to their total across all service providers.
Both the VBAbased value efficiency measures, and Professor Stone's φ_{i}, on their own provide no information about the extent of the loss due to inefficiency or about the cause of inefficiency. To address these issues the minimum cost function x(y) must be estimated. This is what DEA and SFA attempt to do. Research effort should be devoted to improving on these methods.
A careless use of efficiency measures as the basis for rewards or penalties may lead to suboptimal output choices. There will generally be underproduction of outputs that are omitted from the value function w(y). Even if all outputs are included, if service providers do not face the minimum cost function then there will be a tendency to specialize in the production of outputs that providers can produce relatively efficiently and service may be withdrawn from users who are costly to serve. Universal service obligations placed on regulated private utilities are one response to the latter problem.
Briefly, on one statistical issue, there are now bootstrapbased inferential methods for DEA (e.g. Simar and Wilson (2000a,b)). The identification problem which plagues SFA may be eased with repeated observations, either gathered through time, if inefficiency is time invariant within enterprises, or at a point in time within subunits of the units being considered.
Professor Stone's paper raises serious questions regarding the usefulness of DEA and SFA methods in measuring the efficiency of public service provision. His paper will stimulate research in this topical and very important area and I have much pleasure in seconding the vote of thanks.
The vote of thanks was passed by acclamation.
James ForemanPeck (Her Majesty's Treasury, London)
Professor Stone advocates value weights, ‘valuti’, instead of weights chosen by data envelopment analysis or stochastic frontier analysis for indices of public service efficiency. This position seems more justified in the case that he examines, where there are multiple (public service) outputs, than in Farrell's original example, where there were many inputs and one output. In the first it is more reasonable to suppose that ‘society’ would or should have such weights. The second is a technical matter, where a variety of processes may be appropriate to different circumstances, perhaps best understood by the units under study themselves.
A very simple diagrammatic representation of the value point can be made by considering two outputs of policing—crime prevention and crime detection—and one input—police time (Fig. 6). The frontier XX^{′} shows the maximum that can be achieved from given police time. A technically efficient police force would ensure that reallocating police time between the two activities could not increase one without reducing another. The gradient of the frontier indicates how much of one activity must be given up to achieve an increase in the other by reallocating labour. Without output ‘prices’ or values there is no way of judging whether any such shift would be advantageous or not.
If a police force were ‘socially efficient’, the marginal social value of a police officer in crime prevention should be equal to that in criminal detection. To calculate such efficiency some form of value weight is necessary to convert ‘crimes prevented’ into ‘crimedetected equivalents’—otherwise the two indices are incommensurable. These values are Professor Stone's valuti. Social preferences based on these valuti, and higher valued preferences, are represented respectively as YY^{′} and Y^{′′}Y^{′′′} in Fig. 6.
Police force B would be identified as technically efficient by a frontier method. Police force A would be technically inefficient by comparison. But ‘society’ prefers the inefficient police force; A is on a higher valued preference curve.
Much improvement of public services, however, depends on establishing the effectiveness of different input combinations for achieving a given output, e.g. various mixes of police foot and car patrols in the prevention of burglaries. This is a matter of what works in different circumstances, not a question of relative values. In such instances data envelopment analysis and stochastic frontier analysis may be helpful.
Firstly I congratulate Professor Stone on this intellectually vibrant paper. However, I believe that staff costs, operating costs and capital consumption costs as inputs (Section 7) will not yield very elegant results, especially when we undertake international comparisons. I suggest that total police hours worked would be a better input variable. Moreover, for a more insightful analysis, the total police hours worked could be divided into hours consumed for administration purposes, namely, recording crimes etc., and those relating to the maintenance of law and order, and solving crimes. With regard to the environmental variables, I would have thought that the number of young men out of work would yield more insightful results than simply the number of young men, since those in work would not have much time to indulge in criminal activities, unless crimes are committed in a state of drunken stupor after office hours. Finally, the author should also consider the influence of violence portrayed on television on the actual violence in society.
V. T. Farewell (Medical Research Council Biostatistics Unit, Cambridge)
First let me say how pleased I am that Professor Stone has been able to present his paper. I congratulate him on both the paper and his perserverance in the clarification of issues in the measurement of efficiency for public services.
I chaired the Royal Statistical Society's Official Statistics Section meeting which was prompted by Spottiswoode's (2000) report. At that time, I summarized the situation as being that the report advocated data envelopment analysis and stochastic frontier analysis models with some minimal acknowledgement of ‘dissenting advice’ but some individuals felt that a wider discussion than that enabled by the report would be valuable. We have, perhaps, not moved far from this position. Thus I welcome this meeting.
I should like to suggest that it may be appropriate to take some reasonably large steps backwards before proceeding too much further forwards in this area. For example, there is surely a need for a fairly extensive preliminary data analysis of any variables which are to be used in efficiency measures and this should be in the public domain.
From a more methodological perspective, general considerations of multiplicity are relevant. Data envelopment analysis and stochastic frontier analysis appear to be rather extreme examples of the ‘summary measure’ approach to multiplicity. What consideration has been given to others, in particular to the use of marginal procedures which retain the individuality of some responses? For the reporting of clinical trials in rheumatoid arthritis, the use of five outcome measures is recommended as no single summary measure captures the complexity of response to treatment. The measurement of the efficiency of a police force is surely at least as complicated. Substantial input from police forces should help to direct thinking in this area.
The valuable work by Goldstein and Spiegelhalter (1996) on institutional rankings should be considered in light of the possible end use of efficiency measures. Also, since the measurement of outcomes of primary interest is difficult, the potential pitfalls of a reliance on surrogate measures deserve discussion.
Finally, there should be published comparisons of any suggested procedure for efficiency measurement with a variety of others. Professor Stone's valuebased analysis should be pursued. However, even simpler approaches, such as O'Brien's (1984) procedure which simply sums the ranks when a large number of outcome measures are involved, are transparent and could be the basis of informative comparisons.
D. R. Cox (Nuffield College, Oxford)
Professor Stone has produced a searching and original paper on an important topic.
A small technical point is that if one were to use stochastic frontier analysis, and I totally share Professor Stone's general reservations, then even in the most favourable circumstances considerable care would be needed with the statistical analysis. BarndorffNielsen and Cox (1994), page 110, showed that even in a highly simplified version of the problem the likelihood function, although not technically irregular, is unlikely to be well behaved.
On a much broader and more important aspect, it might be argued that data envelopment analysis will give valuable answers provided that the number of performance measures (output variables) is kept very small. This has major disadvantages even within the context of finding a single measure of efficiency but more broadly can be very dangerous. Performance measures clearly have other possible objectives, e.g. as tools for local management of various kinds, including steering an organization towards certain objectives, and the provision of information for the public. All this will tend to point towards multidimensionality combined with much more use of sampling to reduce burden and a careful analysis for rational dimension reduction. This is not least to avoid such situations as the ranking of hospitals on the basis of criteria that do not include an assessment of the success of the care provided or of university teaching departments without considering what is taught and the attractiveness and success of the teaching. If a onedimensional score is essential then the method of Professor Stone's Appendix A seems appealing with, however, quite frequent updating of the weights.
The UK Government recognized its need for more input from independent statistical experts at least by mid1999, when the Performance and Innovation Unit began a review of ‘quantitative analysis and modelling in central government’. Their report concluded that government needed, above all, to make ‘better use of links to the academic world’ (http://www.cabinetoffice.gov.uk/innovation/2000/adding/coiaddin.pdf, page 70).
More recently, the Civil Service Commissioners published their 2000–2001 annual report, which described an appeal from a civil servant regarding statistics on progress towards a Government target. They conclude that there is
‘an obligation on civil servants to take reasonable measures to ensure that the way that they present data [to Parliament and the public] does not have the effect of being deceptive or misleading’,
The experience of Professor Stone, and others, in advising the Treasury on its proposal for measuring the efficiency of police forces suggests that the Government continues to have great difficulty in accepting independent advice and scrutiny regarding statistics. The new framework for National Statistics simply codifies the problem. Ministers will decide what counts as a ‘national statistic’ and is therefore subjected to quality assurance; and it appears that much regarding the performance measurement of public services will be excluded.
There is no kind of national statistics of greater interest to the public and Parliament than statistics on the performance of public services. Indeed, no other kind of statistics is so central to the functioning of Parliamentary democracy. If Parliament and the public cannot be confident of the quality of the statistics that they receive on the performance of Government and public services, then they cannot exercise effective oversight of Government.
It is as inadvisable to allow governments to present accounts of their performance in whatever form they like, with no independent scrutiny, as it would be to allow companies to present accounts of their performance in whatever form they liked, and unaudited. The Statistics Commission must exercise its power to recommend legislation to secure a truly independent framework for National Statistics, on the grounds of Parliament's right to determine what information it and the public require from the Government and with what assurance of validity. In the meantime, the statistics community must do everything that it can to help Parliament and the public to evaluate critically all the statistics presented to them by the Government.
Chris Tofallis (University of Hertfordshire Business School, Hatfield)
Mervyn Stone does a great service in drawing the attention of the statistics community to the area of assessing efficiency. Despite the fact that Farrell's seminal work of the 1950s appeared in the Journal of the Royal Statistical Society, statisticians have done little to further this work. Most research on data envelopment analysis (DEA) appears in the literature of operational research or management science, and for stochastic frontier analysis mainly in the econometrics literature. In both cases very little is critical. This may be because papers offering critiques are refereed by the people whose work is being criticized, and so they never see the light of day. Perhaps that is where the statistics community may play a useful role in being both critical and constructive, and this paper offers a clear example.
It includes a very clear section on ‘feasible performances’; this is the ‘production possibility set’ of efficiency theory. By emphasizing the three operations of mixing, worsening and rescaling, comprehension of how such sets are defined is greatly assisted.
Professor Stone is rightly concerned by the fact that when using DEA the discrimination between the units being compared may be very low when the number of output variables is high. Yet users of statistical methods face similar problems. There are things that we can do to help to improve discrimination in DEA. An obvious one is to aggregate some of the variables. Secondly, we can increase the number of data points by including data from two or more time periods. This simultaneously allows progress to be measured at individual units as well as more clearly delineating the set of feasible performances.
Allowing a single figure to represent the performance of a unit involved in a variety of outputs is a summary statistic in extremis. Moreover a DEA score of 100% does not mean that a particular unit is outstanding in all areas. Thus a different approach to improving discrimination is to apply DEA to subsets of variables to obtain an ‘efficiency profile’ for each unit (Tofallis, 1997). Each subset may represent a particular aspect of performance—these might be things that appear in the mission statement. Such a profile more easily identifies areas of weakness within each unit.
Finally, the very generous approach that DEA adopts in attaching weights means that it is more useful in identifying poor performance than good. It is a method that bends over backwards to make each unit appear in a good light. If, despite this effort, the resulting score is still poor then we have strong evidence that the unit in question appears to be a laggard and needs further investigation. However, DEA can underestimate efficiency if there is a convex region in the underlying production function.
Jane Galbraith (London School of Economics and Political Science)
Professor Stone's admirable paper does not suggest that data envelopment analysis would ever be appropriate for measuring the efficiency of public services, but it does indicate clearly that it is particularly inappropriate if the number of output variables is large. Unfortunately the smaller the number of output variables the greater the risk that the units being assessed will manipulate them to increase their score on the performance indicator.
Where this is done by creative accounting (e.g. by changing the ways that reported crimes are categorized) little harm may be done. But where resources are reallocated to improve the specific output variables this can be to the detriment of the overall provision of service. In either case the performance indicator's validity will be undermined.
Therefore it is important in devising a performance indicator that the output variables are chosen so that, when units try to optimize their scores, not too much harm is done to the service and the indicator retains some validity.
Ben Torsney (University of Glasgow)
The illustration considered in the paper focused on seeking an efficiency measure in respect of the provision of police services in England and Wales alone. Of course the Treasury has no direct responsibility for provision in Scotland, so such an exclusion is forgivable, and, indeed, could possibly be exploited.
The issue that I wish to address concerns the fact that all police forces in England and Wales are presumed to be included in the analysis—a population rather than a sample.
Whatever method of analysis is adopted, a fundamental question is what does residual variation represent? It could possibly be variation explained by nonincluded factors or yeartoyear variation.
However, I think that there is scope for a more substantive consideration of the issue. It is one which I face in a study of outreach provision of health services at health centres (from which one or more general practitioner practices can operate) in Scotland. This includes the provision of electrocardiogram or Xray equipment or specialist consultant clinics; see Milne and Torsney (1992, 1993, 1994, 1997, 2001) and Torsney and Milne (1999).
A model for a binary response is needed here and potential explanatory variables are available, but on what basis can inferences about parameters be judged? I have no great wisdom to offer here. One possibility might be a superpopulation version of whatever model or method is adopted. Does the author have any advice on this?
Of course, if the observed units do represent a sample from a wider grouping, this issue is less pressing. For example if the English and Welsh police forces could be viewed as representative of the UK, as a whole, then inferences or predictions could possibly be made about Scotland and their accuracy assessed. However, such an extrapolation might be unreasonable if, unlike the Scottish Executive, the UK Treasury does not consult experts!
Greg Phillpotts (Department of Health, London)
I would like to return to the point raised by Juanita Roche, that of the trust that the public have in official statistics and statistical methods. My point is that the Government has provided a framework to build trust in National Statistics. This includes the need for a quality review of key national statistics and the methods used to produce them and incorporating the use of outside expertise in the review process. I do not know from my position at the Department of Health whether the efficiency measures that are produced by the Home Office and the subject of this paper are part of National Statistics. There is a meeting to be held here at the Society on January 28th, 2002, at 3 p.m. about the draft National Statistics code of practice and this question of coverage of National Statistics could be picked up then.
Stephen Senn (University College London)
This paper is, in the best traditions of this Society, bringing the results of skilful investigations of statistical theory to bear on a matter of practical importance. I have one question concerned with whether it has any lessons for us as statisticians.
It is now about 70 years since two statisticians, who were associated with the institution at which I work, in the department which Professor Stone once headed and in which he now has an honorary appointment, were faced with an analogous problem to that considered in this paper. They were considering the choice of statistical tests and characterized such tests by using two properties only: a far simpler case than that of judging police forces that is considered here. In fact, since they restricted their consideration (perhaps misleadingly for future development) to fixed sample sizes, there is a further simplification, in that to make the analogy we could have to consider police forces operating with the same budget.
The statisticians were, of course, Jerzy Neyman and Egon Pearson and the two outputs from statistical tests that they considered were type I and type II error rates (Neyman and Pearson, 1933). They recognized the impossibility of simultaneously minimizing these and adopted instead the approach of fixing one error rate and minimizing the other. If imitation is the measure of success, this procedure has turned out to be a stupendous success but it seems to involve, if anything, even more squeamishness about combining outputs than data envelopment analysis does.
My question for Professor Stone is this: the lessons of his paper for the Government seem clear; are there any for statisticians?
The following contributions were received in writing after the meeting.
Rolf Färe, Shawna Grosskopf and Valentin Zelenyuk (Oregon State University, Corvallis)
We thank M. Stone for bringing the efficiency literature to the attention of the statistics community, but we take issue with some of the remarks raised in his ‘unreserved criticism of the data envelopment analysis (DEA) technique’ (Section 1). The laundry list of pitfalls in the approach would provide an admirable checklist for any empirical analysis, e.g. avoiding specification errors and paying attention to the sample size. His proposed alternative approach is also perfectly consistent with reasonable DEA studies, although we would substitute ‘Choose an appropriate model specification’ for ‘fix the valuti’.
However, we disagree that the Farrell–DEA technique produces ‘selfdefining efficiency measures—functions of the database alone and determinable without reference to context’ (Section 8.1). Over the last decades we have made considerable efforts to point out the connections between the DEA model and axiomatic production and duality theory as pioneered by Shephard (1953, 1970), including the reference cited by the author. These provide the theoretical underpinnings of the DEA model and give an economic context. Exploiting the links to economic theory and the flexibility of the activity analysis (DEA) model allows the analyst to customize the specification to the application. For example, we have long endorsed using Shephard's (1974) cost indirect model for applications in the public sector. In that model the benchmark objective is to maximize services (or outcomes or activities) subject to the given budget. Here, in contrast with Stone's example, input prices are explicitly included and the solution yields the cost minimizing allocation of inputs consistent with the budget. (Farrell's original model does this as well; in fact, Farrell's major contribution is considered to be the fact that he decomposed the cost inefficiency into the technical component focused on by Stone and a pricerelated allocative efficiency component.) We also advocate exploiting duality theory to derive the associated shadow prices of the outputs to provide information concerning the ‘values’ that are implicit in the observed mix of outputs. Stone may also be interested in a variation which explicitly includes a utility function as part of the DEA problem; see Färe et al. (2002).
In addition, recent results also demonstrate that DEA has respectable statistical properties as an estimator, and with the aid of bootstrapping techniques it can be readily adapted to undertake statistical inference and hypothesis testing as well (see Simar and Wilson (2000a)). In conclusion, we would argue that Stone is condemning a discipline for what he sees as the shortcomings of a particular application.
Paul Hewson (Devon County Council, Exeter)
I would like to start by asking a naıve question. With the exception of the British Crime Survey data, seven of the variables proposed as police outputs are counts, and one is a percentage. Presumably there is a loss of analytical subtlety when the environmental variables are dealt with as a modified input rather than an offset to some of these count variables. In a statistical context, one would imagine that the variability in estimates of counts increases at least as much as the underlying mean value for the count, whereas the variability in the percentage estimates may decrease as we approach either end of the scale. I am therefore intrigued by the possibilities that in treating data of this kind empirically the set of feasible performances will be dominated by outliers. The author's point about the need for a small set of indicators is well made in Section 8.2, but I question the way in which these items were selected. Given a large range of data, is there any scope for dimension reducing techniques to allow the construction of either lower dimension projections or regular reselection of a smaller number of variables which can act as proxies for a wider range of variables?
Turning to the valuebased analysis I would firstly like to ask about nondecisionmaking possibilities, such as whether there is any need to have a single consensus vector of valuti. I feel that there could be explorative potential in studying efficiency with a range of valuti, including valuti that represent organizations’ stated priorities, locally determined valuti and valuti for different subsections of the population. This may not avoid the ‘benevolent dictatorship’ problem alluded to, but it might help to inform the way in which improvements in efficiency and policy generally impacted on different sections of the community.
Finally, given the developments in Bayesian statistics since 1957, I would like to finish by asking what potential may exist for approaching valuebased analysis in a modelling context. The obvious first advantage is in dealing more accurately with random variation throughout the data, and presumably the valuebased element can be dealt with in terms of loss functions or utilities.
It is noteworthy that this paper was presented a month after the National Audit Office (2001) reported on some unacceptable ways in which hospital waiting list figures may have been adjusted. Some of the risks of naıve use of performance indicators may be demonstrated in this report, and it is clear that any further analysis should seek not to add additional dysfunctionalities.
Gary Koop (University of Glasgow) and Mark F. Steel (University of Kent at Cantebury)
Stochastic frontier analysis (SFA), data envelopment analysis (DEA) and the proposed valuebased an alysis (VBA) involve different sets of assumptions. In applications, these assumptions may or may not be reasonable. Stone is very critical of SFA and we thought it appropriate to note that these criticisms are, perhaps, not as damaging as he suggests. There is a burgeoning literature on SFA with multiple outputs, including some of our own Bayesian work (Fernández et al., 2000, 2002). In economic applications, prices often provide us with the output weights that Stone desires. If these are not available, we would argue that, with a careful choice of outputs, the databased SFA and DEA approaches will, in many cases, be less objectionable (and more practical) than somehow choosing ‘societal weights’ for outputs (especially given Stone's desire to ‘retain the goodwill of workers’ and need to ‘obtain from each unit … how its own input costs … should be notionally divided’ and ‘negotiate the value … with unit managers’).
We briefly comment on the ‘widely recognized weaknesses in … SFA’ (Section 6).
‘ Ignore errors in outputs ‘
This feature is shared by all methods, and it seems that formal errorsinvariables methods can relatively easily be used in the statistical context of SFA.
‘ Make an arbitrary choice of the distribution of u and v ‘
The use of longitudinal or panel data (Schmidt and Sickles, 1984; Koop et al., 1997; Fernández et al., 1997) can substantially reduce the sensitivity of our results and Bayesian methods allow formal model comparisons and averaging. The question is whether the assumptions are appropriate for the empirical question at hand. For example, SFA typically assumes that ‘measurement error is Normally distributed, independently of efficiency’, whereas DEA–VBA typically assumes that ‘measurement error is identically zero’.
There is a large literature on flexible functional forms which may be sensible in a given application (Stone's use of the Cobb–Douglas form really sets up a straw man) and restrictions of economic theory (e.g. monotonicity and concavity) can trivially be imposed through the prior. Alternatively, nonparametric or semiparametric methods can be used. See Koop et al. (1994).
Finally, it is important to note that statistical methods allow for probability statements and confidence or credible intervals, which can be of great practical importance (see Kim and Schmidt (2000)).
Emmanuel Thanassoulis (Aston University, Birmingham)
Data envelopment analysis can work with value judgments
In Fig. 7 the data envelopment analysis (DEA) weights are estimates of resources that an efficient force would use per burglary and violent crime cleared (Thanassoulis, 1996, 2001). We refer to them as marginal resource levels (MRLs). We have three properly defined sets of MRLs, one respectively for output mixes reflected on AB, BC and CD. The MRLs of AB render forces on AB 100% efficient and so on.
Stone asks in Section 8 can we really take as a measure of G's efficiency the proportion of its expenditure that the MRLs of BC explain when they may not necessarily reflect the (unknown) worth of violent or burglary clearups? There is no reason to limit ourselves to this technical DEA efficiency measure. If the valuti exist or relative valuti ranges can be derived from stakeholders then why would we not use DEA models with weights restrictions or those for allocative efficiency (Thanassoulis (2001), section 4.7.2) to derive overall(valuebased) efficiency measures also? Surely it is valuable to know the part of the value shortfall of a force due to the incompatibility of its output mix with stakeholder worths (allocative inefficiency) and the part due to its inability to gain maximum outputs relative to other forces (technical inefficiency). That is to say nothing of the identification by DEA of suitable role model forces which an inefficient force can emulate whether it is to gain maximum technical and/or allocative efficiency. On another point, in Fig. 7 force A may not be an efficient peer for any inefficient force whereas B and C can be peers for many forces. Yet the MRLs of AB render both force A and force B 100% (technically) efficient. Is force A necessarily less efficient in value or technical terms than force B? There is a difference between efficient performance and whether or not the output mix of a force is shared by any other forces.
The problem surely is not the technical efficiency and much other managerial information that DEA yields but rather how to solicit from stakeholders information on valuti that would enrich the assessment. DEA is a tool for using such valuti information as can be gathered. It does not preclude its use.
The author replied later, in writing, as follows. I want to thank the Society's Programme Committee for accepting this paper for ‘reading’, thereby allowing things to be said that were not ‘read’. Things said by discussants, especially the critical bits, will be the first and perhaps only things read in the printed version. In aggregate their good sense will efficiently add useful public service to an otherwise academic occasion.
The contributions from Professor Smith, Professor Chesher, Professor Farewell, Sir David Cox, Dr Roche and Mrs Galbraith will speak volumes without any significant comment from me. But I cannot refrain from lightly questioning the pessimism in the first and the optimism in the second.
Professor Smith points out that there is no ‘golden rule’ with which to fix global societal priorities in the shape of the valuti {v_{i} }, and he thinks that we must rely on elected politicians to do that job. At present, police forces with their police authorities claim to be exercising local priorities—and, in this, a traditionally trusting society does not demur significantly. So it would be a small step towards transparency to put Chief Constables (rather than politicians) into conclave with a modicum of statisticians—until they emerge with agreed global valuti and a formula for environmental priority adjustment of the index V (not forgetting adjustment for documentable deficiencies in the local criminal justice system).
Professor Chesher wants to see research effort devoted to establishing a ‘minimum [over output profiles] cost function’, the same for all forces, so that we would know how much inefficiency there is above the largest V_{i}. Until that is done, we have to live (do we not?) with Farrell's preference for a yardstick based on observed performances, albeit one that is less empirical and more normative than Farrell's.
Unless practices change, optimism may also be present in Dr Roche's suggestion that National Statistics should embrace public service statistics. A less ambitious suggestion (Stone, 2002) is that mandarins should be free to relax the constraints of the usual tendering and contracting procedures—to bring into play, informally and at low cost, a wider range of outside judgment, in cases where broad judgment might help at the start of any project and where either convergence or divergence of outside views would be informative.
Dr ForemanPeck has thrown additional light on the important difference between what motivated Farrell in his agricultural economics example and what now clearly motivates society in its assessment of police force outputs. It is the difference between an arguably irrelevant technical efficiency and an overall efficiency based on valuti that includes a compensatingly irrelevant allocative efficiency.
I see Dr Tofallis's ‘efficiency profiling’ as providing something smoother than the current practice of presenting public service units with a large number of performance targets, as a means of provoking improvements here and there but without making clear where political masters may choose to intervene. Provided that interest can be limited to technical efficiency and it is applied with the perceptiveness of its inventor, the technique will be preferable to such abusive practice.
Mr Hewson has beaten up a number of hares that would require more ground than I have on which to pursue them properly. His suggestion of using different sorts of valuti is interesting, as is the reminder of a particular dysfunctionality associated with the ‘political ownership’ of National Health Service performance indicators.
Professor Thanassoulis simply repeats Farrell in pointing out that, when valuti are fixed, overall value efficiency can be factorized into allocative and technical efficiency—usefully for Farrell. Here, I cannot see the logic in letting technicalefficiencybased data envelopment analysis (DEA) determine forcespecific valuti for V, when the valuti are not fixed but specified by ‘weight restrictions’, e.g. intervals (which is what the Spottiswoode report recommends). Clearly the option of somehow using intervals is a generalization of using single values, but the value of a generalization depends on how it is exploited.
Professor Senn has milked a nice metaphor for all that it can give. For the Bword he wants me to use, I refer him to the contribution of Dr Koop and Professor Steel. Their first reference shows how far we have come since the simple kinetics of a billiard ball set the Bayesian ball rolling. If we try to apply the Bayesian stochastic frontier analysis (SFA) model of Fernández et al. (2000) to the case in hand (with T =1 for 1 year of data and ours taken to be 30), and reversing the order in which that paper builds the likelihood but without changing the model, we would have the following.
 (a)
The distribution of the shape statisticsS of the 30 logged outputs L = log ( Y ) would be boldly specified as those of the degenerate 30parameter log–Dirichlet distribution ‘ LD ( s )‘. ( S is the maximal invariant under affine transformations a + qL for q >0. Writing a = C α where Σ α _{j} = 1, this specification fixes q and α in the identification of the distributions of C α + qL and LD ( s ), but not the distribution of C .)
 (b)
The distribution of C would be fixed (and hence that of Y ) by giving θ = _{def} { α _{1} Y (1) ^{q} + … + α _{30} Y (30) ^{q} } ^{1/q} the distribution of exp { β log ( x )− γ + σ E } where γ >0 and E is N (0,1).
Proper priors would then be given to s, q, α,β,γ and σ of the sort (large variances etc.) that are mistakenly thought to evade all the problems of improper priors. In (b), θ indexes the ‘aggregate’ or ‘technologically equivalent’ output profiles, with respect to which exp(− γ) is taken to represent the ‘efficiency’ —whose posterior distribution given an observed performance profile (x,y) would then be calculated by Markov chain Monte Carlo algorithms.
I fear the introduction of such econometrically motivated models into the practical definition and measurement of public service efficiency: their intimidating complexity, barely understood by their creators, would inhibit the necessary wider understanding of their weak points. Even before thinking about (b) (in which the α and q of (a) mysteriously resurface without justification), the model should be required to pass some nonBayesian test of the goodness of fit of the sample of 43 realizations of S to the family of distributions prescribed for it. In their banking efficiency study, Fernández et al. (2000) did not venture such a test, which would be in the spirit of George Box's approach to scientific Bayesian modelling (Box, 1980). Is it fanciful to think that, if you tried to implement such a model, you might be charged with ‘obstructing the police’ in their modest effort to improve?
I am sure that Dr Torsney has very good reasons (to do with ‘understanding, intervention and prediction’ rather that ‘evaluation and assessment’) for wanting to model the health service processes that he studies with such care. But having now seen that contextfree theoretical modelling may be either misleading or hopelessly ambitious, I prefer here to ‘pass’ on the challenging questions that he has raised.
Professor Färe and his colleagues chide me with lack of respect for powerful theory, as being too blinded by a particular application—one that happens to absorb nearly £8 billion of public expenditure. They do not tell me where I have shown disrespect. What I do not respect is the opinion of a DEA advocate in a hospitable Home Office seminar—that I must be wrong about the Spottiswoode report because there are now at least 5000 doctoral theses on the DEA–SFA techniques that it recommended.