Do Performance Measures of Donors' Aid Allocation Underperform?

Indices of donor performance abound. Their recent popularity has occurred within the context of pessimism over aid's impact and optimism over the effect of changes in donor behaviour. Rankings of donor allocative performance aim to change donor behaviour, either through direct pressure on governments or indirectly through public engagement. The indices themselves rely on descriptive measures, and typically claim methodological superiority over positive alternatives due to their simplicity. However, there are two problems. First, measures do not seem robust to simple variations in methodology. Second, correlation amongst competing indices is low, leading to a host of contradictory judgements. This offers neither clear technical guidance nor consistent political pressure. The advantages and disadvantages of the approach are discussed, building upon the more general critique of aggregate indices. I suggest a graphical solution that embraces the advantages of the descriptive approach (including ease of public communication) while avoiding some of its major weaknesses (which typically stem from aggregation).

2006). The RPI resembles both the MPI and API, but policy 1 replaces population as a factor of interest. Population does influence another part of the broader index in determining proliferation, but is not incorporated into the measure of selectivity at all. The policy and poverty weights are found separately in a similar fashion to the MPI and API, and then multiplied together to obtain the weight used in the index. The fourth, EW (Easterly and Williamson, 2011) does not sit within the traditional index method, but rather uses the headcount measure. This means that it calculates the percentage of a donor's aid that meets a given criterion. Three variables are chosen and given additive weights: low-income countries (50 per cent), politically free countries (25 per cent, based on Polity VI data) and less corrupt countries (25 per cent, based on ICRG data). Thus any aid to a country that meets all three criteria receives a weight of 1, with weights of 0.75, 0.5, 0.25 and 0 for other countries, depending on which criteria they meet. The fifth measure, KRE (Knack et al., 2011) uses the average of three coefficient estimates from a sparsely specified regression: aid is regressed on income, policy 2 and population (where all variables are logged). The coefficient estimates are then combined to give a final score.

a. The Aim and Approach of Rankings
Measures of donor allocative behaviour typically aim to change the behaviour they measure (Easterly and Williamson, 2011;Knack et al., 2011). Birdsall (2011) states three ways in which this might happen: by changing the focus of a debate, by highlighting certain technical issues and by encouraging advocacy through public communication. These three ways clearly overlap, and can be subsumed under the general heading of policy influence. Roodman (2011b) states that their main use is in educating the public, which is the first step in creating political pressure for a change in behaviour. For this purpose, rankings offer clear advantages: they condense a large amount of information into an easily understood format. Furthermore, they are media-friendly sources of national pride, shame and controversy. This feeds into the aim of changing donor behaviour through influencing public opinion.
The aim of public education explains not only the relatively simple methods of presentation, but also the preference for simple methods of measurement. The selection of simple, easy-to-understand methods coincides with a growing distrust of regressions (Roodman, 2007). This distinguishes the descriptive research from explanatory research (e.g. Alesina and Dollar, 2000;Berth elemy and Tichit, 2004;Clist, 2011), which seeks to explain allocations by fully specifying a regression. By contrast, the descriptive approach merely describes a certain aspect of an allocation, meaning they are more ad hoc and flexible in their approach. In avoiding the need to fully specify regressions, descriptive measures often claim to be methodologically simpler and by implication more trustworthy: this claim can be found in Roodman (2004;p. 18) a descriptive approach 'minimizes questions about proper modeling specification', Nunnenkamp and Thiele (2006;p. 1179) 'we follow Roodman . . . who stresses the risk that cross-country regression models are misspecified and, thus, favours a simpler approach', and Easterly and Williamson (2011, p. 2)'Once the Pandora's box of conditioning factors is opened, it is very hard to decide where to begin or where to stop'. However, this claim is not quite as well established as it is widely repeated. While it is clear that these approaches do not claim to know the data generating process, this does not automatically equate with minimising questions of proper specification. It merely means choosing a different set of questions: it is with those questions that this section deals.

b. What to Measure
The variables chosen in any measure seek to reflect a factor that is thought to constitute part of the latent variable under examination. In many cases, the latent variable is rather grand and difficult to define, but here it is donor allocative performance: the desirability of a distribution of a given amount of aid among a given set of potential recipients. Some of the indices I discuss have a broader focus, of which I examine merely the selectivity subcomponent. Having decided the latent variable, there are two questions related to what to measure: the factors and variables. Which factors make up allocative performance is surprisingly contentious. Clist (2011) introduced the 4P framework (poverty, population, policy and proximity) to explain aid allocation, the first three of which are the most commonly mentioned factors in normative accounts (Llavador and Roemer, 2001;Collier and Dollar, 2002). Poverty has the broadest appeal, and is typically measured using income per capita given incomplete poverty data. This proxies for both the amount of poverty in a country and the resources for dealing with this poverty. All five measures discussed here include it. Population is a measure both of the amount of poverty (in combination with a measure of income per capita) and potentially the marginal effectiveness of aid. 3 Most normative accounts discuss population and most measures include it; however, the RPI and EW are completely insensitive to population in the selectivity part. This leads to some surprising implications; insensitivity to population means that all donors could improve their RPI selectivity score by reallocating all of their aid to the Pacific island of Kiribati (with a population of around 100,000).
The idea that the policy environment of a country influences the marginal effectiveness of aid is widely recognised due to Collier and Dollar (2002), but also hotly contested. An alternative rationale for allocating aid in response to policy is to incentivise certain policy environments. While the RPI, EW and KRE include policy, the API and MPI do not. In both cases the exclusion of policy is explained as the direct result of a normative belief that aid should be allocated on income (and population) grounds alone (White and McGillivray, 1995). However, the selection of what to measure is influenced by a normative vision in every case, be it explicit or implicit. All measures of donor allocative performance will necessarily contain a judgement as to the correct factors to include, which will reflect certain ideals and beliefs. While these three factors (Poverty, Population and Policy) are the most commonly discussed, none of the five measures are always sensitive to all of them (Anderson and Clist, 2011). This is due to both deliberate decisions resulting from normative differences and technical features of the measures.
It is worth noting that measures of policy or governance differ from measures of income and population in that they are necessarily more subjective: there is disagreement over the underlying factor not just the specific variable. While there may be technical disagreements regarding the correct measure of income or population, there is greater room for philosophical and political disagreements regarding what constitutes a good policy or institutional environ-P. CLIST ment. Disagreement on the donor side is obvious, for example, there are clear differences of opinion between the United States and Nordic donors over the correct type of regulation in an industry. Between researchers creating indices, this disagreement can be more subtle. For example, Easterly and Williamson (2011) and Knack et al. (2011) both purport to measure policy selectivity. To represent this, the latter use the World Bank's in-house measure of policy (CPIA) that emphasises public sector and economic management, whereas the former are more focused on political freedom and corruption. The difference in variable choice reflects a more fundamental disagreement. Both sets of authors argue that they are measuring selectivity, despite very different theoretical conceptualisations of what constitutes policy selectivity.

c. How to Measure
One dividing line between explanatory and descriptive research is that descriptive research typically measures absolute characteristics, whereas explanatory research measures conditional characteristics. The desirability of controlling for other factors depends upon the goal of the research. Knack et al. (2011) argue, from within the descriptive tradition, that controlling for other factors is useful if the judgements are to be used to influence policy, as it controls for the limitations on donor actions. However, the approach of KRE appears flawed as they control for only some aspects of allocation but leave out important confounding factors such as historical, linguistic and commercial links between recipient and donor. As such, it cannot be thought of as a fully specified regression, but neither does it constitute a test that is always sensitive to the three factors it measures (Anderson and Clist, 2011). In this way, the approach of KRE appears to miss out on the absolute advantages of both the descriptive and explanatory approaches. Easterly and Williamson (2011) argue that examining absolute sensitivity provides an informative alternative, complementing explanatory work. This seems a sensible role for explanatory research, as while the positive literature is more econometrically advanced, it might be that controlling for other factors masks population, policy and income sensitivity in absolute terms. Imagine that the positive literature finds that a given donor has low income-sensitivity but high concern for former colonies: clearly a donor's willingness to support its former colonies may be the best explanation of its aid allocation. However, if these colonies are also poor, the positive coefficient on colonies will dampen the coefficient on incomewhich may lead the reader to underestimate the absolute povertyselectivity of the donor. If donors are to decrease fragmentation they should not be punished for allowing colonial and linguistic links to influence allocation patterns, as this promises to reduce proliferation and therefore transaction costs. This makes absolute judgements of poverty sensitivity a useful tool in judging donors, as conditional judgements of poverty-sensitivity may mislead. Any conditional effect is clearly best estimated within the explanatory approach, and the descriptive approach should then concentrate on its advantage of absolute measures. Klugman et al. (2011) discuss the importance of scale invariance within the context of the human development index (HDI). If a measure is not scale invariant, the way it is measured (e.g. using a five point or 10-point policy scale) will affect the rankings. The old HDI formula, like the MPI, API and RPI, tries to avoid this problem by rescaling the index with reference to minima and maxima. However, this introduces the problem of sensitivity to those bounds. This is then another source of 'methodological uncertainty' that some purport to minimise by using the descriptive approach. The most common index approach to scaling data is to use the form x ¼ y i Ày min y max Ày min : this is found in the RPI and MPI. This attaches a weight (x) to the recipient based on its desirability as an aid recipient. To understand the effect of this, consider the weight for the three common factors (population, policy and income) for the median potential recipient in a hypothetical index. For population, using 2008 data, N max is 1.3 billion (China), N min , is 31,000 (Palau), and so the median recipient (Laos, 6.3 million) has a section weight of 0.005 (calculated using logged population, where China would have a weight of 1 and Palau 0). For income, Guatemala is the median recipient with an income of US$ 4,285 (per capita): the section weight is 0.14 (using logged income data). For policy (WGI), the median recipient is Malawi with À0.33: the section weight is 0.45. We can think of these as discounts of 99.5, 86 and 55 per cent, respectively. Therefore, because the typical index scaling method relies on minima and maxima, and population is heavily positively skewed, any aid to the median recipient in terms of population is essentially dismissed. 4

d. How to Aggregate
Once the different aspects of allocative performance have been decided upon and measured, descriptive approaches commonly aggregate them to create a ranking based on a single number. While in theory the different factors can be weighted, an equal weighting is often used. This is chosen more as a default rather than for any inherent attractiveness or theoretical justification. Ravallion (2012, p. 15) argues that the 'degree of robustness to weights depends on the intercorrelations among the components. If these are perfectly correlated then . . . the result is entirely robust to the choice of weights'. Figure 1 is a scatter graph of poverty and policy for potential recipients of aid. It shows that the two variables are very highly correlated, but this correlation is negative (the correlation coefficient between income and policy is 0.72). Thus the relative implicit weight given to the two factors will greatly affect any final rankings. The weight is not just the statistical parameter that is called weight, but rather the sensitivity of rankings to changes in that part of the data. From the example above, it is clear that the index method of rescaling the data gives greater weight to population even if this is aggregated using an 'equal' weighting system. Two approaches to combining the various factors are common: geometric and arithmetic averages. For example, the geometric average is used by RPI, and an arithmetic average is used by EW. The simple difference between the two approaches is that under the geometric approach, donors are punished more for simultaneously performing badly in multiple areas. In contrast, under the arithmetic average, a given portion of the overall weight is decided by a specific factor: for example, EW give 50 per cent of the weight to income and 50 per cent to policy. To illustrate the difference between the two approaches, imagine a recipient exhibited the median characteristics described above. Using a geometric average, any aid would receive a weight of 0.000315, whereas under an arithmetic average it would receive 0.199. Thus, aid given to this imaginary median recipient would either be discounted by 99.99 per cent or 80.1 per cent. The heavily skewed nature of population determines a third of the arithmetic weight, but completely dominates the geometric average. A further problem with the geometric average arises when the individual components are allowed to be negative. This could potentially happen with the RPI, as the minima and maxima are set in relation to data from the first year of its calculation. Thus, a recipient with either a lower governance score than Afghanistan (in 2000) or higher per capita income than Singapore (in 2001) will receive a negative score. This is troubling as a slight change in one variable could have a large effect on the index. While unlikely in the case of the RPI, this approach can give perverse results if two components are negative. In this case, a recipient would receive a high weight because of its exceptionally poor characteristics.

SOME SIMPLE TESTS OF ROBUSTNESS
The preceding section separates out the decisions of what to measure, how to measure and how to aggregate. However, the effect of each decision needs to be understood in conjunction with the others'. The relative importance of one factor is not determined solely by how factors are aggregated, but is also influenced by decisions over variables and measurement. This section examines what effect these decisions have on final rankings, by varying some of these decisions. It is not a full sensitivity analysis, as the number of possible descriptive measures makes this impractical (a case of combinatorial explosion). Instead, I show the effect of simple, minimal diversions from a measure's implementation, beginning with the EW and KRE measures. EW uses a headcount method, and so trade-offs between factors are quite easy to understand. EW calculate the amount of aid going to low income (using a World Bank classification), free (indicated by a Polity IV democracy score of 8 to 10) and less corrupt countries  (from the ICRG dataset). This could be viewed as an index method, where potential recipients receive a weight of 0, 0.25, 0.5, 0.75 or 1. There is no change to the weight of a recipient unless it crosses a boundary: two recipients that have polity scores of 1 and 7 are equally dismissed, and recipients with polity scores of 8 and 10 are equally valued. In the same way, all middle-income countries are discounted, and all low-income countries valued. Unfortunately, given the proprietary nature of the ICRG data used, I am not able to calculate the effect of small changes in methodology. In lieu of this, I discuss the specific problems and anomalies of the headcount approach in this setting. Table 1 lists the relevant characteristics of six countries, along with the implied weight this recipient would receive (this abstracts from the ICRG corruption data, and so 0.25 is left off all recipients). This illustrates the problem of thresholds in the headcount approach, where large differences in characteristics can mean no difference in the weight, and yet conversely, small differences that cross a threshold can mean very different weights. So, Senegal and Argentina are equally dismissed as non-low-income countries whereas Guinea-Bissau and Kenya are equally valued as low-income countries. Only Argentina and Ghana are valued for their Polity IV score, with all others receiving the same discount for their low scores. The table also illustrates a specific problem with the headcount measure in this instance. As shown in Figure 1, income and policy data is typically highly positively correlated. Because of this, a headcount measure is more likely to value recipients that just meet a given criterion. In other words, of all recipients that meet the low income criterion, the relatively richer are the most likely to meet the policy criteria.
What effect does varying a threshold have? I investigate the effect of changes in the threshold that determines a country's classification as 'free', as a test of sensitivity of EW rankings to a minimal methodological change. Easterly and Williamson (2011) use Polity VI data, where a score of 8 or above is classified as free. I calculate the effect of varying this threshold to be 7 or above, with original and alternative rankings shown in columns 1 and 2 of Table 2. These rankings only refer to the freedom part of their selectivity measure. There are substantial rises in the rankings for Austria, Japan and the UK, and large falls for Luxembourg, Sweden and the USA. Only four donors do not move, and the average change in ranking is 2.8 places; the judgement of donor sensitivity to democratic principles is sensitive to threshold choice. This choice is arbitrary, with little justification or guidance for choosing one number over another. This sensitivity is perhaps to be expected with a headcount approach. However, because of the negative correlation between income and governance data (Easterly and Williamson, 2011) the threshold choices appear crucial.
The KRE differs from the other measures as it uses a regression to determine the scores of donors. The final score is an arithmetic average of the three coefficient estimates, and as the  Table 2 alongside the original ranking in column 3. Only three donors have the same ranking (Belgium, Luxembourg and New Zealand), and the average change is 2.6 places. Korea and Norway appear to be much worse donors when using the WGI data rather than the CPIA, dropping from 11th to 17th and from 6th to 12th, respectively. Conversely, Finland is viewed more favourably, climbing from 16th to 10th. These differences could be the result of differences in samples, as the coverage of the two variables differs. The WGI and CPIA share very similar aims, the former even incorporating the latter in its assessment. For the donors that look worse because the CPIA is used, the WGI is clearly preferable. For others however, it is difficult to see how an objective decision could be made, and how one should choose between the two sets of results is far from clear.  there is no guidance on selecting the sample). While specific details may differ, 6 these arguments apply to the API and MPI in general terms. The marginal rate of substitution (MRS) is a useful concept in understanding the three indices. It describes the amount of one variable that must be increased to compensate for a shortfall in another. The index method of rescaling the data means that the maxima and minima in each variable are crucial as they determine the MRS. The line in Figure 1 represents the MRS used in the RPI, running from the maximum to the minimum. It can be thought of as an isoquant, where any recipients lying on the line are equally valued. The best recipient would reside in the south east corner of the graph, and be both poor and well governed. For GDP per capita, the bounds are $21,869 and $81 given by the 2001 scores of Singapore and the Democratic Republic of Congo, respectively. For the governance score, the range is À2.25 to 1.44, given by the scores of Afghanistan and Singapore (2000 data was used as 2001 data do not exist). The RPI equates the two scales: thus $21,788 per capita dollars (logged) are equated to 3.69 on the WGI policy score. Using a different sample, variable or base year would change the minima and or maxima, which would in turn change the MRS and ultimately the relative performance of donors. I examine each of these methodological choices in turn.
The choice of sample may affect the MRS through deciding the maxima and minima of the measure. In the RPI, a number of countries are excluded as being rich 7 in a decision made by the researchers in the first year of the CDIs operation, on the basis of which countries were plausible aid recipients. However, Israel is excluded despite being a very large aid recipient. Neither is it a strict implementation of an income threshold in 2002, as countries of similar incomes to those excluded are left in. For example, Portugal (21,372) is excluded as rich whereas Singapore 8 (36,076) and Cyprus (23,590) are included. Singapore, a high income country, determines one end of both policy and income scales, and the inclusion of these high income countries leads to some surprising implications. For example, Poland received a selectivity weight of 0.47 as a donor in 2009 (Roodman, 2011a), but in that same year it received a score of 0.59 as a potential recipient (Roodman, 2011b). Thus, by ceasing to allocate aid to other countries and instead giving all its aid to itself, Poland would improve its selectivity score by around 25 per cent. This is not an isolated case as several of the donors that are included in the sample would receive a higher selectivity score if they reallocated all their aid to themselves: the Czech Republic would increase from 0.58 to 0.59, Turkey from 0.40 to 0.47 and Hungary from 0.57 to 0.61.
While the RPI methodology does not change and the base year is fixed, it updates every year such that new data in 2001 is allowed (if applicable) to update the reference points of maxima and minima. The sample of excluded countries does not include Bermuda, which is included as a potential recipient for 2009 data, despite having a GDP per capita of 66,268 6 Most notably, the RPI includes policy and not population as a factor, with the associated problem of greater subjectivity. 7 Specifically, they are Austria, Belgium, Denmark, France, Germany, Italy, Netherlands, Norwich, Portugal, Sweden, Switzerland, UK, Finland, Iceland, Ireland, Luxembourg, Greece, Spain, Canada, USA, Israel, UAE, Japan, Taiwan and Australia. Thanks are due to David Roodman for clarifying this, and several other points (personal correspondence). 8 In documentation of the RPI, Singapore's 2002 GDP per capita is stated to have been 21,869. However, current data show it as 36,076. It is not clear whether Singapore is actually included in the base year, where this figure comes from, nor what effect it has had, if any. (Roodman, 2011a). 9 Currently missing data for other variables in 2002 mean that Bermuda does not set the income maxima in the base year (with 59,699). If the data were provided, it would have a large effect on the MRS, with 3.32 on the governance scale equating 59,618 on the income scale. Table 3 shows the rankings using original methodology in column 1, and those allowing Bermuda to set the maximum income in column 2. Of the 23 donors, seven are unchanged, with an average change in ranking of 1.4 places. The seemingly small decision to include Bermuda means Italy and Australia fall by three places and France and Germany to rise by 3 and 4 places, respectively. These changes are not the result of a large change in sample, and Bermuda's exclusion appears to be the result of data coverage rather than a deliberate decision. Thus, even this small change of sample affects the rank of the majority of bilateral donors. The decision of which recipients to include is a small detail, was Note: (i) The RPI original column was calculated using the maxima and minima reported in Roodman (2011b).
(ii) In columns 4 and 5, as well as Figure 3, I use the actual data and so rankings differ (see footnote 5 for one such difference).
(iii) Average change refers to the average number of places difference between the original and different methodologies.
taken without any theoretical guidance and was not the strict application of a threshold. I do not wish to argue that the RPI made the wrong choice in selecting the sample, but rather a more pessimistic point: there is no obviously superior way of deciding which countries in the world are potential recipients, and yet this decision has real effects in determining the MRS and the subsequent rankings of donors. Column 3 of Table 3 reports the alternate rankings found if the Polity VI variable used by EW is used instead of the WGI. As aforementioned, changing the variable changes the sample due to differences in data coverage. Thus, it is to be expected that the measure is more sensitive to the change in variable to allowing one extra data point (the Bahamas in 2002). Even taking this into account, the differences are striking: an average change of seven places. 10 Greece and Portugal switch places due to the change in variable: between 4th and 23rd. Large changes can also be seen in the rankings of Ireland, Korea, Italy and Spain. If the Polity VI data were used instead of the WGI, Greece, Italy and Spain would see criticism replaced with praise, with the opposite effect for Ireland, Portugal and Korea.
Turning to the base year, if 2002 were chosen instead of 2000, the bottom of the policy range would be Iraq (À1.88), and the top of the range would be unchanged (given by Singapore with 1.44). Assuming the income range was stable, this would equate $21,788 per capita dollars (logged) with 3.32 on the WGI policy scale. To investigate the effect of such changes, I recalculated the scores of each donor using base years between 1996 and 2008: columns 4 and 5 of Table 3 report the highest and lowest rank of the bilateral donors. The RPI uses 2001 as the base year as it was the most recent data available in the year the index started, meaning this exercise can be thought of as a test of how robust the rankings are to the year the commitment to development index started in. Four donor rankings are unaffected by such a change: Belgium, Denmark, Luxembourg and Switzerland. For the remaining 19, their performance relative to other donors' changes depends on which of the 13 years is used as the reference point. The changes may not seem large: the largest difference is four places, and the average change is 1.4 places. However, remember that nothing has changed apart from the base year. The choice of base year determines which of New Zealand, Italy and Greece was named the worst donor in terms of selectivity, keeping all other methodological aspects constant. These changes come from changes in the MRS that result from different minima and maxima being used. The sensitivity of rankings to the whimsical choice of the base year is worrying as even the seemingly innocuous decision of base year alters rankings. The cumulative effect of multiple small decisions can only be larger, which questions the robustness of the five approaches to even minimal changes in methodology.

DISPARITY OF RANKINGS
Despite the pessimism of Section 3, it is possible that the sensitivity of donor allocative performance measures are academic. If the popular measures concur, then any methodological difference is perhaps not overly troublesome. For this reason, I recalculate the five measures and display the results graphically, following Høyland et al. (2012). However, while they display results of sensitivity tests, Figure 2 shows the (rescaled) raw scores and rankings using the canonical methodology of the five measures. They were calculated using recent data (2010 for aid, 2009 for all independent variables) for the 23 OECD/DAC bilateral donors. P. CLIST The exception is Easterly and Williamson (2011) who use a proprietary dataset for governance, so their original raw scores and rankings are used for the 22 donors that they include (they exclude Korea) using 2008 data. The left of the Figure shows the raw scores, which are rescaled to fit between 0 (the worst donor) and 100 (the best donor). The right of the figure shows the actual rankings from 23 to 1 (the actual scores are shown in Table A1 in the Appendix). This is not a full sensitivity analysis, but rather a subset of possible judgements. If more indices were included or the full gamut of possible rankings were explored, we would expect the range to increase dramatically. It is not a sensitivity test and is more likely to produce false positives than false negatives, so a failure to show an agreement on this simple test would be particularly worrying.
In Figure 2, the donors are ranked in descending order of average (rescaled) donor performance: the UK is considered to be the best donor, and the worst is Greece. If the measures tended to concur, there would be a clear diagonal line running from top right to bottom left in both raw scores and rankings. Instead, we see a great disparity of rankings. The two donors with the smallest variability are Greece and Ireland, for all others there appears to be real disagreement about how good a donor's allocation is. The best single donor according to the RPI, MPI, and KRE and EW measures (Portugal, New Zealand and the USA, respectively) are ranked, on average, among the worst. For the majority of donors, it is not clear which half of the distribution they should reside in, and there are few consistent orderings of one donor over another. It is worth stressing that this is not a test of the effect of the full range of methodological uncertainty: just the results of five popular/recent measures. The disparity of descriptive results is not a recent trend - White and McGillivray (1995) proposed two measures that met their criteria which gave results that had a rank correlation of zero.

a. Undermined by Their Own Flexibility
The aim of the various indices is to change the behaviour they measure. It is often envisaged that this acts through education of the public, which in turn leads to political pressure. Alternatively, some see indices highlighting certain technical matters to the governments themselves. Regardless of the mechanism, the aim is undermined by the disparity of rankings: a mass of contradictory judgements offers neither clear technical advice 11 nor consistent political pressure. The preference for methodological simplicity, in the hope that this minimises methodological uncertainty, seems somewhat misguided. The sensitivity of measures and the disparity of judgements are the natural consequence of methodological uncertainty within the descriptive tradition. The choices of what to include, how to measure these factors and aggregate these findings are important decisions which greatly affect the final rankings, and must be taken with little theoretical direction. It is hard to believe that the goal of advocacy through increased public awareness would survive a debate on methodological differences of indices, with popular opinion settling upon a preferred index. The result is a multitude of fragile and contradictory judgements which are all potentially valid, their differences stemming from justifiable choices of what to include, how to measure those variables and the aggregation methodology. Today, a donor that is ranked poorly by one measure need not state it is based on value judgements; it could merely point to its high scores from a different measure.

FUTURE DIRECTIONS
The evidence presented here is not positive regarding indicesthe rankings they produce are neither robust nor in agreement. This section discusses possible future directions in the light of this evidence. First, I discuss whether more aggregation might be a viable way forward. Second, I discuss the opposite directionwhether disaggregation might sidestep the most intractable problems of the descriptive approach. Third, I question the value of the approach itself, asking whether the approach has anything to offer despite its limitations.
a. More Aggregation Høyland et al. (2012) follow the Worldwide Governance Indicators and seek to resolve the problem of aggregation by resorting to more aggregation, in quality and quantity. Thus, rather than aggregating merely the point estimates of different measures, they seek to aggregate the range of estimates. This allows a statement regarding the sensitivity to certain assumptions. I do not choose this route. Section 3 shows that indices are not robust to even small changes in methodology. Section 4, dealing with what is in essence a small subset of a sensitivity analysis, shows that there are very large bands of uncertainty in final rankings. A fuller sensitivity analysis would vary what factors are included (population, policy and poverty), which variables are chosen to represent them (e.g. logged real GDP per capita or GNI in current prices), the scale used to measure them (e.g. the typical index method, a regression or headcount approach), the aggregation method employed (geometric or arithmetic) and other methodological choices (e.g. the base year and sample used). This would clearly give much greater variance of possible judgements than the five measures displayed in Figure 2. Ironically, given the deliberate choice to move away from an econometrically advanced approach, descriptive research may soon be aptly described by Leamer's (1983, p. 37) classic critique of econometric work: 'hardly anyone takes anyone else's data analyses seriously'.

b. Less Aggregation
An alternative is to present a disaggregated index. This does not avoid all of the problems of the descriptive approach as there is still uncertainty over which factors are relevant and the best way to measure them; both could lead to disparate judgements. However, disaggregation does avoid the pitfall of deciding upon how best to aggregate them, and therefore does not need to decide upon the appropriate MRS. This in itself makes the methodology of measurement less important, as it no longer determines the relevant importance of each factor. Within the descriptive tradition, Nunnenkamp and Thiele (2006) is the only disaggregated presentation of donor allocative performance of which I am aware. This approach is presumably unpopular because of the difficulty in assimilating the information for the reader, due to the lack of a ranking mechanism. For this reason, I follow White and Woestman (1994) who, in similar debate (on general donor performance) on the wisdom and method of aggregation, promoted a graphical alternative to aggregation used by Ashuvud (1986). The graph used was a four axes 'diamond' where the distance from the centre denoted the size of a measure (exactly the same type of graph that is now used by Birdsall and Kharas, 2010). A downside of the graph is that it is not easy to display several donors on the same graph. Donor allocative performance is related to just three factors, making it easier to display graphically. Figure 3 succinctly displays the sensitivity of 23 donors to poverty, policy and population in absolute terms. Specifically, they are the coefficients taken from three bivariate regressions per donor: logged aid regressed on logged population, logged (PPP) income per capita and policy (Worldwide Governance Indicator average). The x-axis represents the income coefficient (times by À1 so that it represents poverty not income), the y-axis is the policy coefficient and the size of the circle reflects the population coefficient. Thus a donor with high poverty, policy and population sensitivity would have a large circle in the north-east corner of the graph. France has a population coefficient of around 1, whereas Portugal has no circle, as its population coefficient is negative. In Figure 3, only two donors have positive coefficient estimates for policy -Luxembourg and Austria. Greece is a particularly worrying donor here as it gives more aid to richer and less well-governed recipients in absolute terms. The other 20 donors all reside in the south-east quadranta clear sign that donors are income sensitive but not policy sensitive (a common finding in the positive literature, see Easterly, 2007;Hout, 2007;Clist, 2011). This point is underlined when examining the scales of the axes. Some donors are remarkably similar -Canada, Belgium, Finland and Japan all overlap. There is also a clear trend for smaller donors to have a greater small country bias, perhaps as they seek to concentrate on recipients for whom they can give relatively substantial aid flows.
While not a perfect solution, this approach maintains the ability for non-specialist readers to quickly assimilate the information. The disaggregated form also makes clear the trade-offs that donors have made: some may be more poverty sensitive, whereas others are more popula-tion sensitive. By aggregating, the researcher imposes a marginal rate of substitution, and the donors are ranked according to that MRS. The graphical approach makes clear the MRS that each donor implicitly uses. Roodman (2011b, p. 483) admits that 'aggregation hides at least as much as it reveals'. While the approach taken here is not amenable to an index with more factors, the graphical presentation reveals more than a simple index in the case of donor allocative performance. It is said that a picture is worth a thousand words: Figure 3 is worth some 69 bivariate regressions.

c. Abandon Indices
Given the evidence presented earlier, it is worth fundamentally questioning the value of the descriptive approach. Final rankings are fragile to minimal changes in methodology, and popular measures have a very low degree of consensus. Is a graphical option (or other disaggregated approach) a credible way forward, given that it does not solve all of the problems associated with current rankings? There are two aspects of the descriptive approach that make it (potentially) worthwhile: one technical and the other pragmatic. First, I argue, like Easterly and Williamson (2011), that the advantage comes from using absolute judgements to complement explanatory approaches. Descriptive analysis is unique in that it can answer questions regarding absolute poverty, population and policy sensitivity, where more econometrically advanced methods are only able to answer such questions conditional on other factors.  Source: Author's calculations, using 2009 data. Coefficients taken from 3 bivariate regressions. P. CLIST Second, while rankings (and graphical presentations) based on simple measures may not be a first-best solution, they have become popular because of an understandable desire to judge donor performance. Furthermore, it is clear that the descriptive approach can be more simply understood by non-specialists, in a way that is not the case with more econometrically advanced work. This advantage threatens to be undermined by fragile and contradictory judgements, but the potential advantage should not be dismissed.
The reader must judge whether these two reasons are enough to counteract the weaknesses of the approach. Pragmatically however, it appears likely that descriptive approaches will continue to be used regardless of their fragility. It is in this context that I recommend graphical presentations that avoid aggregation but maintain a simple presentation of donor performance: not as a first-best solution but as an improvement on fragile rankings.

CONCLUSION
The aim of descriptive measures is laudable: to improve aid effectiveness by encouraging good allocative practice. Because the imagined mechanism for changing allocative behaviour includes increased public awareness of the issues, simplicity in measurement and presentation has been valued. The latter has been fruitful, as rankings have successfully increased public discussion and involvement in the issues of donor allocative behaviour. However, it appears naive to claim that in favouring simpler methods, the descriptive literature has minimised methodological uncertainty. I have shown that even minimal changes to methodology result in different rankings, and often in very different rankings. Changes such as whether to include or exclude a single recipient, the variable used and the base year lead to non-trivial changes. The choice of factors, measurement approach and aggregation technique would logically lead to even larger differences in final rankings. The aggregate effect of the inherent whimsy in popular descriptive measures is that of substantial sensitivity to methodology. Similarly, competing measures, implemented using the canonical model, give contradictory judgements. The range of possible opinions in a full sensitivity analysis of selectivity would surely be even larger than the actual range of opinions discussed here. The lack of clear theoretical or technical dominance of one approach over another means a plurality of opinions are potentially valid. This sensitivity to small technical details and disagreement in final rankings threatens to undermine increased public awareness, as methodological ambiguity belies the false certainty of unambiguous rankings. In future, donors would not need to claim that a poor judgement of their performance was a 'certain value judgement'; it could merely point to a competing measure that praised their performance. Indeed, Kihara (2012, p. 3-4) recently defended Japanese aid using such an argument: 'some empirical studies, including studies presented here, seem to contradict the findings of the CGD. A number of recent studies have ranked aid donors by various indicators of their aid-giving and in some of these Japanese aid has been ranked toward the upper end of donor countries'. The problems with indices notwithstanding, their advantages mean demand for them is likely to continue, and with this in mind, a graphical approach is recommended as a way of avoiding some of the largest problems. Much, but certainly not all, of the disagreement in final rankings can be avoided if distinct elements of an index are left in disaggregated form. This approach has not been embraced as disaggregation has typically meant poor presentation, which would undermine the intermediate goal of public communication and engagement. Section 5 proposes a graphical solution to the resulting impasse, combining disaggregation with a presentation that is easy to understand. The graph displays the poverty, policy and population sensitivity of 23 donors in absolute terms (although the graph could equally display conditional results). In doing so, the marginal rate of substitution is not imposed on the rankings, but can be inferred by the reader. In fact, the graph does not even impose decision of the appropriate factors, as a reader could choose to ignore one of the three dimensions. I argue that the descriptive approach is a useful complement to explanatory research. In that vein, recent findings of low policy sensitivity and high income sensitivity in explanatory research (Easterly, 2007;Hout, 2007;Clist, 2011) is echoed in Figure 3. This enables researchers to have more confidence in the result as this is found in both conditional and absolute terms using explanatory and descriptive approaches. Measuring absolute sensitivity (i.e. not controlling for other factors) also makes clear the inherent dichotomy that donors face (see Figure 1): richer countries tend to be better governed. As such the advice of Collier and Dollar (2002) to be both income and policy sensitive is easier to give than to follow; the graphical method confirms that donors generally choose the former. The method makes clear the difficult choice that donors face, but also highlights the donors that fare poorly. By measuring donor performance better, it is hoped that unprofitable debate over the correct index is avoided, and the focus remains on increasing donor performance by measuring it.