Have You Read this? An Empirical Comparison of the British Ref Peer Review and the Italian VQR Bibliometric Algorithm

This paper determines the ranking of the publications units of assessment which were submitted to the UK research evaluation carried out in 2014, the REF, which would have been obtained if their submission had been evaluated with the bibliometric algorithm used by the Italian evaluation agency, ANVUR, for its evaluation of the research of Italian universities. We find very high correlation between the two methods, especially in regard to the funding allocation, with a headline figure of 0.9997 for the funding attributed to the institutions.


Introduction
The week before Christmas 2014, university common rooms and PR offices up and down the country were abuzz with discussions and dissections of the freshly pub- This peer review based evaluation was the last in a series of such exercises, which have taken place at approximately regular intervals, after the initial dummy run held in 1986. The raison d'être of the exercise is twofold. One the one hand, to ensure accountability for the taxpayer's investment in academic research and persuading the public of its benefits, on the other hand to form the basis for the selective allocation of the annual "block" budget for research to institutions. The funds allocated on the basis of the results of the REF are around one quarter of all the funds transferred from the taxpayer to higher education institutions.
Following the 2008 exercise, the funding agency run a pilot study with a view to replace peer review, considered very expensive, with an evaluation based on a bibliometric algorithm, but concluded that "bibliometrics are not sufficiently robust at this stage to be used formulaically or to replace expert review in the REF" (HEFCE, 2009) and so the 2014 exercise continued to rely on peer evaluation of academic output, although the assessors could choose to use citation information to inform their expert review. The estimated overall cost of the 2014 exercise is approximately £246m (Farla and Simmonds, 2015), comparable to the annual budget of a medium size university, and dividing up at £4000 per academic assessed. The next exercise, planned for 2021, will also be conducted via peer review, partly because of the UK academia's continued opposition to an increased role for mechanical methods of evaluation of research output, even when several other countries do adopt a bibliometric evaluation, as highlighted in Wang, Vuolanto, and Muhonen (2014)'s survey. To the extent that considerable cost saving could be achieved by a bibliometric approach, it is not surprising that the literature has addressed the question of the closeness between a peer review and a bibliometric approach. Thus Bertocchi, Gambardella, Jappelli, Nappi, and Peracchi (2015) report on the working method of the economics and management assessment panel in the Italian 2004-10 assessment, which randomly selected some of the journal articles assigned to bibliometric evaluation also to be peer reviewed, precisely to assess to correspondence between the two methods (see also Baccini and De Nicolao (2016) and the reply, Bertocchi, Gambardella, Jappelli, Nappi, and Peracchi (2016)). Mryglod, Kenna, Holovatch, and Berche (2015) assess the correlation between the score and the rank obtained by each institution with the corresponding "departmental h-index" (Hirsch, 2010). The latter paper examines a broader range of research areas than Bertocchi, Gambardella, Jappelli, Nappi, and Peracchi (2015), and reports good correlations in the various subject areas, between 0.36 and 0.89. However, it uses a different set of articles from those evaluated by the REF panels, and indeed, as we explain below, it includes articles written by academics who were not submitted as part of the group evaluated by the relevant REF panel. In the same vein, Harzing (2017) has shown that ranking UK departments according to the "departmental h-index" correlates to the REF power ranking at 0.97.
In detail, we assess the papers which were submitted to the UK REF, and are included in the Scopus database, using the bibliometric criteria which ANVUR, the Italian evaluation agency, used to assess the outputs submitted for the Italian evaluation exercise which assessed outputs published from 2011 to 2014. Thus there are two important differences with the literature mentioned above. Firstly, we consider all the research areas, and, secondly, we only assess journal articles submitted to the relevant panel of the REF, and hence, at least in principle, we compare the two approaches, bibliometric and peer review, on the basis of the same set of research outputs.
We stress at the outset an important limitation of the exercise, which makes its contribution more a template for more thorough analysis than policy advice in its own right: books and book chapters, which constitute an important form of output in some research areas, cannot be assessed by the ANVUR algorithm; there are also several other specific differences between the two evaluations (illustrated in Table 3). We did not make any adjustment to the algorithm to account for these.
Such adjustments would have an ad hoc nature, and one criterion of choice among them would inevitably be whether or not they improve the correlation between the rankings; as such they would bias our exercise. Even then, we find a remarkable correspondence between the methods: in the 18 REF research areas where at least 75% of the outputs submitted to the REF could be evaluated bibliometrically, the average correlation between the average quality of departments in the REF peer review score and the corresponding measure calculated with the ANVUR algorithm is 0.81, and the average rank correlation is 0.76: for the full sample, the figures are 0.63 and 0.6. Correlation is very much higher for other measures of departmental research quality, which consider the size of the unit as well as its average quality: of particular interest to policy makers is the correlation in the funding that would be attributed by the two methods, which stands at 0.995 when the departments with at least 75% of the outputs could be evaluated bibliometrically, and at 0.986 for the whole sample. Even when stacking the deck against the comparison by applying it without making it any allowance for the type of outputs submitted, we show that, had the annual funding to institutions been allocated following the ANVUR assessment methods , the outcome would have differed relatively little. The summary result of the correlation in the institutional funding is most striking: if the output submitted had been evaluated with the bibliometric algorithm used in the Italian eValuation of the Quality of the Research (VQR), with peer review assessing the rest of the institutional submission, the correlation between the actual funding assigned to each institution and the funding it would have received if calculated with the VQR score would have exceeded 0.9997, and hence the difference in funding would have been minuscule.
We close the paper with a simple attempt to uncover association between the closeness of the measure and other institutional variables. We find very little systematic variation: only two such variables appear to explain some of the difference in the scores of the two assessment methods: first the size of the submission, with larger units of assessment appearing to have been slightly penalised by the REF peer review relative to the bibliometric VQR algorithm, and the number of units in the institution as a whole, universities with many departments performing a little better with the REF than they would have done with the VQR bibliometric algorithm. This paper is organised as follows. In Section 2, we describe the REF evaluation, and in Section 3 we present the bibliometric algorithm adopted in the Italian VQR.
Section 4 describes the data used to evaluate the REF journal articles, and Section 5 reports the results. A brief conclusion ends the paper.

The 2014 Research Excellence Framework
The REF2014 exercise evaluated the research conducted by 52,000 academic researchers associated to 1911 units of assessment in 154 higher education UK institutions. The assessment was carried out by 36 expert panels, one in each area of research, in turn grouped into four "main panels"; corresponding to very broad disciplinary areas: medicine and biology, the other sciences and engineering, the social sciences, and the arts and humanities; the full list is in Table 5 below. The 36 panels comprised over 1000 assessors in total, three quarters of them academic, the rest non-academic "users" of the research. The grouping of the disciplines differs in the two exercises we consider, the VQR and the REF. It may therefore be useful to fix terminology for the rest of the paper: we denote as "subject areas" the 350 subject categories in Scopus: this is the finest classification of topics. We will then denote as "VQR research areas" and "REF research areas" the groups of subject areas which were assessed by the 16 VQR individual panels (known as GEV gruppi esperti valutatori) and the 36 REF panels. In the formal analysis we index with h the subject areas and with i the research areas. Having announced the assessment criteria well in advance, the panels determined, on the basis of a peer review of each output submitted, the percentage for each of the three dimensions of the activities of each submission to be assigned to the five quality categories, ranging from the best, 4-stars "quality that is world-leading in terms of originality, significance and rigour" to the worst, 0-stars "quality that falls below the standard of nationally recognised work". On Thursday 18 December 2014, the panels' assessments of each dimension of activity of every institution was made public, together with the aggregate profile, obtained as a weighted average of the outputs, environment, and impact components, with the weights 0.65, 0.15, 0.2. 1 The unit of assessment is the group of researchers submitted to a given na-  Harzing (2017). 3 Outputs can be submitted by an institution as long as the author is employed by that institution on the REF census date, 31st October 2013, irrespectively of where the author was when the paper was written or published. The expert panels assessed the output component of each submission carrying out peer-review evaluations of the "reach and significance" of each output submitted.
The environment component is a written submission describing the achievements of the academic department, together with data on research grant income and PhD completions. Finally, impact is assessed by considering written 'case studies', one for every eight academics submitted, accompanied by supporting evidence which shows how the research of the department has brought benefits outside of academia, through, for example, influence on government policy or industry practice. Unlike output, impact is attributed to the institution where it was carried out irrespective of which institution is currently employing the researcher responsible for it at the census date.
The measures of environment and impact have no exact correspondence in the Italian VQR, and cannot obviously be the object of a bibliometric approach, and so we limit our comparison to the output component of the REF.
Unlike its Italian counterpart, the UK funding agency does not present a single score which would immediately determine a ranking of institutions. Commentators and the public have therefore stepped in, variously aggregating the profiles into single numbers so as to draw ranking of units of assessment and institutions in national league tables. The most commonly used are the grade point average, GPA, and the research power, RP (Forster, 2015). GPA is calculated as a weighted average of the scores, with the proportion in each category as weight: the GPA of department i's in institution k is calculated simply as: where π s ik is the proportion of the activity of department i's in institution k which was assessed to be of s star quality. Table 1 reports the grade point average (GPA). It shows that the correlation between the three components is high, but not so much as to make it meaningless to assess the three components separately. The other measure widely used to rank departments is research power, which again has no official status. It is simply the product of the GPA by the number of staff submitted: where n ik denotes the number of full-time equivalent researchers submitted by institution k to panel i. Thus GPA measures the average quality, without reference to the size of the unit of assessment, which is instead taken into account by the RP measure.
There is an obvious trade-off between the two: excluding a relatively weak member of staff would definitely increase the GPA and reduce research power.
While less prominent in the media, the government, by the very fact of basing the research funding allocations on the results of the REF, does in practice determine a further single measure, which can be used to rank departments within units of assessments, and subsequently aggregated to institutions. This is the funding score formula, FS, which is used to calculate how to allocate the overall "quality related" funding made available to the sector in each year. Unlike the funds distributed by the research councils which are strictly linked to specific projects, universities are free to spend this funding as they wish, with no link to projects or even disciplines. 4 When designing the funding formula the government intended to provide incentives towards high quality research, and so it gave high weight to 4 output, specifically four times higher than the weight given to 3 output, and no weight to output judged less than 3 . 5 With the above notation, an institution's funding in year t until the following evaluation exercise is given by where Φ t is the coefficient (in the jargon the "QR unit funding"), which varies from year to year, and depends on the overall public funding for universities, and Γ i is a research area specific weight which takes value 1.6 for STEM subjects, UoAs   the Italian VQR rules. Given the many differences between the set of rules used in the two assessment methods, as illustrated in Table 3, this seems unlikely.
Differences between the results of the two assessment methods could spring from two sources. One the one hand there could structural differences between the methods, which would be the case if a substantial fraction of the highly cited papers published in prestigious journals were, rightly or wrongly, considered to be of poor quality by the peer reviewers, or vice versa, if peer review assessed as being of top quality many papers published in obscure journals and with low citation counts. On the  In order to do so, it is therefore necessary to know the world distribution of citations and impact metrics at the earliest available date after the REF census date.
We purchased from Scopus bibliometric information (namely the number of citations and the SCImago Journal Rank) on 1/1/2015, for each of the papers submitted to the REF; given the suggestive nature of the exercise, rather than obtaining detailed information on the world distributions of impact metrics and citations, we opted to use data made available by ANVUR, which included these distributions on 1/1/2017. This might generate a measurement error which however is systematic only to the extent that there are different trends in the citations patterns and the impact metrics of the journals where certain institutions are more inclined to publish.
In the next step of the procedure 6 , the unit square [0, 1] 2 ⊆ R 2 is divided into five subsets as shown in Figure 1 by four parallel downward sloping straight lines, in such 6 The procedure is described in greater detail in Anfossi, Ciolfi, Costa, Parisi, and Benedetto (2016).

Figure 1:
Allocations of products to quality classes a way that the dark green area 7 is 0.1, the light green and yellow areas are both 0.2, the orange area is 0.3, and the red area is 0.2. Simple computations determine the boundary lines; these are given by y = a it − b it y, where a it is the solution in a, for σ = 0.1, 0.3, 0.5, 0.8, of: The normalisation with the percentiles ensures that the distribution is uniform in the unit square.
14 This solution is given by: In (4) b it is the slope used to assess outputs in the VQR research area i in year t: it is chosen subjectively by each panel, to reflect the trade-off between visibility of an article and prestige of the publishing journal, in their research area, and the manner in which it changes with time. In practice, the slope b it varied from year to year and from VQR research area to VQR research area, to account for the different citation patterns and the fact that more recent papers have less opportunity to collect citations than equally influential article published five years before, and so for more recent papers the impact metric of the journal was given a higher weight. Because of these considerations, the slopes separating the areas in Figure 1 increased in absolute value with the year of publication so as to reduce the importance of citation for younger articles.  Note: The slopes of the lines in Figure 1, for different VQR research areas and different years. The first four columns report the coefficients used in the VQR, the last three those we have metrics. Then, this article's score is given by where, in each row, the dependence of a it on σ and b it derived in (4) is made explicit.
In words, an article is considered as "excellent" (score 1) if it corresponds to the best 10% in the world joint distribution of citations and journal metric; it is assessed as "good" (score 0.7), if it falls within 10% and 30%; it is considered "fair" (score 0.4), if it falls within 30% and 50% and as "acceptable" (score 0.1), if it falls within 50% and 80% of the world distribution. The remaining papers are labelled as "limited", and receive a score of 0.
Approximately 70% of the outputs submitted to REF are published in journals which the VQR had allocated to one or more VQR research areas. We allocated the remaining ones, for example journals in social sciences arts and humanities, to close VQR research areas, possibly more than one, by exploiting information on the frequency of publications in journals of a given Scopus subject areas by the academics submitted to a VQR research area. The entire allocation procedure was such that around 46% of the outputs submitted to the REF and contained in Scopus was published in journals which are associated to multiple VQR research areas.
Depending on where they fall in the version of Figure 1 of each VQR research area, a given output could have different values of these scores. In the event, 7068 outputs, around 5% of those we assessed, were given different values by the algorithm. When this happened, we assessed the given output in all the selected VQR research areas, and then chose the highest evaluation score. 9 After each output was assigned to the corresponding class, the score could be aggregated by averaging or adding up all the scores for each article submitted by members of each unit assessed (department, faculty or university). 10 The corresponding score for each institution i evaluated according to the VQR algorithm is given by: where π s ik is the proportion of the articles of institution i published in research area k to which the algorithm assigned a score s VQR = s , s = 1, 0.7, 0.4, 0.1. Note of course that ∑ s π s ik ≤ 1, but it can be strictly less than 1, as some output may score zero. In (5), we calculate the GPA with the weight vector (4, 3, 2, 1, 0) used in the REF, rather that the VQR weight vector, which was (1, 0.7, 0.4, 0.1, 0). The overall correlation between the measures, at 0.998, is very high.

The data
All the outputs submitted to the REF is available from the REF website (www.ref. ac.uk/2014) as Excel files. 11 The total number of outputs assessed is 190, 962, with 81.09% of the total (154, 854) journal articles, the remainder consists mainly of chapters in books (7.5%) and books (5.4%). There are many other different types, all representing a tiny fraction of the total, such as compositions (0.35%), patents (0.06%), exhibitions (0.65%), or scholarly editions (0.19%). For each output, the file contains the type of output (journal article, book, working paper, etc), the institution that submitted the output, and the unit of assessment it was submitted, as well as standard bibliographic information such as the DOI, the publication year, the number of co-authors, the title the place of publication and so on.  Table 5 presents summary statistics of the output data: as one expects, the research area with the highest proportion of outputs that can be assessed using the VQR bibliometric algorithm are those in the STEM research areas, and those, like economics, where the typical publication outlet are refereed journals.

Results
Our main results are reported in Table 6. The UoAs for the REF are ordered according to the percentage of output that we have been able to assess using the VQR, the fourth column in Table 5 .
Column (1) in Table 6 reports the correlation between the individual GPA scores calculated for the outputs of the various institutions which submitted to the corre-sponding UoA using the VQR algorithm, (the formula in (5) and the scores awarded to these units by the REF expert panel. Column (2) reports the rank correlation between these sets of scores. These two sets of correlations are themselves highly correlated (0.973). All the correlations are positive, and many, especially for the UoAs where a large percentage of the products submitted could be assessed with the bibliometric algorithm of the VQR, are very high; this is true both for the correlations between values and the rank correlations. GPA scores are averages, and so are independent of the number of academics submitted. When the latter are allowed into the picture the correlations increase radically, as shown in columns (3) and (4) which reports the correlations in RP, and even more so in columns (5) and (6) which reports the correlations in the FS measure, the funding attributed to each unit submitted.
In column (5) The results for the rank correlation are less extreme. Given that the aim of the UK exercise is to assess research, not rank institutions, this is the less relevant of the two correlation measures. Its lower value is likely to be due to the fact that many scores are very tightly bunched, and so small measurement errors change little in the absolute scores, but may have large impact in the ranking The same message emerges from Figure 2. It illustrates the correlations and the rank correlations in the various units of assessment according to the various  (1), for the GPA, in (2), for the research power, and in (3), for the funding score.
based funding scores in the two methods of assessment, illustrated by the green dots.
While we stress once again the highly stylised nature of our computations, it might nevertheless be intriguing to verify, along the lines of Harzing (2017) where π s,X ik is the proportion of activity X submitted by unit i in institution k assessed to be of quality s-star, s = 3, 4, with X taking values OUT, output, ENV, environment, and I MP, impact, Γ i the cost adjustment parameter taking values 1.6, 1.3 or 1, as explained above, and I k is the set of units of assessment submitted by institution k.
The correlation between the levels of funding with the two methods is 0.9997, both when all units of assessment are taken into account and when only those where at least 75% of the outputs could be assessed with the VQR algorithm. Obviously some of the correlation is due to the fact that the environment and impact components are the same in the two terms, but if these are removed, the correlation between an institution's portion of the funding due to the output component and the same portion when outputs are assessed with the VQR algorithm is still extremely high, at 0.9940, or 0.9937 when considering only the units of assessment where at least 75% of the outputs could be assessed with the VQR algorithm.
We end the paper by trying to uncover whether there are any links between any discrepancy in the two measures, our calculation using the VQR bibliometric algorithm and the REF peer review evaluation, and observable characteristics of institutions and departments. We are well aware that there is no possibility to establish any causal effect, and so the result presented in Table 7 which reports the estimated 24 coefficients for various specifications of the following equation.
where ∆ ik is the difference in a given measure of research quality or of the corresponding rank between the outcome measured by the VQR algorithm and that assessed by the REF peer review: ∆ ik > 0 indicates that the submission to the REF research area i made by university k did better with the VQR algorithm than it was judged to be by the peer reviewer.
In the upper part of  Table 7 suggests that there is very little explanatory power from any of the variables, and in the cases when they do, such as the size of the submissions, the number of other submissions made by the institution, and the presence of a member of the department in the peer review panel, these variables appear to affect only some of the difference in the rankings. Overall differences in scores and rankings between departments in the two exercises, the British REF and the Italian VQR seems to be due mostly to random non-systematic factors.

Concluding remarks
We have performed in this paper a simple exercise to compare the outcome of the assessment of the research carried out in British universities in the course of the 2014 REF with the outcome that would have resulted had the publications which were included submissions been evaluated, when possible, using the VQR bibliometric algorithm used in the corresponding exercise for Italian universities.
While we are keenly aware of the rough and approximate nature of our analysis, whose aim is chiefly to highlight a possible route to be followed in light touch, cost effective evaluation rather than a suggestion that the measures we obtain are an accurate description of the relative standing of the UK institutions in the various subject areas, we find the closeness of the outcome, especially when comparing size In the upper part of the table the dependant variable of the OLS regression is the score: the GPA, (1), the log of the research power, (2), and the log of the funding score, (3). The restricted sample include only the UoAs where the VQR algorithm could assess at least 75% of the outputs submitted (those above the line in Table 6). The lower part of the table repeats the OLS regression using the rank instead of the score or its log.
sensitive measures, strongly suggestive that the method could be used to assess the publications at least for the research areas where the main outlet are refereed journals.
Of course the nature of the research output might itself be affected by the manner in which it is measured, in a coarse macroscopic version of the Heisenberg Uncertainty Principle. A statement that only journal articles will be considered worthwhile output 27 for assessment would obviously direct academics to try to publish mainly in these outlets, even though they might not be the most suitable ones for their research. This effect could be particularly strong for early career researchers, many of whose outputs were submitted in the form of working papers in some subject areas (in the economics and econometrics unit of assessment, institutions submitted 2386 journal articles and 168 working papers 13 ), and who might decide or be persuaded to submit their work to less prestigious journals, rather than risk being unable to submit outputs which the rules deem of lower quality.