Nature and Purpose of Peer Review
Peer review is an established component of professional practice, the academic reward system, and the scholarly publication process. The fundamental principle is straightforward: experts in a given domain appraise the professional performance, creativity, or quality of scientific work produced by others in their field or area of competence. In most cases, reviewer identity is hidden (single-blind review) to encourage frank commentary by protecting against possible reprisals by authors; and, in some cases, author identities will be masked from reviewers (double-blind review) to protect against forms of social bias. The structure of peer review is designed to encourage peer impartiality: typically, peer review involves the use of a “third party” (Smith, 2006, p. 178), someone who is neither affiliated directly with the reviewing entity (university, research council, academic journal, etc.) nor too closely associated with the person, unit, or institution being reviewed; and peers submit their reviews without, initially at least, knowledge of other reviewers' comments and recommendations. In some cases, however, peers will be known to one another, as with in vivo review, and may even be able to confer and compare their evaluations (e.g., members of a National Science Foundation [NSF] review panel).
Peer review, broadly construed, covers a wide spectrum of activities, including but not limited to observation of peers' clinical practice; assessment of colleagues' classroom teaching abilities; evaluation by experts of research grant and fellowship applications submitted to federal and other funding agencies; review by both editors and external referees of articles submitted to scholarly journals; rating of papers and posters submitted to conferences by program committee chairs and members; evaluation of book proposals submitted to university and commercial presses by in-house editors and external readers; and assessments of the quality, applicability, and interpretability of data sets (Lawrence, Jones, Matthews, Pepler, & Callaghan, 2011; Parsons, Duerr, & Minster, 2010). To this list one might add promotion and tenure decisions in higher education for which an individual's institutional peers and select outside experts determine that person's suitability for tenure and/or promotion in rank, and also the procedures whereby candidates are admitted to national academies, elected fellows of learned societies, or awarded honors such as the Fields Medal or Nobel Prize.
In many ideal depictions, peer review processes are understood as providing “a system of institutionalized vigilance” (Merton, 1973, p. 339) in the self-regulation of knowledge communities. Peer expertise is coordinated to vet the quality and feasibility of submitted work. Authors, in the anticipation of the peer evaluation of their work, aim to conform to shared standards of excellence out of expediency and in accordance with an internalized ethos (Merton, 1973). The norms and values to which peers hold each other are conceived as being universally and consistently applied to all members, where these norms and values pertain to the content of authors' evidence and arguments independently of their social caste or positional authority (Merton, 1973). When these norms and values are impartially interpreted and applied, peer evaluations are understood as being fair. It is the impartial interpretation and application of shared norms and standards that make for a fair process, which—psychologically (Tyler, 2006) and epistemologically—legitimizes peer review outcomes, content, and institutions.
This is why critics' charge of bias in peer review is so troubling: Threats to the impartiality of review appear to threaten peer review's psychological and epistemic legitimacy. Although there are a few exceptions (Lamont, 2009; Lee, in press; Mallard, Lamont, & Guetzkow, 2009), variations in the interpretation and application of epistemic norms and values are almost always conceived of as problematic. Failures in impartiality lead to outcomes that result from the “luck of the reviewer draw” (Cole, Cole, & Simon, 1981, p. 885), fail to uphold the meritocratic image of knowledge communities (Lee & Schunn, 2011; Merton, 1973), protect orthodox theories and approaches (Travis & Collins, 1991), insulate “old boy” networks (Gillespie, Chubin, & Kurzon, 1985; McCullough, 1989), encourage authors to “chase” disputable standards (Ioannidis, 2005, p. 696), and mask bad faith efforts by reviewers who also serve as competitors (Campanario & Acedo, 2005). Perceived partiality leads to dissatisfaction among those whose professional success or failure is determined by review outcomes (Gillespie, Chubin, & Kurzon, 1985; McCullough, 1989; Ware & Monkman, 2008).
The charge of bias also threatens the social legitimacy of peer review. Peer review signals to the body politic that the world of science and scholarship takes seriously its social responsibilities as a self-regulating, normatively driven community. The enormity and complexity of contemporary science and its ramified institutional arrangements are such that peer review has, in the words of Biagioli (2002, p. 34), been “elevated to a ‘principle’ — a unifying principle for a remarkably fragmented field.” As a consequence, the system is held to almost impossibly strict standards and routinely exposed to intense scrutiny by insiders and outsiders alike, including elected politicians (Gustafson, 1975; Walsh, 1975).
Does the “mundane reality of peer review” depart radically from its “mythology” (Biagioli, 2002, p. 13)? Given that human fallibility and venality are inescapable facts of life, it seems unreasonable to imagine that “the flywheel of science” (Chubin & Hackett, 1990, p. 5) could function flawlessly. Throughout the literature, charges of systematic bias—not just isolated incidents—are repeatedly aired. Such concerns need to be addressed in an open and thoroughgoing fashion to ensure that trust in the integrity of peer review is maintained. In this spirit, our review seeks to articulate notions of impartiality and bias that are faithful to concerns raised by quantitative research on peer review; characterize major genres of research on bias by their methods, assumptions, and concerns; report their results; and indicate how alternative forms of peer review might ameliorate various forms of bias.
Our discussion will draw on literature on the origins, purpose, and mechanics of scientific peer review across multiple genres including journal articles, grant proposals, and fellowship applications (e.g., Bornmann & Daniel, 2007; Bornmann, Mutz, & Daniel, 2007, 2008, 2009; Campanario, 1998a, 1998b; Chubin & Hackett, 1990; Daniel, Mittag, & Bornmann, 2007; Hames, 2007; Holbrook & Frodeman, 2011; Kronick, 1990; Marsh, Bornmann, Mutz, Daniel, & O'Mara, 2009; Shatz, 2004; Spier, 2002); the growing body of empirical and meta-analytic research on the reliability and predictive validity of the peer review process (e.g., Bornmann, 2011a, 2011b; Peters & Ceci, 1982a), including the various kinds of presumptive biases (e.g., institutional, gender, cognitive) associated with different types of review systems (e.g., Alam et al., 2011; Blank, 1991; Budden et al., 2008); survey research on scientists' attitudes towards peer review (e.g., Sense About Science, 2009; Ware & Monkman, 2008); and debates surrounding the relative merits of open peer review in light of new experimental, web-based systems (e.g., Delamothe & Smith, 2002).
History of Peer Review
The origins of scholarly peer review are commonly associated with the formation of national academies in 17th-century Europe, although some have found foreshadowing of the practice. Biagioli (2002, p. 31) has described in detail “the slow differentiation of peer review from book censorship” and the role state licensing and censorship systems played in 16th-century Europe. A few years after the Royal Society of London (1662) and the Académie Royale des Sciences of Paris (1699) were established, both bodies created in-house journals, the Philosophical Transactions and Journal des Sçavans, respectively. These prototypical scientific journals gradually replaced the exchange of experimental reports and findings via correspondence, formalizing a process that up until then had been essentially personal, informal, and nonassured in nature. In London, Henry Oldenburg was appointed Secretary to the Royal Society and became the journal's first editor, gathering, reporting, and editing the work of others (Manten, 1980). From these early efforts gradually emerged the process of independent review of scientific reports by acknowledged experts that persists to this day. Indeed, as early as 1731 the Royal Society of Edinburgh had adopted a review process in which materials sent to it for publication were vetted and evaluated by knowledgeable members (Spier, 2002, p. 357).
This was the era of the amateur scientist and armchair philosopher who “produced reliable knowledge in and through a moral economy patterned upon the conventions of gentlemanly conversation” (Shapin, 1995, p. 290). But professional science is not conducted by “logically omniscient lone knowers” (Kitcher, 1993, p. 59), and mechanisms thus evolved to formalize the ways in which the trustworthiness of scientific findings could be verified and promulgated to a wider audience. Over time, three principal forms of journal peer review evolved: single-blind, double-blind, and open. Of these, single-blind (the author's identity is known to the reviewer while the reviewer's is concealed from the author) is the most widely used, not least because it is a less onerous and less expensive system to operate than double-blind, for which considerable (often unsuccessful) effort is required in order to remove all traces of the author's identity from all parts of the manuscript/proposal under review (e.g., Blank, 1991; Nature, 2008).
Commitment to and Dependence on Peer Review
Today there are literally thousands (estimates vary considerably) of peer-reviewed journals in existence, although the stringency and consistency with which peer review procedures are applied across this population are variable (Mabe, 2003). In any given year these journals publish, at a conservative estimate, a million articles (Björk, Roos, & Lauri, 2009). Each one of those articles will, in all likelihood, have been read by at least one, often two, and sometimes three or more reviewers, selected by journal editors, and most of those submissions will have undergone multiple rounds of review prior to eventual publication in a journal of record. In addition to the million or so published articles there will be at any given moment a very sizeable pool of rejected articles moving through the system, as many (but not all) leading journals have high rejection rates (Schultz, 2010a, 2010b). These rejected papers will also have consumed a great deal of reviewer time (Hamermesh, 1994; Vines, Rieseberg, & Smith, 2010). Moreover, at least some of those rejected papers will be resubmitted to a different journal (possibly more than one) in an effort to see the light of day (Cronin & McKenzie, 1992). As Kravitz and Baker (2011, para. 1) put it: “each submission of a rejected manuscript requires the entire machinery of peer review to creak to life anew,” creating, in effect, “a journal loop bounded only by the number of journals available and the dignity of the Authors.”
But that is only part of the story. Research councils, foundations, universities, and other grant-awarding bodies also need to call upon the services of peer experts to review the millions of research proposals, intra- and extramural, seeking funding in any given year. In the U.S. alone, the National Institutes of Health (NIH) and the NSF, the two principal funding agencies, together receive nearly 90,000 research proposals each year and fund less than one quarter of these (National Institutes of Health, 2011; National Science Foundation, 2011). Many of those who review NSF and NIH research proposals are probably at the very least also regular reviewers of papers for a range of academic journals and conferences and also occasional reviewers of promotion and tenure dossiers, tens of thousands of which require careful scrutiny by multiple reviewers every academic year. It is not hard to grasp the enormity of the burden placed upon members of the scientific community, both junior and senior (Vines, 2010), by a system that, with very few exceptions (see Engers & Gans, 1998), operates on a voluntary, unremunerated basis (Kravitz & Baker, 2011).
With advances in technology, scientific research has become highly sophisticated, collaborative, distributed, and capital intensive in recent years: as a result many manuscripts are now accompanied by large amounts of supplementary materials that require careful scrutiny, placing an even greater burden on conscientious reviewers. As the commercial and career stakes rise, in what Ziman (2000, p. 211) has termed the age of “post-academic science,” so does the burden placed on the shoulders of those individuals refereeing for the world's leading scientific journals.
The competition for both pecuniary resources and attention in the marketplace of ideas has intensified to such an extent that reviewers need to be ever alert to the possibility of fraud (e.g., data fabrication, data trimming), credit misallocation (e.g., unearned/gift authorship), and potential conflicts of interest (e.g., undeclared commercial or consulting ties) in the publications they evaluate. Although unethical practices have been documented repeatedly in the medical and biomedical fields (e.g., Biagioli, 1998; Cronin, 2002; Sismondo, 2009), there is also suggestive evidence that chicanery and corner-cutting may be on the rise in some of the social sciences (Shea, 2011).
More than ever, we need to rely on peer review in the efficient and effective evaluation of knowledge claims. Research on bias in peer review seeks to identify ways in which it fails to do so. However, as Bornmann (2008) notes, the focal concept of bias has not been defined unambiguously in the literature, perhaps because there is presumed to exist a shared, albeit tacit, understanding of this term. In what follows, we will articulate a general notion of bias, defined as the violation of impartiality in peer evaluation, that draws the empirical literature's normative concerns together. We will then identify different categories of bias research by their hypothesized source of partiality and (in some cases) by the methods and assumptions adopted to study that type of bias.
Bias in the Peer Review Process
In the context of quantitative research on bias in peer review, reviewer bias is understood as the violation of impartiality in the evaluation of a submission. We define impartiality in peer evaluations as the ability for any reviewer to interpret and apply evaluative criteria in the same way in the assessment of a submission. That is, impartial reviewers arrive at identical evaluations of a submission in relation to evaluative criteria because they see the relationship of the criteria to the submission in the same ways. And, so long as the evaluative criteria have to do with the cognitive content of the submission and its relationship to the literature, impartiality ensures evaluations are independent of the author's and reviewer's social identities and independent of the reviewer's theoretical biases and tolerance for risk.
There are many reasons to challenge this ideal notion of impartiality in peer review. Lamont (2009) and Mallard et al. (2009) argue that evaluative criteria should not be subject to unifying, transdisciplinary interpretations (Lamont, 2009; Mallard et al., 2009). Lee (in press) argues that impartiality in peer evaluations may not be possible since definitions of evaluative criteria underdetermine their interpretation and application in both multidisciplinary and disciplinary contexts. Lamont (2009) argues that the cognitive value of submissions cannot and should not be assessed in ways that are dissociated from the reviewer's “sense of self and relative positioning” with respect to the submission's content. Likewise, it is not clear that a reviewer's theoretical or methodological orientations should be looked upon as normatively problematic. However, we articulate this notion of impartiality in an effort to identify an underlying ideal that aligns different genres of quantitative research on bias in peer review.
These genres of quantitative research can be categorized by differences in their conception of the primary source of bias: (a) error in assessing a submission's “true quality,” (b) social characteristics of the author, (c) social characteristics of the reviewer, and/or (d) content of the submission.1 In what follows we will characterize these genres of work (and their subgenres), identify assumptions and methods adopted to undertake this quantitative research, and provide a selective review of their findings. In all these genres and subgenres, bias is deemed problematic qua partiality. However, when critics implicitly or explicitly express additional grounds for normative complaint, we identify them throughout.
Bias as Deviation From “True Quality” Value
Some quantitative research conceives of bias as a kind of error in identifying “the true quality of the object being rated” (Blackburn & Hakal, 2006, p. 378). Errors in identifying the true quality of submissions violate the ideal of impartiality in peer review by demonstrating that reviewers—in succumbing to error—can fail to interpret and apply evaluative criteria in consistent ways. The assumption that there exists such a value along a single dimension is commonplace within psychometric research, which measures single-dimension constructs such as intelligence and creativity (Hargens & Herting, 1990; Rust & Golombok, 2009). Improvement in peer review practices, from this perspective, involves improving the reliability with which reviewers identify the true quality value of submissions.
Bias as deviation from proxy measures for true quality
In one subgenre of this research, studies seek to assess the construct validity of peer review as a test/process by comparing its outcomes to proxy measures for manuscript quality. Proxy measures include reviewers' pooled mean rating (Goodman, Berlin, Fletcher, & Fletcher, 1994), ratings by super experts (Gardner & Bond, 1990), editor/panel decisions (Marsh, Jayasinghe, & Bond, 2008; van Rooyen, Godlee, Evans, Smith, & Black, 2010), ratings by readers of a journal (Justice, Cho, Winker, & Berlin, 1998), citation counts (Hagstrom, 1971; Campanario, 1995; Daniel, 2005; Bornmann & Daniel, 2009; Gottfredson, 1978), and subsequent publication (Bornmann & Daniel, 2008b; Bornmann, Mutz, Marx, Schier, & Daniel, 2011). Reliance on proxies for quality is especially common in research on peer review in medicine, which seeks to carry out randomized controlled trials to identify practices that improve peer review processes and outcomes (Godlee, Gale, & Martyn, 1998; Justice et al. 1998; van Rooyen, Godlee, Evans, Smith, et al., 2010).
For example, studies have investigated the citation patterns of “rejected-then-published-elsewhere” articles. A high subsequent citation rate is used to indicate error in the original decision to reject a manuscript (Bornmann & Daniel, 2008b, 2009; Bornmann, Mutz, Marx, et al., 2011). Bornmann and Daniel (2009) found, when comparing citation counts, that 15% of accepted papers and 15% of rejected papers (that were subsequently published elsewhere) should not have been accepted/rejected at a top chemistry journal. Bornmann, Mutz, Marx, et al. (2011) found that acceptance by the original journal was a good predictor of later citation success and that rejection was a good predictor of limited citation, thereby validating editorial decisions.
In some studies, subsequent publication of a rejected manuscript in a more prestigious journal is also used as an indication of error in the original publication decision (Bornmann, Mutz, Marx, et al., 2011). However, Cronin and McKenzie (1992) note relatively few instances of “upward migration” and challenge the notion that such cases may reflect error on the part of the original publication decision: upward migration may sometimes result from the manuscript's finding a better fit—in terms of “focus, scope or style”—with a more prestigious journal (p. 316).
Bias as low inter-rater reliability
From a psychometric perspective, in order for peer review to be a valid test of submission quality, reviewer judgments must be reliable with respect to each other (Hargens & Herting, 1990; Rust & Golombok, 2009, p. 72). Some researchers have suggested that inter-rater reliability for two reviewers on a single submission should be about 0.8–0.9 (Marsh et al., 2008, p. 162), which is similar to the rate found for intelligence and personality tests (Rust & Golombok, 2009). Unfortunately, agreement between reviewers is very low (e.g., Bornmann & Daniel, 2008a; Ernst & Resch, 1999; Jackson, Srinivasan, Rea, Fletcher, & Kravitz, 2011; Rothwell & Martyn, 2000), with agreement “barely beyond chance” (Kravitz et al., 2010, p. 1) and comparable to rates found for Rorschach inkblot tests (Lee, in press).
Research has demonstrated that inter-reviewer agreement is improved when reviewers evaluate more rather than fewer grant applications, suggesting improvement via learning/training (Jayasinghe, Marsh, & Bond, 2003). Research has also shown improvements in inter-rater reliability with the addition of more reviewers per grant application (Jayasinghe et al., 2003). Such improvements are important to psychometrically oriented researchers since they decrease the chance that review outcomes vary dramatically as a function of which reviewers are chosen (Cole, Cole, & Simon, 1981). However, empirical study suggests that increasing the number of reviewers per journal manuscript does not significantly affect final decisions (Schultz, 2010a).
Inter-rater reliability research focuses on recommendation outcomes without studying other qualities of reviews, such as their length, tone, and presence of references. Without considering the nature and language of the review, it is difficult to assess whether systematic bias is present and what type of bias it may be (epistemic, language, etc.). When “we shift focus away from the numerical representation of a reviewer's assessment to the content upon which such assessments are grounded, we can identify” (Lee, in press, p. 5) ways in which inter-rater disagreement might reflect normatively appropriate disagreements. Editors and grant program officers may seek reviewers who can evaluate different aspects of a submission according to their own subspecialization and expertise (Bailar, 1991), and we would not expect high inter-rater reliability in cases where quality along these different aspects diverges (Hargens & Herting, 1990). These considerations suggest that “diversity of opinion among referees may be desirable and beneficial” (Chubin & Hackett, 1990, p. 102)—disagreement is thus sometimes normatively desirable and appropriate (Harnad, 1982; Hirschauer, 2009; James, Demaree, & Wolf, 1984; Lee, in press).
Philosophical and qualitative sociological research challenges the psychometric assumption that peer review involves assessments along a single dimension of evaluation. Peer review criteria—such as novelty, soundness, and significance—may be open to different, normatively appropriate interpretations (Lamont, 2009; Lee, in press) and fail to reduce to a single dimension of evaluation. If this is the case, then we would expect normatively appropriate disagreement between reviewers. That normative credibility is conceptually different from high inter-rater reliability is also demonstrated by Bornmann, Mutz, and Daniel (2010), who found that studies with high levels of inter-rater agreement turned out to be less statistically credible than those with low levels of agreement.
Bias as a Function of Author Characteristics
Among Merton's classical norms of science is universalism, the ideal that knowledge claims be evaluated according to “preestablished impersonal criteria” that assess the excellence or originality of a person's ideas, rather than on particular facts about their social identity and status (Merton, 1973, p. 269). As expressed by Peters and Ceci (1982b), universalism in the context of peer review requires that an author's research be “judged on the merit of [his/her] ideas, not on the basis of academic rank, sex, place of work, publication record, and so on” (p. 252). Social bias is the differential evaluation of an author's submission as a result of her/his perceived membership in a particular social category. Social bias challenges the thesis of impartiality by suggesting that reviewers do not evaluate submissions—their content and relationship to the literature—independently of the author's (perceived) identity.
Some view this type of bias as malicious in nature. For example, acknowledging the problem of ad hominem bias, Nature's review policies warn that reviewer anonymity cannot be protected “in the face of a successful legal action to disclose identity in the event of a reviewer having written personally derogatory [review] comments about the authors” (Nature, 2012, para. 41). However, bias that violates the norm of universalism need not be ill-intended or conscious at all (Lee & Schunn, 2011). An individual may sincerely espouse norms of equality and invoke normatively appropriate criteria to justify biased evaluations. However, implicit biases in evaluation—resulting from automatic and subconscious processes—are not usually blocked by the conscious, deliberative processes by which egalitarian beliefs are formed and sustained (Bargh & Williams, 2006; Chaiken & Trope, 1999). For example, hiring studies demonstrate that, despite identical curricula vitae, male applicants are deemed as having superior qualities than female applicants (Steinpreis, Anders, & Ritzke, 1999). Ironically, evaluators given the opportunity to disagree with blatantly sexist statements were more likely to reject women for stereotypically male jobs (Monin & Miller, 2001).
Much (but not all) research in this genre assumes that the quality of work by individuals across different social groups (e.g., prestigious vs. not, male vs. female) is, in the aggregate, roughly comparable. As a result, we should expect the rate with which members of less powerful social groups enjoy successful peer review outcomes to be proportionate to their representation in submission rates. Researchers infer the existence of bias when a difference is discovered and infer the lack of bias when no difference is discovered. Very few studies are able to demonstrate that their submission pools are similar to or representative of the larger population of researchers (for an exception, see RAND, 2005). Some models refine comparisons across groups by controlling for additional factors that might correlate with submission quality. For example, some studies control for factors such as type of institution (Blank, 1991; Xie & Shauman, 1998), experience (RAND, 2005), and rank (Ley & Hamilton, 2008) since these are acknowledged as affecting the resources and expertise needed to do quality work. Studies that distribute for review submissions that are identical in all respects except for the perceived social category to which the stated author belongs control for quality most adequately (Borsuk et al., 2009; Peters & Ceci, 1982b). However, which author attributes should and do correlate with indicators of manuscript quality are questions that deserve further theorizing and testing.
As Merton observed, prestige-based bias calls attention to a “class structure” in science, where those rich in prestige disproportionately accumulate limited resources (e.g., grant monies, publication space, awards), which allows them to garner yet more prestige in a process of cumulative advantage (Merton, 1973, p. 443; Price, 1976). The preferential evaluation of contributions by the prestigious versus the nonprestigious has been dubbed “the Matthew effect” (Merton, 1968). Some researchers perceive that prestige-bias affects peer review: surveys report that applicants to the NSF and NIH are concerned about “old boy” networks (McCullough, 1989, p. 82; Gillespie et al., 1985, p. 49) and bias against researchers in nonmajor universities (Gillespie et al., 1985, p. 49).
A study of peer-reviewed grant decisions for awarding long-term fellowships to postgraduate researchers in biomedicine discovered that funding rates decreased correlatively with institutional prestige; however, the effect was small and not statistically significant (Bornmann & Daniel, 2006). In a much discussed and highly cited study, Peters and Ceci (1982b) investigated whether “researchers affiliated with prestigious institutions will tend to fare better than colleagues at less prestigious ones” (p. 748). To control for the quality of submissions, they resubmitted published articles by prestigious individuals from prestigious institutions under fictitious names associated with less prestigious institutions. They found that resubmitted manuscripts were rejected 89% of the time (higher than the journal's 80% rejection rate) on the grounds that the studies contained “serious methodological flaws” (p. 187).
Reviewers do not necessarily use the prestige of an author as direct grounds for their recommendations. For example, Bornmann, Weymuth, and Daniel (2010) investigated the content of reviews of rejected articles to identify which negative comments were the best predictors of future success at subsequent high- and low-ranked journals. They found that reviewers cite relevance and design of research rather than social factors (such as affiliation and institution).
Affiliation bias occurs when reviewers and authors/applicants enjoy formal or informal relationships. This bias may be classified as a kind of bias that varies as a function of reviewer characteristics, since affiliation is shared between authors and reviewers. Affiliation bias may be a form of prestige bias in cases where reviewers and authors enjoy formal or information relationships due to shared, prestige-marked characteristics (e.g., institutional affiliation). Wennerås and Wold (1997) discovered that postdoctoral fellowship applicants with personal ties to reviewers were assessed as more competent than those who were not affiliated but equally productive. A replication of this work found a 15% affiliation bonus for both male and female applicants (Sandström & Hällsten, 2008). However, affiliation does not always result in favorable outcomes for authors and applicants: Oswald (2008) found that two journals housed at top economics departments did not favor or even discriminated against authors from the journal's parent institution.
Many studies have found that journals favor authors located in the same country as the journal (e.g., Daniel, 1993; Ernst & Kienbacher, 1991; Link, 1998), some highlighting a particularly strong degree of preferential attachment in the U.S. (Ernst & Kienbacher, 1991). Yet other studies suggest that American authors are more critical of their compatriots and more lenient when assessing grant applications of non-American authors (Marsh et al., 2008). These studies use current author address as a proxy for nationality; however, doing so conflates current affiliation or address with country of origin and ethnicity. Others worry that nationality bias may reflect prose quality and not nationality per se (Cronin, 2009). However, it may also be that language and writing style are cited as problems for manuscripts written by non-native speakers even when there is nothing problematic about the prose (Herrera, 1999).
The potential for language bias has been examined both in terms of acceptance rates and as a dependent variable when blinding reviews. An examination of reviews in medical research demonstrated a significant difference in the acceptance rates for abstracts written by authors from English- and non-English-speaking countries (Ross et al., 2006). The difference diminished when the editors instituted blind review: “Blinding significantly attenuated the association between language and likelihood of abstract acceptance” (p. 1679). Tregenza (2002) found that acceptance rates at ecology and evolution journals were higher for first authors living in wealthy English-speaking nations versus wealthy non-English-speaking nations. However, another study found that language was not an important criterion for acceptance or rejection, noting no significant difference in acceptance rates for “linguistically criticized” manuscripts compared with those that did not receive such criticism (Loonen, Hage, & Kon, 2005, p. 1,469).
In light of the gender gap in STEM (science, technology, education, and medicine) fields (Budden et al., 2008; Wennerås & Wold, 1997), the prevailing assumption has been that men are overall more favorably treated than women in peer review. Although empirical research on gender bias in publication and grant outcomes has produced “data and interpretations which at times are contradictory” (Rees, 2011, p. 140), recent meta-analysis suggests that claims of gender bias in peer review “are no longer valid” (Ceci & Williams, 2011, p. 3,157).
For example, if there is gender bias in review, we would expect double-blind conditions to increase acceptance rates for female authors. However, this is not the case (Blank, 1991). Nor are manuscripts by female authors disproportionately rejected at single-blind review journals such as Journal of Biogeography (Whittaker, 2008), Journal of the American Medical Association (Gilbert, Williams, & Lundberg, 1994), Nature Neuroscience (Nature Neuroscience, 2006), and Cortex (Valkonen & Brooks, 2011). Even when the quality of submissions is controlled for, manuscripts authored by women do not appear to be rejected at a higher rate than those authored by men (Borsuk et al., 2009).
Wennerås and Wold (1997) found that female biomedical postdoctoral fellowship applicants had to be 2.5 times more productive than a male applicant to receive the same competence score. However, replications of the study at comparable institutions in the U.K. (Grant, Burden, & Breen, 1997), Canada (Friesen, 1998), and Germany (Bornmann & Daniel, 2006) failed to discover statistically significant gender bias in the awarding of the same type of postdoctoral fellowship. A later replication at the same institution found that gender-based allotments had reversed (Sandström & Hällsten, 2008).
Meta-analyses (Marsh et al., 2009) and large-scale studies (Marsh, Jayasinghe, & Bond, 2011; RAND, 2005) of grant outcomes found no gender differences after adjusting for factors such as discipline, country, institution, experience, and past research output. One study found that female applicants received only 63% of the funding that their male colleagues received from the NIH (RAND, 2005). However, a later study found funding success rates were nearly equal for men and women at NIH when controlling for research rank/stage (Ley & Hamilton, 2008).
Bias as a Function of Reviewer Characteristics
Bias as a function of reviewer characteristics challenges the impartiality of peer review by demonstrating that reviewers fail to evaluate a submission's content and relationship to the literature independently of reviewer characteristics. Such bias is demonstrated by showing that specific classes of reviewers are systematically tougher or softer on identical submissions (e.g., Jayasinghe et al., 2003) or across multiple submissions (e.g., Gilbert et al., 1994).
Evaluative strictness or leniency can be idiosyncratic to individual reviewers (Casati, Marchese, Ragone, & Turrini, 2009; Marsh & Ball, 1989; Thurner & Hanel, 2010). Strictness or leniency can also vary systematically as a function of the social categories to which reviewers belong. Studies show significant differences in the patterns of reviewing by gender, with female reviewers being stricter than their male colleagues (e.g., Borsuk et al., 2009; Jayasinghe et al., 2003; Lane & Linden, 2009; Wing, Benner, Petersen, Newcomb, & Scott, 2010). Female editors have been found to reject more submissions than their male colleagues (Gilbert et al., 1994; Lane & Linden, 2009), although the reverse phenomenon has also been discovered (Wing et al., 2010). Toughness may also vary by disciplinary affiliation: Lee and Schunn (2011) found philosophers' reviews were more negative in tone and more likely to lead to rejection than those written by psychologists. Wood (1997) found American reviewers to be more lenient than their colleagues from the U.K. or Germany. Marsh et al. (2008) found American reviewers to be more lenient than Australians and suggested that the leniency of American reviewers results from “a culture that is comfortable being generous in their evaluations” (p. 163).
This observation raises interesting normative questions about the role that cultural differences play in reviewer style and strictness. Because Marsh et al. (2008) work within a psychometric framework, they conceive of cultural differences in peer evaluations as sources of contamination or error in assessments of a submission's true quality value. However, when evaluative cultures are specific to disciplines, it is less clear whether such differences should be understood as a form of problematic bias. Lamont (2009) and Mallard et al. (2009) argue that discipline-specific evaluative cultures articulate appropriate ways to approach theory/method and provide the proper epistemic grounds for fairly evaluating grant proposals.
Content-based bias involves partiality for or against a submission by virtue of the content (e.g., methods, theoretical orientation, results) of the work.2 Since different types of content-based bias challenge the thesis of impartiality in different ways, we will save analysis of these challenges to discussion of the subtypes. Content-based bias is primarily studied in the context of scientific disciplines. This is because the overarching concern motivating research on content-based bias is whether peer review is capable of the kind of self-regulation that encourages scientific progress and the achievement of other scientific goals. Most studies attempt to demonstrate content-based bias by showing that review outcomes vary as a function of the submission's content. However, when such studies are not available, surveys or anecdotal evidence from researchers or grant program managers are appealed to instead.
Many hypothesize that reviewers will evaluate more favorably the submissions of authors who belong to similar “schools of thought,” a form of “cognitive cronyism” (Travis & Collins, 1991, p. 323). The perception that cognitive cronyism is at play in peer review contexts is evidenced by conversations among grant committee members at the U.K. Science and Engineering Research Council, which reveal attempts to contextualize reviewer recommendations by identifying theoretical and subdisciplinary affiliations between reviewers and proposal authors (Travis & Collins, 1991). Sandström (2009) operationalized cognitive cronyism in reviews by examining the relationships between key noun phrases appearing in the titles and abstracts of papers being reviewed and papers written by the reviewers, hypothesizing that reviewers would favor work that was similar to their own. The data did not support the hypothesis.
At what point does cognitive difference become discrimination? Travis and Collins (1991) contrast cognitive cronyism with bias based on social status. For Travis and Collins, cognitive cronyism is not pernicious like social status bias so long as the boundaries of cognitive communities and social hierarchies do not coincide. However, in cases where they do coincide, outsiders may find “old-boy networks” that control journal and conference content (Hull, 1988, p. 156) and citation networks (Ferber, 1986) difficult to penetrate for social reasons disguised as purely cognitive ones (Lee & Schunn, 2011).
If reviewers prefer research that is similar in cognitive orientation and content to their own, then we would expect that, on the whole, reviewers disfavor research inconsistent with their theoretical orientation as well as research falling outside the mainstream, including interdisciplinary and transformative research.
In the psychological literature, confirmation bias is the tendency to gather, interpret, and remember evidence in ways that affirm rather than challenge one's already held beliefs (Nickerson, 1998). Historical and philosophical analyses have demonstrated the obstructive and constructive role that confirmation bias has played in the course of scientific inquiry, theorizing, and debate (Greenwald, Pratkanis, Leippe, & Baumgardner, 1986; Solomon, 2001). In the context of peer review, confirmation bias is understood as reviewer bias against manuscripts describing results inconsistent with the theoretical perspective of the reviewer (Jelicic & Merckelbach, 2002). As such, confirmation bias can also be classified as a type of bias that varies as a function of reviewer characteristics. Confirmation bias challenges the impartiality of peer review by questioning whether reviewers evaluate submissions on the basis of their content and relationship to the literature, independently of their own theoretical/methodological preferences and commitments. Confirmation bias also challenges the impartiality of scientists qua scientists by questioning their ability to evaluate scientific hypotheses on the basis of the evidence independently of their “desires, value perspectives, cultural and institutional norms and presuppositions, expedient alliances and their interests” (Lacey, 1999, p. 6).
Empirical study suggests reviewers are vulnerable to confirmation bias. Ernst, Resch, and Uher (1992) found that referees who had published work in favor of a controversial clinical intervention judged a manuscript whose data supported the use of that intervention more favorably than those who had published work against it. Confirmation bias for or against manuscripts may be rooted in biased assessments along more specific dimensions of evaluation. For example, Mahoney (1977) found that reviewers judged the methodological soundness, data presentation, scientific contribution, and publishability of a manuscript to be of higher quality when its data were consistent with the reviewer's theoretical orientation. However, consistency between a reviewer's theoretical orientation and a manuscript's reported results does not automatically lead to confirmation bias. Hull's (1988) analysis of reviewer recommendations for Systematic Zoology demonstrates that, during a time of warring schools of taxonomy, confirmation bias among reviewers was “far from total” (p. 333) since allies can disagree on fundamental tenets and wish to prevent the publication of weak papers that could become easy targets for rivals.
Peer review is often censured for its conservativism, that is, bias against groundbreaking and innovative research (Braben, 2004; Chubin & Hackett, 1990; Wesseley, 1998). Conservativism violates the impartiality of peer review by suggesting that reviewers do not interpret and apply evaluative criteria in identical ways since what count as the proper criteria of evaluation—and their relative weightings—are disputed. Although some challenge the suggestion that conservativism is epistemically problematic (Shatz, 2004), most argue that conservativism threatens scientific progress by stifling the funding and public articulation of alternative and revolutionary scientific theories (Stanford, 2012). More locally, conservativism violates explicit mandates, articulated by journals and granting institutions, to fund and publish innovative research (Frank, 1996; Horrobin, 1990; Luukkonen, 2012).
Many have voiced concern about conservativism in peer review, including past directors at the NSF and NIH (Carter, 1979; Kolata, 2009) and applicants to these institutions (Gillespie et al., 1985, p. 49; McCullough 1989, p. 83). Research suggests that authors proposing unorthodox as opposed to orthodox claims must meet a higher burden of proof: Resch, Ernst, and Garrow (2000) demonstrated that studies supporting unorthodox medical treatments were rated less highly even though the supporting data were equally strong. Qualitative research reveals another possible source for conservativism: for many grant panelists, “frontier” research is understood as “paradigm-shifting” and “revolutionary” (Luukkonen, 2012, p. 54), while “excellent” research is understood as involving “methodological rigour and solid quality of the research” (Luukkonen, 2012, p. 54). Because of the uncertainty surrounding the pursuit of novel methods and theories—and the need for multiple contingency plans should a new experiment or project not go as planned—it may be more difficult for frontier research to appear excellent qua methodologically rigorous or solid.
There is a paucity of quantitative work on whether and where conservativism arises in peer review. This gap indicates a crucial area for future research—one facing methodological and conceptual challenges. Since all manuscripts and grant proposals aim to be novel in some respect, studies on conservativism must find ways to measure degrees of novelty and/or parse out how different types of novelty (e.g., in methods, theory, application context, research question, or statistical analyses) impact peer evaluations.
Bias against interdisciplinary research
Some researchers have expressed concerns of bias against interdisciplinary research since, it is thought, disciplinary reviewers prefer mainstream research (Travis & Collins, 1991). Bias against interdisciplinary research, if discovered, would violate the impartiality of peer review by suggesting that reviewers do not interpret and apply evaluative criteria in identical ways because what count as the proper criteria of evaluation—and their relative weightings—are disputed. Bias against interdisciplinary research would also be problematic since many of the most important social and scientific problems require multiple disciplinary perspectives to address (Metzger & Zare, 1999, p. 642).
Efforts to demonstrate interdisciplinary bias have been mixed. Porter and Rossini (1985) found that interdisciplinary proposals at the NSF received lower ratings. However, no difference in peer rating for interdisciplinary research was found by the Finnish Research Council (Bruun, Hukkinen, Huutoniemi, & Klein, 2005) or by the International Review Committee for Physics (Rinia, van Leeuwen, van Vuren, & van Raan, 2001). The perception of bias remains, however: grant panelists at the European Research Council gave their favorite interdisciplinary projects the highest rating in a strategic effort to counterbalance anticipated bias (Luukkonen, 2012, p. 56). Rather than endorse this kind of gaming behavior by reviewers, the Public Library of Science (PLoS) and the U.K. Research Integrity Office recommend seeking the expertise of a larger number of reviewers, a practice undertaken by the Royal Society (Science and Technology Committee, 2011), to ensure that interdisciplinary work is evaluated by individuals with the appropriate skills and expertise.
Publication bias is the tendency for journals to publish research demonstrating positive rather than negative outcomes, where “positive outcomes” include results that have a positive direction (Bardy, 1998), are statistically significant irrespective of the direction of result (Dickersin, Min, & Meinert, 1992), or both (Fanelli, 2010; Ioannidis, 1998). The controversy surrounding publication bias demonstrates that scientists disagree about the evaluative merits of research reporting negative outcomes (Ioannidis, 2005; Palmer, 2000). More commonly, publication bias is understood as normatively problematic because it leads to exaggerated effect size measurements in later meta-analyses (Ioannidis, 2005; Palmer, 2000), creates publication patterns that conflict with overall disciplinary goals (Lee, ,2012), and encourages the practice of “burying” or “redressing” negatives as positives in distorting ways (Chan, Hróbjartsson, Haahr, Gøtzsche, & Altman, 2004; Gerber & Malhotra, 2008). There is work suggesting that publication bias is the result of reviewer and editor preferences for positive outcomes: for example, the Journal of the American Medical Association was more likely to accept statistically significant results on the primary outcome (Olson et al., 2002). However, other work suggests it is authors, anticipating the rejection of negative outcomes, who are primarily responsible for the disproportionate publication of positive outcomes (Dickersin, 1990; Easterbrook, Berlin, Gopalan, & Matthews, 1991) as well as for the increased time lag in the publication of negative results (Ioannidis, 1998).
Conclusions and Future Research
Impartiality ensures both the consistency and meritocracy of peer review. Research on bias in peer review—predicated on the ideal of impartiality—raises not just local hypotheses about specific sources of partiality, but much broader questions about whether the processes by which knowledge communities regulate themselves are epistemically and socially corrupt. Contra impartiality, the evidence suggests that peer evaluations vary as a function of author nationality and prestige of institutional affiliation; reviewer nationality, gender, and discipline; author affiliation with reviewers; reviewer agreement with submission hypotheses (confirmation bias); and submission demonstration of positive outcomes (publication bias).
However, a closer look at the empirical and methodological limitations of research on bias raises questions about the existence, extent, and normative status of many hypothesized forms of bias. Psychometrically oriented research is predicated on the questionable assumption that disagreement among reviewers is not normatively appropriate or desirable. Research on bias as a function of author characteristics adopts the untested assumption that authors belonging to different social categories submit manuscripts and grant proposals of comparable quality. Despite vocal concerns about conservativism in science, there is no empirical evidence (beyond anecdote) to buttress or belie such worries. And the evidence for bias against interdisciplinary research is mixed, as is the evidence for bias against female authors and authors living in non-English-speaking countries.
Research on bias in peer review also suggests that peer review is social in ways that go beyond the social categories to which authors and reviewers belong: Relationships between individuals in the process impact outcomes (e.g., affiliation bias), and individuals make decisions conditioned on beliefs about what others value (e.g., publication bias). Future research might usefully investigate these complex and dynamic social relations. Consider, for example, how the editor's relationships and beliefs about other actors may have an impact on his/her decisions. On the basis of previous experience with reviewers, the editor may differentially value and preferentially assign reviewers to manuscripts, which may alter final recommendations. Frequent or highly sought authors to the journal may develop a privileged relationship with the editor and with potential reviewers. Editors may feel peer pressure when evaluating manuscripts submitted by frequent reviewers and editorial board members (Lipworth, Kerridge, Carter, & Little, 2011). The readership may function as an invisible hand in the selection of authors and manuscript content, since the editor will need to be cognizant of the needs and wants of the marketplace. An editor may also be influenced by her/his relationship with the editorial board and/or publisher (commercial, academic, or society). The editor's strategy or vision for the journal may have a bearing on which manuscripts are reviewed and ultimately accepted for publication. As Chubin and Hackett (1990, p. 92) note, “[t]he journal editor occupies a delicate position between the author and reviewers, alternating among the roles of wordsmith and gatekeeper, caretaker and networker, literary agent and judge.”
Not all of these sources of social influence impact peer review in problematic ways. For example, the ways in which authors, reviewers, and editors anticipate each others' scrutiny and judgment may serve to improve the quality of each of their contributions (Bailar, 1991; Hirschauer, 2009), and editors' personal connections allow them to learn about and capture high-impact papers for publication (Laband & Piette, 1994). These examples suggest that the sociality of peer review can be structured “to enrich, rather than threaten” the well-being of peer review (Lipworth et al., 2011, p. 1,056). A natural direction for future research includes articulating and assessing alternative normative models that acknowledge reviewer partiality, with a focus on the epistemic and cultural bases for reviewer disagreement; the ways editors and grant program managers anticipate, capitalize on, and manage reviewer disagreement; and the ways publication venues and funding opportunities should be structured to accommodate reviewer differences (Hargens & Herting, 1990; Lee, ). Finally, the inescapable sociality and partiality of peer evaluation raise questions about whether impartiality can or should be upheld as the ideal for peer review.