Blinded vs. unblinded peer review of manuscripts submitted to a dermatology journal: a randomized multi-rater study

Authors


  • Funding sources
    Departmental Funds, Department of Dermatology, Section of Cutaneous and Aesthetic Surgery, Northwestern University.

  • Conflicts of interest
    The authors have no conflicts of interest to disclose. Neither Northwestern University nor any of the investigators or study personnel received or expect to receive compensation, supplies, equipment or any other inducements, directly or indirectly, from the manufacturers or distributors of any of the study agents.

Murad Alam.
E-mail: m-alam@northwestern.edu

Summary

Background  Submissions to medical and scientific journals are vetted by peer review, but peer review itself has been poorly studied until recently. One concern has been that manuscript reviews in which the reviewer is unblinded (e.g. knows author identity) may be biased, with an increased likelihood that the evaluation will not be strictly on scientific merits.

Objectives  The purpose of this study was to compare the outcomes of blinded and unblinded reviews of manuscripts submitted to a single dermatology journal via a randomized multi-rater study.

Materials and methods  Forty manuscripts submitted to the journal Dermatologic Surgery were assessed by four reviewers, two of whom were randomly selected to be blinded and two unblinded regarding the identities of the manuscripts’ authors. The primary outcome measure was the initial score assigned to each manuscript by each reviewer characterized on an ordinal scale of 1–3, with 1 = accept; 2 = revise (i.e. minor or major revisions) and 3 = reject. Subgroup analysis compared the primary outcome measure across manuscripts from U.S. corresponding authors and foreign corresponding authors. The secondary outcome measure was word count of the narrative portion (i.e. comments to editor and comments to authors) of the reviewer forms.

Results  There was no significant difference between the scores given to manuscripts by unblinded reviewers and blinded reviewers, both for manuscripts from the U.S. and for foreign submissions. There was also no difference in word count between unblinded and blinded reviews.

Conclusions  It seems, at least in the case of one dermatology journal, that blinding during peer review does not appear to affect the disposition of the manuscript. To the extent that review word count is a proxy for review quality, there appears to be no quality difference associated with blinding.

Peer review is the gold standard for assessment of manuscripts submitted to scientific journals. Until the past 20 years, however, the integrity of the process of peer review was rarely studied.1 Since then, concerns have been raised regarding the potential pitfalls of peer review, including the possibility that well-known or established authors may have an unfair advantage in obtaining manuscript acceptances from journals.2–4 To the extent that manuscript reviewers are aware of authors’ identities, this argument suggests that reviewers could potentially tilt their reviews to favour well-regarded authors. The biases of individual reviewers against specific authors may also influence peer review integrity. One means for eliminating such unconscious or conscious bias would be to blind reviewers to author identity. The problem with this approach is that it is time-consuming, and may be difficult to accomplish given the abundance of self-referential material in most papers. Reviewers familiar with a given specialty area may also be aware of conflicts which were not revealed by an author, such that unblinded reviews could actually provide extra protection against fraud and deceit. It would be helpful for editors to know whether blinding is important before incurring the expense and inconvenience of taking this additional step. There are as yet no studies in dermatology addressing this question.

The purpose of this study was to compare the outcome of blinded vs. unblinded peer review for a single dermatology journal, Dermatologic Surgery. This was done via a randomized multi-rater study, in which the same manuscripts were sent to both blinded and unblinded reviewers.

Materials and methods

Participants (reviewers)

This study was approved by the Northwestern University IRB. Declaration of Helsinki protocols were followed, and all participants provided written, informed consent at the inception of the study. Participants were 20 existing Dermatologic Surgery reviewers. They were selected from the list of active reviewers who had previously reviewed at least five articles each for the journal, and who cumulatively reflected a range of expertise in dermatological surgery. The participating reviewers were told that if they consented, they would be asked to review up to approximately 10 papers each over the ensuing several months.

Protocol

We conducted a randomized rater-blinded prospective study in which four reviewers were randomly selected from a pool of 20 volunteer reviewers (selected as above) to evaluate each of 40 consecutive manuscripts submitted for publication to Dermatologic Surgery and assigned to one assistant editor (unpaid volunteer position) (M.A.) for processing.

The types of manuscripts reviewed included a mixture of original research studies, case reports, evidence-based reviews, and descriptions of surgical technique. As such, they were representative of the typical mix of submissions routinely received by the journal. Consecutive submissions assigned to the assistant editor were included in the study, and no vetting or selection was performed.

Assignment

For each manuscript, four reviewers were randomly selected from the pool of 20 reviewers. Two reviewers were randomly selected to receive blinded versions of the manuscript, and the other two received unblinded versions.

Masking

Unblinded manuscripts were provided to the reviewers exactly as they were uploaded to the journal submission site. Blinded manuscripts were identical, except that all known identifiers, including author and institution identifiers in the title page, main text, figures and tables, had been deleted (by N.A.K. or J.H.) electronically. To avoid the problem of inadvertently revealing authorship, the entire bibliography/endnotes section was removed before the blinded manuscripts were sent for review. (During the routine editorial process, Dermatologic Surgery typically has unblinded reviews.)

Participant flow and follow-up

When a newly submitted manuscript was uploaded into the assistant editor’s inbox by the journal’s editorial staff, four reviewers were assigned as above, and each of them received either a blinded or unblinded paper copy of the manuscript for review, as noted above. Each reviewer also received an email asking them to enter the results of their review into an electronic version of the Dermatologic Surgery reviewer form. Completed forms were returned as email attachments to the assistant editor’s email inbox. The reviewer forms included a number of checkboxes regarding manuscript quality, a series of checkboxes for suggesting disposition of the manuscript (i.e. accept, minor revisions, major revisions, reject), and narrative sections for comments to the editors and authors, respectively. Reviewers of the blinded articles were not specifically queried as to whether they thought they could identify the authors of the article to which they were assigned; however, none volunteered that they could.

Analysis

The primary outcome measure was the initial score assigned to each manuscript by each reviewer. Because three different recommendations were possible, each recommendation was characterized on an ordinal scale of 1–3, with 1 = accept, 2 = revise (i.e. minor or major revisions) and 3 = reject. Subgroup analysis entailed comparison of the primary outcome measure across manuscripts from U.S. corresponding authors vs. foreign corresponding authors. The secondary outcome measure was word count of the narrative portion (i.e. comments to editor and comments to authors) of the reviewer forms.

Data pertaining to the primary outcome measure, specifically the ordinal ratings of each manuscript by each reviewer are presented as response profiles, which are three percentages – (%Accept, %Revise, %Reject). Cumulative logit logistic regression5 analysis was used to compare response profiles. Mean word count was analysed using repeated measures analysis of variance. The sample size of 40 manuscripts has 84% power to detect a 25% difference of 30% vs. 55% in the percentage of accepted manuscripts between two raters who are both rating each manuscript where the statistical test is done at P < 0·10. A sample size of 40 manuscripts allows any percentage in the response profile to be estimated with a standard error no larger than 8%.

Results

Forty manuscripts submitted to the journal Dermatologic Surgery were each assessed by four reviewers, two of whom were blinded and two unblinded regarding the identities of the authors. All 20 reviewers initially approached to participate in this study consented, none declined, and none dropped out before completion of the study. All of the reviewers were from the U.S. None of the reviewers was assigned an article that they declined to review, either because of a conflict of interest, because they were authors, or for some other reason. None of the reviewers evinced any difficulty in reviewing assigned blinded manuscripts without full access to the bibliography.

Twenty-six of the manuscripts had corresponding authors from the U.S., and the remaining 14 had foreign corresponding authors. The foreign authors were from a variety of countries, including those in Asia (especially Turkey, Korea, Japan, China), Europe (including the U.K., Italy, Germany, France) and South America (including Brazil), as well as Mexico, Canada and Australia. No one country accounted for more than 15% of foreign submissions.

In a total of six cases, reviewers failed to complete their reviews within 1 month of the date of assignment; a different reviewer was then randomly assigned to replace each of these. Replacement reviewers were substituted for three blinded and three unblinded reviewers, and in no instance were multiple replacement reviewers required for a given manuscript. With 40 manuscripts and four reviews per manuscript, there were a total of 160 ratings. The overall response profile for these 160 ratings was 35·0%, 48·1%, 16·9% (accept, revise, reject). Statistics for ratings by blinding status and rater are presented in Table 1. Comparison of blinded vs. unblinded reviewers revealed a response profile of 37·5%, 48·75%, 13·75% for the blinded reviewers, and 32·5%, 47·5%, 20·0% for the unblinded reviewers (P = 0·32). Statistics for ratings by blinding status and country are presented in Table 2. For U.S. manuscripts only, comparison of blinded vs. unblinded reviewers indicated a response profile of 40·4%, 48·1%, 11·5% for the blinded reviewers, and 38·5%, 42·3%, 19·2% for the unblinded (P = 0·50). For foreign manuscripts only, comparison of blinded vs. unblinded reviewers found a response profile of 32·1%, 50·0%, 17·9% for the blinded reviewers, and 21·4%, 57·1%, 21·4% for the unblinded (P = 0·45).

Table 1.   Summary of ratings by all reviewers; all percentages are row percentages, n (%)
 Number of ratingsAcceptReviseReject
All blinded reviews80 (100·0)30 (37·5)39 (48·75)11 (13·75)
All unblinded reviewers80 (100·0)26 (32·5)38 (47·5)16 (20·0)
All ratings160 (100·0)56 (35·0)77 (48·1)27 (16·9)
Table 2.   Summary of ratings by blinding status and country; all percentages are row percentages, n (%)
 Number of ratingsAcceptReviseReject
U.S.
 Blinded reviews52 (100·0)21 (40·4)25 (48·1) 6 (11·5)
 Unblinded reviews52 (100·0)20 (38·5)22 (42·3)10 (19·2)
 All U.S. reviews104 (100·0)41 (39·4)47 (45·2)16 (15·4)
Foreign
 Blinded reviews28 (100·0) 9 (32·1)14 (50·0) 5 (17·9)
 Unblinded reviews28 (100·0) 6 (21·4)16 (57·1) 6 (21·4)
 All foreign reviewers56 (100·0)15 (26·8)30 (53·6)11 (19·6)

When U.S. manuscripts were compared with foreign manuscripts, the response profile for all U.S. submissions was 39·4%, 45·2%, 15·4% vs. 26·8%, 53·6%, 19·6% for foreign manuscripts (= 0·14). The mean number of words of the reviews received from the blinded reviewers was 120·7 [95% confidence interval (CI) 95·4–146·1], and from the unblinded reviewers 138·5 (95% CI 106·3–170·7) (P = 0·59). Comparisons were also made separately by type of decision. For accepted manuscripts, the mean number of words in the reviews received from the blinded reviewers was 43·1 (95% CI 17·4–68·9), and from the unblinded reviewers 37·4 (95% CI 22·1–52·8) (= 0·83). For revised manuscripts, the mean number of words in the reviews received from the blinded reviewers was 177·2 (95% CI 140·4–214·0), and from the unblinded reviewers 205·6 (95% CI 160·8–250·3) (P = 0·50). For rejected manuscripts, the mean number of words in the reviews received from the blinded reviewers was 132·2 (95% CI 66·9–197·5), and from the unblinded reviewers 143·6 (95% CI 46·9–240·4) (P = 0·87). There was a significant difference in mean word count across decision type (P < 0·0001).

Discussion

The results of this randomized multi-rater study suggest that, for Dermatologic Surgery, there is no difference in the likelihood of manuscript acceptance associated with blinding reviewers to the identities of manuscript authors. In subgroup analysis, this lack of difference held for U.S. manuscripts and foreign manuscripts when they were analysed separately; notably, the subgroup analyses were probably underpowered to detect such differences. Additionally, there appears to be no difference in the length of the narrative reviews provided by blinded vs. unblinded reviewers. Although it is true that word count may be a proxy for review quality, our results would suggest that blinded reviewers are no more or less likely to provide quality reviews than unblinded reviewers. Overall, papers accepted initially had the shortest reviews by word count, papers sent for revision had the longest reviews, and the reviews of rejected manuscripts were intermediate in length; however, these findings held true for both blinded and unblinded reviews, and there was no significant length difference associated with blinding in any of the three decision categories. Finally, there was no significant difference in likelihood of acceptance associated with the provenance of the manuscript, as mean ratings of U.S. and foreign manuscripts were similar.

Although peer review of scientific manuscripts has a 300-year history, the study of its validity has only recently been undertaken.1 In the 1980s and 1990s, several preliminary reports suggested possible bias in the peer review process in medical journals. One emergent concern was that unblinded manuscripts from well-known authors or prestigious institutions could have an unfair advantage in the review process. This concern stimulated further research interest, which in turn led to the seminal International Congress on Biomedical Peer Review and Global Communications held in Prague in September 1997.1 Interestingly, papers that were presented at this meeting and subsequently published did not detect significant reviewer bias. In one study, a previously accepted paper was modified to contain several flaws and then sent out to numerous blinded and unblinded reviewers; there was no difference in the quality of reviews, and blinded reviewers were less likely to recommend rejection.2 Another investigation tracked several hundred submissions that were each sent to one blinded and one unblinded reviewer. Among the 89% of papers that were successfully reviewed in this manner, no blinding-associated differences were detected in reviewer recommendations regarding publication.3 Subsequent research findings have been mixed, with some papers confirming that blinding reviewers does not affect manuscript acceptance and others finding the opposite. A recent large study of abstracts submitted to the American Heart Association annual meeting found that manuscripts were more likely to be accepted if reviewers were blinded, and that blinding moderately reduced bias against foreign authors.4 In view of the apparently equivocal benefit of blinding reviewers to prevent acceptance bias, some have asserted that blinding is too onerous and unnecessary a process. These authors have also noted that the perception of bias may exist, but there is little proof to substantiate this fear.5 Indeed, it has been argued that even when journals have tried to blind reviewers, the integrity of this process is a ‘myth’,6–9 given the likelihood of inadvertent or deliberate self-identification of the authors in the text, figure legends and bibliography, or the chance that the reviewer has personal knowledge of the source of the manuscript.

Our study was designed to address some of these concerns by including a strict procedure to remove author identifiers from blinded manuscripts. To avoid the risk of inadvertent disclosure of identifiers via an electronic interface, only edited and reviewed paper copies were provided to reviewers. Even at the cost of reducing manuscript clarity, all author or institution identifiers were removed from the text, figures and bibliographies. Another change from prior studies was the use of two blinded and two unblinded reviewers per manuscript. To the best of our knowledge, the few prospective studies that previously assessed blinding never used more than one blinded and one unblinded reviewer per paper. We believe that inclusion of multiple blinded and unblinded reviewers for each manuscript reduces the risk of data being skewed by individual ‘zealot’ or ‘assassin’ reviewers, who may be prone to have a special affinity or dislike for particular submissions. Similarly, random assignment of reviewers from the pool to review various manuscripts reduces the potential bias associated with editors assigning reviewers to manuscripts. Finally, our study is the first such investigation in the dermatology literature. Overall, our results are in accordance with the preponderance of prior prospective studies, which have noted that blinding reviewers leads to a small increase in the likelihood of acceptance and decrease in bias against foreign manuscripts. In our study, these differences were nominal but not statistically significant.

This study has several limitations. Firstly, the reviewers rated only the first version of each manuscript submitted to the journal. In most cases, revisions were required before a manuscript was eventually accepted, and it is certainly possible that final acceptance or rejection in certain instances deviated from the expected outcome based on the initial reviewer ratings. Secondly, it is possible that some of the differences suggested but not detected by this study could have been statistically significant had the sample size been larger. The study was adequately powered to detect large effect size, but may have been underpowered to detect a true moderate effect. Thirdly, random assignment of reviewers to manuscripts could have resulted in some reviewers evaluating papers on topics on which they had limited expertise; however, dermatological surgery is a relatively defined field, and most established reviewers for this journal are competent to evaluate almost any submitted manuscript to the same journal. Finally, participating reviewers did know they were being studied, and this may have changed their behaviour. The disclosure to the reviewers and the obtaining of their consent was mandated by the Northwestern University IRB.

Significantly, while the secondary outcome measure correlating word count with quality of review may be true in general, there is no evidence that we are aware of that definitively establishes a correlation between the two variables. For instance, when articles of very high quality are reviewed, there may be very little to say in terms of constructive comments and reviews, and reviews may be quite short. In that instance there is a correlation between a short review and high quality of the paper. To the extent that reviewers for Dermatologic Surgery are asked by editors to provide very short reviews for articles that are either nearly ready for publication or clear rejects, it may be surmised that there is some correlation between review length and quality for the remainder of articles which are sent for revision. Indeed, subgroup analysis revealed that the articles sent for revision were associated with the longest reviews, but that there was no difference in word count in this category between blinded and unblinded reviews.

In conclusion, it appears that for the journal Dermatologic Surgery, blinded peer review is not necessary to protect lesser-known authors and institutions from deleterious bias that could reduce the likelihood of manuscript acceptance. The substantial cost of blinding reviewers, and the risk that important information could be lost during the blinding process, must be weighed against the modest potential benefits of reviewer blinding, in particular the nominal (but not statistically significant) reduction in the bias against foreign manuscripts.

What’s already known about this topic?

• Research findings have been mixed, with some papers confirming that blinding reviewers does not affect manuscript acceptance and others finding the opposite.

What does this study add?

  •  It seems that at least in the case of one dermatology journal, blinding during peer review does not appear to affect the disposition of the manuscript.
  •  Additionally, to the extent that review word count is a proxy for review quality, there appears to be no quality difference associated with blinding.