Characteristics of Imperial College London's COVID‐19 research outputs

Abstract We identified 651 research outputs on the topic of COVID‐19 in the form of preprint, report, journal article, dataset, and software/code published by Imperial College London authors between January to September 2020. We sought to understand the distribution of outputs over time by output type, peer review status, publisher, and open access status. Search of Scopus, the institutional repositories, Github, and other databases identified relevant research outputs, which were then combined with Unpaywall open access data and manually‐verified associations between preprints and journal articles. Reports were the earliest output to emerge [median: 103 days, interquartile range (IQR): 57.5–129], but journal articles were the most commonly occurring output type over the entire period (60.8%, 396/651). Thirty preprints were identified as connected to a journal article within the set (15.8%, 30/189). A total of 52 publishers were identified, of which 4 publishers account for 59.6% of outputs (388/651). The majority of outputs were available open access through gold, hybrid, or green route (66.1%, 430/651). The presence of exclusively non‐peer reviewed material from January to March suggests that demand could not be met by journals in this period, and the sector supported this with enhanced preprint services for authors. Connections between preprints and published articles suggests that some authors chose to use both dissemination methods and that, as some publishers also serve across both models, traditional distinctions of output types might be changing. The bronze open access cohort brings widespread ‘free’ access but does not ensure true open access.


INTRODUCTION
The novel coronavirus (SARS-CoV-2), the disease it causes , and its implications for society have been described as the fastest-moving production of knowledge in our time (Kupferschmidt, 2020) and are estimated to have resulted in tens of thousands of papers produced in a 6-month period (Teixeira da Silva, Tsigaris, & Erfanmanesh, 2020). As a large, research-intensive science, technology and medicine university with substantial biomedical and public health expertise, Imperial College London researchers began sharing research on the topic in January 2020, Organization, 2020b). The forecasts of one report (Ferguson et al., 2020) were widely cited as having changed multiple national government responses to the pandemic (Bruce-Lockhart, Burn-Murdoch, & Barker, 2020) (Landler & Castle, 2020) (Boseley, 2020). This output received phenomenal media and online attention (https:// www.altmetric.com/details/77704842). Many other researchers and groups at Imperial have produced COVID-19 research in a variety of formats and open access models. We sought to understand the quantity and characteristics of all of Imperial's contributions to COVID-19 research in order to provide data for the institution to understand its outputs, as well as to provide an institutional cohort perspective to complement the global output level of analysis in other studies on COVID research (Di Girolamo & Meursinge Reynders, 2020;Fraser et al., 2020;Helliwell et al., 2020;Shuja, Alanazi, Alasmary, & Alashaikh, 2020;Teixeira da Silva et al., 2020).
The institution's commitment to 'consider the value and impact of all research outputs (including datasets and software) in addition to research publications' (SF Dora, 2012) as a signatory of the San Francisco Declaration on Research Assessment instructed us to consider the widest possible interpretation of research outputs that were still feasible to collect using bibliographic and data search methods; resulting in journal articles, preprints, reports, datasets, and software/code forming the dataset.

OBJECTIVES
We sought to understand the volume and characteristics of the research from Imperial College London on the novel coronavirus in a publication period of 1st January to 30th September 2020.
The following research aims were identified: • Identify the volume of publications and the distribution over the time period by different research output types.
• Determine what proportion of preprints went on to be published as journal articles and the average time for this. • Identify open access trends.
• Demonstrate the distribution of outputs between publishers.

METHODS
This was a cross-sectional study of Imperial College Londonauthored research outputs related to COVID-19. The data were extracted in October 2020.

Software definitions
For the equivalent of publication date, the earliest found date in the repository referring to the release or any documented action on the output was taken as a proxy publication date. Anonymous authorship practices in software communities introduce uncertainty around author or affiliated institution. Outputs identified from non-institutionally managed repositories were manually verified to have Imperial authors before inclusion. Multiple versions of the same software/code published in the same repository file were considered as one entity, dated to their earliest found version.

Preprint definitions
Multiple versions of the same preprint that shared a common DOI were counted as a single output, but versions with different DOIs or hosted on different servers or repositories were counted as individual outputs. We could not find a systematic way to identify preprints that also existed as journal articles, so we had to identify these connections manually by similarity of title and author composition. We chose to move the contents of the Unpaywall 'publisher' field into the 'journal name' field for preprints and inputted manually into the 'publisher' field the owner of the server, for example, 'journal name' becomes 'medrXiv' and 'publisher' becomes 'Cold Spring Harbour Laboratory'.

RESULTS
A total of 651 outputs were identified from the search. These included journal articles, preprints, software/code, reports, and datasets. See Table 1 for full details.

Volume of publication by month
Month-on-month change in the volume of publication was observed across the period, with some instances of no change:  (Fig. 1).

Days to publication by output type
Assuming the first instance of a publication to be Day 1 (report,  (Fig. 2).

Peer reviewed and non-peer reviewed outputs
Across the entire time period, identification of output types as peer reviewed (PR) and non-peer reviewed (NPR) revealed January to March outputs were exclusively NPR, but across the entire time period, the majority of the outputs were PR (60.8%, 396/651).

DISCUSSION
Although the majority of outputs over the entire time period were journal articles, the exclusive presence of NPR outputs reports and preprints between January and March, which were not surpassed by PR content until May, suggests that authors needed a faster form of dissemination than journals could offer in the early months of coronavirus pandemic (Kupferschmidt, 2020), similar to those working in other global health emergencies (Zhang, Zhao, Sun, Huang, & Glänzel, 2020). As authors chose to disseminate research in the preprint form, the sector responded. PubMed Central adapted to include coronavirus preprints (www. ncbi.nlm.nih.gov/pmc/about/nihpreprints/), and other existing preprint servers have adapted to prioritize this research or have been established solely for the crisis (Lu Wang et al., 2020).
Journal publishers responded to the crisis; a decrease of days between submission and publication by some medical journals publishing on the topic has been observed (Horbach, 2020), as well as announcements of reduction of peer review times by publishers (Redhead, 2020). However, whether the likely contradictory demands of both reducing peer review and editorial time whilst retaining quality (Kwon, 2020) are sustainable or achievable are yet to be evaluated long term. There is some indication that this pressure is changing journal publisher attitudes to preprints, seen by explicit encouragement of preprints on the topic at The New England Journal of Medicine (Rubin, Baden, Morrissey, & Campion, 2020), reference to the pandemic as a reason for The Lancet's decision to make their 'Preprints with the Lancet' SSRN platform permanent in September 2020 (Kleinert & Horton, 2020), and the introduction of a default preprint policy  for COVID-19 submissions at eLife (Eisen, Akhmanova, Behrens, & Weigel, 2020). Publication platforms such as Wellcome Open Research and F1000 further disrupt traditional distinctions in the journal and peer review process.
As preprints shift closer to the centre field of established scholarly communications, either the infrastructure and data standards supporting them needs to develop, or bibliographic tools need to adapt to accommodate. The complicated method of preprint data collection in this study (searching through the institution's CRIS records, a search function only available to administrators at the institution, and then supplementing it with a second search on Dimensions) was used because, although some databases index preprints (Europe PMC, Dimensions), contributor affiliation data associated with preprints is not of sufficient quality or sufficiently widespread to enable comprehensive search with verified affiliation. This is not a fault of the databases but perhaps a dependency on structured and parsable metadata from preprint servers that is not always available. Also, a lack of accessible methods through which to search for connected preprints and published journal articles, also perhaps due to missing associated metadata identifiers; prevents large-scale or automated data collection and requires associations to be identified manually as  in this study. This constraint could be possibly prohibiting the rich insight that could come from easily accessible mapping of preprint and article networks.
The presence of 52 publishers found is an indication that authors are served with competitive options from which to choose their own preferred outlet for dissemination and are safeguarded against 'lock-in' from any one provider. Whilst the majority of publishers predominantly serve one output type, e.g. journal publishers to journal articles, some are represented across more than one typefor example, the institutional repository publishing as 'Imperial College London' is represented amongst datasets (1), preprints (1), and reports (29). This could be a positive indicator that artificial distinctions in the research life cycle are being replaced with more holistic solutions that offer dissemination for all outputs of research. However, others have raised concern that the representation of commercial publishers across output types poses a threat to equity and value in the research production cycle (Posada & Chen, 2018 content to bronze open access is positive but has limitations; the access is not ensured in perpetuity and could be revoked in the future (Elsevier, 2020), and conditions of rights are not consistently clarified. Areas of particular need in this crisis that free access alone does not ensure are machine access for text and data-mining purposes, which is needed to apply artificial intelligence and machine-learning techniques to COVID-19 research (Shuja et al., 2020) and translation rights to disseminate in a global public health event.
This study of a single institution's outputs was undertaken with an awareness that Imperial is not the largest contributor by publication volume to COVID-19 research (Hook & Porter, 2020) and obviously not the only institution to have produced impactful results. Despite suggestions of the pressures of adapting research practices to accommodate lab closures and the demand for rapid results leading to smaller teams and fewer international collaborative partners in the early months of the pandemic (Fry, Cai, Zhang, & Wagner, 2020), we understand that coronavirus research demands collaboration at every level (Apuzzo & Kirkpatrick, 2020) and that any institutional-level analysis should be interpretated in relation to organisation size, mission and resouces.

LIMITATIONS
We recognize the limitations of comparing output types without adjusting for their characteristics or context. For example, comparison of publication times of journal articles and preprints is not a truly fair comparison given the vastly different time enterprises required of each type; neither is to compare the open access models of output types that are mandated for open access (e.g. articles) and the other output types which are not (preprints, datasets, reports, software/code).
The green open access share of the data may underrepresent the true number of articles self-archived, an action that is mandated by the institution's open access policy. This is because outputs would only be classified green when there is no publisher-hosted option available (Piwowar et al., 2018), so it is possible that some of the bronze open access items also exist as repository-archived green open access, but the Unpaywall hierarchy gives authority to the bronze publisher-hosted version in classification.

CONCLUSION
Authors were served with options to publish rapidly in non-peer review form and under open access models throughout the entire period, and from January to March, these options were exclusively used. Across the entire period, however, the most commonly observed output was journal articles. The association of some preprints with journal articles suggests that the status of peer review versus non-peer review is, for some outputs, not binary. This increasing connectedness between the two can also be seen by the presence of publishers serving across both types.
That the majority of outputs were published under some form of open access is positive; however, whether the bronze OA cohort is truly compliant with the long-term needs of this global challenge (World Health Organization, 2020a) (Wellcome, 2020) is not clear. The inclusion of reports, preprints, datasets and software/code as output types permits a richer and more accurate description of the institution's activities and talents than considering journal articles alone. There is a need for bibliographic methods to adapt to better identify and classify these valuable non-journal output types.