A Sanger sequencing protocol for SARS‐CoV‐2 S‐gene

Abstract We describe a Sanger sequencing protocol for SARS‐CoV‐2 S‐gene the Spike (S)‐glycoprotein product of which, composed of receptor‐binding (S1) and membrane fusion (S2) segments, is the target of vaccines used to combat COVID‐19. The protocol can be used in laboratories with basic Sanger sequencing capabilities and allows rapid “at source” screening for SARS‐CoV‐2 variants, notably those of concern. The protocol has been applied for surveillance, with clinical specimens collected in either nucleic acid preservation lysis‐mix or virus transport medium, and research involving cultured viruses, and can yield data of public health importance in a timely manner.

Gene sequencing is key for surveillance of SARS-CoV-2 and monitoring for the emergence of mutated strains of the virus which may have altered behaviour and infectivity/transmissibility characteristics that affect spread of the disease and/or disease severity, 3,4 together with the capacity to potentially escape protective immunity induced by vaccination (thereby reducing vaccine efficacy) and/or previous infection, 5 and/or escape methods of virus detection such as real-time RT-PCR (rtRT-PCR) assays. Infectivity/transmissibility characteristics and immune evasion are particularly relevant to the S-gene that encodes the surface S-glycoprotein (Spike) that is responsible for initiating infection by binding to host cell ACE2 receptor, 6,7 fusion of virus and cell membranes, and release of the virus genome, which then uses cellular machinery to produce progeny virus that disseminates within the host and fuels virus transmission to new hosts. These activities, together with Spike being the major inducer of host neutralising antibody responses, make it a target for therapeutic strategies and vaccine development. 8,9 In this age of next generation sequencing (NGS) methodologies, the great majority of protocols developed (e.g., ARTIC [https://artic. (n = 1 177 811), which relate to sequence quality, and "collection date complete" (n = 1 151 800), which is an important criterion for data analyses. The latter number represents a 26% reduction in the number of "quality" sequences. Further, many of these "quality" sequences contain significant runs of four or more undefined/missing nucleotides (n: see below for a focus on the S-gene) that cause major issues for alignment programmes such as MAFFT (https://mafft.cbrc. jp/alignment/software/) resulting in alignments with sizeable gaps, often encompassing coding regions of important S-glycoprotein domains (e.g., the receptor-binding domain and associated antigenic sites, and the S1/S2 cleavage site). In developing the set of 16 primers reported here for a S-gene specific Sanger sequencing approach, 10 823 "quality" sequences from the initial stages of the COVID-19 pandemic were downloaded from the EpiCoV™ database on 2020-04-29 and a MAFFT-generated alignment made, from which the S-gene coding section (with some flanking sequence) was extracted. Having extracted the S-gene coding region sequences with runs of four or more n were removed, leaving 8 429 (an additional 22% reduction), which were re-aligned. Further, many of these remaining sequences contained significant numbers of nucleotide ambiguity codes, often occurring in runs, that possibly relate to the base-calling capabilities of the NGS platform and the quality of the bioinformatics pipeline used, 10,11 together with amount and quality of SARS-CoV-2 RNA recovered from clinical specimens.
Sanger-based sequencing protocols for the S-gene are available, for example, a commercially available set of 24 M13-tagged primers 12 linked to use of specified equipment (https://assets.thermofisher. com/TFS-Assets/GSD/brochures/sequencing-sars-cov-2-spike-geneprotocol.pdf). However, a significant number of NICs within GISRS in low-middle income countries (LMICs) do not have the resources or within country support to either upgrade their existing Sanger sequencing facilities or implement and maintain NGS in a cost effective manner (which is dependent on a high throughput of samples).
The Sanger sequencing approach described here has been shared with a small number of NICs where it has been implemented successfully based on their existing methods and capabilities that have been developed for influenza surveillance. Table 1 gives details of the primer set together with primer pairings used to produce three overlapping fragments (each of It is being used routinely to screen for potential Spike amino acid substitutions and/or polymorphisms that may emerge during adaptation of SARS-CoV-2 to propagation in cell-lines used in the laboratory, and validating of virus stocks generated for use in high throughput assays, for example, virus neutralisation assays used to screen for potential escape of new variants from antibody responses induced by vaccination. 5,13,14 In addition, NGS has been performed by mixing the three fragments, performing library preparation with QIAGEN QIAseq FX DNA Library Kits (#180475) and running products on Illumina MiSeq platforms, allowing greater in depth assessment of the presence of minority variants than is available through Sanger sequencing alone.

| CONCLUSION
Sanger sequencing remains the "gold standard" for accuracy of basecalling and thereby quality of sequences generated for surveillance and research purposes. Accuracy of sequencing is essential to prevent databases being flooded with poorly curated sequences and consequent difficulties, and potential erroneous outcomes, when applying bioinformatic softwares to mine the data. Further, gene sequencing alone cannot provide all the information required to fully understand virus evolution and make truly informed vaccine strain selections: phenotypic characterisation of viruses remains paramount when making such decisions. A Spike-specific Sanger sequencing approach, like that described here, can facilitate rapid identification of clinical specimens containing variants of concern/interest/high consequence

CONFLICT OF INTEREST
All authors declare no conflicts of interest.

ETHICAL STATEMENT
The conception and execution of this work did not require ethical approval.

PATIENT CONSENT
All clinical specimens were collected with consent to be used for diagnostic and virus characterisation purposes.

PEER REVIEW
The peer review history for this article is available at https://publons.