SARS‐CoV‐2 variants preferentially emerge at intrinsically disordered protein sites helping immune evasion

The SARS‐CoV‐2 pandemic is maintained by the emergence of successive variants, highlighting the flexibility of the protein sequences of the virus. We show that experimentally determined intrinsically disordered regions (IDRs) are abundant in the SARS‐CoV‐2 viral proteins, making up to 28% of disorder content for the S1 subunit of spike and up to 51% for the nucleoprotein, with the vast majority of mutations occurring in the 13 major variants mapped to these IDRs. Strikingly, antigenic sites are enriched in IDRs, in the receptor‐binding domain (RBD) and in the N‐terminal domain (NTD), suggesting a key role of structural flexibility in the antigenicity of the SARS‐CoV‐2 protein surface. Mutations occurring in the S1 subunit and nucleoprotein (N) IDRs are critical for immune evasion and antibody escape, suggesting potential additional implications for vaccines and monoclonal therapeutic strategies. Overall, this suggests the presence of variable regions on S1 and N protein surfaces, which confer sequence and antigenic flexibility to the virus without altering its protein functions.


Introduction
Intrinsically Disordered Proteins (IDPs) are a widespread class of diverse proteins characterized by lack of a fixed 3D structure [1]. IDPs are well known players of multiple biological processes, such as nucleic acid binding, signalling, cell cycle regulation, and play a central role in a large number of physiological and pathological processes [2]. Although widely distributed in eukaryotes, the widest content is found among viruses [3], where IDPs have evolved to support virusrelated biological functions [4,5]. Disordered proteins represent an important class of antigens in a variety of human pathogens and can be targets of protective antibody responses [6].
The presence of protein intrinsic disorder was also highlighted in the Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) proteome [7][8][9]. In particular, both spike glycoprotein (S) and nucleoprotein (N) are nowadays well known to contain functionally relevant disordered regions (IDRs) [7][8][9]. Since the onset of the COVID-19 pandemic, several SARS-CoV-2 variants have been identified worldwide [10], affecting the epidemiology of the virus, and playing an important role in pandemic surveillance and control [11,12]. Mutations that affect the viral genome and potentially impact disease transmission and severity are referred to as variants of concern (VOC) and variants of interest (VOI), and the scientific community is increasingly dedicated to monitoring the emergence of new viral lineages worldwide. The most variable proteins are spike and nucleoprotein, which are also the major antigenic proteins [13].
In this work, we use manually curated structural data to describe the disordered regions of SARS-CoV-2-as a collaboration between leading data resources, UniProt [14], ViralZone [15] and DisProt [16,17]-focusing on the spike protein and nucleoprotein. Many different SARS-CoV-2 variants have been observed: there are 1737 lineages described in PANGO (https:// cov-lineages.org/index.html/cite) as of December 2021. We chose to analyse the 13 Variants Of Concern (VOC) and the Variants Of Interest (VOI)-including Omicron-as they represent the most widespread and best adapted to humans (https://www.who.int/en/ activities/tracking-SARS-CoV-2-variants/). We analyse mutation localization for these 13 major variants of the SARS-CoV-2 virus and uncover hotspots that correlate not only with disordered regions but also with immune evasion. Finally, we highlight the role of flexible regions in the major antigenic site of the spike protein, suggesting a role of intrinsic disorder in escaping the host immune response.

SARS-CoV-2 spike and nucleoprotein are enriched in IDRs
Intrinsically disordered proteins are characterized by the presence of unstructured segments, that is, intrinsically disordered regions (IDRs), that lack a stable tertiary structure. Intrinsic disorder in proteins can be identified by several experimental techniques, including biophysical and biochemical methods, the most widely used being X-ray crystallography, nuclear magnetic resonance (NMR), circular dichroism and small-angle X-ray scattering [18,19]. Using the information available in DisProt, the major repository of manually curated data of IDPs and IDRs from literature data, we investigated the presence of IDRs in the SARS-CoV-2 proteins, along with their interactions and functions [16,17]. By analysing published structures and raw experimental data, we investigated IDR regions in nucleoprotein, spike, E protein, ORF1ab, ORF3a and ORF7a proteins. We focused our analysis on those proteins playing a crucial role in the virus-host interaction, and targets of vaccines and antibodies development, that is, proteins spike and nucleoprotein [20,21].
Analysis revealed that several regions are omitted in the structures of SARS-CoV-2 spike glycoprotein (protein S, DisProt: DP02772) due to their flexibility. No apparent density can be detected for region 455-490 [7]: this region of the Receptor-Binding Motif (RBM) is indeed unstructured and flexible in the unbound conformation [7,8] and undergoes folding-uponbinding in the ACE2-bound form [22,23].
The IDR between S1 and S2 (673-686) [7] is required for the proteolytic processing essential for the viral entry into host cells [24]. An insertion at position 680-687, that includes the specific furin-like cleavage motif RRxR, has been shown to be absent in other beta coronaviruses such as SARS-CoV [25].
Several sterically accessible complex-type glycans were identified inside the IDRs of SARS-CoV-2 spike glycoprotein (N74, N149 and three positions in the unstructured C terminus, N1158, N1173, N1194) as characterized by mass spectrometry experiments [26]. As protein glycosylation is a well-established strategy adopted by viruses to evade host immunity [27], molecular dynamic simulations highlighted that glycans extensively shield the spike protein surface from antibody recognition [28]. Nevertheless, we found no significant correlation between glycan sites and IDR in spike protein.
SARS-CoV-2 nucleoprotein (protein N, DisProt: DP03212) is a 419-residue multidomain protein characterized by 52% of disorder content that include the unstructured N-and C-termini, along with a disordered flexible linker connecting the RNA-binding domain (RBD) and the dimerization domain [29]. The disordered N terminus plays a role in liquid-liquid phase separation of protein N, indeed its deletion strongly decreases phase separation in the presence of RNA, while slightly increasing turbidity and droplet formation in the absence of RNA [30]. Similarly, a deletion of the flexible linker (region 174-247) strongly reduces LLPS-associated droplet formation and turbidity [30]. globular domains, indicate that the interaction strength decreases in the phosphorylated form. Moreover, phosphorylation of full-length nucleoprotein affects its RNA-induced phase separation, resulting in a weaker interaction of protein N with RNA and an increased diffusion of the phosphorylated species inside polyUinduced droplets [31]. The C-terminal IDR, instead, is not required for nucleoprotein condensation with RNA via LLPS [31]. The N-terminal and C-terminal IDRs were also found to be involved in the binding of nucleocapsid-targeting single-domain antibodies (sdAbs), sdAbs-N5 and sdAb-N6, whose interaction with the nucleoprotein requires the presence of its intrinsically disordered termini [32]. Size-exclusion chromatography studies of the nucleoprotein in RNAbound states and RNA-free state showed that truncations of its N-terminal IDR impair the RNA binding and that both the N-terminal and C-terminal IDRs contribute to RNA-binding activity of the SARS-CoV-2 nucleoprotein [33]. Finally, the C-terminal disordered region seems to play a role in droplet formation [33].

S1 and N mutation hotspots cluster in unstructured regions
Since late 2020, the SARS-CoV-2 pandemic has been driven by the emergence of variants [34]. These lineages carry fixed mutations that increase the viral fitness while enhancing the spread of the virus at population level. Our analysis reveals that nonsynonymous mutations tend to cluster in hotspots ( Fig. 1,2), Parentheses indicate variants whose prevalence is < 80% but > 10%. The main regions are annotated: S1 with N-terminal domain (NTD) and receptor-binding domain (RBD); S2 with fusion peptides, heptad repeat 1 (HR1) and 2 (HR2) and the transmembrane domain (TM) [73]. Major antigenic sites are shown below with the NTD supersite [56], and monoclonal antibody-binding regions for sotrovimab [74], casirivimab and imdevimab [75,76]. suggesting the presence of variable disordered regions. Such features in viral surface proteins may influence viral antigenicity and/or tropism. The external loop domain III of dengue virus envelope protein is disordered and plays a role in selective host binding ( [35], DisProt: DP00876). Moreover, it is the major target of highly neutralizing and protective serotype-specific antibodies [36]. Similarly, the HIV-1 glycoprotein is characterized by multiple variable loops that are intrinsically disordered [37] and play a role in immune evasion [38] and coreceptor binding [39]. To assess the presence of variable disordered regions in SARS-CoV-2, we compared the substitutions/deletions found in the 13 major variants classified by WHO (January 2022) (https://www.who.int/en/activities/tracking-SARS-CoV-2-variants/) with the experimentally determined IDRs (Fig. 1,2,3), identifying a strong correlation among mutations and disordered regions in SARS-CoV-2 spike protein and nucleoprotein. For instance, mutations in the S1 subunit of the spike glycoprotein tend to cluster in hotspots at the N terminus and occur in its unstructured regions-32 out of 45 mutated positions accounting for 71% of variants are localized inside S1 IDRs, whereas the S2 chain variants do not (Table 1). Similarly, 16 out of 18 mutated positions in SARS-CoV-2 nucleoprotein (N) are localized inside its IDRs, accounting for 89% of variants affecting protein N ( Table 1). For all the other SARS-CoV-2 proteins for which we gathered intrinsic disorder data, the observed mutations either did not correlate with known IDRs, or there were too few mutations to be significant. Here, we provide an insight on the intrinsic disorder and mutation content of SARS-CoV-2 ORF3a, E protein, ORF7a and ORF1ab ( Table 2, Fig. 4,5).
ORF3a (DisProt: DP03003): electron cryomicroscopy experiments of the protein shed light on the intrinsic disorder of its N-and C-terminal regions [40]. Point mutations disrupting the TRAF-binding region of ORF3a (residues 36-40) lack the ability to activate either IL-1b or IL-8-Luc secretion, highlighting the role of ORF3a in NF-jB and NLRP3 inflammasome activation [41]. The ORF3a unstructured N terminus is also responsible for its subcellular localization, for instance a deletion of the first 41 residues increases its expression in the plasma membrane while impairing localization to internal membranes [40]. Finally, 42% of the mutations affecting ORF3a in the variants here described are localized in its disordered N-and C termini: T9I (peculiar to Omicron variant), I20M (Mu), S26L (Delta and Kappa), S253P (Gamma), del257 and V259L (Mu).
ORF7a protein (DisProt: DP03460): X-ray crystallography of the SARS-CoV-2 ORF7a ectodomain (PDB: 7CI3, residues 14-96) shows that this protein (121 aa) is characterized by a well-defined structure and visible electron density from residues 14 to 82. Residues 83-96 are instead not visible in the electron density map, indicating the presence of structural disorder in the ORF7 protein, followed by a transmembrane domain (97-116) and an ER-retention signal (117-121) not included in the crystal structure [43]. No mutations are found inside the IDR of ORF7a identified so far.

Omicron variant
During the time this paper was submitted, the Omicron variant appeared [49]. This variant is unusual in that it has more than 30 mutations localized in the spike glycoprotein, so many that it escapes most therapeutic monoclonal antibodies and, to a large extent, vaccine-triggered antibodies [50,51]. The variant presents a large number of mutated positions in the S1 region (n = 31), with a significant number mapping to disordered regions (53%) although less than the 12 previous variants (71%) ( Table 3). This may be due to the tremendous acceleration of evolution that has led to omicron emergence, not yet completely understood [52].

Antigenic drift is closely associated with SARS-CoV-2 IDRs
The major SARS-CoV-2-specific antibody responses target the spike glycoprotein (S1 subunit) [8,53]. Two major antigenic regions are present in the S1 subunit: the receptor-binding domain (RBD) and the N-terminal domain (NTD) [54]. The RBD is the main antigenic site to which neutralizing antibodies bind, and this region includes three IDRs. Many neutralizing antibodies target the receptor-binding motif (RBM, pos. 438-506) in the RBD [8,55]. They act by preventing binding to the host receptor or reducing attachment to the host cell [54,55]. The inner part of this region is unstructured (pos. 455-490) [7,8] and it folds when interacting with the ACE2 receptor [22,23].
The NTD contains an antigenic supersite to which neutralizing antibodies bind [56]. Interestingly, this supersite corresponds to the first three IDRs where most of the variation occurs [54,57]. These three regions behave similarly to the variable loops in flavivirus envelope or HIV gp120: unstructured regions that allow the virus to escape immunity through a high potential for variation [56,58].
Antibody recognition of disordered epitopes is particularly sensitive to epitope variation [6]. A recent study analysed viral mutations that occurred in immunocompromised patients, and found out that most mutations are observed in either the NTD supersite or the RBM [59]. The flexibility of the IDR regions allows variants to escape neutralization by many antibodies, as shown by the resistance of Beta and Gamma variants to bamlanivimab and casirivimab treatments [50]. In particular, E484K substitution-localized in the IDR within the RBM-triggers immune evasion against casirivimab monoclonal antibodies [60]. In addition, Q677H and deletion 246-253 in the eta and lambda variants confer a better resistance to neutralizing antibodies [61].
A superantigen-like motif-absent in other SARS family beta coronaviruses-has been identified in the spike of SARS-CoV-2. This motif, corresponding to the furin cleavage site at position 681-684 (PRRA) [62], was proposed to be a high-affinity site for T-cell receptor (TCR) b-Chain and may play a crucial role in the immune inflammation responsible for severe cases of COVID [63]. Strikingly this motif at position 681-684 maps to an intrinsically disordered region of the spike protein, moreover P681 is a mutational hotspot in SARS-CoV-2 variants Alpha, Delta, Kappa, Mu (Fig. 1,3).
The nucleocapsid is the second major antigen of SARS-CoV-2 [64]. Early studies on SARS-CoV showed that the immunodominant epitopes are located in regions 1-69, 153-235 and 354-422 [65], corresponding to the three disordered domains conserved in both SARS-CoV and SARS-CoV-2.
Collectively, these findings suggest that the immunodominant epitopes of the S1 subunit and of the N protein are closely associated with the disordered regions in the SARS-CoV-2 proteins.

Discussion
Intrinsically disordered regions (IDRs), protein regions characterized by a lack of stable three-dimensional structure, are present and abundant in native SARS- CoV-2 proteins. The IDRs described here were identified by screening the associated scientific literature and the data retrieved were subsequently manually curated into DisProt and integrated with information from ViralZone. These IDRs have been shown to be associated with hotspots of mutations in spike S1 protein and nucleoprotein. Substitutions and deletions falling inside unstructured regions are likely to have a minor impact on the protein folding [66,67]. Moreover we show that these disordered regions overlap with major antigenic sites. IDRs are known to be specific targets of antibody recognition [6] and this variability might have an impact on antibodies' binding specificity. Our results suggest that SARS-CoV-2 displays disordered regions (IDRs) on the spike S1 subunit and on the N protein, and that these regions are enriched in mutations that could provide the virus with an advantage both for genetic and antigenic drift. These findings are particularly important in light of emerging variants, such as the delta subvariant AY.4.2, which is being monitored by the European Centre for Disease Prevention and Control (ECDC, https://www.ecdc.europa.eu/) and the World Health Organization (WHO, https://www.who.int/). The major mutation associated with the AY.4.2 variant, Y145H, is located in an IDR of the spike glycoprotein and is structurally close to the known immunodominant epitope at position 153-235 (Fig. 1,3), possibly playing a role in viral immune defence. Omicron variants have a higher amount of mutations in S1 IDRs (20) than any other variants. It combines all the high-consequence mutations identified in previous variants and has an unexpected ability to evade vaccine protection. In addition, it has an enormous number of mutations (19) in structured regions of the protein, making it distinctly different from previous variants. This suggests that Omicron arose under different selective pressures. Indeed, early studies suggest that the Omicron may have arisen in chronically infected COVID-19 patients [52] or infected animals [68].
The proposed correlation between intrinsic disorder with mutational hotspots and major antigenic sites may have potential implications for the management of the SARS-CoV-2 pandemic and associated disease. Treatment of severe COVID patients depends on monoclonal antibodies, which in turn relies on their ability to recognize specific epitopes. Mutations in the targeted epitopes may inhibit the binding of monoclonal antibodies and reduce the therapeutic effect of this treatment [69]. Given the established link between IDR and mutation hotspot, it may be beneficial in the long term to select monoclonal antibodies that target ordered regions. Similarly, vaccine development could benefit from knowing where the key variable regions of the spike protein are located.

Identification and mapping of mutations on IDRs
The analysis of SARS-CoV-2 mutations, both missense and deletions, relies on variants of concern (VOC), that is, Alpha, Beta, Gamma, Delta and Omicron, and variants of interest (VOI), that is, Epsilon, Zeta, Eta, Theta, Iota, Kappa, Lambda and Mu, by using the outbreak.info resource (https://outbreak.info/). Mutations with at least a minimum prevalence of 80% were considered for the analysis and then mapped on the previously identified IDRs in the spike glycoprotein and Nucleoprotein of SARS-CoV-2.
The trimeric spike protein structure (PDB: 6ZGG [70]) was built using Chimera to display mutations specifically affecting viral immune escape and antibody evasion [71]. Disordered region 677-689, missing from the spike structure, was modelled on the chain A starting from the sequence using RANCH [72].
for helpful discussions. The Graphical Abstract was partially created with BioRender.com. Open Access Funding provided by Universita degli Studi di Padova within the CRUI-CARE Agreement.

Conflict of interest
The authors declare no conflicts of interest.

Author contributions
PLM and FQ conceived the study. ES, FQ and PLM performed the data curation and analysed the data. PLM and SCET supervised the project. FQ, ES, MC, GM, SCET and PLM contributed to writing, critically reviewing and editing the manuscript.

Peer review
The peer review history for this article is available at https://publons.com/publon/10.1111/febs.16379.