Raman Molecular Fingerprints of SARS‐CoV‐2 British Variant and the Concept of Raman Barcode

Abstract The multiple mutations of the severe acute respiratory syndrome coronavirus 2 (SARS‐CoV‐2) virus have created variants with structural differences in both their spike and nucleocapsid proteins. While the functional relevance of these mutations is under continuous scrutiny, current findings have documented their detrimental impact in terms of affinity with host receptors, antibody resistance, and diagnostic sensitivity. Raman spectra collected on two British variant sub‐types found in Japan (QK002 and QHN001) are compared with that of the original Japanese isolate (JPN/TY/WK‐521), and found bold vibrational differences. These included: i) fractions of sulfur‐containing amino acid rotamers, ii) hydrophobic interactions of tyrosine phenol ring, iii) apparent fractions of RNA purines and pyrimidines, and iv) protein secondary structures. Building upon molecular scale results and their statistical validations, the authors propose to represent virus variants with a barcode specially tailored on Raman spectrum. Raman spectroscopy enables fast identification of virus variants, while the Raman barcode facilitates electronic recordkeeping and translates molecular characteristics into information rapidly accessible by users.

lected at different locations over areas of ~2 mm 2 for each type of sample and averaged. A schematic draft explaining the sample setup and the sample/probe interaction is given in the figure below. As seen in the above draft, agglomerations of virions were spotted into semi-ellipsoidal pools (enclaves) of paraformaldehyde typically 50~100 μm in diameter. Under the microscope, the agglomerated virions appeared as black/grey "dust" with variable focal positions. Accordingly, the micrometric laser spot was focused at selected locations of strong signal to collect the Raman spectrum. A preliminary z-scan was needed to focus the laser exactly on the virions and to maximize their spectrum with respect to the formaldehyde broad background. Spectra were collected on different enclaves and averaged for each sample.
In pre-treating the collected Raman spectra, baseline subtraction was preliminary optimized and standardized according to a comparison between three different methods: polynomial fitting, 1 asymmetric least square method, 2 and penalized spline smoothing based on vector transformation. 3 In background removal by polynomial fitting, 1 the background was fitted to a low-order polynomial by iteratively determining the polynomial parameters that minimize a least-square criterion. This method is time-consuming but is conceptually the simplest one, being quite widely used in Raman spectroscopy. Other options exist when baseline drifts in Raman spectra mainly arise from a fluorescence noise, generating a broad background that overlaps with the Raman spectrum. This was indeed the case here, due to the presence of the overlapping broad but otherwise featureless spectrum of paraformaldehyde. However, baseline shifts differed at different locations or even at the same location when different wavenumber intervals were selected. Baselines were either slightly or severely drifting, and the spectral preprocessing was optimized to be insensitive to the noise morphology. In the present Raman experiments, algorithms built to reproduce those proposed in Refs. 2 and 3 appeared both suitable and reliable for Raman analyses.
The collected Raman spectra were treated with a baseline subtraction procedure and automatically deconvoluted into series of Voigtian sub-bands using commercial software (LabSpec 4.02, Horiba/Jobin-Yvon, Kyoto, Japan). In deconvolutive spectral analyses, we applied a machine-learning approach using an in-house built automatic solver, S av (ν), exploiting a linear polynomial expression of Voigtian functions, V(Δν, σ, γ), where ν, Δν, σ, and γ represent the Raman frequency, the shift in frequency from each sub-band's maximum (ν 0 ), the standard deviation of each Voigtian component, and the full-width at half-maximum of the Lorentzian component, respectively. An algorithm searching for the minimum value of the difference between the experimental and the fitted spectrum was then set, as follows: where the index i locates each compound in a series of n compounds contributing to the overall spectrum, and the index j locates each Voigtian sub-band of a series of m compounds in the Raman spectrum of each compound of an n series. A computer program was set to optimize the above algorithm after choosing a series of Voigtian sub-bands from the deconvoluted spectra of pre-selected compounds included in a database of key biomolecules in aqueous solution and in solid state according to the chemical and structural peculiarities of virions. A pre-selection was made according to the literature, from spectra collected in aqueous solution. After picking up spectral sub-bands of elementary compounds from the library, the algorithm pinpointed the closest matches to the experimental spectra according to the following criteria: (i) preserving relative intensities (β ij ), (ii) assigning spectral positions (ν 0 ), and full-width at half-maximum (σ and γ) values for specific sub-bands from each elementary compound within ±3 cm -1 (i.e., to include the possibility of alterations of the molecular structure in the virion structure). The conditions imposed on band positions, relative intensity, and bandwidths provided the required mathematical constraints to univocally deconvolute the experimental spectra. The computational work produced two outcomes: (i) spectra could be screened automatically and an appropriate deconvolution be suggested by finding the closest match with the experimental spectrum through Eq. (m-1), while additionally identifying the primary sub-band contributing molecules; and, (ii) sub-bands having primarily single-reference-molecule sourced signal intensity (>90%) could be isolated. A number of reference spectra from basic molecules used in the above-described machine-learning algorithm could be found in Ref. 39 of the main text.

Table S-II:
Comparison between fractions of purine and pyrimidine fractions computed from Raman (R) and genome (G) analyses in different isolates.

S-3. Statistical analysis of Raman spectra according to the Pearson's correlation coefficient
The so-called Pearson's correlation coefficient (PC), r, also referred to as spectral similarity coefficient, can be calculated according to following equation [16]: where Q represents an average spectrum from a given variant/sub-type in the database, s is the spectrum to be assessed, and n is the total number of CCD pixels in each of the two spectra. Note that the wavenumber interval to be compared and the pixel intervals should be exactly the same. A perfect matching or a complete mismatching between the database average spectrum Q and the spectrum under assessment s give r =1 and r = 0, respectively. PC values and related standard deviations, as calculated according to the above Eq. (s-1) are listed in Table S

S-4. Comparison of Raman spectra collected at different locations
Comparisons between average Raman spectra and spectra collected at individual locations on the same variant/sub-type are shown hereafter (Fig. S-2 ~ Fig. S-11). The figures include sub-band-deconvolution analyses for fractions of sulfur-containing amino acid rotamers, tyrosine phenol ring, apparent fractions of RNA purines and pyrimidines, and protein secondary structures. Individual spectra collected at different locations generally showed agreement with the corresponding average one in agreement with Pearson's statistical assessments (in Table S-III). However, some interesting variations in local spectra could be found. Although lacking statistical significance, such anomalous spectra, could unveil additional structural features. Looking, for example, at the morphology of the tyrosine doublet in average spectra, namely, the I 854 /I 826 ratio as a sensor of hydrophobic/hydrophilic balance at the virion surface, the QK002 sub-type of the British variant experienced in average a high I 854 /I 826 ratio (~1.8; cf. Figs. 3 and S-9(a)). This is a label for a hydrophilic tyrosine configuration and reveals an acidic environment at the virion surface experienced by the majority of the virions belonging to this species (cf. Figs. S-9(a) and (c)). However, some anomalous locations in the QK002 sample showed the opposite trend (cf. Fig. S-9(c)), which indeed proves the presence of some inhomogeneity in the virions' population. Interestingly, the anomalous QK002 locations with an I 854 /I 826 ratio <1 also presented a different distribution of protein secondary structures, with a clearly larger relative fraction of random coil configuration (cf. sub-band displaying at ~1676 cm -1 in Fig. S-11(c)), as compared to α-helix. A similar trend could be observed for the original Japanese isolate JPN/TY/WK-521 (cf. Figs. S-4(c) and S-6(c)). However, a peculiar anomaly in the spectrum of the original isolate consisted in the presence of a tyrosinate band at 836~840 cm -1 . This anomaly was never found in any of the two British variant's sub-types. The anionic form of tyrosine is expected at high values of interfacial pH with tyrosine ratio, namely, only in spectra with tyrosine ratios >1. However, this was not always the case (cf. Figs. S-4(b) and (c)). The presence of the ~840 cm -1 tyrosinate sub-band in anomalous spectra with tyrosine ratios <1 indeed represents a confirmation of the higher heterogeneity of the original Japanese isolate, since it can only be explained by the concurrent presence of different virion populations in the same sample.