Comparative analysis of the genome structure and organization of the Middle East respiratory syndrome coronavirus (MERS‐CoV) 2012 to 2019 revealing evidence for virus strain barcoding, zoonotic transmission, and selection pressure

Summary The Middle East respiratory syndrome coronavirus (MERS‐CoV) emerged in late 2012 in Saudi Arabia. For this study, we conducted a large‐scale comparative genome study of MERS‐CoV from both human and dromedary camels from 2012 to 2019 to map any genetic changes that emerged in the past 8 years. We downloaded 1309 submissions, including 308 full‐length genome sequences of MERS‐CoV available in GenBank from 2012 to 2019. We used bioinformatics tools to describe the genome structure and organization of the virus and to map the most important motifs within various regions/genes throughout the genome over the past 8 years. We also monitored variations/mutations among these sequences since its emergence. Our phylogenetic analyses suggest that the cluster within African camels is derived by S gene. We identified some prominent motifs within the ORF1ab, S gene and ORF‐5, which may be used for barcoding the African camel lineages of MERS‐CoV. Furthermore, we mapped some sequence patterns that support the zoonotic origin of the virus from dromedary camels. Other sequences identified selection pressures, particularly within the N gene and the 5′ UTR. Further studies are required for careful monitoring of the MERS‐CoV genome to identify any potential significant mutations in the future.

belong to the Betacoronaviruses. 3 The SARS-CoV emerged in 2003; the MERS-CoV was discovered in 2012, and the SARS-CoV-2 was reported in late 2019. The time gap between SARS-CoV and MERS-CoV is around 9 years, and between the emergence of MERS-CoV and SARS-CoV-2 is 7 years. The continuous emergence of new coronaviruses candidates may be attributed to the features of their genomes. This may be due to several factors, including the low fidelity of their RNA-dependant RNA polymerases (RdRp), the possibility of recombination, and the high level of expression of their receptors in many mammalian and avian species. 4 Thus, there is an urgent need for the regular monitoring of the genome sequences of coronaviruses from various species of bats, animals, and birds. [5][6][7][8][9][10][11][12] The main goal of the current study was to do a com- includes the majority of viral isolates from human and some from camel origins. 14 Clade-C includes viruses of camel origin isolated from various countries in Africa, including Egypt, Morocco, Nigeria, Burkina Faso, and Kenya. 13,15 Interestingly, results from the virus neutralization assay revealed that all three clades are closely related, which supports the notion that one vaccine may be able to protect against all three clades. 13 It is believed that MERS-CoV originated in bats and spilled over to humans via an intermediate host, dromedary camels. 16,17 Dromedary camels are the main known reservoir until now. 18 Genome analysis of MERS-CoV isolates from human and dromedary camel origins revealed a close relationship between each other, suggesting the zoonotic origin of the virus. 14 Like other coronaviruses, MERS-CoV continues to show some changes at the genome level. Thus, new virus clades and sub-clades are recently reported of both human and dromedary camel origins. 19 Regular monitoring of the genetic makeup of the virus is very important to track down any potential mutation or recombination.
The main goal of this study was to do a deep bioinformatic analysis of the most available MERs-CoV genome sequences in Genbank during 2012 to 2019 to understand the evolution of the virus and map any potential changes across the viral genome. Therefore, the development of potential diagnostic assays, vaccines, or therapy should cope with any potential changes over the viral genome. This monitoring may also contribute substantially to the control of the virus by knowing the currently circulating clades in a certain community. An earlier study showed the circulation of three different genotypes of the virus in some patients during 2013 in KSA. 20 One year later, the same group reported the presence of four MERS-CoV clades during the early emergence of MERS-CoV. Later, during 2014 three clades were no longer contributing to the reported human cases suggesting their extinction. 19 The main reason behind these changes was the dynamic changes among the S gene of the virus. 19,20 Studies on MERS-CoV isolates from various countries in northern and central Africa revealed that the circulating strains of the virus in dromedary camels from these countries belong to lineage-C. This lineage is different from the other two lineages reported earlier in the Arabian Peninsula. 13 Although the three lineages have some genetic variations, their antigenic properties remain identical, as shown by the virus neutralization test. 13 Isolates from dromedary camels collected from Nigeria, Burkina Faso, and Morocco clustered together into a new sublineage-C1 due to shared genetic signatures, including deletions in ORF4b. 13 Thus, there is an urgent need for continued study not only of MERS-CoV but also other coronaviruses in the context of the human-animal interface and to understand the biological diversity of coronaviruses.

| Identification of the full-length genome submissions
MERS-CoV submissions were considered as a full-length genome only if they meet three parameters. First, the length of the sequence must be greater than or equal to 30-kilobases. Second, the submission must have the full 5 0 UTR sequence. Third, the submission must have a poly-A tail even represented by one nucleotide of adenine (A).

| Multiple sequence alignment and single nucleotide polymorphism density analysis
Multiple sequence alignment was conducted using the MAFFT tool (http://ma.cbrc.jp/alignment/soware/). Single nucleotide polymorphism (SNP) density (excluding indels) was counted from the multiplesequence alignment of 544 ORF1ab and 744 S gene sequences by use of an in-house script written in python. For data visualization, Geneious (version 7.1.8) was used.

| Identification of the putative ORFs
To avoid losing open reading frames (ORFs) within different sequences, ORFs were collected by retrieving regions flanked by conserved sequences, as shown in (Table 1). Conserved sequences were obtained from multiple sequence alignment of 308 complete whole MERS-CoV genomes. Different lengths of ORF4b and ORF8b were calculated at its minimum possible size (300 nt), allowing any start codon (ATG, TTG, or CTG) to initiate the ORF.

| Identification of the cleavage sites for non-structural proteins
Mapping the cleavage sites of the NSPs among gene-1 was carried out as previously described. 21

| Phylogenetic analysis
Phylogenetic trees were constructed using MEGA X software 22 using multiple sequence alignment of the whole genome, ORF1ab, S gene, ORF4b, ORF5, E gene, M gene, N gene and ORF8b sequences. The trees were constructed using maximum likelihood methods and the Tamura-Nei model. Bootstrap analysis (100 pseudo-replicates) was conducted to evaluate the statistical significance of the inferred trees, and only values greater than 50 were displayed.
Since most isolated sequences did not meet the parameters used in this study for the whole-genome sequences, we conducted the analysis on individual genes.
For the comparative study of phylogenetic trees, we initially considered 493 sequences greater than 30 kb since most of the isolated sequences (especially isolates from African camels) did not meet the parameters used in this study for the whole-genome sequences. Then, to make the tree much simpler and easily readable, we narrowed  We used 308 full-length genome sequences after applying filteration parameters. All relevant data of these sequences are summarized in Table S1.  (Table 4).
ORF1ab consists of two overlapping ORFs (ORF-1a and ORF-1b). The ORF-1b is produced by ribosomal frameshifting in which the ribosome steps back one nucleotide and continues reading and producing ORF-1b ( Figure 1). We analyzed 544 sequences of ORF1ab (Table S1).
ORF1ab is 21 236 nt in length, and the position of the ribosomal frameshifting is located at 13 433 nt of JX869059. Distribution of polymorphisms (SNPs found in more than one strain in a 50-bp sliding window) showed that SNP density in ORF1ab varies from 0 to 11 ( Figure S1A). From multiple sequence alignment, we identified eight variants seem to be restricted to MERS-CoV isolated from African camels. All these variants exited in studied African camels only  (Table S2).  Table 3). Details about the NSPs, their locations, as well as sizes, are shown in Table 2.

| The spike glycoprotein (S)
We analyzed 744 of the S-gene sequences (Table S1). The full-length of the S gene is about 4062 nt. The organization of the MERS-CoV-S gene is described in Figure 1B (Table 3). We conducted an SNP density analysis to test whether the three observed variants of the S gene in African camel samples were not in a highly variable region.
The distribution of polymorphisms (SNPs found in more than one strain in a 50-bp sliding window) showed that SNP density in S gene varies from 0 to 12 ( Figure S2). The variant V26A is located in a region with a SNP density of five, while both R1020Q and A1158S/L substitutions are located in a region of low SNP density (SNP density = 2).
The total number of different nucleotides between samples in the phylogenetic tree of the S gene sequences is shown in Table S2.
The phylogenetic tree of the 57 MERS-CoV of S gene sequences is presented in Figure 2C. Compared to the phylogenetic tree of MERS-CoV whole-genome and ORF1ab sequences (Figure 2A (Table S1). There was no sample observed with the two mutations (I529T and D510G) at the same time.

| ORF-3
In the case of ORF-3, we analyzed 566 sequences (Table S1). The length of ORF3 gene is 312-nt. However, in some submissions, we found truncated versions of the encoded protein of this ORF due to deletions varying from 9-nt to 41-nt (Table 3). Two variants were
In the case of ORF-4a, we analyzed 568 sequences (Table S1). The ORF4a gene is 330-nt in length, and there is 89-nt overlapping between ORF4a and ORF4b genes. The ORF4a is highly conserved, but SNP density increases in the last 45 nt (data not shown).

| ORF-4b
We analyzed 575 sequences of the ORF-4b gene (    Figure 4 revealed that all models of the defective gene ORF4b and the wild type gene ORF4b could produce an identical potential ORF of 90 A.A. in length.

| ORF-5
We analyzed 569 MERS-CoV sequences of ORF5 (Table S1). The fulllength of gene ORF5 is 675 nt, which encodes a protein of 224 AA in length. We found two strains encoding a truncated ORF5 protein due to deletions causing a frameshift and stop codon (  (Figure 2 and Figure S4).

| The envelope (E) and membrane (M) proteins
We analyzed 579 sequences of E gene and 575 sequences of M gene (Table S1). The lengths for E protein and M protein are 82-A.A. and 219-A.A., respectively. Most of the E and M gene sequences were quite conserved and had no significant observation either at the nucleotide or amino acid level. However, we considered observed variants not significant as they either existed in only one sample or in sequences generated by the same group.

| The nucleocapsid (N) protein
The number of retrieved and analyzed sequences of N gene was 565 (Table S1). The full-length of gene N is 1242 nt, which encodes a  Table 3).

| ORF-8b
We analyzed 598 sequences of gene ORF-8b (Table S1). The fulllength of the wild type gene ORF8 is 339 nt, which encodes a protein of 112 AA in length. This ORF is located within the N gene (from 28 762 to 29 100 of the JX869059 genome sequencing; Figure 1).  (Table S1). On the other hand, 13 strains have truncated ORF8 protein (34-AA shorter than the wild type) due to a mutation at position 28 996 of JX869059, changing TCA to a TGA (stop codon). One of these sequences was isolated from a human in Qatar during 2013 while the others were of human origin from KSA in 2014 (one of them was sequenced in Indiana-KJ813439; Table S1).
All studied African camels (25 samples

| Classification of African camel MERS-CoV lineage is mainly driven by ORF1ab and S gene sequences
MERS-CoV isolated from African camels is shown to be phylogenetically distinct from those circulating in the Arabian Peninsula. 13,26,27 In this study, we showed that the cluster of MERS-CoV African camels into    (Table 3). We tested the possibility of a defective version of the ORF8 gene that can produce ORF8 protein with a length close to the wild type by starting from different frames rather than frame-1, all possible ORFs were tested. There was no defective form of gene ORF8 that produced ORF8 protein with a length close to the wild type ( Figure S6). However, the biological consequence behind the variation of ORF8 protein length is still unknown.

| CONCLUSIONS
Comparative genome sequence analysis of the MERS-CoV of both dromedary camels and human origins revealed significant evidence for potential barcoding of the African clades based on the S gene sequences. It also provided evidence for the zoonotic origins of the virus from dromedary camels to humans and highlighted the role of selection pressure and compensatory mechanisms in virus genome evolution.