Compositional biases in RNA viruses: Causes, consequences and applications

Abstract If each of the four nucleotides were represented equally in the genomes of viruses and the hosts they infect, each base would occur at a frequency of 25%. However, this is not observed in nature. Similarly, the order of nucleotides is not random (e.g., in the human genome, guanine follows cytosine at a frequency of ~0.0125, or a quarter the number of times predicted by random representation). Codon usage and codon order are also nonrandom. Furthermore, nucleotide and codon biases vary between species. Such biases have various drivers, including cellular proteins that recognize specific patterns in nucleic acids, that once triggered, induce mutations or invoke intrinsic or innate immune responses. In this review we examine the types of compositional biases identified in viral genomes and current understanding of the evolutionary mechanisms underpinning these trends. Finally, we consider the potential for large scale synonymous recoding strategies to engineer RNA virus vaccines, including those with pandemic potential, such as influenza A virus and Severe Acute Respiratory Syndrome Coronavirus Virus 2. This article is categorized under: RNA in Disease and Development > RNA in Disease RNA Evolution and Genomics > Computational Analyses of RNA RNA Interactions with Proteins and Other Molecules > Protein‐RNA Recognition

While the focus of this review is on genome compositional biases of RNA viruses, often the leading research in a specific area has been undertaking using a DNA virus as a model system, and so where appropriate this research is also described. It is important to note that the concepts discussed have been evaluated using diverse virus systems, often with fundamentally different replication strategies. Exposure to cellular factors is expected to vary depending on where in the cell a virus replicates, the extent of protection of viral genomes from the cellular environment by nucleoproteins, the kinetics of virus replication, as well as the host species and the cell type infected. Nevertheless, all viruses produce mRNAs that are translated in the cytoplasm, so some generalities are likely to exist, as well as differences.
2 | TYPES OF GENOME COMPOSITIONAL BIAS

| Nucleotide bias
If all bases were represented equally in a genome, each would be recorded at a frequency of 25%. However, biases in individual base frequencies are seen across all genomes, including viral. This is often facilitated by codon degeneracy. Of 20 amino acids, 2 are encoded by a unique codon (Met, Trp); nine by two codons (Phe, Tyr, His, Gln, Asn, Lys, Asp, Glu, Cys); Ile is encoded by three codons; five amino acids are encoded by four codons (Val, Pro, Thr, Ala, Gly) and three are encoded by six codons (Leu, Arg, Ser). Representation of each of the degenerate codons can be highly skewed. For example, across the HIV-1 genome, $37% of bases are adenine, and adenines are heavily selected for at degenerate positions (Kypr & Mr azek, 1987). This bias is at least partly induced by the cellular factor APOBEC3G (Sheehy et al., 2002), which deaminates cytidine to uridine in the negative sense ssDNA produced during virus replication as an intrinsic antiretroviral defense. Uridine mimics thymine, and so when positive sense DNA is synthesized during genome replication, this reverse complement strand incorporates adenine in place of guanine. This is, in other words, mutationally driven. Conversely, enrichment for adenine at specific sites is thought to reduce the impact of ribosomal frame-shift events due to introduction of out-of-frame stop codons, as modeled using bacterial genomes (Abrahams & Hurst, 2017) (i.e., driven by selection). Other types of nucleotide biases are also described, such as the 70% GC content of the rubella virus genome (Zhou et al., 2012), largely attributed to the use of C bases at degenerate positions (Zhou et al., 2012). Contrarily, extensive C to U mutations (in comparison to other base changes) are seen in the genome of SARS-CoV-2 (Rice et al., 2020;Simmonds, 2020). The mechanisms driving these latter two biases are, at present, poorly understood.

| Codon usage biases
Usage of degenerate codons is nonrandom, with some codons used frequently and others rarely. Codon preferences vary by host and by viral species, and even by gene. In humans, codon usage biases are stronger in genes that are more highly expressed. The greater exposure of the transcripts from these genes to the drivers of selection may generate stronger biases (Urrutia & Hurst, 2003). Commonly expressed genes use codons which are decoded by abundant tRNAs, whereas during stress the tRNA pool changes to increase abundance of rare tRNAs, as stress response genes are more likely to use rare codons (Torrent et al., 2018). Within-gene biases are also evident; for example, evolutionarily constrained exonic splice enhancer sites demonstrate different codon usage patterns to other coding regions (Savisaar & Hurst, 2018).
In virology, how well a virus reflects the codon usage of its host can be calculated using the Codon Adaptation Index (CAI) metric. Key to genome composition variation is how long a virus has been adapting to its host; for example, a virus that has recently switched host may change its genome composition profile as it adapts to a new host (Babayan et al., 2018;Greenbaum et al., 2008). In CAI scoring, the most frequently used codons score highly and rare codons score below 1. The scores can then be averaged across an ORF or a proteome. CAI scores vary between À1 and +1, with higher scores representing more frequently used codons with respect to the host (Sharp & Li, 1987). Viral genomes display codon usage biases, but these do not necessarily mimic their host. This may arise as a consequence of nucleotide biases; for example, HIV-1 and rubella virus display very different codon usage profiles to each other and the human genome as a result of the nucleotide biases they exhibit (van der Kuyl & Berkhout, 2012;Zhou et al., 2012). Genome architecture and virus ecology may also be important for driving codon usage preferences, as codon usage biases may be more evident in segmented and aerosol-borne viruses compared with vector-borne viruses (Jenkins & Holmes, 2003), as vector-borne viruses must also be able to replicate in their invertebrate hosts (Fros et al., 2021). Within viral genomes, codon usage preferences may also vary. For some large DNA viruses, distinct temporal phases of infection occur; usage of rare codons in late genes of large DNA viruses has been proposed as a mechanism of gene expression regulation (Shin et al., 2015;Zhou et al., 1999). In the SARS-CoV-2 genome, E ORF and ORF10 encode a high proportion of disfavoured codons, whereas in other genes, codon usage is more reflective of the human host (Digard et al., 2020;Rice et al., 2020).
While the reason(s) underlying codon preferences are somewhat speculative, successive codons encoding the same amino acid are more likely to use the same degenerate base and so the same tRNA, possibly allowing for faster recycling of tRNAs, if tRNA diffusion away from the ribosome happens slower than the rate of translation (Cannarozzi et al., 2010). A nonexclusive alternative is that use of rare codons slows translational rate, which in turn can affect how a protein folds (Kimchi-Sarfaty et al., 2007).

| Dinucleotide biases
In 1981 it was first proposed that nucleotide and codon preferences might be explained by dinucleotide biases (Nussinov, 1981). A dinucleotide is defined as two adjacent nucleotide bases joined by a phosphate bridge, on the same strand of nucleic acid (i.e., in cis). Given the four bases of RNA-adenine (A), cytosine (C), guanine (G) and uracil (U)-all possible combinations give rise to 16 possible dinucleotides (Figure 1a). The conventional notation for dinucleotides of, for example "CpG," refers to a cytosine 5 0 to a guanine base and joined by a phosphate ("p") bridge (Figure 1b).
In a given sequence, how often a given dinucleotide would occur if nucleotide sequence was random can be calculated by simply multiplying observed base frequencies together. By then counting how many times the chosen dinucleotide occurs in a given sequence, over-or under-representation of any dinucleotide can be calculated. This is referred to as the observed: expected (O:E) ratio, represented by the formula: where X and Y represent the two nucleotides of choice. A ratio of above 1 indicates that the observed frequency is higher than expected, and so the dinucleotide is over-represented, whereas anything below 1 indicates an underrepresented dinucleotide. This simplest method of calculating dinucleotide representation does not take into consideration potential sources of exogenous bias such as amino acid composition and codon bias, although software accounting for such factors has been released (Simmonds, 2012); in our experience of analyzing viral genomes, the results delivered by different models are very similar.

| CpG dinucleotides
Vertebrate genomic dinucleotide composition has been studied since the 1960s, when the striking observation was made that CpG motifs are under-represented in vertebrate genomes (Swartz et al., 1962;Josse et al., 1961). The human genome has a CpG O:E ratio of around 0.25 (Bird, 1980), similar to other mammalian species (Jabbari et al., 1997) (i.e., CpGs occur at a quarter of the frequency one would expect, given individual cytosine and guanine frequencies in the human genome). Little if any CpG suppression is seen in the genome of invertebrates (Josse et al., 1961;Simmonds et al., 2013) (Figure 2), although CpG suppression is seen in plant genomes (Bougraa & Perrin, 1987;Ibrahim et al., 2019). In vertebrates, genomic CpG suppression is thought to have arisen due to the epigenetic regulation of transcription occurring in part through the methylation of cytosines in the CpG conformation. Methylated cytosines are prone to undergo spontaneous deamination and so conversion to thymine (i.e., TpG), which is proposed to have resulted in a loss of CpG motifs from vertebrate genomes over evolutionary time (Cooper & Krawczak, 1989). Methylation of cytosines in invertebrate genomes is restricted or entirely absent (Bird & Tweedie, 1995), providing an explanation for the contrasting lack of CpG suppression in these organisms. The reasons for CpG suppression in plant genomes are unclear, as they do not support methylation (Bougraa & Perrin, 1987). In the 1990s it was reported that the genomes of small, but not large, viruses infecting eukaryotes also underrepresent CpG (Karlin et al., 1994). A more detailed analysis (Simmonds et al., 2013) showed that generally in viruses of mammals, single stranded RNA (ssRNA) viruses under-represent CpG, whereas dsRNA and large DNA viruses do not (Simmonds et al., 2013). The under-representation of CpG in the IAV PR8 genome described above is therefore characteristic of its class of RNA viruses. By comparison, CpG suppression is less apparent or entirely absent in invertebrate viruses (Simmonds et al., 2013). Viral CpG content can therefore be approximated using the genome type-based Baltimore classification of viruses (Baltimore, 1971) except in the case of dsDNA viruses, where size matters (Simmonds et al., 2013). Viruses under-representing CpG in their genomes include the groups of +ssRNA, ÀssRNA, small dsDNA, ssDNA (which generally have small genome sizes), positive sense ssRNA reverse transcriptase viruses, and dsDNA reverse transcriptase viruses, while those that do not are dsRNA and large dsDNA viruses ( Figure 3a). Overall, for RNA viruses, the extent of CpG bias is considered to be reflective of host (Simmonds et al., 2013). The mechanistic underpinnings giving way to varied rates of CpG suppression are likely to vary between, and even within, different Baltimore group virus classifications due to the differing cellular environments each type of viral genome is exposed to, as well as the different ways in which viruses regulate the cellular environment.

| UpA dinucleotides
Dramatic dinucleotide suppression in the genome of vertebrates is unique to CpG. However, the TpA dinucleotide is modestly under-represented in the genomes of vertebrates, invertebrates (Simmonds et al., 2013) and plants (Bougraa & Perrin, 1987). Both RNA and DNA viruses mimic their host by displaying moderate suppression of the UpA dinucleotide (Di Giallonardo et al., 2017), but to varying extents ( Figure 3b).

| Codon pair bias
During translation, a ribosome decodes two codons simultaneously, and so as well as codon usage, codon order is also important. Some codon pairs are used more frequently than others, and this is considered as a separate phenomenon of F I G U R E 2 GC content vs CpG ratio for various invertebrate (blue circle) and vertebrate (pink circle) species. In blue from left to right: Spodoptera exempta (African armyworm), Drosophila melanogaster (fruit fly), Bombus bombus (bumble bee), Anopheles gambiae (mosquito). In pink from left to right: Danio rerio (zebrafish), Halichoerus spp (seals), Phocoena spp (porpoise), Didelphis virginiana (opossum), Homo sapiens (human), Rattus norvegicus (brown rat), Takifugu rubripes (pufferfish), Ornithorhynchus anatinus (platypus) "codon pair bias." Codon pairs may occur at different frequencies to those expected given the individual codon frequencies within a proteome (Buchan & Stansfield, 2005;Irwin et al., 1995) and in many organisms, some codon pairs are heavily underused, or "disfavoured." The phenomenon was first described in 1985 in Escherichia coli (Yarus & Folley, 1985) and has since been summarized for three domains of life (bacteria, archaea, and eukaryotes) (Tats et al., 2008) (Table 1).
Codon pair biases impact translation elongation rate (Gamble et al., 2016). In bacteria, over-represented codon pairs are translated more slowly than under-represented codon pairs (Irwin, Heck, and Hatfield 1995). Conversely, in eukaryotic cells, 17 specific codon pairs impede translation (Table 2), and reversing their order abrogates the effect (Gamble et al., 2016). These 17 codon pairs were all associated with wobble decoding interactions-that is, a non-Watson-Crick interactions between the third base of the codon and the first base of the tRNA anticodon. None of these codon pairs are common to those listed in Table 1.
Codon pair biases have also been linked with determining efficiency of protein folding and the co-ordinated expression of functionally grouped proteins (reviewed in Novoa & Ribas de Pouplana, 2012).  The first study of codon pair bias deoptimization of a virus genome determined that in poliovirus, artificially introduced rare codon pairs (relative to host) were translated more slowly (Coleman et al., 2008); this finding has been recapitulated in other virus systems including Marek's disease herpesvirus (Eschke et al., 2018) and IAV (Groenke et al., 2020).
We have described four different types of bias observed in genomes of organisms and the viruses that infect themnucleotide bias, codon bias, dinucleotide bias and codon pair bias. Let us reconsider the HIV-1 genome-the A base is highly over-represented, occurring with a frequency of $37% (Kypr & Mr azek, 1987). If we did not know the underlying mechanism causing this bias, we may have difficulty determining which type of bias we were looking at, because all four may look similar ( Figure 4). In order to deconvolute these types of bias, we need to understand the underlying mechanisms underlying their presence in more detail.
3 | DRIVERS OF VIRAL GENOME COMPOSITIONAL BIAS As described above, genomic composition biases may arise through a variety of evolutionary selection pressures, both positive and negative. These potential drivers of bias are summarized below and in Figure 5.

| Biases driven by factors influencing translational rate
The efficiency with which different codons and codon pairs are translated in resting cells compared with stressed cells (e.g., during virus infection) varies depending on the tRNA pool available (Buchan et al., 2006). In a study that examined translational efficiency of a library of 217 synonymously recoded GFP sequences, codon usage and GC content of genes were both found to influence translational efficiency, mRNA splicing efficiency and mRNA subcellular localization . In addition, in resting cells, high GC content of a gene increases its transcriptional rate (Kudla et al., 2006). Whether these features influence the translational efficiency of viral genes, and whether viral genes have evolved specific composition traits to regulate transcription and translation, is unknown, but the hypotheses are reasonable. Use of codons or codon pairs which require wobble decoding is known to increase the likelihood of mistranslation events (Patil et al., 2012), T A B L E 2 Codon pairs which are inefficiently translated and associated with wobble decoding

Codon pair
First codon wobble Second codon wobble and mistranslation events are more frequent during cellular stress (Mohler & Ibba, 2017). Wobble decoding contributes to increased access to alternative reading frames (Drummond & Wilke, 2009;Ou et al., 2019), and so may be relevant for viruses which encode overlapping reading frames, but whether these events are physiologically important for viral replication is also unknown. Translational fidelity can nevertheless shape virus evolution (Ou et al., 2019); for example, some mitochondrially replicating mitoviruses avoid use of tryptophan codons, which mirrors avoidance of their use by the host fungi organelle mitochondrial genome (Nibert, 2017). RNA modifications (e.g., m 6 A methylation) may also regulate translation (reviewed elsewhere; Roundtree et al., 2017) and the frequency of such modifications is related to biases in individual base frequencies.

| Biases driven by factors influencing mutation
Mutations arise in viral genomes either through the actions of host cell editors (i.e., direct mutation), or by copying errors that then become fixed in the viral genome (selection). We have already considered the A-rich genome of HIV-1, Four types of bias are described in the genomes of organisms and the viruses they are infected with F I G U R E 5 Compositional biases in viral genomes may be driven by three types of evolutionary pressure-Translational, selection and mutational. Translationally derived biases arise due to the different translational efficiencies of transcripts with varying composition in different cell conditions (e.g., resting vs. stress). Biases driven by selection arise through viral genomes avoiding encoding specific motifs that may be recognized by components of the innate immune response. Biases driven by mutation arise through editing of viral genomes or transcripts by host cell proteins and understand that this has arisen due to the mutational activities of the cellular protein APOBEC3G. Similarly, the cellular proteins of the adenosine deaminase acting on RNA (ADAR) family convert adenosine to inosine; evidence for ADAR acting on virally derived nucleic acids was first reported in the genome of vesicular stomatitis virus (O'Hara et al., 1984) but has since been identified in the genomes of a range of other viruses (Samuel, 2012). There are numerous other APOBEC and ADAR family members with potential to act on viral genomes (Christofi & Zaravinos, 2019). The observation that the SARS-CoV-2 genome is extremely uracil-rich (Rice et al., 2020;Simmonds, 2020) has been speculatively attributed to the editing efforts of cellular mutators such as APOBEC (originally reported to edit DNA, but also reported to act on RNA; Sharma et al., 2016) and ADAR (Simmonds, 2020;Di Giorgio et al., 2020), but could also be attributable to an as-yet unidentified cellular protein.

| Biases driven by factors influencing selection
Selection pressure might arise also due to the activities of a cellular protein that, for example, recognizes a specific viral motif or pathogen-associated molecular pattern (PAMP). In general, recognition of a viral PAMP by a host cell protein (or a "pattern recognition receptor"; PRR) triggers type I interferon signaling; these PRRs may themselves be upregulated by interferon, and in this case are known as interferon stimulated genes (ISGs) (reviewed in [Kumar et al., 2011]). The concept of PAMPs being recognized by PRRs during the innate immune response was first hypothesized by Charles Janeway in 1989 (Janeway, 1989). As he predicted, the first PRR identified was Xa21, a gene that protects rice from bacterial infection (described in 1995) (Song et al., 1995). Of the many current examples of PRRs, some recognize specific viral nucleic acid signatures and thus may contribute to driving genome compositional biases. The 10 Toll-like receptors (TLRs) identified in humans are heavily evolutionarily conserved across vertebrates (Oshiumi et al., 2008) and some can recognize pathogen nucleic acids. The clearest example of this relevant to compositional biases is that TLR9 recognizes unmethylated CpG motifs in DNA (Bauer et al., 2001;Krug et al., 2004;Tabeta et al., 2004), and genomic suppression of CpG in murine herpesvirus 68 to evade detection by TLR9 has been reported (Pezda et al., 2011). Examples for RNA viruses are less clear-cut, but TLR7 recognizes purine-rich viral ssRNA (Gantier et al., 2008;Zhang et al., 2016). Thus, deselection of these PAMPs over evolutionary time may be due to the selection pressures applied by these PRRs, as well as as-yet-unidentified cellular factors.

| Mechanistic understanding of how viral CpGs are selected against
The suppression of CpG dinucleotides in the genomes of viruses and their hosts illustrates a fascinating contrast between mutational versus selection pressure. As described above, over evolutionary time the deamination of methylated CpG motifs in vertebrate genomes has resulted in their removal by mutation (biases driven by mutation). Viral mimicry of genomic CpG suppression was hypothesized to be due to aberrant CpG frequency sensing by an as yet unidentified PRR , and thus CpG motifs had been deselected in viral genomes (biases driven by selection). This hypothesis was strengthened in 2017, when a breakthrough paper reported that the product of the cellular ISG, zinc-finger antiviral protein (ZAP) senses CpG motifs in viral RNA (Takata et al., 2017). ZAP has long been identified as a suppressor of some but not all viruses by inducing degradation of specific viral mRNAs through an unknown targeting mechanism (Gao et al., 2002;Guo et al., 2007;Bick et al., 2003;Zhu et al., 2011). This more recent study used the HIV-1 genome as a model system in which to synonymously enrich CpG frequencies, and while the mutant virus was replication defective in normal cells, that defect was fully abrogated in a ZAP knockout system (Takata et al., 2017). Similarly, enrichment of CpGs in the echovirus 7 genome also caused a replication defect, that could be restored by ZAP knockout . Similarly, an inhibitory role for ZAP against human cytomegalovirus has been shown, which correlated with CpG-content dependent inhibition of viral Immediate Early 1 protein expression (Lin et al., 2020), further strengthening evidence that ZAP acts as an antiviral PRR though sensing high CpG frequencies in viral mRNAs. ZAP is encoded on the ZC3HAV1 gene, which generates multiple isoforms via alternative splicing. Two isoforms are expressed to levels readily detectable by western blotting: the long (ZAPL) and short (ZAPS) forms . From the N terminus, both major isoforms incorporate four zinc fingers implicated in RNA binding (Guo et al., 2004), a TiPARP Homology (TPH) domain, also containing a zinc finger (Kerns et al., 2008), and a WWE domain predicted to mediate interactions with proteins that facilitate post-translational conjugations (Aravind, 2001). In comparison with ZAPL, ZAPS lacks the catalytically inactive poly(ADP-ribose) polymerase (PARP)-like domain, which enhances antiviral activity against an alphavirus and a retrovirus (Kerns et al., 2008). ZAPL is considered to be the constitutively expressed isoform, whereas ZAPS is an ISG which itself triggers IFN (Hayakawa et al., 2011;Ryman et al., 2005;Marcello et al., 2006) and is implicated in CpG recognition (Takata et al., 2017). Accordingly, here we only consider ZAPS (and refer to it simply as "ZAP"). The original paper reporting ZAP as a CpG sensor demonstrated the specific binding of ZAP at CpG sites using cross-linking followed by immunoprecipitation (CLIP) and sequencing (Takata et al., 2017). Crystallographic resolution of the structure of the N-terminus of ZAP bound to CpG motif-containing RNA revealed that the four zinc fingers of ZAP fold in a specific architecture to enable extensive RNA interactions which were diminished by mutation either of RNA CpG sites, or of ZAP at the zinc finger motifs (Luo et al., 2020;Meagher et al., 2019).
Following ZAP recognition of CpG-containing RNA, antiviral activity arises by inhibition of virus gene expression, either by mRNA degradation and/or inhibition of translation (Guo et al., 2007;Zhu et al., 2011). ZAP may inhibit translation by disrupting interactions between the translation initiation factors eIF4A and eIF4G (Zhu et al., 2012). ZAP also recruits transcripts to stress granules (Law et al., 2019). Degradation of viral mRNA is thought to occur through multiple routes, including via recruitment of the RNA exosome complex and/or the major cytoplasmic exoribonuclease, Xrn1 (Guo et al., 2007;Goodier et al., 2015;Todorova et al., 2014;Zhu et al., 2011). ZAP directly interacts with several exosome components, and their depletion by siRNA knockdown resulted in diminished antiviral activity by ZAP (Guo et al., 2007), confirming an essential role for the exosome in ZAP-mediated RNA degradation. During exosomemediated RNA degradation, mRNAs must be deadenylated and then decapped to yield a monophosphorylated RNA, which can then also be digested by Xrn1 (Chang et al., 2019). Interactions between ZAP and poly-A specific ribonuclease (PARN) may direct deadenylation of the mRNA, while interactions between Xrn1 and the decapping enzymes necessary for 5 0 ! 3 0 RNA degradation are indirect, via the RNA helicase DDX17 (Zhu et al., 2011). Xrn1 also digests endonucleolytically cleaved RNAs (Gatfield & Izaurralde, 2004), but it is not definitively known whether ZAP binding leads to internal mRNA cleavage events. In support of this possibility, ZAP binds to and its inhibitory activity against CpG-enriched transcripts is dependent on the cellular protein KHNYN, which unlike ZAP, does possess endonuclease activity (Ficarelli et al., 2020;Ficarelli et al., 2019). This is summarized ( Figure 6).
How ZAP feeds back into the interferon pathway is uncertain. ZAP has been shown to interact with the cytoplasmic PRR RIG-I and to augment innate immune signaling in response to a variety of artificial RNA stimuli (Hayakawa et al., 2011). This study was performed prior to ZAP's identification as a CpG sensor however, and focussed on recognition of 3 0 -triphosphate RNA moieties; it remains to be determined if CpG-rich RNA signals through the same mechanisms.
Alternatively, ZAP-mediated innate immune responses may themselves be mediated through interactions with ZAP's cofactor TRIM25. ZAP is directly bound by TRIM25, itself an RNA binding protein and also an E3 ubiquitin ligase (Zou & Zhang, 2006), and this interaction is required for ZAP's antiviral activity (Li et al., 2017;Zheng et al., 2017). TRIM25 binds ZAP through TRIM25's SPRY domain (a protein interaction module characterized by a F I G U R E 6 Possible mechanisms by which ZAP activity leads to viral transcript degradation. CpG motifs in viral RNA (red) are bound by the cytoplasmic PRR ZAP, which can lead to recruitment of 5 0 decapping enzymes (Dcp1/2 complex), the 3 0 deadenylation enzyme PARN and potentially the KHNYN RNA endonuclease, followed by 5 0 -3 0 degradation mediated by Xrn1 and/or 3 0 -5 0 degradation mediated by the RNA exosome. Interactions between ZAP and RIG-I and/or TRIM25 may also lead to innate immune signaling sequence repeat; D' Cruz et al., 2013) and ubiquitinates ZAP, although ubiquitination is not required for ZAP antiviral activity (Choudhury et al., 2017). TRIM25 was originally understood to be essential for activation of the RIG-Idependent pathway for interferon activation (Gack et al., 2007), but recently it was shown that RIPLET and not TRIM25 ubiquitinates RIG-I, and that RIPLET is sufficient for the ubiquitination and activation of RIG-I (Cadena et al., 2019). It is therefore unclear how important TRIM25 (or by extension, the interaction between ZAP and TRIM25) is during virally induced activation of the interferon response.
CpG suppression may be more nuanced than the blanket genome-wide suppression described above, which has consequent implications for the mechanisms of, and viral counteractivity to, ZAP. In the genomes of Betaherpesviruses, immediate early genes suppress CpG, whereas this is not seen in the rest of the genome (Lin et al., 2020). The authors hypothesized that immediate early gene product(s) are able to abrogate ZAP activity, thus removing any selection against high CpG frequencies in viral genes that are activated at later timepoints during infection. Conversely, in the SARS-CoV-2 genome, CpG is over-represented in E (envelope) ORF and in ORF10, whereas other genes-as expected-suppress CpG (Digard et al., 2020;Rice et al., 2020). Why these ORFs are able to buck the trend seemingly imposed on the rest of the genome is unknown; possibly, high CpG frequencies invite turnover by ZAP, thereby regulating protein production. Alternatively, these ORFs may have been acquired through recombination events and had an ancestral origin not previously subject to the same translational, mutational or selection pressures.

| CpG context may be an important driver of biases imparted by selection
For ZAP to function as an innate immune sensor and/or effector for foreign RNAs containing high CpG content, there must be a mechanism to limit activation of the system by cellular RNAs that also contain CpG dinucleotides (as all do). Since ZAP recognizes CpG motifs in ssRNA, it is possible that secondary structure of RNA-i.e., CpG context, is an important factor in determining whether CpG motifs can be recognized by ZAP, and there is some evidence indicating this. First, in the crystallography paper characterizing ZAP-RNA binding, the optimal binding motif for ZAP on RNA was found to be C(n 7 )G(n)CG (Luo et al., 2020). ZAP was found to bind to multiple sites on an RNA, and in considering the stoichiometry of RNA degradation complex recruitment, the authors concluded that owing to the relatively small size of ZAP relative to RNA degradation complexes, several bound ZAP molecules must be required for this. Therefore the number and spacing of CpG dinucleotides is likely to be important.
Context effects for CpG deselection have also been identified in an evolutionary context. Greenbaum et al., found that since the emergence of the 1918 H1N1 pandemic strain of IAV in humans, CpG motifs have gradually been lost from the viral genome as it became endemic in humans. They asked whether specific nucleotides were more likely to flank the CpGs that were deselected, by measuring the relative frequencies of (C/G)CG(C/G), (A/U)CG(A/U), (A/U)CG (C/G) and (C/G)CG(A/U) in H1N1 genomes over time (Greenbaum et al., 2009). No reduction in (C/G)CG(C/G) motifs was seen, whereas all three of the other motifs declined in frequency, with the strongest reduction seen in the (A/U)CG (A/U) motif. The authors speculated that the severe disease attributed to infection with the 1918 virus was caused by the aberrantly high CpG frequency present in the viral genome provoking a cytokine storm.
A similar observation has been recapitulated in vitro. Using echovirus 7 as a model system, a replicon was recoded to maintain CpG frequency (n = 51) but add AACGAA or UUCGUU motifs (Fros et al., 2017). The UUCGUU mutant was fivefold more impaired than a CpG enriched transcript (Fros et al., 2017). Thus, there is a growing body of evidence that CpG context is important for innate sensing.
3.6 | UpA dinucleotide sensing as a driver of bias Two possible explanations have been put forward to date to explain genomic UpA suppression. First, it was originally reported in 1981 (and subsequently verified) that UpA dinucleotides are cleaved by the cellular ISG RNaseL (Wreschner et al., 1981;Karasik et al., 2021), which could explain their deselection over evolutionary time. However, the authors further reported that RNaseL also cleaves RNA at UpU dinucleotides, and TpT/ UpU are generally not under-represented in animal genomes or in the viruses that infect them, so the specificity and impact of RNaseL on genomic TpA/UpA content is questionable. So far, one study using echovirus 7 as a model found that the reduced replication of an artificially UpA-enriched virus could be rescued by RNaseL removal , but it appears that the pathway is not specific to RNaseL, as ZAP depletion also complemented the defect in virus replication. While both CpG and UpA dinucleotide suppression may be driven by co-regulated factors of the interferon response, the extent of CpG and UpA suppression within a virus genome do not necessarily correlate (Figure 7).
The second nonexclusive idea to explain TpA/ UpA suppression is the propensity for this dinucleotide to introduce a stop codon. Stop codons are encoded by UAG, UAA and UGA nucleotide triplets, and so deselection of UpA motifs in the first and second codon positions reduces the risk of aberrant stop codon introduction. However, 6 of 10 disfavoured codon pairs encode a UpA motif cross the codon boundary (Table 1), and so deselection of UpA motifs in this context is evident and may therefore be important for translation regulation. Therefore, multiple constraints may be acting which, together, reduce UpA representation in the genomes of organisms and their infecting viruses.

| CONSEQUENCES OF ALTERING VIRAL GENOME COMPOSITIONAL BIASES
To study the biological relevance of under-represented nucleotides, dinucleotides, codons and codon pairs, synonymous recoding has been undertaken for a wide range of viruses. These are summarized (Table 3). In these studies, deoptimization to alter sequence composition in a direction away from that of the host, or optimization to recode viral sequence to look more like host genome has been undertaken. Generally, deoptimization of any of these parameters results in virus attenuation, whereas optimization usually does not improve replication.

| Codon pair bias recoding
The first study (published in 2008) to draw significant attention to the subject of large scale genome recoding examined the effects of modifying codon pair bias in the poliovirus genome (Coleman et al., 2008), where deoptimization of codon pairs resulted in virus attenuation, and the extent of recoding correlated with the extent of attenuation. The authors found that introduction of disfavoured codon pairs decreased protein translation rates (assayed using a luciferase reporter construct) and yielded viruses that were attenuated in mice, but still offered protection from homologous virus F I G U R E 7 Comparison of CpG and UpA suppression in the genomes of various viruses. RNA viruses: BTV, bluetongue virus; EBOV, ebola virus; FMDV, foot and mouth disease virus; HCV, hepatitis C virus; RSV, respiratory syncytial virus; SARS2, severe acute respiratory syndrome coronavirus 2. DNA viruses: adeno, adenovirus; HCMV, human cytomegalovirus; HSV-1, herpes simplex virus 1; Parvo, canine parvovirus 2 T A B L E 3 Synonymous recoding strategies which have been applied to RNA viruses are summarized

Virus
Recoding strategy Region recoded Findings References

Adeno-associated virus
Codon pair bias deoptimization

Rep
The negative regulatory signal imparted on adenovirus by AAV was diminished, and so adenovirus replication was enhanced Sitaraman et al. (2011) Dengue virus Codon pair bias deoptimization to match insect bias E/NS3/NS5 Mutants grow well in insect cells but not well (if at all) in mammalian cells. LD50 was 10 2-3.5 fold up in mice Shen et al. (2015) Bioinformatic analyses showed that the above recoding strategy also increased CpG frequency This re-analysis suggested that attenuation of viral replication in mammalian cells might result from increased CpG content rather than increased codon-pair bias Simmonds et al.
Codon pair bias deoptimization HA and NA 10 5 -fold attenuation in mice and clinical attenuation in ferrets Yang et al. (2013) and Broadbent et al. (2016) CpG and UpA dinucleotide deoptimization NP 10 1-2 -fold reduction in titre in cell culture and disease attenuation in mice Gaunt et al. (2016) CpG and codon pair bias deoptimization NA Codon pair bias dramatically decreased replication whereas increased CpG dinucleotides did not Groenke et al.
challenge. Following on from this work, several papers (by the same research group and others) expanded this concept by applying the same recoding strategy to other viruses (Table 3). The potential for codon pair bias recoding as a vaccine development strategy has also been demonstrated in work using IAV as a model system. The PR8 strain of H1N1 IAV (the backbone of which is used to make live attenuated IAV vaccines) was recoded to increase disfavoured codon pair usage. Three viral genome segments (2 [PB1], 4 [HA], and 5 [NP]) were modified and tested separately or in combination for their effects on viral growth characteristics in cells and vaccine potential in mice. Single or combinatorial segment modifications all displayed around 10-fold defect in multicycle replication assays in vitro. However, in BALB/c mice, the triple reassortant had a 3000-fold reduction in virus titre at 24 h post-infection. The triple reassortant virus was further tested for its 50% protective dose (PD 50 ; i.e., the inoculum dose required to protect from infection upon challenge), displaying a 50% lethal dose (LD 50) )/PD 50 ratio 1000-fold higher than that of wildtype PR8 virus. This result, and others from the same lab (Yang et al., 2013;  Broadbent et al., 2016) emphasized the potential of large scale genome recoding as an approach to live-attenuated vaccine development.
A question often posited when large scale recoding of viruses to either mimic or deviate from the patterns seen in host genomes is considered, is what happens in the case of vector-borne viruses that replicate in both invertebrate and vertebrate hosts. This was investigated for dengue virus, which replicates effectively in both the main insect vector Aedes aegypti, and in humans (Olson et al., 1996). Recoding of the dengue virus genome to align its codon pair usage in favor with insect genome preferences resulted in a virus that replicated as well as wildtype in insect cells, but experienced a 1-2 log 10 decrease in replication in some mammalian cells. In mice, this recoding resulted in a 2-3 log 10 increase in LD 50 . Curiously, the recoded virus replicated normally in BHK-21 cells.

| Dinucleotide recoding
Large scale recoding of virus genomes using dinucleotide deoptimization was first reported in 2006, using poliovirus as a model system (Burns et al., 2006). In this paper, the authors set out to recode poliovirus by deoptimizing codon usage, but observed that in the process, they introduced 207 CpG dinucleotides across the capsid region. In doing this, virus titres were reduced 65-fold.
The same group went on to specifically study the impact of CpG and UpA enrichment on poliovirus replication. Addition of CpG or UpA were both found to diminish replication, and when both dinucleotide frequencies were simultaneously increased, the effects were found to be synergistic (Burns et al., 2009).
These first papers investigating dinucleotide deoptimization were confounded by lack of corrections for nucleotide and codon usage biases, which as noted above (Figure 4), are inter-related. More recent works have enriched CpG and UpA dinucleotides without altering nucleotide or codon frequencies. The introduction of CpGs or UpAs into the echovirus 7 genome in a way that controlled for these other variables  found that CpG introduction more strongly reduced virus fitness than UpA introduction. Conversely, using IAV as a model system, UpA introduction was more detrimental to virus replication than CpG addition. This IAV work also demonstrated that a sub-clinical dose of CpG-enriched virus protected from challenge with a potentially lethal dose of the wildtype PR8 strain in mice (Gaunt et al., 2016), directly demonstrating the potential of dinucleotide deoptimization as a vaccine development strategy.
CpG and UpA dinucleotide optimization and deoptimization have been characterized in the genomes of various other viruses (Table 3), but the bulk of these studies were undertaken prior to the discovery of ZAP as a CpG sensor. The defect imparted by CpG enrichment has been abrogated by ZAP knockout in echovirus 7  and HIV-1 systems (Ficarelli et al., 2020;Takata et al., 2017), although for echovirus 7 the defect was also relieved by RNaseL knockout, and CpG enrichment in HIV-1 impacted splicing events. The role of ZAP in CpG sensing of viral RNAs requires further clarification.

| Is codon pair bias an artifact of dinucleotide bias?
Of the top 10 most avoided codon pairs across bacteria, archaea and eukaryotes, three contain a CpG motif at the codon boundary and six contain a UpA motif (Table 1). Thus, the two phenomena of codon pair bias and dinucleotide bias are interlinked and at one extreme could simply be two ways of measuring the same effect. This has proven to be a contentious issue (Simmonds et al., 2015;Futcher et al., 2015). Deconvoluting the two has been achieved using the echovirus 7 system to make a panel of mutants which were either codon pair bias deoptimized or dinucleotide bias deoptimized, without altering the other parameter. Using this system, codon pair bias did not impact virus replication kinetics, whereas dinucleotide composition did (Tulloch et al., 2014). This finding was supported by a bioinformatics study from an independent laboratory which reached the same conclusion (Kunec & Osterrieder, 2016). However, when the same authors from this latter study used IAV as a model system to experimentally test their predictions, they found that codon pair bias was far more important than dinucleotide bias, using IAV as a model system (Groenke et al., 2020). In this latter report, the authors found that codon pair deoptimization resulted in diminished mRNA stability (Groenke et al., 2020); however, the codon pair deoptimization resulted in increased UpA dinucleotide frequencies, and as UpA is reported to be cleaved by RNaseL (Wreschner et al., 1981;Karasik et al., 2021), this could also explain the outcome. A bioinformatics study that used nucleotide patterns of viruses to predict host species found the two features to be discrete, but that dinucleotide bias was far more accurate than codon pair bias in identifying viral host species (Babayan et al., 2018).
The field therefore remains divided about whether codon pair and dinucleotide bias are synonyms or are discrete phenomena. The confusion between codon pair bias and dinucleotide bias is compounded by a proportion of the described studies not including control constructs-e.g., re-ordering codons without altering dinucleotide frequencies Fros et al., 2017;Gaunt et al., 2016;Ibrahim et al., 2019;Odon et al., 2019). Such controls are imperfect-perhaps, for example, a deoptimized virus has inadvertently introduced a mutation that alters an uncharacterized RNA functional element, whereas the recoding in the control virus did not. Such controls are nevertheless still helpful for strengthening mechanistic conclusions, even if redundant for purely empirical attempts to attenuate a virus. Furthermore, not all studies fully investigate the mechanism of attenuation-for example, few have assessed translation rates (impacted by codon pair biases) or RNA turnover (impacted by dinucleotide biases). Ultimately, better control strategies are needed to deconvolute these two phenomena properly. Distinction-or not-between dinucleotide bias and codon pair bias can be made if we fully understand the mechanism(s) by which these biases attenuate virus propagation.
The discovery of ZAP as a CpG sensor provides the opportunity for researchers to validate CpG enrichment studies. If CpG enrichment results in a defect which can be abrogated with ZAP knockout, the impairment phenotype can sensibly be concluded to be a result of ZAP activity and therefore the introduction of CpGs (rather than an unintended side effect such as introduction of disfavoured codon pairs). Publications on this so far report mixed results (Ficarelli et al., 2020;Odon et al., 2019;Ficarelli et al., 2019) wherein ZAP is not the only sensor whose depletion results in fitness reconstitution, or off-target effects of CpG enrichment are seen. Nevertheless, ZAP knockout looks like a promising test of whether CpG enrichment is the key to why a CpG-enriched virus has a replication defect. However, no such "rescue system" exists for codon pair bias studies. Perhaps, if the limitation is in translational efficiency, tRNA supplementation would rescue the system, but we are not aware of any study attempting this.

| DISCUSSION: POTENTIAL FOR DINUCLEOTIDE MODIFICATION AS A VACCINE DEVELOPMENT STRATEGY
The detailed observations on the large-scale recoding of RNA virus genomes has enthused researchers to repeatedly suggest that these methods may offer a potential live attenuated vaccine development strategy, as described above for both codon pair bias and dinucleotide bias deoptimization.
A critical consideration for live attenuated vaccine development, regardless of the virus system being explored, is virus yield. For a successful vaccine candidate, it must be possible to produce that vaccine virus in high amount. However, the described large-scale recoding strategies, while attenuative and in some cases protective from heterologous virus challenge, result in marked defects in virus production levels. The discovery of ZAP as a CpG sensor could provide a potential route to circumvent this issue. Vaccine candidate viruses can simply be grown in ZAP knockout systems, thus recapitulating wildtype virus titres-assuming other unintended effects of mutagenesis (e.g., on genome replication and/or packaging as discussed above) are avoided.
Let us consider the example of IAV as a candidate virus for which a large scale recoded virus could be developed for vaccination. IAV live attenuated vaccines are most commonly produced in embryonated hen's eggs, although there is a motion to switch production to cell culture-based systems (Perdue et al., 2011). A CpG enriched virus could be-and has been (Gaunt et al., 2016)-produced. One can therefore envisage synthesis of a CpG enriched IAV that replicates to wildtype levels in a CpG-sensor knockout system (whether this virus is manufactured in a cell culture system or in embryonated hen's eggs; there is emerging technology for creating gene edited chickens; Long et al., 2019;Idoko-Akoh et al., 2018). This CpG-enriched virus may offer enhanced immunogenicity (Gaunt et al., 2016) and so the amount of vaccine virus required per dose might also be reduced.
IAV offers a very attractive vaccine target for which synonymous recoding must be a serious consideration. For IAV live attenuated vaccines, PR8 strain-which is nonpathogenic in humans-is used as the backbone, and it is straightforward to switch in synonymously recoded segments into this backbone using well established reverse genetics systems (Fodor et al., 1999;Neumann et al., 1999). By contrast, if we were to recode viable segments of a recombinant virus such as SARS-CoV-2 and use that as a vaccine (Digard et al., 2020), there is the risk that this virus can recombine and revert to virulence even with large scale recoding. For poliovirus, the same caveats apply, as well as concerns that this is a neurotropic virus and a CpG-enriched virus may not be subject to the same replicative losses in the immunoprivileged replication sites (Gao et al., 2015).
It is critical that we fully understand the mechanism of attenuation imparted by synonymous recoding before we apply this technology to vaccine development. For example, we can ask whether the addition of CpGs really allowed greater visibility of the virus to the innate immune system because of those additional CpG motifs, or has some unpredicted defect in replication been introduced-such as disruption or creation of an alternative open reading frame, packaging signals, splice junction, etc.? A candidate vaccine virus may be a lot closer to reversion than it appears in phenotyping as all these examples could potentially be overcome by a single reversion mutation. If this happened when a dinucleotide modified strain were used as a live attenuated vaccine this could create a virus adept at replication in humans, but which is highly immunogenic and therefore pathogenic.

| CONCLUSION
Three drivers shape the genome composition of viruses-translation, mutation and selection. These result in four types of bias-nucleotide, codon, dinucleotide and codon pair. Systematic recoding of viral genomes to disrupt these frequencies almost universally leads to virus attenuation. Synonymous recoding offers a highly attractive vaccine development strategy with the potential to overcome the yield issues currently thwarting current live attenuated vaccine production efforts. CpG dinucleotide deoptimization alone has available a rescue system in which a vaccine virus could be amplified to wildtype virus titres. No such system exists (yet) for any other deoptimization strategy. However, further work must be undertaken to fully understand the mechanisms impacted by this recoding before we can consider using this approach commercially. Figures 2, 4, 5 and 6 were created using BioRender. We are thankful to Professor Peter Simmonds for provision of SSE software used to analyze the data represented in Figures 1, 3, and 7.

ACKNOWLEDGMENTS
We are grateful to Dr Finn Gray, Dr Grzegorz Kudla, Professor Laurence Hurst and Dr Sander Granneman for helpful discussions.

CONFLICT OF INTEREST
The authors have declared no conflicts of interest for this article.

DATA AVAILABILITY STATEMENT
Data sharing is not applicable to this article as no new data were created or analyzed in this study.