Evolution of genomes, host shifts and the geographic spread of SARS-CoV and related coronaviruses.

Severe acute respiratory syndrome (SARS) is a novel human illness caused by a previously unrecognized coronavirus (CoV) termed SARS-CoV. There are conflicting reports on the animal reservoir of SARS-CoV. Many of the groups that argue carnivores are the original reservoir of SARS-CoV use a phylogeny to support their argument. However, the phylogenies in these studies often lack outgroup and rooting criteria necessary to determine the origins of SARS-CoV. Recently, SARS-CoV has been isolated from various species of Chiroptera from China (e.g., Rhinolophus sinicus) thus leading to reconsideration of the original reservoir of SARS-CoV. We evaluated the hypothesis that SARS-CoV isolated from Chiroptera are the original zoonotic source for SARS-CoV by sampling SARS-CoV and non-SARS-CoV from diverse hosts including Chiroptera, as well as carnivores, artiodactyls, rodents, birds and humans. Regardless of alignment parameters, optimality criteria, or isolate sampling, the resulting phylogenies clearly show that the SARS-CoV was transmitted to small carnivores well after the epidemic of SARS in humans that began in late 2002. The SARS-CoV isolates from small carnivores in Shenzhen markets form a terminal clade that emerged recently from within the radiation of human SARS-CoV. There is evidence of subsequent exchange of SARS-CoV between humans and carnivores. In addition SARS-CoV was transmitted independently from humans to farmed pigs (Sus scrofa). The position of SARS-CoV isolates from Chiroptera are basal to the SARS-CoV clade isolated from humans and carnivores. Although sequence data indicate that Chiroptera are a good candidate for the original reservoir of SARS-CoV, the structural biology of the spike protein of SARS-CoV isolated from Chiroptera suggests that these viruses are not able to interact with the human variant of the receptor of SARS-CoV, angiotensin-converting enzyme 2 (ACE2). In SARS-CoV we study, both visually and statistically, labile genomic fragments and, putative key mutations of the spike protein that may be associated with host shifts. We display host shifts and candidate mutations on trees projected in virtual globes depicting the spread of SARS-CoV. These results suggest that more sampling of coronaviruses from diverse hosts, especially Chiroptera, carnivores and primates, will be required to understand the genomic and biochemical evolution of coronaviruses, including SARS-CoV. © The Willi Hennig Society 2008.

Severe acute respiratory syndrome (SARS) is a recently described human infectious disease caused by a previously unrecognized coronavirus, SARS-CoV (Ksiazek et al., 2003).Between November 2002 and August 2003, there were 8422 cases and 916 deaths from SARS (WHO, 2003).These numbers are not on the scale of major epidemics such as seasonal forms of influenza infecting humans, but in an era of rapid globalization, the potential for a pandemic was significant.SARS-CoV infection has not been reported among humans since the early days of 2004.However, there remain conflicting reports on the animal reservoir of SARS-CoV.Guan et al. (2003) and Kan et al. (2005) implicate small carnivores whereas Li et al. (2005) and Lau et al. (2005) asserted that Chiroptera are the animal reservoir of SARS-CoV.In a comprehensive review of CoV among Chiroptera, Tang et al. (2006) argued that the origin of SARS-CoV remains unknown.
Among humans, serological surveys indicate that SARS-CoV viruses were circulating in subepidemic levels in 2001 in residents of Hong Kong (data from mainland China is not available) (Zheng et al., 2004).Also, in describing the world's largest SARS epidemic in Beijing, Pang et al. (2003) point out that ''It is possible that some SARS cases were not counted before mid-April 2003 when the extent of the outbreak was fully recognized.'' In a search for the animal reservoir of SARS-CoV outside of urban areas Kan et al. (2005) surveyed farmed Parguma larvata (Himalayan palm civet) in 25 farms spread over 12 provinces in South-east China and found no evidence of SARS-CoV infection.SARS-CoV in carnivores was isolated to animals in the Xinyuan market, in the suburbs of Guangzhou, China.Vijaykrishna et al. (2007) make the argument that Chiroptera are a reservoir for a wide variety of coronaviruses (SARS and non-SARS) that affect humans and animals.Before the SARS outbreak, coronaviruses were known primarily from animals of agricultural importance in which they cause respiratory and enteric infections (Siddell et al., 1983).The human strains CoV-229E and CoV-OC43, which are distantly related to SARS-CoV, cause mild respiratory illnesses similar to the common cold (Mahony and Richardson, 2005).Recently Dominguez et al. (2007) have shown that Chiroptera (Myotis occultus and Eptesicus fuscus from the Rocky Mountains of Colorado, USA, carry group 1 coronaviruses.Our preliminary analyses show that these CoVs from Rocky Mountain Chiroptera are very closely related to group 1 CoV that infect humans (e.g., .
Several, but not all of the genomes of the coronaviruses isolated from small carnivores contain a specific 29-nucleotide region (CCTACTGGTTACCAA-CCTGAATGGAATAT, e.g., positions 27869-27897 in the of AY304488) in a protein with an unknown function.It was initially reported that this 29-nucleotide region was absent from all human SARS-CoV isolates sequenced with the notable exception of one isolate from Guangdong that contains the 29-nucleotide region (GD01 GenBank accession AY278489) (Guan et al., 2003); however, several human isolates were later discovered to contain the region.Owing to the perceived potential of the 29-nucleotide region as a clue to the animal origins and subsequent adaptation of SARS-CoV to human hosts, this 29-nucleotide region garnered media attention as early as May 2003 as a ''29nucleotide deletion'' in human SARS-CoV that enabled animal to human transmission (Bradsher and Altman, 2003;Enserink, 2003).
SARS-CoV isolates from Chiroptera contain a different 29-nucleotide sequence (CCAATACATTACTATT-CGGACTGGTTTAT, e.g., positions 27866-27894 in DQ648857, Bat coronavirus BtCoV ⁄ 279 ⁄ 2005) in a protein with an unknown function.This fragment from isolates of SARS-CoV derived from Chiroptera is in an orthologous genomic position to the 29-nucleotide region described above for some SARS-CoV isolated from small carnivores and humans.When the 29nucleotide regions from Chiroptera versus human and carnivore hosts are compared, 12 nucleotide positions are polymorphic (Lau et al., 2005).Under the current sampling of SARS-CoV, this fragment is exclusive to SARS-CoV isolated from Chiroptera.
The Chinese SARS Molecular Epidemiology Consortium ( 2004) published an analysis of molecular evolution of SARS-CoV within humans during the 2002-03 epidemic.This study included the release of many new genomic sequences of SARS-CoV from humans infected in the early stages of the outbreak in southern China 1 .
A human SARS-CoV associated with a re-emergent case of SARS in Guangzhou, Guangdong Province, China was isolated December 22, 2003.The sequence of this SARS-CoV spike gene was released in February 2004 (SARS-CoV GD03T0013; GenBank accession AY525636).Song et al. (2005) released many full and partial genome sequences of SARS-CoV isolated from human and palm civet cats collected in southern China into the public domain in 2005 2 .Kan et al. (2005)  Receptor binding studies Li et al. (2006) provide a review of the structural biology of the SARS-CoV spike protein and the variation of the receptor for spike protein on host cells, angiotensin-converting enzyme 2 (ACE2), among human and carnivore hosts.These authors point out via pairwise alignment that the spike protein of SARS-CoV isolated from Chiroptera lack a stretch of amino acid residues and have mismatches among other residues that form the receptor-binding motif for the human variant of ACE2.
There is also empirical evidence concerning the relative affinity of various spike proteins to ACE2 from various hosts.The SARS-CoV spike proteins tested include: an early epidemic, 2002-03, human isolate (SARS-CoV, TOR 2), a human isolate tied to sporadic infections in 2003-04 (SARS-CoV, GD03T0013), and a carnivore isolate (P.larvata, SZ3) from 2003 to 2003 (Li et al., 2005).Li et al. (2005Li et al. ( , 2006) ) describe and ''expected'' result for SZ3 and an ''unexpected'' result for GD03T0013 that both of these spike proteins bound P. larvata ACE2 better than they bound human ACE2.Spike protein from TOR 2 bound ACE2 from P. larvata and human equally well.The unexpected nature of their results is tied to the perception that the SARS-CoV virus was adapting from carnivore to humans as suggested by prevailing phylogenetic studies of the time (e.g., Guan et al., 2003;Chinese SARS Molecular Epidemiology Consortium, 2004;Kan et al., 2005;Song et al., 2005).

Demarcation of sequence characters
We compared nucleotide sequences for whole and partially sequenced genomes that were in the public domain as of January 1, 2005.This data set included 83 viruses from a wide host and geographic range (Table 1).First, we compared these genomes with ClustalW under default settings (i.e., gap opening penalty 15 gap extension penalty 6.66, DNA transition weight 0.5) (Thompson et al., 1994) and developed a set 1 GenBank accession numbers for SARS-CoV sequences released in January 2004: AY394978 AY394979 AY394980 AY394981 AY394982 AY394983 AY394984 AY394985 AY394986 AY394987 AY394989 AY394990 AY394991 AY394992 AY394993 AY394994 AY394995 AY394996 AY394997 AY394999 AY395000 AY395001 AY395002 AY395003 AY395004.
We use the same ClustalW settings to produce an updated aligned data set of whole and partially sequenced genomes that were in the public domain as of July 21, 2006.The updated data set includes 157 viruses many of which were isolated from Chiroptera and small carnivore hosts (Table 2).We then split the genomes along 66 boundaries and removed all gaps inserted by ClustalW, thus forming an updated set of 67 sequence fragment characters for POY3.
We produced a data set of 113 whole genomes of SARS-CoV from human, Chiroptera, swine and carnivore hosts (Table 3) that were available to the public as of July 21, 2006.We used a single outgroup, human coronavirus NL63 (GenBank accession no.AY567487).The sequences in this data set were similar enough to align without splitting them into sequence fragment characters.Together these 114 complete genome sequences were aligned using default settings in ClustalW.This alignment was analyzed with standard tree search methods.

Sensitivity analysis plus tree fusion under direct optimization
Direct optimization (Wheeler, 1996) works by creating parsimonious hypothetical ancestral sequences at internal nodes of a cladogram.The key difference between direct optimization and multiple alignment is that in direct optimization evolutionary differences in sequence length are accommodated, not by the use of gap characters, but rather by allowing insertion-deletion events between ancestral and descendant sequences.In direct optimization, evolutionary base substitution and insertion-deletion events are treated with the same edit costs that are used in standard studies using static alignment followed by search for a set of optimal tree(s).However, in direct optimization, alignment is dynamic in that a novel set of putative sequence homologies is considered each time a novel topology is considered.
The best set(s) of homologies is discovered by searching for the topology(ies) that minimizes the global cost of substitution and indel events.Moreover, we varied alignment parameter sets across five sets of edit costs ranging from unitary costs for nucleotide insertion-deletions, transversions and transitions to costs with upweighted insertion-deletions and transversions (Tables 4 and 5) (Wheeler, 1995).This process of parallel direct optimization across many edit costs not only allows for analysis of whether the results are sensitive to parameter choice, but when also coupled  with a genetical algorithm can shorten the computation time necessary to find satisfactory results (treated below).

Genetical algorithms under direct optimization
Next, we used POY3 to perform tree fusion, a search heuristic first presented in a phylogenetic context by Goloboff (1999) to address the problem of composite optima.With a set of various near suboptimal trees such as produced during direct optimization analysis, often some taxa are in an optimal configuration in some of the trees but no one tree is optimal for all taxa.We applied the following POY3 commands to a concatenated file named ''ALL.TREES'' containing trees collected under various edit costs (POY3 commands: -parallel -fitchtrees -treefuse -fusemingroup 5-fuse maxtrees 10-fuselimit 100-slop 5-check slop 10-maxtrees 10-topofile ALL.TREES -molecularmatrix $ALIGNMENTPARAMETERS).

Standard tree search for aligned data
For the 114 isolate multiple alignment we ran a new technology search in TNT (Goloboff et al., 2003b) under equally weighted parsimony and stabilized the consensus 10 times (Fig. 6).We also ran these data under maximum likelihood under the GTR + GAM-MA and CAT models of nucleotide substitution for 1000 randomly generated maximum parsimony trees in RAXML (Stamatakis, 2006) on a computing cluster.

Character optimization on flat trees
We optimized the position of the animal SARS-CoV isolates in the best tree(s) produced by tree fusion in each parameter set with the program MESQUITE (Maddison and Maddison, 2004) using the option: trace character history: parsimony ancestral states.All best trees from the parameter study were used for study of the relative topological position of isolates in various hosts (Tables 4 and 5).
For flat tree presentation of the optimization of: various 29-nucleotide fragments, key amino acid mutations, and host character states we used MESQUITE with trees for the 83 (Figs 1 and 4) and 157 isolate datasets (Figs 2 and 5, and supplemental data at http:// For flat tree and geographic visualization studies (treated next) we used a binary version (using the TNT command randtree*) of the 114 isolate strict consensus tree produced by ClustalW alignment and parsimony search (Figs 3 and 6).

Projection of a tree, key mutations and metadata into a virtual globe
We used the methods described in Janies et al. (2007) to project a binary representation of the tree found for 114 isolates in TNT into a virtual globe (http://supramap.osu.edu/cov/janiesetal2008covsars.kmz).One subtle difference was that in this case we used an apomorphy list derived from PAUP* (version 4.0b10; Swofford, 2002) using the command describe trees:output list of apomorphies.We drew data on host and date of isolation from Lau et al. (2005; GenBank, or the International Committee on Taxonomy of Viruses database (http://www.ncbi.nlm.nih.gov/ICTVdb).

Spike protein mutations
Not all nucleotide records for coronaviruses in GenBank had translations to proteins.To get amino acid data of interest we translated nucleotide records into proteins in the Genetic Data Environment (http:// www-bimas.cit.nih.gov/gde_sw.html)and checked these translations against reference amino acid sequences from GenBank.Amino acid sequences were aligned with ClustalW.Amino acid positions 479 and 487 of the spike protein were optimized on a tree using apomorphy commands of PAUP for tree projections.Optimizations of these amino acid positions were also conducted in MESQUITE for flat tree visualization (supplemental data at http://supramap.osu.edu/cov).

Genotype-phenotype correlation studies
We used the options: trace and chart of MACC-LADE (Maddison and Maddison, 2000) to perform the concentrated changes test (Maddison, 1990) with the presence of the region CCTACTGGTTACCAAC-CTGAATGGAATAT as the independent character and the infection of carnivores as the dependent character.Any ambiguities in the optimization were resolved using the DELTRAN option.The CCT test was performed using simulation sample size of 100 000 iterations.

Sensitivity analysis of outgroup choice
Rooting an evolutionary tree is a critical step to polarize the temporal sequence of genomic and phenotypic changes and clarify the relationships of the organisms.Unlike Snijder et al. (2003) who used an equine torovirus outgroup (as the taxonomy suggests might be suitable http://www.ncbi.nlm.nih.gov/ICT-Vdb/Ictv/index.htm), we could not verify the suitability of an outgroup from outside the coronaviruses.Our investigation using BLAST (Altschul et al., 1997) [default values as implemented in GenBank http:// www.ncbi.nlm.nih.gov(i.e., expect ¼ 10)] indicated to us that no arterivirus or torovirus genome in Gen-Bank bears significant nucleotide similarity with any coronavirus.As outgroups, we used genomes and partial genomes from non-SARS coronaviruses (Tables 1, 2 and 3).We choose many candidate outgroup taxa to maximize host and antigenic diversity.Clades formed by antigenic group 1, group 2, and group 3 coronaviruses have significant branch lengths between each other and the SARS-CoV clade.
Finding the ingroup root when the available outgroups are markedly divergent can be challenging.The divergence can be a result of rapid mutation rates, recombination events, inadequate sampling, multiple evolutionary origins, or a combination of these phenomena.Thus we performed several experimental searches in which a random outgroup selected from non-SARS taxa was used.The results of these searches were assessed to see whether our phylogenetic and host evolution results were affected by outgroup choice.To perform these randomization experiments, we output an implied alignment (Wheeler, 2003) resulting from each parameter set and best tree.(POY3 commands: -phastwincladfile $IM-PLIEDALIGNMENT.phast-topodiagnoseonlytopofile $ALIGNMENTPARAMETERS.TREE).Next, for each implied alignment we used 1000 replicate new technology tree searches (TNT command: XMULT 2) (Goloboff et al., 2003b).In each search replicate, we randomly deleted a subset of the outgroup taxa and assessed: (1) whether the most basal taxon in the SARS ingroup was stable, and (2) whether the most basal taxon of the SARS ingroup was ever an isolate from an animal host (scripts available from the authors).Fig. 4. Phylogenetic tree produced by direct optimization of 83 coronavirus isolates based on whole and partial genomes (sampling in Table 1).The evolution of hosts is optimized on the genome-based tree as shown by the colors traced on the branches.Note that the SARS-CoV isolates from carnivores (purple trace: civet cat Parguma larvata, raccoon dog Nyctereutes procyonoides, and ferret badger Melogale moschata) and artiodactyls (light blue trace: pig, Sus scrofa) are nested within a large clade of SARS-CoV isolates from humans (yellow trace: Homo sapiens), which are basal among SARS-CoV.The search method for the genomic data was direct optimization.Parsimony optimization was used for the host data.The edit costs were indels 1, transversions 1, transitions 1.

Resampling
We performed jackknife GC resampling in TNT (Goloboff et al., 2003a,b) on the ClustalW alignment of the 114 isolate data set and the implied alignment from unitary costs for the 83 and 157 isolate data sets as specified by the following commands: resample jak rep1000 [xm ¼ lev5 rep5] from 0.

Direct optimization searches
Best tree lengths for the direct optimization searches under various parameters are reported for the 83 isolate data set in Table 4 and for the 157 isolate data set in Table 5.The resampling values are reported as supplemental data at http://supramap.osu.edu/cov/.

Multiple alignment to standard tree search
For the 114 isolate data set, a best score of 22 363 steps under equally weighted parsimony was hit 107 times and 87 trees were retained.A strict consensus of 59 nodes was stabilized 10 times (Fig. 6).The best RAXML tree for this alignment was found under GTRGAMMA at -ln likelihood of 111006.264984.RAXML trees with host character optimization and resampling values are available in supplemental data at http://supramap.osu.edu/cov/.

Evolution of host shifts among coronaviruses
In the 83 isolate data set in all parameter sets considered, we found the SARS-CoV isolates from P. larvata, N. procyonoides (Carnivora) and Sus scrofa (Artiodactyla) to occur in terminal positions of the trees, nested well within a large clade of SARS-CoV isolated from humans (Fig. 4, Table 4).Thus, based on genomic evidence, SARS-CoV occurred in P. larvata, N. procyo-Fig. 5. Phylogenetic tree produced by direct optimization of whole and partial coronavirus genomes produced of 157 isolates (sampling in Table 2).Note that the SARS-CoV isolates from Chiroptera (black trace: Rhinolophus sinicus, Rhinolophus ferrumequinum, Rhinolophus macrotis and Rhinolophus pearsoni) are basal among the entire SARS-CoV clade.SARS-CoV isolates from small carnivores (purple trace) and artiodactyls (light blue trace) are nested within a clade of SARS-CoV isolates from humans (yellow trace), although there were several exchanges between humans and carnivores.The search method for the genomic data was direct optimization.Parsimony optimization was used for the host data.The edit costs were indels 1, transversions 1, transitions 1.
noides and S. scrofa after SARS-CoV occurred in humans (Figs. 4).The shift of SARS-CoV from human hosts to S. scrofa host is independent of the shift from human host to small carnivore hosts (N.procyonoides and S. scrofa).
In the 83 isolate tree recovered under unitary costs, the polarity of host shift is ambiguous between the SARS-CoV isolate from N. procyonoides (HC ⁄ SZ ⁄ 61 ⁄ 03) and the SARS-CoV isolate GD03T0013 from humans.GD03T0013 is closely related to SARS-CoV isolated from civets served in a restaurant in Guangzhou, China in late 2003 and early 2004.No epidemiological data link the GD03T0013 human case to exposure to laboratory isolates of SARS-CoV (Wang et al., 2005).In the 157 isolate data set, under all parameters we found the SARS-CoV isolates from P. larvata, N. procyonoides and S. scrofa were terminal, nested well within a large clade of SARS-CoV isolated from humans (Fig. 5, Table 5).In the analysis of these data under most parameter sets the SARS-CoV isolated from Chiroptera were basal to SARS-CoV isolated from humans, carnivores and swine.A solitary minus exception to this pattern occurred under an extremely biased edit cost model of indels 8, transversions 2, transitions 1 (Table 5).In this analysis, two of four isolates of SARS-CoV from Chiroptera occur in terminal rather than basal positions.
In the 157 isolate tree recovered under unitary costs, the human SARS-CoV isolate GD03T0013 is closely related to civct as well as human isolates SARS-CoV.This is consistent with the result that there were bidirectional exchanges of SARS-CoV between humans and carnivores.
The 114 isolate trees that result from analyses using multiple alignment and standard tree searches under parsimony and maximum likelihood show a pattern of host shifts similar to those described for the direct optimization searches.SARS-CoV isolated from Chiroptera are basal to SARS-CoV under alignment plus parsimony search or alignment plus maximum likelihood search.In all results from the 114 isolate data set SARS-CoV isolated from carnivores are terminal and nested within a large clade of SARS-CoV isolated from humans and there is evidence of bidirectional exchange of SARS-CoV between humans and carnivores (Fig. 6 and supplemental data at http://supramap.osu.edu/cov).

Evolution of a labile region of the SARS-CoV genome
In all three isolate sampling regimes the first insertion of the 29-nucleotide region, CCTACTGGTTAC-CAACCTGAATGGAATAT, occurs phylogenetically basal to the clade exhibiting the earliest hosts shift among humans and carnivores.However, the result of whether this region covaries with host shifts is dependent on isolate sampling regime.

Locus insertion and deletion among SARS-CoV from various hosts in the 83 isolate data set
We present the phylogeny for 83 isolates found under unitary costs with tracing depicting the complex pattern of presence and absence of the 29-nucleotide region CCTACTGGTTACCAACCTGAATGGAA TAT (Fig. 1).The pattern of insertion and deletion of the 29-nucleotide region region includes four to eight insertions and zero to four deletions.However, two host shifts from human to carnivore occur in concert with insertions of the 29-nucleotide region (Fig. 4).Using Maddison's (1990) concentrated changes test, we find statistically significant correlation between this 29-nucleotide region and host shifts (CCT ¼ 0.0123).

Locus insertion and deletion among SARS-CoV in the 157 isolate data set
We optimized the presence of 29 nucleotide sequence regions CCTACTGGTTACCAACCTGAATGGAA-TAT and CCAATACATTACTATTCGGACTGGTT-TAT over the tree calculated for 157 isolates under unitary costs (Fig. 2).The region CCAATACATTAC-TATTCGGACTGGTTTAT occurs in all wholly sequenced genomes of SARS-CoV isolated from Chiroptera and is well correlated with this host.In contrast, the region CCTACTGGTTACCAACCT-GAATGGAATAT is inserted seven to eight times and deleted four to five times.In terms of host use in this tree, there are five shifts from carnivore to human hosts and two changes from human to carnivore hosts (Fig. 5).Among all these changes in the presence of the 29-nucleotide region, CCTACTGGTTACCAA-CCTGAATGGAATAT, and changes in host use, there is only one branch where these two changes occur concurrently.This results in a CCT value of 0.108.Thus the CCTACTGGTTACCAACCTGAATGGAATAT region shows insignificant correlation with the host shift in the 157 isolate data set.

Locus insertion and deletion among SARS in the 114 isolate data set
We optimized the presence and absence of the 29nucleotide regions CCTACTGGTTACCAACCTG AATGGAATAT and CCAATACATTACTATTCG-GACTGGTTTAT, on a binary representation of strict consensus resulting from parsimony search of the 114 isolate data set (Fig. 3).There are no branches where a host shift (Fig. 6) is coincident with an insertion or deletion of this fragment.This result indicates, that like the 157 isolate data set, the insertion of this 29nucleotide region is not significantly correlated with a host shift.Moreover, just as in the 157 isolate dataset, the region, CCAATACATTACTATTCGGACTGGT-TTAT, occurs in all wholly sequenced genomes of SARS-CoV isolated from Chiroptera and is well correlated with this host.
Mutations in the spike protein Li et al. (2005) interpret the distribution of states and polarity of change of position 479 of the SARS-CoV spike protein as follows.Viruses infecting carnivores contain a basic residue, arginine (R) or lysine (K).Next mutation to a small uncharged residue asparagine (N) allowed infection of humans.
However, in the 157 isolate tree we see a different distribution of genotypes and polarities of change.SARS-CoV isolated from carnivores exhibit three genotypes at position 479: asparagine (N) arginine (R) or lysine (K).SARS-CoV infecting humans have two genotypes at position 479: asparagine (N) and arginine (R).SARS-CoV infecting Chiroptera contain exclusively serine (S) at position 479.SARS-CoV isolated from the artiodactyl contain asparagine (N).Considering the tree in the 157 isolate data set, we observe the following mutations at in the spike protein: N479K, N479R, S479N, R479N (supplemental data at http:// supramap.osu.edu/cov).Li et al. (2005)  We observe essentially the same diversity of genotype at position 487 with some additions.SARS-CoV infecting Chiroptera contain primarily valine (V) at position 487 with the exception of one isolate that contains an isoluceine (I).SARS-CoV isolated from the artiodactyl exhibits a threonine (T).However, we observe different polarities of change than those inferred by Li et al. (2005).We observe the muations: V487I, V487T, T487S based on the tree from the 157 isolate data set (supplemental data at http://supramap.osu.edu/cov).
We found a statistically signifcant covariation of mutation T487S in the spike protein with carnivore hosts (Fig. 5 and supplemental data at http://super map.osu.edu/cov).The CCT is 0.019 with DELTRAN optimization and 0.018 with ACCTRAN optimization.
We find no correlation of the mutations N479K and N479R in the spike protein with change from human to carnivore hosts (Fig. 5 and supplemental data at http:// supramap.osu.edu/cov) as there are no branches that share these mutations and a shift in host.

Outgroup choice
As presented in Figs 1-6 and supplemental figures at http://supermap.osu.edu/cov,we rooted our phylogenies on non-SARS coronaviruses.Due to the long internal branches (e.g., ranging from 1680 to 3332 steps in the 83 isolate data set) between any antigenic groups and SARS we decided to use this rooting only for visualization.
The rooting we can present in a figure does not fully represent the extent of out analyses.Our tests as to whether our results were sensitive to outgroup choice showed that our results were not affected by outgroup choice.SARS-CoV isolates from human hosts were consistently basal to any SARS-CoV isolate from a carnivore host irrespective of outgroup choice.

Discussion
Based on the SARS-CoV data released as of July 2006, the polarity of host shifts from human to carnivore hosts and humans to artiodactyl host is clear.Simply put, the SARS-CoV sequence data from animal hosts that has been released as of July 2006 are the results of two zoonotic events that occurred after the 2002-03 outbreak of SARS in humans: one major shift from human to carnivore hosts (with subsequent reversals that were not significant to human outbreaks) and one shift to an artiodactyl.SARS-CoV isolated from Chiroptera are consistently basal to clades containing SARS-CoV from human, carnivore and artiodactyl hosts.

Outgroup choice and presentation
Many of the reports that argue for carnivores as the original reservoir of SARS-CoV use a phylogeny to support their arguments (Guan et al., 2003;Chinese SARS Molecular Epidemiology Consortium, 2004;Kan et al., 2005;Song et al., 2005;Zhang, C et al., 2006).However, the phylogenies in these studies lack outgroup and rooting criteria necessary to derive such evidence for the origins of SARS-CoV.Outgroups chosen from outside of SARS-CoV are necessary to test the monophyly of the SARS-CoV ingroup (Barriel and Tassy, 1998).Moreover in optimal trees, non-SARS-CoV outgroups will join the region of the SARS-CoV subtree that is closest to the ancestor of SARS and provide a point suitable for rooting and subsequent character analysis (Grandcolas et al., 2004).
In the case of Guan et al. [2003, see their figs 2 and S2) and the Chinese SARS Molecular Epidemiology Consortium (2004); see their fig.S7 of their supplemental materials] these researchers simply force the root position on their drawings such that they represent SARS-CoV isolates from animal hosts as ancestral.In other drawings, no outgroup is designated (Chinese SARS Molecular Epidemiology Consortium, 2004, fig.2) or a human SARS-CoV outgroup is used and the animal SARS-CoV isolates are omitted from the tree (Chinese SARS Molecular Epidemiology Consortium, 2004, fig.S6).In the case of Song et al. (2005a) human SARS-CoV is designated as the outgroup.Regression methods are used to construct a rooted tree in which the date of the most recent ancestor is reconstructed as December 2002 (Song et al., 2005).Song et al. (2005) conclude that a source of disease common to humans and civets must be in the environment and further surveys of the CoV in the Guangdong region are warranted.In the case of Zhang, C et al., 2006, fig. 1;and pers. comm.) an outgroup was used for tree construction but not for tests of selection.
Many researchers agree that SARS represents a previously unrecognized fourth lineage of coronaviruses (Marra et al., 2003;Rest and Mindell, 2003;Rota et al., 2003).Thus, the non-SARS coronaviruses can serve as outgroups to SARS-CoV.This can be revisited if and when data on viruses closely related to SARS-CoV become available.Alternatively, other researchers used a torovirus and ⁄ or okavirus outgroup(s) to place SARS-CoV as sister to group 2 coronaviruses (Snijder et al., 2003;Lio´and Goldman, 2004).However, based on the data in GenBank, toroviruses and okaviruses bear little sequence similarity to any coronavirus.The danger in use of such distant outgroups is well documented (Wheeler, 1990;Graham et al., 2002).In essence, distant outgroups act as if they are random sequences resulting in spurious attraction to the longest branch available among the ingroup.Indeed the branch lengths between the major clades of coronaviruses in the 83 and 157 isolate datasets of this paper are long.This problem is addressed in the 114 isolate data set.The best approach going forward is to extend sampling of diverse coronavirus genomes to search for outgroups of SARS-CoV in humans, especially from Chiroptera, carnivores and non-human primates.

Taxonomic sampling affects analyses
The lack of a good outgroup to SARS-CoV is tied to (1) poor sampling of non-SARS coronavirus genomes before the 2002-03 SARS outbreak, and (2) the preoccupation with animals in Chinese markets, farms and restaurants after the outbreak without regard to highly diverse species traded as bush meat in South-east Asia (Bell et al., 2004).Before the SARS epidemic, the small number of animal coronaviruses that had been sequenced were selected primarily from animals of agricultural importance or model organisms.This lack of sampling of coronaviruses from wild animals is changing as viral surveys of Chiroptera, camelids and bovids are published and in preparation (Chu et al., 2006;Dominguez et al., 2007;Jina et al., 2007;Zhang, X et al., 2007).

Insertion of the 29-nucleotide regions
Presence of the region CCTACTGGTTACCAACC-TGAATGGAATAT is correlated with host switching beween human and carnivore hosts in the 83 isolate data set but is insignificantly correlated with switches from human to carnivore hosts in the larger (114 and 157 isolate) data sets.The concentrated changes test (CCT; Madison, WI) whether a change in one character (e.g., insertion or deletion of the 29-nucleotide region) and a change in another character (e.g., host phenotype) cooccur on the same branches of a tree more often than expected by chance.In the case of the 83 isolate data set we observe a significant correlation between the presence of this 29-nucleotide region and carnivore hosts.In the case of the 157 isolate data set we observe an insignificant correlation.In the case of the 114 isolate data set we do observe changes that strictly co-occur.However, we do observe that host shifts in the 114 and 157 isolate data set occur in the region of the tree in which changes in the 29-nucleotide region occurred more basally.Thus, the presence of the 29-bp region may predispose or be part of a suite of genomic changes associated with host shifts.In light of these results, it is of interest to implement a test.This test could examine the branches in the vicinity of the relaxed CCT change of interest for a correlated change in a second character.

Mutations of the spike gene
Our phylogenetic results shed fresh light on the polarity of mutations and diversity of genotypes in the spike protein of SARS-CoV.Our results differ from the result of Zhang, C et al. (2006) who using CODEML (Yang, 1997) and HYPY (Kosakovsky Pond and Frost, 2005) for a tree-based spike nucleotide sequence analysis show that the codon for amino acid position 479 was under positive selection and the codon for amino acid position 487 was not.The trees used to derive these results reflect the same bias seen in other studies-that transmission of SARS-CoV was from carnivore to human hosts.

Geographic visualization
The pattern of geographic spread of SARS-CoV is similar to that of avian influenza (H5N1; Janies et al. 2007) in that both viral lineages that have caused recent outbreaks have their origins in Southern China.However, H5N1 and SARS-CoV contrast in the rapidity in which they moved across the planet.The recent outbreak lineage of H5N1 spread from Asia to Europe, the Middle East, and Africa during the period of 1996-2005 and has not yet arrived in North America.In contrast, SARS-CoV spread not only from Asia to Europe but also North America in a matter of months (November 2002-March 2003).These differences are perhaps associated with the fact that SARS-CoV infected carnivores in urban markets and a cosmopolitan human population with access to world travel.In contrast, H5N1 is currently infecting primarily avian populations and humans that live in rural settings and come into close contact with birds via subsistence farming and food processing.

Further directions
In order to better understand the molecular epidemiology of SARS-CoV we must develop research programs that include comprehensive sampling and phylogenetic analyses of many whole viral genomes, including outgroups that are closely related to SARS-CoV.As a result of the previously unrecognized zoonotic threat they pose, several groups have embarked on large-scale sequencing projects on coronavirus genomes isolated from diverse animal hosts, especially Chiroptera, carnivores and primates.These efforts will help us pinpoint the zoonotic origins of SARS-CoV, develop an understanding of the zoonotic potential of coronaviruses as well as the genomic changes that underlie host shifts among coronaviruses.spike.aa.pos479.pdf.Phylogenetic tree of 157 coronavirus isolates based on whole genomes (sampling in Table 2).This is the same tree as Figs 2 and 5 in the body of the paper except that in this instance the amino acid states at position 479 in the spike locus are traced.spike.aa.pos487.pdf.Phylogenetic tree of 157 coronavirus isolates based on whole genomes (sampling in Table 2).This is the same tree as Figs 2 and 5 in the body of the paper except that in this instance the amino acid states at position 487 in the spike locus are traced.cov114.host.raxmltree929.names.pdf.RAXML search under GTRGAMMA for 114 isolates.Character optimization was conducted under equally weighted parsimony cov114.host.raxmltree929boot.nex.Tree with bootstrap values for RAXML search.To be viewed with MESQUITE.r1000.cov114.jackknife.log.Jackknife values for 114 isolate data set under equally weighted parsimony.To be viewed with a text editor.r1000.cov83.jackknife.log.Jackknife values for 83 isolate data set under equally weighted parsimony.To be viewed with a text editor r1000.cov157.jackknife.log.Jackknife values for 157 isolate data set under equally weighted parsimony.To be viewed with a text editor janiesetal2008covsars.kmz.Keyhole Markup file depicting the spread of 114 isolates of SARS-CoV over geography.To be opened with Google Earth.See also readmesarskml.pdf.
released many spike gene and three full genome sequences for SARS-CoV isolated from human, raccoon dog and civet cat hosts into the public domain in July, 2006 3 .Li et al. (2005) 4 published SARS-CoV nucleoprotein and spike gene sequences (some recently updated as whole genomes) isolated from Chiroptera: Rhinolophus sinicus, Rhinolophus ferrumequinum, Rhinolophus macrotis and Rhinolophus pearsoni.Lau et al. (2005) 5 published three complete SARS-CoV genomes isolated from the bat Rhinolophus pearsoni and a SARS-CoV polymerase sequences from Rhinolophus sinicus.Poon et al. (2005) 6 published sequences of RNA-dependent RNA polymerase (RdRp), polyprotein, and spike genes of a non-SARS-CoV isolated from the bat Miniopterous pusillus.Tang et al. (2006) 7 published a review of bat coronaviruses in August, 2006 and released three genomes and 70 gene fragments in July, 2006.

Fig. 3 .
Fig.3.Binary representation of strict consensus tree produced by multiple alignment followed by tree search under parsimony of 114 whole coronavirus genomes.Branches with black traces indicate presence of the 29-nucleotide region, CCTACTGGTTACCAACCTGAATGGAATAT (e.g., positions 27869-27897 in AY278489) in an uncharacterized protein of variants of the SARS-CoV that infect small carnivores and humans.Branches with green traces indicate the presence of the 29-nucleotide region CCAATACATTACTATTCGGACTGGTTTAT (e.g., positions 27866-27894 in DQ648857) in an uncharacterized protein of all SARS-CoV isolated from Chiroptera.White traces indicate the absence of either region.In this analysis the evolution of insertions and deletions of these regions is simple.

Fig. 6 .
Fig.6.Note that the SARS-CoV isolates from Chiroptera (black trace) are basal to the entire SARS-CoV clade.The SARS-CoV isolates from carnivores (purple trace) and artiodactyls (light blue trace) are nested within a large clade of SARS-CoV isolates from humans (yellow trace), although there were exchanges of SARS-CoV between humans and carnivores.The tree search and character optimization were conducted under equally weighted parsimony.
also describe diversity and polarity of change for position 487 of the spike protein of SARS-CoV.They describe SARS-CoV isolated in 2002-03 to contain threonine (T) and SARS-CoV isolated from humans and carnivores in 2003-04 to contain serine (S) at position 487.

Table 1
GenBank accession numbers and descriptions of genomes and partial genomes of virus exemplars considered in the 83 isolate data set

Table 2
GenBank accession numbers and descriptions of genomes and partial genomes of virus exemplars considered in the 157 isolate data set

Table 3
GenBank accession numbers and descriptions of whole genomes of virus exemplars considered in the 114 isolate data set

Table 4
Phylogenetic position of carnivore and swine relative to human SARS-CoV isolates in trees calculated under various edit costs under direct optimization for the 83 isolate data set Fig.1.Phylogenetic tree produced by direct optimization of 83 coronavirus isolates based on whole and partial genomes (sampling in Table1).Branches with black traces indicate presence of the 29-nucleotide region, CCTACTGGTTACCAACCTGAATGGAATAT (e.g., positions 27869-27897 in AY278489) in an uncharacterized protein of variants of the SARS-CoV that infect small carnivores and humans.White traces indicate the absence of this region.In this analysis, the evolution of insertions and deletions of this region is labile and complex.Fig.2.Phylogenetic tree produced by direct optimization of whole and partial coronavirus genomes produced of 157 isolates (sampling in Table2).Branches with black traces indicate presence of the 29-nucleotide region, CCTACTGGTTACCAACCTGAATGGAATAT (e.g., positions 27869-27897in AY278489) in an uncharacterized protein of variants of the SARS-CoV that infect small carnivores and humans.Branches with green traces indicate the presence of the 29-nucleotide region CCAATACATTACTATTCGGACTGGTTTAT (e.g., positions 27866-27894 in DQ648857) in an uncharacterized protein of all SARS-CoV isolated from Chiroptera.White traces indicate the absence of either region.In this analysis, the evolution of insertions and deletions of these regions is labile and complex.