Genotypic analysis of two hypervariable human cytomegalovirus genes

Most human cytomegalovirus (HCMV) genes are highly conserved in sequence among strains, but some exhibit a substantial degree of variation. Two of these genes are UL146, which encodes a CXC chemokine, and UL139, which is predicted to encode a membrane glycoprotein. The sequences of these genes were determined from a collection of 184 HCMV samples obtained from Africa, Australia, Asia, Europe, and North America. UL146 is hypervariable throughout, whereas variation in UL139 is concentrated in a sequence encoding a potentially highly glycosylated region. The UL146 sequences fell into 14 genotypes, as did all previously reported sequences. The UL139 sequences grouped into 8 genotypes, and all previously reported sequences fell into a subset of these. There were minor differences among continents in genotypic frequencies for UL146 and UL139, but no clear geographical separation, and identical nucleotide sequences were represented among communities distant from each other. The frequent detection of multiple genotypes indicated that mixed infections are common. For both genes, the degree of divergence was sufficient to preclude reliable sequence alignments between genotypes in the most variable regions, and the mode of evolution involved in generating the genotypes could not be discerned. Within genotypes, constraint appears to have been the predominant mode, and positive selection was detected marginally at best. No evidence was found for linkage disequilibrium. The emerging scenario is that the HCMV genotypes developed in early human populations (or even earlier), becoming established via founder or bottleneck effects, and have spread, recombined and mixed worldwide in more recent times.

Most human cytomegalovirus (HCMV) genes are highly conserved in sequence among strains, but some exhibit a substantial degree of variation. Two of these genes are UL146, which encodes a CXC chemokine, and UL139, which is predicted to encode a membrane glycoprotein. The sequences of these genes were determined from a collection of 184 HCMV samples obtained from Africa, Australia, Asia, Europe, and North America. UL146 is hypervariable throughout, whereas variation in UL139 is concentrated in a sequence encoding a potentially highly glycosylated region. The UL146 sequences fell into 14 genotypes, as did all previously reported sequences. The UL139 sequences grouped into 8 genotypes, and all previously reported sequences fell into a subset of these. There were minor differences among continents in genotypic frequencies for UL146 and UL139, but no clear geographical separation, and identical nucleotide sequences were represented among communities distant from each other. The frequent detection of multiple genotypes indicated that mixed infections are common. For both genes, the degree of divergence was sufficient to preclude reliable sequence alignments between genotypes in the most variable regions, and the mode of evolution involved in generating the genotypes could not be discerned. Within genotypes, constraint appears to have been the predominant mode, and positive selection was detected marginally at best. No evidence was found for linkage disequilibrium. The emerging scenario is that the HCMV genotypes developed in early human populations (or even earlier), becoming established via founder or bottleneck effects, and have spread, recombined and mixed worldwide in more recent times. J. Med. Virol. 80:1615-1623, 2008. ß 2008 KEY WORDS: herpesvirus; variation; genotype; chemokine; glycoprotein

INTRODUCTION
Human cytomegalovirus (HCMV; family Herpesviridae, subfamily Betaherpesvirinae, genus Cytomegalovirus, species Human herpesvirus 5) is ubiquitous and host-specific. Infection is asymptomatic for most people, but can result in serious disease in immunocompromised individuals and congenitally infected newborns. The minimally passaged Merlin strain is considered best to represent wild-type HCMV, and has a 236 kbp genome that is predicted to contain approximately 165 protein-encoding genes [Dolan et al., 2004].
Most genes are highly conserved in sequence between HCMV strains, but a number of genes predicted to encode membrane-associated or secreted proteins are characterized by a striking degree of variability, as revealed by examination of individual genes [reviewed in Pignatelli et al., 2004] and by whole genome comparisons [Murphy et al., 2003;Dolan et al., 2004]. Various studies have attempted to make connections between the genotypes of hypervariable genes and disease outcome, and overall the conclusions reached are unclear or contradictory [reviewed in Puchhammer-Stöckl and Görzer, 2006]. The apparently random association of genotypes at different loci (that is, the absence of linkage disequilibrium), presumably as a result of recombination during HCMV evolution, limits any conclusions to the specific gene under study, except in some cases where genes are very close to each other or encode interacting hypervariable proteins [Rasmussen et al., 2002[Rasmussen et al., , 2003. A further complicating factor is the occurrence of multiple HCMV genotypes in individuals, including immunocompromised patients, pregnant women and congenitally infected newborns [for recent papers, see Rasmussen et al., 2003;Hassan-Walker et al., 2004;Stanton et al., 2005;Iwasenko et al., 2007].
One of the most variable HCMV genes is UL146, which encodes a chemokine designated vCXC-1. This gene is variable throughout its length [Penfold et al., 1999;Prichard et al., 2001;Hassan-Walker et al., 2004;Arav-Boger et al., 2005Stanton et al., 2005;He et al., 2006;Lurain et al., 2006], and 14 genotypes have been catalogued [Dolan et al., 2004]. In strain Toledo, UL146 encodes a functional chemokine that is capable of neutrophil degranulation, chemotaxis and calcium mobilization. This protein contains an ELRCXC motif, which has been shown to be essential for receptor binding and IL-8 activity [Clark-Lewis et al., 1991]. vCXC-1 binds to human CXCR2 and is comparable in its activities to CXC chemokines IL-8 and gro-a [Penfold et al., 1999]. The function of vCXC-1 may be to facilitate dissemination of the virus through its ability to attract monocytes to the initial site of infection. Thus, the virus could undermine the effectiveness of antiviral immunity by manipulating the host chemokine system and, together with other virus-encoded molecular mimics, suppressing the immune system.
The most variable gene in the vicinity of UL146 is UL139, which is located 5.2 kbp distant and is predicted to encode a type I membrane glycoprotein. Variability is concentrated in a region of the ectodomain [Dolan et al., 2004]. A recent study of 26 HCMV strains isolated in China described three major genotypes, with two of these divided into subtypes .
The aims of the present study were to investigate whether additional UL146 genotypes exist in a large number of clinical samples obtained from a wide range of locations and clinical settings, and to define the range of UL139 genotypes in these samples. Ancillary interests were to examine the relative frequencies and geographical distribution of genotypes, to assess whether infections with more than one HCMV strain are common, and to investigate the evolution of UL146 and UL139.

Virus DNA Collection
A collection of 184 DNA samples was derived from 179 anonymized clinical samples obtained in various geographical locations in accordance with local ethical guidelines, plus 5 commonly used laboratory strains (Davis, Merlin, TB40/E, Toledo and Towne). Details of the 171 samples in the collection that yielded sequence data are available on request, and include the age, sex, and pathology of the patient, the clinical source of the sample, and the UL146 and UL139 genotypes determined. The samples numbered 18 from Australia, 10 from Hong Kong, 6 from Germany, 13 from England, 18 from The Gambia, 24 from Hungary, 7 from Italy, 6 from The Netherlands, 41 from Scotland, 5 from the USA, 8 from Wales, and 15 from South Africa. A minority of strains (40) had been passaged in human fibroblast cell culture, either as routine diagnostic specimens or as laboratory strains. DNA was extracted by standard methods from body tissues, urine, saliva or infected cells. The South African samples were obtained from the saliva of mothers (10 of whom tested HIV-negative) attending rural clinics in KwaZulu/Natal [Dedicoat et al., 2004]. Since these were available in very limited amounts and contained low numbers of HCMV genomes, whole genome amplification using a REPLIg kit (Qiagen, Crawley, UK) was carried out prior to PCR amplification.

PCR Amplification
UL146 and UL139 were amplified separately by single round or nested PCR, using primers in conserved regions (Table I). Single (and first) round PCR of UL146 using AB4 and A162 generated a product of approximately 1 kbp, and second round PCR using UL146-4A and UL146-3A yielded an 800 bp product. Single (and first) round PCR of UL139 using AB1 and AB2 generated an 800 bp product, and nested PCR using UL140-3A and UL140-11A yielded a 500 bp product. UL140-11A is located within the UL139 coding region, and as a consequence the sequences obtained using nested PCR (approximately 40% of the total) lacked 29 amino acid-encoding codons from the highly conserved C terminus.
For the single (and first) round, 1 ml of DNA was added to the PCR reaction mixture, which consisted of 40 ml of water, 5 ml of buffer, 1 ml of 10 mM dNTPs, 1 ml of each the two primers (10 mM) and 1 ml (1 U) of DNA polymerase (Advantage 2, BD Clontech, Basingstoke, UK). The conditions for amplification were 958C for 2 min followed by 35 cycles of 958C for 2 min, 608C for 30 sec and 688C for 1 min. Second round PCR utilized 1 ml of first round PCR products as template amplified under the same conditions. PCR reactions were set up in a dedicated, PCR product-free room. Approximately one-third of the samples were tested on three separate occasions to assess reproducibility.
Purification, Cloning, and Sequencing of PCR Products PCR products were separated by agarose gel electrophoresis. Appropriate DNA fragments were excised, purified using a Geneclean turbo kit (Q Biogene, Cambridge, UK), and eluted using 100 ml of nucleasefree water. The single round or second round primers were used for direct sequencing.
In some cases, including those where direct sequencing indicated the presence of more than one genotype of UL146 or UL139, fragments were cloned using a pGEM-T kit (Promega, Southampton, UK). Following ligation and transformation into chemically competent E. coli TOP 10 cells, 5 recombinant colonies were picked and grown overnight at 378C in 2YT-broth containing 100 mg/ml ampicillin. Plasmid DNA was purified using a QIAprep Spin miniprep kit (Qiagen). Plasmid inserts were sequenced using universal forward and reverse primers. Sequencing was carried out on both DNA strands using a BigDye terminator kit (Applied Biosystems, Warrington, UK) in an ABI 3730 instrument. Samples containing multiple strains were identified by the derivation of plasmids representing different genotypes of UL146 or UL139.

Sequence Analysis
Sequence chromatograms were viewed using Editview (Applied Biosystems) and analyzed using Pregap4 and Gap4 [Staden et al., 2000]. Nucleotide and imputed amino acid sequences were aligned using CLUSTAL W [Thompson et al., 1994] and MAFFT [Katoh et al., 2005]. Full-length sequences were used for the UL146 data and a subset of the UL139 data, and another subset of the UL139 data was analyzed using sequences lacking the conserved C terminus. MEGA4.0 [Tamura et al., 2007] was used for the generation of phylogenetic trees. Frequencies of nonsynonymous and synonymous differences per site (dN and dS, respectively) and degree of sequence variability (nucleotide and amino acid) were investigated using Swaap 1.0.1 [Pride, 2004], DnaSP 4.10.9 [Rozas et al., 2003], and MEGA4.0. dN/dS ratios and probabilities of positive selection were assessed using PAML 3.15 [Yang, 1997]. Signal peptide and transmembrane sequences were predicted using Phobius [Kall et al., 2004].

Statistical Analysis
Sample origin was divided into four regions (Africa, Asia, Europe, and Australia) for assessment of the geographical distribution of genotypes. Chi-square tests were used to assess the significance of variability of genotype frequencies among regions. Yates' correction for continuity was applied to chi-square tests in cases where the expected values fell below 5. Similarly, Chisquare tests with Yates' correction were applied to 60 samples where single genotypes were detected for both UL146 and UL139, in order to test for linkage disequilibrium. Samples containing mixed infections were excluded from this analysis.

UL146 and UL139 Sequences
The UL146 and UL139 genotypes in 184 samples were investigated by PCR and sequencing using primers in conserved regions. UL146 was amplified from 159 samples and sequences were determined from 134, and UL139 was amplified from 168 samples and sequences determined from 131. A total of 13 samples failed to yield products from either gene. Since some samples contained more than one virus strain, totals of 182 UL146 sequences and 183 UL139 sequences were obtained. Alignment and phylogenetic analyses involved the 350 UL146 sequences and 300 UL139 sequences derived from the present study or reported by others in the literature [Cha et al., 1996;Davison et al., 2003;Dolan et al., 2004;Arav-Boger et al., 2005Stanton et al., 2005;He et al., 2006;Lurain et al., 2006;Qi et al., 2006] or

UL146 Genotypes
The UL146 coding sequences range in length from 114 to 126 codons, and phylogenetic analyses indicated that all fall into the 14 genotypes defined previously and designated G1-G14 [Dolan et al., 2004]. Amino acid sequence variation among genotypes is high (p ¼ 0.521, where p is protein diversity from MEGA4.0), whereas within each genotype it is low (p ¼ 0-0.051 with a mean of 0.017) (Table II).
Differences in overall genotypic frequencies were observed (Table II). For example, G7 was detected in 16% of samples and G6 in fewer than 1%.

UL139 Genotypes
The UL139 coding sequences range in length from 124 to 148 codons, and phylogenetic analyses indicated that all fall into 8 genotypes designated G1-G8. Figure 1A shows a predicted amino acid sequence alignment of the primary translation products of one representative of each genotype. Figure 1B shows a phylogenetic tree constructed using these sequences.
The protein encoded by each HCMV UL139 genotype contains a putative signal peptide sequence and a transmembrane region. Variation is concentrated in the N-terminal portion of the protein. Amino acid sequence variation between genotypes is high (p ¼ 0.275), whereas within each genotype it is low (p ¼ 0.007-0.095 with a mean of 0.025) (Table III). Variation within genotypes tends to be higher in UL139 than in UL146, but that among genotypes is lower. Sequences in G1 exhibit a greater level of variation than those in the other genotypes.
Differences in overall genotypic frequencies were observed (Table III). For example, G2 was detected in 27% of samples and G8 in fewer than 3%.

Assessment of Positive Selection
In order to assess positive selection (i.e., for amino acid sequence diversity), the dN/dS ratio was calculated for each UL146 and UL139 genotype (Tables II and III). Positive selection was detected at the 1% significance level only in UL139 G1, and at the 5% level in UL146 G7 and G1. UL139 G6 and G7 and UL146 G2 and G13 also had values of dN/dS > 1, but these were not statistically significant. Only in UL139 G1 was an amino acid residue identified as under positive selection, although this was at position 12 in the predicted signal peptide sequence. Thus, evidence for positive selection is marginal, and it seems unlikely that this mode of diversification has featured in the evolution of UL139 and UL146 since the genotypes arose. No strong evidence emerged for positive selection in formal comparisons among genotypes (i.e., as a factor in emergence of the genotypes), but it must be registered that variation was so large as to confound reliable sequence alignments.

Geographical Distribution of Genotypes
The sequence data derived in the present work were divided into four groups representing strains obtained from Africa, Asia, Australia, and Europe. Insufficient sample numbers were obtained from America to war-  Completely conserved residues are indicated in the consensus row (con). Below this is the CCMV sequence, which is included to illustrate conservation of the SETTTGTSSNSS motif (underlined). The CCMV sequence [Davison et al., 2003] provided is the C-terminal portion (final 12 residues not shown) of a larger protein, the N-terminal portion of which lacks a counterpart in HCMV but is related to a protein (encoded by gene rh174) in rhesus cytomegalovirus. B: Europe) for which a larger sample size was tested displayed greater genotypic diversity. Nonetheless, UL146 G13 appears somewhat more common in African samples than in European samples (11 out of 45 sequences were detected in the former and 17 out of 104 in the latter), UL146 G10 and G11 appear to be restricted to Europe (8 and 3 samples, respectively), and the single sample of UL146 G6 originated from Asia.
Identical nucleotide sequences were frequently obtained from geographically distant and presumably epidemiologically unrelated patients. For example, certain samples from The Gambia, Scotland, and Hungary contained identical UL146 G12 sequences. Also, UL139 G2, which was identified in 27% of samples, was represented by identical sequences from Hungary, The UK, and The Gambia.

Linkage Disequilibrium Between UL146 and UL139
Potential linkage disequilibrium was investigated in 60 strains for which single genotypes of both UL146 and UL139 were obtained. Of 112 possible genotype pairs, 41 were observed at least once (Table VI). No statistical significance was obtained for the observed distribution of genotype pairs versus a null hypothesis of independent assortment, indicating an absence of linkage disequilibrium.

Infections With Multiple Strains
Multiple genotypes in one or both genes were detected in at least 14% of samples upon first analysis (rising to 29% when repeat experiments were included), distributed among immunocompetent and immunocompromised individuals. More than one genotype was detected in 11% of European samples, 16% of Gambian samples, 47% of South African samples, and 10% of Hong Kong samples (rising to 24%, 33%, 60%, and 60%, respectively, when repeat experiments are included).

DISCUSSION
This study focused on the genotype definitions, frequencies, occurrence in mixed infections, geographical distribution and evolution (in terms of linkage disequilibrium and mode of selection) of two hypervariable HCMV genes, UL146 and UL139. Totals of 182 UL146 and 183 UL139 sequences were obtained from a   large panel of clinical isolates collected from Africa (South Africa and The Gambia), Asia (Hong Kong), Australia and Europe (various countries). These were used in all analyses, and were supplemented by 168 previously published UL146 and 117 UL139 sequences in analyses of genotype definitions, frequencies and mode of selection. The UL146 sequences fell into the 14 genotypes described previously [Dolan et al., 2004], and no new genotypes were discovered. Twelve genotypes contained the ELRCXC motif, which has been shown to be essential for receptor binding and IL-8 activity [Clark-Lewis et al., 1991], and 2 contained the NGRCXC motif, which has been shown to be important for interaction with T and B cells [Baggiolini et al., 1997]. The latter genotypes (G5 and G6) are relatively rare, being present in approximately 5% of samples. It is not known whether the UL146 genotypes possess different biological properties, and studies to investigate this question are required.
The UL139 sequences grouped into eight genotypes. A recent analysis of 26 clinical samples  described three major groups (G1, G2, and G3), two of which were divided into subgroups (G1 into G1a, G1b and G1c and G2 into G2a and G2b). Subgroups G1b and G1c in the previous study correspond to G1 in the present study, subgroup G1a corresponds to G4, subgroups G2a and G2b correspond to G6 and G2, respectively, and G3 is named identically in both studies. Thus, apart from the differences in nomenclature, the subgroups  correlate with a subset of the genotypes in the present study, except that the closely related subgroups G1b and G1c in the former are amalgamated as G1 in the latter. Most of the variation in UL139 is due to substitutions or deletions of variable size near the N terminus. This region is rich in S and T residues that are potentially susceptible to Oglycosylation, and also contains NXS or NXT motifs that are potentially susceptible to N-glycosylation. This suggests that selection may have focused primarily on glycosyl side chains rather than the underlying amino acid sequence. A similar feature characterizes other variable glycoprotein genes, such as UL73 (encoding glycoprotein N (gN)) and UL74 (encoding glycoprotein O (gO)) [Pignatelli et al., 2001[Pignatelli et al., , 2003Mattick et al., 2004].
A region of sequence identity (SETTTGTSSNSS in Fig. 1A) has been noted between the HCMV UL139 protein and CD24, a cellular glycosyl phosphatidylinositol-linked glycoprotein that is involved in B cell activation . It is difficult to assess the significance of this observation, especially since 9 of the 12 residues are S or T and the region is not conserved in CD24 orthologues from other mammals. However, the sequence is present in all of the UL139 genotypes identified in the present study, except for G5, and also in the homologous protein from chimpanzee cytomegalovirus (CCMV) (Fig. 1A). Variation in glycosylation has been observed in CD24 and has been linked to differences in cell and tissue specificity [Poncet et al., 1996]. Additional roles for CD24 in apoptosis and cell adhesion have also been suggested, and more recently in regulating the responsiveness of a chemokine receptor, CXCR4 [Schabath et al., 2006;Smith et al., 2006]. The possibility that UL139 may be a CD24 homologue remains intriguing, but, in the absence of functional data, unproven.
Studies of HCMV genotype frequency, including the present one, are usually based on the use of conserved PCR primers, and face limitations as a result. Firstly, there is no guarantee that all genotypes will be detected, since primers are chosen on the basis of alignments of available sequences. Secondly, samples containing more than one strain yield mixed sequences, which when cloned are recovered approximately in proportion to their abundance (although stochastic processes may introduce bias during PCR). Therefore, the absence of a genotype from a particular sample cannot be assured. If any UL146 or UL139 genotypes have escaped recognition, they may emerge from future studies involving different primers or from whole genome sequencing exercises.
As found in previous studies [reviewed in Puchhammer-Stöckl and Görzer, 2006], mixed infections with different HCMV strains were common. In some samples, a single UL139 genotype and multiple UL146 genotypes, or vice versa, were detected. This could be due to different strains happening to contain the same genotype at one locus but not at the other, or to the limitations of amplifying sequences present as mixtures in unequal proportions. Some samples tested more than once were found to contain additional genotypes not detected in the first experiment, suggesting that the number of mixed infections was underestimated by the methodology used. Mixed infections were more frequently detected from certain regions, namely Hong Kong, South Africa and, to a lesser extent, The Gambia. It is possible that this is a result of higher transmission frequencies. In one study [Beyari et al., 2005], a higher seroprevalence of HCMV in children in Malawi compared to European countries and the USA was taken as  G1 G2 G3 G4 G5 G6 G7 G8   G1  1  0  0  3  1  0  0 possibly reflecting greater opportunities for transmission, although multiple genotypes were detected in only a small number of samples. The occurrence of mixed infections is being recognized increasingly as potentially significant to the biology of HCMV. This feature adds to the limitations inherent in studies of whether particular genotypes are associated with disease outcome; other features include the number, origin and pathological categorization of samples, the choice of gene, the absence of linkage disequilibrium, and host factors. In light of these limitations, our opinion is that robust evidence in favor of any association between genotype and pathology has proved elusive in the literature. Further work utilizing genotype-specific approaches is required to explore the true frequency of mixed infections, both to validate studies of this type and to determine whether mixed infections have geographical or biological correlates.
Similar to the conclusions drawn from a study on UL73 (encoding gN) [Pignatelli et al., 2003], no statistically significant association of UL146 or UL139 genotypes with geographical origin arose from the analysis. However, this may reflect low sample numbers (albeit much larger than those utilized in previous studies on UL146 and UL139) and the lack of detailed information on ethnic origin. Likewise, investigation of linkage disequilibrium between UL146 and UL139 genotypes was compromised by the small sample number (60) relative to the large number of possible genotype combinations (112). However, no evidence for linkage disequilibrium was obtained, indicating the involvement of recombination in HCMV evolution since the genotypes arose. Taking into account the size of the HCMV gene complement, we agree with the view that very many strains are likely to be circulating in the world [Rasmussen et al., 2003].
The extensive divergence between genotypes and the consequent inability to produce reliable sequence alignments for both UL146 and UL139 in the hypervariable regions compromised assessments of the role of positive selection in generating the genotypes. In contrast, variation within genotypes is low, and identical nucleotide sequences were obtained from geographically distant individuals. The analysis suggests that constraint has been the predominant factor in evolution within genotypes, with positive selection detected only marginally. A previous study [Arav-Boger et al., 2005] involving 30 sequences also concluded that UL146 has evolved under constraint. The sequences of hypervariable genes are stable on short timescales in patients Stanton et al., 2005] and cell culture [Lurain et al., 2006], consistent with the perception of herpesvirus genomes as relatively slowly evolving [McGeoch et al., 2006]. The most likely scenario for the evolution of HCMV emerging from the literature and from the present study is that the genotypes developed in early human populations (or even earlier), becoming established via founder or bottleneck effects, and have spread, recombined and mixed worldwide in more recent times, with mixed infections being common.

ACKNOWLEDGMENTS
IJK was a recipient of a FEMS Research Fellowship and a FEMS-ESCMID Joint Fellowship, and KRA was a recipient of a DAAD Fellowship (German Academic Exchange Service). We thank Mark Schleiss for providing the virus from which one of the samples (a BAC) was generated. We are grateful to Duncan McGeoch for comments on a draft of the manuscript.