Correspondence: Isak S. Pretorius, The Australian Wine Research Institute, PO Box 197, Glen Osmond, Adelaide, SA 5064, Australia. Tel.: +61 8 83036610; fax: +61 8 83036601; e-mail: firstname.lastname@example.org
Many industrial strains of Saccharomyces cerevisiae have been selected primarily for their ability to convert sugars into ethanol efficiently despite exposure to a variety of stresses. To begin investigation of the genetic basis of phenotypic variation in industrial strains of S. cerevisiae, we have sequenced the genome of a wine yeast, AWRI1631, and have compared this sequence with both the laboratory strain S288c and the human pathogenic isolate YJM789. AWRI1631 was found to be substantially different from S288c and YJM789, especially at the level of single-nucleotide polymorphisms, which were present, on average, every 150 bp between all three strains. In addition, there were major differences in the arrangement and number of Ty elements between the strains, as well as several regions of DNA that were specific to AWRI1631 and that were predicted to encode proteins that are unique to this industrial strain.
Saccharomyces cerevisiae can be regarded as the first domesticated microorganism. For thousands of years, humans have used this yeast for baking, brewing and winemaking due to its ability to ferment sugars into alcohol and carbon dioxide efficiently (Verstrepen et al., 2006). Over time, passaging and selection strategies have improved the performance of these industrial strains, producing genetically distinct yeasts that are highly suited for specific industrial applications.
Variation between strains of S. cerevisiae represents an attractive model for research on the role of selection in shaping a eukaryotic genome. Furthermore, while industrial strains represent a diverse group of organisms, many other strains of S. cerevisiae have been isolated for use in nonindustrial settings, such as fundamental biological research. These laboratory yeasts represent important ‘out-groups’ for comparison of industrial strains. While industrial strains were isolated for their fermentative ability under stressful environmental conditions (low pH, low nutrient availability, high ethanol concentrations and fluctuating temperatures), laboratory strains, such as S288c, were selected for fast and consistent growth in nutrient-rich laboratory media (Mortimer & Johnston, 1986). Comparison of genome structures between industrial and laboratory strains should, therefore, highlight how differences in selection regimes have differentially shaped the genomic structure of these strains and assist in the identification of genomic loci that play important roles in regulating key industrial phenotypes.
Indeed, comparative genomic studies of yeast strains have already shown that there can be substantial nucleotide variation within the S. cerevisiae species. A recent genome sequencing study revealed that S. cerevisiae strain YJM789, which was isolated from the lung of an AIDS patient, displayed around 70 000 discrete nucleotide variations when compared with S288c, in addition to several ORFs that were unique to YJM789 (Wei et al., 2007). It has also been estimated that nucleotide variation among laboratory strains can be as high as one variant nucleotide in every 300 bp of sequence (Schacherer et al., 2007). This nucleotide variation, when translated into phenotypic differences, most likely underscores the ability of yeast to adapt to varied environmental conditions and to be efficiently selected and bred for the many different roles for which yeast is currently used. Thus, the identification and characterization of the genomic diversity that exists within S. cerevisiae is vital to understanding the selective potential of this species.
To begin the investigation into how industrial yeasts differ from other strains of S. cerevisiae, we have sequenced the genome of a haploid wine yeast, AWRI1631, and compared this sequence with both the laboratory strain S288c, as well as the pathogenic isolate YJM789 (Goffeau et al., 1996; Wei et al., 2007). AWRI1631 was shown to be substantially different from both strains, especially at the level of single-nucleotide polymorphisms (SNPs), which were present, on average, every 150 bp. There were also major differences in the arrangement and number of Ty elements between the strains, in addition to several regions of DNA that were specific to AWRI1631.
Materials and methods
Isolation of the haploid wine yeast AWRI1631
The haploid S. cerevisiae strain AWRI1631 (MATa; ΔHO) is descended from the diploid industrial wine strain N96 (similar to strains known in the trade as EC1118 and Pris de Mousse). Briefly, N96 (Anchor Yeast, Cape Town, South Africa) was sporulated and haploid progeny were isolated using microdissection. As N96 is homothallic, these haploid progeny formed homozygous diploid strains via mating-type switching and several of these homozygous diploid lines were isolated for further manipulation. Homologous recombination was then used to delete the HO locus from each of these strains (Walker et al., 2005), which were then resporulated and stable haploid (ΔHO) progeny were isolated. Several progeny from each homozygous diploid line were then tested for their winemaking properties, with AWRI1631 selected as being one of several strains with winemaking characteristics matching those of the original diploid parent (data not shown).
Genome sequencing, assembly and analysis
Shotgun genome sequencing was performed using the GS-FLX system (Roche, Mannheim, Germany) by the Australian Genome Research Facility. Sequence contigs were assembled de novo and scaffolded using the existing S288c sequence (Goffeau et al., 1996). Sequences were then manually edited using seqman pro (DNAstar, Madison) with data from Sanger sequencing reactions used to clarify several problematic regions. Final alignments between the S288c (Goffeau et al., 1996), YJM789 (Wei et al., 2007) and AWRI1631 genomes were performed using mavid (Bray & Pachter, 2004) with nucleotide variation (SNPs, insertions and deletions) catalogued using seqman pro. Final sequence contigs have been deposited at DDBJ/EMBL/GenBank under the Whole Genome Shotgun project accession number ABSV00000000. The version described in this study is the first version, ABSV01000000.
Genome sequencing provides the most complete understanding of the genome structure of an organism and allows for the most in-depth comparisons to be made between related species. A haploid wine yeast, AWRI1631, was recently developed for use as a model strain for the analysis of wine yeast systems biology (A.R. Borneman et al., unpublished data). AWRI1631 was produced from the common commercial wine strain N96, which, like most industrial yeasts, is a homothallic diploid. This haploid strain has retained the robust fermentation kinetics of its parent while producing wine with a composition and flavour profile that is also equivalent to N96. However, due to its stable haploid genome, it is far easier to manipulate genetically.
In order to provide a sound basis for future investigations, in addition to obtaining a thorough understanding of the genomic differences between an industrial wine yeast and a laboratory counterpart, we sequenced the genome of this model wine yeast strain. AWRI1631 was subjected to one round of GS-FLX sequencing, producing approximately sevenfold genomic coverage (supporting Table S1). Following de novo assembly, 2971 contigs were aligned to the existing S288c genome sequence and manually edited. This reduced the total number of contigs to 2489, which cover 11 088 986 bp of the S288c genome (92%), in addition to 113 115 bp of sequence, which could not be assigned to any of the S288c chromosomes.
Initial nucleotide alignments of AWRI1631 and S288c showed that there were 68 290 instances of nucleotide variation between the strains. The majority of these events (57 463) were SNPs, with the remainder comprised of insertions and deletions (indels). While there was a nucleotide variant every 162 bp, on average, between the strains, the densities of both SNPs and indels were not evenly distributed across the genome. There were several regions that displayed substantial increases in polymorphism density that were interspersed throughout regions of relatively high interstrain conservation (Fig. 1a–c).
Remarkably, the level of nucleotide polymorphism observed between AWRI1631 and S288c (0.6%) is very similar to that separating S288c and YJM789, a strain of S. cerevisiae isolated from the lung of an AIDS patient (Wei et al., 2007). To determine whether the similar nucleotide divergence observed between the laboratory and nonlaboratory isolates was due to AWRI1631 and YJM789 sharing a common set of SNP alleles relative to S288c, we performed a three-way genomic comparison of the strains. The three yeast genomes were aligned using mavid (Bray & Pachter, 2004) and each chromosome was then cropped to exclude the telomeric regions which, due to their repetitive nature, were generally absent or misaligned in either AWRI1631 or YJM789. Analysis of the three-way alignments showed that SNP density between the three strains was approximately equal, although AWRI1631 displayed a slightly higher level of nucleotide divergence when compared with both S288c and YJM789 (Fig. 1d, Table S2). The three strains, therefore, represent equally distant lineages, with the two nonlaboratory strains being no closer at the nucleotide level than their laboratory counterpart.
Intraspecific differences in large genomic rearrangements
In addition to SNP variation between strains, there were 8155 differences between AWRI1631 and S288c due to insertion and deletion events. The location of 3299 of these indels overlapped with indels present in YJM789 when compared with S288c, while 2978 matched YJM789 in both their exact location and size. These ranged from single base pair mutations (5637) to large indels, which covered up to several kilobases each (Fig. 2a). These large indels were generally associated with the differential presence of Ty transposons in the genomes of each strain (Fig. 2b–d), although there was at least one large insertion shared between AWRI1631 and YJM789 that is not associated with a transposable element. Of the 52 Ty elements in the S288c genome, there were clear data to indicate that 42 were absent from the AWRI1631 genome, with the majority of these also being absent in YJM789. In addition to those Ty elements found in S288c, there were seven additional Ty elements present in the genome of YJM789, of which only one was present in AWRI1631.
While very similar in their Ty content and distribution, YJM789 and AWRI1631 differ in the presence of a large inversion on chromosome XIV and interchromosomal translocation between chromosomes VI and X, both of which are present in YJM789 but absent from both S288c and AWRI1631 (Wei et al., 2007).
Gene content of ‘natural’ vs. laboratory isolates
In order to obtain ORF sequences from AWRI1631 for comparison with those of other strains, the gene prediction program glimmer (Delcher et al., 1999) was used to identify a total of 5687 ORFs from those AWRI1631 sequences that could be aligned to the S288c genome (Table S3). As the AWRI1631 genome sequence is comprised of a large number of separate sequencing contigs, there were many instances where genomic regions from AWRI1631 that were homologous to a single S288c ORF spanned a sequencing gap. In these situations, there were often multiple, nonoverlapping ORFs in the AWRI1631 sequence that each matched to different parts of a single S288c ORF such that 5204 (90%) of the AWRI1631 ORFs could be matched to 4634 S288c proteins by amino acid homology. This represents 81% coverage of the predicted S288c proteome if dubious ORFs are disregarded. From the remaining 483 AWRI1631 proteins, blast searches performed against the nonredundant GenBank protein database showed that these either lacked homology to any other protein sequence or represented matches to dubious S288c ORFs and are most likely nonfunctional. The only exception to this was orf_09_0030, which is predicted to encode a protein homologous to the combined products of the S288c genes YIL167W and YIL168W. These two genes are presumed to be nonfunctional in S288c, but are fused in some S. cerevisiae strains and other members of the Saccharomyces sensu stricto group where the hybrid gene encodes an l-serine dehydratase (Seufert, 1990; Kellis et al., 2003; Godard et al., 2007). Like these strains, AWRI1631 also appears to contain a functional YIL167W–YIL168W serine dehydratase enzyme that is absent from S288c.
Conserved genomic loci
Before examining the levels of protein conservation for loci conserved between AWRI1631 and S288c, we first observed whether there was any evidence for the selection of specific protein variation at the nucleotide level through the analysis of nonsynonymous mutation (dN) rates (Yang & Bielawski, 2000). clustalw protein alignments of each pair of AWRI1631, S288c orthologues were converted to codon-gapped nucleotide alignments (Suyama et al., 2006) and the number of nonsynonymous mutations were determined (Yang, 1997). While there was a large amount of variation in nonsynonymous mutation rates, when these individual values were grouped by GO categories, only a few groups displayed a concerted difference in their rates of nonsynonymous mutation, which could indicate the presence of selective pressure on their protein-coding sequences (Fig. 3). Of these groups, those involved in translation (GO terms ribosome, translation regulator activity and translation), showed the largest effect, with these three groups consistently displaying significant decrease in the frequency of nonsynonymous mutations. This is consistent with the well-known constraints that exist on the sequence of these highly conserved proteins. Of the remaining GO categories, proteins located in the cell wall (GO component) and those involved in signal transduction (GO function) displayed slightly higher nonsynonymous mutation rates compared with the other categories. Unlike this situation observed for the ribosomal proteins, increased levels of nonsynonymous mutations may indicate diversifying selection that favours protein sequences other than those present in S288c.
When the amino acid, rather than nucleotide sequences of each pair of AWRI1631 and S288c orthologues were compared, it was shown that the two predicted proteomes were very similar, displaying an average amino acid identity of 99.3% and a similarity of 99.6%. However, there were large variations in the level of conservation observed on a protein-by-protein basis (Fig. 4). Accompanying the differences due to amino acid substitutions, there were also 1118 and 54 instances where the predicted protein products of AWRI1631 were truncated or extended at either the amino or carboxyl termini by more than 10% compared with S288c, respectively. It must be noted, however, that many of the truncations observed in AWRI1631 might be due to gaps in sequencing contigs as 636 lack either an intact start or stop codon, indicating that they were located at the extremities of their respective contig sequences.
Nonconserved genomic loci
From the 10 large insertions in AWRI1631 that were shared with YJM789, only one region, located on chromosome IX, was not primarily composed of transposon-related sequences. This region is predicted to encode the KHR1 heat-sensitive killer toxin, a gene that has been identified previously as being present in YJM789 but not in S288c (Goto et al., 1990; Wei et al., 2007). In addition to insertions that were shared with YJM789, there were another 10 AWRI1631-specific insertions that were not Ty associated. However, none of these regions were predicted to contain functional ORFs.
As mentioned previously, in addition to AWRI1631-specific genomic regions that could be mapped to a chromosomal location by conserved flanking regions, there were 29 contigs that could not be assigned to a particular chromosome. This 113 kb of AWRI1631-specific sequence was predicted to encode an additional 37 proteins, of which 27 had a significant (E<1−10) match to at least one other protein in the GenBank nonredundant protein database (Table 1). Despite the fact that these 27 predicted ORFs all lie within the AWRI1631 sequence that could not be clearly aligned to the S288c genome, 20 of the 27 showed significant similarity to S288c protein sequences. Of these 20 homologous ORFs, nine represent a highly conserved, reciprocal best match of an S288c protein. As such, the presence of the conserved ORF sequence within divergent telomeric sequences is probably interfering with the ability to accurately locate the surrounding contig with respect to the S288c genome (Table 1; S288c orthologues). Thus, while these ORFs most likely represent orthologues of S288c proteins, they are located within a sequence that is highly divergent between AWRI1631 and S288c.
Hypothetical protein Kpol_1004p1 Vanderwaltozyma polyspora DSM
chrI telomere – cell wall mannoprotein
Unlike the nine ORFs that represented reciprocal best hits of an S288c protein, the 11 remaining AWRI-specific ORFs (out of the 20 that displayed homology to proteins from S288c) were non reciprocal best hits, with at least one other AWRI1631 ORF showing significantly higher similarity to the same S288c sequence (Table 1; S288c paralogues). Thus, these 11 ORFs appear to encode distinct paralogues of a pair of S288c, AWRI1631 orthologues and are more closely related to proteins from species other than S. cerevisiae (Fig. 4). These proteins include a predicted aspartate transaminase homologous to the protein encoded by YNL027C from S288c (Fig. 5a) and three ORFs that are located on a single, 15.5-kb AWRI1631-specific contig, that are homologous to YBR043C (major facilitator drug transporter), YNL247C (glyoxylate reductase) and YKL215W (5-oxo-l-prolinase) from S288c, respectively (Fig. 5b). The presence of this gene cluster is particularly interesting, as while all three of these neighbouring genes show their highest homology to sequences outside the S. cerevisiae, there is no clear single species to which all three genes show the greatest similarity. Thus, the origin of this cluster remains to be determined. It is impossible to ascertain whether these genes were present in the common ancestor of AWRI1631, S288c and YJM789 and then subsequently lost in all but AWRI1631, or whether the genes were acquired by gene duplication (followed by sequence divergence) or by horizontal gene transfer.
The seven remaining AWRI1631-specific ORFs displayed significant homology (E<1−10) only to proteins not present in S288c (Table 1). While the predicted function of many of these proteins remains elusive due to a lack of characterized homologous sequences, one putative protein is predicted to have a role in neutral amino acid transport (orf_c1103_0030), while a second (orf_c1110_0010) is predicted to encode a highly conserved homologue of the Mpr1 protein of S. cerevisiae strain Σ1278b (Shichiri et al., 2001). MPR1 was shown to be specific to Σ1278b when compared with S288c and has been implicated in tolerance to both ethanol and cold stress (Du & Takagi, 2005, 2007). This gene may, therefore, play a role in providing increased stress tolerance to wine yeasts when compared with laboratory strains.
This study has shown that while the yeast genome displays substantial conservation throughout a core set of genes, there are many regions that display high degrees of interstrain variation. These variable regions include genes with increased levels of nucleotide substitutions, in addition to entire strain-specific loci. These strain-specific and highly variable genes may provide the means for phenotypic diversification and specialization without impeding the basic growth and metabolism of the yeast cell. However, despite the large amount of data produced in this study, it is still unclear as to exactly which genomic variation is most important for the phenotypic divergence between wine yeasts and other S. cerevisiae strains.
Sequencing of additional strains of S. cerevisiae will be required to recognize the full size and scope of the ‘global’S. cerevisiae genome. These genome sequences will also provide important points of comparison for identifying variation that can be confidently associated with desirable phenotypic traits. It is fortunate that there is now at least one large-scale study underway that is looking to broadly catalogue this variation through low-coverage sequencing of a large number of S. cerevisiae strains (http://www.sanger.ac.uk/Teams/Team71/durbin/sgrp/). However, with rapid advances in DNA sequencing technology (Bennett, 2004; Margulies et al., 2005), the ability to fully sequence large numbers of yeast genomes is now within the reach of individual laboratories, thereby providing researchers with the ability to characterize large numbers of yeast strains at the genomic level.
The Australian Wine Research Institute (AWRI), a member of the Wine Innovation Cluster in Adelaide, is supported by Australia's grapegrowers and winemakers through their investment body the Grape and Wine Research Development Corporation with matching funding from the Australian Government. Systems biology research at the AWRI was performed using resources provided as part of the National Collaborative Research Infrastructure Strategy, an initiative of the Australian Government, in addition to funds from the South Australian State Government.