In the original S. cerevisiae genomic annotation (c. 1993–1996), protein encoding genes were simply annotated as the longest possible open reading frame of 100 or more codons. These annotations have now been subjected to a decade of testing by thousands of scientists worldwide, using a large range of experimental and comparative methods. In particular, the genome-wide comparisons published by Brachat et al. (2003), Cliften et al. (2003), and Kellis et al. (2003) provided an excellent opportunity to review the entire S. cerevisiae gene model, both in sequence and interpretation. In these studies, the sequenced species were so closely related to S. cerevisiae as to allow the expectation of very close conservation of ORF size, location and intron/exon structure. Not surprisingly, there have been many suggested changes: new ORFs have been identified, and existing ORFs have been ‘removed’ and revised (Figure 1).
Most newly identified ORFs have been smaller than 100 codons. This is simply due to the fact that the S. cerevisiae genome sequencing project did not annotate ORFs of fewer than 100 codons that did not have significant sequence similarity to a previously identified gene. This approach was necessary because there is a high probability that ORFs of this size are just fortuitous sequences of nucleotides: only 342 (2%) of the 15 000 ORFs in the genome between 50 and 99 codons in length are currently thought to encode proteins within the yeast cell. As a consequence, any ORF under 100 codons is treated as spurious until proved otherwise through either experimental or comparative work.
However, length alone does not guarantee that an ORF is genuine, and the total number of biologically significant S. cerevisiae ORFs has been the subject of debate since the completion of the genomic sequence (Termier and Kalogeropoulos, 1996; Zhang and Wang, 2000; Malpertuy et al., 2000; Wood et al., 2001; Mackiewicz et al., 2002; Brachat et al., 2003; Cliften et al., 2003; Kellis et al., 2003). At the heart of this debate is the basic principle that it is virtually impossible to demonstrate experimentally that an ORF is non-functional; there is always a chance that a suspect ORF encodes a protein of extremely low abundance or that is produced only under some specific environmental condition. Fortunately, the availability of genomic sequences from other fungi provides a positive test for the relevance of experimentally uncharacterized ORFs: evolutionary conservation among very closely related species. This has allowed for a separation of significant ORFs from those that are likely to be spurious.
Even many bona fide ORFs have required updating. Revisions of ORF annotation fall into two major categories: those in which the nucleotide sequence is corrected; and those in which the nucleotide sequence remains the same but its interpretation is altered. Changes in the first category often affect the start codon, stop codon, reading frame or coding sequence for that ORF, while changes in the second category include annotation of different start codons and intron/exon structure.
Although automated data processing is an important element in the process of revising and updating genomic sequence annotation, human evaluation is also essential. In making any changes to the genome sequence, SGD curators evaluate and synthesize all available types of evidence, including that generated by individual gene-specific experiments, by large-scale analyses and by cross-species comparisons.
Because SGD strives to provide rapid access to new information, individual updates are integrated into the genome sequence and released to the community as soon as possible. As a result, genome updates have been made gradually and released continually, rather than as rare scheduled updates encompassing multiple changes. While this approach provides the fastest means of disseminating the updates, alerting the research community to the changes has proven to be a continuing challenge. Here, we describe the types of changes that have been incorporated into the S. cerevisiae genome annotation, how SGD handles each type of change and how the research community can access the updated information.
Results and discussion
Over the last decade, 522 new ORFs have been added to the S. cerevisiae gene catalogue. Prior to the year 2001, most new small ORFs were discovered individually during the course of focused experimental research. These ORFs were annotated because they encoded proteins that were isolated from complexes (e.g. TIM9/YEL020W-A; Koehler et al., 2000), discovered in traditional genetic screens (e.g. SAE3/YHRO79C-A; McKee and Kleckner, 1997) or identified in focused comparative analyses (e.g. YAL044W-A; Valerie Wood, personal communication). More recently, researchers have applied large-scale approaches, both computational and experimental, to the problem of finding the biologically significant small ORFs (Basrai et al., 1997; Blandin et al., 2000; Kumar et al., 2002; Oshiro et al., 2002; Brachat et al., 2003; Cliften et al., 2003; Kessler et al., 2003). These large-scale studies produced 65% of the new additions to the S. cerevisiae ORF catalogue.
SGD curators examined each proposed new ORF to insure its validity as a potential gene. In most instances, the new ORF was accepted as proposed, but some cases required more extensive analysis. For example, several of the new ORFs proposed by Blandin et al. (2000), Brachat et al. (2003) and Cliften et al. (2003) contained introns; while these three groups often predicted new intron-containing ORFs in the same regions, they sometimes differed on the exact location of the exon/intron boundaries. These conflicts were resolved by examining the sensu stricto Saccharomyces data published by Kellis et al. (2003) and determining which proposed exon/intron structure was conserved in other closely related species. In a few other cases, the new ‘ORFs’ were subsequently shown to be part of previously annotated ORFs rather than independent new ORFs.
Classification of open reading frames
The ascomycete species sequenced by Brachat et al. (2003), Cliften et al. (2003) and Kellis et al. (2003) largely contain the same ORFs as does S. cerevisiae, in the same order. Thus, lack of conservation in the closely related species constitutes evidence against the biological significance of an S. cerevisiae ORF. All three of these groups applied this test independently, using their own datasets, and generated three partially overlapping lists of potentially spurious ORFs. Brachat et al. (2003), Cliften et al. (2003) and Kellis et al. (2003) recommended that 368, 496 and 515 ORFs, respectively, be deleted.
Because even sophisticated computation is no substitute for actual laboratory experiments, SGD takes a cautious approach towards the removal of ORFs from the S. cerevisiae genomic catalogue. ORFs recommended for deletion are not actually eliminated from the genome annotation, but are simply labelled ‘dubious’. This approach results in an S. cerevisiae gene model of relatively high certainty, while still allowing further testing on the set of questionable, ‘dubious’ ORFs. The ‘dubious’ designation is prominently displayed on Locus Summary pages and is indicated by colour on graphical displays of chromosome maps. Dubious ORFs are also excluded from sets of ORFs considered biologically significant; they are not included in the comprehensive file of S. cerevisiae Gene Ontology annotations (gene_association.sgd) that SGD provides to the public, and they are not included in the S. cerevisiae reference sequence (RefSeq) entries that SGD maintains and provides to NCBI.
During the initial analysis, individual ORFs were designated ‘dubious’ if they met the following criteria: (a) the ORF was identified as potentially spurious by at least one of the comparative studies above; (b) there were no well-controlled, small-scale, published experiments demonstrating that detectable mRNA or protein was produced from this ORF; (c) any mutant phenotype described for the ORF could be ascribed to mutation of an overlapping gene; and (d) the ORF did not contain an intron. The last condition was necessary because none of the three groups annotated introns in the related fungal species, and comparison of ‘spliced’ S. cerevisiae ORFs with exon fragments in other species could result in the artificial appearance of non-conservation. The majority of the ORFs identified as spurious by Brachat et al. (2003), Cliften et al. (2003) and Kellis et al. (2003) met these four criteria and were assigned a ‘dubious’ designation by SGD. For a small number of ORFs in this group, SGD curators found evidence suggesting that they represented functional genes. For example, all three groups recommended that AUA1/YFL010W-A is not a protein-encoding ORF because it is not conserved, and has substantial overlap with a characterized gene, WWM1/YFL010C. However, the transcription and mutant phenotype of AUA1 have been characterized (Sophianopoulou and Diallinas, 1993) and were not easily attributed to WWM1.
At the same time that SGD began labelling spurious ORFs ‘dubious’, we also implemented a further classification of conserved ORFs, according to the certainty that they actually encode proteins. ORFs that contained an intron, or that were identified as conserved by all three of the large-scale comparative studies, were designated either ‘uncharacterized’ or ‘verified’, depending on available experimental evidence. Because the S. cerevisiae nomenclature system allows yeast ORFs to be assigned a genetic name only after being described in a publication, named ORFs were automatically classified as ‘verified’. Unnamed ORFs were designated ‘uncharacterized’ unless there were published data supporting a ‘verified’ classification, such as mRNA or protein detection, or a mutant phenotype not ascribable to an overlapping gene.
Unfortunately, the comparative analyses done by Brachat et al. (2003), Cliften et al. (2003) and Kellis et al. (2003) were concurrent with many of the other large-scale analyses that identified new small ORFs. As a consequence, most of these new ORFs have not yet been assessed for conservation in closely related species. In addition, many of the new ORFs overlap with other genes, making analysis of conservation problematic. When clear evidence for conservation was not available, new ORFs that overlapped existing ORFs were assigned ‘dubious’ designations, while all others were classified as ‘uncharacterized’.
Thus, all S. cerevisiae ORFs are now categorized into one of three groups: ‘dubious’, referring to those ORFs that are unlikely to encode a protein; ‘uncharacterized’, those that are likely, but not yet fully established, to encode a protein; and ‘verified’, those for which there is clear experimental evidence for the presence of a protein-encoding gene. It should be noted that these ORF classifications are not static properties and are expected to change as new data become available for each ORF. In the almost 3 years since the original analysis, the classifications of 299 ORFs have been updated; 90% of these changes have been from ‘uncharacterized’ to ‘verified’. Very few ‘dubious’ ORFs (19 of 832 nuclear ORFs) have been reclassified as either ‘uncharacterized’ or ‘verified’. Experimental evidence supporting the validity of these classifications is beginning to accumulate. For example, Raisner et al. (2006) reported that the variant histone protein H2A.Z is associated with the 5′ ends of ‘verified’ and ‘uncharacterized’ ORFs, but not with the 5′ ends of silenced genes or ‘dubious’ ORFs.
Sequence changes and ORF revision
Any large-scale analysis will include some percentage of errors, and large-scale sequencing projects are no exception. During the last decade, a total of 185 ORFs have been revised due to the correction of demonstrated sequencing errors (Figure 2).
The ORF revisions and underlying sequence corrections vary widely in nature. They range from single nucleotide changes that alter the nature of a single critical amino acid (e.g. MCM6/YGL201C; Andrea Duina, personal communication; GenBank Accession No. AY258324); to multiple changes, insertions and deletions resulting in a C-terminal extension and a new stop codon (e.g. SAL1/YNL083W; Belenkiy et al., 2000; Brachat et al., 2003); to the insertion of a 220 bp region that had not been included in the original sequence (HSP150/YJL159W; Moukadiri and Zueco, 2001; Brachat et al., 2003).
As with new small ORFs, the errors in the reference sequence were typically discovered during the course of focused experimental research. However, the recent large-scale genomic comparisons have allowed for much more rapid identification of a particular subset of sequencing errors. When identifying orthologues in closely-related species, Blandin et al. (2000), Brachat et al. (2003), Cliften et al. (2003) and Kellis et al. (2003) noticed many cases in which a gene was largely conserved across species in sequence and position, but the S. cerevisiae gene contained extensions or deletions relative to its predicted orthologues, suggesting that sequencing errors might have led to incorrect annotation of its 5′ or 3′ boundary.
In many instances, the authors tested their predictions by resequencing genes themselves (Brachat et al., 2003; Kellis et al., 2003). In some additional cases, other researchers independently predicted, tested and confirmed the same sequencing errors (Schmalix and Bandlow, 1994; Beh et al., 2001; Xiao et al., 1998; Treton et al., 2000; Angus-Hill et al., 2001; Moukadiri and Zueco, 2001; Kaliraman et al., 2001; Palmer et al., 2001; Robben et al., 2002; Jaspersen et al., 2002; Denis and Cyert, 2002; Muller et al., 2003; Charlie Boone, personal communication; Jim Brown, personal communication; Clyde Denis, personal communication; Tim Formosa, personal communication; Claude Gaillardin and Aaron P. Mitchell, personal communication; Gerard Manning, personal communication). In all remaining cases, SGD curators examined and tested the recommended sequence changes. Upon close examination of the sequence alignments and available literature for each gene, some of the proposals were rejected due to inadequate or unconvincing alignments with related fungal sequences, but in most cases, it was straightforward to predict a sequence change that would produce a highly conserved ORF.
Annotation changes and ORF revision
In the original S. cerevisiae genomic annotation, each ORF was simply annotated as the longest possible reading frame. However, comparison with closely related species suggested that for some ORFs, the methionine codon that produced the longest possible reading frame might not actually represent the translational start. In these cases, the conserved start codon in the orthologues aligned with a downstream, in-frame methionine codon, rather than the start codon annotated in S. cerevisiae. Changing the S. cerevisiae annotation to use the downstream, conserved start codon effectively produces a 5′ truncation of these ORFs, relative to their previous annotation. Kellis et al. (2003) recommended 120 such changes. In some cases published data, such as protein size determination or N-terminal sequencing, corroborated the new predictions (Adzuma et al., 1984; Taylor et al., 1987; Dean-Johnson and Henry, 1989; Hanes et al., 1989; Sanni et al., 1991; Wang et al., 1992; Poon and Storms, 1994; Sanders and Herskowitz, 1996; Horazdovsky et al., 1997; Nothwehr and Hindes, 1997; Zheng et al., 1997; Mori et al., 1998; Davis et al., 2000; Kurtz et al., 2002; Willer et al., 2003; Rodney Rothstein, personal communication). In the absence of published experimental data, the new start codon was accepted only if it was the predicted start in at least three of the four available Saccharomyces sensu stricto species (S. bayanus, S. paradoxus, S. mikatae or S. kudriavzevii). Of the recommended start site changes, 87 (72%) met these criteria and were incorporated into SGD. Four more were later added because they were confirmed by Zhang and Dietrich (2005), who also discovered an additional four start codon changes that Kellis et al. (2003) had not predicted. Although this number is small in comparison, it does illustrate the point that the work done by Kellis et al. (2003) was not saturating, and we can expect that focused experimental work may identify even more start codon corrections.
The original annotation for the budding yeast genome contained 225 genes with introns. Introns are rare in yeast, tend to be in the extreme 5′ end of the gene, and typically include a perfect match to the branch site consensus (UACUAAC; Spingola et al., 1999). Since 1996, only 39 new introns and exons have been identified. The majority of these were identified by Brachat et al. (2003) and Cliften et al. (2003), who proposed that a combined total of 24 existing ORFs be updated with new introns and exons, such that the reading frame of the original ORF was preserved but the new intron and exon effectively added an extension at either the 5′ or the 3′ end. In some instances, the intron/exon predictions were directly tested (Brachat et al.2003). For the remainder, SGD curators examined the sensu stricto Saccharomyces data published by Kellis et al. (2003), which was not used for the intron predictions by Brachat et al. (2003) and Cliften et al. (2003). The new intron/exon structure was annotated only if the reading frame, the start and stop codons and the branch site splicing signals were conserved in the other species.
In a few cases, examination of the evidence led to revision of the proposed change. For example, based on sequence conservation between Ashbya gossypii and S. cerevisiae, Brachat et al. (2003) proposed an intron and a new 3′ exon for SEF1/YBL066C. However, when the SEF1 sequences from four Saccharomyces sensu stricto species were compared to S. cerevisiae, the comparative data argued against the presence of an intron. Instead it appeared that the S. cerevisiae sequence contained a large number of sequencing errors in this gene. SGD resequenced the 150 base pairs spanning the divergent region and found that 37 nucleotide insertions and four nucleotide substitutions were necessary to correct the reference sequence. Once these errors were corrected, the S. cerevisiae SEF1 ORF displayed close conservation with the other Saccharomyces sensu stricto orthologues, none of which was predicted to contain an intron.
Sequence and annotation changes are announced regularly on SGD's homepage and in our quarterly newsletter. All changes are also tracked and posted in a more permanent manner, on SGD web pages and at our FTP site.
The Locus Summary page, the basic unit of the SGD website, includes a ‘Sequence Information’ section located near the bottom of the page. This section lists sequence and coordinate details for that feature, including the dates when each was last updated. A detailed description of each update is provided on the Locus History page (accessible from a tab at the top of the Locus Summary page).
The Locus Summary provides focused update information on a gene-by-gene basis, but this information is also available via the web in more comprehensive forms. The Chromosome History pages (http://www.yeastgenome.org/chromosomes) provide a complete list of changes for each chromosome. The Advanced Search tool can be used to generate lists of all currently annotated ORFs of each classification (verified, uncharacterized, dubious) as well as lists of any other type of annotated chromosomal feature.
Comprehensive information is also available for download via the SGD site (ftp://ftp.yeastgenome.org/yeast/). Sequences for these features, as well as for entire chromosomes and intergenic regions, can be found in the ‘genomic_sequence’ directory.
Incorporation of sequence and annotation changes over the past decade has resulted in a significantly more accurate reference sequence for S. cerevisiae. However, although the recent large-scale comparative analyses (Blandin et al., 2000; Brachat et al., 2003; Cliften et al., 2003; Kellis et al., 2003) have provided a bonanza of sequence and annotation corrections, we expect that more errors will be discovered lurking within the reference sequence. The broad scope of these analyses revealed gross errors in genomic annotation, such as mistakes in intron/exon structure or ORF boundaries. A narrower focus will be required for the detection of more subtle errors that likely exist in both coding and intergenic regions, and we anticipate a continually refined reference sequence and its annotation.
This work was made possible by the support of Gavin Sherlock in performing nucleotide sequencing within his laboratory. We appreciate the efforts of the many scientists who have corresponded with SGD about errors in the genome sequence and who have, in many cases, re-sequenced genomic regions in order to ensure the quality of the reference sequence. We thank Maria C. Costanzo and all the members of the SGD group for helpful discussions and critical reading of the manuscript. SGD is supported by a P41 grant, Genome Research Resource (No. HG001315), from the National Human Genome Research Institute at the US National Institutes of Health.
DNA accession numbers
The S. cerevisiae S288C genome sequence and annotations are maintained by SGD and archived at NCBI within the Reference Sequence (RefSeq) collection. The Accession Nos for the 16 nuclear chromosomes and the mitochondrial genome are: NC_001133, NC_001134, NC_001135, NC_001136, NC_001137, NC_001138, NC_001139, NC_001140, NC_001141, NC_001142, NC_001143, NC_001144, NC_001145, NC_001146, NC_001147, NC_001148, and NC_001224.