Deoxyribonucleic acid (DNA) sequences from human chromosomes and chromosomes of other organisms are enabling a detailed look at the structure and organization of protein-coding information in the context of genomes as a whole. Groups of genes can now be examined in relation to broader landscape features such as the guanine plus cytosine (G+C) content, interspersed genome-wide repeats and syntenic relationships between species.
As predicted in the 1950s, bacteria, plants and animals share a genetic language. This language encompasses the genetic code, wherein open reading frames of 61 trinucleotide codons are used to specify the amino acid sequences of proteins. The central dogma postulated by Francis Crick nearly 50 years ago is still predominantly true, namely, that information required for making proteins is stored in DNA, transcribed into portable messenger amino acids. These in turn are folded into proteins and protein complexes that function as the molecular machines of the cell. In addition to protein-coding genes, there are genes that only code for ribonucleic acids (RNAs), such as transfer RNAs, ribosomal RNAs and RNAs involved in the splicing machinery. See also Splicing of Pre-
The language of gene regulation that is, the cellular interpretation of DNA sequence that governs which proteins are made when is more complex and less understood than is the genetic code. However, it is likely to function by grammatical rules that are shared across diverse species. The desire to crack regulatory codes is motivating the sequencing of genomes that occupy diverse branches of the phylogenetic tree. Evolutionarily conserved blocks of noncoding sequence identified through comparative analyses across species provide starting points for experimentation.
In this overview, the definition of a gene is discussed from the perspective of structural genomics, and results from the sequence of the human genome are summarized.
Anatomy of a Gene
Because controversy surrounds every attempt to define precisely what a gene is, the subject will be approached from the bottom up, by starting with features that structural genomicists consider to be relevant parts of a gene.
Introns, exons and alternative splice forms
In the late 1970s, it was discovered that the open reading frames for genes of multicellular organisms are typically interrupted by regions of noncoding DNA sequence called introns. This means that when the gene is transcribed, the primary transcript spans more genomic terrain than would be indicated by the length of the coding sequence. Introns are initially transcribed but then removed from the mature messenger RNA (mRNA) by a process called splicing. The sequence that remains in the mRNA consists of what are called exons (Figure 1). When spliced correctly, exons reconstruct the open reading frame that contains the coding information required for translation of the mature mRNA into a functional polypeptide.
Analysis of thousands of intronexon boundaries has shown that there are consensus sequences for splice-site recognition (Figure 2). The bundling of coding sequences into exons increases the complexity of the coding capacity of a genome owing to a process called alternative splicing. Different mature mRNAs can be produced from a primary transcript by leaving out one or more exons or by using different splice sites in an exon (Figure 1). The differences in the polypeptide sequences translated from these mRNAs often result in proteins with diverse functions. For example, alternative splicing can result in both membrane-bound and secreted forms of receptors or can alter the binding properties of receptors, depending on which exons are present in the mature transcript. In some cases (e.g. a family of genes expressed in brain called neurexins) up to thousands of different isoforms (i.e. splice forms) can be potentially generated from the same primary transcript.
Starts, stops and untranslated regions of the RNA transcript
Synthesis of a primary transcript for a gene starts with the promoter, or start site, and ends at or near the polyadenylation signal sequence, which tells the cellular machinery to add a stretch of adenosine residues onto the end of the message. Because the primary transcript includes the introns, it can be quite large sometimes more than a megabase (Mb) for some human genes. After splicing, translation from the mature messenger RNA begins with the codon for methionine, AUG, and ends with one of the three stop codons UAA, UGA or UAG (Figure 3). In addition to coding sequence, primary transcript RNAs contain signal sequences required for proper processing, which includes capping at the 5¢ end of the message and polyadenylation at the 3¢ end.
The so-called 5¢ and 3¢ untranslated regions (UTRs) are found at the beginning and end of the mRNA. Depending on where translation begins and ends, the UTRs may be derived from more than one exon and may constitute a considerable portion of the length of the processed transcript. See also mRNA Untranslated Regions (UTRs), 3¢ UTRs and Regulation, and 5¢-
Regulatory sequences and networks
Synthesis of RNA requires activity of the basal transcription apparatus of the cell a complex of proteins that includes RNA polymerase. Transcription starts from a short region of DNA sequence called the promoter. Initiation of transcription is usually regulated by the presence of transcription factors proteins that bind to specific sequence motifs in the genomic DNA called cis-regulatory elements (Figure 4). Transcriptional regulation is a principal determinant of the amount of RNA that is available for translation into specific polypeptides. If an RNA transcript is present in a given cell at a given time in response to specific stimuli, then the gene that directed the synthesis of that RNA is said to be expressed. Whereas some genes are expressed all the time in all cells because the proteins that they encode are essential to the life of the cell (the so-called housekeeping genes), other genes are expressed only in response to specific needs such as the specialized functions of differentiated cells or the transient timing mechanisms at work in embryonic development.
Determining how gene expression is controlled is a matter of intense interest. Proteins binding to cis-regulatory elements have the effect of either activating transcription or repressing it (Figure 4). For example, a gene may require the presence of three different transcription factors before it can be expressed. For each of these transcription factors to be present, the genes encoding them must in turn be expressed and the proteins must be able to find and bind to the relevant cis-regulatory sequences. Thus, the expression of sets of transcription factors and the organization of cis-regulatory elements into modules are crucial to the regulation of many genes. In addition, expression of a given gene can be regulated by more than one module, thus allowing a flexible response to the needs of the cell.
Cis-regulatory elements are typically found in the genomic DNA sequence adjacent to the 5¢ side of the transcription start site. But they can also be found in the introns, 3¢ UTRs, the regions between genes and even in other genes. The basic idea is that cis (from Latin, meaning on the same side) regulatory elements are within or in close proximity to the gene and generally refer to features of the DNA sequence.
In contrast, the term trans-regulatory element refers to an entity whose origin lies outside the gene that is being regulated (trans, from Latin, meaning beyond). For example, the transcription factors that bind to cis-regulatory elements can properly be called trans-regulatory elements, because they are encoded by other genes. Other factors such as magnesium ion concentration (which affects protein binding or mRNA stability) have also been referred to as trans-regulatory elements because they can have an effect on the expression of specific genes.
To complicate matters even further, so-called epigenetic phenomena may affect gene expression. For example, some genes are imprinted, which means that they are expressed solely from the maternally inherited or the paternally inherited copy of the chromosome. In general, details pertaining to the tertiary structure of chromosomes can have a decided effect on transcription and, therefore, have a role in gene regulation. See also Gene Expression Databases, and Genomic Imprinting at the Transcriptional Level
Definition of a gene from a structural perspective
There is no universally agreed definition of a gene. Interpreted functionally, genes do something for example, they transmit inherited traits, make polypeptides or compete for evolutionary dominance. Interpreted structurally, genes are something for example, they are stretches of nucleic acid sequence whose informational content has the potential to do one of the functions listed above. Because different perspectives on what a gene is give different results for enumerating genes, what is considered to count as a gene must be made clear for a given context of investigation.
Almost any interpretation of the concept of a gene leads to some counterintuitive consequences. For example, if a gene were defined strictly as only those portions of the sequence that encode a specific polypeptide, then introns would not be considered as part of the gene even though they are part of the primary transcript. Moreover, DNA sequence encoding an alternatively spliced exon would be part of the gene when the exon remains in the messenger RNA but not part of the gene when it is spliced out. Along the same line, an exon could be considered as part of several genes if alternative splice forms translate into polypeptides with different amino acid sequences. Because at least 30% of mammalian primary transcripts are alternatively spliced, this would make the number of genes in the human genome both enormous and uncountable in the light of our current incomplete knowledge of alternative splicing.
Alternatively, if a gene were defined loosely as all of the sequence that is required for generating a specific polypeptide, then cis-regulatory elements, promoters, noncoding exons and introns would be included. Arguably, the sequences that regulate and encode all of the relevant transcription factors would have to be included as well, because genes cannot be expressed without them. Given that regulatory sequences can be anywhere, genes would overlap in highly complex structures if this definition were adopted. Indeed, one would be identifying regulatory networks that would colocalize to many chromosomes. Because gene regulation is not well understood, an inclusive definition such as this, although perhaps functionally satisfying, would render the task of structural identification impossible for most genes.
For the purpose of mapping genes and analyzing their structure and organization, the following definition is used: a gene is all of the genomic sequence that would be represented in the primary (i.e. preprocessed) RNA transcript if the gene were expressed. In this interpretation, the 5¢ and 3¢ UTRs, noncoding exons and introns are included as part of the gene but the regulatory sequences, unless they are transcribed, are not. This definition has one potentially objectionable implication, that is, that all alternative splice forms derived from a given primary transcript originate from the same gene even if different variations of the polypeptide sequence are produced. In other words, one gene can be expressed as more than one polypeptide.
Using this structural definition, the chromosomal location of genes can be identified on the basis of comparisons between the genomic sequences and full-length mRNA transcripts as long as both types of sequence are available. This definition is also useful for organizing data pertaining to gene count, gene size, intronexon statistics, cross-species gene comparisons and generalizations about gene structure in the context of chromosomal landscape features such as G+C content and genome-wide interspersed repeats. Thus, it will be used for the remainder of this overview.
Features of Genes
Computational approaches to gene identification fall into two broad categories: sequence comparisons and gene predictions. Both strategies have limitations and can produce erroneous results. Putative genes that do not exist may be identified (false positives) and genes that are real may fail to be found (false negatives).
Identification using sequence comparison
Several different algorithms have been developed for aligning two sequences to each other, calculating their percentage similarity and determining the probability that the similarity would occur by chance. Perhaps the most commonly used algorithm is the basic local alignment search tool (BLAST). For gene identification and localization, the most reliable sets of sequences to compare are full-length complementary DNAs (cDNAs) against long stretches of finished sequence derived from chromosomes or whole genomes of the same species.
A full-length cDNA clone contains sequence corresponding to all of the exons, including the 5¢ and 3¢ untranslated regions, of a spliced mRNA. Thus cDNAs are markers for expressed genes. Libraries of cDNA clones can be prepared from specific cells or tissues by using the enzyme reverse transcriptase to copy the mRNA into complementary DNA, and then ligating the cDNA sequences into appropriate cloning vectors. As for genomic sequence, contiguous stretches on the order of a megabase in length are now available for most of the human chromosomes. Sequencing of the whole human genome is scheduled to be finished by Spring of 2003. (The sequences of the genomes of Drosophila melanogaster (a fruitfly), Caenorhabditis elegans (a roundworm) and Arabidopsis thaliana (a common weed) are already essentially finished. Draft sequences currently exist or are underway for rice, mouse, rat, mosquito, pufferfish, zebra fish and sea squirt, and there are additional genomes in the sequencing pipeline.)
With both a full-length cDNA and finished genomic sequence in hand, a gene structure can be identified on the basis of matches corresponding to the exons (Figure 5). When the cDNA is less than full length, or when there are gaps or misassemblies in the corresponding genomic sequence, genes can still be identified, but conclusions about the gene structure may be erroneous.
A less reliable but still useful approach to gene identification uses cDNAs from different species. Because the similarity to the genomic sequence is not exact, this strategy works best for single-copy genes. If there is more than one copy of the gene, owing to a process called segmental duplication, it may be unclear which copy best corresponds to the cDNA.
Some genes are expressed rarely or their transcripts are unstable; thus, databases of cDNA sequences underrepresent the complete set of genes. To augment the list of genes, comparisons between genomic DNA from two species, such as human and mouse, are also used to identify gene candidates on the basis of similarity between evolutionarily conserved sequence blocks that are likely to be exons. Another strategy is to translate stretches of genomic sequence into amino acid sequences and search for similarity to characterized protein sequences in other species. This approach is used for species that are too diverged evolutionarily for nucleotide sequence similarity to be meaningful.
Identification using gene prediction programs
Algorithms have been written to identify potential exons and build gene models from genomic sequence data. Examples include Genscan, Grail, FGENESH and Genewise. These programs use features of known genes to make predictions. For example, a program can find open reading frames surrounded by canonical splice sites and mark them as potential exons. Various neural nets and Hidden Markov models have been used to refine the programs and increase their reliability. See also Cytochrome P450 (CYP) Gene Superfamily, RNA Gene Prediction, and Olfactory Receptors
Gene prediction programs generally fail for genes with large (>100 kb) introns, as spurious exons are identified and gene models are hard to build with confidence. At the other extreme, in gene-dense regions errors are often made when sorting the exons into gene models two real genes may be combined into one predicted gene or one real gene may be split into two predicted genes.
Because gene prediction programs extrapolate from features of known genes, there also is a worry that truly novel genes will fail to be predicted, especially in species whose genomes are not well understood.
Current estimates of gene number
Using a combination of gene predictions and sequence comparisons, investigators have estimated the number of genes in organisms whose genomes have been sequenced (Table 1).
About 2000025000 of the predicted human genes have strong evidence to support their existence. Although some investigators think that humans have many more genes, analyses from several different perspectives have converged on a figure of about 35000.
A significant portion of the human genome, perhaps as much as 5%, is involved in segmental duplications, either on the same chromosome or on different chromosomes. Gene duplication has created multigene families whose members have evolved or are capable of evolving subtle functional differences. For example, there are five copies of a 4-kb gene encoding the serine protease trypsinogen, each of which is embedded in a 10-kb repeated segment of human chromosome 7. Because the nucleotide sequences of the copies are about 9193% similar to each other, there are minor variations in the amino acid sequence that affect properties of the trypsinogen protein such as responsiveness to trypsin inhibitor. Given gene duplication, there are fewer distinctly different types of genes in the human genome than is indicated by the current gene count.
Accounting for complexity
The observation that humans may have nearly the same number of different types of gene as the pufferfish (Fugu rubripes) is one of the surprising insights to emerge from the Human Genome Project. If true, the complexities inherent to human physiology and development must be explained in large part by differences in how genes or parts of genes are modified and expressed. For example, RNA splicing seems to be prevalent and extensive in vertebrate genomes. RNA editing, whereby a chemical modification of a base changes the amino acid sequence translated from a mRNA, also serves to increase the information content obtainable from a gene.
In addition, posttranslational modification of proteins, through phosphorylation or glycosylation, can alter the specificity and types of interaction in which the proteins participate in the cell.
The real action, however, may be in gene regulation. Consider that humans and chimpanzees have genomes that differ by 12% in their sequence, but there are clear and distinct morphological and neurological differences between the species. It is likely that the differences will be explained ultimately by subtleties at the level of gene expression, that is, by subtleties in cis-regulatory networks and their interacting transcription factors and in the epigenetic factors that influence transcription.
Statistics pertaining to human genes
The extensive amount of finished human genomic sequence, along with thousands of full-length cDNAs, has enabled statistical analyses of the properties of human genes. The results are currently biased in favor of gene-rich regions, because the finishing of these regions has proceeded more quickly than that of the gene-poor regions of the genome. Thus, the numbers will change as more chromosomes are finished and cDNAs for the rarely expressed genes become available.
According to the first round of global analyses of the human genome, the average human gene spans about 27 kb. It has between eight and nine exons, which are typically small (~200 bases). The mature mRNA has about 2400 bases, of which around 1350 is coding sequence. Some genes, for example, many histones and G-protein-coupled receptors, have only one exon. In some cases, it may be difficult to distinguish functional single-exon genes from what are called processed pseudogenes. A processed pseudogene is the result of an mRNA being reverse transcribed and reinserted at random back into the genome. Processed pseudogenes frequently have poly(A) tails at the end of the apparent coding sequence.
The biggest human gene identified to date is dystrophin (DMD), on chromosome X, at about 2.3 Mb. The gene with the most coding sequence (80.8 kb) encodes the cytoskeleton protein titin (TTN) on chromosome 2. Indeed, one of the exons of titin is 17106 bases! Aside from families of interspersed repeats (e.g. Alu elements), the largest gene family is the olfactory receptors, which has more than 900 members.
Information pertaining to individual genes and genomes can be found on several websites. For example, GeneCards
Genes are not evenly distributed across the genome (Figure 6). Before the sequencing of the human genome, a correlation was noted between chromosomal bands staining lightly and darkly and the density of genes. Light bands, which are higher in G+C content, were found to contain clusters of closely spaced small genes, whereas dark bands, which are poor in G+C content, had many fewer genes mapping to them. The available sequence data have now established that there is a correlation between the overall percentage of G+C and the number of genes per 100 kb.
The average G+C content of the human genome is about 41%; the smaller and more closely spaced genes are found at higher percentages of G+C, whereas the big genes and the gene deserts are found at lower percentages of G+C. The most gene-dense region of the human genome identified so far is the major histocompatibility complex (MHC) class III region on 6p21.3. Here, 60 genes containing over 500 exons occupy 706 kb. In contrast to a genome average of 34% coding sequence, the MHC class III region is more than 13% coding sequence. As expected, the average G+C content of this region is high (51.8%). Chromosome 19 has the most genes (estimated at 23 genes per megabase) and its average G+C content is 49%. See also Major Histocompatibility Complex (MHC) Genes
So-called gene-poor regions, in which the G+C content is about 3438%, can be divided into three distinct categories in terms of gene organization: gene deserts, widely spaced genes and big genes. Gene deserts are megabase-sized regions of chromosomes that seem to be devoid of genes. For example, there is a 4-Mb region of chromosome 21q21 in which no genes have been found. The mouse counterpart to this region is also devoid of clearly identifiable genes. The functional significance of gene deserts, if any, is unknown. Some chromosomal stretches, although not true gene deserts, have very few, widely spaced genes. For example, there is a 2-Mb region on chromosome 7p22 that contains only seven genes ranging from 964 bases to 148.7 kb. In this region, only about 10% of the genomic DNA is organized into genes. See also Gene Distribution in Human Chromosomes
Perhaps the most interesting subclass of gene-poor regions are the big genes, that is, the genes whose primary transcripts span over 500 kb. As the average size of the coding sequence of the big genes is only about twice the genome average, it is clear that the genes are big because of extralarge introns. About 200 big genes account for at least 5% of the whole human genome. Because big genes take many hours to be transcribed, their products are likely to be found only in slowly dividing or nondividing cells such as neurons. When cell division is faster than the time required for transcription, the protein coded for by the mRNA is not made because transcription is prematurely terminated. This might serve as a mechanism for the timing of gene expression in developing cells, as has been found for the fruitfly.
How much of the human genome is genes?
In terms of coding sequence, about 34% of the information contained in the human genome is used for translation into protein. It is less clear what percentage of the genome is actually transcribed (intragenic), as opposed to being between genes (intergenic). Assuming 35000 genes, an average gene size of 27 kb and 3.2 Mb of genomic DNA, it would follow that only about 30% of the genome is transcribed. This is most probably an underestimate, however, because the average gene size is increasing as the finished sequences are getting longer and as more of the big genes are identified. If the average gene size turns out to be 50 kb and there are 40000 genes, then it will follow that 62% of the genome is transcribed.
About 45% of the human genome is comprised of common transposable element repeats. Repeats are found both within and between genes. Whether repeats are just junk or instead serve a useful purpose is still unclear. The finding of higher than expected numbers of the short so-called Alu repeat sequence in GC-rich, gene-dense regions of the genome leads some investigators to think that the repeats are functionally important. See also Genome Organization of Vertebrates
A Look to the Future
The genomic architectures being revealed by the sequences of many diverse species pose interesting puzzles and questions. Why are big genes big? Do the genes found in gene-dense clusters lie together for a reason? Why have some transposable elements died in the human genome but not in rodents? What are the mechanisms for segmental duplication? If repeat sequences are important then why does the pufferfish (Fugu rubripes) genome, at about 350 Mb, contain only about 4% of repetitive DNA? Does the pufferfish also have big genes and gene deserts?
In coming years, many more species will have their genomes sequenced. Out of this work will come a better understanding of the dynamic evolutionary forces that have shaped genomes and the relationships between chromosome structure and the functions of genes. Perhaps even the regulatory codes will be cracked eventually, through a combination of experimentation and genome structure analysis.