Author for correspondence: M. Udvardi Tel: +49 331567 8149 Fax: +49 331567 8150 Email: email@example.com
Functional genomics is transforming the way biological research is done in the 21st century. Functional genomics brings together high-throughput genetics with multiparallel analyses of gene transcripts, proteins, and metabolites to answer the ultimate question posed by all genome-sequencing projects: what is the biological function of each and every gene? Functional genomics is driving a shift in the research paradigm away from vertical analysis of single genes, proteins, or metabolites, towards horizontal analysis of full suites of genes, proteins, and metabolites. By identifying and measuring many, if not all of the molecular players that participate in a given biological process, functional genomics offers the prospect of obtaining a truly holistic picture of life. This review describes the tools that are currently being used for functional genomics work and considers the impact that this new discipline is likely to have in the future.
Experimental organismic biology is built on a foundation of reductionism, where individual research groups specialize in small, well-defined areas of research, and where the greatest challenge is often to collect and synthesize the results of many groups into useful models of how the various parts of a cell or complex, multicellular organism interact to produce a viable organism. Although functional genomics is also built upon reductionism, it is nonetheless transforming the way we think about and perform biological science. It has captured the imagination of thousands of scientists, perhaps in part because it empowers individual groups to work on a grander scale that promises local (and possibly patentable) insights into the ‘big picture’ of biology. The multiparallel approaches of functional genomics allow one to query the ‘behaviour’ of thousands of genes, proteins, and metabolites in a single, controlled experiment in a way that allows sensible connections to be made within and between the various levels of biological organization. The main challenge in functional genomics may no longer be acquisition of comparable data from many different experiments in many different research groups, but rather the analysis of huge, internally consistent local data sets. This analysis promises an unprecedented, holistic picture of the molecular basis of life.
The functional genomics toolbox
Although some of the tools of functional genomics existed and were being applied before the completion of the first bacterial and eukaryotic genome projects, the discipline of functional genomics emerged largely in response to the challenge posed by complete genome sequences. This challenge is to understand the biochemical and physiological function of every gene product, and the complex interplay between them all. The activity of genes is manifest at a number of different levels, including RNA, protein, and metabolite levels and analyses at these levels can provide insight not only into the possible function of individual genes, but also the cooperation that occurs between genes and gene products to produce a defined biological outcome. Global analyses of the various levels of molecular organization have been facilitated by remarkable developments in high throughput technologies. In fact, it is probably a sign of the immaturity of functional genomics as a discipline that it is still largely driven by technology rather than by novel hypotheses. This, of course, means that there are great opportunities for those that come to the discipline with good questions. The high throughput technologies that currently define functional genomics include: DNA or oligonucleotide array technologies for determining mRNA levels for thousands of genes simultaneously (transcriptome analysis); 2D-gels and mass spectroscopy (MS), and other methods for identifying and quantifying thousands of different proteins (proteome analysis); and gas or liquid chromatography-mass spectrometry (GC-MS or LC-MS) for identifying and quantifying hundreds of different metabolites in different cell, tissue, or organ types. High throughput methods for forward and reverse genetics are also integral to functional genomics approaches (Fig. 1).
Fast forward and reverse genetics
Genetics plays a central role in functional genomics, although it represents a potential bottleneck for high-throughput analysis of gene function. Traditional, forward genetics has, over the past century, linked hundreds of mutant alleles with specific phenotypes in many different model organisms. Isolating the gene affected in a specific mutant is often problematic, however, especially when a point mutation (e.g. a single base-pair change) is responsible for the phenotype, and especially when the organism has a large and complex genome, as do higher plants. Map-based cloning has been the only effective way of identifying alleles containing point mutations in plants, and the fact that relatively few plant genes have been isolated in this way in the past is an indication that it has been a major bottleneck to gene discovery. This has begun to change, however, especially in Arabidopsis where the complete genome sequence of the Columbia ecotype together with the near-complete sequence of a second ecotype, Landsberg erecta, has led to high density single nucleotide polymorphism (SNP) maps that promise to reduce the time needed for isolating mutant alleles to a few months (Lukowitz et al., 2000). An alternative, and in the past more powerful approach, to isolating interesting mutant alleles is insertion mutagenesis, where a foreign piece of DNA (either a transposon or Agrobacterium T-DNA) is used as the mutagen. The mutant allele can be relatively easily recovered using the foreign DNA as a tag, by inverse PCR-based, plasmid rescue, or Southern blot approaches (Dilkes & Feldmann, 1998). Many genes have now been isolated in this way. Although it is now easier than ever before to isolate interesting genes using forward genetics, the approach has one important intrinsic limitation: it can only be applied to genes that produce a discernable phenotype when mutated. Technologies that allow us to monitor changes in the transcriptome, proteome, and metabolome will no doubt help us to discern subtle molecular phenotypes that elude detection at the physiological or morphological levels. Such technologies may even help to discriminate the roles of genes from multigene families, which will be valuable when functional redundancy shrouds their physiological roles.
Complete genome sequences represent an amazing resource for systematic biology. Reverse genetics is one way to harvest systematically the information inherent in such sequences. Reverse genetics begins with an isolated gene and works ‘backwards’ to obtain the phenotype associated with impaired function of that gene. In many model organisms, including bacteria, yeast, and even some lower plants like the moss, Physcomitrella patens, it is possible to knock-out the function of a gene by replacing it with a mutant allele using homologous recombination (Schaefer & Zryd, 1997). Usually, a selectable marker, such as an antibiotic resistance gene is inserted into the inactive replacement gene to enable positive selection. Unfortunately, homologous recombination of introduced genes in higher plants is not yet a viable technology. Most reverse genetics experiments in higher plants have so far relied on antisense RNA suppression or cosuppression, both of which reduce the level of endogenous transcript of the target gene (van der Krol et al., 1988; Brusslan et al., 1993). These technologies represent a double-edged sword. They rarely lead to the complete absence of the target gene transcript and therefore do not produce a truly null-phenotype. On the other hand, this can be an advantage, especially when a null phenotype would be lethal. In addition, a range of levels of transcript and associated protein activity can be a useful resource, for instance when one is interested in the level of control exerted by a particular enzyme in a metabolic pathway. Most antisense and cosuppression experiments have utilized constitutive promoters to drive transcription of introduced genes in stable transgenic lines. If the construct is lethal or severely effects reproduction, then it may be difficult to obtain the desired transgenic lines. This potential problem can be avoided by using inducible promoters that are silent until specifically triggered by the investigator. A novel technology that is also based on post-transcriptional gene silencing is being developed using viruses as gene delivery and expression systems (Baulcombe, 1999). This technology offers the potential to switch-off expression of the target gene quickly at any time during plant development. It is not dependent on stable, germline transformation and reproductive development and, because it can be deployed at any time during development, it could side-step potential problems that might result from constitutive silencing of critical genes. This is definitely an emerging technology to watch closely in the future.
Another approach that is facilitating reverse genetics and functional genomics is the production of libraries of insertion or deletion mutants that contain mutations in many, if not all of the genes of a model organism. For example, there are now libraries of yeast mutants that contain deletions in every predicted open reading frame (e.g. see: http://www.rz.uni-frankfurt.de/FB/fb16/mikro/euroscarf/index.html). For the yeast researcher, it is now a simple matter to go ‘shopping’ for a mutant affected in the gene of interest. The bottleneck for functional genomics in yeast is therefore no longer in getting a mutant in one of the approximately 6000 different genes, but rather in thinking of and implementing intelligent screens to uncover phenotypes associated with the various mutant genes. An early example of a ‘genome-wide’ screen for genes that affect fitness in yeast was described by Smith et al. (1996). The transposon, Ty1, was used to generate a complex population of mutants, which was subsequently grown in different selective media. The fate of individual mutant lines in the complex population was followed by PCR, using a primer complementary to part of the inserted Ty1 transposon together with a mixture of hundreds of gene-specific primers. Of the 268 different mutants analysed, more than half were found to have reduced fitness. In this way, phenotypes were linked to nearly one hundred previously uncharacterized genes.
Libraries of T-DNA and transposon-tagged mutants have also been created in many different plant species, including of course Arabidopsis. Such libraries are now being used by many groups to determine the phenotype(s) associated with a wide assortment of genes. Most of these libraries must still be screened physically at the DNA level in order to identify mutants of interest (Krysan et al., 1999; Tissier et al., 1999). However, several groups have begun to sequence the endogenous sequences flanking the inserted foreign DNA and to create directories of Flanking Sequence Tags (FSTs) that will enable plant researchers to identify mutant lines in silico (e.g. Parinov et al., 1999). This will remove a significant bottleneck to functional genomics in higher plants. The new limitation to progress may then become our ability to devise clever, high-throughput phenotypic screens. These screens need not be based on macroscopic phenotypes, as should become clear from the following sections.
A recent and potentially potent addition to the reverse genetics armoury is a procedure called Targeted Induced Local Lesions IN Genomes (TILLING) (McCallum et al., 2000). This procedure identifies a plant containing a point mutation in the gene of interest by PCR-amplifying wild-type and mutant alleles from pools of plants, denaturing and annealing the amplified DNA to produce heteroduplex DNA, and finally detecting this DNA by a procedure such as denaturing HPLC. Plant pools giving rise to heteroduplex DNA are subdivided and reanalysed until the mutant plant is isolated. The procedure can be scaled up for high throughput screening of mutants and can be applied to all plant species. It will be particularly useful for species where no transposon system is available or where T-DNA mutagenesis is not applicable, or where the numbers of available insertion mutants are low and insufficient to ensure full coverage of the genome.
Several technologies are now in use to identify differentially expressed genes and to analyse gene expression levels on a genomic scale. The application of these technologies to plants has been reviewed recently (Baldwin et al., 1999; Kuhn, 2001). Gel-based techniques such as differential display (Liang & Pardee, 1992) and amplified restriction fragment length polymorphism (AFLP) (Bachem et al., 1996) are accessible to all labs and have proven to be useful in the identification of differentially expressed genes (Baldwin et al., 1999). One completely dry approach utilizes the increasing availability of expressed sequence tag (EST) data to estimate expression levels by ‘Electronic Northern analysis’ (Rafalski et al., 1998). Sequencing based methods such as serial analysis of gene expression (SAGE; Velculescu et al., 1995) and massively parallel signature sequencing (MPSS; Brenner et al., 2000), also utilize the fact that the more abundant mRNA species are more highly represented in a cDNA library. These methods allow the simultaneous analysis of thousands, or even millions of transcripts by generating short nucleotide signature sequences.
The final and currently most widespread technology for transcriptome analysis is DNA array technology, which relies on a reverse-Northern approach in which DNA is attached to a solid support, and then hybridized to labelled cDNA probes derived from cellular mRNA (Fig. 2). The amount of label bound to each immobilized spot of DNA provides a measure of the relative abundance of the transcript (mRNA) of that gene in the given sample. Different types of DNA are used to create arrays for hybridization experiments. Oligonucleotides with high specificity for defined transcripts can be synthesized directly onto glass slides using light directed chemical synthesis (Affymetrix GeneChips) or the so-called ‘ink-jet’ synthesis (Agilent). Alternatively, DNA fragments can be spotted onto nylon filters or onto glass slides. This has been the most common approach for plant transcriptome analysis, in part because accurate sequence data, which is not available for many plant species, is not necessarily required. The DNA fragments can originate from anonymous clones, from conventionally synthesized oligonucleotides, from EST clones, or from PCR amplification of open reading frames when genome sequence information is available. As already noted, the complete sequence of the Arabidopsis genome is now known, and national and international consortia are planning to amplify portions of every predicted gene to create a unigene set for array production (for example, see GARNet, http://www.york.ac.uk/res/garnet/garnet.htm). DNA fragments are spotted from 96 or 384 well plates onto the surface of the different support media by robotic systems. DNA arrays are then probed with labelled cDNA reverse transcribed from poly(A) + RNA or total RNA isolated from the sample of interest. The cDNA is labelled with radioactive (33P) or fluorescent dNTPs, usually by including the label in the reverse transcription reaction. Following hybridization and washing of arrays, the amount of probe bound to each spot is quantified using either a high-sensitivity PhosphorImager or fluorescence imaging equipment.
The relative merits of the different types of immobilized DNA, support material, and probes have been compared in several recent reviews (Jordan, 1998; Lemieux et al., 1998; Marshall & Hodgson, 1998; Baldwin et al., 1999; Duggan et al., 1999). To summarize briefly, nylon filters are inexpensive and accessible to most labs. In addition, the design of these filters is flexible and can be adjusted to suit individual experiments, and the results are highly reproducible. Arraying on glass slides also provides reproducible results, and it allows very high densities of spotted DNA and small hybridization volumes, although the associated costs are currently prohibitive for most laboratories. Sensitivity on glass slides is very high for both oligonucleotides (Wodicka et al., 1997) and DNA fragments (Ruan et al., 1998). Oligonucleotides can be made gene specific and thus avoid problems associated with repeat sequences and cross-hybridization between members of gene families. Immobilized DNA fragments can be screened using probes labelled with two, and perhaps in the near future four different flourescent dyes, allowing simultaneous comparison of different samples.
Once obtained, raw DNA array data must be processed before biologically relevant information can be extracted from it. This is one of the most challenging aspects of transcriptome analysis. There are many sources of variation in DNA array experiments. These include the amount of DNA delivered to each individual spot on the array, which in turn can be affected by PCR efficiency, pin geometry, and the degree of DNA fixation to the array surface. In addition, the quality of starting RNA, reverse transcription, and labelling efficiency can vary, as can the hybridization efficiency, amount of nonspecific hybridization, amount of overlap from neighbouring spots, and image analysis (Schuchhardt et al., 2000). Using at least three replicates in an experiment reduces the number of false positives and false negatives providing more reliable results overall (Lee et al., 2000). In addition, normalization methods, usually using a set of housekeeping genes or other control genes, can be applied in an attempt to account for this inherent variability (Hedge et al., 2000; Schuchhardt et al., 2000).
Knowledge of where and when proteins are expressed is essential for an understanding of their biological functions. Taking a census of the proteins that are present in a biological sample can be done in several different ways, once the proteins have been separated. Traditionally, polyacrylamide gels, either one- or two-dimensional, have been the method of choice for separating and visualizing proteins. Two-dimensional polyacrylamide gel electrophoresis (2D-PAGE), which separates proteins by isoelectric point (pI) in the first dimension and by molecular weight (MW) in the second dimension, was first used a quarter of a century ago (O’Farrell, 1975) and can resolve several thousand proteins. Such gels are a powerful way to visualize changes in an organism’s proteome during development or in response to changes in the environment, as well as to identify differences between a wild-type and a mutant. Thus, 2D-PAGE is a useful screening tool for finding potentially interesting proteins. Advances in protein sequencing technologies, together with burgeoning DNA sequence databases, have simplified the subsequent task of identifying proteins of interest. Proteins can be sequenced either chemically by N or C terminal degradation, or by using tandem mass spectrometry (MS-MS), which provides sequence of small peptide fragments from a protein that has been enzymatically digested (Fig. 3) (Pandey & Mann, 2000; Yates, 2000). Both methods provide short stretches of amino acid sequence, usually between 10 and 20 amino acids, which can often be enough to find a match with a protein from the same or a related species. These matches are performed in silico by comparing the query protein sequence with predicted protein sequences derived from DNA databases. The chances of finding an exact match to the query protein sequence increase with the number of genes or cDNA clones that are sequenced from the species being studied. For those bacterial and few eukaryotic genomes that have been completely sequenced, it is a straight-forward matter to identify the protein that contains the partial amino acid sequence of interest. Often, a biochemical function can also be ascribed to the protein, although this is by no means always the case even for the simplest model organisms. The combination of 2D-PAGE and protein sequencing has been put to good effect in a number of plant studies. It has been used to identify plasma membrane proteins from Arabidopsis (Santoni et al., 2000); hypoxia-induced proteins in maize (Chang et al., 2000); chloroplast proteins from pea (Peltier et al., 2000); and proteins associated with the peribacteroid membrane of soybean root nodules (Panter et al., 2000).
Whilst protein sequencing is perhaps the best way to identify proteins from organisms for which only partial genomic sequence is available, peptide mass fingerprinting using MS is generally superior when applied to species for which the entire genome sequence is known. This approach relies on the fact that the set of peptides produced by protease digestion of a specific protein is nearly always unique to that protein. Mass spectrometry allows the masses of each of the peptide fragments to be determined quickly: The resulting set of peptide masses provides a fingerprint that can be matched with a database of DNA that has been translated into protein and digested in silico. Although, in practice, not all protease digestion sites of a real protein are accessible to digestion, and proteins are often post-translationally modified, there is usually sufficient overlap in the real and in silico fingerprints to allow unambiguous identification of a protein (Peltier et al., 2000).
Proteome analysis using mass spectrometry can also provide valuable information about covalent modifications of proteins, which, of course, can affect their activity. The presence of covalent modification can be inferred by an increase in the mass of one or more of the peptides derived from a protein, compared to in silico predictions or real, nonmodified controls. The mass of the attached group can also give clues to the chemical nature of the modification. Phosphorylated peptides produce characteristic ions during MS, which can be used to identify such peptides. By sequencing such peptides, the residues likely to be phosphorylated (ser, thr) can be identified. This approach has been used successfully to identify phosphorylated proteins from thylakoid membranes of Arabidopsis (Vener et al., 2001). Methods to identify carbohydrate side chains of proteins using MS have also been developed (Packer & Harrison, 1998).
Current 2D-PAGE protocols suffer from some problems that limit their use for proteomics: proteins with extreme pI and molecular weight, low abundance, or high hydrophobicity (especially membrane proteins) are rarely seen on 2D-PAGE gels. Other separation technologies are being developed that avoid these problems. One very promising approach relies on two-dimensional liquid chromatography (2D-LC), to separate peptides derived from a complex mixture of proteins. Different types of chromatography (ion exchange, reverse phase, size exclusion) can be used in tandem provided they are largely independent, and provided components resolved in the first dimension remain resolved in the second. Using 2D-LC coupled to MS-M3, Washburn et al. (2001) recently identified 1484 proteins from yeast, including a fair representation of low-abundance proteins, proteins with extremes of pI and MW, and integral membrane proteins. In addition to being largely unbiased, the method, called Multidimensional Protein Indentification Technology (MudPIT), is fully automated and superior to 2D-PAGE-based methods, both in speed and sensitivity.
Other high-throughput methods for protein identification are also becoming available, such as BIA-MS in which proteins are removed from a complex mixture by affinity to a substrate (e.g. a peptide, nucleic acid, or antibody) that is immobilized on a chip. These proteins can then be subjected to MS-MS to identify them (Sonksen et al., 1998; Nelson et al., 2000).
Most of the proteomics methods employed at present are not quantitative and comparisons between different biological samples are difficult to make. However, the development of isotope-coded affinity tags (ICAT) makes it possible to compare the quantities of a given protein in two different samples (Gygi et al., 1999). Quantitative comparisons are achieved by labelling the proteins of one sample with a light isotope version of a chemical tag and proteins of the second sample with a heavy isotope version of the same tag. The samples are mixed, protease digested, and affinity purified to capture tagged peptides, which are then separated and identified by LC-MS-MS. Identical peptides derived from the two different samples are distinguished by the mass difference conferred by each ICAT, and intensities of the mass peaks reflect the relative abundance of the original protein in the two samples.
The genomic resources now available are driving proteomics into new areas. For example, the availability of large numbers of full-length cDNAs has led some groups towards large-scale heterologous expression approaches. Individual cDNAs are expressed in bacteria, yeast or phage systems in an arrayed format. Fusion to an epitope tag allows subsequent protein purification. The purified products of these expression libraries can then be screened for biological or enzymatic activity, or interaction with a target protein; alternatively, they can be used to screen recombinant antibody libraries (Martzen et al., 1999; Albala et al., 2000; Emili & Cagney, 2000; Larsson et al., 2000). Another approach has been to synthesize arrays of peptides of known sequence on an immobilized support or in multiwell plates. Such arrays can be screened for biological activity, for a target sequence of a protein-modifying enzyme or for antigenic epitopes to a given antibody (Emili & Cagney, 2000).
A dream of the proteomicist may be to obtain a dynamic map of an organism’s entire proteome that reveals how it changes during development and in response to abiotic and biotic challenges to survival. Such a map could have more intrinsic value than the corresponding transcriptome map, which is a more remote indicator of the biochemical activity of the organism. However, proteome maps will be more complex than transcriptome maps, and probably more difficult to achieve, because post-translational modifications of proteins can produce many variants of a single gene product and because proteome maps will ultimately need to incorporate information about the location of each protein in a cell.
The metabolome is the complete set of all metabolites in an organism. It is a dynamic entity, which changes with time and space, both within an individual cell and among cells of multicellular organisms. Although primary metabolism involves only a few hundred metabolites, the metabolome of a higher plant may well include tens of thousands of different metabolites that differ in concentration among cell compartments, cells, tissues, and organs. Clearly, the task of analysing an organism’s metabolome is a daunting one. Nonetheless, there are now a handful of intrepid biological chemists, armed with GC-MS, LC-MS, and other analytical equipment who have begun to explore the metabolome of plants and other organisms. GS-MS was used recently to quantify 326 different compounds from Arabidopsis, and more than half of these could be assigned a chemical structure, based on their GC retention time and mass spectrum (Fiehn et al., 2000). The immediate challenge for metabolomics is to increase the numbers of compounds that can be separated and quantified, and to identify their chemical structures. Deconvolution software (Stein, 1999) helps to extract more usable data from the complex mass spectra associated with plant chromatograms, and aids compound identification. Other analytical techniques, such as NMR, will also be necessary to assign structures to compounds that cannot be identified by their retention times and mass spectra.
Despite the challenges that lie ahead, metabolic profiling has begun to show its value in providing metabolic fingerprints of wild-type, mutant, and transgenic plants. For instance, metabolic profiling (using GC-MS) of two different Arabidopsis ecotypes, Col-2 and C24, and a mutant of each ecotype, showed distinct profiles for each genotype (Fiehn et al., 2000). The metabolic phenotypes of the two ecotypes were more divergent than were the phenotypes of the single-loci mutants and their parental ecotypes. This presumably reflected the greater genetic divergence between the two ecotypes. Importantly, metabolic profiling has the potential to uncover subtle molecular effects of mutations that are not evident at the macroscopic or whole-plant level. This information may help to decipher the role of many genes of unknown function in plants.
Metabolic profiling is also likely to uncover novel metabolic pathways in plants, and novel locations for known pathways. Recent metabolic profiling of potato tubers from wild-type and transgenic plants indicated that tubers may contain the full complement of enzymes for producing amino acids (Roessner et al., 2001). Therefore, tubers may be less reliant on transport of amino acids from the leaves than previously thought. Metabolic profiling could also uncover metabolic regulation of certain enzymes and might therefore be useful in screening for regulatory mutants (Roessner et al., 2001).
High-throughput transcript, protein, and metabolite profiling is now generating huge amounts of data of different kinds. There are numerous challenges for bioinformatics in the postgenomics era, both in the area of data management and data analysis. A major issue for data management at present is the need to create standard formats for different types of data so that it can be shared and compared between labs. It is likely that central repositories of profiling data will emerge in the near future, analogous to databases like GenBank and SwissProt, which will enable the scientific community to share and discuss experimental results in a standardized way. Data formats also need to be compatible with software in order to make the content accessible to automated methods. Data formats must also be built upon sound ontologies, which map entities like genes, proteins, pathways and experimental conditions to unique identifiers without ambiguity (Ashburner et al., 2000). The Microarray Gene Expression Database (MGED) consortium (http://www.mged.org) has proposed a data format (MAML/MIAME) that is already used by some applications and is likely to find broad acceptance by public repositories and commercial software suppliers.
Data format guidelines will also affect experimental design (Brazma, 2001). For instance, a standard set of growth conditions may be recommended for many types of experiments. At the least, precise details of the many environmental variables that affect growth and development will be required along with profiling data. Statistical considerations will probably also influence which data can and cannot be accepted into public databases (Lee et al., 2000).
A greater challenge for bioinformatics lies in the area of data analysis. How will a holistic picture of the interplay between biomolecules be obtained from raw profiling data? Bioinformaticians are currently using various clustering algorithms to group raw data in an unbiased way (Eisen et al., 1998; Tamayo et al., 1999). However, these methods are generally sensitive to ‘noisy’ data and therefore provide only a crude picture of gene or other interactions. The bioinformatics community is still searching for the ultimate tool to subdivide the multidimensional space of high-throughput data into digestible pieces (Herrero et al., 2001).
Clustering of gene transcripts, proteins, and metabolites according to how they change during development or in response to changes in the environment may help to identify the role of many genes of unknown function, especially if the roles of some genes, proteins, or metabolites in the cluster are already known. However, this ‘guilt by association’ approach is likely to reach its full potential only if it is possible to integrate information from many different organizational levels, not just the molecular level. Although this is currently done on a gene-by-gene basis in many existing labs, the task for bioinfomatics will be to extend this to all genes/proteins with all the information that is currently available. Such efforts are already under way. For example, in order to elucidate the functions of unknown open reading frames (ORFs) in yeast, Marcotte et al. (1999) used different data sources, including transcript profiling, to assign a general function to more than half of the previously uncharacterized proteins in this organism. Once again, it is obvious that such an undertaking must be founded on data formulated using a controlled vocabulary with a sound ontology. Recent initiatives, such as the KEGG database for metabolic pathways (Kanehisa & Goto, 2000) and the EcoCyc/MetaCyc project (Karp et al., 2000) provide us with a glimpse of how useful future data models might be.
The young discipline of functional genomics is transforming experimental biology at present. However, it will not, and in fact cannot replace older disciplines such as biochemistry, molecular and cell biology, and physiology, which complement each other and functional genomics as well. In the short term, functional genomics will generate more questions and hypotheses than answers, and this will invigorate the older disciplines, which will help to provide answers. In the long term, functional genomics promises a more holistic picture of life that not only reveals the biological function of many, if not all biomolecules, but also the significant interactions between them. In the accompanying article, we describe how the tools of functional genomics are being applied to the study of symbiotic nitrogen fixation.
We would like to thank Thomas Altmann, Oliver Fiehn, Sebastian Kloska, and Megan McKenzie for useful comments on the manuscript. We also thank the Max Planck Society and the Alexander von Humboldt Foundation for generous support. (49 331567 8149; E-mail: firstname.lastname@example.org).