The biochemical and functional analysis of proteins with unknown functions can be a difficult task and needs endurance and the knowledge of sometimes ‘old-fashioned’ methods. Even more, without a sequenced genome, it takes a long time to identify the DNA sequence coding for the protein of interest. Remembering my time as a diploma student, working on the filamentous fungus Aspergillus nidulans, it took me weeks or months to identify the genomic sequence corresponding to a purified protein, because the genome sequence was not publicly available. One of the strategies in ‘former days’ was to purify the protein, blot it onto a membrane, perform N-terminal protein sequencing, construct degenerate primers, screen a genomic library, subclone fragments and sequence them. Nowadays, the protein can be fragmented by trypsin, surveyed by MALDI-TOF analyses and, due to the increasing number of finished genome projects, the gene is subsequently identified by an automatic database search against the genome of interest. This methodology speeds up the procedure by several weeks and allows a much higher throughput in the identification of gene functions.

However, genome projects would not have been initiated to ease the identification of the coding sequence of a single interesting protein. In fact, the sequencing of genomes has opened the era of ‘omics’, starting from ‘gen’omics, continuing with ‘transcript’omics and enhancing the informative value of ‘prote’omics. Now it is possible to compare genomes of different organisms (which genes are universal and which are specific), to look for changes in transcript levels (e.g. after applying an environmental stress) and to identify modifications of proteins and their abundance under defined conditions. These massive amounts of data (especially from transcriptomics) create a ‘Garden of Eden’ for bioinformatitians, who can perform statistical analyses on the data sets to evaluate their significance in order to develop new methods for hierarchical cluster analyses. The use of mutual information matrices on time response studies allows the identification of genes that can be grouped into regulons. These analyses aim to suggest new hypotheses on the connection of different pathways and their communication. The challenge of the biologists is to examine these hypotheses experimentally to give a verification of the predictions.

This sounds like a ‘beautiful new world’, because ‘life’ becomes computable and it works well for pathways in which all genes and proteins involved are already known. However, these analyses and predictions are hampered by several problems, in particular: (i) the countless ‘hypothetical proteins’ and genes of ‘unknown function’, and (ii) the genes and proteins, which were annotated by their identity to other already characterized proteins. Analyses dealing with those data rapidly reach a dead end and lead to a sobering conclusion: bioinformatics cannot substitute the wet lab.

Transcriptional analyses, the prediction of the number of genes within a genome, the comparison of genomes, the location of proteins and other features depend on a correct genome annotation. However, the functional annotation of proteins is very frequently misleading with sometimes profound consequences. A single example, although there are many more that could be listed: The methylisocitrate lyase is a key enzyme of the fungal methylcitrate cycle and specifically cleaves (2R,3S)-2-methylisocitrate into succinate and pyruvate. The enzyme is highly specific for its natural substrate and does not accept isocitrate in its active site. Fungal isocitrate lyases, specific enzymes from the glyoxylate bypass, share about 35–50% identity to fungal methylisocitrate lyases, but are hardly active with methylisocitrate. By the means of ‘identity’ most fungal methylisocitrate lyases are incorrectly annotated as isocitrate lyases. This is exemplified on the so-called isocitrate lyase 2 from Saccharomyces cerevisiae, which was formerly denoted as a ‘non-functional’ isocitrate lyase, because no activity was observed with isocitrate as a substrate (Heinisch et al., 1996). However, a re-characterization revealed significant methylisocitrate lyase activity (Luttik et al., 2000), which led to the correction of the annotation. Nevertheless, there are still several methylisocitrate lyases in fungal genome annotations that are denoted as isocitrate lyases, but indeed represent methylisocitrate lyases. Although this might sound as a minor problem, the missing methylisocitrate lyase would lead to an incomplete methylcitrate cycle. Metabolic flux analyses, however, depend on the knowledge of all pathways present within a cell and a missing pathway may lead to incorrect data evaluations.

Genes of unknown function or hypothetical proteins cause even bigger problems. A strong upregulation or downregulation of gene expression suggests an importance under the applied condition. However, due to the complexity within a cell it is difficult to predict, whether the change is a direct cause from the applied stress or resulted from a subsequent adaptation mechanism. Systems biology tries to answer this question by hierarchical clustering of the changes, which provides hints on additional genes that may be involved in the same pathway. However, without a detailed molecular biological and/or biochemical analysis of each gene, its true function remains unsolved. Therefore, ‘omic’-researchers need to continue to collect large data sets but should have in mind that the characterization of genes in the laboratory is still an essential approach, in particular for those of unknown function, but also for those with only a predicted function.

Therefore, I propose to give the characterization of unknown function genes a higher priority. Currently, it seems as the value of such research is not appreciated and getting financial support for this kind of research is difficult. A few years ago I published a paper dealing with the biochemical characterization of several enzymes and the impact of a gene deletion on cellular physiology (Brock and Buckel, 2004). It turned out quite difficult to get it published, because one reviewer commented: ‘. . . the manuscript mainly deals with methods used in the sixties and seventies and it is questionable, whether such investigations are still suitable for publication.’ Of course it might be less ‘sexy’ for readers to become confronted with ‘old methods’, but the time may have come to go back to the roots and to elucidate the function of genes of unknown function to drive the advancing field of biological science, including computer-based technologies, forward.


  1. Top of page
  2. Acknowledgements
  3. References

I thank Bernhard Hube for the helpful feedback and discussion on this issue.


  1. Top of page
  2. Acknowledgements
  3. References
  • Brock, M., and Buckel, W. (2004) On the mechanism of action of the antifungal agent propionate. Eur J Biochem 271: 32273241.
  • Heinisch, J.J., Valdes, E., Alvarez, J., and Rodicio, R. (1996) Molecular genetics of ICL2, encoding a non-functional isocitrate lyase in Saccharomyces cerevisiae. Yeast 12: 12851295.
  • Luttik, M.A., Kotter, P., Salomons, F.A., Van Der Klei, I.J., Van Dijken, J.P., and Pronk, J.T. (2000) The Saccharomyces cerevisiae ICL2 gene encodes a mitochondrial 2-methylisocitrate lyase involved in propionyl-coenzyme A metabolism. J Bacteriol 182: 70077013.