Unpublished but public microbial genomes with biotechnological relevance
Roland J. Siezen,
Kluyver Centre for Genomics of Industrial Fermentation; TI Food and Nutrition, 6700AN Wageningen, The Netherlands; NIZO Food Research, 6710BA Ede, The Netherlands; CMBI, Radboud University Nijmegen, 6500HB Nijmegen, The Netherlands.
In the past few years, the number of microbial genome sequencing projects worldwide has rapidly increased, both of single species and microbial consortia (metagenomes). The development of several new high-throughput sequencing platforms (Hall, 2007; Marsh, 2007), and an enormous reduction in costs, means we can expect to have thousands of complete and incomplete genomes sequences available to us in the coming years. Many of these microbial genomes are of biotechnological interest, and several have spectacular properties in relation to their growth requirements, the metabolites they produce, their potential for environmental clean-up or survival in extreme environments. One of the ideas behind sequencing and analysis of whole genomes or substantial parts is that it will be used to enable a more targeted construction of mutant strains for improvement of industrial processes. This is in contrast to the more common procedure of production of random mutations and then screening for the desired phenotype.
The sheer number of newly completed genomes, estimated at about 1 per day in 2008, makes it impossible to publish all this information in regular scientific journals. So how do we keep track of which genome sequences are known or are upcoming, and where can we find all this sequence data to do data mining and comparative genomics in search of leads for our own research on biotechnologically interesting microbes?
Genome sequencing and databases
To make genome datasets publicly available, they are initially submitted to the public sequence data repositories GenBank (Benson et al., 2008), EMBL (Cochrane et al., 2008) and DDBJ (Sugawara et al., 2008). Then this genome data is further processed in different ways by curation, annotation, and comparison and ends up in a variety of microbial genome data resources, as reviewed recently (Markowitz, 2007). A very complete and up-to-date status of genome sequencing can be found in the Genomes Online Database (GOLD; http://www.genomesonline.org) (Liolios et al., 2008), a World Wide Web resource for comprehensive access to information regarding complete and ongoing genome projects, as well as metagenomes and metadata, around the world. The entry page links to the GOLD tables, each containing a summary of different kinds of sequencing projects: completed genomes, ongoing genomes (archaeal, bacterial or eukaryote) or metagenomes. Links are provided to each organism, genome sequence, institution, funding agency, scientific journal publication, and much much more. By clicking on the button ‘Download’ at the top of a table, access is gained to a wealth of metadata for each microbial genome, such as species/strains/serovars, phenotype, habitat, origin of isolation, pH and temperature regimes, etc.
The GOLD statistics report that most of recent genome sequencing data of bacteria and archaea comes from large high-throughput sequencing centers such as the Joint Genome Institute (25%) and the J. Craig Venter Institute (23%) in the USA. Many of these genomes are part of major large-scale microbial sequencing programs funded by government agencies such as National Institutes of Health (NIH), National Science Foundation (NSF), and the Department of Energy (DOE) in the USA.
Unpublished public genomes
At the end of 2007, over 700 completed genomes were listed that can be accessed in public databases, and the large majority of those were of bacterial and archaeal origin. ‘Complete’ means single complete sequences for each chromosome. Up to 2004, nearly all of these complete genomes were also reported in scientific journals, and these are referred to as ‘published public’ genomes (Figure 1). After 2004, the number of newly published public genomes has remained rather steady at 60–70 per year, while the number of ‘unpublished public’ genomes has increased rapidly. Last year, over 200 new genomes were released to public databases, but two-thirds of those did not appear in scientific publications. These are the genomes that remain ‘invisible’ to the general reader who relies only on PubMed searches or other literature alert services. One way of getting a quick insight into recent ‘unpublished public’ genomes is to read Michael Galperin's two-monthly brief summaries in the Genomics Update section of Environmental Microbiology (Galperin, 2007a,b).
It is understandable that the more recent depositions may have no publication accompanying them yet, but it is surprising that almost 41% (243) of the completely sequenced microbial genomes catalogued in GOLD remain as yet unpublished. These organisms were sequenced to be used in comparative genomics studies, but either these analyses are still on-going or they have been accomplished and the findings not reported. As the genomes are all in public databases it would be possible to do the comparison ‘in house’. Some of the sequenced genomes have been carefully investigated, and although not published in the scientific literature they have been used in patent applications submitted by the commissioning scientists and organizations.
Over 1500 additional genomes of bacteria and archaea were listed as ‘ongoing’ or incomplete at the end of 2007 in the GOLD tables, and none of those are reported yet in the scientific literature. Many of these genomes can also be considered as ‘public unpublished’ because access is provided to preliminary sequence data, usually consisting of multiple sequence contigs. So this is the place to go to, to find out what is being sequenced, who is doing this, and what is the status of each sequencing project.
GOLD also ranks microbial genomes according to biomedical, biotechnological, environmental, agricultural, or evolutionary relevance (with some overlap of categories) (Figure 2). For readers of this journal the category ‘Biotechnological relevance’ is the most interesting to scrutinize in more detail. In the last 6 months of 2007, the GOLD table lists 28 such genomes, of which 24 are still ‘unpublished’ (Table 1). Some interesting examples are Fervidobacterium nodosum from hot springs whose amylolytic enzymes have great potential, Alkaliphilus (Clostridium) oremlandii which reduces arsenate to arsenite, making it potentially useful in bioremediation of contaminated soils and waters, and Petrotoga mobilis from 60°C water near oil wells, which may help in cleaning up oil contaminations. Properties of a few other relevant microbes and their applications are described in more detail below.
Table 1. Microbial genomes of biotechnological relevance made public in July–December 2007 (adapted from GOLD On-line Database v 2.0; http://www.genomesonline.org).
Cellulose is a complex plant polysaccharide that is not that easy to degrade. Several clostridia achieve this using a mixture of enzymes (endoglucanases and glucanases) which are held together in a large complex on the cell surface known as the cellulosome (Bayer et al., 2004; Doi and Kosugi, 2004). Clostridia are anaerobes mostly isolated from soils where they adhere to decaying plant material. Some also inhabit other niches such as the stomach of ruminants and the human colon. One which has recently had its genome sequenced is Clostridium phytofermentansISDg (ATCC 700394), isolated from a damp silt bed in a forested area (Warnick et al., 2002). This strain is special in that it can anaerobically ferment a vast array of plant sugars, starches and cellulose to produce economically substantial amounts of ethanol and acetate. It produces two to four times more ethanol than acetate and this suggests it contains unusual fermentation pathways. In fact, the genome of Clostridium phylofermentans contains over 100 ABC-type transport systems and 52 of these appear to be dedicated to transporting carbohydrates into cells. Some of these are monosaccharide transporters but others are involved in the transport of dissacharides (e.g cellobiose), tri- and tetrasaccharides (Leuscine and Warnick 2007). The polymer-hydrolyzing lifestyle of this organism and a distant relative Clostridium thermocellum are currently the object of a comparative genomics effort. The composition of the cellulosome in relation to the substrate that the organism has been adapted to was subject of a proteomics study, which showed that different glucanases were incorporated into the cellulosome (Gold and Martin, 2007). Another genome sequenced but not yet completely assembled is that of Clostridium cellulolyticum H10. Comparative genomics should help to explain the differences in the fermentative capacity of these organisms, which are all very useful as biomass fermenters producing substantial amounts of ethanol but also other compounds such as acetate and lactate. The comparative analysis may also help to explain the differences that occur during biofilm formation with these organisms, as the formation of biofilms may have dramatic effects on subsequent cellulose decomposition. It is possible that in some (Clostridium phytofermentans) it will increase ethanol production and in others (Clostridium cellulolyticum) reduce ethanol formation (Desvaux et al., 2000). The spin-off company SunEthanol (http://www.sunethanol.com) has been established to exploit the biofuel-producing potential of Clostridium phytofermentans.
Fine chemicals production
Actinobacillus succinogenes strain 130Z (ATCC 55618) was isolated from the bovine rumen. It is a Gram-negative, facultatively anaerobic, pleomorphic bacterium, belonging to the family Pasteurellaceae that, in addition to the genus Actinobacillus includes Mannheimia, Haemophilus, and Pasteurella. These bacteria are generally pathogenic or commensal. A. succinogenes is thought to serve a commensal role by producing organic acids that are used as an energy source by the cow. The major end product of its fermentative metabolism is succinate (Guettler et al., 1999), which has many industrial fine chemical uses. It is mostly produced by petrochemical means by butane oxidation at high temperatures with catalysts. Succinic acid can be converted into a number of very important industrially useful chemicals such as butanediol, tetrahydrofuran, γ-butyrolactone, adipic acid, succinate ester solvents, 2-pyrrolidone, succinimide, maleic anhydride, and polybutylene succinate. As a specialty chemical, it is a flavour and formulating ingredient in food processing, a pharmaceutical ingredient and has use as a surfactant. The market potential for succinate is substantial and in future it will be used in many white technologies, e.g. for producing bulk chemicals, stronger-than-steel plastics, ethylene diamine disuccinate (a biodegradable chelator), and diethyl succinate (a green solvent for replacement of methylene chloride). World-wide sales of biobased products have increased more than two-fold in the last 10 years and the projection is for a continued increase (Committee on Biobased Industrial Products, National Research Council 2000).
A. succinogenes is the best known natural succinate producer, and it can utilize a wide range of substrates including glucose, cellobiose, lactose, xylose, arabinose, and fructose. It also has the potential to fix CO2 as every mole of succinate made by A. succinogenes requires a mole of CO2. It should be possible to couple industrial succinate fermentation to industrial ethanol fermentation by capturing the CO2 waste from the ethanol fermentation. The draft genome sequence was put to use in the filing of a patent application (Zeikus et al., 2007a) which claimed the genes from the organism for the production of chemicals from the C4 pathway. The genome sequence has also allowed for modeling of metabolic pathways. This modeling will assist in developing leads in processes which may change metabolic fluxes and control circuits diverting carbon flux away from other endpoints and thereby increasing production of succinate. In another patent application, the genome-based metabolic model was used to define a minimal growth medium for A.succinogenes (Zeikus et al., 2007b). The genome (Hong et al., 2004) and a genome-scale metabolic model (Kim et al., 2007) are also available for another succinate producer, Mannheimia succiniciproducens, and it should be interesting to compare their metabolic capacities.
There are currently three complete sequenced strains of Shewanella baltica (OS195, OS185, OS155), while another (OS233) is in the draft phase. These bacteria were originally isolated from Baltic water and were reclassified from Shewanella putrifaciens to baltica (Ziemke et al., 1998). Many Shewanella have also been isolated from fish kept in cold storage, where they out-compete other bacterial growth. This family of bacteria is considered as having great value for bioremediation. They have the ability to reduce metals and so could be used to remove contamination from sites with heavy metals. OS195 is highly versatile with respect to its ability to use many electron acceptors and donors. It is fast-growing, easily cultivated and can survive long periods of starvation and grows quickly once nutrition is supplied. This strain was isolated in deep water in the Baltic Sea from an anoxic basin and formed the most populous clone of Shewanella isolated. The comparative genome analysis of the Shewanella will help in our understanding of biogeochemical potential and the specific ecology of the Baltic Sea, not to mention being potentially very useful as a bioremediation organism.
What to do with all this gold?
All these sequenced genomes and no descriptive publications – it seems a bit like Fort Knox vast vaults of precious metal, but not much being made out of it. The challenge for the comparative genomics field and not just the comparative biotech consortia is to explain what all this sequencing has accomplished, to tell us what it means and what it predicts for the future. There is today much concern that using food stuff for biofuel production is immoral. There is also great concern that in the push to cut dependence upon fossil fuels, that the means of producing the biofuel may be even more damaging on the environment (Cramer Commission Report 2007).
Surely, the comparative analysis of all these biotechnologically relevant micro-organisms can produce new leads, cleaner methods, less energy demanding processes and sustainable production of biobased products – something which all the world requires.