• Open Access

Interplay of metagenomics and in vitro compartmentalization


*E-mail mferrer@icp.csic.es; Tel. (+34) 91 585 4928; Fax (+34) 91 585 4760;

**E-mail aaharoni@bgu.ac.il; Tel. (+972) 8 647 2645; Fax (+972) 8 647 9218.


In recent years, the application of approaches for harvesting DNA from the environment, the so-called, ‘metagenomic approaches’ has proven to be highly successful for the identification, isolation and generation of novel enzymes. Functional screening for the desired catalytic activity is one of the key steps in mining metagenomic libraries, as it does not rely on sequence homology. In this mini-review, we survey high-throughput screening tools, originally developed for directed evolution experiments, which can be readily adapted for the screening of large libraries. In particular, we focus on the use of in vitro compartmentalization (IVC) approaches to address potential advantages and problems the merger of culture-independent and IVC techniques might bring on the mining of enzyme activities in microbial communities.


According to the well-established dilemma of environmental microbiology only a minority of microorganisms is readily culturable (Raes et al., 2007). For this reason, a wide range of approaches collectively described as ‘environmental genomics’ or ‘metagenomics’ have been developed to study such communities without culturing individual organisms. The term ‘metagenomics’ has been used broadly to encompass research ranging from examining environmental DNA in enzyme screenings and drug discovery to randomly sample the genomes (whole-sequencing) from a small subset of organisms present in an environment (Gabor et al., 2007).

The main task of the sequencing-based metagenomics, initiated through shotgun sequencing and now becoming feasible through 454 pyro-sequencing is to reconstruct the metabolism of the organisms comprising the community and to predict their functional roles in the ecosystem: 134 whole metagenomic projects using 454 technology platform are running at the moment (Liolios et al., 2008). Near-complete genome reconstructions were achieved for the dominant members of communities with a small number of organism types and for a few highly abundant organisms from diverse communities (see examples in Tables 1 and 2). The genome sequence information also serves as cornerstone of enzyme discovery because of the possibility to access, through virtual sequence homology screening, to an immense repertoire of millions of known and unknown proteins predicted by the environmental sequence information (Ferrer et al., 2007). However, the clear disadvantage of this approach for enzyme discovery is its full reliance on the existing genome annotations: the in silico comprehensive functional annotation analysis of proteins remains difficult, error-prone and machine-based processes and the options for comparative analyses are limited (Hallin et al., 2008). Better annotation quality and curated functional information would enable improved gene function predictions in newly sequenced organisms and environmental samples, and allow high-throughput (HT) evaluation of context-dependent expression and function (Segota et al., 2008).

Table 1.  Analysis of microbial communities through shotgun metagenomic sequencing (only environmental samples are shown).
SampleLibrary sizeHost or vector system usedAverage insert size (kbp)BiodiversityReferences
  • a. 

    Number of reads produced or

  • b. 

    b. independent clones.

Sargasso Sea1 985 561aBst XI linearized
pBR322 derivative
2–6Samples were dominated by genes from Proteobacteria (primarily subgroups Alpha, Beta and Gamma) with moderate contributions from Firmicutes, Cyanobacteria and species in the CFB phyla (Cytophaga, Flavobacterium and Bacteroides). Poor sequencing coverage enabled the assembly of only two near-complete genomes. Here, 1.6 G base pair of unique metagenomic DNA sequences were obtained.Venter et al. (2004)
Soil1129 (Bacteria)a
527 (Archaea)a
919 (Fungi)a
4577 (Viruses)a
pCR®2.1-TOPO (Bacterial, Archaeal and Fungal)
pSMART (viral)
0.49 (Viruses)This is the first study to use sequencing to characterize soil viral communities.
Within each of the four microbial groups, data showed minimal taxonomic overlap between sites, suggesting that soil archaea, bacteria, fungi and viruses are globally as well as locally diverse.
Fierer et al. (2007)
Acid mine drainage biofilm103 462apUC183.2Authors report the reconstruction of near-complete genomes of Leptospirillum group II and Ferroplasma type II and partial recovery of three other genomes.Tyson et al. (2004)
Global Ocean7 697 926aBst XI linearized
pBR322 derivative
2Authors report a metagenomic study of the marine planktonic microbiota in which surface (mostly marine) water samples were analysed as part of the Sorcerer II Global Ocean Sampling expedition. The resulting 7.7 million sequencing reads form 41 samples provide an unprecedented look at the great diversity and heterogeneity in naturally occurring microbial populations.Rusch et al. (2007)
Soil1 186 200bpJN105/pCF430 (small inserts)
pBeloBAC11 (big inserts)
2.7–45Authors designed a metagenomic analysis to isolate antibiotic resistance genes from 6 libraries of soil. They identified nine clones expressing resistance to aminoglycoside antibiotics and one expressing tetracycline resistance.Riesenfeld et al. (2004)
Table 2.  Metagenomic populations characterized through the 454 pyrosequencing technology (only environmental samples are shown).
Sample454-library sizeaAverage length of readsBiodiversityReferences
  • a. 

    Number of reads.

Solar saltern582 681≈100 bp151 genomic fragments were dominated by different halophilic archaea and by Salinibacter.Krause et al. (2008)
Soudan Mine334 386 (Red sample)
388 627 (Black sample)
106 bp (RS)
99.1 bp (BS)
76 16S rDNA-fragments dominated by Alpha- and Gammaproteobacteria (Red Sample) and 24 16S rDNA-fragments dominated by Actinobacteria (Black Sample).Edwards et al. (2006)
Coral Porites astreoides316 279102 bpThe most prominent bacterial groups were Proteobacteria, Firmicutes, Cyanobacteria and Actinobacteria.Wegley et al. (2007)
North Atlantic Deep Water and Axial Seamount118 778< 120 bpNearly 50% of the population corresponds to divergent Epsilonproteobacteria.Sogin et al. (2006)
Ocean surface waters414 323 (DNA)
128.324 (cDNA)
110 bp (DNA)
114 bp (cDNA)
The genus Prochlorococcus and Alphaproteobacteria (genus Pelagibacter) were the two most highly represented taxonomic groups in both DNA and cDNA libraries.Frias-Lopez et al. (2008)
Global soil314 04196.4 bpResults indicate that crenarchaeota may be the most abundant ammonia-oxidizing organisms in soil ecosystems on Earth.Leininger et al. (2006)
Marine virome of four oceanic regions1 768 297102 bpMetagenomic analyses of 184 viral assemblages collected over a decade and representing 68 sites in four major oceanic regions. This work provides evidence that the composition of viral assemblages varies in different geographic regions.Angly et al. (2006)
Northwest Atlantic & Eastern Tropical Pacific Seawater ≈100 bpAnalysis of 7.7 million sequencing reads (6.3 billion bp) from the microbes collected across a several-thousand km marine transects.Yooseph et al. (2007)
Surface and hypersaline marine, freshwater samples ≈100 bpMetagenomic analysis of 37 samples. Results showed that most of the 154 662 viral peptide sequences identified were not similar to those in the current database and that only few thousands genes encoding metabolic and cellular functions could be unambiguously identified.Williamson et al. (2008)

For this reason, the selection-based approach that involves construction of small- to large-insert expression libraries, especially those made in lambda phage, cosmid or copy-control fosmid vectors, which are further implemented for a direct activity screening (Gabor et al., 2007), is the best option for enzyme discovery. In fact, the exploitation of natural microbial diversity for the identification and isolation of novel enzymes is currently an extremely active field in microbiology (see examples in Beloqui et al., 2008). Further, the powers of metagenomics approaches can be combined with directed evolution methodologies for the generation of a biotechnological platform that will allow for the discovery and engineering of novel enzymes (Fig. 1). Here, a critical step is to screen for genes encoding novel enzymes among millions or billions of unrelated DNA sequences, which highlights the need of high-throughput screening (HTS) methodologies in order to increase the chance of identifying, isolating and generating novel enzymes. We will review important HTS strategies that enable the screening or selection of extremely large libraries for a variety of enzymatic activities, with an emphasis on the use of in vitro compartmentalization (IVC).

Figure 1.

Flow chart illustrating the identification, isolation and further engineering of novel enzymes using metagenomic and directed evolution approaches. Newly identified enzymes from metagenomic libraries can serve as an ideal starting point for the directed evolution of enzymes with improved properties. High-throughput screening for enzymatic activity can be used, both for screening large metagenomic libraries and subsequently, for directed evolution experiments.

Metagenome analysis requires HTS strategies

There are two distinct mining strategies taken in metagenomics to capture the largest amount of the available enzymatic resources in environmental DNA (Fig. 2). First, the application of the sequence-based approach that involves the design of PCR primers or hybridization probes for the target genes that are derived from conserved regions of already known protein families (Gabor et al., 2007). The use of microarrays to profile libraries offers to this strategy an effective approach for characterizing many clones rapidly in a HT fashion (Sebat et al., 2003). This format is referred to as a metagenome microarray (MGA). In the MGA format, the ‘probe’ and ‘target’ concept is a reversal of those of general cDNA and oligonucleotide microarrays: targets (fosmid clones) are spotted on a slide and a specific gene probe is labelled and used for hybridization. This format of microarray may offer an effective HTS approach for identifying clones from metagenome libraries rapidly without the need of laborious procedures for screening various target genes. As an example, Sebat and colleagues (2003) and Park and colleagues (2008) used microarray platforms to screen microbial genomes and whole community genomes. However, the difficulty and limitation of this approach is related to achieving high hybridization efficiency and that the target genes derived from conserved regions of already known protein families reduce our chances for obtaining fundamentally new proteins (Gabor et al., 2007). On the other hand, based on the few metagenome surveys of microbial and viral communities completed to date, a consensus is emerging that environmental communities are extraordinarily diverse and contain a high proportion (over 60%) of novel sequences with unknown functions that are relatively distant from better known representatives in sequenced genomes (see examples shown in Tables 1 and 2).

Figure 2.

Mining genomes and metagenomes for novel enzymes. A gene library is created from environmental samples (Step 1–3) and used to screen for novel genes (Step 4) cloned into bacteria which can be sequenced (Step 5a). The encoded proteins expressed in appropriate host are then subjected to structure-function analyses (central panel). Alternatively, large-scale sequencing of bulk DNA is used for archiving and sequence homology screening purposes to capture the largest amount of the available genetic resources present in environmental samples (Step 5b).

The application of the alternative, activity-based approach, available at HT scale by using chromogenic or fluorogenic substrates, analyses in a sequence-independent way dozens of thousands of clones in a single screen (Fig. 2). Certainly, the shortcoming of functional-screening approach is that even large libraries only provide a small fraction of the environmental diversity and the low frequency of novel enzymes among vast unrelated DNA sequences. The diversity that can be accessed by metagenomic analysis is overwhelming: it was shown that a pristine soil sample contains more than 104 different microbial species and over thousand million open reading frames, many of which encode putative enzymes (Raes et al., 2007). To generate an effective metagenomic library, improved HT-DNA extraction methods and cloning strategies are required so that the individual genomes in the library provide an acceptable representation of the entire metagenome (Ferrer et al., 2007). The target genes encoding for novel enzymes represent a tiny fraction, in some cases less than 0.01%, of the total nucleic acid sample extracted from environmental sources and therefore the low abundance of the gene in the library may play a critical role in mining programmes (Gabor et al., 2007). For this reason, currently, several enrichment methodologies exist that are either dependent or independent of sequence information, i.e. cultural enrichment methodologies including size-selective filtration and stable isotope probing, to cite some. However, the main drawbacks of both tools are the danger of enriching fast growing microorganisms that do not utilize the supplied nutrients and the inefficient labelling efficiency and preventing cross-feeding or recycling of the label within the microbial community (Schloss and Handelsman, 2003) respectively. Another crucial factor in the generation and expression of metagenomic library is the insert size, the gene or operon length and the use of an adequate host organism that is able to express the target gene and other factors in trans to facilitate expression and folding (such as chaperones, cofactors, etc.) (Galvao et al., 2005). Collectively these factors highlight the need for HT cloning and expression independent screenings to increase the chance of isolating novel enzymes. It is therefore conceivable to screen environmental DNA rapidly by applying different IVC-like strategies using single microdroplets and cell-free translation systems together with sorting speed screenings (see below). Those may complement the low-to-medium colorimetric and genetic traps screening which are not covered here and have been extensively reviewed elsewhere (see Ferrer et al., 2007; Gabor et al., 2007 and references therein).

High-throughput functional screening or selection assays

In the sections below, we focus on the principles of HTS approaches with the focus on the use of IVC.

The fundamentals of HTS or selection

Screening and selection methodologies should meet the following demands: (i) They should, if possible, directly select for the property of interest –‘you get what you select for’ is an important rule in screening for desired activity. Thus, the substrate should be identical, or as close as possible to the target substrate, and product detection should be under multiple turnover conditions to ensure the selection of effective catalysts. (ii) The assay should be sensitive over the desired dynamic range. The first rounds of screening large libraries demand isolation with high recovery. All variants, including those that exhibit relatively low detectable catalytic activity, should be recovered. The more advanced rounds must be performed at higher stringency so as to ensure isolation of the best variants. A limited dynamic range seems to be a drawback of most selection approaches. (iii) The procedure should be applicable to a HTS format.

Numerous assays enable detection of enzymatic activities in agar colonies or crude cell lysates by the production of a fluorophore or chromophore (see Bornscheuer, 2002; Goddard and Reymond, 2004; Andexer et al., 2006). Assays on agar-plated colonies typically enable the screening of > 104 variants in a matter of days, but are often limited in sensitivity. Soluble products diffuse away from the colony and hence, only very active variants are detected. Assays based on insoluble products show higher sensitivity, but their scope is rather limited (for example, see Khalameyzer et al., 1999). The range of assays that are applicable for crude cell lysates is obviously much wider, but their throughput is rather restricted. In the absence of sophisticated robotics that are usually unavailable to academic laboratories, only 103–104 variants are typically screened (Geddie et al., 2004). These low-to-medium throughput screens have proved effective for the isolation of enzyme variants with improved properties or for the isolation of enzymes from pre-enriched metagenomic libraries as described in a number of reviews (Beloqui et al., 2008). However, a far more efficient sampling of sequence space is required for the isolation of rare variants from large metagenomic libraries or variants with dramatically altered phenotypes in directed evolution experiments. Here, methods based on the screening of in silico (virtual) libraries with a size up to 1080 variants have been recently applied successfully (Fox et al., 2007).

In vitro compartmentalization

In vitro compartmentalization is based on water-in-oil emulsions, where the water phase is dispersed in the oil phase to form microscopic aqueous compartments. Each droplet contains, on average, a single gene, and serves as an artificial cell allowing for transcription, translation and the activity of the resulting proteins, to take place within the compartment. The oil phase remains largely inert and restricts the diffusion of genes and proteins between compartments (Fig. 3). The droplet volume (∼5 femtoliter) enables a single DNA molecule to be transcribed and translated (Griffiths and Tawfik, 2006), as well as the detection of single enzyme molecules (Griffiths and Tawfik, 2003). The high capacity of the system (> 1010 in 1 ml of emulsion), the ease of preparing emulsions and their high stability over a broad range of temperatures, render IVC an attractive system for HTS of enzymes, as well as for many other HT genetic and genomic manipulations (for a recent review, see Griffiths and Tawfik, 2006).

Figure 3.

Selections by FACS sorting of double emulsion droplets. A gene library is transformed into bacteria, and the encoded proteins are expressed in the cytoplasm, the periplasm, or on the surface of the cells (Step 1). The bacteria are dispersed to form a water-in-oil (w/o) emulsion, with typically one cell per aqueous microdroplet. Alternatively, an in vitro transcription/translation reaction mixture containing a library of genes is dispersed to form a w/o emulsion with typically one gene per aqueous microdroplet. The genes are transcribed and translated within the microdroplets (Step 2). Proteins with enzymatic activity convert the non-fluorescent substrate into a fluorescent product and the w/o emulsion is converted into a water-in-oil-in-water (w/o/w) emulsion (Step 3). Fluorescent microdroplets are separated from non-fluorescent microdroplets using a fluorescence activated cell sorter (FACS) (Step 4). Bacteria or genes from fluorescent microdroplets which encode active enzymes are recovered and the bacteria are propagated or the DNA is amplified using the polymerase chain reaction. These bacteria or genes can then be re-compartmentalized for further rounds of selection.

While IVC provides a facile mean for co-compartmentalizing genes and the proteins they encode, the selection of an enzymatic activity requires a link between the desired reaction product and the gene. One possible selection format is to have the substrate, and subsequently the product, of the desired enzymatic activity physically linked to the gene. Enzyme-encoding genes can then be isolated by virtue of their attachment to the product while other genes that encode an inactive protein carry the unmodified substrate. The simplest applications of this strategy come with the selection of DNA-modifying enzymes, where the gene and substrate comprise the same molecule. Indeed, IVC was first applied for the selection of DNA-methyltransferases (MTases) (Tawfik and Griffiths, 1998). Other applications include the selection of restriction endonucleases (Doi et al., 2004) and DNA polymerases (Ghadessy et al., 2001; 2004).

The first application of IVC beyond DNA-modifying enzymes was demonstrated by a selection of bacterial phosphotriesterase variants (Griffiths and Tawfik, 2003). The selection strategy used was based on two emulsification steps. In the first step, microbeads, each displaying a single gene and multiple copies of the encoded protein variant, were formed by translating genes immobilized to microbeads in emulsion droplets and capturing the resulting protein via an affinity tag. The microbeads were isolated and re-emulsified in the presence of a modified phosphotriester substrate. In the second step, the product and any unreacted substrate were subsequently coupled to the beads. Product-coated beads, displaying active enzymes and the genes that encode them, were detected using fluorescently labelled anti-product antibodies and selected by fluorescence-activated cell sorting (FACS). Selection from a library of > 107 different variants led to the isolation of an enzyme variant with a very high kcat value (> 105 s−1).

Some IVC selection modes take advantage of the fact that this system is solely used in vitro, allowing for selection of substrates, products and reaction conditions that are incompatible with in vivo systems. However, cell-free translation must be performed under defined pH, buffer, ionic strength and metal ion composition. In the selection for phosphotriesterase using IVC described above (Griffiths and Tawfik, 2003), translation was completely separated from catalytic selection by using two sequential emulsification steps, allowing selection for catalysis under conditions that are incompatible with translation. However, it is also possible to use a single emulsification step, and to modify the contents of the droplets without breaking the emulsion once translation is completed (Bernath et al., 2005).

IVC in double emulsions

The need to link the product to the enzyme-coding gene complicates and restricts the scope of selection, especially for non-DNA modifying enzymes. Recently, an alternative strategy was developed, based on compartmentalizing and sorting single genes, together with the fluorescent product molecules generated by their encoded enzymes. The technology makes use of double, water-in-oil-in-water (w/o/w) emulsions that are amenable to sorting by FACS (Fig. 3). FACS technology, originally developed by Diversa, enables the identification of biological activity within a single cell by incorporating a laser with multiple wavelength capabilities with the ability to screen up to 50 000 clones per second, or over 1 billion clones per day. This, therefore, circumvents the need to tailor the selection for each substrate and reaction, and allows the use of a wide variety of existing fluorogenic substrates. The making and sorting of w/o/w emulsion droplets does not disrupt the content of the aqueous droplets of the primary w/o emulsions. Further, sorting by FACS of w/o/w emulsion droplets containing a fluorescent marker and parallel gene enrichment have been demonstrated (Bernath et al., 2004).

The w/o/w emulsions were also applied for the directed evolution of two different enzymatic systems. New variants of serum paraoxonase (PON1) with thiolactonase activity (Aharoni et al., 2005a) and new enzyme variants with β-galactosidase activity were selected from libraries of > 107 mutants (Mastrobattista et al., 2005). The β-galactosidase variants were translated in vitro, as in previously described IVC selections (Tawfik and Griffiths, 1998). In the case of PON1, intact Escherichia coli cells in which the library variants were expressed, emulsified and FACS sorted, thereby demonstrating the applicability of double emulsions for single-cell phenotyping and directed enzyme evolution (Aharoni et al., 2005b). The same strategy has been recently applied for the selection of lactonase activity (Amitai et al., 2007) using an oxo-lactone substrate, the enzymatic hydrolysis of which generates a thiol that was subsequently detected with a fluorogenic probe (Khersonsky and Tawfik, 2006). Detailed protocols for the preparation and sorting of double emulsions are available (Miller et al., 2006).

Perspectives of HT-IVC screens for metagenome mining

Recent progress has revealed that the capture of genetic resources of complex microbial communities in metagenome libraries allows the discovery of a richness of new genomic and metabolic diversity that had not previously been imagined. Activity-based screening of such libraries has demonstrated that this new diversity is not simply variations on known sequence themes, but rather the existence of entirely new sequence classes and novel functionalities (Ferrer et al., 2007). IVC-analysis of bulk environmental DNA may serve as a powerful tool for selecting and dynamically analysing those new sequences through the ability to check biochemical parameters in millions of single cells or microdroplets each of them containing individual ‘metagenes’. The application of IVC-FACS technologies would enable the screening of billions of bp for enzymatic activity in few days, a processing speed which is impossible to achieve using conventional HT screening approaches.

The only example in the literature so far for using the power of the FACS for sorting metagenomic libraries, is the substrate-dependent gene-induction assays (SIGEX). Using this methodology metagenome fragments are ligated into an operon-trap vector (e.g. p18GFP), and the cells are then separated and analysed by HT-FACS to select for GFP-expressing cells (Uchiyama and Watanabe, 2008). The SIGEX approach enriches for functional genes but do no select directly for enzymatic activity and is restricted to cloning and the usage of transcription–translation machinery of E. coli. These drawbacks can be partially circumvented by IVC cell-free system, which allows the direct screening for enzymatic activity and the advantage of controlling the reaction conditions and introducing novel chemical groups to enzymes (new metals, cofactors and unnatural amino acids). Yet the full recovery of ‘metaenzymes’ by IVC-FACS is limited by the reliance and low availability on cell-free systems comprising the essential components for transcription and translation (Griffiths and Tawfik, 2006).


Exploiting the vast environmental genetic diversity by metagenomic approaches is a promising strategy for the identification and isolation of novel enzymes. Identification of such enzymes will enhance our knowledge of the structure, function and evolution of enzymes and will allow definition of many new enzyme families. These enzymes could be highly valuable for industrial applications, especially in the development of new biocatalytic processes. The application of HTS approaches for selecting metagenomic libraries, and especially IVC assays, will enable us to identify novel enzymes for a wider scope of biotransformations. These newly identified enzymes can serve as the ideal starting point for the directed evolution of novel enzymes with improved properties. We believe that in the near future, a new biotechnological platform will emerge, combining the fields of metagenomics and directed evolution for the generation of novel biocatalysts.


This research was supported by the Spanish MEC BIO2006-11738 and CSD2007-00005 projects. A.B. thanks the Spanish MEC for a FPU fellowship. A.A. acknowledges funding support from the Israel Science Foundation (ISF) Morasha program.