Notice: Wiley Online Library will be unavailable on Saturday 27th February from 09:00-14:00 GMT / 04:00-09:00 EST / 17:00-22:00 SGT for essential maintenance. Apologies for the inconvenience.
The study of protein function usually requires the use of a cloned version of the gene for protein expression and functional assays. This strategy is particularly important when the information available regarding function is limited. The functional characterization of the thousands of newly identified proteins revealed by genomics requires faster methods than traditional single-gene experiments, creating the need for fast, flexible, and reliable cloning systems. These collections of ORF clones can be coupled with high-throughput proteomics platforms, such as protein microarrays and cell-based assays, to answer biological questions. In this tutorial, we provide the background for DNA cloning, discuss the major high-throughput cloning systems (Gateway® Technology, Flexi® Vector Systems, and CreatorTM DNA Cloning System) and compare them side-by-side. We also report an example of high-throughput cloning study and its application in functional proteomics. This tutorial is part of the International Proteomics Tutorial Programme (IPTP12).
If you can't find a tool you're looking for, please click the link at the top of the page to "Go to old article view". Alternatively, view our Knowledge Base articles for additional help. Your feedback is important to us, so please let us know if you have comments or ideas for improvement.
The term clone was coined in 1903 and derived from the Greek word κλών (klōn, “twig”), referring to the process where a twig is used to create a new plant, genetically identical to the twig donor . Currently, “clone” is used with a broader meaning, designating not only the production of identical organisms, but also cells and even DNA fragments. The earliest DNA clone was created in 1972 by Paul Berg, when a DNA segment of the galactose operon was inserted into the SV40 virus . In 1973, Stanley Cohen and Herbert Boyer generated the first organism expressing a recombinant DNA . Although DNA was discovered as the source of the genetic information in 1944 , the cloning of the first DNA fragment awaited isolation of the enzymes necessary to manipulate nucleic acids. Only after the identification of DNA ligase in 1967  and restriction enzymes (REs) in 1970 [6-8] did genetic engineering become possible.
During the early 1970s, molecular cloning was done blindly, without any sequence information about the DNA fragments used in the process. This scenario changed with the development of DNA sequencing methodologies in the late 1970s [9-11], allowing the cloning of specific genes/sequences. The first completely sequenced genome was Bacteriophage ΦX174 with only 5375 bp [12, 13]. Since then, the number of genomes completely sequenced has increased exponentially and includes the following: Haemophilus influenza  as the first free-living organism with a fully sequenced genome; Saccharomyces cerevisiae  as the first eukaryotic genome; Caenorhabditis elegans , the first multicelullar eukaryotic genome; and the Human genome in 2001 [17, 18].
The explosion of full genome sequences has identified thousands of genes encoding proteins with unknown or poorly known activity. The rapid elucidation of their functions will rely up flexible high-throughput cloning methods. Virtually all technologies routinely employed for the study of protein function begin with protein expression, either in vitro and/or in vivo, using a cDNA copy of the ORF for the gene of interest (GOI). The expressed protein is then used in a broad variety of functional assays. In this approach, to study a protein, one must have the correspondent GOI clone, generating a new need for systematic and high-throughput cloning methodologies.
These high-throughput cloning methods benefit from a number of important characteristics. Typically, ORFs are captured in a common configuration allowing the same basic reagents and steps to operate on all ORFs. The key is to avoid any need to individualize cloning steps based on the ORF. Ideally, the cloning steps operate with molecular conservation, avoiding amplification, which could introduce errors. In this manner, once a sequence-verified ORF is introduced into the system, then there will be no need to resequence clones after any transfer steps. A distinguishing characteristic among various cloning methods is the mechanism for transferring ORFs from one plasmid to another, all of which aspire to simple, rapid, reliable, and highly efficient. The best of these methods can be automated.
The creation of large collections containing many thousands of genes is essential to supply the necessary tools for functional genomics and proteomics. The first comprehensive DNA clone collection produced contained nearly the entire S. cerevisiae transcriptome with more than 6000 genes . This library was constructed through the gap-repair method, in which the GOI is amplified using primers with adapter sequences; in this case a sequence homologous to the vector. The final product was an ORF flanked by 50 bp of vector sequence. Transformation of both vector and amplified ORF into yeast cells allowed homologous recombination in vivo and the consequent generation of the vector coding for the ORF [20, 21]. This collection was used in many assays to address protein function [22, 23], protein–protein interactions [24, 25], protein phosphorylation , and glycosylation .
Since this first comprehensive collection, many more libraries were created using different cloning strategies . New approaches were designed to overcome the major problems of the gap-repair methodology, including the need for very long and expensive primers for the insert amplification; the high rate of empty vectors at the end of the screening; and mostly, the fact that once a gene is cloned into a vector it cannot be easily transferred to another vector without restarting the entire cloning process.
In this tutorial, we will focus on three most commonly used high-throughput cloning systems (Gateway® Technology, Flexi® Vector Systems, and CreatorTM DNA Cloning System), and describe briefly a few other systems. A detailed comparison among these systems will be discussed in the last section.
2 Basic concepts
2.1 Main enzymes used in cloning methods
Distinct cloning strategies were developed based on enzymes capable of manipulating DNA in different ways. Among the most relevant enzyme classes for cloning are the following: DNA polymerases, REs, ligases, and recombinases. DNA-dependent DNA polymerases are used for the amplification of desired sequences in the PCR (Supporting Information Fig. S1A), whereas RNA-dependent DNA polymerases copy the information from RNA molecules into DNA, referred to as cDNA (Supporting Information Fig. S1B). Type II REs cleave within specific DNA recognition sequences and ligases rejoin two DNA strands, reconstituting the phosphodiester bonds (Supporting Information Fig. S2). Site-specific recombinases have, within a single complex, the intrinsic activity to recognize specific sequences, cut and ligate DNA fragments (Supporting Information Fig. S3). They are able to exchange fragments between two DNA molecules. The length and composition of recognition sequences determine how frequently they occur within GOIs, and thus has an impact on how universally their associated enzymes can be applied without disrupting ORFs. More detailed information about all these enzymes can be obtained in the Supporting Information.
2.2 Insert selection
The preparation of ORF clones for eukaryotic genes favors the use of mRNA as a starting template in order to omit introns.
Lacking splicing machinery, prokaryotic cells require a contiguous ORF, without introns and untranslated regions (UTR), for appropriate protein expression. In eukaryotic cells, however, the presence of an intron can increase the expression levels of the insert . The usage of the natural introns present in the gene would make the cloning process more difficult, increasing the size of the insert to be cloned considerably. To overcome this problem, eukaryotic expression vectors with an artificial intron at the end of the cloning site were designed; enabling the use of ORFs for the generation of transcripts with introns .
The easiest way to obtain the desired ORF is through the amplification by PCR using a pair of primers design to align specifically with the beginning and end of the coding sequence (CDS). The template choice for the PCR is a key component for the cloning (Fig. 1). Genomic DNA is one option; however, it is recommended only for genes without introns, as in prokaryotes and simple eukaryotes. For genes with splicing, it is more straightforward to use cDNA as template; however, cDNA for many genes may be difficult to capture because their mRNAs may be rare or missing from the tissue source used, making it necessary to use mRNA purified from several cells lineages to cover the entire transcriptome/proteome of an organism. Another start point for the ORF amplification would be from a preexisting plasmid with the CDS of interest. This clone might be available from large cloning initiatives like Mammalian Gene Collection (http://mgc.nci.nih.gov), Riken collection (http://dna.brc.riken.jp/search/), DNASU (http://dnasu.asu.edu/DNASU/), and Kazusa mammalian cDNA collection ; or by request from individual laboratories. Finally, a new approach that avoids amplification from template is de novo synthesis of ORFs through chemical or PCR-based methods [31, 32]. Today it is possible to buy a synthetic ORF of 1 kb for less than 300 dollars; however, the price and the difficulty in synthesis increase with the length of the ORF. In a small cloning project this price is not prohibitive, but in high-throughput projects the cost of traditional ORF amplification methods are a fraction of the cost of the de novo synthesis, making this new approach a second choice. However, it is worth mentioning that one advantage of de novo synthesis is that one can design the clone with optimal codon usage. That is, this approach uses codons that correspond to the tRNAs that are most abundant in the species where the protein will be produced, therefore, more likely to give higher yields of protein .
2.3 Vectors features
Prior to the discussion about vectors features, it is important to establish an appropriate nomenclature to distinguish between plasmids that contain the sequence of interest (referred as clone or DNA clone) from the plasmid without the insert (known as vector or empty vector). This nomenclature will be used throughout this tutorial.
In functional genomics, vectors with many distinct features and applications are available. Some of the basic elements are common to all vectors, including the replication site, selectable marker(s), and cloning site(s). More information about the basic elements can be found at the Supporting Information and Supporting Information Fig. S4.
Vectors that possess only the basic features are known as entry vectors or master vectors. These vectors are often engineered so that they cannot express the insert and are usually intermediate steps in the cloning process. Avoiding protein expression is particularly useful for genes that are toxic in bacteria cells, allowing the propagation of the clone without any negative selection against the host cell carrying the gene. Entry clones are stable, with no natural selection occurring in the gene sequence, and are used as a source of fully sequenced ORFs. Transferring a gene from an entry clone into an experimental vector of interest is a simple process that can be accomplished using the cloning sites present in the entry vector, without the need of reamplification or full-length sequence validation of the insert. Entry clones are especially useful when ORFs must be transferred into several vectors for distinct experimental applications (Fig. 2).
Expression vectors possess extra features, along with the common elements, allowing them to express the protein encoded in the insert. The main element of the expression vector is the promoter, responsible to drive the transcription of the ORF. Any promoter that fits the following rules can be used for protein expression: (1) the GOI must be cloned downstream of the promoter, (2) the promoter has to be compatible with the tissue/species/system where the gene will be expressed, and (3) transcription strength and activity of the promoter must be appropriate for the assay; some experiments require constitutive expression, while other demand a time-specific expression, which can be obtained with inducible promoters. Numerous promoters with a broad range of transcriptional strength and specificity have been identified and are used routinely for protein expression [34, 35].
For some applications, it is desirable to add a fusion tag to the expressed protein, a small polypeptide or domain with defined structure that can be recognized by antibodies or other well-characterized reagents. Tags are especially important for high-throughput assays, allowing the detection of all proteins in a single experiment in an easy and cost effective way. The DNA sequence coding for the tag is inserted in the plasmid, in frame with the ORF of interest, and during the transcription and translation both proteins are expressed as a single chimeric entity. It is possible to create fusion proteins with the tag on the amino-terminal (N-terminal) or carboxyl-terminal (C-terminal) side of the ORF. N-terminal tagged proteins require a start codon prior to the tag, while the C-terminal tagged proteins can use the natural start codon of the ORF; however, the natural stop codon present in the ORF must be removed to allow the translation of the carboxyl-tag. ORFs from which the stop codon was removed and which allow the translation to continue through them are known as open or fusion ORFs, in contrast to closed ORFs (natural stop codon intact).
2.4 Expression systems
Using expression clones, proteins can be produced in cell-based (in vivo) or cell-free (in vitro) systems. Cell-based methods rely on introducing the plasmid inside the cells, where the cellular transcriptional and translation machinery will be responsible for the protein synthesis. After expression, the protein of interest can be purified, from cell lysates, cell membranes, or the media, using antibodies or affinity ligands designed against the protein itself or any tag that is present. Prokaryotic and eukaryotic systems have been used successfully for protein expression, among them are bacteria, insect, yeast, plant, and mammalian cells. The choice of the appropriate system for exogenous protein expression depends on importance of protein folding and functional PTM, as well as cost. Prokaryotic systems are known for the production of recombinant proteins in high yields, but PTMs are not common; whereas eukaryotic systems possess the opposite trend [36, 37].
A cell-free system for protein expression is a mixture of all necessary components for translation and requires RNA as a template. Cell-free protein expression can be performed in two steps (RNA transcription followed by translation) or combined into single step, using coupled systems. The former usually has higher yields but is less convenient. In the coupled system, DNA (PCR product or expression clones) can be used as a template, simplifying the overall process. Currently, these systems are available from Escherichia coli, wheat germ extract, and mammalian cell extracts, including human and rabbit [38-42].
2.5 Scaling up: High-throughput cloning
Scaling for high-throughput cloning requires planning and strategy to handle thousands of genes simultaneously. High-throughput cloning is usually a multistep protocol that starts with the gene capture reaction in which the ORF is inserted in the entry vector, and it is followed by the transfer from the entry clone to the expression vector of choice.
Several distinct aspects of the cloning strategy should be analyzed prior to beginning the project and they will help guide the decision of which cloning system is most suitable for the desired application.
One of the first aspects to consider is the enzyme involved in each step of the cloning process and its fidelity, efficacy, and stability. It is known that DNA polymerase, DNA or RNA dependent, has the lowest fidelity among all cloning-related enzymes, adding errors in the DNA sequence with ratios that vary from 10−5 to 10−7 mutation frequency/bp . Those enzymes are irreplaceable for the amplification of genes from DNA or RNA molecules. For the gene capture (cloning into the master vector) and transfer (from the master clone to the expression vector), cloning protocols that use REs and/or recombinases are the best option, as those enzymes do not alter the insert, operating only on selected regions in the DNA used for the ligation or recombination.
Efficiency is another important aspect for a cloning enzyme. This is defined as the percentage of the selected genes that can be captured. In addition, in systems that include the master vector/expression vector paradigm, transfer efficiency from the master clone to expression vector is important because this process will occur more frequently during the lifetime usage of the plasmid as new expression vectors are created and large numbers of clones are transferred for additional biological applications. When scaling up, the goal is to approach 100% efficiency in the capture as well as subsequent transfers. Procedures with efficiencies as low as 90% will require additional reaction rounds to succeed for all clones, adding cost and time in the overall project .
High-throughput cloning enzymes should also be stable and robust. Ideally the shelf life should be long and the lot-to-lot variation should not affect the reaction efficiency. The enzyme should also be fairly independent of the concentration of substrates. In high-throughput operations, the quantification and normalization of the substrates (PCR product for instance) are not practical and increase the cost of the project.
An essential part of the process is validating the clones. The newly generated clones are sent for DNA sequencing, the reads are assembled into contigs and then aligned against their reference sequences, to report any discrepancies and their consequence to the polypeptide. The insert in the entry vector should be fully sequenced to detect any mutation introduced during the PCR reaction. Users should define a set of acceptable discrepancy thresholds to consistently accept or reject the plasmids. Once the entry clone with the desired quality is obtained, the insert may be subcloned to many expression vectors. Successful transfers can be verified by restriction analysis, colony PCR, or a single sequencing read that covers the junction between insert and vector. These quick transfer verification experiments can be employed as long as the transfer method is conservative and does not involve any amplification steps.
The entire cloning process, from PCR to the final validation, should be automation friendly, avoiding human errors and allowing tracking of the location and status of each plasmid . The automation can be accomplished with any liquid handling system able to process 96-well plates and also rearray selected wells into a new plate. The latter is especially important for the design of new plates with sequence verified clones. A laboratory information management system tracks clones with a combination of barcoded plates and maps of which clones are in the wells of those plates. Each transfer to a new plate adds a barcode to the plate's history so that a user can track each plate and well visited by a clone from concept to completion.
As one can expect, informatics support is necessary for any high-throughput project, and cloning is no exception. The use of a laboratory information management system is recommended for the tracking of all steps in the cloning process. The data analysis for the validation is easier if it can be performed automatically in a platform that allows the analysis of thousands of clones simultaneously. Both systems should be integrated to permit the identification of clones that were not captured in the first round and require another attempt at cloning. By analyzing thousands of clones at once, time and money are dramatically reduced .
3 Examples of high-throughput gene cloning systems
3.1 Restriction enzyme-based cloning
The very first protocol available for DNA cloning was based on RE and ligase [2, 3]. This classic method is not effective for high-throughput cloning because the commonly used REs have recognition sequences that occur too frequently within ORFs. The exception to this is a RE-based method called Flexi cloning system from Promega.
3.1.1 Flexi® cloning system
The Flexi cloning system relies on the use of the two most rare-cutting REs in the human ORFeome (SgfI and PmeI), with a combined digestion frequency lower than 1.2% in human ORFs. This system is not limited to the cloning of human collections since those enzymes are rare cutters in many organisms, such as mouse (1.2%), S. cerevisiae (2.96%), Arabidopsis thaliana (2.4%), and E. coli (6.35%).
To clone a gene into the Flexi system (Fig. 3A), primers with adaptor sequences are used. The SgfI site is added at the 5′ of the starting codon, while the PmeI site is added at the 3′ end of the CDS, keeping the cloning process directional. After the PCR amplification, the GOI, flanked by the restriction sites, and the Flexi compatible vector are digested with a mixture of SfgI and PmeI. The digestion of the Flexi vector releases a lethal gene, which selects against the parental vector, and prepares the vector backbone to receive the insert. Subsequently, the digested products are ligated, transformed into competent cells, and sequenced to isolate clones with the desired insert. Typically, the number of false positive is low and the sequencing of just a few colonies is enough to ensure the acquisition of the desired clone .
The Flexi cloning system has slightly different facets depending upon where and if any in-frame fusion tag is used (Fig. 3B). N-terminal tagged proteins are generated from vectors that position the tag sequence upstream of the SgfI site (tag – SgfI – GOI – PmeI). The translation starts from the AUG of the tag, continues through the SgfI site, which encodes for a tripeptide Ala-Ile-Ala, and ends with the natural stop codon of the GOI (if it is included), or at the stop codon provided by the vector backbone (if it is not). To create proteins with carboxyl-terminal tags using the Flexi system, the tag sequence must be downstream of the PmeI site (SgfI – GOI – PmeI – tag). The PmeI restriction site (GTT-TAA-AC) encodes for Val-Stop-Thr, requiring the removal of the stop codon to enable the ribosomes to continue translation of the tag. PmeI RE produces a blunt cut in the middle of the stop codon, which can be ligated to any other blunt product generated by a different RE (e.g., EcoICRI or its isoschizomer Eco53kI). When the PmeI-digested PCR product is ligated to the EcoICRI-digested vector, the new sequence (GTT-TCT-C) has no stop codons and encodes for Val-Phe-X, with X being an amino acid coded by the codon CNN, where N is a nucleotide present in the vector.
Once the ORF is inserted into a Flexi vector, this clone can be used as a donor clone. The only exception would be Flexi clones with carboxyl-terminus tags, in which the PmeI site was lost during the cloning process. The donor clone is digested with SgfI and PmeI to release the GOI, while the acceptor vector is digested by the REs that flank the lethal gene, SgfI/PmeI, or SgfI/EcoICRI. Following the digestions, ligation and transformation is performed. Distinct selection markers between donor and acceptor vectors and negative selection by a lethal gene permit the recovery of only the clones with the GOI in the acceptor vector. Theoretically, sequencing is not required after this transfer, as no amplification or change in the frame occurs; however, we recommend at least a single sequence reaction just to confirm the presence of the insert.
The Flexi cloning system was successfully used in a high-throughput project, where more than 3500 human ORF clones, larger than 4000 bp each, were generated and expressed both in a cell-free expression systems and in vivo [46, 47].
3.2 Recombination based cloning
The most popular high-throughput cloning systems are based on recombinational cloning , and several groups implemented this technology to construct expression libraries [48-52].
Recombination cloning is mediated by recombinases at site-specific sequences and eliminates the use of REs and ligases. These sites are long sequences, ranging from 30 bp up to a couple hundred bps, which makes them extremely rare in the genome. The infrequency of these recombination sites enables virtually any insert DNA to be cloned without fear of interrupting the CDS. Recombinase-based systems have a high efficiency rate, making them robust, fast, and reliable. They also do not require individual inspection of each target sequence. These characteristics make the entire process easily automated.
The first collection produced using recombination employed the natural homologous recombination property of S. cerevisiae . This in vivo strategy was replaced by more efficient in vitro recombination cloning systems, such as CreatorTM DNA Cloning System (Clontech) and Gateway® Technology (Life Technologies).
3.2.1 CreatorTM cloning system (Cre-loxP)
The Creator technology employs the Cre-loxP site-specific recombination system isolated from the Phage P1 . Cre is a 38 kDa enzyme that catalyzes the recombination between two loxP sites, a 34 bp sequence composed of two 13 bp inverted repeats separated by 8 bp region [54, 55]. The asymmetric nature of the spacer permits only unidirectional recombination, making it compatible with high-throughput DNA cloning .
188.8.131.52 Master clones for the Creator system
The Creator system requires the generation of master clones. The strategy used in this process is independent to the Cre-loxP, but the final product must have the GOI flanked by loxP sites, one upstream and another downstream. These two recombination sites are required to transfer the insert into the expression vector. Many protocols can be used to create the master clones, but we will focus on the In-Fusion PCR Cloning (Clontech).
The In-Fusion system is a technology designed to clone linear DNAs that share at least 15 bp of identity at the ends (Fig. 4A) . In this methodology, the entry vector should be linearized, with any RE, at the site where the GOI will be inserted. Primers to amplify the GOI should add a 15 bp overlap with the ends of the digested vector plus enough gene-specific sequence to amplify the gene. Each primer will match a different end of the digested vector, therefore, the PCR product will be inserted directionally into the entry vector. In the In-Fusion reaction mix, the ends of the linear dsDNA (insert and vector) are partially digested to expose the overlapping sequence. The single strand overlapping sequences anneals and the two DNA molecules are noncovalently joined together. The In-Fusion product is transformed into E. coli for DNA repair (covalent vector-insert join) and clone selection.
A couple of different entry vectors for Creator are available; however, they all share the same cloning cassette (loxP – MCS – asCmR – loxP), where MCS is the multiple cloning site and asCmR is the chloramphenicol resistance gene cloned in the anti-sense direction. The anti-sense orientation of the CmR is important later for the selection of the expression clone and will be discussed in the next section. After the recombination, the entry clone will contain the GOI in the multiple cloning site. Other In-Fusion like systems are available, such as the Choo-Choo PCR Cloning  or Cold Fusion Cloning Kit .
184.108.40.206 Creator expression clones
A Creator compatible expression vector must have the loxP site downstream of a promoter to allow the expression of the GOI. The usual structure of this class of vectors includes a cassette composed of promoter – loxP – asProkaryotic promoter, with the latter being a promoter cloned in the antisense direction. Cre mediates the recombination between loxP sites present in the entry clone and expression vectors, allowing the insertion of the elements flanked by the loxP in the entry clone (loxP – GOI –asCmR – loxP) into the loxP site of the expression vector. The desired product of the Cre reaction is a new expression clone with an expression cassette composed of: promoter – loxP – GOI –asCmR – loxP – asProkaryotic promoter. Selection with chloramphenicol demands that the prokaryotic promoter from the expression clone aligns correctly with the chloramphenicol gene from transfer cassette ensuring the desired recombinant clone (Fig. 4A).
The Creator system can be used to express proteins with tags (Fig. 4B). Proteins with N-terminal tags are obtained from expression clones where a tag is present between the promoter and the loxP site. The translation will start at the tag and progress through the loxP site, which adds 12 amino acids between the upstream tag and the GOI. Creator can also be used for the expression of proteins with carboxyl-terminal tags in eukaryotic cells, through an RNA splice mechanism . The splice donor sequence (SD), located at the carboxyl end of the GOI, and the splice acceptor sequence (SA), situated on the other side of the prokaryotic promoter in the expression clone (loxP – GOI – SD – asCmR – loxP – asprokaryotic promoter – SA – C-terminal tag), are used by the cellular splicing machinery. After the splicing, the RNA sequence between SD and SA are removed and the mature RNA expresses a fusion protein with a C-terminal tag. This system is very efficient and works well for cells with splicing machinery, such as tissue culture cells. For expressions systems that lack splicing machinery, the only way to express a C-terminal tagged protein is to include the tag already linked to the GOI in the entry clone as part of the transferred ORF.
3.2.2 Gateway cloning systems (phage lambda—attB/P)
Analogous to the Creator system, the Gateway® Cloning System (Life Technologies) is based on a recombination event that occurs naturally, in this case between the bacteriophage λ (phage λ) and its host E. coli. The phage λ is a virus that infects bacteria and either propagates inside the host, leading to cell lysis, or integrates into the host genome and stays in a dormant form . The integration occurs between specific sites, named attachment sites (att site), in a reaction mediated by a set of enzymes expressed by both phage and bacteria. This process can be reversible; during cell stress, the phage is excised from the host genome and reenter the lytic cycle. Enzymes necessary for the phage integration/excision and DNA sites recognized in those processes are used in the Gateway system .
220.127.116.11 Gateway entry clones (BP reactions)
The generation of the entry clones for the Gateway system mimics the phage integration into the host genome. The phage DNA is flanked by the attP site (phage attachment site) and integrates into the bacterial att site, named attB. Both sites are distinct in length and sequence; however, they share a 15 bp core sequence that is used for the recombination. The attB and attP sites are cleaved within the core sequence, recombined with the other site, and the two distinct DNA molecules are religated, creating a new DNA molecule that possess half of each parental att site (Fig. 5A). In the Gateway system, this reaction is recreated in vitro with bacteriophage λ integrase (Int) and E. coli integration host factor (IHF) proteins (BP Clonase™ enzyme mix) in a process called BP reaction. In order to keep the BP reaction directional, the attB and attP sites were slightly modified by creating different sets of core sequences to produce attB1/attB2 and attP1/attP2 sites, respectively, allowing the recombination of the attB1 site only with attP1 site and vice versa.
For ORF cloning using Gateway, the first step is acquisition of the GOI flanked by attB1/attB2 sites, which are the shortest of the att sequences. The easiest way to obtain this molecule is through PCR amplification using primers with adaptors for the attB sites. Usually, the PCR is performed in two rounds, mainly because the attB site is 25 bp long and often an equal number of bases are needed to amplify the GOI. Thus, to combine both of these sequences in a single primer can increase the cost of cloning and the likelihood of obtaining a mutated or truncated primer. In our experience, error rates in primers increase rapidly as primers exceed 45 bases in length. In the first PCR, a primer with a gene-specific sequence and a portion of the attB site is used to amplify the GOI. This product is then the template for the second PCR, which uses a universal primer set that overlaps with the att sequences only and adds the full attB sequences. High quality universal primer can be produced in large batches and tested, creating a reagent that can be used for many subsequent PCR reactions. Moreover, a single PCR reaction using both set of primers (gene specific and universal) can be performed, saving time and money.
After amplifying the GOI with the entire attB sites, the BP reaction inserts the GOI into the entry vector. In the Gateway system, all entry vectors possess a common death cassette (attP1 – ccdB – CmR – attP2), in which the ccdB codes for a DNA gyrase inhibitor, blocking cell growth in most cell types. ccdB selects against the empty entry vector, whereas CmR is important to propagate the empty entry vector in ccdB-resistant cells. During the BP reaction, the attB and attP sites recombine, generating a new att site called attL and the death cassette is replaced by the GOI. The new cassette of the entry clone is attL1 – GOI – attL2. The selection for the clones of interest is done in ccdB sensitive cells in the presence of the antibiotic for which the entry vector has a resistance gene.
18.104.22.168 Gateway expression clones (LR reactions)
Cloning into the expression vector mimics the biology of the phage λ during the viral DNA excision from the host genome. The recombination occurs between attL and attR sites, and is therefore called the LR reaction, following the same principles described previously for the attB/attP sites (Fig. 5B). The LR reaction is catalyzed by bacteriophage λ Int and Excisionase (Xis) proteins, and the E. coli IHF protein (LR Clonase™ enzyme mix).
Empty expression vectors with their death cassette (attR1 – ccdB – CmR – attR2) are used in the LR reaction in conjunction with an entry clone that includes the GOI flanked by attL sites (attL1 – GOI – attL2). The final product is the expression clone with the newly reconstituted attB sites flanking the GOI (attB1 – GOI – attB2). Once again, the selection of the clones of interest is accomplished using the antibiotic resistance genes present in the vector backbones and the death cassette. Vectors with the wrong selection marker (entry vector), or carrying the death cassette (empty expression vector) are eliminated.
N- or C-terminal tagged proteins can be obtained using the Gateway system. The N-terminal tag is placed in the expression vector upstream of the attR1 site, whereas the C-terminal tag is located downstream of the attR2 site. The transcription and translation will occur through the respective att site, adding nine extra amino acids between the tag and the gene.
The Gateway cloning system has been employed for several high-throughput ORF cloning projects, including human [48, 62-66], E. coli , several distinct viruses , Francisella tularensis , Vibrio cholerae , A. thaliana [51, 70], C. Elegans [71, 72], and Plasmadium falciparum , among others. Due to the high usage of this system, several expression vectors with distinct features and for different species are available.
3.3 Alternative high-throughput cloning methods
Here, we present some alternative cloning methods that were employed in high-throughput cloning, such as In-Fusion and Ligase Independent Cloning, or have potential to be used in such applications in the future.
3.3.1 In-Fusion cloning system
The In-Fusion cloning system was discussed previously as an option for cloning genes into Creator entry vectors; however, this system is very flexible and can be used independent of the Creator cloning system. Any vector can be used in the In-Fusion system with the only requirements being the vector linearization and use of inserts with end sequences homologous to the vector ends, which can be easily obtained by PCR (Fig. 6A).
Besides ORF cloning into vectors, another important application of the In-Fusion system is for DNA assembling, where several fragments are cloned simultaneously. For example, those fragments can be part of a single gene (capture of a long transcript using tiling fragments, insertion, or deletion of DNA fragments), a gene promoter for report assays (generation of promoter with different combination of domains), or vector features (promoter, selection marker, tags) combined to generate a new vector .
3.3.2 Gibson assembly method
The Gibson method is a simple cloning strategy that employs three commonly used enzymes: 5′ exonuclease (digest the 5′ end of the dsDNA, exposing the ssDNA), DNA polymerase, and ligase (Fig. 6B). Multiple DNA fragments with overlapping sequencing (20 bp) can be cloned. The principle is that DNA fragments treated with 5′ exonucleases expose their overlapping sequences, allowing the alignment between the fragments. The treatment with 5′ exonuclease can expose more than just the overlapping sequence, and DNA polymerase is used to fill this gap. Finally, ligase covalently joints the DNA fragments and generates the final cloning product. This strategy was successfully used to assemble DNA molecules as large as 900 kb [75, 76].
The Gibson assembly method is very similar with the In-Fusion cloning system, with an advantage of more than tenfold cost savings; all enzymes required are available from several vendors at accessible prices.
3.3.3 Ligation independent cloning (LIC)
LIC is a cloning method that does not employ REs, ligases, or recombinases, making this process cheap and easily adaptable to high throughput [77, 78]. This method relies on the annealing of complementary ssDNA and the capability of bacteria cells to close gaps in dsDNA (Fig. 6C).
The initial step of LIC is the PCR amplification of the insert with primers that have a 12 nucleotide long adapter sequence, lacking one of the nucleotide types, for example, guanine. Directional cloning is achieved using two distinct sets of adapters. To generate ssDNA ends, the PCR products are treated with T4 DNA polymerase in the presence of a single nucleotide type, in this example guanine. In the absence of the other three nucleotides, T4 DNA polymerase works as an exonuclease, removing nucleotides from the 3′ end of a DNA strand until it finds a position where it can work as a polymerase, inserting the one base it has available. In this position, the enzyme will settle between the polymerase and exonuclease activity, without moving to the next nucleotide. The adapter sequence, designed to lack cytosine will be converted into ssDNA ends.
The second step is the preparation of the vector for the cloning. A vector, to be compatible with the LIC method, needs to display single stranded ends complementary to the adapter sequence present in the insert. Two distinct approaches can be used for the acquisition of these vectors: (1) PCR reaction followed by T4 DNA polymerase treatment, similar to what was described earlier for the insert [77, 78]; or (2) insertion of the adapter sequences in the vector, flanking a restriction site; the vector is digested with the RE and then treated with T4 DNA polymerase . The second approach allows the preparation of large concentrations of ready-to-be-used vector, without the PCR amplification, and is more suitable for high-throughput applications.
Inserts and vectors are combined after the T4 DNA polymerase treatment and their complementary single strands align and form a double strand with a nick on each side. This product is introduced in bacteria, where the nicks are closed and the recombinant clone can be replicated.
Several alternative protocols for the traditional LIC were described; including sequence and ligation independent cloning (SLIC), polymerase incomplete primer extension, and enzyme-free cloning.
SLIC relies on treatment of insert and vector with T4 without nucleotides, allowing the use of any sequence as the cohesive ssDNA ends. This is especially important for cloning of multiple PCR fragments into a single vector, where the addition of adapter sequences at the end of each PCR product would alter the desired insert sequence. High levels of cloning efficiency can be acquired with the use of RecA during the annealing of vector and insert. RecA is a recombinase enzyme associate with DNA repair and in the SLIC reaction is used to repair the newly generated clone .
Polymerase incomplete primer extension methodology takes advantage of the fact that PCR reaction generates a population of partially ssDNA fragments as a result of the incomplete primer extension. Inserts amplified with primers containing the appropriate adapter sequences, complementary to the ends of the vector of interest, can be cloned into PCR amplified vectors or LIC-compatible vectors. This approach was successfully used in semi-high-throughput cloning (500 genes) and mutagenesis .
Enzyme-free cloning is a strategy designed to amplify PCR fragments with overhanging ends, without the need of REs or exonuclease treatment. The PCR is performed with a combination of short and long primer, with the latter containing the entire short primer sequence plus the desired overhanging sequence. Each PCR fragment is amplified in two independent PCR reactions using a combination of short and long primer, for example, short forward primer and long reverse primer. PCR products are combined, denaturated, and reannealed, generating the hybrid molecules with the single strand extremities . The hybrid molecules can be used as the cloning insert into a LIC-compatible vector .
LIC methodologies are very flexible platforms that can be used in several cloning applications, including single genes into multiple distinct vectors [84, 85], gene assembly , and mutagenesis .
3.3.4 Gold Gate assembly method
The Gold Gate assembly method relies on the use of type IIS REs (IISREs) (Fig. 6D), which are enzymes able to cleave DNA at a defined distance from their nonpalindromic recognition sites. Since the cleavage occurs outside the recognition site, the overhangs can be of any nucleotide sequence, depending solely on the DNA sequence present after of the recognition site.
Gold Gate compatible vectors generally present a death cassette flanked by an IISRE restriction site that creates two distinct overhanging sequences, one on each side, making the cloning strategy directional. The insert is amplified with adapter sequences for the IISRE of choice, including the overhanging sequences present in vector. The vector and insert are designed in a way that after the digestion with the appropriate IISRE, the recognition site remains in the by-products and only the overhanging sequence is present in the desired DNA fragments (vector and insert). The digestion/ligation of vector and insert can be performed in a single tube reaction; the finished clone with the correct sequence lacks the IISRE site and is resistant to the digestion . This approach was successfully used for DNA shuffling, cloning simultaneously nine DNA fragments in more than 20 different arrangements  and modular cloning to create multigene constructs [89, 90].
Slightly modified versions of the Gold Gate assembly method were proposed. In the fragment exchange cloning (FX cloning), for example, the GOI is initially cloned into an entry vector that possesses an IISRE recognition site. The site is orientated facing the death cassette, allowing the recognition site and overhanging sequence to be present in the vector after the digestion. The FX entry clone may then be used for the transfer of the ORF to several FX compatible expression vectors. At this stage, the reaction is exactly as the one described previously for the Gold Gate assembly method . Another Gold Gate-based method was named fully automatic single-tube recombination and uses crude PCR extracts for the cloning reaction. This was possible using DNA polymerase inhibitor (aphidicolin) and digesting the template DNA with methylated specific RE .
The major limitation of all IISRE-based cloning systems is the possibility that the vector or insert are substrates of the IISRE of choice, requiring the use of alternative cloning approaches, such as site direct mutagenesis of the endogenous restriction site or generation of vectors compatible with a different IISRE enzyme. The main advantage is the lack of long cloning sequences at the extremity of the GOI.
3.3.5 Univector plasmid fusion system
The univector system is a cre-based system that utilizes loxP sites and Cre enzyme to clone the GOI into the expression vector (Fig. 6E) . The main difference between univector and Creator is the presence of a single loxP site in the entry vector; the empty expression vector remains with one site. The recombination between the two loxPs results in the fusion between vectors, generating the expression vector with the insert. This system was used in high throughput to clone ORFs from Treponema pallidum, the pathogen for syphilis .
3.4 Generation of new expression vectors
Regardless of the cloning system adopted, the generation of new expression vectors follows the same principles. Basically, any standard expression vector can be converted into a high-throughput expression vector by the simple insertion of the appropriate cassette in the position where the GOI should be situated. The differences among the cassettes are responsible for the vector compatibility with the relevant system. The Flexi cloning system uses the cassette SgfI – Death cassette – PmeI, for expression with N-terminal tag, or SgfI – Death cassette – EcoICRI, for carboxyl-terminal tags; the Creator cassette is loxP – asProkaryotic promoter; whereas, Gateway uses the death cassette attR1 – ccdB – CmR – attR2.
Different strategies can be employed to insert the cassette into a vector of interest, such as (i) standard digestion and ligation cloning techniques, (ii) PCR amplification of the cassette, with adaptors for RE sites, followed by digestions and ligations into the vector of interest, or (iii) In-Fusion or LIC-based methods. It should be noted that when constructing vectors for systems that utilize death cassettes, such as Gateway or Flexi, a resistant strain must be used to propagate the vector. It is also possible to modify particular features of the expression vector, such as promoters or tags, to further customize the vector for a particular need. The same set of strategies mentioned before can be used in this scenario. As a final note, the best strategy for developing high-throughput vectors is to design and test the vector using standard molecular biology methods. Once the vector is optimized, then insert the specialized cassette as a last step.
4 Successful example
As an example of high-throughput cloning for protein expression and functional proteomics, we will discuss the cloning of the V. cholerae transcriptome . This high-throughput cloning project used the Gateway cloning system. Prior to the cloning of the entire collection, the protocol was tested on a small selection of the genes (10–20) with a range of size and GC content, to analyze the efficacy of the process and optimize any necessary steps.
The initial step was the identification of all ORFs from the V. cholerae genome by comparing the genome annotation from two distinct sources the National Center for Biotechnology Information (NCBI) and The Institute for Genome Research (TIGR). Genomic annotations often contain errors and the use of multiple sources will minimize this problem and increase the ORFeome coverage. Any errors in the genome annotation can lead to errors in primer designs and ultimately impact the overall success of the project . A total of 3887 ORF sequences were identified and for each one a unique clone number was assigned and all the information available regarding the gene and the protein product was collected. The unique clone number is vital to allow the identification of the clone within the tracking system.
The next step was the design of gene-specific primers with the appropriate adapters for the Gateway system. All the primers were designed automatically using in-house software that calculated the primer temperature and length for an annealing temperature of 58°C, using the nearest neighbor algorithm [95, 96]. As mentioned before, the PCR for the amplification of the insert with the full attB site was performed in a two-step reaction, first gene-specific primers with half of the attB site were used and then a second amplification with universal primers was performed completing the full attB site. This kept average primer length under 50 bases, reducing the error rate. All PCR products were confirmed in agarose gels and the data were collected in the database. Overall 98% of the amplification was successful.
Based on a pilot study showing approximately 80% first pass cloning success rate (unsuccessful clones showed mostly unexpected inserts), one entry clone for each gene was selected and sequenced. For validation, the insert was fully sequenced and compared with the reference sequence; acceptable clones had either a perfect match or a maximum of one amino acid change. Genes longer than 600 bp had internal primers designed for new sequencing reactions to allow the full coverage of the insert. After the first cloning round, 82% of the ORFs were captured into the entry vector. ORFs that failed the first round had additional colonies screened, whenever possible, or underwent de novo cloning using a different set of primers. Overall 93% of all ORFs were captured without any mutation, 2.1% had a silent mutation and 5% showed a single amino acid change, totaling 97% of the nonredundant cholera transcriptome. The 3% missing are mainly due to genes wrongly annotated in the genome, including reading frames that are not multiples of 3 or insertion/deletions within the gene sequence.
These entry clones were transferred to an expression vector with a C-terminal GST tag and the proteins were expressed on protein microarrays (Fig. 7). Protein microarrays were generated using the nucleic acids programmable protein array method . In this platform, the DNA coding for the gene is coprinted with antibody against the tag, in this case GST. The DNA is expressed in vitro using a cell-free transcription and translation system and the freshly expressed protein is captured by the antibody. In the V. cholerae array, we demonstrated that more than 92% of the proteins were successfully expressed. Those arrays can be used for the study of the immune response, and preliminary data were recently published .
5 Current limitations and practical considerations
All cloning systems possess limitations inherited from their adopted cloning strategy. Weaknesses and strengths in this context are relative terms that may change roles depending on the intended use(s) of the clones. Therefore, choosing the best system depends on the application and must be determined by the user. A detailed comparison of the Flexi, Creator, and Gateway systems is shown in Table 1.
Table 1. Comparison of the major three cloning systems: Flexi (Promega), Creator (Clontech), and Gateway (Life Technologies)
8 steps (PCR, digestion, DNA purification, annealing, transformation, colony picking, DNA extraction, sequencing)
12 steps (PCR, vector digestion, DNA purification/cloning enhance, In-Fusion, transformation, colony picking, DNA extraction, sequencing, Cre-reaction, transformation, colony picking, DNA extraction, clone validation)
12 steps (PCR1 + PCR2 – single tube reaction, DNA purification, BP reaction, transformation, colony picking, DNA extraction, sequencing, LR-reaction, transformation, colony picking, DNA extraction, clone validation)
Size of cloned genes
<3 kb genes - BP reaction
>3 kb genes - alternative approach
Sequence of the gene of interest
DependentGenes with either SgfI or PmeI sites should be cloned using alternative protocols
Alternative: use loxP site (34 nt) as adapter sequence in the PCR reaction
Alternatives: use attL site (100 nt) as adapter sequence in the PCR reaction; or a expression vector with attP sites, which will add 100 nt on each side of the ORF
Primer lengh (adapter sequence only)
8 nt (restriction enzyme recognition site)
15 nt (vector extremity sequence) or 34 nt (loxP sequence; cloning directly to the expression vector)
25 nt (attB site) or 42 nt (attB site + Shine-Dalgarno sequence + Kozak sequence) for the expression of native protein
Potential number of additional amino acids (aa) introduced via the cloning site
N-terminal: 3 aa (SgfI site)C-terminal: 3 aa (EcoICRI site)
N-terminal: 12 aa (loxP site)C-terminal: 0 (loxP site is removed by splicing)
N-terminal: 9 aa (attB1 site)C-terminal: 9 aa (attB2 site)
Number of false positives
BP reaction—low (<3kb genes), high for larger genes
Bacterial competency needed
Number of expression vectors compatible with the system
Number of clones libraries compatible with the system
One of the major advantages of the Flexi system is that the product obtained after the first round of ligations is a functional expression clone, without the need to generate an archive entry clone. This simplified protocol, combined with short primers and enzymes available from more than one vendor, makes the cloning cost per gene one of the lowest available. Another advantage of the Flexi system is the insertion of comparatively few exogenous amino acids between the protein of interest and the tag. Some systems, such as Gateway and Creator, add as many as 12 aa, which may result in a lower expression levels or nonoptimum protein function.
A drawback of the Flexi system is a somewhat reduced cloning efficiency when compared to Gateway, for instance. It requires the selection of multiple clones per insert to ensure the acquisition of the correct clone, making it less efficient in genomic scale projects (unpublished data). Another disadvantage is the fact that some GOIs maybe be cleaved by SgfI or PmeI and an individual cloning approach has to be adopted for these genes. In these cases, silent mutagenesis can be done to remove the restriction site, without changing the protein sequence. Alternatively, RecA can be used to protect the restriction site present in the GOI and prevent the digestion [97, 98]. Both methods add extra steps to the cloning process and increase the final cost. The transfer of the GOI into another expression vector is also susceptible to error. During the ligation step, it is possible that both vectors backbones will be joined together, generating a viable construct in the selection. One simple strategy to eliminate these false positive clones is through the design of vector backbones that share at least 184 bp adjacent to the death cassette. After the vector–vector ligation, a large palindromic sequence will be generated and it is known that E. coli cannot replicate vectors with palindromic sequences greater than 184 bp . This strategy was successfully used to decrease the number of vector–vector ligation in the Flexi system .
The ability to clone any gene, regardless of the sequence, in a universal protocol is very appealing for high-throughput projects, making the cloning systems that employ recombination very popular. Cloning using recombination requires the construction of an entry vector, which may be seen as a drawback of the system because of the addition of many extra steps in the cloning processes. However, the entry vector prevents negative selection for toxic genes that may be expressed (if even due to leaky expression) under a bacterial promoter. Depending on the intended set of genes to be cloned, those extra steps might be necessary. Another major advantage is the possibility to transfer ORFs to expression vectors without PCR amplification and consequently without the need to obtain full-length gene sequence for clone validation.
Gateway was the first high-throughput recombinational cloning system available in the market. Its cloning efficiency is very high for genes smaller than 3 kb, however, for larger genes the BP reaction (the capture reaction) is not efficient. An alternative to overcome the BP limitation is to mix and match steps from different systems, such as to incorporate In-Fusion cloning into the Gateway protocol . The LR reaction (the transfer reaction), in contrast, is less sensitive to gene size and is one of the most robust and efficient reaction among all cloning systems. Once in the entry vector, even large genes can be transferred using the LR reaction to the expression vectors efficiently. Since each gene has to be captured just once and then transferred to several expression vectors, the high efficiency of the LR reaction has made the Gateway system very popular. One disadvantage of this system is the inclusion of the peptide coded by the recombination sites in any proteins with fusion tags. Note that native protein can be made if no tags are desired.
Some of the disadvantages of the Creator system overlap with those of Gateway, such as a requirement for entry clones and the addition of amino acids from the expression of the recombination sites in fusion proteins. However, the biggest disadvantage is that RNA splicing is required to add common carboxyl tags via the expression vector. This limits the number of expression systems that can be used with C-terminal tagged proteins.
Unlike Gateway and Flexi systems where an expression clone can be used as a template for the transfer of the GOI into another entry vector or expression vector, in the Creator system the Cre reaction is not reversible, once a gene is in the expression vector it cannot be easily transferred to another vector. Besides being irreversible, the Cre reaction is less efficient compared to other enzymes/systems. In a typical experiment, the number of background colonies is high and the sequencing of multiple clones may be required to obtain the desired expression clone (unpublished data).
Overall, the Gateway system has seen the widest adoption. Factors such as its early release, its ease at handling both N- and C-terminal tags, and the near perfect efficiency of transfer from entry clone to expression vectors seem to have outweighed the difficulties encountered in creating the first entry clone. The collection of Gateway-compatible clones and vectors is the largest among all high-throughput cloning systems and is continuously growing.
This project was supported by the NIH grant U01CA117374, U01AI077883, and Virginia G. Piper Foundation.
The authors have declared no conflict of interest.