J. L. Markley, Biochemistry Department, University of Wisconsin-Madison, 433 Babcock Drive, Madison, WI 53706, USA Fax: +1 608 262 3759 Tel: +1 608 263 9349 E-mail: email@example.com Website: http://uwstructuralgenomics.org
We describe a platform that utilizes wheat germ cell-free technology to produce protein samples for NMR structure determinations. In the first stage, cloned DNA molecules coding for proteins of interest are transcribed and translated on a small scale (25 µL) to determine levels of protein expression and solubility. The amount of protein produced (typically 2–10 µg) is sufficient to be visualized by polyacrylamide gel electrophoresis. The fraction of soluble protein is estimated by comparing gel scans of total protein and soluble protein. Targets that pass this first screen by exhibiting high protein production and solubility move to the second stage. In the second stage, the DNA is transcribed on a larger scale, and labeled proteins are produced by incorporation of [15N]-labeled amino acids in a 4 mL translation reaction that typically produces 1–3 mg of protein. The [15N]-labeled proteins are screened by 1H-15N correlated NMR spectroscopy to determine whether the protein is a good candidate for solution structure determination. Targets that pass this second screen are then translated in a medium containing amino acids doubly labeled with 15N and 13C. We describe the automation of these steps and their application to targets chosen from a variety of eukaryotic genomes: Arabidopsis thaliana, human, mouse, rat, and zebrafish. We present protein yields and costs and compare the wheat germ cell-free approach with alternative methods. Finally, we discuss remaining bottlenecks and approaches to their solution.
One of the most important tasks in biotechnology today is the development of improved systems and strategies for synthesizing any desired protein or protein fragment in its folded, soluble form on a preparative scale. This task is fundamental to the success of structural genomics projects, which promise to capitalize upon numerous advances in science and technology to change the appreciation and understanding of biological systems. Structural genomics implies a move away from hypothesis-driven research to a system of solving structures first and using these structures and other structures modeled from them as the source of hypotheses for further research. The medical incentives for understanding protein structure are great. Many diseases are caused by defects in a single protein that alter its folding, stability, or activity. The structures of proteins involved in diseases will move us a step closer to improving disease treatment, diagnosis, and prevention. Beyond their specific medical applications, structural genomics projects are teaching fundamental lessons about the structural basis of life on this planet.
Protein production remains a bottleneck in proteomics, for both structural and functional studies. Most structural biology groups and structural genomics centers utilize cell-based, heterologous protein production from Escherichia coli. However, this approach fails with many individual proteins, particularly those from eukaryotes. Failures result from no or low expression, low solubility, or degradation. Expression levels can be improved by producing the protein of interest as a cleavable fusion with a highly expressing protein. Low solubility can result from failure of the protein to fold properly, aggregation of folded protein, or from unfavorable properties of the construct (intrinsic insolubility of the native sequence or insolubility introduced by a non-native sequence, such as a purification tag or other cloning artifact). As indicated in TargetDB, the target registration database for structural genomics (http://targetdb.pdb.org/), the proportion of targets that code for ‘unique proteins’ that yield soluble protein is only about one-third for prokaryotic proteins and much lower for eukaryotic proteins. In this context, a unique protein is defined as one with a peptide sequence exhibiting ≤ 30% sequence identity to the sequence of any protein with a three-dimensional structure deposited in the Protein Data Bank. Solubility can be improved greatly by producing the protein of interest as a cleavable fusion with a highly soluble protein. This strategy may enable the protein to fold properly without aggregation so that it stays in solution following cleavage. Many eukaryotic proteins are ‘natively disordered’, that is they do not adopt a single, stable, folded structure. Some natively disordered proteins require an additional factor for folding: a metal ion, a small molecule cofactor, another peptide chain, or an oligonucleotide. Other proteins may require extensive post-translational modification to achieve their native folded state. Platforms for structural investigations must support the production of proteins on the scale of 2–10 mg. For efficient structure determination by NMR spectroscopy, the proteins must be labeled with stable isotopes (15N or 13C+15N, or for larger proteins 2H+13C+15N). For X-ray crystallography, proteins normally are labeled with selenomethionine (Se-Met) to support multiwavelength anomalous dispersion data collection for phase determination. Because protein production and labeling on this scale is expensive, it is important to screen targets first on a smaller scale to identify which constructs are expressed, soluble without aggregation, folded, and stable under the conditions used for NMR structure determinations or crystallization trials.
In vitro cell-free methods for protein synthesis with extracts from prokaryotic  or eukaryotic  cells offer an alternative to the E. coli cell-based platforms. Cell-free approaches have a number of potential advantages over other alternatives to heterologous expression in E. coli cells. Stable isotope or Se-Met labeling is easier with cell-free systems than with yeast, mammalian, or insect cell systems [3–5]. Cell-free systems may permit successful production of proteins that undergo proteolysis [6,7] or accumulate in inclusion bodies  in cells. Cell-free systems support selective labeling strategies [9–12] that cannot be achieved in bacterial whole cell systems. An important emerging approach is the incorporation of stereo-array isotope labeled (SAIL) amino acids , chemically synthesized amino acids with stereo-specifically arrayed stable isotope (2H and 13C) labeling patterns that are optimal for NMR spectroscopy. SAIL amino acids are being commercialized by a start-up company in Japan (Sail Technologies, Inc., Yokohama, Japan) and when available will raise the threshold for high-throughput NMR structure determinations from 20 kDa to 40 kDa and above . The SAIL amino acids must be incorporated into proteins by in vitro synthesis so as not to disturb the labeling pattern.
Cell-free systems have been used for the production of various kinds of proteins, including membrane proteins  and proteins that are toxic to cells [8,15]. It is possible to collect NMR spectra of [15N]-labeled proteins prior to isolation from the cell-free protein synthesis mixture [16,17]. One of the features of cell-free protein production is that only the protein of interest is labeled, so that contaminating proteins do not show up in normal multinuclear NMR spectra. Cell-free protein production protocols are streamlined compared to cell-based protocols, in that they do not require cell harvesting or cell lysis. Protein purification is usually simpler, because the protein of interest starts out more concentrated and is isolated from a smaller set of contaminants.
The RIKEN Structural Genomics Center in collaboration with Roche has pioneered the use of cell-free protein production through a coupled transcription-translation system employing E. coli extracts [18–22]. It has been found, however, that most of the proteins that produce well in E. coli cell-free systems are the same ones that are produced successfully from E. coli cells . Thus, despite other potential advantages, the E. coli cell-free approach may not greatly expand the range of proteins that can be produced in soluble, folded state, although it may be possible to overcome this limitation by redesigning the gene sequence (see below), by adding chaperones or other factors [22,24], or by reengineering ribosomal proteins .
One of the first in vitro translation systems to be investigated was prepared from wheat germ extracts, but yields from this eukaryotic extract were low . Y. Endo and his group at Ehime University (Matsuyama, Japan) achieved a breakthrough in this technology by finding that an inhibitor of ribosomal protein synthesis, tritin, is associated with the coat of the wheat embryo . They developed a process for removing this contaminating inhibitor and patented this process along with methods for utilizing the improved wheat germ extract [26–31]. Endo founded a company, CellFree Sciences Co., Ltd. (Yokohama, Japan), to commercialize the technology. We found this approach to be promising and formed a cooperative undertaking with Ehime University and the CellFree Sciences Co., Ltd. with the goal of investigating the potential of wheat germ cell-free protein production as an enabling technology in our structural genomics project, the Center for Eukaryotic Structural Genomics (CESG; Madison, WI). As discussed here, we have found this technology to be robust, and our wheat germ cell-free pipeline now supports high-throughput screening for protein production and solubility and provides stable isotope labeled protein samples for the majority of the NMR structures determined at CESG [32,33].
CESG's wheat germ cell-free platform
Our detailed protocol for wheat germ cell-free protein production is available elsewhere . In short, the approach consists of four steps (Fig. 1A): (1) creation of a plasmid used for in vitro transcription, (2) small scale (25–50 µL) screening to assay the level of protein production and solubility, (3) larger scale (4–12 mL) production of [U-15N]-protein used to evaluate whether solution conditions can be found that render the target suitable for NMR structure determination (soluble, monodisperse, folded, and stable), and (4) production of sufficient [U-13C,15N]-protein for multidimensional, multinuclear magnetic resonance data collection. We purchase the wheat germ extract from CellFree Sciences, Inc., the RNA polymerase from Promega (Madison, WI), and the labeled amino acids from Cambridge Isotope Laboratories, (Andover, MA). Details about these and other reagents and supplies are found in our publications [32–34].
The purification workflow diagram is shown in Fig. 1B. In step (1), a defined series of cloning procedures are used to create a DNA plasmid containing the target gene and 5′ and 3′ extensions that promote efficient transcription and translation. In step (2), small scale protein expression and purification trials are carried out, generally in a 96 well format. Successful candidates from these screens (those estimated to yield > 0.5 mg·mL−1 target protein with solubility > 75%) are then selected for larger scale protein production with incorporation of [15N]-labeled amino acids. Purified [U-15N]-protein samples produced in step (3) are then assayed by 1H-15N correlation spectroscopy (1H-15N HSQC) for their suitability as structural candidates (they must be folded, monodisperse, and stable at room temperature for at least 14 days). The solution conditions can be refined as part of this step. Targets that pass these tests are then prepared as [U-13C,15N]-protein samples, step (4).
We have tested the wheat germ cell-free platform in the context of NMR-based structural genomics of eukaryotic proteins and have compared it with our parallel E. coli cell-based platform. Our experience is summarized briefly as follows. (a) Targets can be screened more quickly and more economically for protein expression and solubility by the cell-free approach than by the cell-based approach. The efficiency of this process is important, because we need to screen many targets or multiple constructs of a given target in order to find one that produces a protein that is soluble and well folded. As an example of multiple screening of a given target, we have screened targets with a noncleavable His6 tag, with a cleavable His6 tag, and with a cleavable glutathione S-transferase (GST) tag and have shown complementary success with these . (b) Because of the smaller volumes involved, the isolation and purification of 1–5 mg quantities of labeled protein for NMR structural studies is faster and less labor intensive with proteins prepared by the cell-free approach than the cell-based approach. (c) Proteins produced with the wheat germ extract from CellFree Sciences and labeled amino acids generally show high levels of enrichment by mass spectrometry: > 95%15N/(14N+15N) or 13C/(12C+13C). These high levels are excellent for NMR spectroscopy. (d) The cell-free system supports the production of proteins with a variety of labeling patterns: uniform labeling with 2H, 13C, and 15N, selective labeling by residue type, and SAIL (discussed above).
We recently carried out a detailed comparison of the wheat germ cell-free and E. coli cell-based approaches to protein production for NMR structure determination . In this study 96 randomly chosen Arabidopsis thaliana targets were carried through CESG's wheat germ cell-free and E. coli cell pipelines. If possible, [15N]-labeled versions of each protein were produced for analysis by 1H-15N correlation NMR spectroscopy. Of the 96 targets started with, only eight from the cell-free pipeline and five from the cell-based pipeline were found suitable for NMR structural analysis on the basis of the NMR results. In this comparison, the five targets that proved successful by the E. coli cell-based approach also were successful by the cell-free approach.
Our wheat germ cell-free approach appears to have advantages over published in vitro protein production protocols that utilize E. coli S30 extract. (a) Cell-free protocols utilizing E. coli extract usually call for the testing of multiple plasmids with sequence differences outside the protein coding region to determine one that produces protein in high yield . By contrast, with the wheat germ cell-free protocol we have found no advantage of modifying the plasmid sequence outside the coding region, and hence utilize a single plasmid construct for all targets. (b) Protocols for E. coli S30 cell-free synthesis typically employ additives, such as polyethylene glycol to improve protein yields . These additives need to be removed prior to NMR structural studies. No such additives are required with the wheat germ cell-free approach probably because the wheat germ extract contains chaperones and other factors that contribute to higher protein yields. (c) To achieve a high level of label incorporation from E. coli S30 extract it may be necessary to take pains to remove endogenous unlabeled amino acids . (d) Proteins prepared from E. coli S30 extract may be heterogeneous as the result of incomplete cleavage of the N-terminal methionine. This heterogeneity can lead to doubling of NMR peaks . An effective solution is to make all proteins with a cleavable N-terminal sequence. This complication does not occur with proteins produced in vitro from wheat germ extract. (e) Wheat germ extracts contain chaperones, and do not require the addition of chaperones as sometimes needed for high yields from E. coli S30 extract [37,38]. A comparison of protein production from wheat germ extract and E. coli S30 extract  demonstrated that a significantly higher proportion of multiple domain eukaryotic proteins were soluble when translated by wheat germ extract.
All of the cell-free operations can be carried out by hand, and this is how we started using the technology. Because of the small volume requirements for screening (25–50 µL) and protein production for structural studies (4–12 mL), cell-free methods have proved amenable to automation. CESG makes use a CellFree Sciences GeneDecoder1000™ robotic system (Fig. 2) in automating the small scale screening of constructs for protein production and solubility. This unit makes it possible to carry out as many as 1052 small scale (25 µL) screening reactions per week. CESG has two prototype robotic units developed by CellFree Sciences for larger scale protein production (Fig. 2). The Protemist10™ robotic system requires preparation of the mRNA off-line, whereas the newer Protemist100™ starts with DNA and produces the mRNA transcript prior to the translation step. Each of these systems supports 24 4 mL transcription and translation reactions per week. Typical yields for the Protemist runs are 0.3–0.5 mg purified protein per mL reaction mixture. These robotic systems handle the many steps that are tedious to carry out by hand, and work through the night. They have greatly reduced the manpower requirements of cell-free screening and protein production.
Success rates with eukaryotic targets
The centers involved in the NIH Protein Structure Initiative (USA) are generating information about success rates in going from a selected target gene to a completed and deposited three-dimensional protein structure. The overall success rates still tend to be quite low, in the range of 2% to 20%, depending on the center and the types of targets selected. It is clear from all centers that the yields of structures for eukaryotic targets are much lower than for prokaryotic targets. In the interest of efficiency and cost savings, it is important to analyze where failures occur and to devise strategies to minimize these. The most effective routes for improvement involve a combination of bioinformatics and small scale screening. Bioinformatics relies on prior information and mathematical models for correlating success rates with gene sequences. Small scale screening offers the most economical way of testing whether a cloned and sequenced target will proceed through the critical stages leading to a structure. The initial screening step determines the level of gene expression and the solubility of the product. As described above, CESG's wheat germ cell-free platform supports rapid and economical small scale screening for expression and solubility. We currently test constructs with and without an N-terminal tag and have shown success in rescuing failed targets by truncating the N- and/or C-termini. The second screening operation relevant to NMR structure determinations is the screening of the [15N]-labeled protein target by 1H-15N HSQC spectroscopy). This test, which is repeated after one week to determine if the protein is stable in solution, is highly diagnostic for the success of an NMR structure determination. Proteins that pass this test are then produced with [15N+13C]-labeling.
We have accumulated experience in using the cell-free platform to produce proteins from several eukaryotic genomes. These include over 722 different structural genomics targets from human, mouse, and Arabidopsis(Table 1). Most of the targets selected for testing have coded for proteins less then 25 kDa, because this is the size limit for high-throughput structure determinations by NMR spectroscopy. In addition, we have carried out small scale wheat germ cell-free screening of approximately 150 larger proteins (25–70 kDa), and the success rates for expressing soluble proteins appear to be comparable to our earlier results with smaller targets presented in Table 1.
Table 1. Statistics on eukaryotic proteins produced by CESG's wheat germ cell-free platform.
Small scale (µg), automated 96 well format production overnight
Large scale (mg), automated 8 × 4 mL production overnight
We define ‘highly soluble’ as ≥ 75% of the total protein being present in the soluble fraction. Of the same proteins produced with N-terminal GST tags and N-terminal (His)6 tags, 9% more were highly soluble with the GST tag. Only ≈ 5% of proteins soluble as GST fusions became insoluble following cleavage and removal of the GST tag. Thus the results show that proteins fused to GST can be more highly soluble and that the advantage may persist after the tag is removed (presumably through improved folding of the purified fusion protein prior to cleavage).
We have gathered statistics specific to human proteins. Of 174 human targets (most with unknown function) that were successfully cloned, 135 (78%) showed expression at levels suitable for structural investigations. Of these expressed proteins, 55 (41%) were soluble at levels needed for NMR spectroscopy. Of these, 36 (66%) gave [15N]-labeled samples at levels that could be evaluated by NMR spectroscopy. To date, nine of these human proteins yielded NMR structures. In total, CESG has determined NMR structures of 18 eukaryotic proteins produced by this methodology (Fig. 3). The average yield of purified, labeled, human proteins made for NMR structural studies has been 0.3 mg·mL−1 reaction mixture.
Labor savings, coupled with the high level of incorporation of labeled amino acids and the high yield of folded protein samples, makes the overall cost of the wheat germ cell-free method comparable to that of the E. coli cell-based approach for NMR structure determinations of eukaryotic proteins. One of the main advantages of the automated wheat germ cell-free protein expression system is that the overall process requires much less time and effort compared to our current cell-based methods. Not including the cloning steps, it generally takes 48 h (using the GeneDecoder1000™), or 72 h (manually), to screen 96 targets for expression and solubility on the small scale. The purification protocols also require less time and effort than cell-based protocols because of the smaller volumes (4–12 mL versus 500–1000 mL) and higher initial purity. Using the latest in General Electric Healthcare HIS-TRAP purification technology (Piscataway, NJ), immobilized metal affinity chromatography (IMAC) purification of His tagged proteins requires 40 min of processing time and results in protein samples that are 75–85% pure. Gel filtration adds an additional 3 h and can increase the purity to > 95% for proteins < 15 kDa and to 90% for proteins < 20 kDa. GST purification results in > 95% purity regardless of size; however, the minimal time to process the sample is greater than 10 h.
Because stable isotope labeled amino acids required for NMR structure determinations are expensive, it is important that the protein yield per quantity of amino acid supplied be high. With cell-free systems (E. coli or wheat germ) ≈ 10% of the labeled amino mixture supplied is incorporated into the protein produced and purified.
Although the cell-free approach is much less labor intensive in comparison to our E. coli cell-based pipeline, it requires more expensive reagents and supplies. Current limitations of the method stem from the restricted availability and high cost of highly active wheat germ extract. These problems should ease as the wheat germ cell-free approach becomes more widespread and as increasing demands for cell-free extract stimulate improvements in production technology. The costs of stable isotope labeled amino acids also may be expected to decrease as demand accelerates. Average supplies costs currently are: US$47 per target for cloning and expression solubility testing (with unpurified reaction mixture assayed by SDS/PAGE), US$370 per mg for Se-Met protein, US$390 per mg for [15N]protein, and US$470 per mg for [13C,15N]protein (with proteins isolated and purified).
The major advantages of the wheat germ cell-free method over the E. coli cell-based pipeline are that it supports the production of a larger fraction of targets as folded, soluble protein and that it is much faster to prepare additional samples or truncated samples as needed for successful structure determinations. The E. coli approach has a cost advantage when its protein yields are much higher than cell-free. The overall costs of each approach appear to be similar for NMR structure determinations.
Because of the complementarity of cell-free and cell-based methods, we envision that it will be most efficient to screen each new target by both methods. Initially, we did not have an easy way to screen targets by the two approaches, because the cell-based pipeline was using ligation-independent cloning technology, whereas the cell-free pipeline used ligation cloning into the pEU vector. To remedy this, we recently implemented a cloning strategy that enables efficient small scale screening by cell-free and cell-based methods ; this approach utilizes Promega's Flexi®Vector technology to transfer the target gene from one plasmid to another. By comparing the small scale screening results from the two platforms, we can now choose the one more likely to be successful. If the cell-based approach is selected for an NMR target, we make use of a self-induction medium developed for producing [15N] or [13C+15N]-labeled protein from E. coli cells .
The largest remaining bottlenecks associated with the wheat germ cell-free protocol are the limited solubility, aggregation, or limited stability exhibited by many targets. Improvements in any of these areas would greatly lower the costs of structure determinations. Our ongoing research is aimed at investigating reasons for failures of these types and at developing approaches for rescuing failed targets. Some structural genomics centers start multiple constructs for each target selected (different N- and C-termini, different fusions, or different vectors and hosts) and choose the one that yields the most soluble protein. We have initiated a pilot study aimed at determining whether the initial production of constructs with multiple N- and C-termini for small scale screening would be more efficient than our current approach of redesigning failed constructs.
Currently, CESG's X-ray structure pipeline requires in the order of 10 mg of Se-Met protein for each target. We anticipate that as reliable small scale crystallization screening methods become available, the wheat germ cell-free method could become part of the X-ray crystallography pipeline. We have already determined by mass spectrometry that the wheat germ cell-free approach supports high level incorporation of Se-Met, and we have made small quantities of Se-Met-labeled proteins for use chip (Fluidigm, South San Francisco, CA) crystallization screening.
We gratefully acknowledge the work of all CESG staff members and collaborators and fruitful interactions with Professor Y. Endo and his group at Ehime University, Matsuyama, Japan, and staff members of CellFree Sciences Co., Ltd. (Yokohama, Japan) in adapting their technology to research and production environments. Supported by NIH grants 1U54 G074901 (which supports CESG), and P41 RR02301 (which supports the National Magnetic Resonance Facility at Madison, where NMR spectroscopy was carried out).