A diverse set of family 48 bacterial glycoside hydrolase cellulases created by structure-guided recombination

Authors

  • Matthew A. Smith,

    1. Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA
    Search for more papers by this author
    • These authors contributed equally to this work
  • Andrea Rentmeister,

    1. Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA
    2. Department of Chemistry, University of Hamburg, Hamburg, Germany
    Search for more papers by this author
    • These authors contributed equally to this work
  • Christopher D. Snow,

    1. Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA
    2. Department of Chemical and Biological Engineering, Colorado State University, Fort Collins, CO, USA
    Search for more papers by this author
  • Timothy Wu,

    1. Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA
    Search for more papers by this author
  • Mary F. Farrow,

    1. Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA
    Search for more papers by this author
  • Florence Mingardon,

    1. Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA
    Search for more papers by this author
  • Frances H. Arnold

    Corresponding author
    • Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA
    Search for more papers by this author

Correspondence

F. H. Arnold, Division of Chemistry and Chemical Engineering, Mail Code 210-41, California Institute of Technology, Pasadena, CA 91125, USA

Fax: +1 626 568 8743

Tel: +1 626 395 4162

E-mail: frances@cheme.caltech.edu

Website: http://cheme.che.caltech.edu/groups/fha/

Abstract

Sequence diversity within a family of functional enzymes provides a platform for elucidating structure–function relationships and for protein engineering to improve properties important for applications. Access to nature's vast sequence diversity is often limited by the fact that only a few enzymes have been characterized in a given family. Here, we recombined the catalytic domains of three glycoside hydrolase family 48 bacterial cellulases (Cel48; EC 3.2.1.176) – Clostridium cellulolyticum CelF, Clostridium stercorarium CelY, and Clostridium thermocellum CelS – to create a diverse library of Cel48 enzymes with an average of 106 mutations from the closest native enzyme. Within this set, we found large variations in properties such as the functional temperature range, stability, and specific activity on crystalline cellulose. We showed that functional status and stability were predictable from simple linear models of the sequence–property data: recombined protein fragments contributed additively to these properties in a given chimera. Using this, we correctly predicted sequences that were as stable as any of the native Cel48 enzymes described to date. The characterization of 60 active Cel48 chimeras expands the number of characterized Cel48 enzymes from 13 to 73. Our work illustrates the role that structure-guided recombination can play in helping to identify sequence–function relationships within a family of enzymes by supplementing natural diversity with synthetic diversity.

Abbreviations
BG

β-glucosidase

CBM

cellulose-binding module

Cel48

glycoside hydrolase family 48 cellulase

DUF

domain of unknown function

GOX

glucose oxidase

HRP

horseradish peroxidase

IPTG

isopropyl-thio-β-d-galactoside

PDB

Protein Data Bank

TMB

tetramethylbenzidine

Introduction

Cellulolytic anaerobic bacteria use macromolecular structures known as cellulosomes to hydrolyze recalcitrant cellulosic substrates [1]. Within the cellulosome, cellulases and other glycoside hydrolases [2, 3] are assembled onto multidomain scaffoldin proteins for efficient degradation of cellulosic substrates [4]. Cellulosome assembly is achieved by binding of dockerin domains from enzymes to cohesin domains in scaffoldin, and interaction with the substrate is mediated by one or more carbohydrate-binding modules (CBMs) on the scaffoldin [1, 5].

The modularity of cellulosomes has spurred interest in ‘designer cellulosomes’ [4, 6], whereby different cellulases are synthetically combined for a specific application. Within a given glycoside hydrolase family, a diverse pool of potential cellulases would be beneficial for designer cellulosomes by providing a suite of enzymes with differing properties and an extensive platform for further enzyme engineering. Glycoside hydrolase family 48 cellulases (Cel48; EC 3.2.1.176) are ideal candidates for designer cellulosomes. As one of the most important families of bacterial cellulases [7, 8], they are usually a major constituent of bacterial cellulosomes [9, 10]. Of the 116 bacterial Cel48 genes currently predicted in the CAZy database (http://www.cazy.org/) [11], only 13 have been characterized.

Here, we used SCHEMA recombination to synthesize a diverse set of new Cel48 sequences. SCHEMA [12] is a structure-guided, site-directed protein recombination method that has been used to generate thousands of novel cytochrome P450s [13], β-lactamases [14], and fungal cellulases [15, 16]. SCHEMA identifies optimal crossover locations for shuffling homologous genes, based on minimizing structural disruption in the resulting chimeric proteins. The chimeric proteins that are made by recombining natural sequences differ from the parent sequences at many amino acid positions, and provide a convenient platform for structure–function studies. The new Cel48 enzymes described here are chimeras of the catalytic domains of three native Cel48 enzymes from mesophilic and thermophilic Clostridia. Sequence–function analysis of this synthetic enzyme library demonstrates a high degree of additivity in the sequence–stability relationship, as observed in previous studies [15, 17]. This simple relationship between the sequence block identity and its contribution to chimera stability has allowed us to predict highly stable, highly active Cel48 enzymes. We have also investigated the relationship between thermostability and optimal catalytic temperature in this enzyme family.

Results

Cel48 parental enzymes

Three extensively characterized Cel48 cellulases were chosen as parents for construction of the SCHEMA recombination library: CelF [10] from the mesophile Clostridium cellulolyticum ATCC 35319, CelY [18] from the thermophile Clostridium stercorarium, and CelS (also known as CelA) [19] from the thermophile Clostridium thermocellum ATCC 27405. All three enzymes are known to act on crystalline cellulose in a processive manner [10, 11]. Crystal structures of CelF and CelS show that the Cel48 catalytic domain is an (α/α)6-barrel fold. The sequence and structural similarities of the catalytic domains (Fig. S1) suggest that these enzymes can be recombined to make functional catalytic domain chimeras.

Outside of the catalytic domain, however, the parent enzymes exhibit significant structural variations. CelF and CelS consist of a 70-kDa catalytic domain connected to their organisms' respective dockerin domains, whereas CelY is a noncellulosomal 103-kDa protein with its N-terminal catalytic domain attached via a 10-kDa domain of unknown function (DUF) to a 17-kDa cellulose-binding domain (CBM3) [18]. Thus, CelY can directly bind cellulose, whereas CelF and CelS bind their respective scaffoldins.

As the noncatalytic domains (dockerin, scaffoldin, and CBM) differ among the parent enzymes, we chose to construct the library by using the C. thermocellum architecture. Having a single architecture for the cellulases enables fair comparison of the chimeric cellulase catalytic domains. A miniscaffoldin consisting of a C. thermocellum cohesin and a CBM was constructed as previously described [20], and the CelS dockerin domain was fused to the C-termini of the catalytic domains of CelF, CelS, and CelY (see Experimental procedures). The parental constructs, with the added C. thermocellum dockerin domain, are referred to as CelF-1, CelS, and CelY-2, and are highlighted (boxed) in Fig. 1. These constructs can attach to the miniscaffoldin to produce minicellulosomes. Another CelY construct was created by the addition of its DUF. Because the presence or absence of this DUF did not affect activity of the CelY constructs (Fig. S2B), the DUF was excluded in constructing the recombination library.

Figure 1.

Architectures of parent Cel48 enzymes and derived constructs. Wild-type CelS and CelF consist of an N-terminal catalytic domain and a C-terminal dockerin that binds specifically to its cohesin. Miniscaffoldin (black) consists of a C. thermocellum cohesin and CBM. Construct CelF-1 contains a C-terminal C. thermocellum dockerin and binds to the miniscaffoldin. CelY from C. stercorarium consists of an N-terminal catalytic domain, a DUF, and a CBM. CelY constructs CelY-1 and CelY-2 contain the CelY catalytic domain and a C-terminal dockerin from C. thermocellum. CelY-1 also contains the DUF. CelY-1 and CelY-2 bind to the miniscaffoldin. All constructs used to prepare the chimera library (boxes) have the C. thermocellum dockerin and bind the C. thermocellum miniscaffoldin.

We first characterized and compared the activities on crystalline cellulose of the parental enzymes with and without the miniscaffoldin. For all cellulases with a dockerin, activity was substantially higher in the presence of miniscaffoldin than without it (Fig. 2A–C). Thus, cohesin–dockerin binding occurs, and CBM-mediated attachment to cellulose enhances the rate of sugar release from crystalline cellulose, as observed previously [21]. Figure 2D directly compares the activity profiles for the dockerin-containing cellulases in the presence of C. thermocellum miniscaffoldin. Under these conditions, CelY and CelS displayed the highest activity at 70–80 °C and very low activity below 50 °C. In contrast, CelF was most active at ~ 50 °C, but quickly lost activity at higher temperatures. In a previous study, we compared the activities of three homologous bacterial glycoside hydrolase family 9 CBM3c cellulases from mesophilic and thermophilic organisms over a range of temperatures. They all displayed similar activities at lower temperatures, and that activity increased with temperature until the enzyme was no longer stable [20]. Here, in contrast, the Cel48 cellulase from the mesophilic organism is significantly more active than its two thermophilic homologs at the lower temperature.

Figure 2.

Activities of purified Cel48 enzymes as a function of temperature, in the presence and absence of equimolar amounts of miniscaffoldin. Activities were determined from the total glucose equivalent released, with the enzymatic glucose assay (see Experimental procedures), in a 1-h reaction with 0.2 μm enzyme and 10 g·L−1 Avicel. All activities are normalized to the activity of CelS at its maximum, at 80 °C. (A) CelF-1, (B) CelS, and (C) CelY-2, along with the native CelY enzyme. (D) Temperature profiles of CelF-1, CelS and CelY-2 constructs with miniscaffoldin. CelS and CelY-2 were most active at 75–80 °C, whereas CelF-1 was most active at 50 °C.

SCHEMA recombination library design

A structure-guided computational approach to designing a library of chimeric genes, SCHEMA identifies crossover sites for recombination of homologous proteins that maximize the likelihood that proteins in the resulting library will retain their folded structure [12]. Contacts (residues that are < 4.5 Å from one another) are identified from one or more of the crystal structures, and the SCHEMA energy (E) for a given chimera is calculated by counting the number of residue–residue contacts that are disrupted by recombination. Recombination sites are chosen to minimize the average SCHEMA energy, <E>, of all possible sequences made by recombining those sequence fragments.

We designed the recombination library of Cel48 catalytic domains by using the raspp algorithm [22] to identify crossover sites that minimized <E> [12]. raspp returned a set of candidate library designs (Fig. S3). The chosen library has crossovers located before residues Pro122, Ala260, Asp292, His348, Gly396, Asn437, and Leu556, based on the numbering of CelS [Protein Data Bank (PDB): 1L2A]. This library has an <E> of 31, and an average number of mutations from the closest parent (<m>) of 106. The individual structural elements (‘blocks’) for this design, shown in Fig. 3A, are not obvious from the secondary or domain structure. Crossovers between blocks B and C, C and D, and G and H, for example, lie within α-helices. This design, however, sequesters as many residue–residue contacts as it can within blocks, given limitations on block size (Fig. 3B).

Figure 3.

Sequence blocks in Cel48 chimeras designed with SCHEMA. (A) Structure of CelS with color-coded blocks A–H. (B) Residue–residue contact map showing the combined contacts from 17 Cel48 structures. The positions of the blocks are indicated with colored squares. Most contacts are sequestered within the blocks, and cannot be broken upon recombination.

Chimeric genes were assembled from 24 gene fragments, representing the eight blocks from each of the three parents, with the sequence-independent site-directed chimeragenesis method [23] to generate a gene library of 38 (6561) different sequences (Table S1; Fig. S4). A C. thermocellum dockerin was attached to the C-terminus of each chimeric sequence during reassembly. The methods used to express, purify and identify functional chimeras are described in detail in Experimental procedures.

Characterization of chimeric Cel48 cellulases

Upon screening 4872 library members with a 96-well plate cellulase activity assay (see Experimental procedures), we identified the functional enzymes, from which we purified and characterized 50 unique, novel Cel48 enzymes. As shown in Fig. 4, these enzymes have, on average, > 80 mutations from the closest parent cellulase. Their SCHEMA E-values range from 8 to 36, and they have 12–142 mutations from the closest parent cellulase. Sequences from all three parental enzymes are well represented at each block in the functional chimeras, except for CelF, which is underrepresented in blocks E, G, and H.

Figure 4.

Representation of three Cel48 parents and 60 active chimeras, with CelF in white, CelY in gray, and CelS in black. SCHEMA E-values, number of mutations from closest parent (m), T50, Topt and Arel are also provided. T50 is the temperature at which an enzyme loses 50% of its activity in a 10-min incubation. Topt is the temperature at which a cellulase liberates the most glucose from crystalline cellulose in a 2-h hydrolysis assay. Arel is the cellulase's specific activity at its respective optimal temperature measured in a 1-h assay with 0.2 μm enzyme and 0.2 μm miniscaffoldin in 10 g·L−1 Avicel. Values are normalized relative to the specific activity of CelS.

We measured the thermostabilities (T50) and optimal catalytic temperatures (Topt) of the 50 Cel48 chimeras and their three parents; these values are reported in Fig. 4. T50 is the temperature at which an enzyme loses 50% of its activity after a 10-min incubation (see Experimental procedures), and is a measure of its ability to resist temperature-induced irreversible inactivation. Topt is the temperature at which a cellulase is most active during a 2-h assay (see Experimental procedures), and is a measure of its ability to remain active at elevated temperature. Thermostability, the ability to withstand denaturation, is necessary but not sufficient for increasing an enzyme's optimal catalytic temperature. In the chimeras, both these measured properties extend beyond the range of the parents. Many of the chimeras are very stable: indeed, this experiment has added 35 new Cel48 enzymes with a Topt of > 60 °C to the six natural thermostable cellulases that have been characterized to date: C. thermocellum ATCC 27405 CelS [24], C. thermocellum F7 CelS [25], C. thermocellum ATCC 27405 CelY [26], Thermobifida fusca YX CelF [27], C. stercorarium CelY [28], and Anaerocellum thermophilum DSM 6725 CelA [29].

We also measured the specific activities of all the Cel48 chimeras at their respective optimal catalytic temperatures (Figs 4 and 5A). The chimeras tend to have specific activities that are similar to or slightly less than the parent enzymes. We did not observe a correlation between Topt and specific activity at that temperature for all of the sampled chimeras (Fig. 5B). However, recombination may have compromised the activities of many of the chimeras. If only the most active enzymes are considered, there does appear to be a correlation between Topt and specific activity (Fig. 5B, dotted line), whereby increasing temperature leads to higher specific activity.

Figure 5.

Specific activities of chimeric cellulases. (A) Specific activities of the Cel48 enzymes at their respective optimal catalytic temperatures (tabulated in Fig. 4). The activities were measured in a 1-h assay, with 0.2 μm enzyme and 0.2 μm miniscaffoldin in 10 g·L−1 Avicel at the respective optimal temperature. The activities are normalized to the maximum specific activity of CelS (Topt = 77.7 °C). The parent enzymes are highlighted in bold. (B) The normalized specific activities versus the optimal catalytic temperatures of the cellulases. The parent enzymes are highlighted as black diamonds, and the possible correlation among the most active cellulases is indicated with a dotted line.

Modeling and predicting the function of chimeric cellulases

As previously demonstrated for fungal CBHI and CBHII cellulases [15, 16], we can use information from a small number of sequences to predict the properties of all the chimeras in the recombination library. To demonstrate this for Cel48, we built predictive models of T50 and Topt based on the sequences and SCHEMA E-values of the 50 functional chimeric cellulases and the three parental enzymes. We modified the simple sequence–stability linear regression model first used by Li et al. [17] to include an additional parameter for second-order SCHEMA contacts in the chimeras (Eqn S1). As shown in Fig. 6A, the thermostability model fits the T50 measurements of all 53 enzymes well (r2 = 0.88), and is an improvement over the simpler model that does not include the SCHEMA E-parameter (r2 = 0.82), as illustrated in Fig. S5.

Figure 6.

Modeling thermostability and thermoactivity. (A) Predicted T50 values from a simple linear model closely correlate with the measured T50 for 53 Cel48 enzymes over a range of almost 30 °C. (B) Stabilizing or destabilizing effects of each sequence block, for CelF (gray) and CelS (white), relative to CelY for the T50 model. Most blocks are destabilizing with respect to the most thermostable parent, CelY. Blocks A, F and G from CelS and, to a lesser extent, blocks C, F and G from CelF are predicted to be stabilizing. The effect of the SCHEMA E-value on the T50 predictions is −0.29 °C per disrupted structural contact (black). (C) Predicted Topt values from the same linear model also correlate with the measured Topt over a similar range. (D) Stabilizing or destabilizing effects of each block, for CelF (gray) and CelS (white), relative to CelY for the Topt model. Block contributions are similar in magnitude to those in the T50 model.

With this model, we were able to identify the contribution that each sequence block makes to stability (Fig. 6B). When trained on Topt measurements, the same additive block model also accurately predicts the measured values (Fig. 6C), and the block contributions to optimal catalytic temperature are very similar to their contributions to thermostability (Fig. 6D). These models trained on data from the sample set can be used to predict the T50 and Topt of all the remaining chimeras in the library.

We wished to construct and test the chimeric cellulases that are predicted to be the most thermostable. Not every chimeric cellulase, however, is functional. To investigate how recombination leads to nonfunctional sequences, we analyzed 28 unique inactive chimeras identified during the activity screen. A chimera was defined as nonfunctional if, upon a five-fold increase in enzyme concentration, from 0.2 to 1 μm, no detectable activity was measured between 45 °C and 80 °C. These nonfunctional cellulases are all soluble proteins of the correct length on an SDS/PAGE gel (data not shown). Using CD, we analyzed 17 of the 28 nonfunctional chimeras at 25 °C, and found that all gave a similar signal to the parent enzymes (Fig. S6), suggesting that nonfunctional chimeras are folded and have a similar secondary structure to functional ones.

Inspired by the success of the additive block models for thermostability and thermoactivity, we took a similar approach to modeling and predicting chimera functional status. We constructed a linear model in which each block contributes independently to whether a chimera is functional or not. As with thermostability, we also included the SCHEMA E-value as a parameter. The output from the model should be a value between 0 and 1 to represent the probability that a chimera is active. To do this, we augmented the output of the linear model by using a linking function, flink, which scales outputs of the model to the required range (Eqn S2). The coefficients for this model can be found by linear regression (Table S2), although, unlike the thermostability model, the block contributions are only additive under the linking function.

We trained the activity model on 81 cellulases (53 active; 28 inactive), and assessed its predictive ability by cross-validating the predictions of functional chimeras with the measurements of functional chimeras. The model successfully predicted the functional status of 88% of the chimeras (Table S3). A low SCHEMA E-value is known to increase the likelihood of a chimera being active [14], but E alone correctly predicted the functional status of only 77% of these chimeras under the same cross-validated conditions. Running the functionality model on all block combinations, we predict that the library contains more than 3000 unique active Cel48 enzymes.

Using the T50 model trained on the 53 experimentally active sequences in combination with the functionality model, we predicted the 13 most stable enzymes that are also expected to be catalytically active. These were constructed and characterized. Ten of the 13 were active (Table S4); these sequences and their stabilities are reported in Fig. 4. As shown in Fig. 7A, their stabilities closely matched the predictions. Five of these variants were slightly more stable than the most stable parental enzymes. Interestingly, two of the highly stable chimeras also hydrolyzed more cellulose than the most active parental enzyme, CelY-2, both in a 1-h assay (Figs 5 and 7C,D) and in a 48-h assay (Fig. 7B), demonstrating the potential utility of these chimeric enzymes for the construction of designer cellulosomes.

Figure 7.

Predicting the most stable Cel48 chimeras. (A) The T50 model trained on all 53 active parent and chimeric test cellulases (crosses) was used to predict 10 very stable chimeras that were subsequently constructed. All 10 are very stable (triangles). (B) Activities of the two most stable, most active chimeras and the most stable, most active cellulosomal parent sequence, CelY-2. Activities were measured in the form of reducing-end sugars released (reported as cellobiose equivalents released) over a 48-h period, with 0.2 μm enzyme and 0.2 μm miniscaffoldin in 10 g·L−1 Avicel at 75 °C. All measurements were carried out in triplicate. (C) Temperature–activity profiles for the two most stable, most active chimeras and the most stable, most active cellulosomal parent sequence, CelY-2. Activities were measured in a 1-h assay, with 0.2 μm enzyme and 0.2 μm miniscaffoldin in 10 g·L−1 Avicel. The activities are normalized to the maximum activity of CelS. (D) The maximum activities of the three parent constructs and two of the most active chimeras. The activities were measured in a 1-h assay, with 0.2 μm enzyme and 0.2 μm miniscaffoldin in 10 g·L−1 Avicel. The activities are normalized to the maximum activity of CelS. Activities were measured both by the number of reducing-end sugars released and the total glucose released.

Probing biochemistry with synthetic diversity

With 60 active cellulase chimeras in hand, we next examined the relationship between the optimal temperature for catalytic activity (Topt) and resistance to temperature-induced denaturation (T50) over a broad range of temperatures. These two properties are closely correlated (Fig. 8), indicating that engineering Cel48 enzymes for greater thermostability increases their optimal catalytic temperatures. Some of the chimeric cellulases have a Topt higher than their T50. We believe that this reflects the stabilizing effect of cellulose substrate, because the substrate is present in the Topt assays but not in the denaturation step of the T50 assays. This effect can be seen in Fig. S7, where T50 values in the presence of cellulose are ~ 2 °C higher than in its absence.

Figure 8.

The correlation between optimum operating temperature for a 2-h assay (Topt) and thermostability (T50) for all 63 chimeric and parent Cel48 enzymes in this study. There is a strong correlation (r2 = 0.83): chimeras with greater stability tend to be most active at higher temperatures. The parents are highlighted in black.

Discussion

The dearth of characterized Cel48 enzymes with different properties is an impediment to their use in designer cellulosomes for specific engineering applications, and inhibits the discovery of sequence–function relationships for these important enzymes. We have used structure-guided protein recombination to expand the diversity of characterized Cel48 enzymes. Using SCHEMA to identify suitable crossover locations for shuffling sequence blocks among the three parent Cel48 catalytic domains, we have generated a large set of novel, active cellulases that have the same architecture and are expressed under the same conditions in the same Escherichia coli host, where they are straightforward to characterize and compare. As expected, we found that properties such as Topt (the ability to remain active at elevated temperature), T50 (the ability to withstand denaturation at high temperature) and the specific activity at Topt vary greatly among these novel enzymes. We also found that functional status, T50 and Topt can be predicted from simple linear models built from sequence–function data from a small sample of the library. This has enabled us to efficiently identify stable chimeras, some of which have high cellulolytic activities.

This set of related enzymes can contribute to our understanding of how sequence affects Cel48 properties. The thermostability model illuminates stabilizing blocks of amino acids, whether they exist in the most stable proteins or not. Two of the most stabilizing blocks are predicted to be from the parent CelS at positions F and G. These blocks are located in the C-terminus of the catalytic domain, close to where the dockerin attaches, which suggests an important stabilizing interaction between these blocks and the C. thermocellum dockerin. When the dockerin binds the cohesin, the linker between the catalytic domain and dockerin is pleated, and this brings the dockerin into close contact with the catalytic domain [30]. A CelS dockerin–cohesin crystal structure would be valuable for identifying specific stabilizing interactions between these two domains.

With this work, we also address another biochemical question with important engineering implications. Using this accessible set of related enzymes, we investigated the correlation between the temperature at which an enzyme is most active and the temperature at which it denatures irreversibly. We found that Cel48 chimeras with greater thermostability also have their activity optima at higher temperatures, and that these temperatures are closely related. In other words, the ability to withstand temperature-induced denaturation at ever-higher temperatures leads to increases in the optimum temperature for activity. It is not necessarily the case that increased structural stability and resistance to denaturation and irreversible inactivation will result in the ability to catalyze the reaction efficiently at higher temperatures, particularly if local instability or dynamics influence catalysis [31]. Among the Cel48 chimeras, however, there is sufficient structural stability in key catalytic regions to make T50 a good surrogate for Topt.

We found that two of the predicted thermostable chimeras had higher specific activities at Topt than the most active parental enzyme, CelY-2. When assayed over a 48-h period, they hydrolyzed twice as much cellulose as CelY-2. These chimeric enzymes, which we have analyzed in a cellulosomal construct, may find potential uses in designer cellulosomes. An important next step will be to determine whether they provide an enhanced cellulolytic capability to a system such as the C. thermocellum cellulosome.

Experimental procedures

Parental enzyme constructs

Cel48 genes from CelF and CelS were PCR-amplified with Phusion-polymerase from genomic DNA, with primers CTHE312.40 and CTHE2453.40 for CelS, and CCEL786.41 and CCEL2864.41 for CelF, introducing HindIII and SacI sites at the 5′-end, as well as a NotI site at the 3′-end (Table S5). Taq polymerase was used to add A-overhangs for TA-cloning into pGEM-T Easy (Promega, Madison, WI, USA). The resulting plasmids were called pGEMT–CTHEwt and pGEMT–CCELwt. The CelS dockerin was added to the CelF catalytic domain to create the plasmid pGEMT–CCELmut1. These constructs were cloned into pET-22(+) by the use of NdeI and NotI sites.

We designed a synthetic gene for CelY from C. stercorarium on the basis of available sequence information but with restriction sites NdeI, HindIII, BsaXI, PstI and SapI removed. The gene was codon-optimized for expression in E. coli by DNA 2.0 (Doc. S1). The CelY gene was cloned into pET-22(+) by the use of NdeI and NotI restriction sites. The resulting construct was termed pET22b+CSTEwt, and contains the catalytic domain, the DUF, and the CBM. Two more constructs were made from the CelY gene: CelY-1, containing only the catalytic domain and Cthermocellum dockerin, and CelY-2, containing the catalytic domain, the DUF and the Cthermocellum dockerin. Products were cloned into pET-22(+) by the use of NdeI and NotI restriction sites.

An XbaI site was introduced by overlap extension PCR into all parental constructs between the catalytic domain and the dockerin. Introducing an XbaI restriction site between the catalytic domain and the dockerin allowed swapping of catalytic domains and dockerins. The XbaI site did not affect activity (Fig. S2A).

Recombination library design

The SCHEMA library was designed with the tools available on the Arnold group homepage (http://www.che.caltech.edu/groups/fha/). The catalytic domains of CelF, CelY and CelS were aligned, with clustalw, from Tyr40 to Phe661, based on the numbering of CelS. We analyzed all available structures without point mutations of the catalytic domains of CelS and CelF [CelF PDBs – 1F9O, 1FAE, 1FBO, 1FCE, and 1G9G; CelS PDBs – 1L1Y (six chains), and 1L2A (six chains); a total of 17 chains]. Of the 3035 unique residue–residue contacts in all 17 structures, on average 73% are conserved between any CelF structure and CelS structure, as compared with an average of 80% of contacts conserved between any two CelF structures, and an average of 80% of contacts conserved between any two CelS structures. As contacts between structures of the same enzyme vary almost as much as contacts between structures of CelF and CelS, we made use of all 17 available structures in designing the library. The average SCHEMA energy for a library (<E>) was calculated for each structure, and libraries were evaluated on the basis of the average <E> from all 17 structures. Seven crossover sites were chosen with the raspp algorithm [22], with a minimum fragment size of 30 residues. raspp returned a set of candidate libraries characterized by <E> (the average number of contacts broken within a library for a given structure), ≪E⪢ (the average of <E> for a given library across all 17 different structures), and <m> (the average number of amino acid substitutions from the closest parent within a library). Figure S3A shows ⪡E⪢ as a function of <m>. We removed solutions without a conserved amino acid at the designated crossover sites (Fig. S3B). To obtain libraries with mutations more evenly distributed into blocks, we also calculated the standard deviation of the average number of mutations per block for each library. Lower numbers indicate more evenly distributed blocks. Figure S3C shows ≪E⪢ as a function of the standard deviation of block mutations. From this set, we picked a library that would contain a large number of active enzymes with high sequence diversity: the chosen library has an ≪E⪢ of 31.3 and an <m> of 106. Calculated for each of the 17 structures, <E> for the library varies from 28 to 34.

Construction of chimeras

Chimeric genes were assembled from 24 gene fragments, representing the eight blocks from each of the three parents, with the sequence-independent site-directed chimeragenesis method [23]. The following consensus sites were used for the crossover sites: (a) CCG; (b) GCC; (c) GAC; (d) CAT; (e) GGT; (f) AAC; and (g) TTA (Table S6). Mini-libraries were cloned into pGEMT by the use of SpeI and SacII sites. Full libraries were made by isolating large amounts of DNA from plasmids digested with SpeI and SacII, not by PCR amplification. Instead of SapI, the isochizomer LguI was used. A C. thermocellum dockerin was attached to the C-terminus of each chimeric sequence during reassembly. The genes were expressed in pET-22(+) under the control of an isopropyl-thio-β-d-galactoside (IPTG)-inducible T7 promoter in E. coli BL21(DE3). A similar approach was used for construction of the specific chimeras predicted to be thermostable, but with the difference that only the specific blocks for the desired chimera were used in the ligation steps.

Quality of library

We completely sequenced 61 randomly chosen chimeras in order to assess the frequency of library construction artefacts, including point mutations, deletions, and insertions. Eighty-nine per cent of the library (54 of 61) contained no amino acid mutations, no insertions, and no deletions. We found one single insertion, and two sequences were missing one-half of the library. Two sequences were back-to-front in the vector, and two sequences contained one remaining tag. Every block from every parent was found in the randomly sequenced chimeras, but CelF block E appears to be underrepresented in the library. The distribution of each block is shown in Table S7.

Protein expression in 96-well plates

In 96-well shallow-well plates, 300 μL of LB medium (10 g of tryptone, 5 g of yeast extract, 10 g of NaCl) containing 100 mg·L−1 ampicillin were inoculated with a single colony of E. coli BL21(DE3) having the cellulase gene on a pET-22(+) plasmid. Plates were grown overnight in an orbital shaker at 37 °C and 250 rpm. In a 96-well deep-well plate, 900 μL of TB medium (12 g of tryptone, 24 g of yeast extract, 4 mL of glycerol, in 1 L of H2O with 17 mm KH2PO4 and 72 mm K2HPO4) containing 100 mg·L−1 ampicillin were inoculated with 50 μL, and grown in an orbital shaker at 37 °C until the D600 nm reached 1.6–1.8. Plates were cooled to < 17 °C, induced with a final concentration of 50 μm IPTG, and grown at 17 °C for 16 h. Cultures were harvested by centrifugation at 5000 g for 10 min, and stored at – 20 °C.

Cellulase activity assay in 96-well plates

Cells were resuspended in 300 μL of lysis buffer (10 mm Tris, pH 8.0, 10 mm MgCl2, 0.7 mg·mL−1 lysozyme, 4 U·mL−1 DNase) per well, and incubated for 60 min at 37 °C. Plates were centrifuged for 5 min at 5000 g at 4 °C. From the supernatant, 100 μL was transferred to a 96-well PCR plate with 50 μL of a 10 g·L−1 Avicel suspension in reaction buffer (50 mm succinate, pH 6.0, 1 mm CaCl2) and 0.2 μm purified miniscaffoldin (Fig. S8). Hydrolysis proceeded overnight at both 50 °C and 75 °C. Plates were centrifuged for 3 min at 200 g at 4 °C, and from each well 50 μL of supernatant was transferred to a new plate. The amount of reducing ends was determined with the Park–Johnson assay.

Park–Johnson activity assay

Reagent A comprised 0.5 g·L−1 K3Fe(CN)6 and 0.2 m K2HPO4 (pH 10.6). Reagent B comprised 5.3 g·L−1 Na2CO3 and 0.65 g·L−1 KCN. Reagent C comprised 2.5 g·L−1 FeCl3, 10 g·L−1 poly(vinylpyrrolidone), and 1 m H2SO4. In a 96-well PCR plate, 50 μL of test sample was mixed with 150 μL of a 2 : 1 A/B mixture (i.e. 100 μL of reagent A and 50 μL of reagent B). The plate was sealed, heated to 95 °C for 15 min, and then cooled to 4 °C. From this plate, 180 μL was transferred to a transparent flat-bottomed screening plate containing 90 μL of reagent C. The plate was incubated in the dark for 1–3 min before the A520 nm was measured in a TECAN plate reader. If glucose equivalents were determined, a calibration curve made from solutions of defined glucose concentrations was included on each plate [32].

Enzymatic glucose activity assay

The β-glucosidase (BG) solution comprised 0.25 g·L−1 almond BG in 50 mm sodium acetate (pH 5.0). The tetramethylbenzidine (TMB) solution comprised 0.8 g·L−1 TMB in double-distilled H2O. The horseradish peroxidase (HRP) solution comprised 0.15 g·L−1 HRP in 50 mm sodium acetate (pH 5.0). The glucose oxidase (GOX) solution comprised 0.1 g·L−1 GOX in 50 mm sodium acetate (pH 5.0). In a transparent flat-bottomed screening plate, 100 μL of test sample was mixed with 50 μL of BG solution. If glucose equivalents were determined, a calibration curve made from solutions of defined glucose concentrations was included on each plate. The plate was sealed, and incubated for 16 h at 37 °C. For development, 50 μL of TMB solution and 20 μL each of HRP solution and GOX solution were added to the plate. After 5 min, the A650 nm was measured in a TECAN plate reader.

Protein purification

Each cellulase was purified from E. coli BL21(DE3), which contains the cellulase gene with a C-terminal His-tag on a pET-22(+) plasmid under the control of an IPTG-inducible promoter. The cells were grown in TB medium (12 g of tryptone, 24 g of yeast extract, 4 mL of glycerol, in 1 L of H2O with 17 mm KH2PO4 and 72 mm K2HPO4) at 37 °C with 100 mg·L−1 ampicillin. Cells were induced with a final concentration of 50 μm IPTG, grown for 16 h at 17 °C, and harvested by centrifugation for 10 min at 5000 g. Pellets were resuspended in buffer A (20 mm Tris, pH 7.4). The solution was lysed by sonication, and centrifuged at 75 000 g for 30 min to sediment cell debris. The supernatant was loaded onto a 1-mL Ni2+–nitrilotriacetic acid His-trap column (GE Healthcare, Little Chalfont, UK), and purified by washing with 1% buffer B (20 mm Tris, pH 7.4, 100 mm NaCl, 300 mm imidazole) for 15 column volumes, followed by a gradient elution (increase to 80% buffer B in 10 column volumes). Cellulase-containing fractions were pooled, and concentrated with protein concentrators with cellulose-free membranes (Vivaproducts, Middleton, MA, USA). Buffer was changed to 10 mm Tris (pH 8.0) by repeated refills. Purified proteins were flash frozen, and stored at −20 °C for up to 3 months. Protein concentration was determined with the Bradford assay, with BSA as the protein standard. Protein purity was determined from SDS/PAGE gels. The amounts of isolated protein were 15–60 mg·L for dockerin-containing constructs and 120 mg·L−1 for CelY.

Thermostability assay (T50 measurements)

For each well of a 96-well PCR plate, 50 μL of a 20 g·L−1 Avicel suspension in reaction buffer (50 mm succinate, pH 6.0, 1 mm CaCl2) was mixed with 25 μL of 0.8 μm miniscaffoldin and spun down for 10 min at 5 000 g. In a different PCR plate, 30 μL of 0.8 μm cellulase in reaction buffer were pipetted per well. Plates were incubated for 10 min in a gradient PCR cycler at the indicated temperatures, and then placed on ice. Heat-treated cellulases were transferred (25 μL per well) to the Avicel-containing PCR plate, and the reaction was run for 60 min at the indicated temperature. Plates were spun down for 3 min at 200 g. Then, 50 μL of supernatant was transferred to a new 96-well PCR plate and tested with either the Park–Johnson assay or the enzymatic glucose assay.

Temperature profiles (Topt measurements)

A final concentration of 0.2 μm enzyme or 0.2 μm enzyme plus 0.2 μm miniscaffoldin was added to a preheated suspension of 10 g·L−1 Avicel in reaction buffer (50 mm succinate, pH 6.0, 1 mm CaCl2). The hydrolysis was performed at a range of temperatures for 2 h in duplicate. Samples were spun down for 1 min at 200 g at 4 °C. From each well, 50 μL of the supernatant was transferred to a 96-well PCR plate, and analyzed with either the Park–Johnson assay or the enzymatic glucose assay. The Topt was determined from the temperature profiles of the chimeras.

Forty-eight-hour activity assay

A final concentration of 0.2 μm enzyme plus 0.2 μm miniscaffoldin was added to a preheated suspension of 10 g·L−1 Avicel in reaction buffer (50 mm succinate, pH 6.0, 1 mm CaCl2) at 75 °C. At regular intervals, the Avicel was resuspended, and a sample of the reaction mixture was removed and cooled to 4 °C. Samples were spun for 1 min at 200 g, and 50 μL of a 1 : 10 dilution of the supernatant was analyzed with the Park–Johnson assay. The measurements were performed in triplicate.

CD

CD measurements were carried out with an Aviv Model 62DS spectrometer with 6 μm protein sample. Wavelength scans to determine the ellipticity were carried out at 25 °C.

Linear regression

Regression models for T50 and Topt were trained with matlab's ‘regress’ function. The regression model for functionality was trained with L1 regularized logistic regression from the toolbox glmnet for matlab [33, 34].

Acknowledgements

This work was supported by the Department of the Interior through grant D10AP00065 from the Defense Advanced Research Projects Agency to F. H. Arnold. M. A. Smith is supported by a Resnick Sustainability Institute fellowship, A. Rentmeister by a DFG postdoctoral fellowship, and T. Wu by a CIT summer undergraduate research fellowship (SURF).

Ancillary