Dissecting the Determinants of Domain Insertion Tolerance and Allostery in Proteins

Abstract Domain insertion engineering is a promising approach to recombine the functions of evolutionarily unrelated proteins. Insertion of light‐switchable receptor domains into a selected effector protein, for instance, can yield allosteric effectors with light‐dependent activity. However, the parameters that determine domain insertion tolerance and allostery are poorly understood. Here, an unbiased screen is used to systematically assess the domain insertion permissibility of several evolutionary unrelated proteins. Training machine learning models on the resulting data allow to dissect features informative for domain insertion tolerance and revealed sequence conservation statistics as the strongest indicators of suitable insertion sites. Finally, extending the experimental pipeline toward the identification of switchable hybrids results in opto‐chemogenetic derivatives of the transcription factor AraC that function as single‐protein Boolean logic gates. The study reveals determinants of domain insertion tolerance and yielded multimodally switchable proteins with unique functional properties.


Supplementary Text Note S1. Analyzing Alphafold2 structure predictions of domain insertion variants.
In light of the recent advances in protein structure prediction, a logical question was if AF2 could guide the identification of promising domain fusions.We chose the pLDDT metric as a starting point, as it was previously shown to be correlated with flexible protein regions [1][2][3] , and could hence serve as potential indicator for suitable domain insertion sites.Analysis of the pLDDT scores of individual amino acids from an AF2-derived structure of wildtype AraC revealed a trend towards lower pLDDT values at enriched sites, although the resulting correlation was very weak (Spearman's r of -0.26; Figure S13).
Next, we predicted AF2 structures of all possible PDZ insertions into AraC (Figure S14A).Representing all amino acid-wise pLDDT scores corresponding to AraC from each fusion protein in a heatmap allowed us to investigate the effect each insertion has on the pLDDT scores of AraC (Figure S14B).Most prominent in the resulting representation is a diagonal of decreased pLDDT values corresponding to the residues neighboring the respective position of the PDZ insertion.
These lower values could implicate structural flexibility around the respective insertion site.The interpretation is backed by the fact that the unstructured loops of AraC are also visible as vertical regions with decreased pLDDT scores.We note that the structure of the N-terminal β-barrel (AA 20-100) is implicitly visible in the heatmap by a symmetric pattern of locally decreased pLDDT scores indicating its loop regions in the upper left quarter.Indeed, the pLDDT scores reflected structural features of AraC and potentially local conformational effects of insertions, albeit these findings remain speculative as this point.However, the pLDDT score changes did not correlate with the experimentally determined enrichment scores (Figure S14C).
In line with the pLDDT values, the structural differences between predicted models of wildtype AraC and the corresponding parts of AraC-PDZ hybrid structures exhibited a similar trend (Figure S14D).When, in turn, the PDZ insert was compared to its wildtype conformation (Figure S14E), misfolding of the domain was predicted for several hybrids, although the corresponding insertion sites did not necessarily correspond to regions of significant depletion in our screen.
Taken together, the exploration of predicted hybrid protein structures suggested that AF2 is not able to capture the functional effects of domain insertions in a meaningful way.Given the generally lower performance of AF2 on multi-domain proteins [4] , domain insertion engineering might still be beyond the scope of AF2 and similar state-of-the-art structure prediction methods.Nonetheless, AF2 predictions do reflect diverse structural features of AraC.

Note S2. Optogenetic AraC variants and single-protein Boolean logic gates.
Boolean logic computations are typical elements of genetic circuits and programs used in synthetic biology.They are usually implemented at the transcriptional and/or translational level [5,6] , which causes delays in the signal relay and integration.Protein-based logic computation in contrast does not suffer from these limitations and thus holds great potential for the custom control of cellular processes and the implementation of computational circuits in cells. [7,8]However, increasingly complex circuit designs require the use of a large number of individual protein components as well as their efficient communication via protein-protein interactions. [7,8]A recent preprint, for instance, impressively demonstrated design of neural-network computations on the protein level [9] , showcasing the increasing power of artificial protein networks in living cells.Such complex cellular compute programs, however, show considerable noise and potentially cross-talk within the system and thus require efficient processing at the level of the individual protein components.
In contrast to such commonly used protein logic gate designs based on several separate protein components, our AraC-LOV2 fusions represent single protein Boolean logic gates (Figure 4B).
AraC-I113-LOV2 acts as an AND-gate, integrating blue light and arabinose as inputs, while AraC-S170-LOV2 represents a NIMPLY-gate.We have found only two other examples of engineered, single-protein logic gates in the literature.The first was constructed by fusing LOV2 and uniRapR domains to a kinase [10] resulting in an OR gate behavior, while the second comprises an engineered transcription factor that responds to light and temperature changes. [11]A particular advantage of these single protein-based logic gates is that they provide a direct wiring from the input signals to the output computation within a single polypeptide chain, a computation that would otherwise require several separate protein components.Building on this unique feature could highly simplify the design of compute circuits and their operation in living cells.Our combination of the naturally existing allosteric signaling in AraC with an artificial, second input might be an engineering approach that could be easily adapted to other proteins and thereby facilitate the implementation of Boolean logics for demanding computational operations in cells.
Taken together, the possibility to integrate complex information and wiring it to desired actuations is one of the main goals in synthetic biology.Integrating more functions into single amino acid chains therefore has the potential to simplify molecular networks, release metabolic burden from the host cell [12] and reduce noise derived from stochastic fluctuations of the individual components. [13]This way, single-protein logic gates could contribute to future generations of synthetic biology approaches to program and re-wire cells.

Figure S2 .
Figure S2.Enrichment of active effector-insert hybrids from candidate libraries via FACS.(A) Schematics of the reporter assays for AraC, the Flp recombinase, the TVMV protease and SigF are shown.(B, C) Histograms depicting the RFP fluorescence distribution during FACS-based library enrichment.Representative histograms generated from 25,000 gated events of (B) the initial library and (C) after the first enrichment are shown.The negative controls (-) carried a plasmid expressing a different candidate protein not activating the reporter construct.

Figure S3 .
Figure S3.Domain insertion profiling outcomes are highly reproducible.(A, B) The enrichment scores of biological replicate-1 are plotted against the respective scores from a second replicate-2 for the different effector-PDZ libraries (A) and the additional AraC libraries with varying insert domains (B).Only variants that were not fully depleted during enrichment are shown.A linear fit with 95 % confidence intervals is included and Pearson correlations coefficients are indicated.Rep., replicate; norm., normalized.c, The heatmap shows pairwise Pearson correlations between all domain inserted into AraC.Enrichments of the AraC-LOV2 library in darkness and under light induction (ind.) were assessed and are depicted separately.

Figure S4 .
Figure S4.Cross-validation of the domain insertion screen by experimental characterization of individual insertion variants.Individual domain insertion variants were cloned and their activity was assessed using the respective RFP reporter assays.Boxplots indicate the resulting normalized fluorescence for enriched and depleted candidate.Individual data points correspond to the mean of three biological replicates, each of which reflect of three underlying technical replicates.The IQR is marked by the box and the median is represented by a red line.Whiskers extend to the 1.5-fold IQR or to the value of the smallest or largest enrichment, respectively.

Figure S5 .
Figure S5.Domain insertion tolerance depends on the identity of the insert.Results from insertion screens of AraC with the ERD, LOV2, uniRapR and eYFP insert domains are shown.Enrichments are mapped to the respective insertion site as indicated by the position of the AraC preceding the insertion.Light green, dark green: individual replicates.Grey: variants with zero reads after enrichment.Red: variants missing in the initial library.

Figure S6 .
Figure S6.Positions with insertion tolerance are clustered at distinct, locally confined surface sites.

Figure S8 .
Figure S8.Insertion permissive regions are scattered across AraC and depend on the insert domain.The AF2-derived structure of AraC is colored by the SD of the min-max-scaled enrichment scores from all insert libraries corresponding to five different insertion domains.Functionally critical residues are highlighted in grey.

Figure S9 .
Figure S9.AlphaFold2 predictions accurately capture the structures of the candidate proteins.(A-C), Structural alignments between experimentally resolved structures (grey) and AlphaFold2 predictions (green) are shown for AraC (A), Flp (B) and the TVMV protease (C).The RMSD of the aligned residues as well as the RMSD for all amino acids are shown.PDB-IDs: 2ARA, 2K9S, 1FLO, 3MMG.

Figure S10 .
Figure S10.Correlations between the enrichment scores and surface accessibility or secondary structures.(A) Scatter plot showing the relation between variant enrichment and the average surface exposed area (ASA) of the residues neighboring an insertion site.(B) The insertion score in regions with the respective secondary structure element are shown.For each insertion site, the secondary structure assignment of the amino acid prior and after the insertion were considered.The IQR is

Figure S11 .
Figure S11.Successful domain insertion cannot be predicted from amino acid identity.(A-D) The enrichment score distribution for each amino acid is shown as boxplots for the PDZ libraries of AraC (A), Flp (B), TVMV protease (C) and SigF (D).Both residues neighboring an insertion site were taken into account for the calculations.The IQR is marked by the box and the median is represented by a line within the box.Whiskers extend to the 1.5-fold interquartile range (IQR) or to the value of the smallest or largest enrichment.Colors indicate the different amino acid categories as indicated underneath the plots.Pos., positive charged.Neg., negatively charged.

Figure S12 .
Figure S12.Heatmap of pairwise Spearman correlations between all investigated positional features.

Figure S14 .
Figure S14.Correlations of AF2 structure predictions with domain insertion susceptibility.

Figure S15 .
Figure S15.Gradient boosting models trained on positional features can infer insertion tolerance for individual proteins.Performance metrics of gradient boosting classifiers that were trained on the PDZ datasets for Flp, TVMV protease and SigF with five-fold cross-validation are shown.The ROC (top) and precisionrecall curves (bottom) are depicted for individual folds.The mean ROC is shown in red and the mean AUC is marked in light red.

Figure S16 .
Figure S16.Full comparison of the trained classifier to baseline predictors.(A-B) The mean AUROC (A) and average precision (B) are shown.The values were calculated on a previously withheld test set.The performance of the gradient boosting classifier is compared to all individual features.

Figure S17 |
Figure S17 | Alignment-derived statistics are key predictors of insertion tolerance.(A) The decrease in accuracy upon random permutation of the respective features is presented for the gradient boosting model trained on the complete dataset.(B) Bar plot indicating the Gini importance of each feature of the reduced model.(C) The permutation importance of training features of the reduced model is shown.(A, C) The results were calculated individually for each structure in the cross-validation dataset.The IQR is marked by the box and the median is represented by a red line.Whiskers extend to the 1.5-fold IQR or to the value of the smallest or largest score, respectively.Outliers are shown as points.

Figure S18 .
Figure S18.Distribution of light-switchable variants in the AraC-LOV2 dataset.(A) Enrichment scores of AraC-LOV libraries that were sorted following incubation in darkness (upper panel) or under blue-light exposure (lower panel) are mapped onto the corresponding insertion sites of AraC (preceding the indicated residue).Values for the light exposed sample correspond to a single experiment.For the sample incubated in the dark, light green and dark green

Figure S19 .
Figure S19.AlphaFold2 predicts different conformations for the lead AraC insertion variants.(A, B) AF2 predictions of AraC-I113-LOV2 (A) and AraC-S170-LOV2 (B) are shown.AraC is depicted in green and the AsLOV2 domain in blue.Residues that bind to the operator are highlighted in pink, key residues for dimerization in the induced state in red and the amino acids that are important for arabinose binding in vermilion.

Figure S20 .
Figure S20.Point mutations improve the performance of the AraC-S170-LOV light switch.(A-C) Cultures were inoculated from precultures carrying plasmids encoding an RFP reporter and the indicated AraC-I113-LOV (A), AraC-S170-LOV (B) and AraC (C) point mutants.The samples were incubated for 16 h under light exposure or in darkness at an arabinose concentration of 8 mM, followed by plate reader measurements of RFP fluorescence and OD600.Bars represent means from three independent biological replicates.Error bars show the SD.

Figure S21 .
Figure S21.Gating strategy used during sorting.(A) Scatter plot indicating how cells were selected via their forward and side scatter.(B) Scatter plot of side scatter height and width showing the gate that was set for the selection of singlets.(C) The population of red fluorescent bacteria was sorted as indicated in the scatter plot and the histogram of the measured RFP fluorescence.
KLD KLD Kullback-Leibler divergence calculated from MSA Insertion frequency Insertion frequency in related natural sequences at the respective position Deletion frequency Deletion frequency in related natural sequences at the respective position Mean ins.len.Mean insertion length in related natural sequences at the respective position Median ins.len.Median insertion length in related natural sequences at the respective position