Domain shuffling of a highly mutable ligand‐binding fold drives adhesin generation across the bacterial kingdom

Abstract Bacterial fibrillar adhesins are specialized extracellular polypeptides that promote the attachment of bacteria to the surfaces of other cells or materials. Adhesin‐mediated interactions are critical for the establishment and persistence of stable bacterial populations within diverse environmental niches and are important determinants of virulence. The fibronectin (Fn)‐binding fibrillar adhesin CshA, and its paralogue CshB, play important roles in host colonization by the oral commensal and opportunistic pathogen Streptococcus gordonii. As paralogues are often catalysts for functional diversification, we have probed the early stages of structural and functional divergence in Csh proteins by determining the X‐ray crystal structure of the CshB adhesive domain NR2 and characterizing its Fn‐binding properties in vitro. Despite sharing a common fold, CshB_NR2 displays an ~1.7‐fold reduction in Fn‐binding affinity relative to CshA_NR2. This correlates with reduced electrostatic charge in the Fn‐binding cleft. Complementary bioinformatic studies reveal that homologues of CshA/B_NR2 domains are widely distributed in both Gram‐positive and Gram‐negative bacteria, where they are found housed within functionally cryptic multi‐domain polypeptides. Our findings are consistent with the classification of Csh adhesins and their relatives as members of the recently defined polymer adhesin domain (PAD) family of bacterial proteins.


| INTRODUCTION
Many bacteria present filamentous polypeptides on their surfaces termed adhesins. These extracellular proteins enable binding to biotic and abiotic surfaces, often via an intimate and highly specific interaction with a partner ligand or receptor. 1 Adhesin-mediated interactions make important contributions to colonization and pathogenicity, 2 with elucidation of the structures and functions of these proteins considered critical for informing our fundamental understanding of the mechanisms of microbial adhesion, and in directing future efforts in the development of antiadhesin-based therapies or antimicrobial surface coatings. 3 Most bacterial adhesins possess an equivalent gross architecture comprising a cell wall tethered elongated stalk with an adhesive tip.
These polypeptides are generally classified as belonging to one of two distinct groups; pili or fibrils. 1 Pili are homomeric assemblies typically composed of covalently or noncovalently linked precursor proteins. 4 In contrast, fibrillar adhesins usually comprise a single polypeptide chain. 2 Fibrillar adhesins have been found widely across the bacterial tree of life 5 and vary significantly in their molecular architectures and ligand-binding specificities. Despite their diversity, subsets of adhesive domains from fibrillar adhesins exhibit some degree of identity.
For example, the adhesive domains of the Ag I/II-like polypeptides SspB, SpaP, PrgB, and GbpC, which constitute the recently designated polymer adhesin domain (PAD) superfamily, 6 or those from the adhesins Sgo0707 and CshA, which exhibit significant structural identity despite their disparate amino acid sequences. 7,8 The Streptococcus gordonii (S. gordonii) fibrillar adhesin cell surface hydrophobicity protein A (CshA) is an $259 kDa cell wall anchored polypeptide that confers fibronectin (Fn)-binding properties to this bacterium. 9,10 CshA comprises an N-terminal "nonrepeat" (NR) region, which houses the adhesive CshA_NR2 domain, fused to a stalk-forming "repeat region" (RR), incorporating 17 sequentially arrayed $100 amino acid domains ( Figure 1A). 11 In previous studies of the CshA non-repeat region, we demonstrated that Fn binding by CshA occurs via a distinctive "catch-clamp" mechanism, where the intrinsically disordered NR1 domain of the protein functions to "catch" Fn, through the formation of a rapidly assembled but readily dissociable precomplex, enabling its neighboring NR2 domain to form a tight-binding interaction with Fn "clamping" the two polypeptides together ( Figure 1B). 8 Subsequent elucidation of the crystal structure of CshA_NR2 revealed a distinctive β-sandwich core fold, composed of two antiparallel β sheets decorated by α-helices. These studies also demonstrated that CshA_NR2 possesses a highly negatively charged pocket on its surface consistent with a role in Fn binding ( Figure 1C).
Complementary studies demonstrated that CshA_NR2 binds tightly to Fn in vitro, even in the absence of CshA_NR1, confirming its implicit role in Fn binding.
Here we expand the scope of our studies of Csh proteins by focusing on the S. gordonii CshA paralogue CshB. This polypeptide is antigenically related to CshA, but is encoded for by a gene resident at a distant chromosomal locus. 11 CshB has previously been shown to play a role in Fn binding by S. gordonii, with ΔcshA strains retaining some Fn-binding capability as compared to a ΔcshA/ΔcshB double mutant that does not adhere to Fn-decorated surfaces. 11 In addition, deletion of either of these genes has been shown to abrogate colonization of the murine oral cavity by S. gordonii. 10 In this study, we have used in vitro binding assays of WT and mutant CshA polypeptides to unambiguously identify the Fn-binding pocket of CshA_NR2. Building on these findings we have elucidated the X-ray crystal structure of CshB_NR2, which is found to possess a core fold analogous to that of CshA_NR2. Inspection of the ligandbinding pocket of CshB_NR2 reveals significant variations in the amino acid composition of this region of the polypeptide as compared to CshA_NR2. Complementary in vitro binding studies demonstrate that these changes translate into a reduced Fn-binding affinity. To F I G U R E 1 Structure and function of CshA. (A) Domain architecture of CshA, which comprises from N-terminus to C-terminus, an "FSIRK" sorting signal, the nonrepeat domains NR1, NR2, and NR3, 17 repeat domains, and an LPxTG anchoring motif. N-and C-terminal domain boundaries are indicated (sequence positions designated "N" and "C"). (B) Illustration of the "catch-clamp" mechanism of Fn binding by the CshA_NR1/NR2 domains. Fn is depicted in green, CshA domains are in purple. It is unknown which of the Fn domains are targeted by CshA, thus this illustration represents a simplified depiction. (C) Ribbon representation (left) and electrostatic surface potential (right; red, negative; blue, positive; white, neutral) of the CshA_NR2 domain. The Poisson-Boltzmann electrostatic scale bar shown depicts a potential of ± 5 kT/e (red to blue, respectively). The Fn-binding site of CshA_NR2 is indicated by a dotted box.
investigate the generality of our findings, a bioinformatics analysis was undertaken in an effort to identify more distantly related homologues of CshA/B_NR2 and assess their distribution in bacterial adhesins. This analysis reveals an unexpectedly broad distribution of CshA/B_NR2-like adhesive domains in both Gram-positive and Gramnegative fibrillar adhesins. These polypeptides appear to arise as a consequence of domain shuffling, yielding novel modular architectures of variable lengths. 5,6,12 Our analyses also reveal that wellcharacterized structural homologues of CshA/B_NR2 possess an equivalent β-sandwich core fold but vary in binding cleft architecture and partner selectivity. These findings are consistent with an evolutionary diversification of function, originating from a highly mutable progenitor fold. In undertaking this analysis, we also identify the adhesive domains of Sgo0707 and CshA/B as members of the previously reported PAD family of bacterial polypeptides 6 and demonstrate that domain shuffling within PAD family members functions as a mechanism for the generation of adhesins in both Gram-positive and Gramnegative species.

| Gene cloning and protein production
Recombinant CshA_NR2 was expressed and purified as described previously. 8 For CshB_NR2, a DNA fragment corresponding to residues 211-528 of the full-length CshB polypeptide was PCR-amplified from S. gordonii DL1 chromosomal DNA using CloneAmp™ HiFi PCR premix (Takara Bio), employing appropriate primers incorporating consensus sequences to enable cloning into the expression vectors pOPINE or pOPINF (Table S1). 13 Resulting PCR products were ligated into either pOPINE, pre-cut with Pmel and NcoI, or pOPINF, pre-cut with KpnI and HindIII, using the In-Fusion™ system (Takara Bio). The resulting plasmids, termed cshB_NR2::pOPINE and cshB_NR2::pOPINF, encode a C-terminally hexa-histidine-tagged variant of CshB_NR2 and an Nterminally hexa-histidine-tagged variant of CshB_NR2, respectively.
Protein produced using cshB_NR2::pOPINE was used for crystallization studies, with protein produced using cshB_NR2::pOPINF employed in binding studies. For the CshA_NR2 single mutants D365W, E367W, M481W and the CshA_NR2 triple mutant D365W/ E367W/M481W, codon optimized synthetic genes were sourced from the GeneArt service of Thermo Fisher. Synthetic genes were amplified by PCR using appropriate primers (Table S1)   2.2 | Assessment of protein secondary structure by circular dichroism spectroscopy Circular dichroism (CD) spectra of WT and mutant CshA_NR2 polypeptides were collected using a Jasco J-1500 spectrophotometer fitted with a Peltier temperature control unit. Proteins were dialysed overnight at 4 C into buffer comprising 10 mM sodium phosphate, 100 mM sodium fluoride, pH 7.4, prior to analysis. CD spectra were collected from 300 μL samples of CshA_NR2 WT, CshA_NR2 D365W, CshA_NR2 E367W, CshA_NR2 M481W and CshA_NR2 D365W/E367W/M481W, each at 1.0 mg/mL, at 4 C, using a 1 mm path length cuvette (Hellma Analytics) pre-purged with nitrogen gas to cleanse any contaminants prior to sample analysis. For all experiments, a high-tension (HT) voltage of <700 V was taken as the quality threshold for the CD signal. Final spectra were generated as averages of four repeat scans, with appropriate protein-free buffer spectra subtracted.

| CshB_NR2 crystallization and structure elucidation
Conditions supporting the growth of CshB_NR2 crystals were initially identified using the sitting drop method of vapor diffusion at 20 C, employing commercially available screens (Molecular Dimensions Ltd.). Diffraction-quality crystals were grown in 0.1 M SPG buffer (succinic acid, sodium dihydrogen phosphate and glycine, in the molar ratio 2:7:7) pH 5, 25% w/v PEG 1500. Crystals selected for diffraction data collection were mounted in appropriately sized LithoLoops (Molecular Dimensions) and flash-frozen in liquid nitrogen without the addition of extraneous cryoprotectant. Diffraction data were collected at Diamond Light Source, UK, on beamline I03. Data were processed with xia2 14 and DIALS 15 and the CshB_NR2 structure was determined by molecular replacement using PHASER, 16 employing the CshA_NR2 structure (5L2D) as a search model. The initial CshB_NR2 model was subjected to iterative rounds of model building (COOT 17 ) and refinement (REFMAC 18 ) undertaken using the CCP4i2 software suite. 19,20 The structure of CshB_NR2 has been deposited in the PDB 21 with the accession code 6YZG. Structural figures have been prepared using PyMoL. 22 Electrostatic surface analyses have been performed using the APBS plugin 23 in PyMol following molecule preparation with pdb2pqr using PropKa 24 applying an AMBER 99 force field. APBS was implemented with a grid spacing of 0.50 and a surface visualization of the Poisson-Boltzmann electrostatic potential of ±5 kT/e. was used as a negative control and showed no evidence of binding to cFn. All measurements were performed in triplicate. MST traces and binding data were analyzed using the MO. Control and MO. Affinity software.

| CshA/B_NR2 sequence conservation analysis
For sequence conservation analysis, homologues of CshA_NR2 were identified using PSI-BLAST 25 searching against nonredundant protein sequences in the organism category of Streptococcaceae (taxid:1300) in NCBI. 26 The search was limited to the generation of 500 matches with an E-value cut-off of 0.001. Searches applied a word size of 3, a BLO- 2.6 | CshA/B_NR2 sequence homologue acquisition, alignment, and phylogenetic analysis For phylogenetic analyses sequences of CshA_NR2 homologues, including those of distant relatives, were identified by performing 10 iterative PSI-BLAST searches executed with no restriction on the number of sequences returned and without taxon exclusions, applying the same parameters as outlined above. This approach returned 2056 sequence matches. Hits were aligned using MAFFT 7.471 and constructed into a phylogenetic tree using rapidnj 28 with default parameters, and trees were subsequently visualized in CLCGW12. Sequences from resultant clusters were manually picked (200 hits total) for full architecture annotation using InterPro. 29 Batch Entrez 30 was used to retrieve sequences of full-length CshA homologues (1804 sequences returned, 252 rejected) with low-to-moderate sequence identity to the polypeptide. Candidates were triaged using sequence alignments generated using MUSCLE 3.8.31, 31 as implemented in the EMBL-EBI API 32 with default parameters. Clusters found to contain polypeptides housing adhesin associated domains were color coded using CLCGW12 in a circular cladogram and annotated using GIMP. 33 Following annotation with InterPro, a subset of sequences were selected and visualized using Biorender 34 in an effort to establish the diversity of domain architectures.
To identify structural homologues of CshA/B_NR2, searches were conducted using the DALI web server 35 employing the monomeric forms of CshA_NR2 and CshB_NR2 as search models. Known adhesive domains with a DALI Z score of >6.0 were selected for comparative structural analysis. MUSCLE was used as before to produce a sequence identity matrix.

| CshA_NR3 domain boundary reannotation
To unambiguously establish the C-terminal domain boundary of the previously assigned CshA_NR3 region, archetype CDD 36 consensus sequences of NR3 domains and their associated downstream regions of sequence in the 2056 PSI-BLAST library were aligned against the sequence of full-length CshA using EMBOSS Needle. 32 Alignments with a Needle score > 100.0 were used to predict the annex sites of NR3. Based on these analyses the two domains constituting the NR3 region of CshA and its relatives were subsequently termed the 'GEVED' domain, and CshA repeat region 1 (RR1).

| Bioinformatic characterization of CshA/ B_NR2 containing polypeptides
The 2056 homologues of CshA_NR2 identified using an iterative PSI-BLAST search were aligned using MUSCLE and an NR2 hidden Markov model (HMM) was created using the HMMER tool (hmmer.org,

| Structure predictions for NR2-like domains
Structure predictions for NR2-like domains were performed using AlphaFold2 employing the Colabfold server. 42 Selected sequences identified from the PSI-BLAST pool of NR2-like homologues were input into the server with default settings applied. These corresponded to: no template mode enabled, msa_mode: MMseqs2 (UniRef+Environmental), pair_mode: unpaired+paired, model_type: auto, num_recycles: 3. The server generated five .pdb models for each input sequence, with the highest ranking selected for further analysis.

| Elucidation of the Fn-binding pocket of CshA_NR2
In previous studies, we identified CshA_NR2 (residues 223-540 of the fulllength polypeptide) as a high-affinity Fn-binding domain. 8 Figure S1). In addition, analysis by size exclusion chromatography demonstrated that each of the four mutant proteins exhibited hydrodynamic properties equivalent to those of the WT polypeptide ( Figure S1).
For MST studies, cFn was mixed with increasing concentrations of monomeric WT CshA_NR2 or each of the four mutants, to produce binding curves from which the equilibrium dissociation constant of each protein for cFn could be determined. Our MST binding studies revealed that in contrast to WT CshA_NR2, which was found to bind to cFn with a K D value of 302 ± 44 nM, we observed no evidence of cFn binding by our CshA_NR2 mutants ( Figure 2). Together, these data demonstrate that disruption of the highly charged surface cleft of CshA_NR2 through mutation abrogates cFn binding, thus establishing this region of the polypeptide as the Fn-binding site of CshA_NR2. It is notable that the K D value for cFn binding by CshA_NR2 is lower than that previously determined using Biolayer Interferometry (BLI; 500 nM), 8 a disparity that may reflect the increased sensitivity of MST as compared to BLI. 43

| X-ray crystal structure of CshB_NR2
To further advance our understanding of Csh adhesins, we next sought to elucidate the X-ray crystal structure of the NR2 domain of the CshA paralogue CshB (residues 211-528 of the full-length polypeptide). Although the two adhesins share 70% sequence identity, and CshB has been previously shown to contribute to Fn binding in S. gordonii, 10  is decorated with six α-helices, which are located between strands β3/ β4 (α1 and α2), β9/β10 (α3, α4, and α5) and after β14 (α6; Figure 3A). In contrast to previous crystallographic studies of CshA_NR2, electron density was observed corresponding to the CshB_NR2 polypeptide chain beyond the final β-strand of the protein (β14). This density corresponds to an extended loop punctuated by a short α-helix (α6), which traverses the connecting loops between β6/β7, β8/β9, β10/β11, and β12/β13, and occupies a groove formed by the packing of α1 against the larger of the two β-sheets.

| Reduced Fn-binding affinity in CshB_NR2 correlates with a loss of electrostatic charge in the ligand-binding site
Having confirmed the identity of the Fn-binding pocket of CshA_NR2, we next sought to investigate the Fn-binding properties of CshB_NR2. Inspection of the CshB_NR2 structure reveals the presence of an equivalent binding pocket, however, the amino acid composition of this region differs from that seen in CshA_NR2 ( Figure 3C). The CshB_NR2-binding pocket is appreciably less negatively charged, predominantly as a consequence of the substitution of residues Asp300, Asp337, and Asp338 in CshA_NR2 with tyrosine (Tyr288), methionine (Met326) and asparagine (Asn327), which occupy equivalent positions in CshB_NR2. To investigate the impact T A B L E 1 Summary of X-ray data collection and refinement statistics. Sequences were aligned and conserved residues (>95% conservation) mapped onto the CshA_NR2 structure ( Figure 5). This analysis identified several regions of high sequence conservation in CshA_NR2, CshB_NR2 and their relatives. These include the hydrophobic interface between the two central β-sheets and the connecting regions between β2/β3, β3/β4, β6/β7, β7/β8, and β9/β10. Aligned sequences of the identified homologues were visualized in CLC Genomics Workbench 12 and revealed frequent amino acid insertions in strands β1

Data Collection
and β4, and in the regions that correspond to the interface between β3/β4 and β9/β10 ( Figure S2). Notably, significant sequence variability was observed in the β11/β12 loop, which caps the Fn-binding pocket, and in the Fn-binding pocket itself, implying that these functionally important regions are highly mutable.  recalcitrant to identification using sequence-based methods, as distant homologues may be sufficiently divergent in their sequences to preclude detection. 47 In such instances, more distantly related polypeptides may only be identifiable using searches based on structural rather than sequence identity. 48 To address this, we performed DALI searches using the structures of monomeric CshA_NR2 and F I G U R E 5 Residue conservation in CshA/B_NR2 homologues. Identified residues have been mapped onto the structure of monomeric CshA_NR2. Residues with a percentage conservation of ≥95% across all 500 sequences analyzed are colored dark blue.

|
F I G U R E 6 Topology diagrams of CshA/B_NR2 structural homologues. Arrows denote β-strands, rectangles denote α-helices and lines represent loops. The conserved β-sandwich core of each fold is colored dark blue. Structural elements colored light blue (with dotted outlines) are those that are absent from the fold relative to CshA/B_NR2, those colored red are present in the fold relative to CshA/B_NR2. β-strands are numbered relative to the CshA_NR2 model, with additional elements labeled with roman numerals. Structural elements that constitute capping loops of the ligand-binding pocket of each of the domains shown are highlighted in yellow boxes.  (Table S2), each of the highest-scoring structural homologues retains the same core β-sandwich fold as that observed in both CshA_NR2 and CshB_NR2. 7,[49][50][51][52] Notably, however, in some instances, this fold is augmented via the acquisition of up to three additional β-strands (Figures 6 and 7A). All identified structural homologues of CshA/B_NR2 present a sizeable negatively charged pocket on their surface, analogous to the Fn-binding site of CshA/B_NR2 ( Figure 7B). In addition, in each instance access to this site is regulated by a flexible capping loop. Significant variation is seen in the size and amino acid composition of both the binding pocket and capping loop in different NR2 structural homologues, with the former being largely dictated by the region preceding β3, and that which connects β11-β 12 (relative to our CshA/B_NR2 structures; Figure 6). The observed structural differences between these domains appear to enable the acquisition or loss of partner binding specificity as a consequence of mutation within the gated binding pocket and/or extension of the core β-sandwich fold via atrophy and elaboration of the central β-sheets. 53 The high structural identity of CshA_NR2, CshB_NR2, and the N1 domain of Sgo0707 to adhesive domains within the recently defined PAD family of bacterial polypeptides 6 indicates that these three polypeptides should be classified as members of this group.

| Domain shuffling drives adhesin diversification
As our adhesin library represents proteins found in a variety of both Gram-positive and Gram-negative species (Figure 9D), we next

| DISCUSSION
In previous studies, we investigated the molecular basis of We have elucidated the X-ray crystal structure of a strand- In studies focused on the identification of structural homologues of CshA/B, we have identified a number of structurally related adhesive domains from other well-characterized adhesins. Notably, these structural homologues exhibit low sequence identity (13%-15%) to both CshA_NR2 and CshB_NR2, and include members of the recently defined PAD family. 6 Each of the identified structural homologues resides in an adhesin known to bind an ECM substrate (Table 3) Our HMM studies indicate that domain shuffling acts as a primary driver of adhesin assembly. This process affords a mechanism by which bacteria can generate novel adhesins by integrating distinct domains, tailored to bind to specific targets, with appropriate partner