Expansin Engineering Database: A navigation and classification tool for expansins and homologues

Expansins have the remarkable ability to loosen plant cell walls and cellulose material without showing catalytic activity and therefore have potential applications in biomass degradation. To support the study of sequence‐structure‐function relationships and the search for novel expansins, the Expansin Engineering Database (ExED, https://exed.biocatnet.de) collected sequence and structure data on expansins from Bacteria, Fungi, and Viridiplantae, and expansin‐like homologues such as carbohydrate binding modules, glycoside hydrolases, loosenins, swollenins, cerato‐platanins, and EXPNs. Based on global sequence alignment and protein sequence network analysis, the sequences are highly diverse. However, many similarities were found between the expansin domains. Newly created profile hidden Markov models of the two expansin domains enable standard numbering schemes, comprehensive conservation analyses, and genome annotation. Conserved key amino acids in the expansin domains were identified, a refined classification of expansins and carbohydrate binding modules was proposed, and new sequence motifs facilitate the search of novel candidate genes and the engineering of expansins.


Introduction
Expansins are plant cell wall loosening proteins without apparent catalytic activity, which have been identified in a broad range of organisms [1][2][3][4] .The loosening mechanism is still elusive, but it has been suggested that the non-covalent interactions between cellulose microfibrils are weakened and moved against each other, thus the tight cellulosic structure is loosened 1 .The interactions between expansins and the plant cell wall, which consists of lignin, hemicellulose, and cellulose, require further investigation 5 .Expansins were first discovered in plants and were described as proteins mediating pH-dependent extension and stress relaxation of cell walls 6 .Based on phylogenetic analysis, it has been proposed that expansins in Bacteria and Fungi resulted from multiple horizontal gene transfers from plants to microbes 7 , but there is also the possibility that the microbial expansin subfamily evolved first in ancient marine microorganisms, and then diversified into distinct terrestrial plant subfamilies 8 .
Expansins consist of two tightly packed protein domains, connected by a short linker and preceded by a signal peptide 9 (Figure 1) .Both expansin domains need to be connected for effective wall extension activity and weakening filter paper 10,11 .The C-terminal domain of EXLX1 (expansin-like X) from Bacillus subtilis dominates the binding to cellulose and to matrix polysaccharides of cell walls through electrostatic or polar interaction 10 .The Zea mays β-expansin (Zm EXPB1) primarily binds glucuronoarabinoxylan, the major matrix polysaccharide in grass cell walls, and loosens it 12 .
dependent" e-values), a minimal hit length of 60 amino acids, and a maximal ratio of bias over domain-based score of 10%.

Sequence hierarchy in the ExED
The initial twenty-five seed sequences comprise six bacterial, one fungal, and seventeen plant expansins, as well as one expansin-like swollenin sequence.The BLAST hits for each of these seed sequences were assigned to a corresponding superfamily named 'Bacterial expansins', 'Fungal expansins', 'Plant expansins', and 'Nterminal domains'.Hence, the division of the identified protein sequences into the different superfamilies was based on sequence identity, and not on phylogenetic relationships.Herein, the term family refers to a group of sequences sharing a certain degree of similarity, i.e. rather a cluster of similar sequences than a clade in a phylogenetic tree.Homologous families were created by a cutoff of 60% pairwise sequence identity as determined by the Needleman-Wunsch algorithm implemented in the EMBOSS software suite (version 6.6.0), with gap opening and extension penalties of 10 and 0.5, respectively 32,33 .All sequence entries which shared at least 98% global sequence identity were assigned to a single protein entry.For each sequence entry, the respective superfamily, homologous family, and protein entry were annotated together with the identifiers of the original source database.

Profile HMMs
A profile hidden Markov model (HMM) 31 was derived for each expansin domain from a multiple sequence alignment built from twenty-eight representative protein sequences, including twenty-two of the twenty-five seed sequences mentioned above, two fungal sequences, and four sequences for which their structure was known (Table S2 ).To determine the region of the two domains in a multiple sequence alignment, four crystal structures of expansins were superimposed (PDB entries 1n10, chain A; 2hcz, chain X; 4fer, chain B; and 4jjo, chain A).The structure-based multiple sequence alignment (Figure S1 ) was generated by the Clustal Omega package 34 (version 1.2.1-1) and STAMP 35 (version 4.4), and visualized by PyMOL 36 (version 4.60, Schrödinger, New York, NY, USA).Based on the structural alignment and on annotations of secondary structures in Pfam 37 (entries PF03330.17 for the N-terminal domain and PF01357.20 for the C-terminal domain), the respective domains were manually retrieved.The individual profile HMMs for the N-and C-terminal expansin domains were built by HMMER from the multiple sequence alignments.The input multiple sequence alignments were aligned against the derived output profile HMMs with thehmmalign command from HMMER in order to determine whether there are shifts between the input and output alignments.Shifted alignment columns were refined manually with respect to the positions of known secondary structure elements.The refined profile HMMs of the N-and C-terminal expansin domain comprise 95 and 75 positions, respectively (Figures S2 and S3 ), and are available together with their underlying alignments at https://doi.org/10.18419/darus-623.

Standard numbering schemes
For the N-and C-terminal expansin domains, standard numbering schemes were introduced to annotate equivalent positions 38 .The B. subtilis expansin, Bs EXLX1 (PDB entry 4fer), was used as the reference sequence for the assignment of standard position numbers to the sequence entries in the ExED upon alignment against the respective profile HMM and subsequent transfer of position numbers: For both expansin domains, the standard positions range from 11 to 105 and 114 to 186.Insertions with respect to the reference sequence, such as loops, were specified by subsequent decimals.Thus, all position numbers mentioned herein are based on the reference Bs EXLX1, unless otherwise stated.Due to insertions in the reference sequence ofBs EXLX1, some regions in the underlying multiple sequence alignments of the standard numbering schemes appeared inaccurate, i.e. these regions could not be aligned properly: In the N-terminal expansin domain, inaccurate positions are from 14.1 to 17, 39.1 to 47, and 104.1 to 105; in the C-terminal expansin domain, inaccurate positions are from 162.1 to 164 and 185.1 to 186.

Conservation analyses
The two standard numbering schemes were used to analyze the amino acid frequencies for the two expansin domains.The domains were annotated by using hmmscan against all sequence entries of the ExED and deploying the match criteria mentioned above.Each annotated domain position was analyzed for conserved amino acids.Groups of amino acids with similar biochemical properties, such as charge or polarity, were also taken into account 39,40 .Conservation analyses were performed separately for each superfamily of the ExED, and additionally for EXPA, EXPB, EXLA (expansin-like A), and EXLB (expansin-like B) (Tables S3 and S4 ).An amino acid position was defined as conserved if it occurred in at least 70% of all annotated sequence entries.Conserved positions were compared with the positions in the structures of two bacterial expansins (PDB entries 4fer, chain B and 4jjo, chain A) and two plant expansins (PDB entries 1n10, chain A and 2hcz, chain X) to predict their functional relevance.

Co-evolution of expansin domains
For comparison of the co-occurrence of the two expansin domains, all sequence entries from the ExED were aligned against the two profile HMMs for the expansin domains.Profile-to-sequence alignments were performed with the hmmsearch command from the HMMER software suite with themax option to collect all domain-based scores for each possible alignment.The lists of domain-based scores were sorted by sequence identifiers to ensure comparability, and in case of multiple hits, only the maximal bit score was kept.The bivariate histogram was visualized as heat map for bit scores greater than zero in MATLAB (version R2019a, The MathWorks, Natick, MA, USA).

Sequence length distributions
For comparing the lengths of the sequence entries in the ExED, histograms and boxplots were created with MATLAB to visualize frequency distributions and to identify possibly fragmented or artificial sequences (version R2019a, The Mathworks, Natick, MA, USA version 2019a, Statistics and Machine Learning Toolbox version 11.5).The whisker length in a boxplot was chosen as 1.5 times the interquartile range.

Protein sequence networks
Protein sequence networks visualize large sequence datasets as nodes in an undirected graph with edge weights to derive relationships between different clusters or communities.The protein sequences in the ExED were sorted by decreasing sequence length and were subsequently clustered using the USEARCH algorithm (UCLUST) with a threshold of 90% sequence identity (without terminal gaps) to determine a reduced set of centroid sequences (representative sequences) 30 .For each centroid sequence, the N-and the C-terminal expansin domains were annotated by the two profile HMMs with the filter criteria mentioned above.Pairwise sequence identities between two sequences were derived from global Needleman-Wunsch alignments as described above and used as edge weights.Protein sequence networks were generated with edge weights of pairwise sequence identity, filtered by a pre-defined threshold.Metadata of the nodes (e.g. the sequence ID) and of the edges (i.e. the edge weights) were summarized in GraphML files by applying the NetworkX library in Python (version 1.9) for an automated assignment of node and edge attributes 41 .The GraphML files are available at https://doi.org/10.18419/darus-624.Protein sequence networks were visualized with Cytoscape version 3.7.2 42using a prefuse, force-directed layout with respect to the edge weights.
For the networks showing the relationships between CBM63s and expansin homologues, and between GH45s and the N-terminal expansin domain homologues, CD-HIT (version 4.7) was used with a clustering threshold of 90% and a word size of 5 (instead of UCLUST) 43,44 .The GH45 sequences were downloaded from the protein family database (Pfam, version 32.0, accession PF02015) 45 , whereas the CBM63 sequences were downloaded from the carbohydrate-active enzymes (CAZy) database on June 3, 2019 46 .In the CAZy database, 633 individual CBM63 sequences were deposited, but only 582 NCBI accessions were available at the time of writing, as some of the records were moved or entries were merged.Members of CBM63 were annotated by the profile HMMs for the two expansin domains (https://doi.org/10.18419/darus-625).

Identification of expansin domains in actinobacterial genomes
Five actinobacterial genomes were selected to show the application of the ExED for the identification of expansin domains.An Illumina MiSeq sequencer was used to sequence the genomes (NGS facility, University of the Western Cape, South Africa).Due to the high G+C content of actinobacterial DNA, a 10% PhiX spike was included in the run.The genomes were assembled using the A5-miseq pipeline 53 .
The two newly created profile HMMs mentioned above were applied to search the five actinobacterial genomes for the occurrence of expansin domains.Nucleic acid sequences were translated using the default codon usage table available in the transeq tool from the EMBOSS software suite (version 6.6.0 54 ).Translated amino acid sequences with less than 60 subsequent amino acid symbols were discarded to reduce computation time.
The hmmscan tool from the HMMER software suite (version 3.1b2, http://www.hmmer.org,Howard Hughes Medical Institute, Chevy Chase, MD, USA) was used to scan the translated amino acid sequences with profile HMMs.The hits from hmmscan were filtered by a minimal domain-based score of 35 and a minimal coverage of 75% (defined as the ratio of hit length without insertions divided by the length of the profile HMM).
The matches for the profile HMMs of expansin domains were extended to find the adjacent start methionine and stop codon along the contig sequence of each match.The first or last available amino acid position in a contig was used to extend the hits, in case of a missing start or stop codon, respectively.The extended hit sequences are available for download under https://doi.org/10.18419/darus-699.

The Expansin Engineering Database (ExED)
The current version of the ExED contains 15,089 sequence entries, 12,400 protein entries, and twenty-one protein structures (Tables 1and S5 ), which, based on global sequence similarity, were assigned to four superfamilies (comprising 12,404 sequence entries, 9954 protein entries and seventeen structures).Three superfamilies include expansin homologues with two domains and were named according to their dominant source organisms: superfamily 1 'Bacterial expansins' (1172 sequences, ten structures), superfamily 2 'Fungal expansins' (543 sequences, no structure), and superfamily 3 'Plant expansins' (8269 sequences, six structures).The members of superfamily 4 'N-terminal domains' consist of the N-terminal expansin domain only (2420 sequences, one structure).This superfamily comprises eukaryotic and bacterial sequences, e.g. from Magnoliophyta (A, B, and C),Actinobacteria , Oomycetes , and Basidiomycota .The remaining number of 2685 sequences (corresponding to 2446 protein entries) and four structures could not be assigned to the four superfamilies and was thus collected in an unclassified fifth superfamily, which was omitted for further investigations.
The sequence lengths in the superfamilies 'Bacterial expansins', 'Fungal expansins', and 'Plant expansins' vary between 40 and 1400 amino acids with a sharp peak between 250 and 270 amino acids and two minor peaks at 150 and at 600 amino acids (Figure S4 ).The sequence length distributions differ for each of the four superfamilies (FigureS5 ).For further analysis of whole expansin sequences and comparison with expansin-like proteins, only sequences with a length between 210 and 300 amino acids were considered (7706 sequences from the superfamilies 'Bacterial expansins', 'Fungal expansins', and 'Plant expansins') (Figure S4 ).

Sequence space of expansin domains
Two profile HMMs for the N-terminal and the C-terminal expansin domains were derived and used for annotation of the two domains in all 12,404 classified sequences of the ExED (superfamilies 1, 2, 3, and 4), independent of their sequence lengths.For the superfamilies 'Bacterial expansins', 'Fungal expansins', and 'Plant expansins', the N-and the C-terminal expansin domains could be annotated in 9,470 out of 9,984 sequences and in 8,896 out of 9,984 sequences, respectively (Table S5 ).In 2,182 out of the 2,420 sequences from the superfamily 'N-terminal domains', only the N-terminal expansin domain was annotated.
Based on the annotated domains in the classified superfamilies, two protein sequence networks were generated.The sequence network of N-terminal expansin domains is dominated by three large clusters (Figure S6  The sequence network of the C-terminal expansin domain is dominated by six large clusters from 'Plant expansins', previously annotated as EXPA, EXPB, EXLB, and EXLA (clusters A-C and E-G), one cluster from 'Fungal expansins' (D, Hfam 7), and three clusters from 'Bacterial expansins' (H-J, Hfams 1, 3, 4, 6) (Figure S7 ).In each of the two domain-based networks, one bacterial sequence was found in a cluster from 'Plant expansins', Streptomyces acidiscabies (NCBI accession GAQ55178.1) in EXPA (Figure S6 ), and Soehngenia saccharolytica (NCBI accession TJX44964.1) in EXLA (Figure S7 ).
The N-and C-terminal expansin domains have not evolved independently, but have co-evolved, as indicated by the correlation of sequence similarities of the two domains to the respective profile HMM (Figure 3 ).The shift in respect to the diagonal indicates a higher conservation for the N-terminal expansin domain than for the C-terminal expansin domain.
Due to our conservation analysis, five of the previously proposed six cysteines 14 were highly conserved in the superfamily 'Plant expansins', three conserved cysteines were found in the superfamily 'Fungal expansins', and none in the superfamily 'Bacterial expansins' (Table 2 and https://doi.org/10.18419/darus-735).The conserved cysteines at standard positions C23 and C52, C55 and a cysteine upstream of the N-terminal expansin domain standard numbering, and C60 and 61.8 were proposed to form disulfide bonds 14 .
For further comparison, 582 protein sequences of the carbohydrate-binding module family 63 (CBM63) with a sequence length between 57 and 746 amino acids were downloaded from the CAZy database 46 .Interestingly, 511 of these sequences contained both expansin domains and were therefore already annotated in the ExED in the superfamilies 'Bacterial expansins' and 'Fungal expansins'.Four CBM63 sequences contained only the C-terminal expansin domain, whereas 58 CBM63 sequences contained only the N-terminal expansin domain and shared a sequence identity of over 60% with N-terminal expansin domains of the superfamily 'Bacterial expansins' (Figure S6 ).A protein sequence network including the whole CBM63 sequences and expansin sequences from the superfamilies 'Bacterial expansins', 'Fungal expansins', and 'Plant expansins' revealed the similarity of CBM63 sequences to 'Bacterial expansins' and also to 'Fungal expansins' from homologous family 7 (Figure 5) .The members of the superfamily 'N-terminal domains' consist of the N-terminal expansin domain only.Similarly, loosenin (NCBI ADI72050.2),EXPN from Endogone sp.FLAS-F59071 (NCBI accession RUS20349.1),the expansin-like protein found in nematodeHeterodera glycines (NCBI ADL29728.1),and cerato-platanin from Ceratocystis platani (NCBI accession CAC84090.2) consist only of the N-terminal expansin domain (Table S7 ).At a threshold of 60% sequence identity, the N-terminal domains of loosenin and Basidiomycota cluster with fungal sequences from Hfam 7 and plant sequences from Hfam 11 (Figure S6 ).In contrast, swollenin was found to possess only a distantly related C-terminal expansin domain (Table S7 ).

Annotation of expansin domains in actinobacterial genomes
As a case study for the application of ExED in genome sequence annotation, actinobacterial genomes from various South African habitats were analyzed for the presence of expansin domains and conserved amino acid positions, using the profile HMMs of the expansin domains (Tables S8 and S9 ).In general, the sequence regions identified for the N-terminal expansin domains emerged with higher HMMER scores, whereas the C-terminal domains seemed less conserved (compare with Figure 3 ).Despite the lower scores for the Cterminal expansin domain, the coverage for the underlying profile HMM was still high (90%).One genome hit was identified in sediment samples collected at Gamka River in the Swartberg Mountain Range, which was identical to an expansin homologue from Streptomyces swartbergensis (NCBI accession WP 086602418), which matched well the profile HMM of the N-terminal expansin domain (score: 60, 98% coverage) and moderately the profile HMM of the C-terminal expansin domain (score: 19, 89% coverage).The sequence from S. swartbergensis contains amino acids that are conserved in the superfamily 'Bacterial expansins' (threonine 12, glycine 21, alanine 36, glycine 53, tyrosine 55, proline 74, aspartate 82, leucine 83, phenylalanine 88, and glycine 97 in the N-terminal expansin domain; lysine 119, tryptophan 126, tryptophan 149, tyrosine 157, and glycine 179 in the C-terminal expansin domain) and also amino acids that are conserved in the superfamilies 'Fungal expansins' or 'Plant expansins' (tyrosine 14, cysteine 23, cysteine 52, and cysteine 73).

Discussion
Expansins typically consist of about 225 amino acids (about 26 kDa) and an N-terminal signal peptide 2 , in total 250 to 275 amino acids 55 , which is in agreement with the average sequence length of 262 amino acids identified in this study.Thus, sequences shorter than 210 amino acids or longer than 300 amino acids were excluded from global sequence analyses (Figure S4 ).However, sequences with a length of about 600 amino acids contained replications of expansin domains as fusion proteins or due to sequencing errors, leading to expansin sequences that contained each domain two or three times.Since the two expansin domains have a length between 80 and 90 amino acids, shorter protein sequences can be considered as fragments or incomplete expansin domains.
The occurrence of expansins in major taxa in the tree of life (after Fig. 1 in 8 where a comprehensive phylogenetic analysis of expansin genes across all kingdoms of life is shown) is comparable to the results obtained in this study (https://doi.org/10.18419/darus-693).For twelve out of ninety groups that were compared, the results are different, e.g. the archaeon Halomicroarcula sp.LR21 can be found in the ExED and contains one expansin homologue for which both expansin domains are annotated, whereas previous studies in 8 did not find a putative expansin in Archaea .Other, apparently hitherto unknown, occurrences of putative expansins in the ExED include thirty-six sequences of Fibrobacteres , one sequence of Ignavibacteria in which both expansin domains can be found, seven sequences of Discosea , one sequence ofDiscoba , and one sequence of Acidobacteria , but without domain annotations.Further expansin sequences that were not included in this study but mentioned in 8 are from the taxaVerrucomicrobia , Chlorobi , Tubulinea ,Glaucophyta , Haptophyta , Dinoflagellata , andPhaeophyta .
The protein sequence networks confirmed the nomenclature and classification of expansins into three kingdoms of Bacteria , Fungi, and Viridiplantae and the subclassification of plant expansins into EXPA, EXPB, EXLA, and EXLB 56 (Figure 2 ).Despite the differences on global sequence level, the protein sequence networks of expansins from Bacteria andViridiplantae share similarities on a domain-based sequence level (Figures S1, S6 and S7 ).The N-terminal expansin domain is more conserved than the C-terminal expansin domain (Figure 3 and https://doi.org/10.18419/darus-735).When expansin homologues from more diverse backgrounds are discovered in the future, updated profile HMMs will show more insights into the possible co-evolution of both expansin domains.
A conservation analysis revealed and confirmed positions with an essential functional or structural role in expansin homologues.Glycine is structurally relevant, as it mediates the formation of short loops 57 and is frequently observed at the N-and C-caps of α-helices to increase helix stability 58 .As observed previously for other protein families 59,60 , glycine is the most conserved amino acid in both expansin domains.In expansins, all four conserved glycines are located in loop regions (Table 2 , compare with Figure 1 ).The conservation of threonine 12 and aspartate 82 in Bacteria , Fungi, EXPA, and EXPB confirms their functional role 10 .Interestingly, at standard position 75, which plays a moderate role in cell wall extension activity of Bs EXLX1 10 , a glutamate is conserved in the superfamily 'Bacterial expansins', and a glycine in EXPA and EXLB.In contrast, standard position 75 is not conserved in the superfamily 'Fungal expansins', in EXPB, and in EXLA (Table 2and https://doi.org/10.18419/darus-735).Aspartate 71, which has been proposed as important but not essential for wall extension activity ofBs EXLX1 10 , is conserved in the superfamilies 'Bacterial expansins' and 'Fungal expansins', and in EXPB, EXLA, and EXLB (Table 2 and https://doi.org/10.18419/darus-735).However, three other proposed key amino acids for cell wall extension activity (threonine 14, serine 16, and tyrosine 73 10 ) are neither conserved in expansins from Bacteria , Fungi, norViridiplantae (Table S3 and https://doi.org/10.18419/darus-735),indicating the importance of an increased sample size for conservation analysis.The large number of expansin sequences investigated here also provided a deeper insight into the structural or functional relevance of disulfide bridges in the different superfamilies.Previously, three disulfide bridges were proposed to stabilize the tertiary structure of the N-terminal expansin domain of EXPA and EXPB 14,15 .Five of the proposed six cysteines could be confirmed as highly conserved in the superfamily 'Plant expansins' (Table 2 and https://doi.org/10.18419/darus-735).The sixth cysteine is located directly before the linker to the C-terminal expansin domain and therefore not included in our profile HMM for the N-terminal expansin domain.Against expectations, the additional highly conserved forth cysteine pair in plant α-expansins from 15 was not found in our analysis( https://doi.org/10.18419/darus-735) .Only three conserved cysteines were found in the superfamily 'Fungal expansins', thus not all fungal expansin homologues possess three disulfide bridges, as concluded from the expansin Sc Exlx1 16 .None of the six cysteines was conserved in the superfamily 'Bacterial expansins' (Table 2 ), which is in accordance with previous observations of bacterial expansins lacking disulfide bridges 13 .
Through the use of conservation analysis, previously published family-specific motifs were confirmed: in the N-terminal expansin domain, the T(F/W)YG motif was present in the two superfamilies 'Bacterial expansins' and 'Fungal expansins' (standard positions 12-14 and 14.1), and the motifs GGACG (20-24)  and HFD (80-82) in the superfamily 'Plant expansins' 9,55 (Table 3 ).We suggest to extend the GGACG motif to a GGACGYG motif and the HFD motif to a HFDL motif in plant expansins.In bacterial and fungal expansins, these two plant motifs are slightly different: in the superfamily 'Fungal expansins', the GGACGYG motif is shorter (GGxC), and in fungal and bacterial expansins the HFDL motif is replaced by HLDL.The HLD motif as well as the GGACS motif were already described for the fungal expansin Sc EXLX1 16 .Newly proposed motifs in the N-terminal expansin domain are VpGP (58-61) in the superfamily 'Bacterial expansins' and GTAnS (34-38) in the superfamily 'Fungal expansins' (Tables 3 and S3) , where p and n denote polar and nonpolar amino acids, respectively.In expansins from Fungi, the proline of the VpGP-motif is replaced by a non-polar amino acid.The previously described CDRC-motif at the amino terminus of EXLA 55 is located beyond the boundaries of our profile HMM for the N-terminal expansin domain .
The large number of expansin sequences used for analysis not only improved the identification of motifs, but also shed light on evolutionary relationships.Interestingly, when searching with the newly established profile HMMs for expansin domains within the CBM63 protein sequences from CAZy, 510 out of the 582 CBM63 protein sequences were found to contain both expansin domains (Table S10 ).Only four sequences had a similarity to the C-terminal expansin domain, while missing the N-terminal expansin domain, as suggested previously 1 , and 58 CBM63 sequences contained only the N-terminal expansin domain.
The observation of four bacterial sequences being found in clusters of plant expansins supports the hypothesis that microbial expansins were derived via horizontal gene transfer from plants to microbes 7 (Figures 2 , S6 , andS7 ).The two bacterial sequences in clusters of the superfamily 'Plant expansins' (Figure 5 ) are from the plant pathogensKutzneria sp.744 (NCBI accession EWM10128.1)andStreptomyces acidiscabies (NCBI accession WP 050370046.1),which are both actinobacteria, as described previously 2 .
With the chosen filter criteria, the sequence of the fungal swollenin does not contain any expansin domain.As the score for the C-terminal expansin domain is far below the chosen criteria, the swollenin sequence resembles a distantly related C-terminal expansin domain (Table S7 ), but we found no N-terminal expansin domain within the protein sequence of swollenin.This is due to the short N-terminal expansin domain in the swollenin from Trichoderma reesei and confirms the rather low sequence similarity between swollenin and expansins 47 .
On a global sequence level, GH45s and N-terminal expansin domains share less than 30% pairwise sequence identity (Figure 4 ), and neither the profile HMM search of the N-and C-terminal expansin domains in the 542 GH45 sequences nor the profile HMM search of the GH45 profile HMM from Pfam (https://pfam.xfam.org/family/PF02015/hmm) in the 15,089 sequences of the ExED resulted in a match.In comparison to N-terminal expansin domains, GH45 sequences are longer due to several inserts and longer loop regions (179-208 amino acids as compared to 90-115 amino acids of the N-terminal expansin domains).Despite these differences, the evolutionary relationship between the two protein families is underlined by conserved amino acids.Both the conserved threonine and aspartate at standard positions 12 and 82, and the HFDL-motif (standard positions 80-83) were found in the GH45 protein sequences.
This study confirms the observation that microbial expansins comprise two protein domains and are widely distributed across diverse lineages of Archaea ,Bacteria , Fungi, other eukaryotic microbes 8 , and Viridiplantae .Therefore, the ExED can serve as a basis for a more detailed phylogenetic analysis in order to elucidate the origin of expansins and ancient evolutionary dynamics.Furthermore, the ExED can be used to search for expansin genes in virulent fungal and bacterial plant pathogens.

Sfam
Sfam name Hfams Proteins Seq  The conserved amino acids or groups of amino acids according to the standard numbering scheme for the N-or C-terminal expansin domain, with the sequence of Bacillus subtilis (PDB accession 4fer) as reference sequence.All positions are listed separately for superfamilies 1 "Bacterial expansins", 2 "Fungal expansins", and 3 "Plant expansins" that are at least conserved to 70%.Positions marked in the standard numbering scheme as inaccurate are excluded (described in the Methods section).The last column names the function and the motif known from literature 9,55 .If a single amino acid is at least conserved to 70%, the conservation of the respective amino acid group is not mentioned.Amino acid groups: non-polar (A, C, F, G, I, L, M, P, V, W) 40 ; polar (D, E, H, K, N, Q, R, S, T, Y)      (accession PF02015) and the sequence regions annotated as N-terminal expansin domains from the superfamilies 'Bacterial expansins', 'Fungal expansins', 'Plant expansins', and 'N-terminal domains'.The colors representing the origin of the expansin sequences correspond to the scheme in Figure 2 with GH45 sequences colored in blue.The threshold for the nodes is 90% sequence identity (clustered with USEARCH) and the threshold for the edges is 30% pairwise sequence identity (determined by Needleman-Wunsch alignments).This network consists of 4,031 nodes and 2,182,810 edges.

Figure 1
Figure1Functionally relevant positions in the expansin domains from the representative protein structure of Bacillus subtilis expansin Bs EXLX1 (PDB entry 4fer, chain B) are labelled with standard position numbers (numbering according to13 ) and shown as sticks.The substrate cellohexaose is depicted above the C-terminal expansin domain in green.

Figure 3
Figure 3 Bivariate histogram of co-occurring HMMER bit scores of the N-and C-terminal expansin domains.The greyscale bar represents the relative frequency of the bit scores.The black diagonal line is the bisecting line.

Figure 4
Figure 4 Protein sequence network showing the protein sequence space of GH45 sequences from Pfam45

Figure 5 FiguresFigure 1
Figure5Protein sequence network showing the protein sequence space of CBM63 sequences from CAZy and the protein sequences of the superfamilies 'Bacterial expansins', 'Fungal expansins', and 'Plant expansins' with a sequence length between 210 and 300 amino acids (FigureS4).In contrast to the four big clusters from 'Plant expansins' (EXPA (A), EXPB (B), EXLA (C), and EXLB (D)), where no CBM63 sequences can be found, the clusters from the superfamilies 'Bacterial expansins' and 'Fungal expansins' show many connections to sequences of CBM63.The colors representing the origin of the expansin sequences correspond to the scheme in Figure2with CBM63 sequences colored in cyan.The threshold for the nodes is 90% sequence identity (clustered with USEARCH) and the threshold for the edges is 50% pairwise sequence identity (determined by Needleman-Wunsch alignments).This network consists of 3,344 nodes and 844,280 edges.Figures

Figure
Figure 2