Multicopper oxidases: modular structure, sequence space, and evolutionary relationships

Multicopper oxidases (MCOs) use copper ions as cofactors to oxidize a variety of substrates while reducing oxygen to water. MCOs have been identified in various taxa, with notable occurrences in fungi. The role of these fungal MCOs in lignin degradation sparked an interest due to their potential for application in biofuel production and various other industries. MCOs consist of different protein domains, which led to their classification into two‐, three‐, and six‐domain MCOs. The previously established Laccase and Multicopper Oxidase Engineering Database (https://lcced.biocatnet.de) was updated and now includes 51 058 sequences and 229 structures of MCOs. Sequences and structures of all MCOs were systematically compared. All MCOs consist of cupredoxin‐like domains. Two‐domain MCOs are formed by the N‐ and C‐terminal domain (domain N and C), while three‐domain MCOs have an additional domain (M) in between, homologous to domain C. The six‐domain MCOs consist of alternating domains N and C, each three times. Two standard numbering schemes were developed for the copper‐binding domains N and C, which facilitated the identification of conserved positions and a comparison to previously reported results from mutagenesis studies. Two sequence motifs for the copper binding sites were identified per domain. Their modularity, depending on the placement of the T1‐copper binding site, was demonstrated. Protein sequence networks showed relationships between two‐ and three‐domain MCOs, allowing for family‐specific annotation and inference of evolutionary relationships.

MCOs in nature is diverse and depends on the source organism and the environmental conditions. They have been found in many organisms such as animals, plants, insects, crustaceans, fungi, bacteria, and archaea. [1][2][3] Due to their broad substrate spectrum, stability under varied conditions (pH, temperature, presence of inhibitors, and presence of organic solvents), and ability to act in synergy with a wide range of enzymes, MCOs, especially fungal laccases and bacterial laccase-like enzymes, are of particular interest for textile, dye, pulp and paper, and lignocellulose-based biorefinery industries. 4,5 MCOs are composed of two (2dMCO), three (3dMCO), or six (6dMCO) cupredoxin-like domains, which typically consist of eight β-strands in a Greek key β-barrel fold. 6,7 Oxidation of substrates takes place at a mononuclear copper site containing a type-1 copper (T1 copper), which is called the blue copper because of its spectroscopic features. 8 Six substrate binding loops have been described for MCOs: loops Ia, Ib, and II located in the second domain of 3dMCOs, and loops III, IVa, and IVb located in the third domain. 9 For each substrate, one electron is abstracted by the T1 copper and transported via a His-Cys-His pathway to the trinuclear cluster (TNC) which is composed of two type-3, and one type-2 copper. The TNC catalyses the reduction of molecular oxygen to water. 4,10 In 3dMCOs, the T1 copper is located in the third domain, also referred to as the C-terminal domain. 11 In 2dMCOs, it is located either in the second (C-terminal) domain (type B), in the first (N-terminal) domain (type C), or in both domains (type A) ( Figure 1). In 6dMCOs, it is located in the second, fourth, and sixth domain. 12 The TNC lies at the interface between the first (N-terminal) and the third (C-terminal) domain in 3dMCOs and between the first and the sixth domain in 6dMCOs. In contrast, the TNC in 2dMCOs is located between the first (N-terminal) and the second (C-terminal) domains of different monomers in a 2dMCO homotrimer. These findings led to hypotheses about the evolution of MCOs, suggesting that 3dMCOs and 6dMCOs may have evolved from 2dMCOs. 12,13 All domains occurring in MCOs belong to the same cupredoxinlike fold, but can be separated into three slightly different categories based on their secondary structure and copper binding ability. 13 Domain N is the N-terminal domain for all 3dMCOs and 2dMCOs; Domain C is the C-terminal domain in 3dMCOs and 2dMCOs (third domain of 3dMCOs and second domain of 2dMCOs; Figure 1). The second domain of 3dMCOs is placed between domains N and C, which, for the purpose of this study, is referred to as domain M. It has a strong similarity to domain C on secondary structure level, but does not occur in 2dMCOs and lacks the ability to bind copper. 13 Since it only occurs in 3dMCOs and is not involved in electron transfer or copper binding, domain M was not investigated in this study. The previously described Laccase and Multicopper Oxidase Engineering Database (LccED) serves as a tool for the systematic analysis of sequence-structure-function relationships. 14 In 2011, it contained 2828 protein sequences belonging to 2297 proteins (defined as sequences from the same source organism with more than 98% sequence identity). They were grouped into 56 homologous families based on phylogeny, which were in turn assigned to 11 superfamilies.
Over the past few years the total number of gene sequences from genome and metagenome sequencing projects has substantially increased. Therefore, the LccED was updated, increasing the number of MCOs sequences by twentyfold. In this study, these new sets of sequences were systematically analyzed. Each of the resulting 1219 representative sequences was used as query sequence for individual blastp searches with an E-value cut-off F I G U R E 1 Modular structures of A, 3dMCOs and B, the three different groups of 2dMCOs. Each triangle represents a domain, domain N is pictured in red, domain C in light green and domain M in dark green. Arrows indicate the order of the connected domains along the sequence and each yellow dot corresponds to one copper. Single yellow dots in a loop are the T1 copper binding sites and three coppers next to each other stand for a trinuclear cluster. 2dMCOs naturally occur (and are therefore represented) as trimers of 10 −10 against the nonredundant protein sequence database at the NCBI. 16,17 For each blastp hit, the protein sequence, the source organism, and the protein description were extracted and loaded into the BioCatNet database system. 18 In contrast to the previous version of the LccED, sequences sharing more than 98% identity were assigned to one protein entry, regardless of the respective source organism. A sequence was assigned to a homologous family if it had a sequence identity above 40% to one of the query sequences in this family. Any database entry that could not be assigned unambiguously to a higher hierarchy level was deleted. Structures of MCOs were searched by blastp against the Protein DataBank, 19 using an E-value cut-off of 10 −10 , and added to the LccED. Multiple sequence alignments including annotations and phylogenetic trees were generated for each superfamily and each homologous family. All sequences and structures are available for download from the LccED website (https://lcced.biocatnet.de).

| Standard numbering scheme
Standard numbering schemes were established for N-and C-terminal domains of MCOs (domain N and C, respectively) as described previously. 20,21 Structures that were described as active MCOs in literature were selected as reference structures, covering all superfamilies and homologous families of the LccED that contain structural information on 2-or 3dMCOs (Table S1). For the domain N standard numbering scheme, 21 structures representing twelve homologous families, spread over nine superfamilies, were selected as reference structures. For the domain C standard numbering scheme, the set of reference structures was reduced from 21 to 18 structures representing nine homologous families, spread over eight superfamilies, due to the occurrence of additional long loops and mismatches between secondary structure elements that hindered a reliable superimposition.
For extracting individual protein domains from the proteins, the structures were visualized by PyMOL version 1.3 (Schrödinger, New York, New York) and secondary structure elements as described previously 13 were used to identify the respective domains of each structure (Table S1). The identified protein domains were used for structure-based multiple sequence alignments (Figures S1 and S2) using STAMP (version 4.4). 22 To select reference alignment columns, profile hidden Markov models (HMMs) were derived from the multiple sequence alignments using the hmmbuild command from the HMMER software package (Version 3.1b2, http://www.hmmer.org, Howard Hughes Medical Institute, Chevy Chase, Maryland). 23,24 The initial alignment that was used to derive a profile HMM was refined manually to check for shifts in the alignment columns after alignment against the profile HMM itself, such as loop regions and secondary structure elements. The protein sequence from PDB entry 1GYC (laccase from Trametes versicolor) was used for the assignments of standard position numbers of domains N and C. All sequences of the LccED were aligned against the respective profile HMM, and the position number of 1GYC was transferred. 25 For each domain, position counting starts from one. We note that some parts of the numbering scheme were less reliable due to a low conservation within those regions: for the domain N numbering, the N-terminal region (standard positions 1-27) and the C-terminal region (standard positions 123 to the end) were discarded. The domain C numbering has inaccuracies at the N-terminal region (standard positions 1-45), at the standard positions 84 to 93, and at the C-terminal region (from 144 to the C-terminus).

| Conservation analysis
The two standard numbering schemes were used for analyzing the amino acid composition of the two domains in 2dMCOs and 3dMCOs.
The three 2dMCO types were analyzed separately. An amino acid was defined as conserved if it occurred in at least 70% of all sequences of the respective sample. In order to predict the functional role of conserved positions, a survey on published results from mutagenesis studies was performed.

| Protein sequence networks
Pairwise sequence alignments were used to derive protein sequence networks, that is, undirected graphs with nodes that represent sequences and edges that are weighted by pairwise sequence identity.
All MCO sequences, both overall sequences and domain regions, were clustered with 60% sequence identity using the USEARCH (UCLUST) algorithm (version v11.0.667) to derive a reduced set of representative sequences (also named centroids). 15 For the domain-based networks, the domain regions were extracted from the overall protein sequences with the profile HMMs created and used for the standard numbering scheme, using the hmmsearch command from the HMMER software package (Version 3.1b2, http://www.hmmer.org, Howard Hughes Medical Institute). A maximal E-value of 10 −10 and a hit length of at least 100 positions were defined as filter criteria.
The obtained representative sequences were used for performing pairwise global sequence alignments using the Needleman-Wunsch algorithm in the EMBOSS software suite with default gap opening and extension penalties of 10 and 0.5, respectively (version 6.6.0). 26,27 The resulting lists of sequence pairs were filtered with a fixed sequence identity cut-off to select the edges of the networks. The visualization of the protein sequence networks was performed with Cytoscape (version 3.4.0) using a prefuse, force-directed layout, which takes the edge weights into account and tends to place pairs of higher identity in closer proximity to each other. 28

| Co-evolution of protein domains
All sequence entries from the updated LccED were aligned against the profile HMMs for domains N and C by using the hmmsearch command from the HMMER software package (Version 3.1b2, http://www. hmmer.org, Howard Hughes Medical Institute) with the max option in order to retrieve all possible bit scores for each profile-to-sequence alignment. The additive bit score for a profile-to-sequence alignment depends on the length of the profile HMM. Since both profile HMMs of domains N and C cover 119 amino acid residues each, the values of the bit scores can be compared for both domains. The lists of bitscores were sorted by sequence identifiers, and in case of multiple hits, only the maximal bit score was kept. The bivariate histogram was visualized as a heat map for bit scores greater than zero in MATLAB (version R2019a, The MathWorks, Natick, Massachusetts).  in a single homologous family. Due to its comparably small sample size and the lack of more detailed literature information, this superfamily was excluded from the following investigations on domain-level.

| Conserved positions in MCO domains N and C
Conserved positions are derived from multiple sequence alignments and are structurally or functionally relevant. All domains N and all domains C were aligned, two standard numbering schemes were generated, and standard position numbers were assigned to structurally equivalent residues (Tables S2 and S3 are conserved in more than 70% of the sequence entries from the updated LccED (Table 2). Domain C is less conserved with 53 conserved positions in type A 2dMCOs, 24 in type B 2dMCOs, 14 in type C 2dMCOs, and 13 in 3dMCOs (Table 3).   (Table S2). Positions are numbered after domain N standard numbering scheme according to domain N of the multicopper oxidase from Trametes versicolor (PDB accession 1GYC). Amino acids present at the respective position in at least 70% of the sequences are defined as conserved. Additional amino acids at the same position are shown if they are found in at least 10% of the sequences. Gaps are indicated with an "−". If no amino acid is conserved in more than 70% of the sequences, the respective table entry is written in italics and included to be compared with the conserved amino acids from other MCO types at the same position.

T A B L E 3
Conserved positions in domain C protein sequences of 3dMCOs, type A 2dMCOs, type B 2dMCOs, and type C 2dMCOs with additional information of the respective function   (Table S3). Positions are numbered after domain C standard numbering scheme according to domain C of the multicopper oxidase from Trametes versicolor (PDB accession 1GYC). Amino acids present at the respective position in at least 70% of the sequences are defined as conserved. Additional amino acids at the same position are shown if they are found in at least 10% of the sequences. Gaps are indicated with an "−". If no amino acid is conserved in more than 70% of the sequences, the respective table entry is written in italics and included to be compared with the conserved amino acids from other MCO types at the same position.

| Conserved glycines and prolines
Most of the seven conserved glycines and two prolines in domain N and four glycines and two prolines in domain C are located in loops, and therefore are probably involved in folding or structure stabilization.

| Copper binding motifs
In all 2dMCOs and 3dMCOs, the TNC binding site is located at the interface between a domain N and a domain C. Therefore, both domains are expected to contain a conserved TNC binding motif. The  Table 4).  Two further sets of protein sequence networks were generated based on pairwise sequence alignments of domain N at a cut-off of 45% sequence identity ( Figure S3) and domain C at a cut-off of 40% sequence identity ( Figure S4). Due to the higher sequence conservation of domain N, most nodes form a single cluster at 45% sequence identity, whereas the homologues of domain C form two predominant clusters, already at a slightly lower cut-off of 40% sequence identity.

| Sequence space of MCOs
In case of the domain N networks, the largest cluster comprises mostly homologues of domain N from 3dMCOs, with distinct but connected communities of both bacterial and eukaryotic origin and few archaeal sequences. For the domain C networks, the two major clusters comprise mostly homologues of 3dMCOs, too, with a community of 2dMCO homologues appearing connected to the 3dMCO homologues in one cluster, whereas the other predominant cluster is F I G U R E 2 Representation of the copper binding motifs (one letter code of amino acids) each in a domain without and with a T1 copper binding site. A nonspecific domain is shown as a triangle. Three green dots represent the trinuclear cluster, while all amino acids binding to it also shown in green. The blue dot in a loop stands for the T1 copper binding site, and if it occurs in a domain, additional amino acids (colored in blue) are conserved and can be included in the motifs. An "x" represents a position that is not conserved Overview of the copper binding motifs and the ability to bind the T1 copper for each investigated MCO group and for the domains N and C, respectively

| Co-evolution of domains N and C
All sequence entries from the updated LccED were aligned against the two profile HMMs for domains N and C. Co-occurring pairs of both domains, that is, cases where both domains were annotated in a common sequence, were compared against each other, showing higher bit scores for domain N than for domain C in most of the sequences ( Figure 4). Thus, it appeared that the N-terminal domain was overall more conserved than its C-terminal counterpart, which was in agreement with the finding that sequence networks for domain N appeared more connected, whereas sequence networks for domain C appeared less connected (Figures S3 and S4).  Table S4, are amino acids that are not conserved, but may be in proximation of a conserved residue. Mutations affecting the highly conserved residues are typically focused on the T1, T2, and/or T3 copper binding sites. For most of these sites, a mutation either results in a dramatic loss of activity or total loss of activity, with the exception of the residue at position 138 in domain C. The 138 (C) position is the axial ligand involved in co-ordination of the T1 copper: in 63% of 3dMCOs, it is a methionine, while in 19% of 3dMCOs, it is a leucine; in 89% of type B 2dMCOs, it is also a methionine (Table S4). When the phenylalanine (standard position 138 (C)) of the T. versicolor 3dMCO was mutated to a methionine, it only resulted in a slight change in redox potential, while mutating the methionine (standard position 138 (C)) of the bilirubin oxidase (3dMCO) produced by Albifimbria verrucaria invariably resulted in a total loss of activity or a decrease in redox potential. Interestingly, mutation of the leucine residue in the Botrytis aclada 3dMCO to a phenylalanine resulted in an increase in catalytic activity and an increase in optimal pH (presumably due to an increase in redox potential), allowing for the application of this laccase in an enzymatic biofuel cell (Table S4). It is therefore clear that the type of amino acid residue targeted by the mutation also plays a role, for example, for the 3dMCO produced by Saccharomyces cerevisiae, mutation from a leucine to methionine or phenylalanine, resulted in no significant loss of activity, whereas mutation to valine, alanine or isoleucine (aliphatic residues) resulted in a dramatic loss of activity, or even total loss of activity when the leucine was mutated to lysine. For the bacterial type B 2dMCOs, a mutation from methionine to alanine, phenylalanine, or leucine resulted in a 33% loss of activity, a slight change in activity, and a decrease in catalytic activity (towards 2,6-DMP), but increase in redox potential, respectively.

| Mutations reported in the literature
Other conserved residues targeted in multiple studies include posi-  Table S4, which include 3dMCOs as well as type B 2dMCOs). These observations could be compared thanks to the application of the standard numbering scheme. shows a T1 copper binding cysteine between two TNC binding histidines, whereas the copper binding histidine is three amino acids apart from the TNC binding HxH motif. On sequence level, the two motifs of each domain are separated by 40 to 50 amino acids, but in the structure, the distance between the T1 copper binding histidines of the two motifs is only 3.2 Å (PDB entry 1GYC) ( Figure S5). In addition, the TNC is bound between two domains, so in total eight histidines (two of each of the four HxH motifs) are involved in TNC binding.

| DISCUSSION
Although a domain N and a domain C equivalent are involved in all MCO groups, respectively, the two motifs that occur in each of these domains seem to be opposite to the two motifs of the other domain in the overall structure, regardless of the different orientations of the domains ( Figure S6). It is conceivable that the symmetrical properties of the HxH motifs make it possible to create the common binding site with an opposite domain containing the same motifs.

| Evolutionary relationships of MCOs
The modular domain organization of MCOs (Figure 1) was discussed previously with respect to evolutionary relationships between MCOs themselves and proteins related to MCOs. 7 Nakamura et al. proposed already in 2003 that types B and C 2dMCOs evolved from type A 2dMCOs by loss of the T1 copper site in domains N and C, respectively. 12 In the same work, it was proposed that 3dMCOs evolved from type B 2dMCOs via recruitment of an additional protein domain encoded between domains N and C of 3dMCOs. Overall, with the help of the protein sequence networks (Figure 3), the conclusion is that 2dMCOs are more similar to the bacterial 3dMCOs than to the eukaryotic ones. This could be due to the fact that almost all 2dMCOs in the LccED are annotated with bacterial source organisms, which could support the existing theory that the 3dMCOs originated from type B 2dMCOs. 12 The increased sample size of sequences in the current LccED verified these previous suggestions on domain organization and showed that the modular structure is a valuable criterion for comparing MCO sequences.
The updated LccED enabled a more thorough analysis of conserved positions within domains N and C. It was found that type A 2dMCOs comprise more conserved residues than type B and C MCOs, an aspect also reflected in the source organisms, with type A 2dMCOs occurring in Archaea, the presumably oldest prokaryotic lineage. Even though conservation does not necessarily imply that type A 2dMCOs are evolutionary older, it is still accepted in the field that type A 2dMCOs resulted in the evolution of types B and C 2dMCOs due to the regression/loss of blue copper-binding sites. 6,7 In addition, domains N and C appeared to be co-evolved with domain N being more conserved than domain C ( Figure 4). When such conservation is observed in domain combinations, it is usually due to the fact that they descend from a common ancestor. 30 In a study by Rydén and Hunt, 11  varied biological activities, for example, laccases, ferroxidases, bilirubin oxidases, and so on. 6,7 Besides MCOs, different modular domain organizations were also observed for other protein families such as thiamine diphosphatedependent decarboxylases 21 and α/β-hydrolases, 32 underlining the importance of the assumed recruitment and arrangement of different protein domains for protein families in general.

| Standard numbering and mutagenesis studies
In this study, a standard numbering scheme was implemented in order to gain insight into the amino acid residues typically targeted in mutagenesis studies. The application of standard numbering will allow researchers to compare MCOs with those that have already been subjected to mutagenesis studies and allow for a deeper insight into which type of approaches should be taken (eg, iterative saturation mutagenesis, random mutagenesis, or site-directed mutagenesis). In addition, insights can be gained from the various studies reported in literature about amino acid residues near or in the substrate binding site in order to change substrate specificity, 33-36 about residues located near the T1 copper for improved redox potential, 36,37 about residues located in the secondary coordination sphere of the T1 copper that may play a role in substrate binding and redox potential, 33,38,39 about residues located near the TNC for a greater understanding of the reduction of dioxygen to water, 40 about the role of the C-terminal tail in solvent access to the TNC, 41 as well as about the role of surface-located residues that either contribute towards improved stability or result in improved expression and secretion in a recombinant host. 41,42 The mutations at equivalent standard positions (Table S4)