AlphaFold2‐guided description of CoBaHMA, a novel family of bacterial domains within the heavy‐metal‐associated superfamily

Three‐dimensional (3D) structure information, now available at the proteome scale, may facilitate the detection of remote evolutionary relationships in protein superfamilies. Here, we illustrate this with the identification of a novel family of protein domains related to the ferredoxin‐like superfold, by combining (i) transitive sequence similarity searches, (ii) clustering approaches, and (iii) the use of AlphaFold2 3D structure models. Domains of this family were initially identified in relation with the intracellular biomineralization of calcium carbonates by Cyanobacteria. They are part of the large heavy‐metal‐associated (HMA) superfamily, departing from the latter by specific sequence and structural features. In particular, most of them share conserved basic amino acids (hence their name CoBaHMA for Conserved Basic residues HMA), forming a positively charged surface, which is likely to interact with anionic partners. CoBaHMA domains are found in diverse modular organizations in bacteria, existing in the form of monodomain proteins or as part of larger proteins, some of which are membrane proteins involved in transport or lipid metabolism. This suggests that the CoBaHMA domains may exert a regulatory function, involving interactions with anionic lipids. This hypothesis might have a particular resonance in the context of the compartmentalization observed for cyanobacterial intracellular calcium carbonates.


| INTRODUCTION
Superfolds are folds observed in a large number of evolutionary unrelated protein domain superfamilies 1 and are characterized by compact super-secondary structure patterns. 2,3One of these superfolds is the ferredoxin-like fold, found in 62 superfamilies according to the Structural Classification of Proteins (SCOPe) (fold d. 58,4 ) and present in many domains with various functions. 5,6It is made of a repeated β-α-β super-secondary structure, forming a four-stranded β-sheet, with the two α-helices packed into an α-β sandwich.As for other superfolds, the ferredoxin-like fold is subject to circular permutations, a mechanism which allows adaptation and the emergence of new functions.This can be visualized by the ligation of the amino-and carboxyl-termini and subsequent cleavage at another site. 7,8Among the 62 superfamilies comprising the ferredoxin-like fold, the heavymetal-associated (HMA) superfamily (SCOPe d. 58.17) contains the eponymous HMA family (Figure 1A).Domains of this HMA family contain two conserved cysteine residues involved in heavy metal binding.They are found in a variety of metal-trafficking proteins, which play essential roles in transport and homeostasis. 9,10While they can be found alone, HMA domains are also part of diverse domain architectures, especially associated with P1B-ATPases, which are integral membrane proteins allowing transport of metals across cell membranes. 11 recently identified a novel family of domains belonging to the HMA superfamily, called CoBaHMA (after Conserved Basic residues HMA).This discovery originates from the characterization of a novel family of two-domain proteins, named calcyanin, which is associated with intracellular biomineralization of calcium carbonates by Cyanobacteria. 12Calcyanins share a common architecture consisting in a conserved C-terminal domain, made of a threefold repeated, unusually long glycine zipper motif ((GlyZip) 3 ).Glycine zippers themselves consist of repeated GXXXG motifs, commonly found in transmembrane domains (TMDs) and bacterial toxins. 13,14The calcyanin (GlyZip) 3 domain is preceded by a variable N-terminal domain, specific to the four distinct calcyanin subgroups identified to this date, which are found in distinct clades of Cyanobacteria.The N-terminal domain of one of these calcyanin subgroups is a CoBaHMA domain.It is represented by 15 sequences. 12It shares significant sequence similarities with HMA domains, as well as plant integrated HMA domains 15 and YajR, AraEP, MBD (YAM) domains found in the C-terminal part of the bacterial YajR, an integral membrane protein which belongs to the major facilitator superfamily ( 16,17 ).These three families of domains share the same three-dimensional (3D) structure, but only the first one (HMA) is referenced in the SCOPe classification (d.58.17.1 4 ).
Sequence analysis and molecular modeling 12 have revealed specific features of the CoBaHMA domain family, including the presence of an additional strand β0 at the N-terminus of the domain, which replaces the C-terminal β4 strand, thus conforming to the circular permutation scenario described before (Figure 1B).This comparative modeling, considering multiple templates to build the novel topology of CoBaHMA domains, was further supported by the AlphaFold2 (AF2) predictions. 12CoBaHMA domains in calcyanins are also characterized by the presence of conserved basic amino acids (AAs) in β0 F I G U R E 1 Topology diagrams and ribbon representations of the three-dimensional (3D) structures of (A) an HMA domain (Human CopZ; experimental 3D structure-PDB 2QIF); (B) a CoBaHMA domain (Synechococcus calcipolaris calcyanin; AlphaFold2 model).Regular secondary structures are colored rainbow, from the N-terminus (blue) to the C-terminus (red), with the exception of the additional strand β0, specific of the CoBaHMA domain (pink).Conserved amino acids, specific to the two families, are highlighted in a ball-and-stick representation.CoBaHMA, Conserved Basic residues HMA; HMA, heavy-metal-associated.
Here, we searched for the presence of CoBaHMA domains in proteins distinct from calcyanins.9][20] For this purpose, we propose a novel methodological framework, taking advantage of the structural information provided by the AF2 3D structure models 21 of the retrieved sequences, now widely available in the AlphaFold Protein Structure Database (AFDB). 22This allows a both sensitive and specific detection of CoBaHMA domains within the HMA superfamily.
As a result, we describe a wide diversity of modular organizations in which CoBaHMA domains are included, with some yet-to-becharacterized regions.Moreover, the CoBaHMA domain family appears to be specific to bacteria and frequently associated with TMDs and soluble domains involved in transport of substrates and lipid metabolism, suggesting a regulatory role with regard to these functions, possibly via an interaction with charged lipids.

| MATERIALS AND METHODS
The workflow developed and used in this study is shown in  We searched the UniClust30/UniRef30 databases (version 2022_02, downloaded from https://gwdu111.gwdg.de/compbiol/uniclust/2022_02/) 23 using the standalone version of HHBlits. 24The 15 sequences of CoBaHMA domains identified by Benzerara et al. 12 were used as individual probes.The sequence similarity search consisted in 8 iterations with hits covering at least 50% of the sequence length and an E-value threshold of 1e-3, for each sequence probe.For each iteration and each probe, we gathered the sequences extracted from the output a3m multiple sequence alignment (MSA).We removed trivial redundancy using an in-house python3 script, by merging entries with identical identifiers and whose sequences were identical on at least 50% of their respective length.In order to find which entry had a 3D model available in the AFDB (https://alphafold.ebi.ac.uk/), 22 we downloaded the accession_ids.csvfile from http:// ftp.ebi.ac.uk/pub/databases/alphafold/.Based on this file, we downloaded the 3D models corresponding to our entries from AFDB v4.
We cropped each 3D model based on the boundaries of the CoBaHMA domain candidates identified with the HHBlits search.
From the 3D coordinates of these sub-3D models, we computed the F I G U R E 2 Workflow of the current analysis.(A) The 15 sequences of CoBaHMA domains were used as probes for sequence similarity searches with HHBlits against the Uniclust30 database.The AF2 3D structure models of the sequences that were identified by HHBlits were considered in order to assess the presence of strand β0, a feature allowing to distinguish true CoBaHMA domains.The new CoBaHMA domain sequences were then used as additional probes with HHBlits to perform transitive searches.(B) Full-length proteins with at least one CoBaHMA domain were annotated relative to transmembrane regions, calcyanin signature, structural domains, and domains families using deepTMHMM, PCALF, DomainMapper, and InterProScan, respectively.In addition, the taxonomy of all sequences was retrieved through the UniProt API.(C) Sequences were clustered using mmseqs2 and representative sequences were searched against themselves.Resulting alignment scores were used to build a similarity network.The network was refined iteratively to ensure that all sequence within a community share a similar length with a maximal amplitude of 50 residues.Finally, nodes and edges were rendered using the edge-rendering and weighted spring embedded layout from Cytoscape.AF2, AlphaFold2; CoBaHMA, Conserved Basic residues HMA; HMA, heavy-metal-associated.
secondary structures associated with each AA using the Define Secondary Structure of Proteins (DSSP) algorithm. 25,26The eight states output of DSSP was converted to three states using the EVA convention. 27Following this convention, α-helix, 3 10 helix and π-helix are converted to helices (H), extended and isolated β-bridge to extended (E), and turn, bend and other to coils (C).To validate the presence of a secondary structure, we set a threshold of at least three consecutive AAs with the same secondary structure assignment.As a result, we could infer the string of regular secondary structures of the submodels.We set four criteria to identify a CoBaHMA domain, as follows.
First, it should contain the secondary structure pattern EEHEEH(E) typical of CoBaHMA domains. 12Second, it should contain neither the Cys-X-X-Cys pattern, typical of HMA domains, 10 nor the Cys-X-X-X-Cys pattern typical of TRASH domains. 28Third, the secondary structure pattern should not start with EH, as observed in canonical HMA domains.Fourth, the first two β-strands should be separated by a coil of 1 or 2 AAs, to avoid confusing a β0 strand of a CoBaHMA domain with the β4 strand of an HMA domain, which is followed by a second HMA domain starting with a β1 strand.A final manual check based on the alignment of sequences was conducted to remove the remaining false positives.Overall, entries meeting these 4 criteria and passing the manual check were considered as CoBaHMA domains.
We performed transitive searches with newly identified CoBaHMA domains.This process was iterated six times.We removed the redundancies out of the six iteration outputs by merging the entries that had the same identifiers and shared at least 50% of their sequences.
Finally, since HHblits alignment only gave the portions of the sequences that match the probes, we downloaded from UniProt (https://www.uniprot.org/) 29the full sequences of each identifier that had at least one CoBaHMA domain in its sequence.

| Annotation of the full-length sequences (Figure 2B)
The TaxID and organism name associated with the sequences were retrieved using the UniProt REST API by downloading UniProt ttl files and serializing sequences information.TaxonKit 30 was used to convert TaxID into taxonomic lineages based on the NCBI taxonomy.
Sequence functional features were assessed using several annotation tools.First, sequences were annotated with InterProscan 31 and Inter-Pro 32 (interproscan-5.59-91.0).In addition, domainMapper 33 (version 3.0.2) and the Evolutionary Classification of Protein Domains (ECOD) were used to complete the sequence annotation.This HMM parsing algorithm provides the detection of non-contiguous, insertional, and circularly permuted domains as well.Results from both tools were filtered with an E-value threshold of 1e-10.Transmembrane (TM) regions were annotated using DeepTMHMM 34 (web version 1.0.20).Finally, calcyanin sequences were detected using pCALF (see below and Supplementary Figure 1).Hydrophobic cluster analysis (HCA 35,36 ) was also used to assess the foldability of the analyzed sequences. 37,383 | Calcyanin detection and classification (Supplementary Figure 1) Sequences of calcyanins were detected in our dataset using a dedicated in-house tool called pCALF (standing for python CALcyanin Finder).pCALF uses four hidden Markov model (HMM) profiles describing the C-terminal glycine zipper triplication specific of calcyanins, called the (GlyZip) 3 motif.12 The (GlyZip) 3 HMM profile describes the whole triplication, while Gly1, Gly2, and Gly3 HMMs describe each glycine zipper individually.These HMMs profiles were searched against AA sequences using pyHMMER.39,40 In addition, a set of domains (7 Y-type, 1 X-type, 14 Z-type, and 15 CoBaHMA), described as N-terminal domains of known calcyanins, was considered to annotate the N-terminal region of the sequences.A sequence is classified as calcyanin when it has a significative hit against the (GlyZip) 3 HMM profile (sequence coverage threshold: 60%; E-value threshold: 1e-20) and significative hits against individual Gly1, Gly2, and Gly3 zippers in this specific order (sequence coverage threshold: 70%, E-value threshold: 1e-10).A sequence is also classified as calcyanin when the second glycine zipper is missing, and it contains a Y-type N-terminal domain.

| Classification of the full-length sequences (Figure 2C)
Full-length proteins were clustered using mmseqs2, 41 considering a sequence coverage threshold of 97% (coverage mode: 0, cluster mode: 0, identity threshold: 0%).This threshold was the lowest that kept the length difference between the shortest and the longest sequences in a cluster below 50 residues and ensured that the length difference between the longest and the shortest sequences of the cluster was inferior to the length of a domain.Sequences representative of each cluster, including singletons, were extracted and a self-versus-self search was performed using mmseqs2 (alignment mode: 0).Alignment results were filtered using a reciprocal coverage threshold above 70%, an E-value threshold below 1e-10 and a sequence length amplitude lower than 50 AAs.The bitscore was normalized (NB) by the length of the shortest sequences comprised in the alignment. 42Finally, alignment results were filtered with a NB threshold above 0.44 that corresponds to the median value of all NBs.
A similarity network was built with networkX 43 using alignment results from both clustering and search with sequences as nodes and similarity as edges, weighted by NB.The network was refined iteratively in two steps: community detection and edges removal.First, the best partition was found using the Louvain Community Detection Algorithm. 44Then, we focused on communities where the longest and shortest sequences/nodes had a length difference of more than 50 AAs.We removed the node that was the furthest from the median length of the community.We repeated this process until the difference between the longest and the shortest nodes in the considered community felt under the 50 AAs threshold.All the nodes removed in this way were labeled as invalid nodes.Finally, we removed all edges between invalid and valid nodes.Community detection and edges removal were repeated until no more edge could be removed.
The partitioning quality of the final network was assessed.Network layout (edge-rendering and weighted spring embedded layout, with weight interpreted as normalized values, "strength of a disconnected spring" set to 0.01 and "strength to apply to avoid collisions" set to 10) and rendering were produced with Cytoscape. 45

| Multiple alignment of representative sequences of the CoBaHMA family
In order to build a MSA that is representative of CoBaHMA's diversity and avoid the over-representation of a subgroup of CoBaHMA domains, the 2305 CoBaHMA sequences were clustered with mmseq2 (coverage mode: 0, cluster mode: 0, identity threshold: 60%, sequence coverage threshold: 80%). 41The representatives of the clusters and singletons were gathered, amounting to a total of 1434 CoBaHMA sequences.The sequences were aligned using MAFFT v7.487 46 (maxiterate 1000; localpair), with some manual correction.
The MSA was viewed and analyzed with Jalview 2.11.2.6. 47bLogo v2.8.2 48 was used on the Berkeley server (https:// weblogo.berkeley.edu/logo.cgi)to build the logo of the MSA, restricted to the CoBaHMA β-strands.In order to identify the AA conservation patterns, we removed the indels from alignment before building the logo.

| Analysis of the 3D structure models
The AF2 3D structure models were manipulated and visualized using Chimera. 49Based on, 50 the predicted local distance difference test (pLDDT) values, describing the model confidence at the AA level, were split into 4 categories: pLDDT ≥90 (very high confidence); pLDDT [70, 90) (high confidence); pLDDT [50, 70) (low confidence); and pLDDT <50 (non-interpretable).The associated predicted aligned error (PAE), extracted from AFDB 22 was also considered in order to evaluate interdomain contacts.Each position of the PAE is the uncertainty in Å on the relative positions between 2 AAs, whose positions in the sequence are given by the x and y coordinates.
Structural similarities were searched using Foldseek. 513D structure models were analyzed in light of the MSAs of each community, built using MAFFT v7.487 46 and rendered using ESPript3. 52| RESULTS

| Identification of CoBaHMA domains not contained in calcyanins
We proposed here a methodology dedicated to the specific identification of CoBaHMA domains, as members of a novel family within the large HMA domain superfamily.Starting from the sequences of CoBaHMA domains from the 15 calcyanins reported by Benzerara et al. 12 as queries, we performed an iterative, profile-based sequence similarity search combined with the consideration of structural features provided by AF2 models (see Material and Methods; Figure 2).Considering these structural features during the search process improved the discrimination between HMA and CoBaHMA domains.However, this specificity was achieved at the cost of a lower recovery of CoBaHMA domains, since not every sequence in the UniClust30 database had a model available in the AFDB at the time of our study.
We increased the size of the sequence dataset by considering transitive searches (six iterations) and using newly detected CoBaHMA domains as probes for additional searches.A total of 38 444 distinct sequences identifiers were recorded by these searches.Among these identifiers, 28 918 had AF2 models.The remaining ones mostly corresponded to UniParc sequences.CoBaHMA-specific features were considered using the AF2 models, restricting the set to a total of 2358 domains corresponding to 2280 sequences, the manual inspection of  The β-sheet displays several conserved features, spread over all β-strands, except strand β4, which is not present in every CoBaHMA domain.By contrast, the two α-helices are highly variable and could not be aligned.Strand β1 possesses two arginine residues that are highly conserved: 88% of the sequences had the two arginine residues and 94% had at least one of them.These two arginine residues are accompanied by another basic residue (arginine or lysine) on the C-terminus of strand β1 as well as a fairly conserved histidine (65% of the sequences) in strand β0, and together, form a basic patch at the surface of the β-sheet.The full motif HxxxxRxRxR that was originally identified in calcyanins 12

| Structural and functional analyses of the fulllength sequence communities
Communities of full-length sequences have been segmented based on a similarity network.The robustness of the affiliation of full-length sequences to a given community was achieved by using a refinement method described in Section 2 (Section 2.4).(iii) they cannot be annotated.An additional processing was added, by considering together communities sharing identical annotations and/or distinct annotations but corresponding to the same major functional families.We describe the communities using this analysis workflow, adding information about their 3D structures from AF2 models, as well as conserved motifs identified from MSA (see Supplementary Data 1).We focused our analysis on the 74 large communities, containing at least 5 sequences.Altogether, they represent a total of 1679 sequences (i.e., 73% of the total number of CoBaHMA domains).
A good agreement between these two sources of annotations is observed.Moreover, it should be noted that among the 365 sparsely populated communities (composed of n < 5 sequences), 154 communities have IPR or ECOD annotations shared with the larger communities (see below for details).
The position of the communities relative to the functional families/phyla and within the network are described in Supplementary Fig-

| Single-CoBaHMA domain proteins
Nine of the 74 large communities fall into this category, scattered over several phyla (Figure 5 and Supplementary Figure 7).They F I G U R E 3 Amino acid conservation in the CoBaHMA family sequences.Amino acids are colored according to their properties (black: apolar, green: polar, yellow: small, orange: aromatic, blue: basic, red: acid, cyan: histidine).The most conserved amino acids, outside aliphatic ones participating into the hydrophobic core, are displayed on the AlphaFold2 three-dimensional (3D) structure model of a CoBaHMA domain, for which per amino acid pLDDT values are mostly very high (UniProt A0A545SE61-community C299).CoBaHMA, Conserved Basic residues HMA; pLDDT, predicted local distance difference test.include a total of 422 sequences, with 90% included in 5 communities.
The lengths of these sequences are around 100 AAs.Superimposition of the AF2 3D models of their representative members (in which all the core α helices and β-strands are predicted with very high/high pLDDT values) indicates that these single-CoBaHMA domain proteins possess extra-regular secondary structures (mostly α helices) at the Nand/or C-terminus of the domain, which pack against the core (Figure 5).Some variations are observed in the β-sheet conserved motifs, with the particular case of community C397 (Figure 5F), in which the conserved basic AAs are nor present and the extra-N-terminal helix takes the place of helix α2.Considering this topological difference, community C397 should thus be considered apart, probably not belonging to the CoBaHMA family.
It is difficult to estimate how many sequences of single-CoBaHMA domain proteins are included in sparsely populated communities (n < 5), given that sequences of similar length (e.g., C444, mean length 123.4,Supplementary Figure 6) can have additional secondary structures decoupled from, instead of associated with the CoBaHMA domain, as illustrated below.

| CoBaHMA domains in multidomain proteins
• IPR-annotated communities Only a few large communities show IPR annotations.They exclusively correspond to membrane proteins.We grouped together the IPR categories relating to the same protein families (Supplementary Table 1A).Worth noting, small communities (n < 5), when annotated, are mostly covered by the same IPR categories, with very few new IPR found there (Supplementary Figure 8 and Supplementary Table 1B).
(A) P-type ATPases P-type ATPases account for the majority of the total, with 345 sequences in the large communities (Supplementary Table 1A).

Below, we describe the general features of P-type ATPases and how
CoBaHMA-containing P-type ATPases are clearly distinguishable from the already well-characterized members of this superfamily.1A) are available to annotate the P-type ATPases over their full-length common core or domains, while other IPRs provide annotations for specific P-type subsets.Of note is that some of the P1B-ATPases with CoBaHMA domains are detected by the cd07550 (P-type_ATPase_HM) profile from the Conserved Domain Database.IPR entries specific to HMA domains are also found, outside the limits of the CoBaHMA domains, as accompanying some P1B-type ATPases.
Nineteen communities with at least five sequences and scattered over several phyla (Figure 4) are annotated as P-type ATPases (Supplementary Table 1A and Supplementary Figure 7), among which 15 belong to the P1B-type.Indeed, the AF2 models of their representative sequences (Figure 6) include a S-domain specific to this subset, comprising two TM helices, a long and curved MA helix and a kinked MB helix with an amphipathic MB 0 segment at the cytoplasmic membrane interface, lining the ion entry point. 53,54 Three communities lack the N-and/or P-domains, with the truncations matching the limits of these domains.Hence, the C20 community (Figure 6D), specific to cyanobacteria, lacks the P-domain and has degenerated consensus motifs in domains A and N, in contrast to the other communities which preserve these critical sequences (Supplementary Data 1).C744 (not shown) lacks the N-domain and the two C-terminal helices of the T domain, while still bearing a P-domain.C710 (not shown) lacks both the N-and P domains, as well as the two C-terminal helices of the T-domain.
A remarkable point here is that of the community C369, which is not detected by any P-type ATPase profile, consists of a CoBaHMA domain, the T-domain and the N-domain, but lacks the MA-MB-M1-M2 block, the A-domain and the P-domain.
Finally, two last communities correspond to P1B-ATPases including other domains in addition to an N-terminal CoBaHMA domain: C550 (Figure 6E), having a C-terminal, well predicted but yet uncharacterized domain (green) and C658 (Figure 6F), which possesses multiple helices (however, predicted with low pLDDT values) between the CoBaHMA and S-domains and after the S-domain (green).
In the remaining P-type ATPases, the C413 community (Figure 6G) possesses a single N-terminal CoBaHMA domain but has an atypical MA-MB segment, which does not match the usual topology encountered in P1B-ATPases and could not be modeled accurately.
Finally, three communities (illustrated in Figure 6H with C556) also possess a CoBaHMA domain in the N-terminus but do not belong to the P IB -type.Instead, they are annotated by cation N-terminal (IPR004014) and cation C-terminal (IPR006068) profiles (Supplementary Table 1A), which are found in several cationtransporting, P2A-ATPases (Na + , K + , Ca 2+ ).Inspection of the AF2 model indicated a conserved calcium-binding site in the T-domain, 11,55 including a conserved central glutamate.
Members of the P1B-type subsets were described heretofore as specific to the translocation of heavy metal ions.They are divided into several groups based on conserved sequence motifs (in the unwound part of M4, but also in M5 and M6) and the selectivity of the transported metal ion. 56The 15 communities of P1B-ATPases highlighted here possess conserved motifs in the unwound part of M4, which differ from one community to the other, and sometimes within large communities, in which sub-communities can be distinguished based on this feature.For instance, the biggest community C366 can be divided into two sub-communities, according to these motifs Finally, it should be noted that P-type ATPases are also abundant in sparsely populated communities (Supplementary Table 1B and Supplementary Figure 8).This indicates a very high level of diversity that far exceeds that described based on the analysis of the more populated communities.
In conclusion, our results indicate that CoBaHMA domains form a novel family found frequently in association with P1B-ATPases, similarly to HMA domains.However, sequence signatures of heavy metal binding in M4 are not systematically found in these P1B-ATPases, giving way to other signatures, including conserved acidic and basic residues and suggesting that P1B-ATPases are not exclusively transporting heavy metals.

(B) ABC exporters
CoBaHMA domains are also found in the N-terminus of two large communities corresponding to Type I ABC exporters (Supplementary Table 1A and Supplementary Figure 7).These proteins, now renamed in Type IV exporters, 57 transport a wide variety of substrates across membranes.They consist in a TMD with six TM helices and a NBD, which form homo-or hetero-dimers, with a swapped arrangement of two TMs.The experimental 3D structures closest to the AF2 models of the communities' representative sequences (Foldseek searches) are those of bacterial ABC exporters involved in the transport of various substrates, including lipid A (MsbA, pdb 7PH4), peptides (TmrAB, pdb 6RAI), or multiple drugs (Sav1866, pdb 2HYD).The CoBaHMAcontaining ABC exporter sequences exhibit canonical ABC conserved motifs in the NBD (Walker A, Walker B, ABC signature), suggesting that they are active transporters (Supplementary Data 1).In contrast to other communities with CoBaHMA domains, the ABC exporter communities do not have the conserved histidine in strand β0, while strands β0 and β1 include several basic AAs (Figure 7A).Linkers of variable length separate the N-terminal CoBaHMA domain from the TMD.These are predicted with lower pLDDT values as random coils or most often, as TMD hairpins (e.g., A0A2W6BRH6, community C538) or both (e.g., A0A6G4WXH4, community C455 or A0A7V8NHJ8, community C499), depending on the ABC exporter sequence.It is precisely the length of the linker that makes the difference between the two most populated communities (Figure 7A).
Worth noting, four small communities, united under the common denominator EcsC (after the name of the gene locus: Effect on exoproteins, defect in competence and sporulation) proteins (Supplementary Table 1A and Supplementary Figure 7, illustrated in Figure 7B with community C663) are related to ABC transport systems.Indeed, in Bacillus subtilis, EcsC is found in an operon with EcsA and EcsB, which are components of an ABC transport system. 58The AF2 model of the EcsC domain folds as an a priori soluble bundle of TM helices, with most of the AAs characterized by low pLDDT values (Figure 7B).A Foldseek search did not highlight any significant similarity with known 3D structures, suggesting that this domain adopts an as yet uncharacterized fold.The EcsC CoBaHMA domains possess the characteristic His/Arg signature on the first two β-strands.
In conclusion, our results indicate that CoBaHMA domains are also found in a few members of another family of membrane proteins, the ABC exporters, as well as in uncharacterized components of an ABC transport system.

(C) Type 2 phosphatidylglycerol phosphatases (PAP2)
CoBaHMA domains are present in the N-terminus of some PAP2 in two cyanobacterial communities (Supplementary Table 1A and Supplementary Figure 7).Integral membrane proteins from the PAP2 family dephosphorylate a variety of compounds, including lipids and carbohydrates. 59They consist in a core TMD, with six tightly packed TM helices connected by extramembrane loops, two of which interacting together to form the catalytic site (Figure 7C).The sequences of the CoBaHMA-containing PAP2 belong to the lipid phosphatase/ phosphotransferase family, as they all contain the conserved tripartite active site motif (KX 6 RP PSGH SRX 5 HX 3 D) 60 (Supplementary Data 1).Members of this family modify several types of lipids in Gram-negative bacteria, for example, phosphatidylglycero-phosphate for PgpB, 61 or lipid A for LpxE. 62The AF2 model of the representative sequence resembles the 3D structure of B. subtilis bsPgpB (pdb 6FMX (A); Prob 1.00, 26.9% identity), which contains eight TM α-helices, six of them (α1, α4-α8) being tightly packed, while the α2 helix is amphiphilic, lying at the surface of the lipid bilayer on the active site side (Figure 7C).The linker separating the CoBaHMA domain, located in the intracellular milieu, from the terminal TM α1-helix is variable.It is predicted as a random coil or even as integrating additional TM α-helices (e.g., A0A433NH73, community C542), always with very low pLDDT values.Again, the PAP2 CoBaHMA domains possess the characteristic His/Arg signature on the first two β-strands.

(D) Diacylglycerol kinases (DAGKs)
Although we mostly focused on large communities, two other small communities of cyanobacterial proteins caught our attention, as they are related to bacterial DAGKs (Supplementary Table 1A and Supplementary Figure 7).These enzymes convert diacylglycerol (DAG), formed by the turnover of membrane phospholipids, to phosphatidic acid (PA). 63In these proteins, the CoBaHMA domain is located at an unusual C-terminal position (Figure 7D,E 63 ).The 3D structure of DgkB has a two-domain architecture, similar to that found in E. coli YegS, which phosphorylates in vitro phosphatidyl glycerol. 64CoBaHMA DAGKs share the conserved P-loop, but exhibit slight differences in the two other conserved motifs, as described in the work of Miller et al. 63 (Figure 7).However, the HMA_2 profile is longer than CoBaHMA domains, with a total length of 180-190 AAs, and includes at its C-terminal part a conserved region generally predicted by AF2 as two contiguous helices.Matches to this C-terminal region are observed for sequences in 13 large communities (Supplementary Figure 7).For communities having a C-terminal region associated with high AF2 pLDDT values (≥70), the two helices, often predicted as TM segments, pack together to form a hairpin (Figure 8A).However, no obvious similarity with any known 3D structure could be detected by Foldseek for this case, outside helix hairpins belonging to larger assemblies.
Matches with the HMA_2 profile were also detected in other communities, however, limited to the N-terminal CoBaHMA domain.Figure 7).We analyzed the regions C-terminal to the HMA_2 N-ter/ CoBaHMA domain of the remaining 18 large communities, especially for detecting possible distant relationships to the HMA_2-Cter profile (Supplementary Figure 7).Three communities (illustrated in Figure 8B) also possess a hairpin of two helices, often predicted as TM segments, with conserved AAs.However, they differ from those defining the HMA_2-Cter profile as the C-terminal regions are characterized by highly conserved histidine residues and basic (arginine, lysine) residues.Moreover, a Foldseek search against PDB detected significant similarities (Prob 0.97, 19% identity) between the C202 representative sequence and the long alpha-hairpin domain of a manganese/iron superoxide dismutase (pdb 4BR6), which forms a four-helix bundle through protein dimerization and provides histidine residues to the ion-binding site at the interface with the preceding domain. 65ur communities possess two HMA_2 N-ter/CoBaHMA domains (Figure 8C).For three out of the four communities, the first of these domains is also followed by a helical hairpin, predicted with low pLDDT values.This helical hairpin contains conserved charged and aromatic AAs.The second CoBaHMA domain is also followed by a helical region, although less defined.
A few communities are characterized by a HMA_2 N-ter/ CoBaHMA domain, followed by one or several non-packed α-helices predicted with low pLDDT values (Figure 8D).The C-termini of sequences from three communities are apparently disordered (Figure 8E).However, the first two ones contain conserved charged residues accompanying hydrophobic clusters, suggesting a hidden fold (Supplementary Figure 9).Interestingly, two communities have a helical hairpin, with conserved charged/polar residues at their N-terminus, upstream the HMA_2-Nter/CoBaHMA domain (Figure 8F).
Four remaining communities possessing HMA_2-Nter/CoBaHMA domains followed by C-terminal more complex architectures (C4, C93, C192, C623) are described in the two next chapters.Finally, we extended our analysis (in search of hairpin-like motifs) within CoBaHMA-containing sequences that do not match the HMA_2 profile and found five communities with such C-terminal helical hairpins (Figure 8G) or other helical segments (Figure 8H).
In conclusion, our results indicate the presence of a large number of sequences with a region downstream of the CoBaHMA domain predicted to possess two helices, which have a strong propensity to form a hairpin and some of which possess charged residues.
(B) Calcyanins: A (GlyZip) 3 motif detected by a dedicated tool (pCALF) One community (C192) corresponds to calcyanins, in which the CoBaHMA domain was first identified.Some of the CoBaHMA domains of this community match the HMA_2 N-ter profile described above.Calcyanins contain a C-terminal domain, consisting in a threefold repeat of a large glycine-zipper (GlyZip) motif.This large GlyZip motif itself corresponds to a duplicated smaller glycine zipper motif, interrupted in its middle part by a conserved Gly-Pro dipeptide.AF2 modeled this GlyZip motif as a hairpin of tightly packed helices, consistent with the presence of conserved glycine residues repeated every four residues (Figure 8I). 14,66The low/very low pLDDT values may be due to the very large sequence distance between this GlyZip motif and known hairpins of this type, present in different architectures (as explored with Foldseek).However, AF2 fails to assemble these GlyZip motifs in a consistent way.

• Other communities
Last, very few large communities other than those described above have been detected.Protein segments associated with CoBaHMA domains correspond to disordered or ordered regions.The order (foldability) in these regions was assessed by examining AF2 predictions, in particular the segments associated with high/very high pLDDT values (≥0.7), as done for the HMA_2 N-ter containing proteins.By this way, we also retrieved helical hairpins in communities other than those containing HMA_2 CoBaHMA (Figure 8G).In contrast, low values of pLDDT are sometimes indicative of disorder, although in some cases, low pLDDT values are associated with genuinely well-folded regions (predicted as folded or in random coil), but for which the prediction cannot be supported.For example, this is the case of new folds and sequences lacking homologs. 37,38One way to evidence these "hidden" folded domains is to assess foldability using HCA. 37,38Therefore, we investigated unannotated sequences by combining AF2 structure predictions with HCAs of the protein sequences, when needed.
The most populated (non-HMA_2 N-ter) community is predicted by AF2 as a CoBaHMA domain with a C-terminal extension comprising three strands completing the core β-sheet (Figure 9A).A four-helix bundle predicted with high pLDDT values is inserted in between the CoBaHMA domain and this C-terminal extension.Interestingly, a Foldseek search of this four-helix bundle, which contains strictly conserved histidine, arginine, and aromatic residues, indicated a possible structural relationship with the MA-MB-M1-M2 block of P1B-ATPases (pdb 4UMW, Prob 0.30, 11.2% identity, Figure 9A).The MB kink (resulting in MB 0 ) is not present in the four-helix bundle, whereas the basic residues (histidine and arginine) are located at the entry site of the P1B-ATPase.
A second community (HMA_2 N-ter) is also predicted by AF2 as composed of a N-terminal CoBaHMA domain and a four-helix bundle (Figure 9B), with conserved AAs (especially arginine and histidine residues) located at the tip of the bundle.However, in this case, no significant structural similarity was found between the representative member of the community and any experimental 3D structures using Foldseek.A third community possesses a C-terminal, HMA_2 N-ter CoBaHMA domain, preceded by a globular all-α domain, with conserved residues, but also which does not share obvious similarity with any available experimental 3D structure (Figure 9C).Repeated CoBaHMA domains within a single protein sequence seems to be a relatively frequent feature of the family, since we also observed in a fourth community a C-terminal HMA_2 N-ter/CoBaHMA domain preceded by a tandem of CoBaHMA-like domains (devoid of the conserved basic signature) (Figure 9D).However, it did not contain any other folded domain based on AF2 modeling and HCA.Finally, a fifth community with a HMA_2 N-ter, and present in Cyanobacteria, possesses a C-terminal well-folded domain, which shares a clear structural relationship with the bacterial permeability-increasing (BPI) protein-like family, as indicated by Foldseek (e.g., A0A1Z4TPU0-pdb 1BP1 [BPI 67 ], prob.0.98, 7.5% identity, Figure 9E).Members of the BPI-like family share a common fold, consisting in a long α-helix wrapped in a highly curved sheet, made of long, antiparallel β-strands and display a tubular cavity, in which lipids bind. 68,69

| DISCUSSION
The large-scale predictions of 3D structures now enabled by artificial Intelligence (AI)-based approaches allow to functionally annotate large sets of proteomes at the AA level, and identify new folds. 70The 3D models provided by AF2 via a dedicated database, 22 connected to UniProt, 29 offer an unprecedented tool for extending the exploration of the universe of protein domains and studying their evolutionary trajectory.Here, we have taken advantage of this large-scale structural information, combined with sequence similarity search and clustering tools, to identify the members of a new family of domains called CoBaHMA.This domain family shares a common evolutionary origin with HMA domains, as evidenced by the signature of a common hydrophobic core.However, CoBaHMA can be discriminated from HMA based on an additional, external β-strand (called β0) and, in many cases, a specific sequence signature, conferring a positive electrostatic F I G U R E 9 CoBaHMA domains with unknown regions.AF2 three-dimensional (3D) structure models of the representative proteins from communities of CoBaHMA domains associated with unknown regions (ribbon representations), colored according to the pLDDT values.Proteins are referenced with their UniProt accession numbers: (A) C233: U2SKL7, the CoBaHMA domain is followed by an helical bundle, which superimposed with the MA/MB/M1/M2 bundle of P1B-ATPases (left); (B) C4: A0A1H0C0W7, the CoBaHMA domain is followed by an helical bundle, with no striking similarities with any known 3D structures; (C) C93: A0A552LCK1, the CoBaHMA domain is preceded by an all-α domain, with no striking similarities with any known 3D structures; (D) C654: A0A0M0SSI8, three CoBaHMA domains are separated from each other by disordered linkers.The two first domains (italics) lack the basic signature in strands β0 and β1.(E) C623: A0A1Z4TPU0, the CoBaHMA domain is followed by a domain belonging to the BPI family (TULIP superfamily).The AF2 3D structure model is compared after superimposition to the experimental 3D structure of human BPI with two bound phosphatidylcholine (pdb 1BP1).The MSAs of the communities, together with the AF2-predicted secondary structures is provided in Supplementary Data 1.AF2, AlphaFold2; BPI, bactericidal permeability-increasing protein; CoBaHMA, Conserved Basic residues HMA; TULIP, TUbular LIpid-binding Protein; pLDDT, predicted local distance difference test.charge on one side of the β-sheet, which is likely to be associated with a specific function.The AF2 models thus provided a structural constraint throughout the workflow we built, enabling a fine discrimination between CoBaHMA domains and the rest of the very large HMA superfamily.It should be noted, however, that the structural features we have automatically defined lightly suffer from the definition of cutoffs.
As a result, it is possible that highly divergent members of the CoBaHMA family with large loops between the two first β-strands may be overlooked.The methodology developed is also dependent on the availability and accuracy of AF2 models.Finally, the proteins identified are extracted from a non-redundant set of sequences (UniRef30), limited to sequences with no more than 30% identity.Therefore, the CoBaHMA family described in this work constitutes a minimum set, limited to these non-redundant sequences and excluding members for which no AF2 models are available and which are highly divergent.
The protocol developed here for the identification of CoBaHMA domains could be applied to any family of globular domains provided it possesses at least one additional regular secondary structure that distinguishes it from the other members of the superfamily.However, it depends on the availability of good quality 3D structure models in the AlphaFold database, which in turn depends on the inclusion of homologous sequences, at least in the AF2-predictive scheme.Finally, it is likely that the protocol will have difficulty working with small domains (with only two or three regular secondary structures) and domains made of repeated sequences.Interestingly, we could consider extending this protocol to the definition of new families of domains starting from the register of super-secondary structures or ancient peptides, at the basis of the constitution of superfolds. 71,72sides providing information about a novel family of domains, our study illustrates how evolution may operate within a superfold to provide broad functional diversity.Nevertheless, the function of the CoBaHMA family of domains has yet to be defined.Further prediction of this CoBaHMA-specific function, or at least the biological environment in which it is performed, can be aided by analyzing the domain architecture of the proteins containing the CoBaHMA domain.Here, deciphering this architecture was again aided by the AF2 predictions, combined with domain database (InterPro) annotations.A large part of the CoBaHMA communities corresponds to single CoBaHMA domain proteins.This is reminiscent of the single-HMA domain proteins, behaving as chaperones. 73CoBaHMA domains are also found in association with a limited number of protein families, which mostly correspond to membrane proteins, at least for those annotated through InterPro profiles and predictors of TM segments.These families were especially analyzed in the large communities we have described here, and they also represent the bulk of the sparsely populated communities (after the analysis of the InterPro annotations, Supplementary Table 1B and Supplementary Figure 8 for details).Only a few other IPRs in addition to the previously described EcsC and DAGK are found in these last communities, generally limited to singletons.Of note is that CoBaHMA domains are principally located at the N-terminus in multidomain proteins.CoBaHMA domains appear to be specific to bacteria, in contrast to HMA domains which are found in bacteria, archaea and eukaryotes.Overall, these observations suggest a membrane-related function of the CoBaHMA domains, specific to bacteria.Moreover, CoBaHMA domains show lineage-specific expansions, as communities are restricted to certain species or phyla.This suggests that they can accommodate unique function(s), linked to specific environments.
Like HMA domains, CoBaHMA domains are particularly abundant in P1B-ATPases.The latter are commonly defined as integral membrane proteins that couple ATP hydrolysis to the transport of metal cations, such as copper, zinc, and cobalt. 56The specificity of P1B-ATPases toward heavy metals is linked to conserved motifs in the middle of the fourth TM helix (TM4).These motifs directly coordinate the ion through cysteine/histidine side chains.Besides, soluble N-and C-terminal metal-binding extensions (known under the generic term heavy metal-binding domains [HMBDs]), also rich in cysteine/histidine and including HMA domains, seemingly play a regulatory role. 53,54,74 particular, HMBDs interact with the amphipathic helix MB 0 , lying at the membrane-cytosol interface at the end of a P1B-specific MA and MB membrane hairpin. 75This amphipathic MB 0 helix is connected to the high affinity ion-binding site through a conserved electronegative funnel.HMBDs also interact with the cytosolic domains (A-and P-domains), thus playing a potential regulatory role by interfering with conformational changes coupling ATP hydrolysis with ion transport across the membrane. 75Here, we show that P1B-ATPases are not restricted to heavy metals.Indeed, the proteins identified here share a P1B-specific MA and MB membrane hairpin but contain CoBaHMA domains instead of the usually encountered HMA domains and have conserved motifs in TM4 different from those coordinating heavy metals.These motifs vary depending on the considered community, often including charged (acidic and/or basic) or polar (serine/threonine) residues.Interestingly, a few communities contain the TM4 CPC motif, typical of heavy metal transport, together with a CoBaHMA domain only, and no HMA domain at its N-terminus.This suggests that the coupling of HMA with TM4 Cys-rich motif is not necessary as initially thought.Moreover, a large sequence/structure diversity is observed at the level of the MA/MB hairpin, probably linked with the substrate diversity of CoBaHMA-containing P1B-ATPases, as also evidenced by the diversity of the M4 conserved motifs.Finally, it is worth noting that CoBaHMA domains are also found associated with members of the P2A-ATPase family (SERCA, Ca 2+ ATPases) and are thus not restricted to the P1B subgroup.
The question remains as to what are the substrate specificities of these P-type proteins associated with CoBaHMA domains, and what is the role of the CoBaHMA domains in this specific modular organization.A regulatory role similar to that played by HMA could be expected, supported by the fact that some AF2 models displayed significant interfaces with A-domains, involving AAs outside the conserved basic patch (Supplementary Figure 10).The identity of the ligands for these charged AAs on the CoBaHMA domain surface of P-type proteins remains to be explored.
Accessory domains provide an additional, often regulatory effect on the functions of ABC exporter cores, which are formed by TMDs and NBDs. 76For instance, the cytosolic cystathionin beta synthase domains, at the C-terminus of the osmoregulatory OpuA, inhibit the transporter activity by binding to cyclic-di-AMP. 77This protein is gated by ionic strength, which modulates the interaction of positively charged AAs in the NBDs with negatively charged lipids. 77Although, similar to what we described for the P-type ATPases, the specific function of CoBaHMA domains in this ABC context remains to be discovered, a number of points can be considered.First, the ABC transporters with the highest sequence identities (30%) include lipid transporters such as MsbA, suggesting that (i) the CoBaHMAdomain-containing ABC transporters might be involved in lipid transport and (ii) the CoBaHMA domain might be directly involved in the uptake of lipids.Second, the specific position of the domain, N-terminal to the NBD, places it at the right location to interact with the polar heads of lipids, as observed for instance with the lasso domain found in some ABCC transporters such as the cystic fibrosis transmembrane conductance regulator protein (ABCC7) (see Reference 78 for a review).This suggests a specific role in contacting the membrane via the conserved basic patch at the surface of the domain.
The predicted presence of additional TM helices between the CoBaHMA domain and the TMD in most of the ABC communities, like in the ABCC transporters ABCC1 (MRP1) and ABCC8 (SUR1), 57 might serve an additional regulatory purpose.This "lipid hypothesis" is also interesting to consider with regard to the function of the CoBaHMA domain in the context of P1B-ATPases, particularly as specific transport of lipids is carried out by another class of P-type ATPases (P4-type, 79,80 ).
An interesting domain architecture is observed in community C623, which includes a domain belonging to the BPI protein-like family.This family, comprising lipopolysaccharide-binding protein and the lipid transfer proteins CETP and PLTP, includes one or two tandem copies of a fold providing a tubular cavity for the binding of lipids. 68It was extended to more divergent groups of proteins, including the synaptotagmin-like, mitochondrial, and lipid-binding protein domains, which are associated with eukaryotic membrane processes. 69All the so-described families were grouped into a single superfamily called TUbular LIpid-binding Protein (TULIP). 69,81,82Cyanobacterial proteins of community C623 bear only one copy of a BPI-like domain, with the CoBaHMA domain likely positioned nearby a potential lipidbinding site.
The hypothesis of the CoBaHMA domain serving as a binding module for positively charged lipids (phospholipids) in bacteria is appealing considering the wide knowledge about phospholipids-binding domains in eukaryotes.Indeed, the tray of basic AAs offered by CoBaHMA domains resembles that observed in C2 domains for instance, which interact with membranes in a Ca 2+ -dependent manner through a polybasic cluster, with specificity to phosphatidylinositol-4,5-bisphosphate. 83,84e phosphatidylinositol phosphates that are largely recognized in eukaryotic membranes are also the targets of some bacterial proteins acting as effectors or toxins (e.g., the lipid raft targeting domain of the Bordetella pathogens, 85 also see Reference 86 for a review).
However, bacterial membranes are distinct in lipid composition from eukaryotic membranes, and their lipid-binding modules are far less well known.Phosphatidylglycerol (PG) might be a target for the CoBaHMA domain.This phospholipid is present in both Gram-negative and Gram-positive bacteria and plays a central role in the synthesis of cardiolipin (diphosphatidylglycerol), lysophosphatidylglycerol, and oligosaccharides. 87Phosphatidic acid (PA) is another potential candidate.It is linked to the activity of DAGK and serves as a precursor for glycerolipids. 88An appealing hypothesis is that, in addition to being specifically recognized by the CoBaHMA domains, these phospholipids could be transported by membrane systems in which CoBaHMA is included (e.g., ABC exporter), a mechanism which could contribute to the general lipid homeostasis.However, such hypotheses remain highly speculative and need to be extensively tested.
Phosphoglycerolipids are far less abundant in cyanobacterial membranes, with PG being the only phospholipid present. 89It is present in the thylakoid membranes in low proportion (10%) relative to the more abundant glycerolipids, monogalactosyldiacylglycerol, digalactosyldiacylglycerol, and sulfoquinovosyldiacylglycerol. 89PG proportion is regulated in response to phosphate availability, 90 but is essential not only for photosynthesis 91 but also cell division and metabolism. 92We note that CoBaHMA domains are found in cyanobacteria-restricted communities of PAP2 and DAGKs.Both types of enzymes are involved in the biosynthesis of lipids starting from PA.
One of the outstanding features of our domain grammar analysis was the co-occurrence of the CoBaHMA domain not only with already annotated, membrane-specific domains, but also with hairpins of two consecutive helices.These helical hairpins show both a conserved structural motif as revealed by AF2 models, and a wide variety of sequences: some include a lot of strong hydrophobic AAs, as in the HMA_2 C-ter profile, and are predicted as forming TM segments; others include small but also globally apolar residues.However, most of them include conserved, charged residues, in particular histidine and basic residues.This suggests that these hairpins may serve as platforms on which AAs can be grafted to interact with specific ions or ligands.From an evolutionary point of view, it is tempting to speculate that these hairpins can be used as basic units for integrating more complex architectures, such as ABC exporter TMDs or those present in calcyanins (three repeats of a glycine-zipper helical hairpin).These calcyanin glycine-zippers are structures characterized by very compact assemblies due to the presence of glycine every four residues, but it remains yet to be specified whether they are soluble or membrane-bound.One open question is to what extent the MA-MB hairpin specific to P1B-ATPases may have evolved from this basic module.Finally, from a methodological point of view, it would be interesting to consider this CoBaHMA-specific grammar to improve sequence similarity searches, as done for instance by Terrapon et al. 93 and Faure and Callebaut 18 or, more recently, by Buchan and Jones 94 using natural language word embedding techniques.

(
H) and β1 (R/K) strands, forming a charged patch on one side of the β-sheet.Despite the clear identification of all these specific sequence and structural features, the function(s) of CoBaHMA domains remain(s) unknown.

2. 1 |
Structure-guided detection of CoBaHMA domain sequences (Figure2A) which identified 68 false-positive domains.Within this whole set, we thus identified a total number of 2305 (2290 domains from Uniclust30 + 15 initial probes) CoBaHMA domains within 2227 different proteins (2212 sequences from Uniclust30 + 15 initial probes).The accuracy of AF2 3D structure models was supported by the average pLDDT values calculated over the whole CoBaHMA domains (Supplementary Figure 2).The vast majority of them are characterized by average values above 70, while the only 12 models have values below 50.Nevertheless, the sequences associated with these models possess the CoBaHMA signature and perfectly align with those of well-predicted domains.Most of the sequences were identified during the first two iterations (see Supplementary Figure 3 for details).Only 8 phyla are represented by more than 20 sequences, all being affiliated to Bacteria.Among them, the Proteobacteria, Cyanobacteria, Bacillota, and Actinomycetota phyla are the most represented.During the whole iterative similarity search process, only one sequence from Eukaryotes was detected during the first transitive search (Chordata, Chondrychtyes class, UniRef100_A0A401TJW7).

Figure 3
Figure 3 illustrates the conserved sequence patterns of the CoBaHMA domains, derived from the alignment of 1434 sequences representative of the whole domain diversity (see Section 2).
was present in only 27% of the CoBaHMA sequences.A continuum can thus be highlighted in the CoBaHMA family, from sequences that do not have the basic patch to sequences that have the full HxxxRxRxR motif, hinting at the possible existence of sub-families with specific features.Besides this basic patch, strand β1 has a conserved proline-glycine dipeptide on its N-terminus.Strand β3 has an array of conserved small AAs (G/A/T/S) on its N-terminus, as well as an aromatic (Y/F/H) position on its C-terminus, which is oriented toward the hydrophobic core of the CoBaHMA in the 3D structure model.Considering their nature (small or apolar), and/or their position (loops or orientations towards the hydrophobic core), all these conserved AAs are likely of structural importance.Finally, two polar positions occupied by an asparagine (Cterminus of strand β2) and an acidic residue (C-terminus of strand β3) are also worth mentioning.Interestingly, strand β2, which is the farthest from the basic patch located on strands β0 and β1 appears to have less sequence conservation, except a central position sometimes occupied by a basic residue.This further strengthens the hypothesis that the functional feature of the CoBaHMA is yielded by strands β0 and β1.
, depending on whether the CoBaHMA domain alone constitutes the entire protein (what we call a "single-CoBaHMA domain protein"), or the protein containing it is longer.Moreover, in the latter case, several scenarios can be distinguished: (i) the regions of the proteins apart from the CoBaHMA domain are already annotated by reference to profiles contained in domain databases (InterPro [IPR], Pfam); (ii) they are related to the calcyanin GlyZip motifs (pCALF tool, see Section 2); for the representative sequences of the communities and Supplementary Data 2 for the details in each community).The mean features of the communities and the main features of sequences possessing CoBaHMA domains can be found in Supplementary Data 3 and Data 4, respectively.Most of these 74 large communities are composed of sequences affiliated with different phyla (Supplementary Figure 4).However, 26 of them are specific to one bacterial phylum (Bacillota [8 communities], Proteobacteria [9 communities], Cyanobacteria [9 communities]).20 Inter-Pro (IPR) accessions and 11 ECOD accessions were detected in more than one full-length sequence within 23 out of the 74 large communities (Supplementary Table ures 5 and 6, respectively.

F
I G U R E 4 Modular organization of the full-length CoBaHMA domain proteins.Each line illustrates the representative sequence of a large community (the vertical order follows that of Supplementary Figure4).Line labels represent the accession of the representative sequence, the community number and the dominant phylum.Functional (IPR) and structural (ECOD) annotations are indicated along the sequence by thin and large shaded areas, respectively.Each functional category of domain is highlighted by the following color code: CoBaHMA (green), HMA_2 (lime), HMA (yellow), P-type (cyan) and SERCA (dark blue), ABC (orange), Calcyanin Gly-Zip (red), PAP2 (magenta).Membrane regions as identified by deepTMHMM are indicated by gray areas.ABC, ATP-binding cassette; CoBaHMA, Conserved Basic residues HMA; HMA, heavy-meta-associated; SERCA, sarcoplasmic/endoplasmic reticulum Ca 2+ -ATPase; PAP2, type 2 phosphatidylglycerol phosphatase.P-typeATPases are composed of a common core of three conserved domains: (i) a discontinuous transport-domain (T-domain) made of six membrane-spanning helices (M1-M6) providing the substrate translocation pathway; (ii) an ATP-binding domain (ATPBD-between M4 and M5), which includes the nucleotide-binding domain (NBD, N-domain) and the phosphorylation domain (P-domain); and (iii) an actuator domain (A-domain, between M2 and M3), which is believed to transmit changes in the ATPBD to the T-domain and to drive dephosphorylation. 11Two additional domains can complete this common core, depending on the considered P-type ATPase subset: (i) the S-domain, which is an auxiliary membrane unit providing support to the T-domain and is located at various positions in the sequence (Nor C-terminal relative to the T-domain); (ii) regulatory (R) domains, which are located at the N-terminus and/or C-terminus and act as intramolecular inhibitors, sensors for transported cations and/or regulators for cation affinities. 11Transport is accomplished via a socalled Post-Albers cycle in which phosphorylation of a conserved aspartate residue in the ATPBD causes the protein to cycle between high (E1)-and low (E2)-affinity ion-binding states.InterPro entries (IPR, Supplementary Table
However, despite this unequivocal structural connection, only 8 out of these 15 communities match P1B-type IPR profile.The 15 P1B-type communities differ by the architecture of their whole proteins: 5 communities have a N-terminal CoBaHMA domain (illustrated in Figure 6A with C366), 1 community has a N terminal CoBaHMA + HMA couple (Figure 6B) and 4 communities (illustrated in Figure 6C with C140) are characterized by tandems of CoBaHMA domains forming a continuous β-sheet, F I G U R E 6 P-type ATPases.AF2 three-dimensional (3D) structure models of the representative proteins from eight P-type ATPase communities (ribbon representations), colored according to modular organization (top).These eight communities summarize the different architectures observed in large communities of P-type ATPases.Domains are designated as in the study of Palmgren and Nissen. 11At the bottom are shown the CoBaHMA domains (left) and the M4 transmembrane helices (right), colored according to the pLDDT values and with the conserved amino acids shown as ball-and-sticks.Proteins are referenced with their UniProt accession numbers: (A) P1B-ATPases with a N-terminal CoBaHMA domain: C366: A0A3P1Y6T0 (for M4, the 3D structure AF2 model of D8F5K4 is also shown at right, representative of another C366 sub-community whose M4 signature sequence differs from that of subcommunity to which A0A3P1Y6T0 belongs).C366 is representative of the architecture of C429, C712, C40, and C656.(B) P1B-ATPases with a N-terminal tandem of CoBaHMA + HMA domains: C525: A0A5C7KBI4; (C) P1B-ATPases with a N-terminal tandem of two CoBaHMA domains: C140: A0A1Y6CVZ1.C140 is representative of the architecture of C7, C154, C692.(D) Truncated P1B-ATPases: C20: A0A3S1CNK6.C744 and C710 also belong to this group (lack of N-domain and N-and P-domains, respectively).(E,F) P1B-ATPases with additional domains (E: C550: K8GFG5, F: C658: R5Q7W6); (G) P1B-ATPases with unusual S-domain C413: R1CS51; (H) cation-transporting ATPases: C556: A0A7C4R520.C556 is representative of the architecture of C42 and C98.The MSAs of the communities together with the AF2-predicted secondary structures are provided in Supplementary Data 1.Consensus motifs of the A-, P-and N-domains are [TS]-G-[DE], DKTGT, and HP, respectively.A-domain, actuator domain; AF2, AlphaFold2; CoBaHMA, Conserved Basic residues HMA; HMA, heavy-metal-associated; N-domain, nucleotide-binding domain; P-domain, phosphorylation domain; pLDDT, predicted local distance difference test.withtwo additional β-strands in the sequence linking them.However, only the C-terminal CoBaHMA domain was detected by our search, while the N-terminal one lacked most of the basic residues.

(
Figure 6A).A C-P-C motif is also found conserved in community C525, consistent with the presence of an HMA domain (in addition to the CoBaHMA domain).A large part of M4 motif within the P1B communities included a D-[YF]-x-[TC]-x(2)-[KRH] sequence, with different ), while the N-terminal domain corresponds to the catalytic DAGK (Foldseek matches with the putative DAGK from Bacillus anthracis, pdb 3T5P(B), F I G U R E 7 ABC exporters, EcsC proteins, PAP2, DAGK.AF2 three-dimensional (3D) structure models of the representative proteins from communities including ABC exporters, EcsC proteins, PAP2, and DAGK, colored according to the pLDDT values.CoBaHMA domains are shown at the bottom, with conserved amino acids shown as ball-and-sticks.Proteins are referenced with their UniProt accession numbers: (A) C538: A0A2W6BRH6, (B) C663: A0A552EVV8, (C) C419: A0A1U7ILD7, (D) C515: A0A0S3UCN3, (E) C553: A0A841V906.The 3D structures of the ABC exporter dimer model (A) were built after superimposition of the AF2 model single chain on the experimental 3D structure of TM287/ TM288 (pdb 4Q4H, best hit in a HH-PRED search, respecting the distance between the two swapped TMs and the TMD core).The MSAs of the large communities, together with the AF2-predicted secondary structures are provided in Supplementary Data 1.CoBaHMA DAGK share the conserved P-loop (ϕ-x-x-G-G-D-G-T-ϕ, where ϕ represents a hydrophobic AA), but exhibit slight differences in the two other conserved motifs (ϕ-ϕ-x-N-P-x-S/A-G instead of ϕ-ϕ-x-N-P-x-S-G and ϕ-ϕ-P-x-G-T-x-N-A-ϕ-x-N instead of ϕ-ϕ-P-x-G-T-x-N-D-ϕ-x-R; side and top of the nucleotide-binding site, respectively).ABC, ATP-binding cassette; AF2, AlphaFold2; CoBaHMA, Conserved Basic residues HMA; DAG, diacylglycerol; DAGK, diacylglycerol kinases; PAP2, Type 2 phosphatidylglycerol phosphatase; pLDDT, predicted local distance difference test.Hou et al. unpublished) and the DAGK DgkB from Staphylococcus aureus (pdb 2QVL(A) They share a common domain in between the DAGK and the CoBaHMA domains, modeled with very low pLDDT values as a long helix or a two-helix hairpin.The C515 single-member community differs from the other DAGKs by an additional C-terminal domain, which is related to the GlyZip motif described below.The DAGK CoBaHMA domains possess the characteristic His/Arg signature in the first two β-strands.• Other annotated, non-IPR, communities (A) HMA_2 Nine hundred and sixty-three CoBaHMA domains overlap the Pfam profile HMA_2 (PF19991).Some communities are almost entirely covered by the profile, while others are only partially.This profile is described as distantly related to HMA domains in its N-terminal part, containing in particular the conserved histidine we also highlighted in this study.Not all the CoBaHMA domains match the N-terminal part of the HMA_2 profile, indicating that proteins matching this profile constitute a subset of the CoBaHMA family.

A
few of them correspond to communities including single CoBaHMA domain proteins and P-type ATPases (see before, Supplementary F I G U R E 8 CoBaHMA domains associated with helical segments, including helical hairpins.AF2 three-dimensional (3D) structure models of the representative proteins from communities of CoBaHMA domains associated with helical segments, including hairpins (ribbon representations), colored according to the pLDDT values.CoBaHMA domains are shown in the same orientation.The communities are grouped according to their matching with the HMA_2 N-ter/C-ter profile (Pfam19991) and the specificities of the segment accompanying the CoBaHMA domain.The AF2 models of the representative members of communities indicated in bold are shown.Proteins are referenced with their UniProt accession numbers: (A) C393: A0A1Z4FYZ0; (B) C202: A0A5C7T941; (C) C331: Q8YVH2 (C628* lacks the central helical segment present in the three other communities); (D) C341: A0A6M0J010; E) C352: A0A7Y4FMF1; (F) C636: A0A1M4UKF2; (G) C487: H8GQW7G; (H) C740: A0A4R3PQS8; (I) C192: Q8DMV2.The MSA of the communities together with the AF2-predicted secondary structures are provided in Supplementary Data 1.AF2, AlphaFold2; CoBaHMA, Conserved Basic residues HMA; pLDDT, predicted local distance difference test.