UniLectin, A One‐Stop‐Shop to Explore and Study Carbohydrate‐Binding Proteins

All eukaryotic cells are covered with a dense layer of glycoconjugates, and the cell walls of bacteria are made of various polysaccharides, putting glycans in key locations for mediating protein‐protein interactions at cell interfaces. Glycan function is therefore mainly defined as binding to other molecules, and lectins are proteins that specifically recognize and interact non‐covalently with glycans. UniLectin was designed based on insight into the knowledge of lectins, their classification, and their biological role. This modular platform provides a curated and periodically updated classification of lectins along with a set of comparative and visualization tools, as well as structured results of screening comprehensive sequence datasets. UniLectin can be used to explore lectins, find precise information on glycan‐protein interactions, and mine the results of predictive tools based on HMM profiles. This usage is illustrated here with two protocols. The first one highlights the fine‐tuned role of the O blood group antigen in distinctive pathogen recognition, while the second compares the various bacterial lectin arsenals that clearly depend on living conditions of species even in the same genus. © 2021 The Authors. Current Protocols published by Wiley Periodicals LLC.


INTRODUCTION
The next frontier in understanding protein function is deciphering the "PTM code." Glycosylation, as the most variable post-translational modification, complicates the picture with its own challenging glycocode borne by eukaryotic surface glycoconjugates or microbial membrane polysaccharides. The ability of lectins to non-covalently bind glycans makes these proteins the key readers of information encoded by glycans (Gabius, Siebert, Andre, Jimenez-Barbero, & Rudiger, 2004). The roles of glycans and glycoconjugates are manifold and revealed in various medical, biochemical, and biotechnological applications. Glycoscience is now reaching out to other omics and systems biology following recent technological progress in carbohydrate structure resolution and ab initio synthesis, as well as functional screening methods.
Lectins are diverse within and across all species. This particular class of proteins is considered as a major player in many biological processes taking place on the surface of cells. Lectins have been studied for many years using broad technologies ranging from X-ray crystallography to various screening methods, e.g., microarrays; nonetheless, they are poorly annotated in genomes and no reference database was available before the launching of the UniLectin platform in 2019 (https:// www.unilectin.eu).
Structuring lectin knowledge has been challenged by the lack of a reliable classification system that is robust with respect to the continuous production of new sequences and the slow elucidation of corresponding 3D structures. Several criteria for grouping lectins have been proposed based on phylogeny or binding specificity, but none have proven satisfactory (Bonnardel, Perez, Lisacek, & Imberty, 2020). A key aspect is that glycan binding relies on multivalence, which gives lectins a unique capacity to interact with surface glycans. β-Propellers in which each propeller contains a glycan-binding site are examples of this situation.
The initial content of UniLectin consisted of manually curated lectin 3D structures together with their interacting glycans grouped in the UniLectin3D module . Then, the potential for prediction from sequence data was validated with tandem repeat lectins. As a proof of concept, β-propellers were categorized by their propeller content, and sequence databases were screened with corresponding Hidden Markov Model (HMM) profiles. Prediction was proven reliable and experimentally tested . These results were compiled in the PropLec module. In 2020, accumulated data set the basis for a robust hierarchical classification built on 35 protein folds and expanding into 109 classes. The latter were used to define 109 HMM profiles and extensively screen NCBI-nr and UniProt. This resulted in the Lec-tomeXplore module, which includes approximately 500,000 lectin predictions across all kingdoms (Bonnardel, Mariethoz, Perez, Imberty, & Lisacek, 2021). This analysis revealed that species distribution is wider than expected. For instance, calnexin, calreticulin, and malectin, involved in quality control during glycoprotein biosynthesis, are not limited to animal species, and cytolysin-like lectins are found elsewhere than fungi. Figure 1 visually describes the chronology of development of the UniLectin platform. UniLectin is available at: https:// www.unilectin.eu/ . The homepage displays the modules, each one being dedicated to a specific aspect of lectins. A simple crossmodule search can be launched from the top window; otherwise, each module includes an advanced search tool. UniLectin relies on standard protein cross-references (e.g., UniProt and PDB) as well as on standard representations of glycans. The latter comply with both the IUPAC condensed and the SNFG (Standard Nomenclature For Glycans) simplified notation (Neelamegham et al., 2019). The SNFG symbolism is now popularized in the literature and in the main chemical compound resources. It is implemented in PubChem (Kim et al., 2019) since 2020, and expected in ChEBI in 2022 (Hastings et al., 2016).
In these protocols, usage of the UniLectin3D and LectomeXplore modules is presented.

SEARCHING FOR THE STRUCTURAL DETAILS OF LECTINS BINDING THE O BLOOD GROUP ANTIGEN
UniLectin3D is a curated database that classifies and describes 3D structures of lectins by origin and fold, with cross-links to literature and other databases in glycoscience. Among the 2330 entries (sept 2021), more than 60% are complexes between a protein and a carbohydrate ligand, bringing invaluable information on the molecular basis of specificity and affinity. Many lectins recognize glycans present on human tissues, playing a role in cancer, development, immunity, and infection. Structural variations of some of these glycans differentiate individuals, defining for example the basis of the ABO human blood group system (Marionneau et al., 2001). Some pathogens, such as Vibrio cholerae or noroviruses, infect sub-populations with given blood group phenotypes more efficiently, and the lectins involved in their recognition have been well studied (Heggelund, Varrot, Imberty, & Krengel, 2017). This protocol, using bacterial lectins binding the blood group O oligosaccharide (also called the H epitope) as an example, will demonstrate the search function of UniLectin3D for the analysis of the structural details of this interaction.

Necessary Resources
Hardware

Computer with internet connection
Software Up-to-date web browser such as Chrome or Firefox. Using Safari or Edge may not be as reliable.
2. Select the UniLectin3D module by clicking on the corresponding blue button on the homepage. This action opens https:// www.unilectin.eu/ unilectin3D/ . The UniLectin3D landing page, shown in Figure 2, offers different search modes (keywords, fields, or glycan ligands). It also displays visual summaries of the module content in the form of sunbursts enabling data browsing. Finally, it shows a hierarchical chart that can be expanded to detail the underlying lectin classification built on folds branching to classes branching to families. In all cases, these charts are interactive and open new windows for accessing the lectin pages.

Searching UniLectin3D with a glycan-based query for H-type 1 epitope
3. Oligosaccharide structures can be searched through the "Glycan Search" option of the UniLectin3D interface. Clicking on this button opens https:// www.unilectin. eu/ unilectin3D/ glycan_search and displays the table of monosaccharides shown in Figure 3.  matching 2D structures recorded in ChEBI and in the SNFG symbolic notation in Figure 4. With this information, it is then possible to click on either of the three monosaccharides that are part of the blood group O(H) epitope or to simultaneously select the three of them, as displayed in Figure 3 (green boxes). This results in outputting all oligosaccharides containing these three monosaccharides and present in binding sites of lectin structures in UniLectin3D.
5. Click on the Fuc(a1-2)Gal(b1-3)GlcNAc motif to display all lectins for which a three-dimensional structure is available in complex with this ligand or with a longer oligosaccharide comprising this epitope. The result by default shows lectins in all species, resulting in a total of 11 entries, two of bacterial and nine of viral origin, mainly noroviruses that attach to blood group O in intestine during infection. The query can be limited to bacteria only by selecting "Bacterial lectins" in the "Origin" box. This search results in two structures from two different species of Burkholderia, as represented in Figure 5. These two lectins belong to different classes, one  adopting a β-propeller fold (AAL-like class) and the other one a β-sandwich/TNF-like fold (TNFα-like fold). Both are opportunistic bacteria from the Burkholderia cepacia complex responsible for lung infection in immunocompromised or cystic fibrosis patients (Mahenthiralingam, Baldwin, & Dowson, 2008).

Searching UniLectin3D by field search query for H-type 2 epitope
6. For more expert users mastering the IUPAC condensed representation of glycans, a shortcut is exemplified here using the H-type 2 oligosaccharide. In this case, the "Fuc(a1-2)Gal(b1-4)GlcNAc" motif is entered in the "IUPAC condensed" box of the "Search by field" menu and can be directly associated with "Bacterial lectins" in the "Origin" box. As shown in Figure 6, this results in two lectins from two different bacteria, i.e., Burkholderia ambifaria (as seen in step 4) and Photorhabdus asymbiotica, an insect pathogen also able to infect humans (Waterfield, Ciche, & Clarke, 2009). Both adopt a β-propeller fold, but they belong to different classes, i.e., AALlike (6-bladed β-propeller) for BambL and PLL-like (7-bladed β-propeller) for PHL.  Burley et al., 2021), and GlyConnect (Alocci et al., 2018), or by exploring the structural information displayed on the page with two different viewers. Litemol software (Sehnal et al., 2017) is integrated to show the lectin 3D structure, which can be manipulated by holding the mouse to turn the molecule around and, for example, bring the ligand to the fore. PLIP software (Salentin, Schreiber, Haupt, Adasme, & Schroeder, 2015) is integrated to detail atomic interactions between the lectin and the ligand. Mousing over one of the listed interactions on the left updates the view on the right to locate where that particular interaction is located in the structure. In the same way, details of the same BambL interacting with the Fuc(a1-2)Gal(b1-4)GlcNAc type 2 ligand are found at https:// unilectin.eu/ unilectin3D/ display_structure?pdb=3ZZV, and comparison of both ligands is displayed in Figure 7.

Visualization of the structural basis for blood group O recognition by bacterial lectins
In the end, a user can shape new assumptions on blood group O recognition based on a clearer understanding of specificity. Moreover, structural data can be downloaded for further computational studies.

COMPARING THE LECTOMES OF RELATED ORGANISMS IN DIFFERENT ENVIRONMENTS
LectomeXplore is a module of UniLectin dedicated to the exploration of predicted lectins from each of the 109 classes of UniLectin3D. Translated genomes (proteomes) stored in UniProtKB and RefSeq sequence databases have been screened to generate all putative lectins (lectome) for each available species. LectomeXplore can be searched by assessing the extent of a class of lectins in the kingdoms of life. In this protocol, we demonstrate how to analyze the lectome of a specified organism and how to compare it with other species of the same genus, but living in different environments. This is applied to investigating the lectomes of two Pseudomonas species that have been widely characterized, P. syringae, a plant pathogen, and P. fluorescens, a ubiquitous bacterium present in soil with much less pathogenic behavior. More specifically, P. syringae is a plant pathogen that causes damage in a wide range of plants, for example, with large impact on kiwi fruit production in New Zealand and in Italy (Donati et al., 2020). In contrast, P. fluorescens is an environmental bacterium that grows in soil and is in general non-pathogenic. The database also contains strains of P. cichorii, a bacterium first isolated on endives and known to cause green midrib rot of greenhouse-grown lettuces (Cottyn et al., 2009).
The protocol also shows how to investigate a specific lectin class common to Pseudomonas species. The OAA-like lectins were chosen. These organisms were structurally described in the cyanobacteria Oscillatoria agardhii and considered of pharmaceutical significance due to an antiviral activity related to their high affinity for mannose (Koharudin & Gronenborn, 2011), and later characterized in Burkholderia and Pseudomonas species.

Computer with internet connection
Software Up-to-date Web browser such as Chrome or Firefox. Using Safari or Edge may not be as reliable

Access and search LectomeXplore with a query
1. LectomeXplore is accessible via the UniLectin portal via its corresponding button (https:// www.unilectin.eu/ predict/ ). On the LectomeXplore homepage, click on the "search by field" button. The corresponding interface displays search fields spanning lectin characteristics (Fold, class) or phylogenetic properties (Kingdom, species …) to be filled individually or in combination (Fig. 8). Note that a second bioinformaticsdedicated box offers the option to query LectomeXplore with protein or protein properties using accession numbers of different databases.

Data mining for species-specific genomes
2. Fill in the species field of the search window to launch the P. syringae lectome search, and then click on the "Load predicted lectin" button. This action results in showing that 65 available strains have been predicted and span six different classes of lectins. The same search is then performed replacing the species name by P. fluorescens, yielding 76 strains spread across 13 classes. These distinct distributions are shown in Figure 9. This cross-species comparison seems to indicate that in the narrower niche of plant pathogens, the variety of lectins is reduced. To strengthen this statement, a third Pseudomonas species is searched by typing P. cichorii in the species search field. Only 11 strains and two lectins in two distinct classes are found. In the end, the result of these three searches shows that the lectome of Pseudomonas species living in soil with a more saprophytic and opportunistic behavior, such as P. fluorescens, is more diverse. It could potentially be involved in adhesion to various surfaces.

Searching for a specific lectin in selected genomes
3. Search can be then focused on a lectin of interest, such as OAA-like lectin that was predicted in both P. syringae and P. fluorescens lectomes (see Fig. 9). In the search  window, type OOA-like in the "class" field and Pseudomonas in the "genus" field. This results in the listing of 90 genes, in exactly 90 Pseudomonas genomes in the database, as shown in Figure 10. The monogenic character of this lectin is rather unusual. Interestingly, 20 of these 90 genes are not correctly annotated, and the lectin is not functionally identified in Pfam (Mistry et al., 2021).
Analyzing the quality of the prediction 4. As mentioned above, 20 out of 90 lectins are poorly annotated. Clicking on the magnifying glass icon next to "uncharacterized protein" will display all corresponding entry summaries. Then, clicking on the top right corner of the listed P. putida entry opens a new page with further information on the alignment of this putative lectin with the reference of the class, here OAA-like. This "Amino acid conservation" also singles out the specific amino acids involved in carbohydrate binding. Figure 11 details this information and shows the match of five out of six binding residues. The sequence logo is another guide for assessing the quality of the alignment. As a matter of fact, many entries in LectomeXplore are named "uncharacterized," "putative," or "hypothetical" proteins (approximately 20% of lectins predicted with a score above 0.25). The present example illustrates how lectins of potential interest can be "mined" in genomes where they were not precisely identified.

GUIDELINES FOR UNDERSTANDING RESULTS
UniLectin was released for the first time in 2019, and unavoidably still faces teething problems in 2021. The content has grown substantially, and issues tend to be specific to each module gradually added to the platform.
The UniLectin3D module is a database of annotated three-dimensional structures of lectins, and, as such, strongly depends on information stored in the PDB. In this context, users should be aware that while crystallography is the main method for structure determination, oligosaccharides are not always treated with the same care as their protein counterparts. Frequent errors in ring shapes and glycosidic linkage conformation have been spotted (Agirre, Davies, Wilson, & Cowtan, 2015). However, cooperation between glycoscientists and the PDB has been fruitful, and recent structures have been subjected to more stringent controls.
The LectomeXplore module is destined to evolve rapidly. For the time being, lectome prediction is based on 109 classes reflecting currently available structural information.

Figure 11
Entry page of the OAA-like lectin predicted to occur in the P. putida genome.
Consequently, prediction misses potential motifs that are present in lectins that have not been structurally characterized. The number of classes should grow with the inclusion of new glycan-binding information that will undoubtedly be generated in the years to come. Furthermore, the identification of putative lectins is based on HMM models, and therefore on sequence similarity over the whole domain of interest, but the definition of this similarity may be hampered by the frequent occurrence of tandem repeats, i.e., repeats of more or less conserved stretches of 30-60 amino acids. This situation occurs in several families of lectins in which multi-valency is achieved by tandem repeats, probably through a gene duplication process (Notova, Bonnardel, Lisacek, Varrot, & Imberty, 2020). This problem was overcome by designing HMM motifs based on aligning individual repeats in the lectins of interest, and resulted in creating two modules: PropLec that covered β-propellers, the prediction of which supported the identification of a novel architecture , and TrefLec dedicated to β-trefoil lectin prediction.
Finally, motif detection is run in parallel for each class and the highest score is kept. In this way, similarity reflected in high HMM scores does ensure that the predicted and target proteins share the same fold, but not necessarily the same function, i.e., glycan binding ability and specificity. This can result in the prediction of a false positive, especially in the case of ubiquitous folds occurring in some lectins and many other proteins, such as the immunoglobulin domain. In such cases, the conservation of not only the whole domain sequence, but also of the amino acids involved in carbohydrate binding, should be carefully considered.

Background Information
The UniLectin platform was designed to address the data integration and classification issues encountered by researchers in glycobiology. Several other databases had been previously developed. The Lectin Frontier Database collecting 400 lectins with curated information (Hirabayashi, Tateno, Shikanai, Aoki-Kinoshita, & Narimatsu, 2015) is now integrated in GlyCosmos, a newly developed portal for accessing and exploring knowledge in glycobiology (Yamada et al., 2020). LectinDB is focused on plant lectins (Chandra et al., 2006). SugarBindDB describes pathogen lectins and their corresponding glycan targets (Mariethoz et al., 2016). Most recently, ProCarbDB was released with more than 5000 3D X-ray crystal structures of protein-carbohydrate complexes and annotated carbohydrate ligands (Copoiu, Torres, Ascher, Blundell, & Malhotra, 2020). Most of these lectin-dedicated databases only partially cover the topic, and, in general, are not based on a solid classification. Lastly, they do not necessarily comply with the latest standards for glycan nomenclature and visualization.
We recently demonstrated the efficiency of LectomeXplore for exploring the genomes of organisms in relation to biological questions. The analysis of the vaginal microbiome and its variations that impact women's health demonstrated that vaginal bacterial species associated with infection and inflammation produce a larger variety of lectins . The size of the predicted lectome and the variety of putative specificities correlated with pathogenicity. The UniLectin portal also contains the MycoLec module, a publicly available and searchable database for fungal lectomes, based on genomes available in the JGI MycoCosm (Grigoriev et al., 2014) and NCBI (Sayers et al., 2020) databases. The lectome content of this module led to correlation of phylogeny and the very diverse lifestyles of fungi, from saprophytes to symbionts and pathogens (Lebreton et al., 2021).
Curated and predicted lectin data are now available to the community and have already fed several machine-learning approaches to detect the structural basis of specificity or novel lectin function. In particular, the bulk data of UniLectin3D was recently used in a comprehensive analysis of lectin-glycan interactions aiming at revealing the determinants of lectin specificity (Mattox & Bailey-Kellogg). LectinOracle combined a powerful glycanspecific deep-learning engine (Burkholz, Quackenbush, & Bojar, 2021) with a transformer-based protein sequence language model to predict glycan-protein interactions (Lundstrøm, Korhonen, Lisacek, & Bojar, 2021). The sequence language model was built from UniLectin3D and LectomeXplore.

Critical Parameters
Except for UniLectin3D, modules of UniLectin were created with HMM profiles for each specific class in each context. A critical parameter is therefore the prediction score, which involves the HMM p-value (<10 -2 ) and the evaluation of amino acid sequence alignment generated by HMMER during the search. More specifically, at each position of the alignment, a cumulative counter is incremented depending on amino acid similarity to set the basis of a normalized score (see Bonnardel et al., 2021, for details). Predictions above 0.5 are highly reliable. Nonetheless, predicted sequences with a score between 0.20 and 0.50 can reveal unknown lectins. Further checks are needed, such as the cautious examination of other potentially present domains and amino acid conservation at binding residue positions.  Table 1 lists potential errors that may arise with the protocols in this article, along with their potential sources and solutions.