The C-type lectin-like domain superfamily

Authors

  • Alex N Zelensky,

    1. Computational Proteomics and Therapy Design Group, John Curtin School of Medical Research, Australian National University, Canberra, Australia, Subdivision: Proteomics
    Search for more papers by this author
  • Jill E Gready

    1. Computational Proteomics and Therapy Design Group, John Curtin School of Medical Research, Australian National University, Canberra, Australia, Subdivision: Proteomics
    Search for more papers by this author

J. E. Gready, Computational Proteomics and Therapy Design Group, Division of Molecular Bioscience, John Curtin School of Medical Research, PO Box 334, Canberra ACT 2601, Australia
Fax: (+)61 2 6125 0415
Tel.: (+)61 2 6125 8303 Website: http://jcsmr.anu.edu.au/dbmb/gready/gready.htm

Abstract

The superfamily of proteins containing C-type lectin-like domains (CTLDs) is a large group of extracellular Metazoan proteins with diverse functions. The CTLD structure has a characteristic double-loop (‘loop-in-a-loop’) stabilized by two highly conserved disulfide bridges located at the bases of the loops, as well as a set of conserved hydrophobic and polar interactions. The second loop, called the long loop region, is structurally and evolutionarily flexible, and is involved in Ca2+-dependent carbohydrate binding and interaction with other ligands. This loop is completely absent in a subset of CTLDs, which we refer to as compact CTLDs; these include the Link/PTR domain and bacterial CTLDs. CTLD-containing proteins (CTLDcps) were originally classified into seven groups based on their overall domain structure. Analyses of the superfamily representation in several completely sequenced genomes have added 10 new groups to the classification, and shown that it is applicable only to vertebrate CTLDcps; despite the abundance of CTLDcps in the invertebrate genomes studied, the domain architectures of these proteins do not match those of the vertebrate groups. Ca2+-dependent carbohydrate binding is the most common CTLD function in vertebrates, and apparently the ancestral one, as suggested by the many humoral defense CTLDcps characterized in insects and other invertebrates. However, many CTLDs have evolved to specifically recognize protein, lipid and inorganic ligands, including the vertebrate clade-specific snake venoms, and fish antifreeze and bird egg-shell proteins. Recent studies highlight the functional versatility of this protein superfamily and the CTLD scaffold, and suggest further interesting discoveries have yet to be made.

Abbreviations
CRD

carbohydrate recognition domain

CTLD

C-type lectin-like domain

CTLDcp

CTLD-containing protein

DC-SIGN

Dendritic cell-specific ICAM-grabbing nonintegrin

EST

expressed sequence tag

MBP

mannose-binding protein

NK

natural killer cell

PSP

pulmonary surfactant protein

PTR

protein tandem repeat

Introduction

The superfamily of proteins containing C-type lectin-like domains (CTLDs) is a large group of extracellular Metazoan proteins with diverse functions. It has been the subject of some general literature reviews [1,2], but with many more focusing on its particular functions (e.g. [3,4]). There are also several systematic studies [5–9]. A classification of the family members based on the overall domain architecture of the CTLD-containing proteins (CTLDcps), which was introduced by Drickamer in 1993 [2] and updated recently [6], served as a useful framework for the superfamily studies. However, despite a voluminous literature describing some of the family's properties in great detail, we feel that a fresh critical review would be useful, as the previous review of this scale was published more than a decade ago [2]. Our approach has several main goals, outlined below.

The literature is strongly biased towards several groups of mammalian proteins, many of more biomedical interest. In this review we tried to capture the superfamily in all its variety, rather than attempting to provide a description of the known members proportional to the amount of published data. In particular, we wanted to integrate the results of the systematic studies of the CTLDs from lower vertebrates, such as proteins from snake venom and fish CTLDs, etc. with the classification of mammalian CTLDs. The recent inclusion of new CTLDcp groups inspired a critical reassessment of the principles on which the current domain-based classification was built. We also wanted to summarize the functional data on invertebrate CTLDs, which to our knowledge has never been reviewed previously at a general level.

In addition, numerous structural studies of CTLDs in the last decade have provided much information on the inner workings of the fold and the mechanisms of Ca2+-dependent carbohydrate binding. We have attempted to generalize these data and outline the most common elements of the domain. An important correlation between the residue composition of the primary carbohydrate-binding site and its basic specificity towards mannose- or galactose-group monosaccharides was discovered early in the history of CTLD studies and remains the most useful means for CTLD-function prediction. However, several models suggested to explain the mechanisms of such a correlation had to be rejected as the volume of data grew, and no comprehensive explanation of this fundamental phenomenon has been published. Our goal was to analyze the current state of the literature on this problem, to see if an explanation is apparent.

Finally, we wanted to address the inconsistencies of the terminology of the CTLDcp superfamily which exist in the literature, and to suggest clear definitions for the relevant terms.

The CTLD superfamily

A brief history of discovery

C-type lectins were among the first animal lectins discovered. Bovine conglutinin, which belongs to the collectin group of C-type lectins, has been known since 1906, and agglutinating activity of the snake venom lectins was first described much earlier, in 1860 [10]. In 1988 Drickamer suggested to organize animal lectins into several categories, and classified Ca2+-dependent lectins structurally similar to the asialoglycoprotein receptor as the C-type lectin group [11]. Since then, the known family has grown significantly, and now includes more than a thousand identified members (including those from genome sequences only) from different animal species, most of which lack lectin activity.

Term definitions: CTLD, CRD, C-type lectin

The terms ‘C-type lectin’, ‘carbohydrate recognition domain’ (CRD), ‘C-type lectin domain’ (CTLD), ‘C-type lectin-like domain’ (also abbreviated as CTLD), are often used interchangeably in the literature. This may be a source of confusion. The history of the introduction and the common meanings of the terms are outlined below, followed by the definitions we will use in this review.

The term ‘C-type lectin’ was introduced to distinguish a group of Ca2+-dependent (C-type) carbohydrate-binding (lectin) animal proteins from the other (Ca2+-independent) types of animal lectins. When the structures of C-type lectins were established biochemically and functions of different domains were defined, it was found that carbohydrate-binding activity was mediated by a compact module – the ‘carbohydrate-recognition domain’ (CRD) – which was present in all Ca2+-dependent lectins but not in other types of animal lectins [11–13]. Comparison of CRD sequences from different C-type lectins revealed conserved residue motifs characteristic of the domain [2,11,13], which allowed discovery of many more proteins that contained it. At the same time, crystallographic studies confirmed that the CRD of the C-type lectins has a compact globular structure, which was not similar to any known protein fold [14]. This domain has been called ‘C-type CRD’ or ‘C-type lectin domain’. As the number of determined sequences grew, it became clear that not all proteins containing C-type CRDs can actually bind carbohydrates or even Ca2+. To resolve the contradiction, a more general term ‘C-type lectin-like domains’ was introduced to refer to such domains [1,3]. The usage of this term is however, somewhat ambiguous, as it is used both as a general name for the group of domains with sequence similarity to C-type lectin CRDs (regardless of the carbohydrate-binding properties), and as a name of the subset of such domains that do not bind carbohydrates, with the subset that does bind carbohydrates being called C-type CRDs [6,8]. Also both ‘C-type CRD’ and ‘C-type lectin domain’ terms are still being used in relation to the C-type lectin homologues that do not bind carbohydrate (e.g. [15–17]), and the group of proteins containing the domain is still often called the ‘C-type lectin family’ or ‘C-type lectins’, although most of them are not in fact lectins. The abbreviation CRD is used both in the meaning of ‘C-type carbohydrate-recognition domain’ and in a more general meaning of ‘carbohydrate-recognition domain’, which encompasses domains from different lectin groups [8]. Occasionally CRD is also used to designate the short amino-acid motifs (i.e. amino-acid domain) within CTLDs that directly interact with Ca2+ and carbohydrate (e.g. [18]).

Structure comparisons add another meaning to the definition of the C-type lectin domain, as structural similarities have been discovered between C-type lectin CRDs and protein domains that did not show significant sequence similarity to any of the known C-type lectins but adopted a similar fold [19–23]. As the fold is very unusual, these domains have been separated into a common group in structure classification databases. For example, in the SCOP database [24] C-type lectins and structurally related domains are grouped at the fold level (‘C-type lectin-like fold’), which is the second level from the top of the classification hierarchy. However, although the structural similarity is often acknowledged in the literature, the common meaning of the C-type lectin-like domain does not include these domains [1,6].

Here we will use the term ‘C-type lectin-like domain’ (CTLD) in its broadest definition to refer to protein domains that are homologous to the CRDs of the C-type lectins, or which have structure resembling the structure of the prototypic C-type lectin CRD. Proteins harboring this domain will be called CTLD-containing proteins (CTLDcps) instead of the more common ‘C-type lectins’, as the latter implies carbohydrate-binding ability which most of the CTLDcps are not known to possess.

Phylogenetic distribution, groups

With a few exceptions, which will be discussed below, CTLDs are only found, extracellularly, in Metazoa. The domain has been a very popular framework evolutionarily for generating new functions and is found in various structural and functional contexts. CTLDcps are ubiquitous in multicellular animals, and are found in a broad range of species, from sponges to human [6,25]. CTLDcp-encoding genes have been found in all fully sequenced Metazoan genomes, and, in general, in large numbers. For example, the CTLD is the 7th most abundant domain family in Caenorhabditis elegans[26]. The family shows both evolutionary flexibility and conservation. Whole-genome studies have shown that although there are virtually no similarities between CTLDcps from worm, fruit fly and vertebrates [8], relatively few modifications occurred within the vertebrate lineage during evolution from fish to mammals [9], with some members showing sequence conservation approaching the conservation of histones.

Non Metazoan CTLDs

There are several interesting examples of non Metazoan CTLDcps, which can be divided into two groups. Members of the first group come from parasitic bacteria and viruses; these are involved in interactions with the animal host and are either hijacked host proteins or their imitations. This group includes bacterial toxins (pertussis toxin [23] and proaerolysin [22]) and outer membrane adhesion proteins (intimin from enteropathogenic Escherichia coli[21] and invasin from Yersinia pseudotuberculosis[27]) and viral proteins. Viral CTLDcps are either transmembrane proteins or structural envelope proteins, and include, for example, eight ORFs in the fowlpox virus genome [28], proteins from vaccinia virus [29,30], African swine fever virus [31], cowpox virus [32], avian adenovirus gal1 [33], myxoma virus [34], molluscum contagiosum [35], Epstein-Barr virus [36], and alcelaphine herpesvirus [37]. Unlike bacterial CTLDs, which were assigned to the CTLD superfamily on the basis of structural similarity only, viral proteins contain a canonical CTLD with significant similarity to those in mammalian CTLDcps.

While the presence of CTLDcps in parasites has an obvious rationalization, the origins of another group of non Metazoan CTLDcps is unclear. We have found three proteins that can be assigned to this group: two proteins from plants, and a putative protein encoded by an ORF from a marine planctomycete Pirellula sp. (GenBank ID:32443381). The latter sequence, which is 7716 amino acids long and is encoded by the biggest ORF in the genome of that bacterium [38], contains several C-type lectin-like, laminin G and cadherin domains, all of which are domains almost exclusively found in Metazoa. The most parsimonious explanation of the presence of all these domains in the Pirellula genome is horizontal gene transfer, but what the function of the protein harboring them might be is a mystery, as Pirellula are free-living species. The plant CTLDcp sequences originate from the Arabidopsis thaliana genome annotation (transcript IDs At4g22160 and At1g52310) and are not characterized functionally. At1g52310 is a transmembrane protein with a typical CTLD in the extracellular domain and a protein kinase domain in the cytoplasmic part; it has a well-conserved orthologue in the rice genome sequence.

It is not absolutely clear whether the CTLD superfamily is monophyletic, as homology between the canonical and some of the compact CTLDs (see below) cannot be confidently established. There seems little doubt that the Link domain group of CTLDs has emerged as a result of a deletion of the long loop region from an ancestral canonical CTLD, because the Link domains have a much narrower phylogenetic distribution (only found in vertebrates), are less diverse, and show detectable sequence similarity to the canonical CTLDs [19]. However, the evolutionary relationship of the compact CTLDs from the bacterial toxins to the animal CTLDs is uncertain [39]. These domains could either have been acquired by horizontal transfer or could have arisen by convergent evolution, as mimicry of host proteins.

The CTLD fold

The CTLD fold has a double-loop structure (Fig. 1). The overall domain is a loop, with its N- and C-terminal β strands (β1, β5) coming close together to form an antiparallel β-sheet. The second loop, which is called the long loop region, lies within the domain; it enters and exits the core domain at the same location. Four cysteines (C1-C4), which are the most conserved CTLD residues, form disulfide bridges at the bases of the loops: C1 and C4 link β5 and α1 (the whole domain loop) and C2 and C3 link β3 and β5 (the long loop region). The rest of the chain forms two flanking α helices (α1 and α2) and the second (‘top’) β-sheet, formed by strands β2, β3 and β4. The long loop region is involved in Ca2+-dependent carbohydrate binding, and in domain-swapping dimerization of some CTLDs (Fig. 2), which occurs via a unique mechanism [40–44].

Figure 1.

CTLD structure. A cartoon representation of a typical CTLD structure (1k9i). The long loop region is shown in blue. Cystine bridges are shown as orange sticks. The cystine bridge specific for long form CTLDs (C0-C0′) is also shown.

Figure 2.

Variation of the long loop region structure. Three common forms of the CTLD long loop region are shown. Panels (A) and (C) show canonical CTLDs in which the long loop region is tightly packed (A) or flipped out to form a domain-swapping dimer (C). A compact CTLD from human CD44 Link domain is shown in panel (B). The core domain and long loop region are colored green and blue, respectively.

The conserved positions involved in CTLD fold maintenance and their structural roles have been discussed in detail elsewhere [5]. In addition to the four conserved cysteines, one other sequence feature needs to be mentioned here, the highly conserved ‘WIGL’ motif. It is located on the β2 strand, is highly conserved and serves as a useful landmark for sequence analysis.

Variations of the fold: canonical, compact, long, short

Structurally, CTLDs can be divided into two groups: canonical CTLDs having a long loop region, and compact CTLDs that lack it (Fig. 2). The second group includes Link or protein tandem repeat (PTR) domains [19,20] and bacterial CTLDs [27,39,45]. Another family usually included in the CTLD superfamily is that of endostatin [1,24,46]. However, in the comparative structure analysis [5], we did not find substantial similarity between the CTLD and endostatin folds, apart from the general topology. As sequence similarity between endostatin and CTLDs is also absent, we not consider the endostatin fold as an example of a CTLD and do not consider it further.

Another subdivision of CTLDs is based on the presence of a short N-terminal extension, which forms a β-hairpin at the base of the domain (Fig. 1). The CTLDs containing such an extension are called ‘long form’. The hairpin is stabilized by an additional cystine bridge, and the presence of these two additional cysteines at the beginning of the CTLD sequence is used to distinguish between long and short form CTLDs in sequence analysis. No systematic study of the N-terminal extension, or of its possible roles, has been published.

Secondary structure element numbering

Although the CTLD fold is very well conserved among its known representatives, there is no general agreement on the numbering of CTLD secondary structure elements in the literature. The secondary structure element numbering scheme in the first solved CTLD structure (rat MBP-A [14]) included five strands, two helices and four loops. However, this description turned out to be insufficient, as MBP lacks some secondary structure elements that are present in long-form CTLD structures, while other small strands were not defined. Other reports describing the structures of CTLDs that have a different number of secondary structure elements than MBP-A either introduced their own numbering (β strands 1–6 in asialoglycoprotein receptor (ASGPR [47]); six β strands in Link module, with labeling not consistent with ASGPR or MBP-A [20]; β1- β7 in NKG2D [48]; β1-β8 in EMBP [49]), or extended the secondary structure element naming scheme used for MBP-A (Ly49A secondary structure element numbering is consistent with that in MBP-A [50]). For consistency we will use a universal numbering scheme ([5], Fig. 3), taking the same approach as was used in the Ly-49 A structure; this allows both direct reference to the most studied CTLD structures (MBP-A and -C) and assigns individual numbers to the elements that are present throughout the family. Other elements will be given derived names and numbers: the β strand specific for the long-form CTLD is labeled β0, the short β strand between α1 and α2 is labeled β1′, and the two β strands forming a hairpin C-terminal to β2 are labeled β2′ and β2′′.

Figure 3.

CTLD secondary structure element numbering. Ribbon diagrams for a compact (intimin, 1f00) and a canonical (E-selectin, 1g1t) CTLD structure. The long loop region in E-selectin, and the short α helix, which replaces the long loop region in compact CTLDs, are shown in black. Secondary structure elements are numbered according to the universal numbering scheme [5].

Ca-binding sites

Four Ca2+-binding sites are found in CTLDs

Four Ca2+-binding sites in the CTLD domain recur in CTLD structures from different groups (Fig. 4). The site occupancy depends on the particular CTLD sequence and on the crystallization conditions [14,51]; in different known structures zero, one, two or three sites are occupied. Sites 1, 2 and 3 are located in the upper lobe of the structure, while site 4 is involved in salt bridge formation between α2 and the β1/β5 sheet.

Figure 4.

Ca-binding sites in CTLDs. Shown are ribbon diagrams of two representative CTLD structures, rat MBP-A and human ASGPR-I, demonstrating the four typical locations of calcium ions in the CTLD. Ca2+ ions are shown as black spheres, and numbers referenced to the different sites in the text are indicated next to the arrows.

Sites 1 and 2 were observed in the structure of rat MBP-A complexed with holmium, which was the first CTLD structure determined [14]. Site 3 was first observed in the MBP-A complex with Ca2+ and oligomannose asparaginyl-oligosaccharide [51]. It is located very close to site 1 and all the side chains coordinating Ca2+ in site 3 are involved in site 1 formation. As biochemical data indicate that MBP-A binds only two calcium atoms [52], Ca2+-binding site 3 is considered a crystallographic artifact [51]. However, in many CTLD structures where site 1 is occupied, a metal ion is also found in site 3; examples include the structures of DC-SIGN and DC-SIGNR [53], invertebrate C-type lectin CEL-I [54], lung surfactant protein D [55] and the CTLD of rat aggrecan [56]. It is interesting to note that molecular dynamics simulations of the MBP-A/mannose complex suggested that Ca2+-3 is involved in the binding interaction [57].

Ca-binding site 2 is involved in carbohydrate binding

Residues with carbonyl sidechains involved in Ca2+ coordination in site 2 form two characteristic motifs in the CTLD sequence, and together with the calcium atom itself are directly involved in monosaccharide binding. The first group of residues, the ‘EPN motif’ in MBP-A (E185, P186, N187), is contributed by the long loop region and contains two residues with carbonyl sidechains separated by a proline in cis conformation. The carbonyl side chains provide two Ca-coordination bonds, form hydrogen bonds with the monosaccharide and determine binding specificity. The cis-proline is highly conserved and maintains the backbone conformation that brings the adjacent carbonyl side chains into the positions required for Ca2+ coordination. The second group of residues, the ‘WND motif’ (positions 204–206), is contributed by the β4 strand. Although only asparagine and aspartate are involved in Ca-coordination, tryptophan immediately preceding them is a highly conserved contributor to the hydrophobic core (position β4W [5]) and is a useful landmark for detecting the motif in a sequence. In the MBP-A structure, Asn205 and Asp206 provide three Ca-coordination bonds (two from the side chains, one from the backbone carbonyl of Asp) and also form hydrogen bonds with the sugar. One more carbonyl side chain is involved in site 2 formation. It belongs to the residue preceding the second conserved cysteine at the end of the long loop region (Glu193 in MBP-A), and forms one coordination bond with the Ca2+ ion.

As no other Ca-binding site except for site 2 is known to be involved in sugar binding, and as the site 2 residue motifs can be confidently detected in the sequence, it is common in the literature to associate the predicted Ca2+/carbohydrate binding properties of an uncharacterized sequence with the presence of these motifs (e.g. [7,8]). Although this is a useful simplification, it should be noted that the absence of the motifs associated with Ca2+-binding site 2 does not indicate that the CTLD is incapable of binding Ca2+, as there are two independent sites (1 and 4). Also, the presence of these motifs does not guarantee lectin activity for the CTLD, as there are numerous examples of CTLDs that contain the conserved motifs but are not known to bind monosaccharides (see below).

Sites 1, 2 and 4 play structural roles

Despite their spatial proximity, from the evolutionary and structural points of view Ca2+-binding sites 1 and 2 should be considered as independent. Crystallographic studies of rat MBP-A CTLD crystallized at a low metal ion concentration (0.325 mm Ho3+ instead of 20 mm as used to obtain the CTLD complexed with mannose) have shown that site 1 has higher affinity for Ca2+ as it remains occupied and Ca2+-coordination geometry is retained while site 2 loses its metal ion [58]. On the other hand, in the 4th CTLD of the human macrophage mannose receptor, Ca2+-binding site 1 is less stable than site 2 [41,59]. This is also the case for the rat pulmonary surfactant protein A (SP-A), where only some of the required ligands for Ca2+-1 are present and these can provide only three coordination bonds to the Ca2+. In one of the two solved SP-A structures (PDB 1r14) both site 1 and site 2 are occupied by metal atoms, while in the other (PDB 1r13) only site 2 is occupied [60]. SP-A is a particularly good example supporting the mutual independence of sites 1 and 2 because in its close homologue – pulmonary surfactant protein D – sites 1, 2 and 3 are occupied by Ca2+ (PDB 1pw9, 1pwb [61]). Independence of Ca2+-binding site 1 is also supported by the fact that in several CTLD structures site 1 is missing, while site 2 contains a calcium ion and is involved in carbohydrate binding. Examples of such structures are human E- and P-selectins (PDB 1esl, 1g1t, 1g1q, 1g1r, 1g1s) [62,63] and tunicate lectin TC14 (PDB 1byf, 1tlg) [64].

Ca2+-binding site 4 was first observed in the structure of the factor IX/X-binding protein from the venom of Trimeresurus flavoviridis, where it was the only location of Ca2+ ions [40]. It is occupied by Ca2+ in several other snake venom CTLD structures. Two observations suggest that this site is a property of the CTLD in general rather than restricted to the snake venom group of CTLDs. First, it is present in the human asialoglycoprotein receptor I [47], which is a very remote homologue of the snake venom CTLDs. Second, as shown by comparative analysis of CTLD structures [5], Ca2+-4 is involved in a stabilizing interaction that is a highly conserved structural feature observed in virtually all CTLD structures. It can be mediated by salt bridge formation between charged groups and by metal ion coordination. In one structure (galactose-specific C-type lectin from rattlesnake Crotalus atrox (PDB 1jzn, 1muq [65]) Na+ was found instead of Ca2+ in site 4.

A stabilizing effect of bound Ca2+ on CTLD structure has been reported for a number of proteins from different CTLD groups [52,66,67]. Ca2+ removal greatly increases CTLD susceptibility to proteolysis and changes physical properties of the domain such as circular dichroism spectra and intrinsic tryptophan fluorescence. Structures of the apo forms of human tetranectin [68] and rat MBP-C, and of the one-ion form of rat MBP-A [58], have demonstrated the mechanism underlying these changes. In these structures compactness of the long loop region is disrupted leading to multiple conformational changes including a cis-trans isomerization of the conserved proline. However, not all CTLDs require Ca2+ to form a stable long loop region structure. NMR studies of the tunicate CTLD TC14 have shown that its loops maintain its compact fold when Ca2+ is removed [69].

Role of Ca2+ in CTLD function

The most important functional role of the bound Ca2+ in CTLDs is monosaccharide binding. This function is limited to site 2 and is discussed in detail in the following section. However, in several cases, which are described below, Ca2+-binding sites participate in interactions that do not involve carbohydrate recognition.

In proteins, Ca2+ is found in 7- or 8-coordinated form. Because of the metal's ability to simultaneously interact with multiple ligands within the protein, its binding can orchestrate dramatic rearrangements in the tertiary structure of the protein. At the same time, the reversible nature of the binding and its dependence on different parameters of the milieu (e.g. ion concentration, pH) provide mechanisms to control the structural transformations induced by metal binding.

There are several examples of CTLD functions that are mediated by Ca2+-induced structural changes, namely the destabilization of the long loop region caused by Ca2+ removal, rather than its involvement in monosaccharide binding. It is thought that the destabilization of the loops caused by pH-induced Ca2+ loss plays a physiological role in the function of the CTLDs in endocytic proteins such as asialoglycoprotein receptors [52,70] and macrophage mannose receptor [41,59]. Transition of the receptor-ligand complex from the cell surface into the acidic environment of a lysosome leads to Ca2+ loss and to the release of the bound ligand. After release, the ligand is processed by the lysosomal enzymes, while the receptor is recycled to the cell surface.

Another example of functional CTLD transformation induced by Ca2+ is human tetranectin. Although in the CTLD of tetranectin Ca2+-binding sites 1 and 2 are present, the CTLD is not known to bind carbohydrates. The domain, however, interacts with several kringle domain-containing proteins, including plasminogen, and the interaction involves several residues from the Ca2+-binding site 2. Moreover, the interaction with kringle domain 4 of plasminogen is only possible when Ca2+ is lost from the binding site [71], which leads to changes in the long loop region conformation similar to those observed in the apo-MBP-C [58,68]. The physiological role of Ca2+ as an inhibitor of the tetranectin/plasminogen interaction is, however, unclear.

The antifreeze protein (AFP) from Atlantic herring provides an interesting example of a CTLD in which Ca2+ bound in site 2 is involved in an interaction with a noncarbohydrate ligand [72]. Ewart et al. have shown that not only is the antifreeze activity of the protein Ca2+-dependent [73], but that it is disrupted by minor changes in the geometry of the Ca2+-binding site 2 introduced by replacing the original galactose-type QPD motif by a mannose-type EPN motif [72]. This strongly suggests that the Ca2+ site 2 in the herring antifreeze protein interacts directly with the ice crystal altering its growth pattern.

Ligand binding

CTLDs selectively bind a wide variety of ligands. As the superfamily name suggests, carbohydrates (in various contexts) are primary ligands for CTLDs and the binding is Ca2+-dependent [74]. However, the fold has been shown to specifically bind proteins [75], lipids [76] and inorganic compounds including CaCO3 and ice [72,77–79]. In several cases the domain is multivalent and may bind both protein and sugar [80–82].

Carbohydrate binding is, however, a fundamental function of the superfamily and the best studied one. The first characterized vertebrate CTLDcps were Ca2+-dependent lectins, and most of the functionally characterized CTLDcps from lower organisms were isolated because of their sugar-binding activity. Although as the number of CTLDcp sequences grows it becomes clearer that the majority of them do not possess lectin properties, CTLDcps are still regarded as a lectin family (according to Drickamer, ∼ 85% of C. elegans and 81% of Drosophila CTLDcps are predicted as noncarbohydrate binding [9]). Unlike many other functions of the CTLDcps, Ca2+-dependent carbohydrate binding is found across the whole phylogenetic distribution of the family, from sponges to human, and thus is likely to be the ancestral function. Also, Ca2+/carbohydrate-binding CTLDs from different species demonstrate amazing similarity in the mechanisms of sugar binding. Systematic studies by Drickamer and his colleagues have provided in depth understanding of many aspects of this mechanism.

The results of this theoretical and experimental work established a basis for developing bioinformatics techniques for predicting CTLD sugar-binding properties with substantial reliability by sequence analysis [83]. Whole-genome studies of the CTLD family published by Drickamer and his colleagues focused on the evolution of the carbohydrate-binding properties and used these prediction methods [6–8]. Although our approach for the Fugu rubripes genome was somewhat different [9], for carbohydrate-binding prediction we used the techniques developed by Drickamer and coworkers. An overview of the literature on the mechanism of Ca2+-dependent monosaccharide binding by CTLDs is given next.

Ca2+-dependent monosaccharide binding

The mechanism of Ca2+-dependent monosaccharide binding by several CTLDs has been studied in great detail by X-ray crystallography, site-directed mutagenesis and biochemical methods. The first crystallographic study of a complex between a CTLD and a carbohydrate was carried out on rat MBP-A and the N-glycan Man6-GalNAc2-Asn [51]. In the structure obtained, a ternary complex between the terminal mannose moiety of the oligosaccharide, the Ca2+ ion bound in site 2 and the protein was observed. The complex is stabilized by a network of coordination and hydrogen bonds: oxygen atoms from 4- and 3- hydroxyls of the mannose form two coordination bonds with the Ca2+ ion and four hydrogen bonds with the carbonyl sidechains that form the Ca2+-binding site 2 (Fig. 5). This bonding pattern is fundamental for CTLD/Ca2+/monosaccharide complexes, and is observed in all known structures. It is also a major contributor to the binding affinity, especially in CTLDs specific for the mannose group of monosaccharides. For example in MBP-A, mannose atoms form very few interactions with the protein other than hydrogen/coordination bond formation by the two equatorial hydroxyls, and extensive mutagenesis screening has shown that the only other significant contributor to mannose binding is Cβ from His189 that forms a hydrophobic interaction with the sugar [84].

Figure 5.

Ca2+-dependent monosaccharide binding by CTLDs. (A) A schematic representation of a Ca2+-hexose-CTLD complex. Two hydroxyl oxygens and the ring of the hexose are shown. The Ca2+ atom is shown as a large grey sphere, and oxygens as empty circles and ovals. Protein groups that act as hydrogen donors and acceptors are not shown. Arrows show the direction of hydrogen bonds in mannose-specific CTLDs, while light-grey arrows indicate changed directions in galactose-specific CTLDs. (B) A stereoview of the MBP-A complex with mannose (PDB 2msb). Coordination bonds are orange. Hydrogen bonds where sugar hydroxyl acts as acceptor and donor are red and blue, respectively. The Ca2+ atom is shown as a blue sphere.

The positioning of hydrogen donors and acceptors in the binding sites has two important features. First, it determines the overall positioning and orientation of the ligand in the binding site. It may seem from Fig. 5A that the sugar-binding site of CTLDs has a twofold symmetry axis relating the sugar hydroxyls, and the hypothetical sugar shown can be rotated by 180° without introducing any changes to the bonding scheme. It is now known that this is indeed the case, although some early modeling and mutagenesis studies were based on the assumption that the orientation of the sugar was fixed. However, when the structure of a complex between rat MBP-C with mannose was determined, the orientation of the bound mannose was opposite to the orientation that was observed in MBP-A [85], and further studies revealed some of the factors that determine the preferred orientation [86]. Although the rat MBPs are the only established example of a CTLD that can bind carbohydrates in both orientations, it is known that different CTLDs bind the same monosaccharide in different orientations (e.g. galactose-binding MBP-A mutant and CEL-I vs. TC-14 lectin).

The second constraint imposed by the Ca2+-coordination site on the ligand determines the properties of the carbohydrate hydroxyls that the site can accept, and this is best demonstrated by the mechanism of discrimination between the mannose group of monosaccharides and the galactose group of monosaccharides by CTLDs. As noted previously, early in the history of CTLDs an important correlation between the residues flanking the conserved cis-proline in the long loop region, which are involved in Ca2+-binding site 2 formation, and the specificity for either galactose or mannose was made. In all mannose-binding proteins known at that time, the sequence of the motif was EPN (E185 and N187 in MBP-A), while in the galactose-specific CTLDs it was QPD. In a series of elegant mutagenesis experiments Drickamer and coworkers have shown that replacing the EPN sequence in MBP-A with a galactose-type QPD sequence was enough to switch the specificity to galactose [87], and that further modifications around the binding site (mainly introduction of a properly positioned aromatic ring to form a hydrophobic interaction with the apolar face of the sugar) can increase the affinity and specificity of the mutant MBP-A for galactose to the level observed in natural galactose-binding CTLDs [88].

Crystallographic analysis of the galactose-specific MBP-A mutant showed that the EPN to QPD change does not cause any serious restructuring of the Ca2+-binding site 2 geometry [89]; this suggested that the key switch in the specificity was induced by swapping the hydrogen-bond donor and acceptor across the monosaccharide-binding plane and changing the hydrogen-bonding pattern from the mannose-type asymmetrical (Fig. 5A, dark-grey arrows) to galactose-type symmetrical (Fig. 5A, light-grey arrows). The same distribution of hydrogen-bonding partners was observed in the galactose-binding lectin TC-14 from the tunicate Polyandrocarpa misakiensis[64]. The TC-14 CTLD contains an unusual EPS motif in the long loop region, which is similar to the motifs of the mannose-binding proteins but contains a serine as a hydrogen-bond donor instead of the asparagine in MBP-A. The crystal structure revealed that due to a compensatory change on the opposite side of the ligand-binding site (the ‘WND’ motif is changed to LDD), and a 180° rotation of the galactose residue compared with the orientation observed in the galactose-binding MBP-A mutant, the symmetrical pattern of the hydrogen bonding is maintained.

Although many of the determinants of the monosaccharide-binding specificity have been established experimentally, the mechanism underlying them is still unclear. Mutual spatial disposition of bonded hydroxyls, which was initially suggested to be the main contributor to the specificity, is no longer considered so important; a growing number of crystal structures of CTLDs with the MBP-A-like (‘asymmetrical’) distribution of hydrogen-bond donors and acceptors have shown that the core binding site is compatible not only with any two equatorial hydroxyl (3- and 4-OH of mannose and glucose, 2- and 3-OH of fucose), but also with a combination of axial and equatorial hydroxyls (3- and 4-OH of fucose, as in E- and P-selectin structures). A comparative study of different lectin-carbohydrate complexes published by Elgavish and Shaanan [90] suggests that additional stereochemical factors need to be taken into consideration. Elgavish and Shaanan noted the unique clustering of hydrogen-bond donors and acceptors around the 4-OH hydroxyl in all compared structures, which was not observed for other hydroxyls: in a Newman projection along the O4-C4 bond, hydrogen bond acceptors are never gauche to both vicinal ring carbons (C3 and C5), and thus the 4-OH proton is always pointing outside the ring. Poget et al. [64] confirmed this observation and also noted that in CTLDs the same rule is also true for the 3-OH proton. However, no explanation of the unique stereochemistry of the 4-OH binding orientation has been offered.

Other contributions to monosaccharide binding affinity and specificity

Although the networks of interactions between the Ca2+ ion, the carbonyl residues that coordinate it and the sugar hydroxyls determines the basic binding affinity and specificity to either mannose-type or galactose-type monosaccharides, other structural elements in the binding sites increase the affinity to the level required for efficient binding, impose steric limitations on the orientation of the ligand and introduce selectivity to the particular members within the mannose or galactose groups.

Structural determinants of specificity for particular monosaccharides from both mannose and galactose groups were studied by protein engineering on the MBP framework [91,92] and by mutagenesis of several wild-type proteins (mechanisms of discrimination between Glc and GlcNAc by chicken hepatic lectin [93], contribution of His189 to the mannose-binding affinity in MBP-A [84], mutations affecting MBP-A binding of mannose [94], discrimination between GalNAc and Gal by ASGPR [95], increasing the mutant MBP-A affinity towards galactose [88], role of van der Waals interaction with Val351 in fucose recognition by human DC-SIGN [96] and residues affecting pH-dependent ligand release by ASGPR [70]). These additional contributors to binding, however, are variable even between close homologues, which combined with the inherent plasticity of the core binding site makes any predictive modeling questionable.

Reliability of Ca2+/carbohydrate-binding prediction

As noted above, the molecular mechanism of Ca2+-dependent carbohydrate binding is conserved in all family members studied; the amino acids that form the core of the binding sites form characteristic motifs (‘EPN’ and ‘WND’) that can be identified by sequence similarity and are indicative of the binding specificity (mannose vs. galactose). These observations provide a simple and very popular approach to predicting whether a CTLD of unknown function is likely to bind sugar (‘EPN’ and ‘WND’ present) and whether it would preferentially bind mannose- or galactose-type ligands (‘EPN’ vs. ‘QPD’). This simple prediction technique is widely used and has proven to be reliable in many cases. However, its development was based on comparison of a limited set of well-characterized domains, whereas the number of uncharacterized sequences to which it is applied is quickly growing, as does also the evolutionary distance between the characterized and new sequences. It is therefore important, especially for studies involving large-scale CTLD sequence analysis, to take into account the assumptions on which this approach is based, and its possible limitations.

The three main assumptions are: (a) the presence of Ca2+-binding site 2 strongly suggests sugar-binding activity. (b) Ca2+-dependent sugar binding involving Ca2+-binding site 2 is the only (major) mechanism of monosaccharide binding by CTLDs. (c) Positioning of hydrogen-bond donors and acceptors flanking the conserved proline in the long loop region determines specificity to either mannose- or galactose-type monosaccharides.

As described above, the presence of the residue motifs associated with Ca2+-binding site 2 does not guarantee that the CTLD will bind carbohydrates. Several examples exist in the literature where sugar-binding activity and specificity predicted from the sequence were not confirmed by experiment. The CTLD of human tetranectin contains a galactose-type QPD motif and binds two Ca2+ ions, but the only demonstrated carbohydrate-binding activity of this protein [97] is not associated with the CTLD [98]. Antifreeze protein from Atlantic herring also contains a galactose-type QPD motif and binds Ca2+, but does not bind carbohydrate [99]. Although human macrophage mannose receptor CTLDs 4 and 5 both contain mannose-type EPN motifs and other positions typically involved in Ca2+-binding are occupied by identical or similar residues [100], monosaccharide-binding activity could be demonstrated only for CTLD 4 [101]. On the other hand, lung surfactant protein A has an EPK motif in the long loop region, but binds Ca2+ at site 2 and also monosaccharides from the mannose group [102,103].

As to the second assumption, there is no firm evidence to indicate an alternative mechanism of monosaccharide binding by CTLDs exists, but we found several examples in the literature that may suggest this possibility. These were: (a) existence of a secondary site was proposed for rabbit and rat hepatic lectins based on binding data [104]. (b) In a study using a photo-activatable galactose derivative to map the binding site of a galactose-specific lectin from acorn barnacle (BRA-3) the labeled regions were not adjacent to the Ca2+-site 2 [105]. (c) A secondary binding site was observed in one of the MBP-C crystals soaked with a high concentration (1.3 m) of α-methyl-mannose [85]. Although the second binding site was not observed at lower monosaccharide concentration (0.2 m) and electron density for the sugar could only be assigned for one of the two copies in the asymmetric unit, it has been suggested that the secondary binding site may be a part of an extended site that has significant affinity only for larger ligands [85]. Interestingly, the monosaccharide bound at the alternative site is in contact with the regions corresponding to the regions labeled in the acorn barnacle lectin study. (d) Although the CTLD of human thrombomodulin does not contain the typical Ca-binding sequence signature, aggregation of melanoma cells mediated by it is abolished by Ca2+ removal or by addition of mannose, chondroitin sulfate A or chondroitin sulfate C [106], which suggests a Ca2+-dependent carbohydrate-binding activity. (e) Dectin-1, which is reported as a macrophage β-glucan receptor [107], does not contain Ca2+/carbohydrate binding motifs or require Ca2+ for carbohydrate binding; residues required for binding are located in strand β3 [108]. Other group V CTLDcps may have similar properties [81].

The evidence for an alternative mechanism of sugar binding by CTLDs is scarce and does not show any common trend. On the other hand, a surprisingly large (> 80%) number of CTLDs from invertebrates are predicted as not sugar binding. It is possible that some of these proteins use an alternative mechanism for sugar binding. In this regard the example of the Link group of proteins is pertinent. These proteins do not contain a long loop region but nevertheless bind carbohydrates via a different mechanism.

As to the last assumption, there is no compelling explanation of the correlation between donor-acceptor positioning in site 2 and the discrimination between galactose and mannose, although it is supported by the majority of the CTLDs discovered since the observation was made. However, the example of the Polyandrocarpa lectin shows that the correlation is not absolute [64].

Groups of vertebrate CTLDcps

In a review of the C-type lectin family published in 1993 Drickamer separated the CTLDcps known at that time into seven groups (I to VII) based on their domain architecture and showed that such grouping correlates well with the results of phylogenetic analysis of the CTLD sequences and captures functional similarities between the proteins [2]. The classification was revised in 2002 [6] with the addition of seven new groups (VIII to XIV). Whereas the first seven groups of CTLDcps have a substantial history and are widely referenced in the literature, the new groups were only briefly outlined in the work introducing them. Along with the updated classification, a link to the ‘World-wide web-based resource for animal lectins’ (http://www.imperial.ac.uk/research/animallectins/default.html) was published, where some additional information on the new groups can be found, including the lists of database identifiers for the sequences that were used to define them. However, no functional description of the CTLDcps from the new groups similar to the description of the groups I to VII has been published. The domain architecture of the CTLDcps in different groups is shown in Fig. 6. In addition to the 14 groups present in Drickamer's updated classification, three new groups (XV to XVII) are shown, which we have added to accommodate the novel vertebrate proteins we identified in the study of Fugu CTLDcps [9]. Table 1 summarizes the literature on the vertebrate CTLDcp groups, focusing on the structural and functional features of the CTLDs.

Figure 6.

Domain architecture of vertebrate CTLDcps, with mammalian homologues, from different groups. Group numbers are indicated next to the domain charts. I –lecticans, II – the ASGR group, III – collectins, IV – selectins, V – NK receptors, VI – the macrophage mannose receptor group, VII – REG proteins, VIII – the chondrolectin group, IX – the tetranectin group, X – polycystin 1, XI – attractin, XII – EMBP, XIII – DGCR2, XIV – the thrombomodulin group, XV – Bimlec, XVI –SEEC, XVII CBCP.

Table 1.  Summary of the structural and functional features of the vertebrate CTLDcps Groups. See also Fig. 6.
Genomic distribution, membersFunctionsCTLD ligands and functionCTLD Structure1
  1. 1Presence of Ca2+ in the four Ca2+-binding sites is indicated with filled (Ca2+ site is occupied) or unfilled (Ca2+ site is unoccupied) circles with the corresponding site numbers.

I LecticansLarge extracellular proteoglycans containing mainly chondroitin sulfate side chains. Historically divided into three globular domains (N-terminal G1 and G2, and a C-terminal G3) and a central extended region to which glycosaminoglycan chains are attached. G1 and G2 contain 2–4 Link type CTLDs, while G3 contains a canonical CTLD.
Vertebrate genomes encode 4 lecticans: aggrecan, brevican, versican and neurocanCell adhesion, tissue integration.Regulate intracellular processing and trafficking of the protein. G3 region shown to promote glycosaminoglycan chain attachment and aggrecan secretion [109,110]. Galactose/fucose specificity shown for aggrecan C-terminal CTLD [111] as well as protein ligands [112–115].X-ray structure of complex of tenascin-R and aggrecan G3 CTLD [56]. ❶❷❸➃
II Asialoglycoprotein and DC receptorsType II transmembrane proteins containing a short cytoplasmic tail, a transmembrane domain, an extracellular stalk region, and Ca2+/carbohydrate binding CTLD. Length of stalk region, involved in oligomerization, varies greatly among different members. Large and heterogeneous group, significantly expanded recently.
Asialoglycoprotein receptor(ASGR)subgroup: ASGR, MGL
Encoded by a gene cluster: Two genes (ASGR1 and ASGR2) found in many mammals encode subunits of heterotrimeric ASGR (‘hepatic lectin‘). In rodents two ASGR2 glycoforms initially considered as separate proteins (rat hepatic lectin (RHL) 2 and 3 or RHL2/3 [116])
Two genes of macrophage galactose-binding lectin (MGL) in mouse (mMGL1 [117] and mMGL2 [118]), and one gene in human (hMGL, also called human macrophage lectin, HML [119]). This subgroup is not present in fish [9]. Our sequence analysis showed the so called ‘chicken hepatic lectin’ [120] is more similar to DC-SIGN subgroup proteins, consistent with its specificity for mannose-type ligands [121,122].
ASGR [123,124]: heterotrimer, expressed exclusively on liver parenchyma; binds and internalizes galactose-terminated oligosaccharides of desialylated glycoproteins. After ligand dissociation in acidic lysosomes, recycled to cell surface. One of first C-type lectins discovered [10,125].
Rat spermatogenic cells express unusual ASGR2 oligomer (sperm galactosyl receptor), consisting of a full-length and a truncated form lacking C-terminal part of CTLD [18,126].
In contrast to ASGRs, other ASGR-gene cluster members are expressed by macrophages. Recombinant mMGL1 and mMGL2 CTLDs show differences in carbohydrate-binding specificities [118].
Ca2+/carbohydrate binding. Primary specificity for galactose. Galactose-binding mechanism unusual for CTLDcps, as subunits have different monosaccharide specificity [127], and bind the same complex carbohydrate molecule simultaneously. Heterooligomeric structure is essential for high-affinity binding and internalization [128,129].X-ray structure of ASGR1 CTLD [47]. ❶❷➂❹
DC-SIGN subgroup: DC-SIGN, CD23, LSECtin
Dendritic cell-specific ICAM-grabbing nonintegrin (DC-SIGN, CD209) and its close homologue DC-SIGNR (DC-SIGN receptor) are an actively evolving gene family, with significant differences among mammals. Two genes (DC-SIGN and DC-SIGNR) identified in human, a group of paralogues found in nonhuman primates [130], and five DC-SIGN homologues found in mouse [131] (DC-SIGN, SIGNR1, SIGNR2, SIGNR3 and SIGNR4) In the fish genome the DC-SIGN group is also expanded [9].
A new protein LSECtin encoded by the DC-SIGN gene cluster recently characterized [132].
mDC-SIGN identified as hDC-SIGN orthologue based on proximity of its gene to the mCD23 gene. In human genome hCD23 and hDC-SIGN are closely linked.
DC-SIGN is responsible for HIV particle transfer and in-trans infection of T-cells [133]. Also a receptor for other pathogens, such as Mycobacterium tuberculosis [134], hepatitis C virus [135], Ebola virus [136], and human cytomegalovirus [137].
CD23 [82,138,139] (low affinity IgE receptor) is a glycoprotein expressed on several cell types including lymphocytes, eosinophils, platelets, and macrophages, and also found in a soluble form produced by proteolysis. A key molecule of B-cell activation and growth. Oligomerization via coiled-coil stalk region significantly increases its affinity for IgE [140].
LSECtin found in sinusoidal endothelial cells of human liver and lymph node [132]; this is similar to the expression profile of DC-SIGNR.
Carbohydrate recognition plays a central role in the DC-SIGN binding of pathogens.
CTLD of CD23 is involved in both protein-protein and protein–carbohydrate interactions. Although human CD23 binds IgE in a carbohydrate-independent manner [141], recognition of another ligand (CD21) and CD23-induced cell aggregation require Gal-terminated glycan chains [142–144]. Predicted Ca2+-binding site 2 motifs of human CD23 CTLD are EPT and WND (EPN in mouse, rat and horse), which are typical for mannose-binding CTLDs, so galactose specificity of CD23 is unexpected.
The CTLD of LSECtin contains an EPN motif and, as expected, preferentially binds mannose-type ligands.
X-ray structures of DC-SIGN and DC-SIGNR complexed with oligosaccharides [53,96]. ❶❷❸➃
NMR structure of human CD23 defining its interactions with IgE and CD21 [145]. ❶➁➂➃
Macrophage receptors: MCL, Mincle, DLEC, DCIR, DCAR, Dectin-2
Gene cluster [146,147] (human 12p13; mouse 6F2), closely linked to NK cell receptor complex (group V). Encodes several CTLDcps expressed by macrophages and dendritic cells: macrophage C-type lectin (MCL [148]), macrophage-inducible C-type lectin (Mincle [149]), dendritic cell immunoreceptor (DCIR [150]), dendritic cell lectin (DLEC or BDCA-2 [151,152]) and dendritic cell-associated lectin-2 (Dectin-2 [153,154]). Rodent-only: DCIR paralogues (DCIR2-DCIR4 [146]), and a dendritic cell immunoactivating receptor (DCAR [155]).Subgroup members discovered only relatively recently, and their functions poorly characterized [153].Only available information on carbohydrate-binding properties from two studies on Dectin-2, which gave conflicting results: in one case Ca2+-dependent binding to mannose observed [156], while in other the protein did not bind carbohydrate [153]. In all group members, a putative Ca2+-binding site 2 motif is present, although in some cases has unusual sequence (e.g. EPK, ESN, EPD in rat, mouse and human MCL, respectively). 
Langerin and Kupffer cell receptors
Cluster of two genes (human 2p13, mouse Ch 6D1), closely linked to NK cell receptor complex (group V): Langerin (CD207) and Kupffer cell receptor (KuCR).
KuCR locus in human genome lacks 3′-terminal exon, which truncates CTLD at beginning of long loop region; hence, suggestion that human receptor is a pseudogene [157]. However, a full-length cDNA (AK096429) for hKuCR is now available in GenBank (63% identity with rat).
Langerin is an endocytic receptor uniquely expressed by Langerhans cells and associated with Birbeck granules in human [158] and mouse [159]. Long stalk region involved in trimerization (coiled-coil).
Kupffer cell receptor structure is similar to Langerin, but expressed in liver and functions as endocytic receptor for fucose-terminated glycoproteins [160,161].
CTLD of Langerin has typical motifs associated with mannose binding, and protein indeed shown to bind mannose-group monosaccharides [162].
Rat KuCR contains a galactose-type QPD motif, but interestingly it binds fucose with relatively high affinity [157].
 
Scavenger receptor with a CTLD (SRCL)
SRCL (human Ch 18p11 [163,164]) has unusual structure for group II proteins. Contains a collagen domain and coiled-coil region, and thus was described as a placental collectin (CL-P1 [165]; HUGO name COLEC12). However, except for collagen region, the domain structure is analogous to other group II CTLDcps; our phylogenetic analysis of CTLD alignments confidently places SRCL into group II.Endocytic receptor; binds Gram-negative and Gram-positive bacteria, yeasts, oxidized low density lipoprotein [165].CTLD of SRCL is similar to ASGR CTLDs, including all elements shown to contribute to high-affinity galactose binding by ASGR (QPD motif, a tryptophan and glycine-rich loop [89]). SRCL indeed binds galactose-type ligands [166] and has unusual high selectivity for glycans containing Lewisx epitope [167]. 
III CollectinsSoluble CTLDcps that contain a collagen domain and function as the first line of the innate immune defense [168].
Serum mannose-binding protein(s) (MBP) and pulmonary surfactant proteins (PSP). Other members: human liver collectin CL-L1 [169] (unusual as only found in cytoplasm). Bovidae collectins: conglutinin, CL-43 and CL-46; genes physically linked with MBP and PSP [170]. Highly conserved vertebrate collectin identified in the Fugu whole-genome study [9].Innate immunity: recognition of pathogen carbohydrates, complement activation via lectin pathway, activation of phagocytosis [e.g 171–173].Unique binding specificity and spatial organization of CTLDs in oligomeric complexes allows collectins to recognize ordered arrays of carbohydrates specific to the surfaces of microorganisms (pathogen associated molecular patterns (PAMPs)) [174,175]. MBP and PSP also bind nucleic acids via both the CTLD and collagen region [176], and PSP binds phospholipids via the CTLD [177].
As discussed in text, MBP wild-type and mutant structures used in classical studies of Drickamer, Weis and coworkers on the mechanism of carbohydrate recognition by CTLDs.
Numerous X-ray structures for wild-type and mutant MBP and complexes [e.g 14,86,89,178]. ❶❷[❸/➂]➃
X-ray structures for rat PSP-A [60] and human PSP-D, including sugar complex [55, 61]. ➀❷➂➃,❶❷➂➃, ❶❷❸➃
IV SelectinsType I transmembrane proteins containing CTLD, EGF and 2–9 complement control protein (CCP) domains.
Three selectin L- (leukocyte), P- (platelet) and E- (endothelial); encoded by a compact gene cluster; this organization is conserved among vertebrates [9].Cell adhesion [179,180]. Involved in the first step (initial attachment (tethering) and subsequent movement (rolling)) of leukocyte recruitment from the blood stream into sites of inflammation and lymphatic tissues.CTLDs bind the carbohydrate sialyl LewisX (SLeX) with low affinity; different high-affinity glycoprotein ligands also identified [reviewed in 180–182]. Binding occurs via an extended site on the CTLD; in addition to fucose binding at the primary site, electrostatic and hydrogen–bond interactions are formed with other monosaccharide moieties of SLeX [62].X-ray structures of E- and P-selectin complexes with SLeX and other ligands [62]. ➀❷➂➃
V ‘NK – cell receptors’Non-Ca2+-binding type II transmembrane CTLDcps. Despite the common group definition of ‘type II NK cell receptors’, many are not (exclusively) expressed by NK cells: CD72 is expressed on B-cells [183]; CD69 on various hematopoietic cells [184]; KLRG1/MAFA on basophils and NK cells [185]; LOX-1 on vascular endothelial cells [186,187]; DCAL1, CLEC-1, KLRL1 on dendritic cells [188,189]; Dectin-1 on macrophages and dendritic cells [190,191]; MDL-1 exclusively on monocytes and macrophages [192]; and CLEC-2 in liver [188].
This group is evolutionarily young and unambiguously identified only in higher vertebrates. Mostly encoded by a single large ‘NK gene cluster’ [193] (human Ch 12p13, mouse Ch 6F3), but uniquely CD72 is on Ch 9 in human and Ch 4 in mouse [194], and MDL-1 on Ch 7p33 in human [192].
Substantial variability between rodents and human: large mouse gene family (Ly49-A – Ly49-x, official symbols KLRA1-KLRA28; some are alleles), even larger in rat (at least 36 genes [195]) but only a single gene (Ly49L/KLRA1) in human, encoding a truncated protein, which lacks the distal part of CTLD [196].
Majority belong to killer cell lectin-like receptor [KLR; unofficial names NKG2 (NK cell group 2) and Ly49] group. Variously associated with inhibition or activation of NK cells, although exact function of many is unknown. KLRs form homo- (e.g. KLRK1/NKG2D) or heterodimers (e.g. CD94 and KLRC1/NKG2A).Most have protein ligands.
Some are multivalent and bind both carbohydrate and protein ligands [81,107]; most striking example is Dectin-1, characterized as a macrophage β-glucan receptor [107]. Polysaccharide binding to Dectin-1 is cation-independent, and mutagenesis studies suggest binding site is not at typical CTLD carbohydrate-binding site [108].
X-ray structures of NKG-2D (human and mouse) with their MHC-like ligands [197,198], Ly49C [199], Ly49A [50], Ly49I [200], CD69 [201], CD94 [202], and LOX-1 [203]. ➀➁➂➃
VI Multi-CTLD endocytic receptorsType I transmembrane proteins with an N-terminal ricin-like domain, a fibronectin type 2 domain and 8 or 10 (Dec205) CTLDs in the extracellular domain, and a short cytoplasmic domain.
4 members (from fish to human): Endo180, phospholipase A2 receptor (PLA2R), macrophage mannose receptor (MManR), and Dec205 (reviewed recently in [204]).Recycling endocytic receptors.Monosaccharide binding demonstrated only for Endo180 [205] and MManR [66]; in both cases, activity is limited to a single domain (4 and 2, respectively), other domains being required for high-affinity binding of multivalent ligands [101,206]. Most CTLDs of group VI proteins do not contain residue motifs associated with Ca2+ -binding site 2.X-ray structure of MManR CTLD 4 [41]. ➀❷➂➃
VII Reg groupCTLD preceded by a short N-terminal peptide. In the initial classification this group included all other soluble single-CTLD proteins.
Four subgroups [207, 208] with a gene cluster on human 2p12 (mouse 6C3) and a single gene (Reg4) on human 1p12 (mouse 3F3). The group outlier Reg4 has much less sequence similarity to the other group members. As discussed in text section on snake CTLDcps, Reg4 proteins represent the ancestral member of the mono-CTLD group.First member of family, now known as Reg1, identified simultaneously by several groups in different functional contexts. Reflected in alternative names [209]: pancreatic stone protein (PSP), as was isolated from pancreatic stones; lithostathine, as was considered an inhibitor of calcite crystal growth [77,210] (not confirmed by other studies [211,212]); and regenerating gene (Reg), as overexpression was observed in regenerating pancreatic islets [213].
No Reg family member contains a characteristic Ca2+ -binding motif. Although involvement shown in various physiological and pathological processes, molecular mechanisms of action largely unknown [214]. Lithostathine also studied due to its ability to form amyloid fibrils and possible involvement in early stages of Alzheimer's disease development [215–217].
X-ray structures of polymerized lithostathine protofibrils [216], and monomer [218,219]. ➀➁➂➃
VIII Chondrolectin, LayilinType I transmembrane proteins with a single CTLD.
Two members: Layilin and Chondrolectin (CHODL/MT75).Layilin expressed in wide range of cell lines and tissues [220] and may function as either an endocytic receptor or adhesion molecule [221].CTLDs of Layilin and CHODL contain a motif associated with Ca2+-binding, although the motif is unusual (EPS). Layilin binds hyaluronan via the CTLD and intracellular proteins from the ERM family (talin and radixin). No hyaluronan binding could be detected for CHODL [222]. 
IX TetranectinSoluble proteins with a long N-terminal α-helical domain involved in coiled-coil formation. This structure resembles the structure of the C-terminal domain of collectins. SCGF also contains an N-terminal mucin-like Ser/Thr rich region.
Three identified members: tetranectin, stem cell growth factor (SCGF, LSCLCL [223–226]) and CLECSF1. Similarity between group IX and group III is further supported by gene structure (intronless CTLD) and molecular phylogeny reconstruction based on CTLD alignment. SCGF identified in two forms (α and β), difference (78 residues [223]) cannot be explained by alternative splicing, as located within exon 3 encoding the CTLD [225].Tetranectin is involved in tissue remodeling, activates plasminogen (main ligand), and is expressed in developing tissues [227].
Expression data only available for CLECSF1 [228]. Shark homologue reported (called shark tetranectin [228,229]).
SCGF detected in culture medium of a human myeloid cell line [230]; mitogen [223,231].
All contain a motif that would satisfy requirements for Ca2+ but not carbohydrate binding in CTLD. Tetranectin binds Ca2+ and carbohydrate (heparin) independently [98]. Ca2+ competes with plasminogen for binding to tetranectin CTLD [71].
As the truncated (SCGFβ) form was reported to be an active growth factor, it is unlikely that this activity is mediated by CTLD.
X-ray structure of tetranectin in Ca2+-bound [232] and Ca2+-free forms [68]. ❶❷➂➃
➀➁➂➃
X Polycystin 1Large multidomain protein with 11 membrane-spanning regions, thought to be involved in cell-cell or cell–matrix interactions. The extracellular domain of PKD1 is ∼3000 amino acids long and contains 16 PKD domains, which have an Ig-like fold [233], a leucine-rich repeat domain, a putative carbohydrate-binding WSC domain, a CTLD and a domain homologous to the sea urchin receptor egg jelly protein 1 (suREJ1 [234]).
PKD1 (polycystin-1) has several homologues in vertebrates, but not all of them contain a CTLD. Its sea urchin homologue (suREJ1) lacks most of PKD1's domains and contains only one TM region, but two CTLDs [234]. suREJ1 paralogues (suREJ2 and suREJ3) isolated later contain one CTLD [235,236]. Thus, the PKD1 group may be the most ancient group of vertebrate CTLDcps as it can be traced back to the early evolution of deuterostomes. Of several PKD1 paralogues identified in mouse and human, only two (PKD1L2 and PKD1L3) contain CTLDs [237].PKD1 initially identified as one of two genes in which mutations are responsible for onset of autosomal dominant polycystic kidney disease (ADPKD) [238,239]. The function of the CTLD in polycystin 1, as well as the function of the protein itself, remains unknown.Study of a GST-fused recombinant PKD1 CTLD showed binding to unsubstituted carbohydrate matrices (Sepharose and Sephadex G25), as well as to several extracellular matrix proteins, with high affinity and in a Ca2+-dependent manner [240]. This is intriguing, taking into account the sequences of the Ca2+-binding site 2 motifs (EPH and WCNT).
Unfortunately, the results cannot be interpreted unambiguously as the GST domain was not cleaved from the CTLD.
 
XI AttractinGlycoprotein expressed in transmembrane or soluble form due to alternative splicing [241]; contains a CUB domain (found in many developmentally regulated proteins), four EGF-like domains, and four PSI domains (found in plexins, semaphorins and integrins).
Orthologue of attractin found in C. elegans, but lacks the CTLD [242]. Interestingly, a CTLD occurs very frequently in combination with a CUB domain in C. elegans. A well-conserved vertebrate attractin paralogue can be found in sequence databases (hypothetical protein KIAA0534); no description of this protein has been published.Expressed by hematopoietic cells [242]. In mouse associated with the mahogany mutation, which affects the melanocortin signaling pathway [243]; and in rat, the zitter mutation in tremor rats [244].Unknown. 
XII Eosinophil major basic protein (EMBP)Soluble protein containing a highly basic CTLD and an acidic pro-peptide, which is cleaved off in the active form.
Paralogue of EMBP, EMBP-2, identified in mouse [245] and human [246,247].Estimated pI of 11; major component of crystalloid core of eosinophil-specific granules; functions as a cytotoxic agent against parasites. Despite its highly basic nature, the CTLD of EMBP has a typical CTLD fold [49]. Ligand-binding functions unclear but finding of binding of heparin may be involved [49].X-ray structure of EMBP [49]. ➀➁➂➃
XIII DGCR2Type I transmembrane protein containing vWF, CTLD and LDL domains in extracellular region.
One member, DGCR2/IDD/Sez12, which was localized in the DiGeorge syndrome (OMIM 188400) critical region [248–250]Function of protein unknown.CTLD of DGCR2 does not contain characteristic Ca2+-binding motifs. 
XIV ThrombomodulinType I transmembrane proteins with a short intracellular domain and an extracellular part that includes a CTLD, a domain referred to as hydrophobic or sushi-like, one or more EGF domains, and low complexity Ser/Thr-rich regions, which are targets for O-glycosylation
Four members: thrombomodulin (TM), Endosialin/TEM1, CD93/C1qRP/AA4 and a novel member we named CETM [9].All expressed on vascular endothelium. Endosialin only found on tumor vascular endothelium [251–253, but, see 254], while C1qRP and TM expressed more broadly. Based on EST data, CETM is ubiquitously expressed. TM is a very well characterized CTLDcp, due to its importance in the coagulation pathway. Thrombin binding to EGF domains 5 and 6 of TM promotes protein C activation (up to 1000×), which makes TM a potent tissue anticoagulant [255]. TM and CD93 are involved in cell adhesion and inflammation control [106,256,257].Thrombomodulin fragment referred to as ‘lectin domain’, which includes CTLD and the following 67-residue ‘hydrophobic region’, is required for Ca2+-dependent thrombomodulin-mediated cell adhesion; this is inhibited by mannose and chondroitin sulfate A or C [106]. However, the TM CTLD does not contain a typical carbohydrate-binding motif.
The CETM CTLD sequence contains a putative carbohydrate-binding motif (EPN), normally associated with mannose specificity [9].
 
XV BimlecType I transmembrane protein with neck region and CTLD in extracellular region.
New group created as predicted in our Fugu whole-genome analysis [9], and supported by a database cDNA sequenced (named ‘Bimlec’), linked to DEC-205.Function unknown. Expressed as fusion protein with DEC-205 in Hodgkin's lymphoma cells [258].  
XVI SEECSouble protein containing SCP, EGF, EGF and CTLD domains (SEEC [9]). The sperm-coating glycoprotein (SCP) domain, which is present in organisms from yeast and plants to mammals, but whose function is unknown [259], is rarely observed in combination with other domains in proteins. SCP/CTLD combination is observed in only one other known protein – Nowa from hydra [260].
New group created as predicted in our Fugu whole-genome analysis [9]; well conserved between human and fish. Supported by available cDNAsFunction unknown.CTLD has potential Ca-carbohydrate-binding motif (QPD) characteristic of galactose specificity. 
XVII CBCP/Frem1/QBRICKLarge proteoglycan (∼2100 residues) containing a set of chondroitin sulphate proteoglycan (CSPG) repeats (homologous to the NG2 ectodomain [261]), a calcium-binding Calx-β domain and a CTLD. CBCP (Calx-β and CTLD containing Protein [9])
New group created as predicted in our Fugu whole-genome analysis [9]. Novel member of protein family not reported previously to have members containing CTLDs; examples of this family include human MCSP/CSPG4 [262] and mouse Fras1 [263] genes.Gene independently discovered twice experimentally in mouse and called Frem1 [264] and QBRICK [265]. Found expressed widely in developing embryo in regions of epithelial/mesenchymal interaction and epidermal remodeling, and appeared to act as mediator of basement membrane adhesion [264]. Also found as adhesive ligand of basement membrane recognized by cells in embryonic skin and hair follicles through integrins [265].CTLD of CBCP lacks Ca-binding residues, and its long loop region is short, resembling that of group V CTLDs. 

Proteins with Link/PTR domain

Link domain or protein tandem repeat (PTR) is a special variety of CTLD, which lacks the long loop region. The major function of Link domains is binding hyaluronan. Although proteins containing it have different domain architecture, their number is small, and they have not been divided into subgroups. Group I CTLDcps contain both canonical and Link-type CTLDs. Other Link-domain containing proteins have four types of domain architecture (Fig. 7). The domain composition of Link proteins is similar to that of the N-terminal part of Group I (lecticans). Four Link protein-encoding genes have been identified in mammals, each physically linked with one of the lectican genes; this suggests that lecticans and Link proteins are of a common evolutionary origin [266]. Link proteins and lecticans are also functionally associated: cartilage Link proteins bind both aggrecan and hyaluronan stabilizing the proteoglycan/glycosaminoglycan network [267]. CD44 and its recently identified close homologue Lyve-1 are type I transmembrane molecules and cell surface receptors for hyaluronan [268,269]. Tumor necrosis factor-inducible protein (TSG-6) is a soluble protein with a CUB and a Link domain [270,271]. The structure of the latter in free and hyaluronan-bound states has been determined by NMR [20,272]. Stabilin-1 and -2 (also known as FEEL-1/-2 [273–275]) are scavenger receptors which have the capacity to internalize conventional scavenger ligands such as low density lipoprotein, bacteria and advanced glycation end products. Unlike Stabilin-2, Stabilin-1 does not bind hyaluronan or other glycosaminoglycan [276].

Figure 7.

Domain architecture of the proteins containing Link domain.

Criteria for classification of novel CTLDcps

The classification created by Drickamer has proven to be a useful instrument for studies of vertebrate CTLD evolution. Its wide acceptance makes it important to have a clear definition of the principles on which the classification is based. As discussed below, some of the apparently minor contradictions that were present in the original grouping often made assignment of newly discovered family members to the existing groups, as well as creation of the new groups, ambiguous and led to some confusion in the literature. While the arguments presented below may seem rather arcane, the problems we routinely encountered while working on the whole genome analysis of the family in Fugu[9] in adding newly found mammalian CTLDcps to the classification inspired this critical review.

First, it is not absolutely clear what the classification is based on – phylogenetic relationships between CTLDs, or the domain architecture of the proteins containing the CTLDs. Although the latter is generally considered to be the case, even Drickamer's [2] initial grouping contained a set of CTLDcps with identical domain architecture which was split into two groups – Group V and Group II – based on phylogenetic and functional considerations. At the same time, all soluble single-domain CTLDcps, which are only distantly related to each other and perform different functions were included in one group (VII), which also contained single-CTLD proteins from invertebrates and lower vertebrates [2]. The problem of Group VII phylogenetic heterogeneity was resolved in the updated classification [6], where group VII was split into several groups: VII (lithostathine), IX (tetranectin), and XII (EMBP). Although there are differences in sequence composition and size of the regions outside the CTLD, these regions do not contain defined, conserved protein domains. For example, EMBP in processed functional form contains only a CTLD, while an acidic region, similar to EMBP's acidic pro-peptide, is found in a stem cell growth factor, which is a member of the tetranectin family. Also, phylogenetic considerations of the CTLD sequence rather than overall domain architecture is the defining factor in classification of these sequences. Sometimes domain architecture can be misleading, as in the case of scavenger receptor with a CTLD (SRCL), which contains a CTLD, a collagen domain and coiled-coil regions – a combination observed otherwise only in collectins. SRCL was first described as a placental collectin [165], although both the CTLD sequence and the presence of a transmembrane domain indicate that SRCL should be assigned to group II rather than group III; the group II re-assignment has been made in the online database described in the recently published revised classification [6].

However, the clearest example of using phylogenetic information as a classification basis is the division of type II transmembrane proteins with a single CTLD into groups II and V – a division not consistent with the general idea of using domain architecture as a basis for classification. Historically the division of Group II and V CTLDcps was made when the initial human ‘NK-cell cluster’ of genes was discovered; although having domain architecture analogous to Group II CTLDcps, such as asialoglycoprotein receptor, they appeared significantly different, for instance lacking a conserved Ca2+-binding site 2 and having an ITIM intracellular motif. However, as more members of both groups were discovered this basis for distinction broke down. On the other hand, phylogenetic analysis of CTLD sequence alignments strongly indicates that there are indeed two distinct sets of type II transmembrane CTLDcps in mammals, even though it is not possible to correlate this differentiation with any structural (properties of the neck region) or obvious functional (oligomerization pattern, tissue expression profile) properties. For example, although group V is often called ‘NK-cell receptors’, according to phylogenetic analysis it should also include proteins expressed on nonlymphoid cells, e.g. oxidized low density lipoprotein receptor (expressed on endothelial cells) or Dectin-1 (macrophages, neutrophils, dendritic cells). Also, there is no obvious distinction between domain structures of the group members which could be used as a basis for (sub)classification. Differentiation cannot be based on oligomerization properties, as some members of group V function as monomers (LOX-1, Dectin-1).

Therefore, it seems that the current paradigm of CTLD classification is based on phylogeny rather than domain architecture, and that the former is more evolutionally stable than the latter. We followed this paradigm in annotation of the Fugu CTLDcps and in grouping the newly found mammalian proteins [9].

Clade-specific vertebrate CTLDcps

Although most of the CTLDcp groups are common to all studied vertebrates from fish to human, there are several groups that are specific to particular vertebrate clades. We found that of the 17 groups listed above, most of which were defined in the absence of sequence data from lower vertebrates, only groups V and VII are specific to mammals [9]. However, groups of vertebrate CTLDs that do not have clear homologues in mammals have been known for decades. These include snake CTLDcps and fish antifreeze proteins.

Clade-specific groups of vertebrate CTLDcps have several features in common. They comprise close homologues produced by recent gene duplication, and almost always lack the characteristic Ca2+-binding motifs in the CTLD sequence. Parallels can be drawn between expansion and functional diversification within these groups and the emergence and evolution of the specialized physiological systems they are involved in.

Snake CTLDcps

The largest and the best-studied group of nonmammalian vertebrate CTLDcps is the so-called snake venom CTLDs. Based on their domain architecture (single CTLD with no other domains), these proteins are normally assigned to group VII. However phylogenetic analysis performed by others [277,278] and ourselves shows that the snake venom CTLD group is phylogenetically heterogeneous, and that its members do not have orthologues among the members of the mammalian CTLD groups.

Snake CTLDcps can be divided into three subgroups: (a) the domain-swapped dimeric haemostasis modulators from Viperidae venoms, (b) phospholipase A2 inhibitors (PLI), and (c) sugar-binding CTLDcps. The domain-swapped subgroup is the largest one, with more than 80 known members [277]; these are very similar to each other but are not clear homologues of other vertebrate CTLDcps (Fig. 8). The group is characterized by a unique dimerization mechanism in which two monomers swap portions of the long loop region forming a stable functional unit (Fig. 2C). Dimers can further aggregate with each other to form higher-order oligomers, or, as in the case of the Russell's viper factor X activating enzyme [279], form covalently linked complexes with a metalloproteinase chain. Despite sharing the same general function, which is disruption of the victim's coagulation pathway leading to serious hemostatic disturbances, the members of the group have different targets and mechanisms of action. They act as either pro- or anticoagulants, or as agonists or antagonists of platelet aggregation.

Figure 8.

Phylogenetic relationships between clade-specific vertebrate CTLDs. The phylogenetic tree was built with the neighbour-joining method using the mega3 program [319] (pairwise distance estimation, Poisson correction, 1000 bootstrap) from a multiple alignment of CTLDs from clade-specific vertebrate CTLDcp groups discussed in the text and from mouse and human CTLDcps from groups III, IV (included as an outgroup), IX and XII. Bootstrap values are indicated for the key nodes discussed in text.

The sugar-binding CTLDcps from snake venoms are not known to form domain-swapped dimers. The crystal structure of one the subgroup members, rattlesnake venom lectin (RSL), revealed a decameric protein composed of two fivefold symmetric pentamers [65]. All members of the subgroup contain sequence motifs associated with Ca2+/carbohydrate binding, which in all but one case (CTLLP2 from Bungarus fasciatus) are associated with specificity for galactose. Although the physiological roles of the sugar-binding snake venom CTLDs were not studied in detail, they are known to be potent haemagglutinins [280,281], and may contribute to haemostasis disruption induced by the other venom components. In our phylogenetic analysis, the sugar-binding CTLDcps are confidently placed in the same clade as the Reg4 protein (Fig. 8), and not with other mammalian members of group VII or with the domain-swapped CTLDcps, although Reg4 lacks the characteristic Ca2+-binding motifs. We found homologous sequence from Xenopus laevis in the database, but no homologues were found in fish [9]. Therefore, Reg4 and snake sugar-binding CTLDs represent the ancestral group of single-CTLD proteins, which appeared after the split between Actynopterigian and Sarcopterygian, while domain-swapped snake venom CTLDs and the mammalian Reg gene cluster are clade-specific expansions of these (or closely related) genes.

Unlike the other two subgroups of snake CTLDs, PLIs are not a component of venom, but are thought to defend the snake against its own weapon – phospholipases A2, which are one of the major toxins of snake venoms. The first PLIs (α- and β-) were isolated from the blood plasma of Trimeresurus flavoviridis[282] and the subgroup now includes five members. Our phylogenetic analysis showed that the subgroup is not closely related to the other subgroups of snake lectins. In fact, the PLI clade was confidently (bootstrap values >70) placed within the collectin CTLDs (Fig. 8). Interestingly, PLI sequences contain an unusual ‘WIGL’ motif (YLVV), which is somewhat similar to the atypical motifs of collectins (e.g. FLSM and YVGL in human PSP-D and -A, respectively). Also of interest is the presence of a proline flanked by carbonyl residues in the long loop region (EPN/QPN motifs). However, the other positions that contribute to the Ca2+-binding at site 2 do not match the consensus. Finally the discovery of a PLI-α homologue, which despite having 70% sequence identity to PLI-α from Trimeresurus flavoviridis, lacks PLA2-inhibitory activity [283], suggests that PLIs may have broader functions.

Based on the occurrence of domain-swapped and sugar-binding CTLDs in the venoms of snakes from the Viperidae and Elapidae families, Fry and Wuster suggested that two different groups of CTLDs were recruited independently as venom toxins [278]. If self-protection from venom toxicity is indeed their main function, PLIs are the third group of CTLDcps that independently evolved to support a newly acquired clade-specific function. The abundance and functional diversity of the CTLDcps from these subgroups provides a good example of the suitability of the domain for rapid generation of new physiological activities, and a parallel to the independent expansion of the group V CTLDcps in mammals.

Other vertebrate CTLDcps

Two major groups of fish-specific CTLDcps have been identified. The first is the type II antifreeze proteins (AFPs) found in plasma of cold-water living fish species (sea raven [284,285], Atlantic herring [99], and smelt [73]). The most distinctive feature of the AFPs is the presence of five cysteine bridges in their structure; three of these correspond to the conserved cysteine bridges of the long form CTLDs [286]. Smelt and herring AFPs, whose sequences are 86% identical, contain motifs characteristic of Ca2+-binding site 2 and galactose specificity, and bind ice in a Ca2+-dependent manner [72], while sea raven AFP, which is only ∼ 40% identical to the other two AFPs, does not contain Ca2+-binding site 2 motif, and does not require Ca2+ for ice crystal growth inhibition [287]. Apart from the three antifreeze proteins, a number of fish sequences has been found that have the same simple domain architecture and whose sequences are more similar to the sequences of AFPs than to any other vertebrate CTLDcp (Fig. 8); we classified them as AFP-like proteins [9]. The AFP-like sequences, however, do not contain the additional cysteines residues that characterize AFPs.

The second distinct group of fish CTLDcps that has not been found in other vertebrates is that of the dual-CTLD proteins of unknown function predicted by our Fugu genome analysis [9]. We suggested that these proteins may be homologous to the invertebrate proteins of similar domain structure, and that they appeared at the earliest stages of CTLD family evolution.

A group of single-domain CTLDcps has been found as a major component of bird eggshell matrix. Known members include ovocleidin-17 from chicken, ansocalcin from goose and two ostrich proteins, SCA-1 and SCA-2 [78,288,289]. These proteins are expressed and secreted into uterine fluid and are thought to control calcite crystal growth and aggregation during egg shell calcification [289]. None of the eggshell matrix CTLDcp sequences contains Ca2+-site 2 motifs, and the phylogenetic relationship of the group to other single-CTLD proteins is uncertain (Fig. 8). Interestingly, CTLDcps interacting with the Ca-rich mineral phase have been reported in very distant Metazoan species: mollusc [290], sea urchin [291] and mammals [77,210]. Of these, only mammalian lithostathine can be considered homologous to the bird sequences, so these disparate examples of CTLD involvement in calcite-containing biocomposite formation appear to be the result of convergent evolution.

Invertebrate CTLDcps

Whole-genome studies on Drosophila and C. elegans demonstrated that the CTLD superfamily is as abundant and diverse in invertebrates as it is in vertebrates. In the absence of adaptive immunity, the whole burden of antipathogen defense falls on the innate immune system, and recognition of characteristic carbohydrate structures is considered the most universal and effective method of distinguishing self from nonself. The need for an effective repertoire of defense lectins may be one explanation for the abundance of CTLDcps in invertebrates, which is especially noticeable when expressed in relative terms (in the C. elegans genome the CTLD is the 7th most abundant domain compared with 43rd in human). However, very little is known about the function of invertebrate CTLDcps and not much can be inferred, due to absence of detectable homology not only to the well-studied vertebrate CTLDcps but also among different invertebrate organisms.

Insect CTLDcps

The majority of characterized CTLDcps from insects are humoral defense proteins induced in response to injury. Immulectins-1 and -2 (IML-1 and IML-2) purified from the hemolymph of bacterially challenged larva of the tobacco hornworm Manduca sexta[292–294] and proteins BmLBP from Bombyx mori[295] and Hdd15 from Hyphantria cunea[296] belong to a group of lepidopteran lipopolysaccharide-binding humoral pattern recognition receptors that share a common dual-CTLD domain architecture (group B, [8]) and some functional characteristics: expression in response to pathogen and induction of phenoloxidase defense pathways. Despite functional and structural similarity between the proteins from this group, sequence similarity is only moderate (∼ 20–30%), and it is not clear whether they are homologous. Moreover, it has been reported that the C-terminal CTLD of IML-1 is most similar to a CTLD from Periplaneta americana CTLDcp, which has a different domain architecture [292,297].

Other CTLDcps involved in immunity have single-CTLD domain architecture (group A1 or A3) and include Sarcophaga lectin from the flesh fly Sarcophaga peregrina, which was the first insect CTLDcp discovered [298,299] and is an acute phase galactose-binding defense protein expressed in response to injury [299] that may also be involved in developmental regulation [300]. The CTLD of Sacrophaga lectin contains an NPD motif in the position corresponding to the motif for Ca2+-binding site 2 contributed by the long loop region; this is unusual but consistent with the postulated hydrogen bond donor/acceptor distribution required for galactose specificity. A trimeric single-CTLD (151 amino acids) protein, Sarcophaga peregrina granulocytin, is constitutively synthesized and stored in granulocytes and released into hemolymph in response to injury in fly larva [301]. Similarly to Sarcophaga lectin, granulocytin activates murine macrophages by interacting with carbohydrate chains of an unidentified glycosylated receptor(s) on their surface [302]. Two Ca-dependent humoral lectins from the cockroach Blaberus discoidalis (BDL1 and BDL2; [303]) are involved in activation of the phenoloxidase system [304] and pathogen phagocytosis [305]; a BDL1 homologue was also isolated from the mosquito Anopheles stephensi[306].

An interesting example of CTLDs related to insect immunity is the CTLDcp CrV3 from symbiotic insect polydnaviruses; these are integrated into genomes of certain Braconid endoparasitic waSPS and transmitted exclusively vertically in the integrated form. Viral particles are formed only in ovaries and injected into host hemocoel together with parasitoid egg; their presence is essential for the host immune response suppression that is required for survival of the parasitoid larva [307]. Interestingly, the lectin activity of CrV3, determined by erythrocyte agglutination essay, was dependent on Mg2+ and Mn2+, but not Ca2+[this is consistent with the absence of the residue motifs associated with Ca2+-binding in CTLDs (KPS and LDN)] and was not inhibited by simple sugars or bacterial lipopolysaccharide.

Insect CTLDcps involved in cellular adhesion and developmental regulation were also reported. The fw (furrowed) Drosophila gene (assigned to group E2 by Dodd and Drickamer [8]; has a highly conserved Anopheles orthologue) encodes a protein with domain composition and organization very similar to those of vertebrate selectins (type I transmembrane protein with a CTLD and 10 complement-like repeats in the extraceullular part). Although fw mutants have a distinct phenotype, with severe impairment of sensory organ development, the function of the protein and its CTLD are unknown [308]. Despite the similarity in domain organization between fw and selectins, homology between these sequences was not supported by our molecular phylogeny analysis, and both the CTLD and the complement repeat region sequences are significantly more similar to vertebrate proteins other than selectins (data not shown). Another insect CTLDcp with a potential role in control of development is regenectin, which is expressed selectively during cockroach tissue regeneration [309,310].

Other invertebrate CTLDcps

Findings of studies of CTLDcps from other invertebrates are similar to those for insect CTLDcps. The majority of characterized invertebrate CTLDs are humoral sugar-binding proteins, with demonstrated or suggested involvement in innate immunity (e.g. from mollusc [311], crustacean [312,313], flat worm [314], nematode [315]). Transmembrane and extracellular matrix CTLDcps putatively involved in cell adhesion have also been reported, as well as regulatory soluble molecules (e.g. from Porifera [25], flat worm [316], cnidarian [317]). Finally, CTLDs are a part of specialized proteins such as the Nowa protein involved in nematocyst formation in hydra [318] or the perlucin protein from the abalone nacre [290].

Conclusions and future directions

As for other Metazoan protein domains that were abundantly recruited during protein evolution in the different branches of the animal kingdom, the CTLD combines a robust fold-forming core, whose main features are highly conserved between even very distant homologues, with regions that can be substantially modified to generate new ligand specificities and oligomerization patterns. Comparison of numerous known CTLD structures combined with analysis of CTLD sequences from different animals reveals a set of highly conserved positions, which define the domain sequence signature and have clear structural roles [5]. At the same time, in some uncharacterized invertebrate CTLD sequences, such positions (including, for example, two absolutely conserved cysteines) are occupied by residues that are not compatible with these roles ([5], and Zelensky, Hulett and Gredy, unpublished results). If these domains do, nonetheless, adopt the typical CTLD fold, then determining their structures would significantly extend our understanding of the ‘anatomy’ of the domain. On the other hand, there is at least one highly conserved position, the glycine from the WIGL motif, that so far cannot be ascribed any structural role by comparing the X-ray structures alone [5]. Given the very high conservation of this position, it would be very interesting to study its potential involvement in folding of the domain or some other dynamic processes required for CTLD stability.

Whole-genome studies and systematic analysis have allowed creation of a detailed picture of CTLD evolution and raised several interesting questions. The first two sequenced invertebrate genomes (D. melanogaster and C. elegans) showed that the CTLDcp repertoire of these species is drastically different from each other, and from the known vertebrate groups. On the other hand, the superfamily has undergone very few changes in the 450 million years of vertebrate radiation [9]. These observations have important implications for understanding of the origins and evolution of the functional systems CTLDcps are involved in, primarily innate immunity. The fish genome study suggested a tentative link between the vertebrate and invertebrate CTLD families. CTLDcps with domain architecture similar to the dual-CTLD group we found in several fish species are present in different invertebrates. However, sequence similarity within this putative group is low, and as the domain architecture is simple and could have arisen independently in different taxa by tandem domain duplication, the homology between these proteins could not be confidently established. Studies of CTLD genomics on fully or partially sequenced genomes of other organisms that occupy crucial positions in the animal evolution tree, such as the protocordate Ciona intestinalis or the primitive deuterostome Strongylocentrotus purpuratus, will allow the suggested link to be tested, as well as possibly illuminating new connections. Another important question concerns the origin of the superfamily. Although there are reasons to think that CTLDs first appeared after the emergence of multicellular animals, sporadic occurrences of the domain in plant and prokaryotic genomes deserve very close attention. Suggestions that these non Metazoan genes were acquired by horizontal transfer need to be tested by phylogenetic analysis, and their origins traced. However, if phylogenetic studies suggest, for example, that the plant sequences are of independent origin, many current assumptions of the evolution of CTLDs, and the functions they are associated with, will need to be rethought.

The mechanisms underlying the specificity and affinity of Ca2+-dependent carbohydrate binding by several CTLDs have been studied in great detail, and a number of common features have been revealed. However, the shallow structure of the primary binding site, the relatively few contacts made between the protein and the bound ligand, and the possibility of alternate modes of monosaccharide binding at the primary site limit the power of sequence analysis in predicting binding specificity. Sequence determinants of the primary binding-site specificity for mannose or galactose-group monosaccharides, postulated early by Drickamer, and confirmed by numerous later studies, remain the only reasonably reliable means available for predicting binding specificity. Although a large amount of empirical data supports the hypothesis that the symmetric vs. asymmetric distribution of hydrogen bond donors and acceptors is responsible for discrimination between mannose and galactose group monosaccharides by the primary binding site, to our knowledge, there is no satisfactory complete explanation of the physicochemical basis of this phenomenon, which must also include energetic considerations. Further computational and experimental studies are needed to resolve this question, which is of general importance for the glycobiology field.

Overall, the CTLD superfamily is an interesting and fruitful subject for systematic studies of the mechanisms of protein–carbohydrate interactions, as well as other aspects of protein structure and function evolution, such as fold adaptation to bind other ligands.

Ancillary