The astacin protein family in Caenorhabditis elegans

Authors


R. Zwilling, Zoologisches Institut, Universität Heidelberg, Im Neuenheimer Feld 230, D-69120 Heidelberg, Germany. Fax: + 49 6221 544913, Tel.: + 49 6221 545887,
E-mail: RobertZwilling@t-online.de

Abstract

In the nematode Caenorhabditis elegans, 40 genes code for astacin-like proteins (nematode astacins, NAS). The astacins are metalloproteases present in bacteria, invertebrates and vertebrates and serve a variety of physiological functions like digestion, hatching, peptide processing, morphogenesis and pattern formation. With the exception of one distorted pseudogene, all the other C. elegans astacins are expressed and are evidently functional. For 13 genes we found splicing patterns differing from the Genefinder predictions in WormBase, sometimes markedly. The GFP expression pattern for NAS-4 shows a specific localization in anterior pharynx cells and in the whole digestive tract (as the secreted form). In contrast, NAS-7 is found in the head of adult hermaphrodites, but not in pharynx cells or in the lumen of the digestive tract. In embryos, NAS-7 fluorescence becomes detectable just before hatching. In C. elegans astacins, three basic structural and functional moieties can be discerned: a prepro portion, the central catalytic chain and long C-terminal extensions with presumably regulatory functions. Within the regulatory moiety, EFG-like, CUB, SXC, and TSP-1 domains can be distinguished. Based on structural differences of the regulatory unit we established six NAS subgroups, which seemingly represented different functional and evolutionary clusters. This pattern deduced exclusively from the domain arrangement in the regulatory moiety is perfectly reflected in an evolutionary tree constructed solely from amino acid sequence information of the catalytic chain. Related catalytic chains tend to have related regulatory extensions. The notable gene, NAS-39 shows a striking resemblance to human BMP-1 and the tolloids.

Abbreviations
cDNA

complementary DNA

dsRNA

double-stranded RNA

EST

expressed sequence tag

GFP

green fluorescent protein

L1-4

larval stage 1–4

OST

open reading frame sequence tag

RNAi

RNA interference

RT-PCR

reverse transcription-polymerase chain reaction

NAS

nematode astacin

The first evidence for the existence of the astacin protein family can be traced back to the year 1967, when one of us (R. Zwilling) observed a proteolytic activity in the digestive fluid of the decapode crayfish Astacus astacus that was different to all other proteases known at that time [1]. Investigations of the cleavage and inhibition specificity confirmed this notion [2–4] and the elucidation of its unique amino acid sequence demonstrated definitely that the crayfish protease represented a new protein family [5]. In subsequent studies, the X-ray crystal structure of the Astacus protease, for which we proposed the denomination ‘astacin’, was solved to a resolution of 1.8 Å[6]. Astacin was recognized to be a metalloprotease exhibiting a penta-coordinated zinc ion in its active center [7]. In addition, the site of biosynthesis [8], genome organization [9], and mode of activation [10,11] have been elucidated, which made the crayfish protease a prototype for the astacins.

A second member of the astacin protein family was identified when Wang et al. and Wozney et al. (1988) studied the human bone-inducing factor BMP-1, into which a domain with high resemblance to crayfish astacin is inserted [12,13]. After that many more astacin-like proteins or genes were described in rapid succession in vertebrates, invertebrates and even in prokaryotes [14], where they serve as different physiological functions as food digestion, hatching, peptide processing, morphogenesis and pattern formation (for an overview see [15]). In the crayfish Astacus astacus, a second astacin gene can be found in the embryo that is activated only during a narrow time window just before hatching [16].

In the model organism Caenorhabditis elegans metalloproteases are present in a great variety, as we have seen in data bank analysis (also [17]). On the other hand we have shown recently that the bulk of total proteolytic activity found in crude extracts of mixed stage populations consists of acidic aspartyl proteases [18,19]. However, with regard to the number of expressed astacin genes C. elegans surpasses any other organism studied so far. This investigation therefore was stimulated by the question, what for this 959-cell organism would need more than 30 different and active astacin genes.

Materials and methods

Preparation of C. elegans

The C. elegans wild-type strain N2 variant Bristol was grown as a liquid culture in S-medium [20] supplemented with Escherichia coli OP50 as food source. The cultures were incubated at 18 °C for 6–8 days under vigorous shaking. When the E. coli food source appeared to have been nearly exhausted, the nematodes, representing a mixed population of adults, all four larval stages and eggs, were harvested and separated from bacteria as described elsewhere [20].

RNA purification

For the isolation of RNA, 100 µg fresh or frozen nematode pellets from a liquid culture were ground by means of a pestle in a mortar containing liquid nitrogen. Total RNA was extracted from the resulting powder following the protocol of Chomcynski and Sacchi [21]. Contamination by genomic DNA was avoided by treating total RNA with DNase I (RNase-free, Boehringer). Poly(A)-rich RNA was isolated by the Oligotex© mRNA procedure (Qiagen, Germany).

DNA purification

Genomic DNA was isolated from 1 mL fresh nematodes from a liquid culture using a standard protocol [22].

PCR amplification and cloning

Polyadenylated RNA (1 µg) was converted into single-stranded cDNA using a d(T)17 primer or a random hexamer primer as described [23]. For the amplification of the predicted astacin-like cDNA fragments specific oligonucleotide primers derived from the genome sequencing data were used. Primer sequences are available at http://www.zoo.uni-heidelberg.de/moehrlen/docs/WebFig1.htm.

PCR amplification was performed on single-stranded cDNA or genomic DNA as a control with 2 U high fidelity Taq DNA polymerase (Invitrogen, Germany) to diminish the mutation rate inherent to the PCR reaction. The cycling conditions were 94 °C for 3 min, 94 °C for 40 s, 55 °C for 40 s, 68 °C for 1 min per kb for 35 cycles, and 68 °C for 8 min. After PCR, samples were analyzed in 2% agarose gels and discrepancies between expected and observed size of any PCR product were readily detected on visual inspection of the gels. The PCR products were then excised from 1.5% agarose gels and purified with a NucleoSpin© gel-extraction kit (Macherey and Nagel, Germany). The purified fragments were subjected to the SureClone© Ligation procedure and cloned into a pUC18 vector according to manufacturer's instructions (Pharmacia, Sweden).

Plasmid DNA was prepared and subsequently nucleotide sequences were determined by double-strand sequencing according to the dideoxynucleotide chain-termination method, using T7 DNA polymerase (Amersham, Sweden). Universal M13 primers were used for sequencing. All sequences have been deposited in EMBL/GenBank/DDBJ under accession numbers AJ561200, AJ561201, AJ561202, AJ561203, AJ561204, AJ561205, AJ561206, AJ561207, AJ561208, AJ561209, AJ561210, AJ561211, AJ561212, AJ561213, AJ561214, AJ561215, AJ561216, AJ561217, AJ561218, AJ561219, AJ561220, AJ561221.

GFP fusion genes for expression studies

The genomic sequence data in WormBase [24] were used to identify a genomic DNA fragment suitable for fusion to a GFP reporter gene. In order to make sure that the gene specific promoter and all proper cis-elements necessary for guiding tissue specific expression are included in the reporter, the whole upstream region between the gene of interest and the neighboring upstream gene was used. For PCR amplification of the genomic DNA fragment the forward primers NAS-4:GFP/SacI/F1 (5′-CGA GCT CTT GAG TGA AGA TGC CAA GA-3′), NAS7:GFP/BamHI/F1 (5′-CGG GAT CCT TCC GCC AAA GTC ATT TAG-3′), NAS-15:GFP/PstI/F1 (5′-AAC TGC AGC TTT TCG GAA GAC TTT TGC-3′), NAS33:GFP/KpnI/F1 (5′-GGG GTA CCC CGG ACC ACA GTA AAG AAT-3′) and the corresponding reverse primers NAS4:GFP/KpnI/R1 (5′-GGG GTA CCC TGA CAC GCT GAC CCA TAC-3′), NAS7:GFP/KpnI/R1 (5′-GGG GTA CCC GATC CTC GCA TTC TA-3′), NAS15:GFP/KpnI/R1 (5′-GGG GTA CCC GCT GGG TAG TGG AGT TG-3′), NAS33:GFP/SacI/F1 (5′-CGA GCT CTG ACA AGA AAG GCA CAA AG-3′) were used. A 8–10 kb PCR fragment containing approximately 3–5 kb upstream sequences down to the last 30–50 codons of the astacin genes was fused in frame to the reporter gene GFP. Thus, the intergenic region as well as the protein coding regions of the astacin-like genes NAS-4, NAS-7; NAS-15 and NAS-33 were amplified with 2 U Elongase© DNA polymerase (Invitrogene, Germany), gel purified (NucleoSpin© gel-extraction kit, Macherey and Nagel, Germany) and cloned in frame into a pBD95.85 vector (having the S65C mutation and artificial introns to increase the expression of GFP; A. Fire Vector Kit, Baltimore, USA) according to standard protocols [23,25]. The molecular details of all fusion constructs are available on request. The construct, together with the marker plasmid pBx, was introduced into pha-1 hermaphrodites, and the worms having the constructs as extrachromosomal arrays were isolated at 25 °C and observed for GFP fluorescence under a Zeiss Axiovert 200 microscope.

Sequence analysis and phylogenetic studies

To identify metalloprotease genes in the genome of C. elegans, we used representative vertebrate and insect proteins, or their conserved domains according to the PFAM [26] and PRINTS database [27], as queries for BLAST searches [28,29] of WormBase [24]. For astacin genes the astacin domain, the zinc binding motif or the Met-turn sequences, as listed by PRINTS, were used to repeatedly screen the whole C. elegans genomic sequence, available from WormBase.

DNA sequences of all astacin genes were further analyzed using the husar package [30] and the predicted gene structures were compared to the Genefinder predictions as annotated in WormBase, and to the alternative GenieGene open reading frame predictions of Kent and Zahler [31]. The splicing patterns were subsequently refined using the EST/OST sequences available in the latest WormBase release (WormBase97, 7 March 2003) and the cDNA sequences resulting from this work. Discrepancies between the WormBase, GenieGene predictions and our own cDNA sequences were communicated to those annotating the sequences (http://www.zoo.uni-heidelberg.de/moehrlen/docs/WebFig2.htm). The corrected cDNA sequences were translated into amino acid sequences using the husar package and aligned using clustal[32]. For remaining unconfirmed splicing patterns, those protein predictions were used for further analysis, which are in accordance with the protein family alignment showing no exceptional insertions, deletions or frame shifts (EMBL:ALIGN_000543).

For identification and annotation of protein domains and the analysis of domain architectures the tools of the SMART [33], PFAM [26], ProDom [34] and INTERPRO [35] protein domain databases were used.

For phylogenetic studies the active protease domains, covering the region from Ala-1 to Leu-200 in the prototype crayfish astacin, from the C. elegans astacins and selected other astacin family members were aligned using clustal[32] and imported into genedoc[36] for further manipulation. The alignment is available at EMBL database with accession number ALIGN_000543. Phylogenetic analyses were carried out using the neighbor-joining method and the Bayesian phylogenetic method. For neighbor-joining analysis the phylip 3.5 software package [37] was used. Distances between the pairs of protein sequences were calculated and corrected for multiple changes according to the PAM001 distance matrix. The reliability of the tree was tested by bootstrap analysis with 100 replications. Bayesian phylogenetic analysis [38,39] was performed by the mr bayes 3.0beta4 program [40] with the WAG matrix [41] assuming a gamma distribution of substitution rates. Prior probabilities for all trees and amino acid replacement models were equal; the starting trees were random. Metropolis-coupled Markov chain Monte Carlo sampling was performed with one cold and three heated chains that were run for 50 000 generations. Trees were sampled every 10th generation. Posterior probabilities were estimated on 2000 trees (burnin = 3000). The tree presented here was visualized using tree view[42].

Results and discussion

Astacin homologue proteins in C. elegans

During a preliminary data base survey we observed in 1996 that the 959-cell organism C. elegans accommodates a surprising number of gene sequences coding for astacin-like proteins, while for other species with a much larger genome not more than 2–3 astacin genes had been reported (G. Geier and R. Zwilling, unpublished).

The complete sequencing of the 97 megabase genome of C. elegans by the C. elegans Sequencing Consortium in 1998 [43] then made a thorough analysis possible. The latest WormBase release (WormBase97, 7 March 2003) contains now 21 437 coding sequences when counting 1891 alternate splice forms. Of these the MEROPS protease database (latest release 6.11: 20 January 2003) lists 382 protease genes (E.C.3.4), of which 158 genes belong to the group of metalloproteases (E.C.3.4.24). The metalloproteases of C. elegans can be arranged into 11 protein clans and subdivided into 27 protein families, according to the nomenclature of Barrett et al. [44]. Our own BLAST searches in WormBase, using protein family consensus sequences according to the PFAM or PRINTS databases as queries, revised the number of identified genes temporarily listed by MEROPS (see Table 1). BLAST searches based on the whole astacin domain, the zinc binding motif or the Met-turn sequence revealed some more astacin genes in C. elegans in addition to those listed by MEROPS so far, which finally brought up the total number of astacin genes in C. elegans to 40 (Tables 1 and 2).

Table 1. One hundred and fifty-one genes coding for metalloproteases in C. elegans. Identification of genes was based on data available in MEROPS (The protease database, release 6.11: 20 January 2003, http://merops.sanger.ac.uk) and subsequently corrected by BLAST searches using the genome sequencing data of C. elegans. Nomenclature is according to Barrett et al. [44].
ClanProtease familyNumber of genesClanProtease familyNumber of genes
MA(E)M1 aminopeptidase12MFM17 leucyl aminopeptidase2
M2 peptidyl-dipeptidase 1MGM24A methionyl aminopeptidase I5
M3A oligopeptidase 2 M24B aminopeptidase P3
M13 neprilysin23MHM18 aminopeptidase I1
M41 E. coli endopeptidase 3 M20A/B glutamate carboxypeptidase5
MA(M)M8 leishmanolysin 1 M28B aminopeptidase Y2
M10A MMP 6 M28X4
M12A Astacin40MJM38 beta-aspartyl dipeptidas1
M12B/C ADAM10MKM22 O-sialoglycoprotein endopeptidase2
MCM14A carboxypeptidase A 9MMM50 S2P protease1
M14B carboxypeptidase E 3MXM48A Ste24 endopeptidase1
MEM16A pitrilysin 5 M49 dipeptidylpeptidase1
M16B mitochondrial processing peptidase 3 M67 proteasome regulatory subunit RPN113
M16X 2   
Table 2. Denomination of astacin genes in C. elegans, data base entries, approximate genetic map position and matching EST or OST clones (see WormBase release 94, Jan – 24–2003 [24]). For RT-PCR sequences (fmNAS-x) resulting from this work see http://www.zoo.uni-heidelberg.de/moehrlen. For further explications see text.
Gene nameWormpep nameEMBL/GenBankGenetic map position EST/OSTRT-PCR sequencingComment
NAS-1F45G2.1Z93382III:22.1OST Aberrant splice, corrected full-length sequence (http://www.zoo.uni-heidelberg.de/moehrlen/docs/WebFig2.htm)
NAS-2F56A4.1AC006645AC006722V:13.99 No PCR productExpression confirmed by microarrays only, translation fits best with GenieGene prediction g-V-409
NAS-3K06A4.1Z70755V:1.98 fmNAS-3cDNA fits best with Genie Gene prediction g-V-1836
NAS-4C05D11.6U00048III:1.33OSTfmNAS-4cDNAs fit best with Genie Gene prediction g-II-1042
NAS-5T23H4.3Z83240I:4.03 No PCR productExpression confirmed by microarrays only
NAS-64R79.1AL031254IV: 30.16 fmNAS-6translation fits best with Genie Gene prediction g-IV-3005
NAS-7C07D10.4U13072II:0.41 fmNAS-7cDNA fits best with Genie Gene prediction g-II-1703
NAS-8C34D4.9U58755IV:3.29ESTfmNAS-8 
NAS-9C37H5.9bU88315V:6.52EST Full-length sequence is confirmed by overlapping cDNAs
NAS-10K09C8.3Z68006X:2.51EST  
NAS-11K11G12.1U23525X:2.66EST  
NAS-12C24F3.3Z81055IV:4.54 fmNAS-12 
NAS-13F39D8.4Z69791X:21.46 fmNAS-13Translation fits best with Genie Gene prediction g-X-2412
NAS-14F09E8.6Z73896IV:8.02 fmNAS-14Translation fits best with Genie Gene prediction g-IV-2471
NAS-15T04G9.2U41274X:19.12ESTfmNAS-15cDNAs and translation fit best with Genie Gene prediction g-X-2732
NAS-16K03B8.1Z74039V:3.16 No PCR productExpression confirmed by microarrays only
NAS-17K03B8.2Z74039V:3.16 No PCR productExpression confirmed by microarrays only
NAS-18K03B8.3Z74039V:3.16 No PCR productExpression confirmed by microarrays only
NAS-19K03B8.5Z74039V:3.16 fmNAS-19 
NAS-20T11F9.3Z74042V:3.2 fmNAS-20cDNA and translation fits best with Genie Gene prediction g-V-2325
NAS-21T11F9.5Z74042V:3.2 fmNAS-21Aberrant splice, corrected sequence (http://www.zoo.uni-heidelberg.de/moehrlen/docs/WebFig2.htm)
NAS-22T11F9.6Z74042V:3.2 fmNAS-22Aberrant splice, corrected sequence (http://www.zoo.uni-heidelberg.demoehrlen/docs/WebFig2.htm) translation fits best with Genie Gene prediction g-V-2327
NAS-23D1022 unassignedU23517II:0.45 fmNAS-23Not in wormbase, predicted using Genescan see (http://www.zoo.uni-heidelberg.de/moehrlen/docs/WebFig2.htm)
NAS-24F20G2.4Z79753V:5.42 fmNAS-24Translation fits best with Genie Gene prediction g-V-2804
NAS-25F46C5.3Z54281II:0.92ESTfmNAS-25 
NAS-26T24A11.3
toh-1
Z49072III:4.54ESTcDNAFits best with Genie Gene prediction g-V-483
NAS-27T23F4.4AF025466II:13.27 fmNAS-27 
NAS-28F42A10.8U10414III:1.38OSTfmNAS-28 Aberrant splice, corrected full-length sequence is confirmed by overlapping cDNAs, (http://www.zoo.uni-heidelberg.de/moehrlen/docs/WebFig2.htm)
NAS-29F58A6.4U53339II:1.98 fmNAS-29Translation fits best with Genie Gene prediction g-II-1160
NAS-30Y95B8 A1AC024877I:20.88 No PCR productExpression confirmed by microarrays only
NAS-31F58B4.1Z74038V:2.87ESTfmNAS-31cDNAs fits best with Genie Gene prediction g-V-2200, possible alternative splice site
NAS-32T02B11.7AF022979V:19.07ESTfmNAS-32 
NAS-33K04E7.3U39666X:2.93 fmNAS-33 
NAS-34F40E10.1
hch-1
D85744Z69792X:19.9925EST Full-length cDNA confirmed by Hishida et al.
NAS-35R151.5
toh-2
U00036III:0.76EST Full-length sequence is confirmed by overlapping cDNAs
NAS-36C26C6.3Z72503I:2.05EST  
NAS-37C17G1.6Z78415X:1.48EST  
NAS-38F57C12.1U41554X:19.47EST  
NAS-39F38E9.2U46668X:23.83EST  
NAS-40F54B8 unassignedZ93383V:9.77 No PCR product;Not in wormbase, Pseudogene

The nomenclature proposed for these 40 C. elegans astacin genes is in accordance with suggestions of the C. elegans Sequencing Consortium. In Table 2 we have numbered these C. elegans astacins (nematode astacins, NAS) from 1 to 40. The two proteins NAS-23 and NAS-40 (located on cosmids F54B8 and D1022) are not recorded in the WormPep database (predicted proteins from WormBase) but could be detected by a genomic TBLASTN search and the use of the program genscan. However, for NAS-40 genscan did not predict a complete protein but rather an 88 amino acid fragment which is interrupted by two stop codons.

Hishida et al. [45] reported that HCH-1 (= F40E10.1, NAS-34) is required for normal hatching and neuroblast migration in C. elegans. For all other astacin genes, beyond the Genefinder protein prediction in WormBase and the partial transcription analysis by the EST or open reading frame sequence tags (OST) projects no further details were known. It therefore was indispensable to confirm as a first step for each gene the existence of expression products.

Transcriptome analysis

Comparing all genomic DNA sequences of astacin genes identified by our BLAST search to the cDNA data of WormBase it became evident that for 12 of the total of 40 genes EST or OST clones [46,47] were already known (WormBase release 57, 17 December 2001). This confirmed that the 12 genes in question were expressed on the mRNA level.

The remaining 28 genes were analyzed by RT-PCR followed by sequencing of the DNA fragments in order to demonstrate their transcription activity. For each gene specific primer pairs were synthesized, the gene fragments amplified by PCR and the products analyzed on agarose gels (http://www.zoo.uni-heidelberg.de/moehrlen/docs/WebFig1.htm). In each case the PCR reaction with reverse-transcribed RNA was accompanied by a control reaction with genomic DNA. Introns within the amplified DNA regions gave rise to correspondingly larger DNA fragments when compared to their cDNA fragments. For unambiguous identification and for the correction of erroneous splicing pattern predictions for all DNA fragments the PCR products were eluted from a agarose gel, blunt end cloned into the vector pUC18 and subsequently sequenced (http://www.zoo.uni-heidelberg.de/moehrlen/docs/WebFig2.htm).

In combination with the recently available EST and OST sequences (WormBase release 97, 7 March 2003) we found for 13 genes (Table 2) splicing patterns differing from the Genefinder predictions in WormBase, sometimes markedly. In these cases, the experimental cDNA transcripts were in good accordance with the alternative GenieGene open reading frame predictions of Kent and Zahler [31] (Table 2 and http://www.zoo.uni-heidelberg.de/moehrlen/docs/WebFig2.htm). For NAS-1, NAS-21, NAS-22 and NAS-28 we observed aberrant splice sites from both, the Genefinder and the GenieGene prediction. The manually corrected cDNA sequences can be found at http://www.zoo.uni-heidelberg.de/moehrlen/docs/WebFig2.htm. All new sequence data including corrected gene structures have been submitted to WormBase and EMBL/GenBank/DDBJ databases (for accession number, see footnote). The genes NAS-2, NAS-5, NAS-16, NAS-17, NAS-18 and NAS-30 showed no apparent PCR product in our RT-PCR analysis (Table 2, http://www.zoo.uni-heidelberg.de/moehrlen/docs/WebFig2.htm). However, the microarray projects of Hill et al. [48,49], Kim et al. [50], or Jiang et al. [51] (for an overview see WormBase) support the expression of these genes. We would like to point out that this technique has no way to unerringly verify either the identity or the splicing pattern of a gene because no sequence data are produced.

Nevertheless, in summary it may be stated that with the exception of pseudogene NAS-40 for all other 39 astacin genes a transcription activity could be confirmed.

Functional analysis

We made an attempt to analyze the function of selected astacin genes in C. elegans investigating the expression pattern of four representative astacin genes of different subgroups (see section on Structural and phylogenetic analysis, Fig. 2.) using GFP-fusion constructs. All astacin-GFP fusions were assayed for expression in animals from embryonic stages onwards. At least three independent transgenic lines were generated from at least two independent clones of each of the astacin-GFP fusion constructs to control for PCR-induced sequence errors. The reporter gene fusion NAS-15::GFP and NAS-33::GFP failed to give detectable expression in any life stage. The fusion protein NAS-4::GFP showed extensive GFP fluorescence throughout the digestive tract in larval stages and in adult worms (Fig. 1A). At higher magnification, we saw GFP staining within pharynx cells of the procorpus, metacorpus, isthmus and terminal bulb, and extracellular staining in the lumen of the terminal bulb (Fig. 1B, arrows). Therefore, NAS-4 most likely is secreted by the pharynx cells into the lumen and then is found in secreted form all the way down in the lumen of the gut. We conclude from this expression pattern that NAS-4 is associated with digestive functions. Of special interest is the notion that NAS-4 and the digestive enzyme astacin from crayfish [8] have a similar domain arrangement, both lacking a C-terminal extension (see section on Structural and phylogenetic analysis). They also cluster in the phylogenetic tree (Fig. 3), suggesting that they have similar functions. These considerations might be extended to the whole subgroup I (Fig. 2, NAS-2–6) which shares these features.

Figure 2.

Schematic representation of homologues and domain structures in astacin genes in C. elegans. Pre-pro sequences, catalytic domain and presumably regulatory appendices. Diagram scale is related to amino acid length. Presequences, purple shaded boxes; prosequences, grey oval; astacin domain, red box; six cysteins, SXC; EGF-like, yellow oval; CUB domains, CUB; thrombospondin-1 like, TSP1; low complexity sequences, striped boxes; not specified, open boxes.

Figure 1.

GFP expression pattern images for NAS-4(A, B) and NAS-7(C, D). (A) Extensive GFP fluorescence throughout the digestive tract in an adult hermaphrodite and a L2 larvae for a NAS-4::GFP fusion gene; 100 × magnification. (B) Higher magnification of the head of an adult hermaphrodite showing GFP expression for the same construct in pharynx cells and in the lumen of the terminal bulb; 400 × magnification. (C) GFP expression of a NAS-7::GFP fusion gene is found in the head of adult hermaphrodites, but not in pharynx cells or in the lumen of the digestive tract; 300 × magnification. (D) In embryos NAS-7::GFP reporter gene fluorescence became detectable just before hatching; 400 × magnification.

Figure 3.

Phylogenetic relationship of the astacins, including all C. elegans astacin proteins(shaded yellow) and selected examples from other organisms. The tree was deduced by Bayesian and neighbor-joining analysis based on the alignment of the amino acid sequences of the catalytic chain. At branching points, Bayesian posterior probabilities and bootstrap values greater than 50 of 100 replications (values in parentheses) and are given as an indication for the confidence of the tree presented. The scale bar represents a distance of 0.1 accepted point mutations per site (PAM). Evolutionary subgroups of the astacin protein family are indicated on the right side. The schematic representation of the protein domains (colored bars) corresponds to that in Fig. 2. Meprin domains: MAM domain, MAM; MATH domain, MATH; I-domain, I; intervening sequence, inter; transmembrane domain, TM; cytoplasmic domain, c. For an overview, see [66]. Abbreviations and Swissprot/TREMBL/PIR accession number of the astacins: AA Astacin, Astacus astacus (crayfish) astacin (P07584); AC TBL-1, Aplysia californica TBL-1 (P91972); AJ EHE-4, Anguilla japonica (fish) EHE-4 (Q90Y89); CC Nephrosin, Cyprinus carpio (fish) Nephrosin (O42326); DM Tolloid and Tolkin, Drosophila melanogaster Tolloid (P25723) and Tolkin (Q23995); FM Flavast, Flavobacterium meningosepticum Flavastacin (Q47899); HS BMP-1, Homo sapiens bone morphogenetic protein 1 (Q14874); HS Meprin A and B, Homo sapiens Meprin α (Q16819) and β (Q16820); HS TLL and TLL-2, Homo sapiens Tolloid like 1 (Q9NQS4) and 2 (Q9UQ00); HV HMP-2, Hydra vulgaris (Cnidaria) Metalloprotease 2 (Q9XZG0); MM BMP-1, Mus musculus BMP-1 (I49540); MM Meprin A and B, Mus musculus Meprin α (P28825) and β (Q61847); OL LCE and HCE-1, Oryzias latipes (fish) low choreolytic enzyme (P31581) and high choreolytic enzyme 1 (EMBL:M96170); PC PMP-1, Podocoryne carnea (Cnidaria) Metalloprotease 1 (O62558PL); PL BP-10, Paracentrotus lividus (sea urchin) blastula protease 10 (P42674); SP BMP-H, Strongylocentrotus purpuratus (sea urchin) BMP-1 homolog (P98069); SP SPAN, Strongylocentrotus purpuratus (sea urchin) SPAN (P98069); TR MP, Takifugu rubripes (fish) HCE-1 (AAL40376); XL BMP-1, Xenopus laevis BMP-1 (P98070).

By contrast, NAS-7::GFP staining was observed only in the head of adult hermaphrodites, but not within pharynx cells (Fig. 1C). The expressing cells are located outside of the pharynx, around the metacarpus and the terminal bulb, and could include neurons, cells of the excretory system or gland cells of still unknown functions [20]. Reporter gene expression also became detectable in the embryo before hatching (Fig. 1D). While at this moment the function of the gene expressed in the adult remains open, in the embryo it possibly could serve as a hatching enzyme.

To further characterize the function of astacin genes in C. elegans we analyzed the genome wide RNAi analysis of Gonczy et al. [52], Fraser et al. [53], Maeda et al. [54], Kamath et al. [55,56], Ashrafi et al. [57], Lee et al. [58] and Pothof et al. [59]. Although nearly all astacin genes have been investigated for gene silencing by RNAi, most of them lack of an obvious phenotype and no function could be deduced from the attempted inactivation. Whether this phenomenon reflects the dsRNA interference being incomplete or a redundancy in functions for the high number of expressed astacin genes remains to be established. Strong RNAi phenotypes were observed for NAS-9, -11 and -37 only, revealing these three astacin genes to be essential. Inactivated NAS-9 showed 6% embryonic lethality [54], NAS-11 showed retarded growth [56] and NAS-37 showed long body deviancy and a molt defect [54,56]. As a rule it can be stated that all known astacin gene inactivations had only little, if any, effect. One explanation for this could be that C. elegans astacins have overlapping functions, which is also suggested by structural homologies.

Structural and phylogenetic analysis

All known sequence data of astacin-like proteins are derived from cDNA and genomic sequences, with the exception of crayfish astacin, which in addition had been completely sequenced by Edman degradation [5].

The present analysis is based on protein sequences available from SwissProt, TrEMBL, EMBL, and GenBank databases. If necessary, open reading frames of DNA sequences were translated by the HUSAR Package into amino acid sequences. For C. elegans we used the Genefinder or GenieGene predictions corrected by our cDNA data (http://www.zoo.uni-heidelberg.de/moehrlen/docs/WebFig1.htm). Altogether, we found over a hundred complete sequences of astacin-like proteins, which are known at present (http://www.zoo.uni-heidelberg.de/moehrlen/docs/WebFig2.htm). Considering only the eucaryote genomes sequenced completely, in human and mouse six, and in Drosophila melanogaster 12 astacin genes are found. However, the tiny 959-cell organism C. elegans exhibits the striking number of 40 astacin genes, a number by far not reached in any other organism studied up to now. With the only exception of the pseudogene NAS-40 all these genes are expressed and seem to have specific functions. Therefore, these findings not only allow the study of an extraordinary divergence of a protein family within one single organism, but also shed light on a multiple functional fine modulation evolving from a common structural source.

In the astacins typically three basic structural and functional moieties can be discerned: a pre-pro portion, the catalytic astacin chain, and long C-terminal extensions, which presumably contain messages for proper function (Fig. 2). Pro-sequences are found in all functional C. elegans astacins, while presequences (signal peptides) are lacking in nine genes (Fig. 2). The missing of signal peptides in these genes may reflect specific intracellular functions of nonsecreted proteins. On the other hand the lack of these signal peptides could also reflect problems with the still unconfirmed 5′-gene predictions of Genfinder or GenieGene as the sequencing data produced here have been limited to PCR-derived fragments, and to the reanalysis of EST and OST fragments. In some rare cases in other organisms prepro structures may be lacking completely, often combined with a N-terminally truncated catalytic domain [Cortunix cortunix (quail) CAM-1, Swissprot P42326; Drosophila melanogaster CG6974, TrEMBL Q9VFD6; Hydra vulgaris FARM-1, TrEMBL Q9U4 × 9], but in C. elegans (with the exception of the not expressed pseudogene NAS-40) this feature never could be seen. In the central domain of all C. elegans astacin genes, the amino acid residues that have been identified in crayfish astacin as essential for catalytic activity [6,7,60,61] are preserved without exception. From this fact it may be concluded that all C. elegans astacins potentially have catalytic activity, too.

C. elegans astacins typically are characterized by long, complex C-terminal extensions adjacent to the catalytic domain, which presumably define time and place of their activity (Fig. 2). Based on homology criteria within these appendices CUB-, EGF-, SXC-, and TSP-1 domains can be discerned, while other sequences must be classified as ‘non specific’ or having ‘low compositional complexity’ (LC). LC regions are often Ser/Thr-rich, are found in many astacins and could serve as sites for O-glycosylation. EGF domains are epidermal growth factor like modules (PFAM accession number: PF00008). CUB domains (SMART accession number: SM0042) are named after their occurrence in complement components C1r/C1s, embryonic sea urchin protein Uegf, and BMP-1 [62]. These domains may be involved in calcium-binding and protein-protein or enzyme–substrate interactions [63]. The SXC (six-cysteine) motif was observed in several hypothetical C. elegans proteins [64,65] but was originally described in metridin, a toxin from sea anemone and is also called ShK toxin domain (SMART accession number: SM0254). TSP-1-like domains are thrombospondin type 1 repeats (SMART accession number: SM0209) which are present in several families of metalloproteases namely in the ADAM-TS proteases (ADAM-TS, a disintegrin-like and metalloproteinase with thrombospondin type I motifs; family M12B/C, see Table 1). TSP-1 domains are reported here for the first time for astacins.

According to the structural differences in their C-terminal extensions we arranged all 40 C. elegans genes into the subgroups I–VI (Fig. 2). Subgroup I comprises five genes with no C-terminal extension (NAS-1), or with short, unspecific extensions, where probably no specific signals can be accommodated. Subgroup II exhibits in its 10 genes exclusively the SXC domain, while other domain types are completely lacking. The SXC domain appears in a single, double or triple arrangement and the domains may be attached directly to the catalytic chain or separated from it and from each other by short, unspecific sequences. A tandem-like arrangement can only be seen with these SXC domains, while other domain types are represented only once in a regulatory chain (for an exception see subgroup VI). Subgroup III combines 15 genes that typically have an EGF-like domain directly attached to the catalytic chain, followed by a CUB domain. In gene NAS-18 the CUB domain and in gene NAS-21 the EGF-like domain is missing. In subgroup IV (two genes) a SXC domain and in subgroup V (six genes) a TSP-1 domain is added to EGF and CUB domains, which show an identical arrangement as in subgroup III. Subgroup VI is a special case: the only entry NAS-39 shows a striking similarity to human bone inducing factor BMP-1. A comparison between both proteins reveals a sequence identity of the catalytic chains of 74%, while for other nematode astacins this value reaches on average only 40%. But also xolloid (Xenopus), tolloid and tolkin (Drosophila) and TBL-1 (Aplysia) have corresponding structures. The Number and arrangement of CUB- and EGF-domains are identical in these genes. NAS-39 exceeds in its length by far all other C. elegans genes. It will be interesting to see what physiological role a factor almost identical to human BMP-1 might perform in C. elegans and this could give us also some insight into the primordial functions from which human BMP-1 has evolved. The distinctive and complex pattern, which appears in the subgroups I–VI seems to provide a specific function for each C. elegans astacin gene. Members of the same subgroup might have similar or identical functions.

We constructed a phylogenetic tree comprising all 39 expressed C. elegans astacins and in addition selected astacin proteins from a variety of other organisms (Fig. 3). The tree is based on a multiple alignment of the amino acid sequence of the active protease domain, covering the region from Ala1 to Leu200 in the prototype, crayfish astacin. Results were corrected with help of the known secondary structures and conserved regions of crayfish astacin. The alignment has been submitted to EMBL databank with accession number ALIGN_000543.

Phylogenetic relationships were initially established on the basis of the neighbor-joining method using the phylip program package. As outgroup we used the phylogenetically most remote flavastacin from bacteria. However, an isolated occurrence of an astacin sequence in a single bacteria species could be due to a lateral gene transfer, which would render this sequence unsuitable as an outgroup. Because recently at least one more astacin-like protein has been detected in bacteria (http://www.zoo.uni-heidelberg.de/moehrlen), lateral gene transfer is most unlikely. Moreover, we also tried the phylogenetically remote Cnidaria astacins (HMP-2 and PMP-1) as an outgroup, which gave exactly the same phylogenetic tree. For statistical verification a consensus tree including 100 sequences was calculated and bootstrap values were established for each point of divergence. However, the phylogenetic tree based on the neighbor-joining method showed rather low bootstrap values (< 50) for the most ancestral nodes (Fig. 3). Pro sequences could not be used additionally to strengthen these branching points because they are differing extremely in length, are changing rapidly or are lacking completely. A similar consideration can be made for the C-terminal extensions. The robustness of the tree was therefore verified additionally by the Bayesian phylogenetic method. With this study the confidence of the tree significantly increased and resulted in high posterior probabilities. The evolutionary tree now presented in Fig. 3 summarizes all above-mentioned approaches and exhibits therefore the best reliability.

From this analysis it becomes evident that similar sequences of the catalytic chain tend to have similar C-terminal extensions (Fig. 3). All 39 complete NAS proteins can be subdivided into two different types: one having CUB domains in their regulatory domains, and another one where these are lacking completely (see also Fig. 2). This pattern is clearly reflected in the amino acid sequence based phylogenetic tree, where all NAS proteins exhibiting a CUB domain come closely together in one cluster (Fig. 3). The CUB domain is almost always preceded by an EGF domain (exception NAS-21). To these either no further segments are attached (subgroup III), or a SXC domain (subgroup IV) or a TSP-1 domain (subgroup V) might follow. The second cluster comprises the NAS-1 to NAS-15 proteins, characterized by having no distinct extensions (subgroup I) or showing one, two or three SXC domains (subgroup II). NAS-39 (subgroup VI) is strikingly different from all other C. elegans astacins, but can perfectly be inserted into the BMP-1/Tolloid-group, likewise on the basis of the sequence homologies or the complex, but identical arrangement of the 5 CUB- and the 2 EGF-segments (Figs 2 and 3).

One might wonder about the expression of such large a number of related, but different astacin genes in a 959-cell organism. Potentially all these genes could have different functions, showing in each case at least clear, in some cases marked structural differences. However, much of this divergence seems to be due to relatively recent gene duplications. In the closely related species Caenorhabditis briggsae the genes NAS-16, -18, -19, -22, -24 and the pseudo-gene NAS-40 are missing. C. elegans and C. briggsae share, however, the neighboring genes NAS-17, -20, and -21. In addition, these genes show a tandem-like arrangement in clusters and are all located on chromosome V, where NAS-16, -17, -18, -19 form one cluster, and separated by different other genes a second cluster comprising NAS-20, -21, -22 can be found. These notions are also supported by the position of these genes in the evolutionary tree (see Table 2, and Figs 2 and 3). It therefore seems reasonable to assume that these genes comprising one half of subgroup III resulted from recent gene duplications, which implies that they might have more or less similar functions. If one extends this kind of reasoning with some caution to the whole of the analyzed C. elegans astacins one could conclude that only the six established subgroups actually represent major functional differences, as these are based on marked differences in their regulatory units. This would reduce the number of functionally different gene types to six, a number that comes close to that found for astacins also in other organisms. Nevertheless, the fact remains that each NAS gene is expressed and structurally distinct from the others. This constitutes a favorable starting point for the rapid acquisition of new functions, a capacity, which might be a prerequisite for the ubiquous occurrence of C. elegans in nearly all soil types. However, most NAS genes are dispersed over all six chromosomes of C. elegans, which indicates a long evolutionary history of the astacin protein family in the nematodes. The identical and complex arrangement of the seven regulatory domains in NAS-39 and BMP-1 suggests furthermore that this distinct structure has been retained unchanged for long periods and was already present in the common ancestor of nematodes and vertebrates.

Acknowledgements

This study was supported by a grant from the Deutsche Forschungsgemeinschaft, Bonn, to RZ (Zw 17/14–2). We also wish to thank Thorsten Burmester, University of Mainz, Germany for supporting the Bayesian phylogenetic analysis.

Ancillary