Protein universe containing a PUA RNA-binding domain

Authors

  • Carolina S. Cerrudo,

    1. Laboratory of Genetic Engineering and Cellular and Molecular Biology, Quilmes National University, Bernal, Buenos Aires, Argentina
    Search for more papers by this author
  • Pablo D. Ghiringhelli,

    Corresponding author
    1. Laboratory of Genetic Engineering and Cellular and Molecular Biology, Quilmes National University, Bernal, Buenos Aires, Argentina
    • Correspondence

      P. D. Ghiringhelli, Laboratory of Genetic Engineering and Cellular and Molecular Biology, Department of Science and Technology, Quilmes National University, Roque Saenz Peña 352, Bernal 1876, Buenos Aires, Argentina

      Fax: +54 11 4365 7132

      Tel: +54 11 4365 7100 (ext. 4152)

      E-mail: pdg@unq.edu.ar

    Search for more papers by this author
  • Daniel E. Gomez

    1. Laboratory of Molecular Oncology, Quilmes National University, Bernal, Buenos Aires, Argentina
    Search for more papers by this author

  • Pablo D. Ghiringhelli and Daniel E. Gomez contributed equally to this work.

Abstract

Here, we review current knowledge about pseudouridine synthase and archaeosine transglycosylase (PUA)-domain-containing proteins to illustrate progress in this field. A methodological analysis of the literature about the topic was carried out, together with a ‘qualitative comparative analysis’ to give a more comprehensive review. Bioinformatics methods for whole-protein or protein-domain identification are commonly based on pairwise protein sequence comparisons; we added comparison of structures to detect the whole universe of proteins containing the PUA domain. We present an update of proteins having this domain, focusing on the specific proteins present in Homo sapiens (dyskerin, MCT1, Nip7, eIF2D and Nsun6), and explore the existence of these in other species. We also analyze the phylogenetic distribution of the PUA domain in different species and proteins. Finally, we performed a structural comparison of the PUA domain through data mining of structural databases, determining a conserved structural motif, despite the differences in the sequence, even among eukaryotes, archaea and bacteria. All data discussed in this review, both bibliographic and analytical, corroborate the functional importance of the PUA domain in RNA-binding proteins.

Abbreviations
Cbf5

dyskerin name in yeast

DENR/DRP

density-regulated protein

eIF2D

eukaryotic translation initiation factor 2D

G5K

glutamate-5-kinase

Gar1

H/ACA ribonucleoprotein complex subunit 1

HMM

hidden Markov model

MCT1

malignant T-cell amplified sequence 1; Nip7, nuclear Import 7

Nsun6

NOL1/NOP2/Sun domain familymember 6

NTE

N-terminal extension

PUA

pseudouridine synthase and archaeosine transglycosylase

RBDs

RNA binding domains

RNP

ribonucleoprotein

TGTs

transglycosylases

Introduction

RNA-binding proteins are fundamental for many aspects of gene expression and cellular functions, including important post-transcriptional processes. RNA-binding proteins are involved in each step of RNA metabolism. Most of them are composed of small RNA-binding domains (RBDs) that are needed for their recruitment to specific RNA targets. In recent years, determination of the structures of RNA–protein complexes by crystallography and NMR has led to increased knowledge about these proteins and their domains. RNA-recognition modes have been extensively studied and are highly versatile. Currently, diverse RBD families are known (Table 1), including the pseudouridine synthase and archaeosine transglycosylase (PUA) domain [8, 9]. Table 1 shows all the RBDs found in three protein databases (Pfam, Smart and PROSITE).

Table 1. RNA-binding domains.
DomainBrief descriptionPFAMPROSITESMARTProtein examples
RNA recognition motif (RRM, RBD, RNP domain)Found in a variety of RNA-binding proteins. Also appears in a few single-stranded DNA-binding proteins [1]

PF00076

31837

PS50102

1393

SM00360

8767

U1A, PABP, HuD, protein components of small nuclear RNPs
Double-stranded RNA binding motif (dsRBD)Found in a variety of RNA-binding proteins with different structures and exhibiting a diversity of functions [2, 3]

PF03368

341

PS50137

627

SM00358

3061

RNA adenosine deaminase, Endoribonuclease Dicer
K homology RNA-binding domain, RBD (KH)First identified in the human heterogeneous nuclear ribonucleoprotein (hnRNP) K. Binds RNA and may function in RNA recognition [4, 5]

PF00013

11484

PS50084

1186

SM00322

10163

PNPases, RS3 ribosomal proteins, human onconeural ventral antigen-1 (NOVA-1)
Piwi, Argonaut and Zwille (PAZ)Function unknown, but has been suggested to mediate complex formation between proteins of the Piwi and Dicer families by heterodimerization [6, 7]

PF02170

1208 seq

PS50821

90

SM00949

724

Argonaute and Piwi proteins, proteins involved in gene silencing
Pseudouridine synthase and archaeosine transglycosylase (PUA)Highly conserved and found in a wide range of archaeal, bacterial and eukaryotic proteins. Have a common RNA recognition surface, with some versatility in the way in which the motif binds to RNA [8, 9]

PF01472

3208

PS50890

578

SM00359

2664

Enzymes that catalyze tRNA and rRNA post-transcriptional modifications, dyskerin, eIF2D
Oligonucleotide/oligosaccharide binding (OB fold)Has a five-stranded β sheet coiled to form a closed β barrel. This barrel is capped by an α helix located between the third and fourth strands [10]

CL0021

Clan (42 families).

SM00316, SM00959Rnase E, Rnase II, NusA, BEM-5, ribonuclease G
Pumilio homology domain (PUF)Multimeric domain of eight tandem repeats that are sufficient for sequence specific RNA binding [11, 12]

PF00806

4584

PS50302

63

SM00025

1024

Fly Pumilio, Worm FBF–1, Worrn FBF–2
U1-like zinc finger (Znf) (ZnF_U1)Family of C2H2-type Znf. Bind DNA, RNA, protein and/or lipid substrates [13, 14]

PF12171

1309

PS50157

1871

SM00451

1566

TIS11d, HIV–1 NC, U1 small nuclear ribonucleoprotein C
Co-antiterminator domain (CAT_RBD)Found at the N–terminal end of transcriptional antitermination proteins. Binds to ribonucleotidic antiterminator hairpin [15, 16]

PF03123

1721

SM01061

734

BglG, SacY, LicT
Telomerase_RBD (TRBD)Involved in formation of the telomerase holoenzyme in addition to recognition and binding of RNA [17]

PF12009

311

SM00975

170

Telomerase
S4 RBD (S4)Small domain consisting of 60–65 amino acid residues [8, 18]

PF01479

24844

PS50889

1854

SM00363

15026

Ribosomal protein S4 and S9, RNA methylases
RBD abundant in apicomplexans (RAP)Consists of multiple blocks of charged and aromatic residues in eukaryotic proteins [19, 20]

PF08373

397

PS51286

26

SM00952

341

Human hypothetical protein MGC5297, mammalian FAST
Thiouridine synthases and methylases (THUMP)Adopts an α/β-fold similar to the C–terminus of translation initiation factor 3 [21, 22]

PF02926

3319

PS51165

481

SM00981

2143

4-thiouridine, pseudouridine synthases and RNA methylases

Several enzymes act upon RNA and although we can not consider them to be a single protein family, they have a common function in enzymatically modifying RNA. In these proteins, the enzymatic domain itself binds RNA and contributes to the enzyme specificity [2]. Within this latter group of proteins with enzymatic activity, we find pseudouridine (Ψ) synthase, which catalyzes the isomerization of uridine to pseudouridine (Ψ) in a variety of RNA molecules, and may function as an RNA chaperone. The domain most commonly found associated with these proteins, PUA domain, is also found in diverse and varied proteins, including archaeal sulfate reductases, bacterial and yeast glutamate kinases and proteins involved in ribosome biogenesis and translation initiation.

Ψ synthases from archaea, bacteria and eukarya can be classified into six families, named after the Escherichia coli enzymes RluA, RsuA, TruA, TruB, TruD and Pus10 [23, 24]. These proteins have substrate specificity, recognizing uridine in the context of RNA, utilizing the sequence or structural context of their target site. Only the TruB family has a C–terminal PUA domain. In bacteria, all TruB family synthases are capable of performing its function without any accessory factors; by contrast, their eukaryote and archaea counterparts function as part of ribonucleoprotein (RNP) complexes. The defining molecules of these RNPs are their small nucleolar RNAs (snoRNAs), which can be divided into H/ACA and C/D classes, and function predominantly in the modification of rRNA. H/ACA RNPs have been identified as (sno)RNPs, but depending on their site of maturation and action, they can also be classified as small Cajal body RNPs [25, 26]. There are ~ 200 different H/ACA RNAs, which associate with the same four core proteins: dyskerin (Cbf5 in yeast, Nap57 in rodents and Nop60B in Drosophila melanogaster), Gar1, Nhp2 (L7Ae in archaea) and Nop10 to form an H/ACA RNP [27, 28]. These proteins are evolutionarily highly conserved with orthologs in eukaryotes and archaeas. Dyskerin ψ synthase provides the catalytic activity for the H/ACA RNP complex [29]. Accessory proteins are required to bind its target RNA and trigger its enzymatic activity [30]. Although the main function of H/ACA RNPs is ψ synthase activity [29], these proteins also participate in specific cleavage of the precursor to 18S rRNA, and dyskerin is also needed for the normal function of the telomerase RNP and telomere maintenance [31].

The main characteristics of PUA domains, their biological functions and architecture were last reviewed in 2007 [9]. Here, we provide an update (with bioinformatics tools and data mining analysis), including a complete list of all known proteins with a PUA domain, a description of their biological functions, sequence homology and underlying structures. Moreover, we intend to expand knowledge of the molecular basis for RNA recognition and evolution of the PUA domain in different proteins.

Results and discussion

Major features of the PUA domain

The PUA domain is a highly conserved RNA-binding motif found in a wide range of archaeal, bacterial and eukaryotic proteins. This group includes proteins involved in ribosome biogenesis and translation, enzymes that catalyse tRNA and rRNA post-transcriptional modifications, as well as enzymes involved in proline biosynthesis. This domain was detected in archaeal and eukaryotic ψ synthases, archaeal archaeosine synthases (TGTs) and a family of predicted archaeal and bacterial rRNA methylases. In addition, the PUA domain was detected in a family of eukaryotic proteins that comprise a novel type of translation factor, and also in bacterial and yeast glutamate kinases (G5Ks) [8].

PUA was initially defined as a compact domain that presents a common RNA recognition surface with α/β architecture [9], however, the length of the PUA domains currently detected by Pfam [32], PROSITE [33] and CDD [34] varies between 64 and 96 amino acids for each family of proteins. The PUA domain contains highly conserved motifs that center on stretches of hydrophobic residues [8], and also three highly conserved positions occupied by glycines or small amino acids (Fig. 1). The overall sequence logo (Fig. 1) created by alignment of all retrieved PUA domain sequences shows conservation of these residues. All proteins studied presented between 10 and 14 polar residues, most of which are glycines (five of these glycines are conserved in position in the overall alignment), and between 2 and 7 coserved acidic residues. Determination of the crystal structure of archaeosine tRNA–guanine transglycosylase from Pyrococcus horikoshii has shown that the PUA domain consists of two α helices and six β strands, and folds into a β–sandwich structure similar to the oligonucleotide-binding fold [35].

Figure 1.

PUA sequence logo. The sequence logo shows the sequence conservation of the PUA domain in all proteins of the three superkingdoms; and in all archaeal, bacterial and eukaryotic proteins. The height of the whole letter stack indicates the information content at that position, and the height of each letter is proportional to the relative frequency of the amino acid that the letter represents. Hydrophobic residues (L, V, I, W, M, F, P) are green, basic residues (R, K) are red, acidic residues (D, E) are blue, and all other residues are black.

At the time of writing, the PUA domain was known to be present in 2084 species and 3208 stored sequences (Pfam database). This domain showed interactions with five other domains stored in Pfam (domain Nop10p, TGT, TruB_N, DKCLD and DUF1947) and 26 structures stored in the PDB contain the domain.

PUA domain and global proteome

As mentioned above, the PUA domain is found in proteins from different species of the three superkingdoms. From this universe of species, we selected a representative set of 20 to analyze the distribution and characteristics of the PUA domain: Homo sapiens, Rattus norvegicus, Mus musculus, Bos taurus, Gallus gallus, Danio rerio, Drosophila melanogaster, Arabidopsis thaliana, Anopheles gambiae, Aedes aegypti, Caenorhabditis elegans, Schizosaccharomyces pombe, Saccharomyces cerevisiae, Kluyveromyces lactis, Escherichia coli, Salmonella enterica, Shigella dysenteriae, Pyrococcus furiosus, Pyrococcus abyssi and Methanocaldococcus jannaschii. We found eight groups of PUA-domain-containing proteins, five of which are present in most of the organisms studied. Table 2 gives a summary of the function of these characteristic proteins and the specific role of the PUA domain in each. It can be seen that PUA-domain-containing proteins might have different global functions. This is because these proteins have modular domains and the conserved PUA domain appears as a single architectural unit or in diverse combinations with a variety of other domains (Fig. 2), although still retaining its ability to bind to RNA.

Table 2. Function and features of the eight characteristic proteins with PUA domain. Alternative names and abbreviation utilized in this work are indicated for each protein, in addition to the protein function. PUA domain role in the specific protein and the organisms in which they are found are also indicated.
Protein NameFunctionRole of PUA domainOrganisms

Dyskerin/Cbf5

 Alternative name: H/ACA small nucleolar ribonucleoprotein (H/ACA small nucleolar RNP) subunit 4, Cbf5, NAP57, Nop60B

Pseudouridine synthase and H/ACA ribonucleoproteinBinds to the ACA motif of H/ACA RNA. Implicated in the stable anchoring of dyskerin to tRNAs [9]All

MCT1

 Alternative name: MCT-1 (multiple copies T-cell malignancies 1); MCTS-1 protein (malignant T-cell-amplified sequence 1), Tma20

Ribosome biogenesis and translational regulationRequired for Cap complex–binding. The interaction with m7GTP through the PUA domain requires the presence of eIF4E [36]All except E. coli, Salmonella enterica and Shigella dysenteriae

eIF2D

 Alternative name: ligatin; eukaryotic translation initiation factor 2D; hepatocellular carcinoma-associated antigen 56, Tma64

Ribosome biogenesis and translation regulation. Trafficking receptor for phosphoglyco proteins.Interacts with the 40S subunit (binds to dsRNA in tRNA or rRNA) [37]All except Schizosaccharomyces pombe, E. coli, Salmonella enterica, Shigella dysenteriae, P. furiosus, P. abyssi and M. jannaschii

NIP7

 Alternative name: 60S ribosome subunit biogenesis protein NIP7 homolog; KD93

Ribosome biogenesis and translation regulationMediates Nip7 interaction with pre-rRNA [38]All except E. coli, Salmonella enterica and Shigella dysenteriae

NSUN6

 Alternative name: NOL1/NOP2/Sun and PUA domain-containing protein 1; NOL1/NOP2/Sun domain family member 6

RNA methyltransferases.Responsible for recognition of one substrate molecule (the ribosomal RNA fragment and ribosomal protein complex) [39]Not in: Anopheles gambiae, A. aegypti, C. elegans, S. cerevisiae, Schizosaccharomyces pombe, K. lactis.

G5K

 Alternative name: glutamate 5-kinase; gamma-glutamyl kinase; GKs

Gamma-glutamyl kinase.Modulates the function of the GK domain. It is not essential for the metabolic activity of the enzyme. RNA targets are unknown [40]Only in: S. cerevisiae, Schizosaccharomyces pombe, K. lactis, E. coli, Salmonella enterica and Shigella dysenteriae

TGTs

 Alternative name: archaeal type archaeosine-specific transglycosylases; 7-cyano-7-deazaguanine tRNA-ribosyltransferase

Archaeosine biosynthesis enzymes, tRNA-guanine transglycosylase.May be specifically responsible for archaeosine incorporation at different sites in tRNA molecules [41]Only in: P. furiosus, P. abyssi and M. jannaschii

Phosphoadenosine phosphosulfate reductase

 Alternative name: PUA domain-phosphoadenosine phosphosulfate reductase proteins

Involved in taxa-specific RNA modifications.Involved in the bases biosynthesis through sulfate activation and reduction [9, 42]Only in: P. furiosus, P. abyssi and M. jannaschii
Figure 2.

Architecture of the eight proteins containing a PUA domain. Identified conserved domains in different protein sequences with PUA domains. Dyskerin- and Nsun6-like proteins are present in species of the three superkingdoms; MCT1- and Nip7-like proteins are present only in archaea and eukaryotic species; eIF2D-like proteins are present only in eukaryotic species; G5K with a PUA domain is present in S. cerevisiae, Shizosaccharomyces pombe, K. lactis, E. coli, Salmonella enterica and Shigella dysenteriae; phosphoadenosine phosphosulfate and TGTs are only present in archaea. Homo sapiens, P. furiosus, E. coli and S. cerevisiae sequences were taken as reference sequences to show the different architecture.

The eight groups are given below.

TGTs and phosphoadenosine phosphosulfate

These proteins are only present in archaea [9] and are distantly related to the family of queuine-specific TGTs [35, 41] and to the proteins that contain the phosphoadenosine phosphosulfate reductase domain [42] of bacteria and eukaryotes, which lack the PUA domain.

Gamma-glutamyl kinases

Glutamate-5-kinases (G5K) are present in all species studied but only the G5K of Schizosaccharomyces pombe, S. cerevisiae, K. lactis, Salmonella enterica, Shigella dysenteriae and E. coli present a PUA domain. G5K catalyzes the controlling first step in the synthesis of the osmoprotective amino acid proline, feedback of which inhibits G5K. In E. coli, the PUA domain of G5K modulates the function of the amino acid kinase domain and is capable of exposing new surfaces upon proline binding. In higher eukaryotes, there is a bifunctional Δ(1)-pyrroline-5-carboxylate synthase, wherein the PUA domain is absent [40, 43].

Dyskerin-like proteins

Dyskerin is a highly conserved nucleolar protein. A lot of information is available about dyskerin because mutations in this protein cause dyskeratosis congenital disorder (DKC), an inherited disorder with clinical manifestations of skin hyperpigmentation (dark patches), oral leukoplakia (white spots inside the mouth) and nail dystrophy (lack of nails). These mutations cause a reduction in telomerase activity, leading to limitations in the proliferative capacity of stem cells. This reduction in proliferative capacity for high turnover cells leads to low counts for blood and immune cells, resulting in aplastic anemia [44]. Like all small nucleolar RNP-associated proteins, dyskerin is phylogenetically highly conserved, and is found in all studied species. Data from studies in mice, rats and humans indicate that dyskerin is involved in at least three basic processes: maintenance of telomere integrity, biogenesis and function of the ribosome, and pseudouridylation of various cellular RNAs. Furthermore, recent reports suggest a role for dyskerin in the regulation of a subset of microRNAs [45] and in translational control of mRNA [46]. Human dyskerin is a 514 amino acid protein; its homologs in archaea (Cbf5, 340 amino acids) and bacteria (TruB, 314 amino acids) are shorter. All proteins contain the TruB_N and PUA domains (currently identified as TruB–C2 by different databases in TruB of E. coli). Nevertheless, dyskerin and Cbf5 also contain a dyskeratosis congenital-like domain of unknown function, but that is typical of this protein family (Fig. 2). Furthermore, human dyskerin contains nuclear and nucleolar localization signals which logically are not present in their homologs from archaea or bacteria. TruB_N is the catalytic domain (TruB family pseudouridylate synthase N–terminal domain) and harbors conserved aspartic acid 125, which marks the active site. The dyskeratosis congenital-like domain is an N–terminal domain associated with the TruB_N/PUA domain of dyskerin-like proteins [47]. As mentioned in the Introduction, members of the TruB_N family are involved in modifying bases in RNA molecules. This group consists of eukaryotic, bacterial and archaeal pseudouridine synthases similar to human dyskerin, S. cerevisiae Cbf5, D. melanogaster Mfl (minifly protein) and includes TruB [48, 49]. Nuclear and nucleolar localization signals have very similar amino acid compositions, but these two signals are recognized as being different by the cell [50]. Proteins containing the joint nucleolar–nuclear localization signal, such as human dyskerin, can cross the nuclear envelope and accumulate in the nucleolus [51]. However, although human dyskerin was first defined as a nucleolar protein that showed strict nucleolar localization (initially localized in the nucleoplasm, followed by a sequential translocation to the nucleoli and to the coiled bodies [52]), a recent study has indicated that the human DKC1 gene encodes a new alternatively spliced mRNA (isoform 3) which can direct the synthesis of a variant form of dyskerin that has an unexpected cytoplasmic localization and lacks the C–terminal nuclear localization signal [53]. Supporting the data, expression of dyskerin has been described in both the cytoplasm and nuclei of basal or parabasal cells in cervical lesions [54], and dyskerin staining was localized mainly in the nuclei of tumor cells and partly in the cytoplasm [55]. Further research is required to learn more about the nucleolar and cytoplasmic forms of dyskerin protein. The biological functions attributed to dyskerin may all be essentially related to its ability to bind and stabilize H/ACA snoRNAs. For stabilization of this H/ACA snoRNA, binding of dyskerin to snoRNAs via the PUA domain is essential.

MCT1-like proteins

Malignant T-cell amplified sequence 1 (MCT1) is an oncogene initially identified in a human T-cell lymphoma and can induce cell proliferation as well as activate survival-related pathways. MCT1 contains two domains, the N–terminus of eIF2D, malignant T cell-amplified sequence 1 and related proteins (eIF2D_N) and the C–terminal PUA domain (Fig. 2). The eIF2D_N domain, initially called DUF1947 by Pfam, is also found in eIF2D and uncharacterized archaeal proteins, and was first identified in archaeal MCT1 homologs [56]. MCT1 protein interacts with the cap complex via its PUA domain and then recruits the density-regulated protein (DENR/DRP) that contains the SUI1/eIF1 domain (also found in the translation initiation factor eIF1) [57]. Consequently, the mRNA translational profile of the human cell is altered, thus the oncogenic activity of MCT1 may be linked with target RNA translation initiation/regulation [58]. Using a comparative genomics approach, Matte-Tailliez et al. [59], identified a homolog of MCT1 in the archaea P. abyssi and in our search we found MCT1 homologs in all species studied except in E. coli, Salmonella enterica and Shigella dysenteriae. Therefore, MCT1 seems to be a highly conserved oncogene with the critical biological function of promoting cell proliferation. MCT1 and DENR in combination (MCT1/DENR) together with eIF2D (initially named Ligatin) are three related mammalian proteins that have a dual function [36]. These proteins have a potential to substitute for the canonical initiation factor eIF2 in circumstances when its activity is downregulated.

eIF2D-like proteins

Although the functional role of eIF2D remains obscure, it is known to participate in the translation initiation of specific mRNA(s) [37]. However, we can not exclude the participation of eIF2D in the regulation of some general translational events (initiation, elongation or termination), under specific conditions or in specific cells [60]. Nevertheless, we found no similar proteins in the archaea and bacteria tested. This protein contains the eIF2D_N domain, PUA domain and the C–terminal domains SWIB/MDM2 and SUI1/eIF1 (also called eIF2D_C domain). These C–terminal domains also occur in DENR. eIF2D, MCT1/DENR and Tma20/Tma22 (yeast orthologs) have been implicated in translation on the basis of bioinformatic, proteomic and overexpression/silencing analyses [57, 60]. Many PUA domains bind dsRNA through either the major or the minor groove, but others, including MCT1, eIF2D and Nip7 [38], lack most or all of the conserved basic residues involved in these interactions, and their potential contacts with rRNA must thus involve different residues and may have an altered specificity. The eIF2D PUA domain presents six prolines and five leucines that are highly conserved, and unlike the others, presents no great conservation of acidic and basic residues. The eIF2D_N domain in P. horikoshii PHO734.1 has an exposed electropositive cluster, also present in MCT1 and eIF2D, which might mediate an interaction with the ribosome.

Nip7-like proteins

The Nip7 protein is involved in ribosome biogenesis, being required for proper 27S pre-rRNA processing and 60S ribosome subunit assembly. Yeast Nip7 interacts with nucleolar proteins. Homo sapiens Nip7 (also called KD93), belongs to the UPF0113 family and its possible molecular function is related to RNA binding. The crystal structure of P. abyssi and H. sapiens Nip7 revealed a monomeric protein composed by two interlinked α/β domains. The N–terminal domain is formed by a five-stranded antiparallel β sheet surrounded by three α helices and a 310 helix, whereas the C–terminus corresponds to the conserved PUA domain. It has been shown experimentally that the PUA domain in yeast and archaea mediates the interaction between Nip7 and RNA [38]. However, there are differences in the amino acid constitution and electrostatic surface between the RNA-binding regions of human Nip7 and other PUA-containing proteins, indicating that they may have different RNA-binding modes [61]. The Nip7 PUA domain, similar to dyskerin, also has conserved basic residues in addition to the characteristic residues of the PUA domain (7–10 residues).

Nsun6-like proteins

Nsun6 is present in most studied species as Anopheles gambiae, Aedes aegypti, C. elegans, S. cerevisiae, Schizosaccharomyces pombe and K. lactis. This is the least characterized of the eight proteins presented in Table 2. It has two domains, NOL1/NOP2/Sun and PUA (detected with Pfam and PROSITE), and it is detected as class I S–adenosyl-l–methionine-dependent methyltransferases by Scop and CDD. Of the proteins belonging to the NOL1/NOP2/Sun family, Nsun2 is the most studied because it is related to diseases such as autosomal recessive intellectual disability [39, 62]. Phylogenetic analysis of proteins belonging to this family, leads to the conclusion that Nsun6 is more closely related to NOP2 that the other members [39]. S-Adenosyl-l–methionine-dependent methyltransferases, as typified by the hypothetical protein MJ1653 from M. jannaschii are characterized by an N–terminal RNA-binding PUA domain and a C–terminal AdoMet-MTase catalytic domain of the class I Rossmann-fold-like type [63].

Comparative analysis of the PUA domain between the different organisms

Many common RNA-binding protein folds are observed in ancient RNA-binding proteins (such as tRNA synthases). However, despite the obvious structural similarities, there is sometimes insufficient sequence homology to detect a direct evolutionary relationship between RNA-binding motifs. Therefore, to produce a more comprehensive review, we carried out a ‘qualitative comparative analysis’ following the guidelines of the Center for Reviews and Dissemination [64]. We searched for similar sequences using several different methods, including sequence similarity, generation of hidden Markov models (HMM) and structural similarity. Once all sequences of the 20 species had been recovered, we tried to construct a phylogeny based only on the PUA domain sequences. We made many attempts using different approaches and algorithms, including distance, maximum parsimony or maximum likelihood, but we could not consistently cluster the entire collection. This may have been due to the very variable levels of sequence identity between some of the species (5–100%), which did not allow us to obtain phylogenetic trees with significant node consistency.

We eventually turned to blastclust [65] and clans [66] as more appropriates tools to generate clusters in this highly divergent set of sequences. Both programs showed similar results, allowing us to observe that the PUA domain clusters correlate quite well with the different protein architectures (Fig. 3). PUA domain sequences from dyskerin, MCT1, eif2D, Nip7 and Nsun6 proteins formed distinct clusters at higher cut-offs, and are thus more closely related. Moreover in the clans analysis, there is a higher cluster that comprises those of the first three proteins mentioned above. The PUA domain sequences from Nip7 and Nsun6 of M. jannaschii, and Nsun6 of A. thaliana are not grouped in any cluster using blastclust, but the clans program includes them in the NIP7 and Nsun6 clusters. Furthermore, the PUA domain sequences from MCT1 proteins of B. taurus, H. sapiens, M. musculus and R. norvegicus are more closely related, and can not be separated by blastclust, as they are clustered even with 100% of identity. The same happens with the PUA domain sequences of H. sapiens and B. taurus dyskerin proteins, and R. norvegicus and B. taurus NIP7 proteins.

Figure 3.

PUA sequence clusters. Clustering by blastclust and clans was based on the average pairwise sequence identity for all PUA domain sequences. Family protein names are indicated (Dyskerin-like, Nip7-like, MCT1-like, Nsun6-like and eiF2D-like). Solid black lines show the relationships detected using the blastclust. Gray shading shows the clusters identified by clans. Gray dotted lines added sequences belonging to the cluster according to clans and which were not present in the results of blastclust.

PUA of dyskerin presents a high level of evolutionary conservation. As can be seen in Fig. 3, the TruB proteins form a cluster separate from the rest of the dyskerin-like proteins. This is consistent with the literature because, despite the low sequence similarity between them, dyskerin shows strong structural similarity to TruB (bacterial pseudouridine synthase) and for that reason is considered to be the pseudouridine synthase of H/ACA small nucleolar RNPs [67, 68]. However, these proteins have some differences, the PUA domain of TruB makes nonsequence-specific contacts with the acceptor stem of tRNA, the PUA domain of dyskerin is considerably larger than that of TruB, and the angle formed between the PUA domain and the core Ψ synthase fold extends the active site cleft in the other direction [30].

Structural comparison of PUA-domain-containing proteins

The PUA domain structure consists of six β strands (β1–β6) and two short α helices (α1–α2). DALI server [69] detect a total of 197 protein structures (Z-score > 3), which belong to one of the groups listed in Table 2. Structural homologs (Z-score > 7) identified by the DALI server show high similarity in overall PUA domain structure. To further illustrate the conformational similarity, we aligned the structures of the PUA domain using the structure of S. cerevisiae as a template (Fig. 4C,D). The 16 PUA domain structures can be closely aligned with a r.m.s.d. of < 1.6.

Figure 4.

Comparison of structures containing the PUA domain. (A) Structural conservation mapping of the superposition of complete currently known protein structures with the PUA domain. The PUA domain core (~ 95 residues) is retained across these five protein families with known structure, dyskerin, MCT1, Nip7, Nsun6 and G5K. (B) Superposition of the PUA core domain from the abovementioned proteins (represented here by the domains from PDB structures: 3UAIA, 3ZV0C, 1T5YA, 3U28A, 1SQWA, 3LWPA, 2EY4A, 3R90A, 3HAXA, 2J5TD, 1K8WA, 3MQKA, 2APOA, 2AUSA, 2P38B, 2FRXA); yellow are β strand and red are α helices. (C) Sequence conservation mapping of the PUA domain core, for sequence and structural conservation red positions are maximally conserved and blue positions are not conserved. (D) Structure-based multiple sequence alignment of representative PUA domains. Secondary structure is indicated by H for α helices and E for β strands, color is used for amino acids that have a frequency > 0.5 in a column. The names of the proteins, species and PDB code are indicated. Also listed for each structure are the Z-score, the r.m.s.d. and% identity, obtained by comparison with the structure of Dysk_Sc (3UAIA) with DALI server. Sc, S. cerevisiae; Hs, H. sapiens; Pf, P. furiosus; Pa, P. abyssi; Mj, M. jannaschii; Ec, E. coli.

Compared with their archaeal counterparts, eukaryotic dyskerin protein is longer and contains an N–terminal extension (NTE) and C–terminal extension [70]. These two extensions are highly conserved in eukaryotes and harbor many pathologic mutations. Using the archaeal dyskerin structure as a template, a structural model for human dyskerin was proposed by Rashid et al. [30]. Even though incomplete because the archaeal protein lacks the NTE and C–terminal extension regions mentioned above, the model shows that most dyskerin pathologic mutations converge on the same side of the PUA domain [30]. This suggests that these mutations may affect the binding to substrate RNAs, or to an as yet unidentified partner of the complex. The crystal structure of the yeast dyskerin in complex with Gar1 and Nop10 showed that the NTE folds into a new structural layer covering the PUA domain, whereas the C–terminal extension is disordered [71]. The structures supplied by Li and colleagues [70, 71] revealed the role of the NTE. The NTE is partially structured and forms a new architectural layer that covers the β barrel of the PUA domain and expands upon the eukaryotic PUA domain. A portion of the NTE is encircled by the N–terminal residues of the PUA domain and packs against the surface of the β barrel via hydrophobic interactions [70]. Certainly, a structural model of the human dyskerin but using as template Cbf5 yeast, will provide useful information.

Concluding remarks

The number of databases has increased from 2007 to the present, and the available information about proteins and their functions has increased, significantly. Analysis of the updated databases allowed us to detect the existence of all PUA domain proteins, corroborating previous findings by Pérez-Arellano et al. [9], but also adding important information that was unknown at that time.

This review illustrates the diversity in the distribution, protein architecture and sequence characteristics of PUA-domain-containing proteins. We surveyed public sequences and structure databases, and multiple search engines were used to identify all proteins containing the PUA domain. The PUA domain is found in a wide variety of proteins and we detected eight protein groups in different species from the three superkingdoms. Five of them (dyskerin, MCT1, eIF2D, Nsun6 and Nip7) are present in H. sapiens and have orthologs in most species, whereas the other three (G5K, TGTs and phosphoadenosine phosphosulfate) are characteristic of archaea or bacteria. This domain occurs as a single copy and might appear as a single architectural unit or in diverse combinations with a variety of other domains. The PUA-containing proteins function as post-transcriptional RNA-modifying enzymes, translation initiation factors, glutamate kinases (in bacterial and yeast) and proteins involved in ribosome biogenesis.

Structural analysis of the PUA domain reveals high levels of conservation in different PUA-domain-containing proteins, even among different species; which testifies to the importance of proteins belonging to the PUA group. Because PUA-containing proteins have an important role in the cell, it is expected that their deregulation results in aberrant cell phenotypes. Thus, better understanding of PUA domain functioning in pathology, as well as in normal cell physiology, might lead to novel therapies.

With the wealth of information described about the group of PUA containing proteins, one of the main issues to be addressed in the future is how these PUA domains are part of proteins with diverse architectures, despite maintaining similar molecular RNA-binding properties. In addition, how the activity of PUA domain is regulated and coordinated by the cell is largely unknown and thus needs to be investigated. Further research is warranted to improve our understanding of this domain and their impact in physiological and pathological processes.

Materials and methods

Sequence search

Protein sequences corresponding to dyskerin, MCT1, eIF2D, NIP7 and Nsun6 were retrieved from the GenBank database using blastp and psi-blast programs [65]. Sequences belonging to the following species were recovered: H. sapiens, R. norvegicus, Mus musculus, Bos taurus, Gallus gallus, Danio rerio, D. melanogaster, A. thaliana, Anopheles gambiae, Aedes aegypti, C. elegans, Schizosaccharomyces pombe, S. cerevisiae, K. lactis, E. coli, Salmonella enterica, Shigella dysenteriae, P. furiosus, P. abyssi and M. jannaschii.

HMM and phylogeny

In order to retrieve all proteins containing the PUA domain from the databases, HMMs were generated. HMMs were built from alignments of PUA domains, using the hmmbuild program and searches were carried out using the hmmsearch program from the hmmer package [72]. Results from hmmsearch and blast were compared and additional PUA domains were selected. Phylogenetic and molecular evolutionary analyses were conducted using mega v. 5 [73].

Cluster analysis of the PUA domain

Clustering of the PUA domain dataset was performed using blastclust [65] and clans [66]. blastclust was performed by obtaining 70% of the length of the sequences for comparison. Cluster analysis of the clans program was performed using PSIBLAST and BLOSUM62 parameters, and a cut-off value of 9.4926.

Structure alignment

Structure similarity searches were conducted using the dali program [69] and the structure of S. cerevisiae (3UAIA) as the input, both complete and isolated PUA domain. One hundred and nine entries, with a Z-score > 8, were retrieved from the PDB using the dali structure superposition search. Fifteen additional sequences comprising PUA domain structures were selected (PDB ID: 3ZV0D, 1T5YA, 3U28A, 1SQWA, 3LWPA, 2EY4A, 3R90A, 3HAXA, 2J5TD, 1K8WA, 3MQKA, 2APOA, 2AUSA, 2P38B and 2FRXA). Structural superpositions, sequence alignments and r.m.s.d. values were generated and calculated using dali and by manual inspection and adjustments. Molecular representations were performed using jmol (SourceForge, Dice Holdings Inc.; Phoenix, AZ, USA) and pymol (Schrödinger, LLC; Portland, OR, USA) (which can be downloaded from http://www.jmol.org and http://www.pymol.org, respectively).

Acknowledgements

This study was supported by Grants from Quilmes National University (Argentina). DEG and PDG are researchers from CONICET. DEG is a member of the National Cancer Institute of Argentina.

Ancillary