A comprehensive, structural and functional, in silico analysis of the medium-chain dehydrogenase/reductase (MDR) superfamily, including 583 proteins, was carried out by use of extensive database mining and the blastp program in an iterative manner to identify all known members of the superfamily. Based on phylogenetic, sequence, and functional similarities, the protein members of the MDR superfamily were classified into three different taxonomic categories: (a) subfamilies, consisting of a closed group containing a set of ideally orthologous proteins that perform the same function; (b) families, each comprising a cluster of monophyletic subfamilies that possess significant sequence identity among them and might share or not common substrates or mechanisms of reaction; and (c) macrofamilies, each comprising a cluster of monophyletic protein families with protein members from the three domains of life, which includes at least one subfamily member that displays activity related to a very ancient metabolic pathway. In this context, a superfamily is a group of homologous protein families (and/or macrofamilies) with monophyletic origin that shares at least a barely detectable sequence similarity, but showing the same 3D fold.
The MDR superfamily encloses three macrofamilies, with eight families and 49 subfamilies. These subfamilies exhibit great functional diversity including noncatalytic members with different subcellular, phylogenetic, and species distributions. This results from constant enzymogenesis and proteinogenesis within each kingdom, and highlights the huge plasticity that MDR superfamily members possess. Thus, through evolution a great number of taxa-specific new functions were acquired by MDRs. The generation of new functions fulfilled by proteins, can be considered as the essence of protein evolution. The mechanisms of protein evolution inside MDR are not constrained to conserve substrate specificity and/or chemistry of catalysis. In consequence, MDR functional diversity is more complex than sequence diversity.
MDR is a very ancient protein superfamily that existed in the last universal common ancestor. It had at least two (and probably three) different ancestral activities related to formaldehyde metabolism and alcoholic fermentation. Eukaryotic members of this superfamily are more related to bacterial than to archaeal members; horizontal gene transfer among the domains of life appears to be a rare event in modern organisms.
polyketide synthase-independent associated protein
quinone oxidoreductase-like 1
sensing starvation protein
quinone oxidoreductase involved in tracheary element differentiation in plants
unweighted pair-group method using arithmetic averages
yeast alcohol dehydrogenase
NAD(P)-dependent alcohol dehydrogenase (ADH) activity is widely distributed in nature and is carried out by three main superfamilies of enzymes that arose independently throughout evolution . Their amino acid identity is 20% or less and they exhibit different structures and reaction mechanisms. The first superfamily corresponds to the Fe-dependent ADHs and makes up the smallest and least studied family of alcohol dehydrogenases [2–4]. The second group includes the short-chain dehydrogenase/reductase superfamily; this large family of enzymes do not require a metallic ion as cofactor [5,6]. The third superfamily is composed of zinc-dependent ADHs, and is named preferentially medium-chain dehydrogenases/reductases (MDRs) [7,8]. These enzymes usually require zinc atom(s) as cofactor and the family includes the classical horse liver ADH. In addition to these three NAD(P)-dependent ADH families, other minor families of ADH exist, which use different cofactors such as FAD, and pyrroquinoline quinone, among others; however, the distribution of these minor families is limited to some bacterial groups.
To date, nearly 1000 protein sequences have been identified as MDR superfamily members [8–10]. Identification of new members of the MDR superfamily is performed with high statistical significance using tools such as blastp or fasta[12,13]. However, efforts to assign proteins to families and/or subfamilies within the MDR superfamily have not been equally successful. Public proteins databases use different criteria to classify proteins, and therefore, several inconsistencies in the identification of protein subfamilies and families have been observed. Recently, Nordling et al., based on analysis of five complete eukaryotic genomes, and Escherichia coli, constructed an evolutionary tree of the MDR in which at least eight families can be distinguished: dimeric ADHs in animals and plants; tetrameric ADHs in fungi (Y-ADHs), polyol dehydrogenases (PDHs), quinone oxidoreductases (QORs), cinnamyl alcohol dehydrogenases (CADHs), leukotriene B4 dehydrogenases (LTDs), enoyl reductases (ERs), and nuclear receptor binding protein (NRBPs). ERs and NRBPs were originally described  as acyl-CoA reductases (ACRs) and mitochondrial respiratory function proteins (MRFs), respectively; the Results section discusses why the names of these enzymes are described differently here.
Because the MDR protein families proposed by Nordling et al. were identified considering only a few genomes, it is possible that other protein families of the MDR may be identified if complete sets of their protein sequences are used. Furthermore, a larger set of MDRs will allow us to make a more detailed taxonomic analysis. Therefore, in this report we analysed MDR taxonomy on the basis of the entire set of currently known MDR members, and completed the work initiated by Nordling et al. with identification of further protein subfamilies that comprise each protein family within the MDR superfamily. To contribute to validation of the eight protein families previously identified, we grouped protein sequences employing a different method from that used by Nordling et al.. Indeed, the limited number of protein sequences employed by Nordling et al. , precluded them from identifying protein subfamilies.
Finally, we analysed evolution of the MDR superfamily and identified some putative selective forces that directed their enzymogenesis. This analysis is valuable as a paradigm of protein evolution and provides information to understand previously defined concepts such as protein family, subfamily, and superfamily, and their relationships to several protein classification efforts. Furthermore, recruitment of selected members of this superfamily may offer clues about the evolution of some metabolic pathways, and show the evolutionary history of different organisms: for example, ER was recruited from MDR and incorporated into the multifunctional enzyme fatty acid synthase from animals (not fungi or plants); additionally, the capacity for retinoic acid synthesis, a powerful regulator of genetic expression active only in vertebrates, evolved in parallel to evolution of animal ADHs; and animal ADHs are involved in the synthetic or catabolic route of paramount modulators such as epinephrine, serotonin, and dopamine .
Materials and methods
Extensive database searches for zinc-dependent ADH, sorbitol dehydrogenase, threonine dehydrogenase, CADH, mannitol dehydrogenase, ER, and QOR were performed. Protein sequence data were taken from SWISSPROT + TrEMBL protein databases  and the GenBank nonredundant protein sequence database at the National Center for Biotechnology Information (NCBI) . Access to NCBI databases was achieved by means of the integrated database retrieval system ENTREZ . Gapped blastp program with default gap penalties and blosum62 substitution matrix was employed . Thus, based on selected protein sequences that belong to each of the subfamilies that compose the MDR superfamily, a search for homologous sequences was performed through blastp for each selected sequence to identify new members of MDRs not yet recognized. Whenever a new sequence was identified (P < 0.00001), the blastp search was repeated, seeking closer relative sequences. The procedure was repeated iteratively until no new members of MDRs were recognized.
Progressive multiple protein sequence alignment was calculated with the clustal_x package  using secondary structure-based penalties and corrected according to results of gapped blastp. Dendrograms were calculated using clustal_x and displayed with treeview. Phylogenetic analyses were performed with mega2 software , using both maximum parsimony (MP) and distance-based methods [UPGMA, and neighbour-joining (NJ)], with the Poisson correction distance method, and gaps treated by pairwise deletion. Confidence limits of branch points were estimated by 1000 bootstrap replications.
The procedure to define protein subfamilies and families is explained with detail in the Results section.
A total of 656 nonredundant sequences (allelic forms excluded) were identified as members of MDR superfamily. Of this total, 73 sequences were excluded from final analysis for one of the following reasons: (a) sequences with less than 75 amino acids; (b) isozymes with 100% identity; (c) multiple sequences corresponding to orthologous genes identified in several species from the same genera, because they were considered redundant for the phylogenetic analysis; and (d) duplicity in information, for example, two fragments of proteins in Streptomyces coelicolor (CAB53403 and CAB55521), were identified as the N- and C-terminus, respectively, of the same protein (kindly confirmed by S. Bentley, Sanger Institute, Hinxton, Cambridge, UK; personal communication). Thus, 583 nonredundant protein sequences were considered for phylogenetic analysis; of these, 21 proteins belong to archaea, 234 to bacteria, 11 to protista, 62 to fungi, 148 to plants, and 107 to animals.
The 583 sequences permitted construction of the unrooted tree shown in Fig. 1. Protein sequences were ascribed to different subfamilies, as indicated in the SWISSPROT database. Conserved groups with high degree of identity can be identified easily (e.g. class III ADH, plant ADHs, animal ADHs), as well as poorly conserved subfamilies, such as sorbitol dehydrogenase, ER, or QOR. Conserved protein subfamilies are identified because distances between their members are short, and appear as a group of branches that join among themselves far from the centre of the tree. In comparison, poorly conserved subfamilies with low identity among themselves, resemble groups of long branches that depart close to the centre of the tree. However, the latter, more than being an inherent property of these subfamilies, might be due to problems concerning particular aspects with regard to reliability of database information, because a significant fraction of functional annotations in databases is dubious or even incorrect [21,22]. This problem arises because there are many noncharacterized sequences. An especially illustrative example is the case of the QOR/ζ-crystallin subfamily, in which many protein sequences are assumed to be QOR only by sequence similarities with the well-characterized animal QOR/ζ-crystallins. Thus, other noncharacterized distantly related sequences are assumed to be also QOR only by similarity to the second group of QOR-related sequences.
In summary, GenBank reports might be produced before characterization is completed and/or published; usually, authors do not update the original GenBank report after publication. Therefore, many proteins would already have been characterized, but this information is not quoted in the GenBank and other protein databases. Thus, to record reliable functional identification for most proteins, an extensive search for published papers by authors who made contributions to GenBank for each of the MDRs was carried out. This functional identification plus statistically significant degree of similarities calculated with blast (E-value), allowed us to identify many additional small subfamilies as members of MDR superfamily. E-value represents the number of alignments with an equivalent or greater score, that would be expected to occur purely by chance .
Table 1 lists the main protein families that are found with the MDR superfamily, as stated by several public protein databases. Several inconsistencies in the nomenclature for protein subfamilies, families and superfamilies are observed: for example, Pfam  does not attempt to identify families or subfamilies in the MDR superfamily; prosite uses motifs to identify two protein families in the MDR superfamily; PIR [26,27] uses distance-based criteria to identify 119 families in MDR; CATH [28,29] uses structural data to identify six superfamilies in MDR; COG [30–32] uses phylogenetic criteria to identify six families; and SYSTERS uses a non-distance-based method to identify 80 families. This discrepancy is due to the different criteria used for defining each of these terms.
Table 1. Protein families/subfamilies within medium-chain dehydrogenase/reductase superfamily (MDR) as it is indicated on several public databases.
Protein families/subfamilies considered within MDR
adh_zinc Include 80 clusters (families), organized into superfamilies; the main superfamilies are:
Superfamily of cluster O60787: includes six aditional clusters with sequences from animal ADH, plant ADH, class III ADH (equivalent to COG1062)
Superfamily of cluster N60795; includes 13 aditional clusters with sequences from CADH, fungi ADH, DHSO, TDH, secondary ADH among others (equivalent to COG1063 plus COG1064)
Superfamily of cluster N60499: includes five aditional clusters with sequences from QOR/ζ-crystallin and related (equivalent to COG0604)
Superfamily of cluster O59495 and O59531: includes other nonrelated clusters (equivalent to COG3321).
To clarify this, we have defined a protein subfamily as a set of homologous (ideally orthologous) protein sequences that (a) performs the same function and (b) forms a closed group in which identity, similarity, and statistical significance between any two members of the closed group are higher than to any other protein sequence outside the subfamily, i.e. clusters of proteins with blast reciprocal best hits. Often, members of protein subfamilies share more than 30% sequence identity, and E-value of approximately 10–30 or less. It should be mentioned that all-vs.-all blast-based searches have recently been used to find orthologs [33–36], and that these methods bypass multiple alignments and construction of phylogenetic trees, which can be slow and error-prone steps in classical ortholog detection .
The previously mentioned definition of subfamily is nearly identical to the approach employed in the SYSTERS database to define protein families or clusters of protein sequences [38–40], but with the additional condition that all sequences in a cluster must (ideally) share the same function. This functional criterion is necessary because true orthologous proteins must perform the same function; if this last condition is not true, then the proteins are paralogous. In contrast, paralogous proteins do not necessarily possess different functions, in that by definition, two proteins are said to be paralogous if they are derived from a duplication event, but orthologous if they are derived from a speciation event [41–44]. Therefore, initially a duplication event will produce two proteins possessing identical properties, and only after evolution might they acquire different functions. This explanation is obligatory because some papers provide inexact definitions [45–47].
This non-distance-based method allows us to sort MDR sequences into nonoverlapping clusters (subfamilies), in which the granularity of this clustering is determined by data and not by a user-supplied data-dependent cut-off . Identification of closed groups of protein sequences, or perfect clusters (in agreement with SYSTERS nomenclature), is advantageous over distance-based clustering methods because it is not necessary to set an arbitrary identity cutoff value to define a subfamily (or families in the SYSTERS database), and permits identification of both highly and poorly conserved groups of orthologous proteins. Furthermore, Krause & Vignron  showed that this method is highly conservative, as the probability of obtaining a false positive is extremely low, i.e. we almost never observe sequences that do not belong to a cluster being included.
On the other hand, this subfamily definition fits with the widely used nomenclature proposed by Persson et al. for the MDR superfamily. Thus, only closed groups with at least one characterized protein were listed as true protein subfamilies in this work. This criterion excluded some minor clusters without characterized proteins, or protein sequences located in the twilight zone, which can not be assigned with certainty to a protein subfamily. Furthermore, there is always the possibility that best match in a database hit is solely a well-conserved paralog  that in reality belongs to a related, but different, protein subfamily.
As a consequence of application of these criteria, subfamilies identified in this work are equivalent to a carefully crafted, manual-curated version from clusters of proteins proposed in the SYSTERS database. Figure 2 shows an unrooted tree constructed with all the MDR protein sequences identified in bacteria and archaea, with recognized protein subfamilies indicated. Figure 3 shows an equivalent unrooted tree constructed with protein sequences identified in eukaryota. In both trees, the main subfamilies of the MDR superfamily are easily visualized. Comparison of Figs 2 and 3 clearly shows that in addition to the well-characterized protein subfamilies that exist simultaneously in several phylogenetic lineages, there are additional subfamilies associated with only one phylogenetic lineage, suggesting a more recent evolutionary origin.
It can also be observed that several protein subfamilies are formed by clusters of related subfamilies (Figs 2 and 3). According to the previous proposal for protein subfamilies, we define a protein family as a set of protein subfamilies in which identity and/or similarity of proteins in the family is higher among them than when compared with other proteins belonging to a different family. Therefore, a family is composed of a closed group of subfamilies in which the closest relative of one subfamily is always another subfamily member from the same family. However, although protein subfamily definition used in this work comprises (ideally) a natural unit (orthologous proteins with the same function), the protein family is not a straightforward concept, as it is necessary to set author cutoff criteria to identify it. In fact, with tools such as blastp, identification of the protein superfamily to which one new protein belongs is easy and accurate. An additional functional analysis of the new protein permits recognition of the orthologous group (subfamily) to which this protein belongs. Nonetheless, at present there are no universal criteria to classify proteins into intermediate categories located between subfamily and superfamily. Indeed, a universally accepted protein family definition, does not exist; thus, different authors use different concepts with a different emphasis, e.g. homology in sequence, structure, and/or function.
Therefore, using blast to compare E-values and identity/similarity values among different protein subfamilies, we can identify several clusters of protein subfamilies in the MDR superfamily. In this way, at the highest level of integration, we herein identify three great clusters or macrofamilies in the MDR superfamily (see Figs 2 and 3). At lower levels of integration, we identify six clusters of orthologous groups of proteins (COGs), that comprise the MDR superfamily (according to the COG database proposed by Koonin & Tatusov (see Table 1) [30–32]), or the eight protein families recently proposed by Nordling et al. . To illustrate the criteria used to identify clusters of protein subfamilies, Fig. 4 illustrates schematically the main relationships among the different subfamily members that comprise macrofamily II in Figs 3 and 4 (this big cluster is equivalent to COG1064, and comprises the Y-ADH and CADH families from Nordling et al. ). Similar data were obtained with the other protein subfamilies (not shown).
Additionally, the proposed taxonomic categories (subfamilies, families, and macrofamilies) were validated by bootstrap analysis with conventional phylogenetic methods, using both distance-based methods (neighbour-joining and UPGMA), and character-based methods (maximum parsimony). To perform this phylogenetic analysis, only subsets of the MDR superfamily were utilized (the complete set demands excessive resources of computing power). Initial subsets employed for phylogenetic analysis included protein sequences that belong to only one kingdom (archaea, bacteria, animals, plants, or fungi). These kingdom-specific subsets were used to validate by bootstrap analysis the proposed taxonomic categories: macrofamilies and families. Later, subsets of proteins that belong to each of the proposed three macrofamilies, or eight families, were used to validate by bootstrap analyses, the proposed 49 protein subfamilies. Figure 5 shows a phylogenetic tree constructed with protein sequences belonging to macrofamily II of MDR superfamily. The additional phylogenetic trees constructed with protein sequences pertaining to macrofamilies I and III, and to each of the kingdoms to which belong the MDR proteins (archaea, bacteria, fungi, animals or plants) are not shown.
Table 2 shows a comparison of the proposed protein families that comprise MDR superfamily, according to COG database, the Nordling et al. paper , and the three macrofamilies or main clusters identified in this work. It is clear that information in addition to sequence data is needed to define the true protein families comprising the MDR superfamily. Consensus agreements among protein taxonomists must be reached before setting up intermediate categories between ideally true orthologous clusters (subfamilies in this paper) and superfamilies. Sequence data alone are not enough to set up true protein families with a real biological sense. It is important to point out that the intermediate categories proposed in COG database, the Nordling et al. paper , and in this work create a congruent pattern despite the different criteria used to define them in each study.
Table 2. Comparison of the protein families included within MDR superfamily according to COG database, Nordling et al. , and the three macrofamilies or main clusters of protein subfamilies identified in this work. The distribution of MDR subfamilies inside each protein family is indicated, as well as their distribution into eukaryota, bacteria, and archaea domain.
1 This family was formerly denominated by Nordling et al. as the mitochondrial respiratory function proteins (MRF) family.2 This subfamily is probably comprised by two or more paralogous related groups. 3 Nordling et al. named inappropriately this family as acyl-CoA reductase (ACR).
Tables 3–8 present lists of subfamilies in the eight families of the MDRs, and their distribution into the different kingdoms, with a brief summary for each subfamily (a complete list with all protein sequences and consulted references was included as supplementary material and can be requested from the publisher or the authors).
Table 3. Main subfamilies that comprise the PDH family of MDR (COG1063) and their occurrence in eukaryota, archaea and bacteria.
a The members of this subfamily receive the official name of L-iditol 2-dehydrogenase, and possess alternative names as glucitol dehydrogenase, xylitol dehydrogenase or polyol dehydrogenase, in addition to sorbitol dehydrogenase. This subfamily catalyzes the reversible oxidation of D-sorbitol and other polyalcohols, like xylitol and L-iditol, to the corresponding keto-sugars [149–152]. b N-terminus is similar to diverse DHSO; C-terminus is probably an NAD(P)H oxidoreductase, which belongs to the GFO_IDH_MocA family. It is related to synthesis of exopolysaccharides. c Two enzymes have been purified, and characterized: formaldehyde dehydrogenase from Pseudomonas putida, and formaldehyde dismutase also from Pseudomonas putida. However, recently Oppenheimer et al., demonstrate that formaldehyde dehydrogenase from P. putida is a functional alcohol dehydrogenase that conducts the efficient dismutation of wide range of aldehydes (including formaldehyde), where NADH production represents a pH-dependent burst. Thus, both enzymes can be considerated as formaldehyde dismutases. d For bacteria and archaea, only sequences that can be unambiguously assigned to one subfamily are considered in the table. References are included on Table S2 of supplementary material.
FADH (formaldehyde dehydrogenase-independent of cofactor-/formaldehyde dismutase)
Proteobacteria (γ subdivision)
Proteobacteria (β subdivision)
Table 4. Main subfamilies that comprise the ADH family of MDR (COG1062) and their occurrence in eukaryota, archaea and bacteria.
a This belongs to a highly conserved gene cluster encoding haloalkane catabolism on the plasmid Prtl1. b This shows affinity for a wide range of (substituted) aromatic alcohols, but are not capable of oxidizing aliphatic alcohols. c This subfamily comprises eight different classes involved besides ethanol metabolism, on the synthesis and catabolism of several endogenous metabolites that regulate growth, metabolism, differentiation, and neuroendocrine functions [15,50,54]. d Some animal ADH are also heterodimers (e.g., isozymes from human class I ADH). e Only class VIII ADH from Rana perezi uses NADP(H) rather than NAD(H) [49,50]. See final note (d) in Table 3.
Table 5. Main subfamilies that comprise the CADH family and Y-ADH family of MDR (COG1064) and their occurrence in eukaryota, archaea, and bacteria.
See final note (d) in Table 3. a Induced by several elicitors, such as pathogens, ozone, and wounding. b Proteins described with different activities: cinnamyl alcohol dehydrogenase, benzyl alcohol dehydrogenase, or mannitol dehydrogenase. Induced by fungal pathogens, wound, salicylic acid, and leaf senescence; shows a down-regulation by sugar or salt stress. c Shows broad substrate specificity; carbon source stimulated.
CADH and related (cinnamyl alcohol dehydrogenase)a
Table 6. Main subfamilies that comprise the QOR family and NRBP family of MDR (COG0604) and their occurrence in eukaryota, archaea and bacteria.
See final note (d) in Table 3. a In animals, NRBP1 is translocated to the nucleus by a piggyback mechanism. In rat, it interacts with peroxisome proliferator-activated receptor α, PPARα; thyroid hormone receptor, TR; retinoic acid receptor, RAR; retinoid-X receptor, RXR, and hepatocyte nuclear factor-4, HNF-4. Fungi lack nuclear receptors; in yeast, it is a single-stranded DNA-binding protein that fulfills a role as transcription factor. b Several activities for ζ-crystallin/QOR have been reported, however, the relative importance of any remains an enigma. Nevertheless, all ζ-crystallin retain NADPH binding capacity as a common character. c PIG3 in humans seems to be a redox-related protein involved in the formation of reactive oxygen species in response to p53-induced apoptosis. d Bifunctional protein in plants; monofunctional protein in Euglenozoa. In plants, it is a defense protein whose synthesis is activated as response to pathogen-inoculation. In Euglenozoa, its functional role is not resolved. e VAT-1 forms a high-molecular-mass complex within the synaptic vesicle membrane, and is composed of three or four VAT-1 subunits, displays an ATPase activity, and binds calcium with low affinity. f Probable monofunctional enoyl reductase involved in biosynthesis of actinomycete aromatic polyketides in a multicomponent (type II) polyketide synthase complex. g Monofunctional enoyl reductase associated to iterative multidomain type I polyketide synthase. h Expressed mainly in heart, brain, and skeletal muscle, and moderately expressed in placenta, kidney, and pancreas. i Dinap1 protein is one of the quantitatively major nuclear proteins in the dinoflagellate Crypthecodinium cohnii. Although Dinap1 did not bind directly to DNA, it activated basal transcription activity. j Protein highly expressed during fruit-ripening, or induced in response to auxin treatment. k These proteins are expressed in plant roots, where light-induced a negative regulation. They are involved in biosynthesis of antimicrobial or allelopathic quinines. l They are included inside plasmids that contain a bacteriocin production region.
NRBP1 (nuclear receptor binding protein/transcription factor) a
Table 7. Main subfamilies that comprise the LTD family of MDR (COG2130) and their occurrence in eukaryota, archaea and bacteria.
See final note (d) in Table 3. a This subfamily in animals corresponds to proteins with two different activities, indicating that enzymes are capable of carrying out reduction of a double bond, as well as oxidation of a hydroxy group. b Enzymes efficient for dehydrogenation of secondary allylic alcohols and reduction of azodicarbonyl compounds and quinones. Induced by various oxidative-stress treatments. c Bacterial and archaea proteins show 40.2 ± 2.5% (SD, n = 36) average identity with animal LHD family, and a 39.6 ± 2.4% (SD, n = 36) with plant AADH family.
LTD (Leukotriene B4 12-hydroxydehydrogenase)/PGR (15 oxoprostaglandin 13-reductase) a
Table 8. Main subfamilies that comprise the ER family of MDR (COG3321) and their occurrence in eukaryota, archaea and bacteria.
See final note (d) in Table 3. a This enoyl reductase domain belongs to a multifunctional polypeptide of approximately 2500 aa that contains seven enzymatic domains. b This enoyl reductase domain belongs to a multifunctional polypeptide with modular organization where each module designates a repeated unit whose functional domains resemble a single type I fatty acid synthase. c This enoyl reductase domain belongs to a multifunctional polypeptide whose functional domains resemble a single type I fatty acid synthase. In fungi, PKS is involved in mycotoxin biosynthesis. d This enoyl reductase domain belongs to a multifunctional polypeptide of 8243 aa that contains 21 enzymatic domains in Cryptosporidium parvum. Three ER domains are organized inside three modules, each containing a complete set of six enzymes for elongation of fatty acid C2-units (i.e., one ER/module).
Enoyl reductase (modular polyketide synthase -PKS-) b
Proteobacteria (δ subdivision)
Proteobacteria (γ subdivision)
Proteobacteria (α subdivision)
Enoyl reductase (iterative polyketide synthase -PKS-) c
NADP+/NADPH (by similarity to modular PKS and FAS)
ER-FAS: alveolata (enoyl reductase from type I fatty acid synthase in alveolata) d
NADP+/NADPH (by similarity to modular PKS and FAS)
Interestingly, archaea protein sequences appear to be concentrated in only two families (macrofamily I: PDH family, COG1063, and macrofamily II: Y-ADH family, COG1064), suggesting that these two families, with a universal distribution, are the probable ancestral protein families in the MDR superfamily. However, in macrofamily III, a small uncharacterized cluster related to crotonyl-CoA reductase (CCAR) subfamily also possesses archaea members, also suggesting an ancient group.
In bacterial phyla, the taxa with sequences most related to eukaryota are firmicutes (Gram-positive) and proteobacteria (γ subdivision), see Tables 3–8. However, this proximity could simply be due to the fact that these bacterial clades possess the greatest number of completely sequenced genomes. Table 9 shows the number of identified genes that belong to the MDR in completely sequenced species. There is great variability with respect to total number of genes identified in each organism, even whitin the same taxonomic category, as well as variability with respect to the number of genes identified in MDR superfamily.
Macrofamily I: PDH family (COG1063): DHSO, TDH, and related subfamilies
This family was formerly denominated by Nordling et al. as PDH (polyol dehydrogenase) family; however, after including bacteria and archaea members, it is clear that less than half of their subfamily members possess an activity related to polyol metabolism. The PDH family is composed of 12 subfamilies (Table 3). Their characterized members contain zinc, show dehydrogenase or reductase activities, bind NAD(H), except secondary ADHs that use NADP(H), and are cytosolic proteins, with the exception of the bi-domain oxidoreductase subfamily (BDOR), which appears to be represented by transmembrane proteins. They are organized as homotetramers or homodimers that are involved in several metabolic roles, but only two correspond to anabolic activities: BDOR, involved in exopolysaccharide biosynthesis, and 2-desacetyl-2-hydroxyethyl bacteriochlorophyllide-a dehydrogenase subfamily (BCHC), in bacteriochlorophyll-a biosynthesis in proteobacteria. Remaining enzymes in PDH family show catabolic activities related either to aryl/alkyl metabolism (FDEH, secondary ADH, and BDH), formaldehyde metabolism (FADH, formaldehyde dismutase), carbohydrate catabolism (DHSO, SORE, GATD, and archaea GDH), and threonine and derivative compound catabolism (TDH and SSP). Five subfamilies have polyphyletic distribution and simultaneously exist in at least two domains (eukaryota and bacteria, or archaea and bacteria). Of these five subfamilies, four include tetrameric proteins and three are present in archaea.
Macrofamily I: ADH family (COG1062): class III ADH and related subfamilies
This family includes classical ADHs from animals and plants. ADH family comprises seven subfamilies absent in archaea (Table 4). Only one subfamily has a broad distribution: class III ADH, which is present in animals, plants, fungi and bacteria (cyanobacteria and proteobacteria). Proteins belonging to these subfamilies are cytoplasmic, although class III ADHs in animals are also nuclear . They contain zinc, bind NAD(H), except animal ADH8 from Rana perezi that uses NADP(H) [49,50], and show dehydrogenase or reductase activities, with the exception of hydroxynitrile lyase (HNL) in plants. They are homodimers and only mycothiol-dependent formaldehyde dehydrogenase is atypically reported as a homotrimer [51–53].
With the exception of HNL, involved in cyanogenesis in plants, all enzymatic activities fulfilled by the MDR subfamilies in the ADH family are catabolic activities related either to aryl/alkyl metabolism (benzyl ADH, firmicute aryl/alkyl ADH), or formaldehyde metabolism (class III ADH, mycothiol-dependent FADH). It is likely that the function of plant and animal ADHs, although typically associated with ethanol metabolism, is more complex, in that these comprise an intricate system with a broad diversity of enzymatic forms. The animal ADH subfamily, in addition to ethanol oxidation, participates in oxidation or reduction of diverse endogenous substrates involved in retinoic acid and bile acid synthesis, norepinephrine, leukotriene, serotonin, and dopamine catabolism, or in detoxification of cytotoxic products of lipoperoxidation such as 4-hydroxynonenal (reviewed in ). Thus, it is difficult to accept that this complex enzymatic system with its broad diversity of enzymatic forms and substrates (up to eight ADH classes in vertebrates) [49,54] was produced in the course of vertebrate evolution with the sole purpose of oxidizing ethanol, an exogenous metabolite found in minimal quantities under regular conditions: in fact, there are several endogenous substrates metabolized by this complex of enzymatic forms with an efficiency at least one thousand times higher than that of ethanol . A similar history probably occurred in plants. Plant ADHs comprise a complex subfamily with numerous enzymatic forms expressed in a developmental and tissue-specific manner; it was suggested recently that these participate in flooding tolerance, anther development, fruit ripening, disease resistance, and stress response (reviewed in ).
Macrofamily II: CADH family (COG1064): ELI3, CADH and related subfamilies
The CADH family comprises two subfamilies; only one shows a broad distribution (Table 5). Their members are oxidoreductases and use zinc. All are dimeric proteins and bind NADP(H), except ELI3 in celery. Enzymes in the CADH subfamily perform anabolic functions and participate in biosynthesis of cinnamyl alcohols, the monomeric precursors of lignin in plants. In bacteria, in which lignin is absent, CADH-related proteins participate in biosynthesis of the lipids composing the bacterial cell envelope; in fungi, they could participate in ligninolysis and fusel alcohol synthesis pathways [56,57].
Elicitor-inducible defense-related proteins (ELI3) are present only in eudicot plants, and show different, but related, defense activities: CADH, benzyl alcohol dehydrogenase, or mannitol dehydrogenase. ELI3 expression is elicited by fungal pathogens , wounds , salicylic acid , and leaf senescence . In celery, there is down-regulation by sugars or salt stress [62–64].
Macrofamily II: Y-ADH family (COG1064): yeast ADH, and related subfamilies
The Y-ADH family comprises four subfamilies; two show broad distribution (Table 5). Their members are oxidoreductases and use zinc. This family contains tetrameric proteins that use NAD(H) and have catabolic functions, involved mainly in metabolism of ethanol or short-chain alcohols (typical yeast ADH, broad ADH, and fungal-secondary ADH), or metabolism of mannitol (fungal MTD). The most ancient subfamily is probably the broad ADH; it is present in archaea and bacteria, and its members exhibit broad substrate specificity.
Macrofamily III: QOR family (COG0604): QOR and related subfamilies
Members of this family lack zinc and use mainly NADP(H) as cofactor. It is the most complex and divergent family, with 16 subfamilies (Table 6). Twelve subfamilies are found in only one taxon, suggesting intensive and recent enzymogenesis. In functional and structural terms, this is a highly divergent family and their members, in addition to oxidoreductase activity, act as lyases, nuclear-associated proteins, membrane traffic proteins (that participate in subcellular protein distribution), and integral membrane proteins with ATPase activity and calcium-binding capacity. This family is nearly absent in archaea; only Halobacterium sp. and Sulfolobus sulfataricus have proteins related to CCARs. It is likely that CCAR and related proteins are the most ancient subfamily of macrofamily III, because they have the widest distribution (archaea, bacteria, and eukaryota) and because it is the only subfamily with a physiologic role related to primary metabolic pathways.
Macrofamily III: NRBP family (COG0604): NRBP1 subfamily and related
This small family comprises only nuclear receptor binding protein 1 (NRBP1) and related subfamily (Table 6). It has broad distribution, and is present in animals, plants, fungi and bacteria. Their members are homodimers, with both nuclear and cytosolic location. This family was formerly designated by Nordling et al. as the mitochondrial respiratory function proteins (MRF) family; however, this name is unfortunate in that members of this family probably do not have enzymatic activity. In animals these proteins are nuclear receptor co-operators; in the cytosol, in presence of the appropriate ligand, they interact with several nuclear hormone receptors, such as peroxisome proliferator-activated receptor α, thyroid hormone receptor, retinoic acid receptor, retinoid-X receptor, and hepatocyte nuclear factor-4 . Later, NRBP1-activated nuclear receptor complex is translocated to the nucleus by a piggyback mechanism, where they act as transcription factors. Although fungi and bacteria lack nuclear receptors, in Saccharomyces cerevisiae, MRF1_YEAST (P38071), a single-stranded DNA-binding protein, has acquired the activity of a transcription factor [66,67]. Indeed, it is a transcriptional regulatory protein of certain genes whose products are necessary for the functional assembly of mitochondrial respiratory proteins. In bacteria, uncharacterized related proteins are reported in Corynebacterium glutamicum and Xanthomonas campestris. Thus, it is likely that in the course of evolution, NRBP1 acquired a new function to work with nuclear receptors. This family appears to be evolved from members of QOR family (COG 0604).
Macrofamily III: LTD family (COG2130): LTD/AADH and related subfamilies
This is a small family with only three subfamilies (Table 7). Members lack zinc and have a preference for NADP(H) over NAD(H). Two subfamilies are found in only one taxon: leukotriene B4 12-hydroxydehydrogenase (LTD)/15-oxoprostaglandin 13-reductase (PGR), found in animals and allyl alcohol dehydrogenase (AADH), found in plants. Both subfamilies clearly have their origin in an uncharacterized protein subfamily (LTD/AADH related) with broad distribution. This protein family is closely related to QOR Family COG0604 (Figs 2 and 3).
Macrofamily III: ER family (COG3321): enoyl reductases
This family contains four related subfamilies comprising multifunctional polypeptides that enclose a MDR domain with ER activity (Table 8). ER domains in MDR enzymes use NADP(H) and lack zinc. These subfamilies show limited distribution and are involved in biosynthesis of fatty acids and polyketides. Nordling et al.  inappropriately named this family as acyl-CoA reductase (ACR). As they identified correctly the enoyl-acyl carrier protein (ACP) reductase domain contained in multifunctional fatty acid synthase from animals, or enoyl-ACP reductase domain from iterative polyketide synthase in fungi, the generic name enoyl reductase is preferable. The enzyme ACR is absent in fatty acid synthase; this latter multidomain enzyme uses ACP as carrier for intermediates, not coenzyme A. ACR is usually a membrane-bound enzyme involved in the biosynthesis of fatty alcohols and waxes, and it is clearly a different enzyme that does not belong to the MDR superfamily [68,69].
Animal fatty acid synthases are closer to fungal iterative polyketide synthases than to any other fatty acid synthases from fungi, plant, or bacteria. The latter kingdoms possess one ER that does not belong to the MDRs. As can be seen in Figs 2 and 3, this protein family is also closely related to QOR Family (COG0604).
We will focus our discussion on five topics: criteria used to define a protein family; mechanisms of evolution in MDR; whether eukaryota inherited their enzymatic machinery mainly from bacteria; ancestral activities of MDR; and taxonomy within MDR superfamily.
Criteria used to define a protein family: sequence over functional similarities
Generally, the term protein family describes ‘a group of homologous (frequently orthologous) enzymes that catalyse the same reaction (mechanism and substrate specificity)’. However, in addition to their primary activities, enzymes often have other secondary activities with lower efficiency and different substrates and mechanism of reaction . For example, horse ADH also exhibits aldehyde dismutase [71,72] and esterase activities ; yeast ADH additionally shows methylformate synthase activity . Therefore, it is clear that through evolution several proteins acquired, with only a few point mutations, activities that differed from the primary activity . This implies the existence of several structurally related proteins with high identity or similarity, but different functional roles . These proteins (closely related paralogous, but with a different mechanism of reaction and/or substrates) might even show higher similarity than the most distant phylogenetic derivatives in the same protein family (true orthologous) with the same activity, substrates, and mechanism of reaction. For example, identity and similarity between plant ADHs and class III ADHs from plants (paralogous proteins with different substrates) are higher than identity and similarity between class III ADHs from plant and bacteria; albeit both orthologous proteins have the same activity, substrates, and mechanism of reaction [indeed, identity between ADH1_MAIZE (P00333) and ADHX_MAIZE (P93629) (paralogous proteins) is 59%, but identity between ADHX_MAIZE (P93629) and FADH_PARDE (P45382) (orthologous proteins) is 55%]. Based on this type of data, it is clear that several proteins exhibit significant similarity (>30–40% identity), but have different functional roles. Therefore, sequence data alone cannot be used as sole criterium to define protein families, because without functional data, orthologous and paralogous groups cannot be accurately identified.
On the other hand, the protein function cannot be the main criterium used to define a protein family because one domain might have several catalytic activities. In fact, LTD subfamily shows two different and equally efficient catalytic activities: leukotriene B4 12-hydroxydehydrogenase, which catalyses oxidation of a hydroxyl group, and 15-oxoprostaglandin 13-reductase, which carries out reduction of a double bond . In contrast, there are several examples where the same function can be fulfilled by several nonrelated proteins with distinct domains, conforming analogous enzymes [75,77,78]. The MDR and the short-chain dehydrogenase/reductase (SDR) superfamilies contain several analogous enzymes. Thus, the SDR superfamily contains an analogous alcohol dehydrogenase found in Drosophila, a glucose dehydrogenase from Bacillus[80,81], an ER from bacteria and plants [82–84], a sorbitol dehydrogenase from Klebsiella, and a threonine dehydrogenase in animals . These enzymes represent different protein structure solutions to the same activities observed in MDRs.
In summary, phylogenetic data can not be overlooked as a criterium for identification of a protein family. All families recognized inside the MDR superfamily are made up of clusters of phylogenetically related paralogous proteins, which may or may not conserve their original substrates or mechanisms of reaction. All paralogous proteins are generated by duplication events, and initially possess the same function; selective pressures and evolutive forces shape the functional role that duplicated proteins will perform. A change in the functional role of a protein is not necessarily related to a change in substrates or mechanism of reaction. Recruitment of a duplicated protein into a different metabolic pathway, a different physiological role, or even a change in the spatiotemporal pattern of expression, expressing a protein in novel tissues and/or developmental stages , could be a good evolutionary reason to conserve the duplicated protein, and result in a novel paralogous protein with a different functional role.
Therefore, we propose that the condition of performing the same function (with one, two, or more catalytic activities) must be assigned solely at a more specific (or restricted) taxonomic level, such as at the subfamily level (employed in this work). A protein family must be defined based mainly on sequence similarities, but in conjunction with other biological criteria different from function, such as phylogenetic data, since minor changes in amino acid sequence may induce changes of function.
Mechanisms of evolution in MDR superfamily
Enzymogenesis. Currently, two different evolutionary scenarios are envisioned for enzyme evolution . New catalytic functions of enzymes can evolve by: (a) changing the chemistry of catalysis, while retaining the binding capacity for a common ligand (hypothesis initially proposed by Horowitz ) or (b) retaining the chemistry of catalysis while changing the substrate specificity. Interestingly, we found several enzymes of the MDR superfamily that conserved their chemistry of catalysis, but changed their substrate specificity, e.g. plant ADH and animal ADH subfamilies that evolved both from class III ADH subfamily; or secondary ADH from fungi and mannitol-1-phosphate dehydrogenase from fungi (Fungi MTD), that evolved both from yeast ADH subfamily. In contrast, we could not find two related enzymes of MDR superfamily that maintained their binding capacity for a common ligand, but with modification in their chemistry of catalysis. This possibility, described as retrograde evolution or substrate-driven evolution, suggests that metabolic pathways evolved in a backward manner, i.e. divergent members of the same protein family catalyse successive reactions inside a metabolic pathway. To our knowledge, only a few examples have been reliably identified to date: two pairs of enzymes in tryptophan and histidine biosynthesis [47,88].
The data presented in our manuscript enlarge perspectives on protein evolution, because in addition to the previously mentioned mechanism of enzyme evolution, we showed that preexisting enzymes can be recruited to form novel pathways in which proteins acquire new activities by changing both their binding capacity and their chemistry of catalysis. This last possibility is in concordance with a novel third hypothesis, recently proposed by Gerlt & Babbit , which does not require conservation of either substrate specificity or chemical mechanisms; instead, they proposed that an active site is able to support an alternate reaction that may use some functional groups of the active site in a different mechanistic and metabolic context; in this proposal, only active site architecture is conserved. We discuss below one interesting example to support this third hypothesis. A divergent plant ADH with an acetone cyanohydrin lyase activity (P93243) has been described in flax (Linum usitatissimum) [90–93]. This protein belongs to a novel class of hydroxynitrile lyases (HNLs), and its amino acid sequence shows no overall homology to any cloned HNLs. Indeed, HNLs from plants form a heterogenous group of proteins differing in molecular mass, quaternary structure, presence or absence of flavin adenine dinucleotide, as well as glycosylation. They have convergently evolved from FAD-dependent oxidoreductases, α/β hydrolases, and MDRs . Interestingly, HNL from flax, is a zinc-containing protein and conserves all amino acid residues important for structural integrity or coordinating zinc [91,92]; however, flax HNL neither displays ADH activity nor is inhibited by reagents interfering with zinc coordination . This information, together with the fact that flax HNL is more related to plant-, animal- and class III ADH , suggest that flax HNL evolved late from a plant-/class III ADH, which was recruited for cyanogenesis in plants, a recent secondary pathway used as a defence mechanism against herbivorous . Existence of multiple phylogenetically independent HNLs in plants supports this proposal. Therefore, this novel activity within MDR superfamily was acquired without conservation of the original binding capacity and the chemistry of catalysis. In conclusion, proteins exhibit a huge unrecognized plasticity.
Another and different alternative mechanism for enzyme evolution, also observed in members of MDR superfamily corresponds to modular construction or gene fusion, in which separate gene products join together and generate new genes containing two or more domains with novel activities [75,96]. Examples of this modular construction within the MDR superfamily are as follows: bi-domain oxidoreductase (BDOR) involved in biosynthesis of exopolysaccharides ; bifunctional QOR in plants, with an N-terminal domain related to short-chain dehydrogenase/reductase superfamily [98,99]; fatty acid synthase (FAS), a multifunctional polypeptide with seven enzymatic domains from animals  or alveolata (protozoa) ; modular polyketide synthase from bacteria , and the iterative polyketide synthase from fungi [102,103]. All of them possess modular architecture. In this sense, it is important to mention that oligomerization is not conserved among members of MDR superfamily. For example, monomers, homodimers, homotrimers, homotetramers and heterodimers, are present in this superfamily, and it has been proposed that degree of oligomerization might be involved with changes in the functional role developed by proteins [75,96].
Taken together, we conclude that the deep-rooted statements ‘one enzyme, one function’ and ‘one protein family, one function’ are not accurate for many enzymes. Several secondary activities might exist in one protein, as in the previously mentioned animal ADH or yeast ADH subfamilies (see the first topic in the Discussion section), and this can be the point of departure to gain novel and completely different functions. Indeed, we point out the fact that two different and equally efficient catalytic activities can be a feature of a single protein, as described for LTD/PGR subfamily. This catalytic promiscuity has been recognized as a vital springboard from which new catalytic activities can emerge from existing folds and active sites [70,104].
Data presented in this paper reinforce the idea that a protein can gain or lose a function through a limited number of amino acid changes, and several such examples from natural protein evolution are shown. MDR belongs to the limited number of protein superfamilies that posses both different mechanisms of reaction and substrate specificity [47,75]. Indeed, several laboratories [45,88,105] have mimicked the evolution of paralog proteins in vitro, showing generation of new catalytic or binding properties by modifications of a preexisting protein scaffold, and forget that evolution has carried out many such successful experiments.
Proteinogenesis vs. enzymogenesis.
Several subfamilies within the MDR superfamily evolved as nonenzyme homologs, i.e. novel proteins that have lost their original catalytic activity. ζ-Crystallin/QOR is probably the most well-investigated example. This protein is expressed in a taxon-specific fashion in the lens of the phylogenetically distant guinea pig, camel, and Japanese tree frog (Hyla japonica) [106–109], and constitutes approximately 10% of total water-soluble proteins of the lens. Other examples of nonenzymes within the MDR superfamily are: (a) NRBP1 that functions as a transcription factor in yeast [66,67], or nuclear receptor co-operator in animals ; (b) dinoflagellate nuclear-associated protein (DINAP) that corresponds to the quantitatively major nuclear protein in Crypthecodinium cohnii, and although DINAP did not bind directly to DNA, it activated basal transcription activity [110,111]; and (c) the membrane traffic protein (AST) in fungi .
On the other hand, subcellular location is not conserved across members of the MDR superfamily. Although the great majority are soluble cytoplasmic proteins, some of them are located in mitochondria (yeast ADH), and nuclei (DINAP; NRBP1; class III ADH in animals), and others have a membrane location (VAT-1, and probably BDOR), or function as a structural protein (ζ-crystallin/QOR).
All these examples serve as a cogent reminder that Nature is not restricted to chemically or substrate- conserved strategies for divergent evolution; instead, divergent evolution is opportunistic and one active site architecture, can be used to develop mechanistically distinct catalytic  or noncatalytic functions. In other words, inside one protein superfamily (e.g. MDR), functional diversity is more complex than sequence diversity.
Eukaryota inherited MDR from bacteria
Our analysis of MDR superfamily shows that most MDR subfamilies in eukaryota are more closely related to their counterparts in bacteria than in archaea. This supports the idea that in eukaryota, although the machinery for DNA duplication, transcription, and protein synthesis is more related to archaea (informational genes), the enzymatic machinery is more related to bacteria (operational genes) . This agrees with the generally accepted notion that eukaryotic cells are the symbiotic result of bacteria (the symbiont) and archaea (the host). Therefore, horizontal gene transfer of operational genes had a significant role in development of metabolic pathways in eukaryotes. In bacterial taxa, phylogenetic relationships that can be established within each protein subfamily suggest a significant horizontal gene transfer. In fact, it is calculated that nearly 20% of Escherichia coli genes were acquired by lateral transfer events in the last 100 million years . This contrasts with the nearly complete absence of recent examples of horizontal gene transfer between species that belong to different domains of life (eukaryota, bacteria, and archaea) in MDRs. Thus, although horizontal gene transfer among bacterial taxa appears to be a recurrent event, horizontal gene transfer between bacteria and eukaryota or between bacteria and archaea is a rare event (at least in MDRs). Only two clear-cut examples were identified: the first corresponds to the previously reported horizontal gene transfer of a secondary ADH from anaerobic bacteria to the protist Entamoeba histolytica, and the second, not previously reported, corresponds to horizontal gene transfer of an LTD/AADH-related protein from firmicutes (Gram-positive bacteria) to the archaea Halobacterium sp. NRC-1 (NCBI accession no. AAG19273). This latter example is shown in Fig. 2, where the LTD/AADH subfamily contains some bacterial sequences that are more related to the archaea sequence (coloured in dark blue) than to other bacterial sequences within the same subfamily, obtaining a phylogenetically discordant pattern that displays a distribution compatible with horizontal gene transfer. Furthermore, this archaea sequence is the only sequence in which its branch departs far from the centre of the unrooted tree (see Fig. 2).
Is there a MDR ancestral activity?
A preliminary answer to this question can be approached from several directions, but it is clear that ancestral activity (within a protein subfamily) should be related to a primary (also ancient) metabolic pathway with (an ideally) broad phylogenetic distribution. Thus, protein subfamilies with restricted phylogenetic distribution involved in secondary metabolic pathways cannot be considered as ancestral subfamilies.
Glutathione-dependent formaldehyde dehydrogenase activity of class III ADH in ADH family (COG1062). This has been proposed as the ancient activity from which both animal and plant ADHs are derived . However, this activity cannot be the ancestral function for the remaining subfamilies within the MDR superfamily, as shown by several pieces of evidence. First, glutathione (GSH) does not show the universal distribution observed for MDRs, inasmuch as GSH is restricted to proteobacteria, cyanobacteria, and eukaryotes [117,118]. Second, in organisms in which the mycothiol (MSH) molecule fulfils the functions of GSH, as in firmicutes, formaldehyde dehydrogenase activity exists in any event, but now as a mycothiol-dependent activity. A third cofactor-independent formaldehyde dehydrogenase subfamily (FADH) exists, present either in proteobacteria (with GSH), firmicutes (with MSH), and archaea (without GSH or MSH). Overall, data suggest that formaldehyde dehydrogenase activity in MDRs is very ancient and predates the origin of GSH or MSH. This is reasonable if we consider that formaldehyde reacts spontaneously with GSH or MSH to form S-hydroxymethyl-glutathione or S-hydroxymethyl-mycothiol, the true substrates for glutathione-dependent formaldehyde dehydrogenase (class III ADH) or mycothiol-dependent formaldehyde dehydrogenase, respectively. Furthermore, the FADH subfamily also shows formaldehyde dismutase activity and the capacity to catalyse a dismutation reaction has been conserved in animal ADH, a subfamily derived from class III ADH. Consequently, it is probable that ADH family (COG1062), absent in archaea, forms a paralogous group derived from FADH subfamily, which in turn exhibits more ample distribution than ADH family (COG1062).
Another interesting option for ancestral activity within MDR superfamily is ER; it is necessary in one of the primary (and ancient?) anabolic pathways, i.e. synthesis of fatty acids. However, little evidence supports this proposal. First, archaea contain membranes with isoprenoid-based ether lipids, lacking fatty acids. Furthermore, gene(s) for fatty acid synthase complex (FAS), as occurs in both bacteria and eukaryotes, is (are) absent in Methanococcus jannaschii, as well as in other completely sequenced archaea genomes such as Aeropyrum pernix K1, Archaeoglobus fulgidus, Methanobacterium thermoautotrophicum, Pyrococcus abyssi, and P. horikoshii (in agreement with our blast results). Thus, although archaea possess some members of MDR superfamily, ER activity probably cannot be the ancestral activity of this superfamily because archaea lacks known FAS, as well as medium-chain ER. Second, different types of FASs exist and each possesses different and unrelated ER. Thus, the ER member of the MDR superfamily is one of seven activities that comprise type I multifunctional fatty acid synthase in animals . ER present in type II fatty acid synthase characteristic of bacteria and plants belongs to short-chain dehydrogenase/reductase (SDR) superfamily, not to MDR as occurs in type I animal fatty acid synthase and some bacterial polyketide synthases. Additionally, ER present in fungi (type I fatty acid synthase α6β6 complex) does not show significant homology either to medium-chain ER or to short-chain ER (calculated with blast), suggesting the existence of a third class of ER. Indeed, the finding that multifunctional FAS protein exists in two distinct architectural forms, the α2 animal FAS and the α6β6 yeast FAS, with protein domains arranged in a different order, is compatible with the idea that FAS complexes evolved independently several times and that they are a late acquisition in metabolic evolution of organisms, subsequent to the split of major kingdoms. Thus, both arguments strengthen the idea that ER is not an ancestral activity of the MDR superfamily. Furthermore, extensive similarity between each domain in FAS and polyketide synthase (PKS), the presence of medium-chain ER, and the order in which the domains are arranged in these multifunctional complexes  suggest that animal FAS is more closely related to PKS than to any other FAS from fungi, plants, or bacteria. In conclusion, there is no one member in ER family (COG3321) that can be considered as an ancestral group.
According to heterotrophic theory, the only theory with experimental support to substantiate the origin of the first metabolic pathways , the most ancient catabolic activities should be semienzymatic fermentative routes fed by stable and available prebiotic compounds. Thus, glycolysis, proposed as the first catabolic route , should have been preceded by simpler versions. The upper part of glycolysis, from hexoses to trioses, appeared as a late adaptation because glucose 6-phosphate and aldopentoses are unlikely prebiotic compounds due to rapid decomposition on a geological timescale . Additionally, the step from glucose to glyceraldehyde 3-phosphate is not a universal pathway; it is absent in archaea, while there are other alternatives to transform glucose into triose derivatives [123–125]. On the other hand, the lower part of glycolysis, from glyceraldehyde 3-phosphate to pyruvate is universally conserved, and glyceraldehyde is one of the most attractive intermediates as an energy source for primitive organisms provided with nascent glycolysis. Some advantages of glyceraldehyde are: (a) it can be produced from formaldehyde under plausible prebiotic conditions [126–128]; (b) through glycolysis, it is an energy source for living purposes; (c) it is an important metabolite in photosynthesis; (d) it can be used in prebiotic condensation reactions [129,130]; and (e) it is a source of glycerol, necessary for synthesis of glycerolipids, the precursors of biomembranes.
Furthermore, results of Fukuchi & Otsuka  suggest that the glycolytic stage from glyceraldehydes 3-phosphate to pyruvate corresponds to one of the most ancient catabolic pathways, because genes involved in this stage of glycolysis exhibit the highest similarity to nucleotide sequences of ribosomal RNA and/or transfer RNA gene clusters, clearly predating the origin of proteic enzymes in the ancient RNA world and strongly suggesting that these metabolic pathways were developed by chance assembly of enzyme proteins generated from pre-existing genes. If this is true, it is clear that fermentative activity should be an early metabolic development to sustain activity of the ancient stage of glycolytic pathway to dispose of generated NAD(P)H. Alcoholic fermentation has been suggested as an early pathway, considering that ethanol permeates the membrane and is easily eliminated by the cell. Lactic acid fermentation should be a later development, in that lactate is a nonpenetrant product, hence retained inside the cell to be utilized to regenerate carbohydrates when autotrophic pathways became available . Therefore, one ancestral activity of the MDR superfamily is probably related to an ancient alcoholic fermentative activity, such as actually observed in some subfamilies like broad ADH (from the Y-ADH family), present in eukaryota, bacteria, and archaea [133,134]; these enzymes catalyse oxidation of a broad variety of substrates, which includes primary and secondary, linear- and branched-chain, aliphatic and aromatic alcohols, in addition to several of their corresponding aldehydes and ketones. Moreover, theoretical studies predict that primordial enzymes were nonspecific, with broad substrate specificity, and showing different activities characterized by slow reaction rates [120,135]. Indeed, some MDRs fulfil all these requirements (e.g. broad ADH subfamily [133,136,137], or animal ADH subfamily [15,138]).
Finally, we cannot disregard other activities, such as threonine dehydrogenase (TDH) or crotonyl CoA-reductase (CCAR), present both in archaea and bacteria. These activities are also probably ancient. TDH is involved in amino acid metabolism, and CCAR in benzoate catabolism, acetate assimilation, and interestingly, in the supply of precursors for polyketides biosynthesis . In animals, TDH initiates a minor degradative pathway , and the enzyme does not belong to the MDR superfamily. It is a small subfamily whose distribution is restricted to animals, and was recruited from short-chain dehydrogenase/reductase superfamily (bacterial UDP-glucose 4-epimerase, according to our blast analysis). On the other hand, the supply of precursors for fatty acid synthesis in bacteria and eukaryota is provided by acetyl-CoA carboxylase, an ancient enzyme also present in archaea. This suggests that the origin of acetyl-CoA carboxylase predates that of fatty acid synthesis, because fatty acids are absent in archaea. Apparently, the role of acetyl-CoA carboxylase in the supply of precursors for fatty acid synthesis is a later recruitment in the evolution of this enzyme. Thus, TDH and CCAR probably belong to ancient metabolic pathways subsequently substituted by other metabolic pathways.
Taxonomy within the MDR superfamily
Use of the complete set of known MDR proteins, together with criteria and procedures described under the Results section, has allowed us to identify within the MDR superfamily, 49 subfamilies, and two additional taxonomic levels containing eight families and three macrofamilies. From these three taxonomic levels, only the subfamily level, as defined by us, comprises a natural unit that can be used to sort protein members of a protein superfamily with clear-cut rules. Thus, each subfamily encloses a set of ideally orthologous proteins that perform the same function, and delineate a closed group (see Results).
Two specific examples of subfamilies containing highly related paralogous rather than orthologous proteins, are the animal ADH and plant ADH subfamilies. Both subfamilies originated by successive gene duplications from an ancient class III ADH. Animal ADH evolved only in vertebrates and plant ADH, only in tracheophytas. Within the former subfamily, fishes possess one animal ADH, while amphibia, reptiles, and birds, appear to have at least two enzymes and mammals, up to six. It seems that animal ADH enzymogenesis developed in parallel to vertebrate evolution. Animal ADHs conserved the same mechanism of reaction, and share the same substrates; their main differences occur in their pattern of expression. Today, the functional roles developed by the different animal ADHs overlap, and this functional redundancy allows the individual to tolerate mutational or environmental perturbations . Absence of one ADH can be overcome by the existence of other members of the animal ADH subfamily . This partial functional redundancy contributes to a more general phenomenon designated ‘canalization’, which is the genetic capacity to buffer developmental pathways against deletereous perturbations ; similar advantages can be described in plant ADHs. Therefore, these singular subfamilies comprise clusters of highly related paralogous proteins that share functional roles.
A protein family, as discussed previously, must comprise a cluster of monophyletic subfamilies, i.e. highly related paralogous proteins, that all derive from a common ancestor. They possess significative sequence identity and/or similarity, and may or may not share common substrates or mechanisms of reaction.
In contrast, a protein macrofamily within MDR comprises a cluster of related protein families with broad phylogenetic distribution, i.e. with protein members from the three domains of life, and that originate from a common ancestor (monophyletics). Furthermore, within each macrofamily at least one subfamily possesses a physiological role related to primary metabolic pathways (with a probable ancient origin). Thus, the advantage of clustering protein families into macrofamilies lies in the fact that not all families are equally related, and this is probably due to the fact that some protein families are more ancient than others. Indeed, within each MDR macrofamily, there is a probable ancestral group (see the previous section), that might be tracked to the last universal common ancestor. If the latter is true, the number of macrofamilies within the MDR superfamily, reflects the original number of MDR proteins that existed in the last universal common ancestor. It is important to mention that Castresana , after analysing the phylogenetic distribution and evolution of bioenergetic pathways, concluded that the last universal common ancestor contained several members of each gene family. This agrees with the idea that the last universal common ancestor was a metabolically sophisticated organism.
Finally, it is interesting to point out that in comparison with the other taxonomic categories, the superfamily concept is not the focus of extensive discussion and there is a near consensus agreement that in addition to sequence similarities, and a common evolutionary origin, 3D structure data should be taken into consideration. Thus, a superfamily can be considered as groups of homologous protein families (and/or macrofamilies) with a monophyletic origin, that share, at least, a barely detectable sequence similarity, but showing similar 3D structure [144,145].
Inclusion of phylogenetic criteria to define subfamilies, families, macrofamilies, and superfamilies can be subscribed to the present tendency to construct a natural taxonomy of proteins and protein families. Figure 6 illustrates the relationships among the different taxonomic categories defined in this work.
After development of MDR molecular taxonomy, we propose application of the methodology employed in this paper to other protein superfamilies for several reasons. First, use of the blastp program in an iterative manner allows for identification of all members of any protein superfamily. Second, use of all-vs.-all blast-based searches within one protein superfamily together with extensive database mining, allow to sort members of any protein superfamily in subfamilies, i.e. closed groups of orthologous proteins with blastp reciprocal best hits. This procedure provides an advantage over classical methods for ortholog detection because it permits use of all available protein sequence members of one superfamily, bypassing global multiple alignments and construction of phylogenetic trees, which can contain slow and error-prone steps. Thus, one can benefit from all the available information without the need of selecting representative proteins and/or genomes by means of employing this faster and clear-cut procedure. In addition, the different taxonomic categories proposed in this work: subfamily, family and macrofamily, can be applied to other protein superfamilies, once formal definitions for each taxonomic rank are provided.
We thank to R.N. Ondarza (Instituto Nacional de Salud Pública, México), H. Weiner (Purdue University, USA), S. Bentley (Sanger Institute, UK), R.F. Doolittle (University of California, La Jolla, USA), A. Steinbüchel (Wilhelms-Universität, Münster, Germany), M. Pharr (North Carolina State University, USA), A. Gómez-Puyou and M. Tuena de Gómez-Puyou (Instituto de Fisiología Celular-UNAM, México), K. Yazaki (Kyoto University, Japan), A. Sosa-Peinado (Facultad de Medicina-UNAM, México), and X. Parés, J. Farrés, J. A. Biosca and their collaborators (Universitat Autònoma de Barcelona, Spain), and three anonymous referees for helpful critical review of this manuscript and/or discussions. This work was supported by grants 34823-M from CONACyT, México, and IN214101 from DGAPA-UNAM, México. H. R. R. has been supported by a graduate fellowship from DGEP-UNAM and CONACyT, México.