Is gene duplication a viable explanation for the origination of biological information and complexity?



All life depends on the biological information encoded in DNA with which to synthesize and regulate various peptide sequences required by an organism's cells. Hence, an evolutionary model accounting for the diversity of life needs to demonstrate how novel exonic regions that code for distinctly different functions can emerge. Natural selection tends to conserve the basic functionality, sequence, and size of genes and, although beneficial and adaptive changes are possible, these serve only to improve or adjust the existing type. However, gene duplication allows for a respite in selection and so can provide a molecular substrate for the development of biochemical innovation. Reference is made here to several well-known examples of gene duplication, and the major means of resulting evolutionary divergence, to examine the plausibility of this assumption. The totality of the evidence reveals that, although duplication can and does facilitate important adaptations by tinkering with existing compounds, molecular evolution is nonetheless constrained in each and every case. Therefore, although the process of gene duplication and subsequent random mutation has certainly contributed to the size and diversity of the genome, it is alone insufficient in explaining the origination of the highly complex information pertinent to the essential functioning of living organisms. © 2011 Wiley Periodicals, Inc. Complexity, 2011


1.1. The Efficacy of Natural Selection

One of the singular issues in molecular biology and evolution concerns the origins of the distinct exonic sequences and motifs that contribute to the functionality of the genome and to organismic complexity. Indeed, the cause of such a huge proliferation of genetic information, coding for polypeptides as small as 49-residue echistatin to those such as titin, a gigantic protein found in muscle tissue and consisting of over 30,000 amino acids remains an elusive and unsolved problem in the study of biological origins. It is presumed that the genes of all extant and extinct species have evolved from a life-form with a protogenome1. Natural selection per se is a poor candidate to explain such an evolution of complexity2, as it is disposed to conserve the existing structure and organization of genes, and their essential information content, resulting in functional stasis3.

Research into the evolution of genes has shown that the peptides they code for are of a finicky and precarious nature, both marginally stable and prone to aggregation4. Protein folding happens to be a highly complex and synergistic process, involving a number of epistatic relationships among many residues. This phenomenon, compounded with the issue of interactions between protein molecules, can significantly complicate adaptive evolution such that in the majority of cases the overall effects on reproductive fitness are very slight5, 6. Many arguably “beneficial” mutations have been observed to incur some sort of cost and so can be classified as a form of antagonistic pleiotropy7.

Indeed, the place and extent of natural selection as a force for change in molecular biology have been questioned in recent years8. Detecting the incidence of any beneficial substitutions in genes has so far relied on statistical inferences as empirical evidence is less readily available. In many instances, nonsynonymous changes and shifts in allelic diversity may be induced by factors that can serve to imitate selective effects—biased gene conversion, mutational and recombinational hotspots, hitchhiking, or even neutral drift being among them9. Moreover, several well-known factors such as the linkage and the multilocus nature of important phenotypes tend to restrain the power of Darwinian evolution, and so represent natural limits to biological change10. Selection, being an essentially negative filter, tends to act against variation including mutations previously believed to be innocuous11. For example, PABPC1 is a polyadenylate-binding protein used in translation initiation in both humans and mice12. Although there are 92 nucleotide differences in the translated region of the respective orthologous genes, these are all synonymous except in just two codons where Asp has been replaced at residue 209 with Glu and Thr with Ser at residue 576—both similar amino acids. However, it is also clear that the gene's role is essential and that any functional divergence in this particular case is unnecessary.

1.2. Duplication as a Potential Driving Force Behind Molecular Evolution

Gene duplication offers the prospect of a respite from stringent purifying/negative selection13. This is because only one gene locus needs to be functional, meaning that any paralog is freer to diverge allowing for changes, promoted by near neutral drift, which would not normally be tolerated in the case of a singleton. It is thought that suboptimal and deleterious changes may become fixed and accumulate through a more permissive regime of selection14, such that they fortuitously combine to produce a novel adaptive function. However, any evolutionary development must be tempered by the impact of any changes on protein structure and stability15 and not just the peptide sequence itself.

Although it may be inefficient and costly for the cell to produce identical surplus proteins, and which can lead to cases of unwanted overexpression and harmful phenotypes16, this can also prove to be beneficial by providing a useful double dosage17. Similarly, the role of duplicate genes in facilitating alternative metabolic pathways and regulatory interactions18 is another important factor.

Duplication, including instances of intragenic amplification, can occur by way of unequal chromosomal crossovers, the retropositioning of spliced mRNA, and copying of a whole chromosome or even an entire genome—the persistence of entire gene networks helps to explain the presence of polyploidy in plants19.

However, genomic studies have revealed that active duplicates may nonetheless be selected for their redundant utility20, as they can serve as backups when a mutation inflicts damage to a sister site21, 22. This means that any changes made to them are liable to be selected against if they impair this masking ability and its contribution to genomic robustness. This may explain, in part, the huge effect of duplicates in shaping both prokaryotic and eukaryotic genomes, and their evolutionary preservation23. Another phenomenon involved in the retention of duplicate genes is “subfunctionalization,” namely the differential partitioning of function or expression24. Here, redundant functions will degenerate at random from the daughter copies until their joint function matches that of the parent gene25.

Were selection to be completely relaxed and any manner of changes permitted, this would only serve to guarantee complete degeneration. It would invariably lead to the introduction of null and nonsense mutations, scrambling the open reading frame (ORF), and degrading the cisregulatory elements involved in transcription—leading to the gene's pseudogenization. Thus, a measure of purifying/stabilizing selection seems necessary for duplicate preservation, and any evolutionary divergence would proceed under a relaxed regime rather than none at all.

Moreover, in terms of population size, Kimura's diffusion approximation26 makes it abundantly clear that in diploid populations of a normal size, typically for those of N > 10,000, even the slightest degree of negative selection is sufficient to prevent any deleterious allele from surviving and increasing in frequency to the point of fixation or near fixation. This would mean that any major changes in both gene singletons and duplicates alike would tend to occur in smaller populations, where drift is much stronger and selection is weaker.


The purpose of this study is to determine the existence and extent of any novel information produced as a consequence of gene duplication. At stake is whether there is sufficient supporting evidence that the digitally communicated instructions27 encoded in DNA could have been constructed through known evolutionary processes, or whether the data suggests that an alternative explanation is required as in all codified nonbiological information. Therefore, this would serve as a means of assessing the current arguments regarding the origins of biological and genomic complexity.

2.1. The Information Conundrum

Although the nucleotide sequences in DNA are commonly understood to carry/convey biological “information”28, a precise scientific delineation for the term in the context of genetics is often found to be lacking. Therefore, it is impossible to test any hypothesis regarding the creation of new genetic information without offering at least a conceptual definition of what information means and what the criterion is for identifying it. In Shannon's theory29 of communication, information is termed the “reduction in uncertainty,” where entropy is the measure of any stochastic dependencies—the greater the level of uncertainty that exists in a particular situation, the less likely it is to predict the behaviors and outcomes because of the presence of random noise. Therefore, information is that which denotes a degree of determinism in a known relationship, although this would also have to involve a large measure of contingency to permit as many possible combinations to be conveyed. In the framework of molecular biology, information would refer to the inherent functionality of gene products: i.e., how they interact with the biochemical environment in which they operate.

Therefore, I have decided to define any gain in exonic information as: “The qualitative increase in operational capability and functional specificity with no resultant uncertainty of outcome.” The two parts of the statement are complementary, because an appreciably great degree of specificity is required to reduce any uncertainty and problems regarding behavior and effect: this is especially true in the case of enzymes that catalyze only particular reactions, and to the exclusion of all others. A random mutation in the active site could well lead to an “advantageous” outcome in a particular environment owing to a shift in catalytic activity. However, the evidence suggests this would entail an alteration in the particular specificity pattern30. Therefore, it would mean that an increase of uncertainty and more erratic behavior, with respect to the overall and net effect(s), is a consequence of such a development.

2.2. The Relationship Between Sequence, Function, and Evolutionary Divergence

Usually, it is safe to say that homologs share basically the same function and that many changes in sequence are not consequential. However, this is very much a general rule. A single amino acid replacement in a carboxyl esterase in blowflies confers organophosphorus insecticide resistance31, although this is because of a loss in the primary enzymatic activity. Many synonymous changes have indeed been identified with codon usage bias, contributing to splicing and translational efficiency32. A study has found that there exists a threshold at ∼ 50% sequence similarity below which functional divergence is enhanced33. Orthologs performing the same function should be under the same selective constraints and evolve at the same rate. But in the case of paralogs, there is a relaxation of purifying selection, and distinguishing loss of constraint from rapid evolution driven by adaptation is difficult because the loss of constraint often precedes any potential neofunctionalization34.

2.3. Testing for the Role of Natural Selection in the Creation of Novel Functionality

Detecting the effect of Darwinian positive selection—whereby an allele is supposed to increase in frequency because it confers a reproductive advantage—is not an exact science by any means, and it relies on statistical-based inferences that leave much to interpretation. Even if adaptive mutations have been prominent in a gene, it is not accurate to necessarily infer that any new functionality has arisen. All it means is that an allele has contributed to a gain in reproductive fitness, and nothing beyond that. In many instances, as with the example above, a loss of function and regulation in a harsh or unusual environment can have a beneficial outcome and thus be selected for—bacteria tend to evolve resistance to antibiotics in such a way through mutations that would otherwise adversely affect membrane permeability35. The magnification of the importance of one or more loci is tantamount to artificial selection, but occurs in some cases during drastic environmental catastrophes, where a single trait might make a difference between survival or not.

Population genetics methods typically involve measuring levels of heterogeneity and polymorphism at sites including and in proximity to the one under investigation36. It can lead to confusing results because the effects of Darwinian selection are often the same as those of background selection—the purging of neutral alleles due to their spatial proximity to deleterious ones37: the case of the gene implicated in microcephaly likely being a controversial example of this38. Sequence alignment methods are preferred, especially where data from a sample of a population is not available. As such, three ratios were determined and used throughout to detect the probability of functional change39.

  • iThe ratio of nonsynonymous to synonymous substitutions, dN/dS (ω), is regarded as the most obvious indication of adaptive change and functional shift40. In the case of neutral evolution, it would be around 1:1, but the proportion is skewed in favor of the former if positive selection is prevalent, whereas purifying selection is inferred when this is reversed. When comparing singletons in different phylogenetic lineages, this is a very powerful method, but in the case of duplicates more caution is required. As has been previously mentioned, there is an appreciably relaxed regime of selection in paralogous genes because only one need maintain the original function(s). As such, the rate of nonsynonymous substitutions may be much higher, not on account of adaptive evolution, but because purifying selection is far less stringent than it is for singletons.
  • iiThe transition to transversion ratio, ts/tv (κ), is also a useful test. Although there are twice as many possible transversions as there are transitions, the molecular mechanisms by which they are generated means that transitions (e.g., purine to purine) are more frequent than transversions (e.g., purine to pyrimidine). Notwithstanding mutational bias, the ratio can be seen as evidence for adaptation if the transversions greatly exceed transitions41.
  • iiiThe ratio of radical to conservative replacements, KR/KC, is a measure of the nature of the evolutionary changes in peptides. As many amino acids are chemically similar, they may also be relatively interchangeable—as with Val, Ile, and Leu—and so can be regarded as essentially neutral substitutions. Therefore, dN/dS may not reflect the significance of any divergence. If KR/KC is >1, then this could be suggestive of the fixation of beneficial mutations. However, such is the nature of context specificity within protein domains that a suboptimal but still conservative replacement at one site could require a compensatory42 and more radical change at another. Although widely used, the method has been criticized for being too simple and shows nothing about actual changes in the behavior of the protein43.

2.4. Aims of Investigation and Materials Used

Several familiar and exemplary cases of evolution following an initial gene duplication were chosen and categorized according to known mechanisms of divergence that include fusion, frameshift mutations, retroposition, internal amplification, and de novo recruitment. There is, of course, considerable overlap between these various mechanisms, although the primary focus is different for each case. The scope and remit of the investigation was limited to exonic sequences within the translated regions, thus largely avoiding regulatory areas and introns, where retrotransposon insertions are believed to be significant44. Although gene regulation and expression are important, it is the regions that code for protein sequences that comprise by far the primary source of biological information. All pertinent sequence data, both nucleotide and amino acid, were downloaded from the NCBI database and taken from where it is cited in the relevant literature. Standard alignment techniques for analyzing and illustrating the data were done using BLAST, with more advanced pair-wise ones using the ClustalW2 algorithm together with Emboss.


3.1. Duplication and Gene Fusion: The Case of Sdic

Sdic is believed to be a flagellar dynein gene found only in Drosophila melanogaster—an example of a tandem duplicated chimeric gene “caught in the act” of evolving45. It was formed when two adjacent genes, AnnX (coding for a cell adhesion protein) and Cdic (encoding a cytoplasmic intermediate chain dynein), were first duplicated and one pair subsequently underwent a deletion-mediated fusion. Sdic is found to be composed of four paralogs having itself been duplicated twice over. The 5' untranslated region (UTR) and part of the promoter sequence of the gene derives from AnnX, whereas the translated part and all 300 base pairs (bp) of the 3′ UTR come from the Cdic gene. A sequence comparison of Sdic2 and Cdic reveals that 522 out of 527 residues (99%) can be aligned without difficulty. Sdic has been observed to be expressed in the testes and incorporated into the sperm tail and this is because it has acquired a testis-specific core element, homologous with those of other promoter sequences, from the 3′UTR of AnnX46. It is unclear whether the element is a translational enhancer or has some other regulatory role in the AnnX gene such as, for example, in mRNA localization. Either way, the gene would seem to contribute to greater fecundity.

But, it is the loss of over 100 codons from Cdic's N-terminus47, involving at least two domains, that deprive the Sdic protein of the motifs necessary to enable it to interact with dynactin (a basic characteristic of cytoplasmic dyneins) and which represent the principal functional shift. Thus, Sdic is axonemal almost by default owing to the mass deletion of exonic information pertinent to cytoplasmic-specific operation (Figure 1). The gene's promoter has simply acquired features from preexisting coding sequences and information present in AnnX, whereas its translated region is virtually identical with the corresponding part of Cdic.

Figure 1.

The alignment of Cdic and Sdic (2 and 4) reveals the virtual identity of the corresponding coding regions in the genes. The N-terminus of Cdic, consisting of 100 codons, is missing in Sdic, and this means that Sdic lacks the motifs necessary with which to interact with dynactin. An intronic recruitment at the amino end has led to the exonization of 16 codons, whereas another deletion downstream, this time involving the loss of 16 codons, is present within the fourth domain from the 5′ end. This development has resulted in a frameshift that provided five novel characters in the sequence.

The distal and proximal conserved elements are also found to be very similar to those of the Cdic promoter. In addition, the 16 codons present at the N-terminus of Sdic, recruited from Cdic's third intron along with an 11 bp insertion, bear a tenuous resemblance to the amino ends of axonemal intermediate chain dyneins such as those for oda6 and AclC348. It is reasonable to assume that this small amount of exonization, allowing a previously noncoding region UTR to become the start site and initial part of the Sdic gene, is adaptive. As such, this could be interpreted as evidence for the de novo creation of novel information.

Further evidence for the role of selection in the development of Sdic includes a possible sweep found in the low levels of polymorphism across neighboring loci and a skewed frequency distribution of allelic variation. However, it is noted that a reduced level of heterozygosity in a region of low recombination, such as at the base of the X-chromosome where Sdic is located, is also consistent with background selection because of the effect of deleterious mutations49. Both analyses could in fact be correct. Although the number of nonsynonymous differences is greater than synonymous ones, as would be expected in a basic test for adaptive evolution, this is due to a bulk deletion and resultant frameshift occurring in the fourth domain (inherited from Cdic) that produced a string containing at least five novel characters. As this domain is believed to be nonfunctional in Sdic, it is more logical to infer the existence of a relaxed regime and decrease in selective constraints, than to assume any adaptive change. Therefore, the initial loss of information at the N-terminus because of relaxed selection was then compensated for by the recruitment of sequences from an intron of Cdic and the exons of AnnX. In this way, a nonfunctional cytoplasmic dynein “evolved” into an axonemal one through a process of copy, cut, and paste.

Divergence between the Sdic paralogs themselves has been very limited such that the translated regions of Sdic2 and Sdic4 actually share a 100% nucleotide identity and are functionally redundant. Although the gene is considered to be young, and formed within the last 2–3 million years, the short generational span of the fruit fly (∼ 2 weeks) means that the evolutionary timescale may actually be rather long (∼ 50 m generations).

It is possible that Sdic contributed to speciation and the emergence of the melanogaster line50. The most likely scenario involves a population bottleneck, migration, or founder effect51. Any reduction in effective population size would also produce a further relaxation of selective constraints as (nearly) neutral drift would predominate.

It appears that deletion in this instance was one of the necessary factors involved in gene fusion. As such, Sdic is shorter than Cdic, and this is true also for the hominoid oncogene, TRE2, which is 200 residues less than one of its parents, USP3252. This presents a problem in terms of explaining any accretion of cistron size with reference to the most naturally applicable evolutionary process. Deletion-mediated fusion also means that usually one of the genes is far less preserved than the other but in the case of Kua-UEV, however, the effect is additive because it has retained the original and separate functions of both its parents53. Although it may behave slightly differently, particularly with respect to intracellular localization, the information content has not appreciably changed.

3.2. Duplication and Frameshift Mutation

Already briefly mentioned in the previous section, another potential means by which new genes, with new exonic information, might arise is by way of a frameshift resulting in an entirely different ORF and peptide sequence. A case of just such a development was proposed by Ohno54 in the case of a nylon oligomer hydrolase found in bacteria near sites involved in the production of the synthetic material. However, a study by Negoro et al.55 found that the likely source was actually an esterase containing a β-lactamase fold. Two amino acid replacements in the catalytic cleft greatly increased the Ald-hydrolytic activity, in some measure already provided by a serine active site, necessary for the degradation of the oligomers. However, this does appear to have come at some cost to part of the esterolytic function and the enzyme does not have nearly the specificity constant and efficiency, with respect to its alternative functionality, of a hydrolase such as aminoacylase56. Therefore, although there is an appreciable gain in operational capability, no new information was generated that specified oligomer degradation.

Scherer and coworkers57, using a search on BLAST, found that as many as 470 duplicated genes in humans had been affected by frameshift translation. However, frameshifts induced either by indels or by transposons (mobile elements) are themselves poor candidates for the generation of novel information because they almost inevitably incur premature stop codons58, leading to protein truncation, in addition to scrambling part of the original reading frame. This is indeed evident in some of the genes presented in their study. HTR3D is a hydroxytryptamine (serotonin) receptor in humans, which is essentially the carboxyl terminus remnant of HTR3C. However, owing to the inherent modularity of a gene, the truncated daughter copy has retained at least part of the parental functionality, whereas the rest has been essentially ignored by purifying selection. Protein truncation in duplicates can also occur by way of a nonsense mutation resulting in a premature stop in translation: the G-type cyclin CCNG1, involved in the regulation of cell cycle kinases, is found to be missing an important “PEST” sequence at the C-terminus that is present in its paralog, CCNG259.

The authors cite as one such example of a possible frameshift the gene SLC25A37, a member of the mitochondrial solute carrier family. Indeed, an analysis reveals that SLC25A37 was created, as shown in Figure 2, as a result of a bulk deletion together with a single nucleotide insertion in a copy of the likely parent, SLC25A28—although the exact sequence of events cannot be determined. As a result of the frameshift, 54 novel characters were generated but 22 were also deleted, casting doubt on the biochemical importance of this resulting minisequence. This would suggest that despite the extent of nonsynonymous differences evident in SLC25A37, these are likely to have been the result of relaxed purifying selection rather than any beneficial increase in information.

Figure 2.

Divergence by way of a frameshifting event in SLC and FUT genes in Homo sapiens. The regions within each gene sequence affected by indels are shown above. In the case of the mitochondrial solute carrier gene, SLC25A28, a 16-nt deletion at the N-terminus has occurred in a duplicated copy of it. This alone would have truncated the gene into two separate reading frames: 1-252 and 252-1079. However the insertion of adenine at nt position 48 suppresses any gene fission and restores the length of the original reading frame, giving rise to SLC25A37. In the FUT genes, a combination of deletions and at least one insertion in FUT3 and FUT6 caused a significant divergence in sequence from a common ancestor whose translated region would have resembled FUT5. In both cases, the reading frame is altered for a short region, involving the loss of many codons, before being reconstituted donwstream and thus demonstrating a conservation of information.

It would be useful to test for the effect of natural selection in the 302 codons of the gene downstream of the frameshift and where the original reading frame has been restored. Accordingly, it was observed that 213 aligned residues were identical and that the ratio (ω) of nonsynonymous to synonymous base pair substitutions was greater than 1.0 (169:114). The ratio (κ) of transitions to transversions was roughly equal (135:148), as was the ratio of radical to conservative amino acid substitutions (49:40). So, this would likely suggest that this reflects an overall structural realignment possibly to offset the radical changes and deletions at the N-terminus, rather than one representing any major functional shift.

Evolutionary divergence by frameshift mutation, and several other mechanisms, has also taken place in the FUT gene family in humans60. All but one of the nine genes are monoexonic and all code for the enzyme—fucosyltransferase—that transfers fucose on the terminal residues of glycans, albeit on a different variety of substrates. FUT3 and FUT6 are believed to be the most expressed members within the family and share a >90% nucleotide identity, displaying no discernibly significant functional differences. Both have diverged from a common ancestor, quite possibly FUT5 itself, by way of a 40-bp deletion and resultant frameshift at the N-terminus—in much the same manner as the previous example. This is consistent with an inference for the relaxation of selective constraints and partial degeneration followed by a suppressing mutation.

Therefore, although frameshifts have the potential to cause more rapid sequence divergence than can individual point mutations, it is wrong to assume that they can produce any novel information even if they do result in the emergence of novel characters within proteins. Therefore, a divergence in sequence need not result in a change in functionality or affect behavior, as the same information can be constructed using a number of different amino acid arrangements. In duplicates, and also singletons, changes may be compensatory and in response to prior degeneration rather than representing any innovation.

3.3. Gene Duplication and Retroposition: The Case of Jingwei and Adh

Another gene of interest to researchers of molecular evolution, found in Drosophila yakuba and D. tessieri, is Jingwei (jgw). Like Sdic, it is a chimeric gene except that it was formed from the retropositioning (by reverse transcription) of one gene into the duplicated copy of another61. This constitutes a type of ectopic recombination, otherwise known as exon shuffling. The first three exons are considered to be derived from a duplicated copy (ynd) of a gene that is expressed uniquely in the testes (ymp). Therefore, the N-terminal domain of ynd has donated the non-Adh portion of jgw and this appears to be well preserved by purifying selection, indicative of the retention of functionality and also of the modular structure and organization of the gene62.

Adh is an alcohol dehydrogenase that occurs in many organisms and facilitates the interconversion between alcohols and aldehydes. The retrosequence of the gene was copied and inserted into the third intron of ynd and nine downstream exons of became pseudoexons, because transcription stopped at the terminating signal encoded in the Adh-derived exon. Initially, this led researchers to believe that Jingwei was nothing other than a pseudogene, and its exact function is still unknown. As such, the first 68 codons of the translated region are derived from the ymp/ynd gene, whereas the remaining 255 are derived from the original 272 codons of the translated region in the ancestral Adh gene. Betrán63 and others speculate that the number of nonsynonymous changes should be regarded as evidence for rapid adaptive evolution. Indeed, only 92 of the original 272 residues remain (almost the minimum proportion to identify a homology), whereas the ratio (ω) of nonsynonymous to synonymous changes is astonishing—(332:58). The ratio (κ) of transitions to transversions (154:236) and radical to conservative amino acid replacements (113:49) is an indicative too of a substantial functional shift. But is that really what has happened? Is there another explanation to account for this?

Clearly, selective constraints have been relaxed as nine of the exons of the ynd gene were silenced by the initial act of retroposition, whereas the C-terminus of the intronless Adh retrosequence itself has been truncated by a frameshift with the resultant loss of 15 codons, for which a single nucleotide insertion would appear to be responsible. This loss of information is to be expected in a model of relaxed selection. The distribution of nonsynonymous changes in the Adh part of Jingwei is also found to be relatively uniform and not clustered in one particular region—the active site of Adh is indeed well preserved. This is either suggestive of widespread directional selection or, rather, random degeneration and destabilization: i.e., a failure to preserve the integrity of the sequence. However, the actual situation is likely to be more nuanced than either scenario would suggest. The introduction of deleterious changes could also set off a process and chain reaction of ensuing compensatory mutations observed in other genes64, 65—a proclivity toward physical stability being inherent in the nature of all proteins. Compensatory mutations would thus make up for any suboptimal or potentially damaging amino acid replacements elsewhere in the sequence, as opposed to back mutations that simply restore the ancestral residue. In this way, evolutionary divergence need not result in any change in the information content and functionality even if the resultant peptide sequence is substantially altered. Moreover, the effect of compensation following partial degeneration would be indistinguishable from any functional innovation because both are beneficial.

In vitro experiments, using a bacterial host species, appear to show that jingwei is a dehydrogenase dimer that catalyzes like Adh but with altered and diversified substrate binding activity and utilization66, 67. This is congruent with other research into the evolution of duplicates, such as within the xanthine dehydrogenase family68. One possibility to account for this may be that the gene product folds abnormally and so has lost functional specificity. In any case, as with all chimeric genes, jingwei has retained the core functionality of one or both of its parents but with a reduced pattern of expression.

Retroposed Adh mRNA features in two other chimerical genes, Adh-Twain and Adh-Finnegan, where it has been inserted in different species of Drosophila. Interestingly, 230 of the 255 residues contained in the corresponding Adh sequences are identical in Jingwei and both Adh-Twain and Adh-Finnegan. Begun and Jones69 suggest that some sort of convergent adaptation could be at work, but that seems unlikely given that these genes have markedly different patterns of expression70—it is perhaps more reasonable to infer that the Adh part has undergone the same level of initially relaxed selection followed by reparative compensation. The observed incidence of parallel evolution, as can be seen in Figure 3, something found to be relatively widespread in genetics71, might be because of a common mutational susceptibility—for which the initial loss of introns associated with the Adh part72 and need for priority readjustments may be a factor. Indeed, research tends to suggest that the presence of introns does have a significant effect on mRNA stability73. It is interesting that Begun and Jones infer a burst of evolutionary activity in the early stages but a noticeable slowing down later on. This is consistent with a model of initially relaxed selection in a population increasing in size following a bottleneck. The probability of fixation in a diploid population is 1/2N for neutral alleles having no selective (dis)advantage, and so more likely to occur in a smaller set.

Figure 3.

A comparison of retroposed Adh in both Jingewi (Jgw), Adh-Twain (Atw), and Adh-Finnegan (Afg) in different species of Drosophila. The ancestral sequence (Adh) is also given. Adh has evolved in a parallel fashion, where it has been inserted into a host gene by retroposition. A frameshift at the C-term has truncated the protein in Jgw and the others. Clearly, the insertion of the retroduplicated gene has resulted in a very similar pattern of molecular evolution within these separate species. This would suggest a mutational convergence that doesn't involve adaptation other than compensation.

3.4. Classical Duplication and Divergence

Perhaps the best example of how duplication and classical evolutionary divergence can facilitate ecological adaptation is the unique case of concerted evolution in colobine monkeys. The animals have adjusted to a predominantly leaf-eating diet by evolving a variant pancreatic ribonuclease (pRNase) recruited to perform a particular role as a digestive enzyme in foregut fermenters74. The data suggests that two pRNase paralogs (1A and 1B), both 156 residues long, have been selected for in the colobine monkeys, with one adapting to its role with the loss of positive charge—namely arginine residues. In colobus polykomos, the number of acidic residues in this gene product has increased from 13 to 15, whereas those for bases have decreased from 20 to 17. A test for selection revealed evidence in support of a partial gain in function. A total of 15 bp substitutions were identified, 13 of which replaced the ancestral amino acid. However, the number of transitions to transversions was as expected from neutral evolution (11:4), and only five of the residue changes were radical ones.

Therefore, this classical model of gene duplication, mutation, and natural selection would appear to demonstrate how evolutionary processes can modify and optimize existing information to meet new environmental pressures. However, this also shows how evolutionary divergence is limited and results in closely related and not entirely novel functions. This may also be true for the nuclear receptor family that are comprised of ligand-mediated regulators of gene expression. It is inferred that “molecular tinkering,” entailing modifications in ligand specificity due to subtle changes in the the ligand pocket, where the signaling compound binds, led to associations in various duplicate members with other hormones and signals75.

Another interesting case of classical divergence within a gene family concerns the tetrameric oxygen-binding protein, hemoglobin, found in the red blood cells of vertebrates. Five variants of hemoglobin exist at the β-globin locus cluster in both humans and chimpanzees, all under the control of single regulatory region76; each member is differentially expressed throughout the development of the organism: Epislon (HBE), for example, is normally expressed only in the embryonic yolk sac. It is precisely for this reason that gene duplication may have been involved in the division and specialization of the original functions of a gene divided among different paralogs—as the organism can not exactly wait for the gene pertinent to the next developmental stage of oxygen metabolism to evolve. The five genes present at the locus (including two HBG variants) mare highly similar in sequence and could be the functional equivalents of alternatively spliced isoforms of the original gene.

Indeed, there are reasonable grounds to suppose that gene duplication and mutation may be functionally comparable with the action of alternative splicing in general77. For example, in certain species of the genus Drosophila, an ancestral sex-biased gene, JanusA, uses alternative splicing to encode two slightly different proteins, one present in multiple tissues of both sexes and the other present only in sperm. Duplication of JanusA created JanusB, which then specialized to encode a sperm-specific protein very similar to the function of the former spliced variant78. Therefore, in this situation, no new information was produced.

Subfunctionalization, whereby the information content of a parent gene is differentially partitioned amongst its daughters, is believed to be a common occurrence among surviving duplicates79. Here, duplication allows the original functionality of a gene to be spread across more stretches of DNA, although conserving the basic information content contained in the ancestral sequence. Subfunctionalization constitutes a loss in functional redundancy, due to the combination of both complementary degeneration and stabilizing selection, and helps explain why knocking out certain paralogs can have a harmful effect. However, the benefit of this is that a degree of functional specialization can be arrived at which can have gains in efficiency in certain circumstances .In baker's yeast, Saccharomyces cerevisiae, two galactose regulatory genes (GAL1 and GAL3) are believed to have evolved from a single bifunctional gene in an ancestral species, resulting in greater flexibility80.

3.5. Duplication and Intragenic Amplification: The Case of an AFGP in Notonthenioids

All of the examples above involve evolution within the existing kind as opposed to any divergence that would lead to the emergence of a new type of gene. The first clear attempt at explaining how an old protein gene could spawn a new gene coding for an entirely new protein, and with a distinctly different function, is the case of a trypsinogen to antifreeze glycoprotein (AFGP) conversion in the notothenioid species, Dissostichus mawsoni81. The ice-binding AFGP that circulates in the blood of the Antarctic fish enables them to avoid freezing in their perpetually icy environment. This crucial survival protein is believed to have evolved from a pancreatic trypsinogen-like protease—a digestive enzyme. Indeed, both proteins are observed to be biosynthesized and secreted in the pancreas82, and this is reflected in the shared regulatory features found in the UTR and signal peptide. The AFGP is characterized by repeats of two 3-residue components: TAA and TPA. These comprise about 60% of the 362-residue protein, Dm3l, one member of the AFGP family. The reasons given for the possible origin of the AFGP from a protease ancestor are:

  • iExon 1 (containing the secretory signal and 5′UTR) in both AFGP and trypsinogen genes is almost identical, as is the 3′ UTR of both genes.
  • iiThe sequence of intron 1 of the trypsinogen gene is included within as two parts within intron 1 of the AFGP gene.
  • iiiA 9-nt element in the trypsinogen gene—acagcggca (TAA)—that straddles intron 1 and exon 2 comprises the main repeating unit of the AFGP gene.
  • ivThe topological proximity of both genes on the same chromosome indicates the likelihood of tandem duplication.
  • vThe discovery of a chimeric AFGP-protease gene (Dm7m) that may be intermediate83.

Cheng et al. speculate that the ancestral protease gene was converted into the AFGP through a process that involved four major steps: a bulk deletion, intronic (de novo) recruitment, repeated internal amplification, and finally illegitimate recombination. However, this proposed mechanism is unlikely to have occurred for the following reasons:

  • iThe authors readily acknowledge that the bulk deletion of four exons and four introns is not likely to be tolerated even in a redundant duplicate, as it results in an entirely nonfunctional copy. This would make it liable for complete disintegration by null mutations, and not for its apparently miraculous reincarnation as an entirely new gene.
  • iiThe AFGP promoter elements at the 5′ flanking sequence upstream of exon 1 are believed to be different from those found in the trypsinogen gene. Both proteins are produced in different amounts and also expressed in a different manner. The proper function and behavior of the glycoprotein depends on changes made or added to the promoter sequences.
  • iiiIntron 1 in Dm3l is 1908 bp long, whereas the corresponding one is 238 bp in the trypsinogen gene. There is no explanation provided for this eightfold difference in size, and the additional sequence's intronic information, other than a huge insertion (e.g., a retrotransposon, for which there is no trace) or a case of repeated intronic amplification.
  • ivThe authors propose, implausibly, that the repeating TAA and TPA elements—hardly a unique sequence—could have been produced by successive polymerase replication slippage or unequal intragenic recombination dozens of times over. However, this process is both indiscriminate and inefficient84, and there is no reason to suppose it would selectively and exactly repeat the 9-nt elements, with no resultant frameshift causing a premature termination. The positioning of the proline residues is important as far as protein stability and folding is concerned85.
  • vThere is no origin given for the inclusion of the important spacer sequence elements—LIF/LNF/FNF/LNL86—other than an unsubstantiated and unfalsifiable claim that they could have been introduced through a yet unspecified “recombinatory event.” There is also a nonhomogenous pattern of repetition observed that is not exactly what one would expect from successive amplification.

A key problem associated with the Darwinian mechanism of evolution is that many of the putative incipient and intermediate stages in the development of a biological trait may not be useful themselves and may even be harmful. This is exactly the problem with Cheng's proposed conversion. The incipient stage consists of a bulk deletion that would be almost certainly selected against, despite it being in a gene copy, as the cistron's core information and any useful functional redundancy it may have offered, would have been entirely lost. The resultant protein would be liable to misfold anyway. It is also extremely problematic that the initial intronic recruitment and its subsequent amplification would have been in any way functional—as far as binding to ice crystals or glycosylation is concerned—or have any exaptive utility. The hypothesized metamorphosis would have required widespread and related changes that must have been coordinated and synchronized—and so representing something to the effect of a directional saltation. However, this is not something a blind, unsupervised process that can be achieved. It is, however, plausible to suggest that the commonality shared between both genes at their respective termini is indicative of the possibility, at least, that the glycoprotein was derived from an ancestral protease template.

Moreover, the antifreeze proteins that have been found in Arctic cod87 are completely different in sequence and organization from their Antarctic cousins—this means that the same tryspinogen-like gene could not have been the ancestral gene in this case. Although this is passed off as evidence of “convergent evolution,” this serves only to provide another problem as to how a gene believed to be of a more recent origin could have evolved.

3.6. De Novo Recruitment Without Duplication

Although duplication is central to the modern evolutionary synthesis, in recent years, the possibility that previously extragenic, noncoding regions of DNA could be recruited wholesale to become translated as functioning proteins, as opposed to just minor exonization observed in the formation of the amino end of the Sdic gene. This represents a return to the idea of the hopeful monster88 at the molecular level. For example, such origination has been proposed in the case of the yeast gene BSC489 (of unknown function); and the human upregulated gene CLLU190 that is believed to have some role in pathogenesis of chronic lymphomatic leukemia and shares structural motifs with the cytokine, IL-4, that is used in the immune system91. In the case of CLLU1, a single nucleotide deletion of adenine in a stretch of DNA orthologous with chimpanzees has created a frameshift and expanded ORF, large enough to be fully functional when translated as a protein. However, this inference may be incorrect. Rather than the deletion creating a new stretch of translated DNA, it is likely that a back mutation restored the original ORF that became essentially divided in two as a result of an insertion—a very common phenomenon observed in indel-induced frameshifts92. Thus, far from being a case of bulk de novo recruitment of ncDNA, CLLU1 in humans is a gene that may have been fully reactivated while still inactive in other primate lineages. The corresponding gene in chimpanzees, if transcribed and regulated, may still be partially functional as two potential 42-codon reading frames are preserved at either terminus. Thus, the de novo and fortuitous origination of entire reading frames may be a profound misinterpretation of cases of pseudogenes being reactivated.

Alternatively, functional sections of noncoding DNA, or perhaps even “dormant” reading frames, have become translated into proteins that perform a particular task. There is indeed evidence for the existence of ORFs within introns93 and other regions of noncoding DNA94 that may be the result of transposition events. However, another possibility is that instead “junk” sequences of ncDNA are accidentally transcribed and translated into nonfunctional products that are fixed by neutral evolution, and which serve no purpose, other than perhaps being assigned to the cell's garbage collection and recycling system. In any case, as a mechanism for the creation of novel motifs and protein domains, de novo recruitment of noncoding DNA would seem extremely improbable and implausible.


Gene duplication and subsequent evolutionary divergence certainly adds to the size of the genome and in large measure to its diversity and versatility. However, in all of the examples given above, known evolutionary mechanisms were markedly constrained in their ability to innovate and to create any novel information. This natural limit to biological change can be attributed mostly to the power of purifying selection, which, despite being relaxed in duplicates, is nonetheless ever-present. The reason for this stabilization of function is not obvious, although the role of duplicates in compensating for deleterious loss of function mutation at paralogous sites may be an important factor. Likewise, there exists a preservation of ancestral functions through the process of a differential division of labor among duplicates, namely that of subfunctionalization. Moreover, both the possibility and opportunity for beneficial changes leading to major functional innovations was found to be not especially convincing. For example, duplicate enzyme-coding genes tend to retain the same ancestral catalytic activity and simply apply that function to different substrates, often by partial degradation of function and the loss of the precise specificity of the parent. However, these may prove to have an important adaptive value in response to environmental challenges such as with respect to temperature, drought, pathogens, and UV radiation.

Where substantive sequence evolution had occurred, it could have been because a respite in selective constraints led to significant degeneration. In the case of Sdic and Jingwei, both genes evolved from duplicates affected by significant deletions or the silencing of exonic information and were then co-opted for use in a different context. This development has likely been misinterpreted in many cases as evidence of a gain in information under positive Darwinian selection, especially when extensive compensatory changes are involved that can amplify sequence divergence in the process. In this sense, a proclivity toward functional stability and the conservation of information, as opposed to any adventurous innovation, predominates.

The various postduplication mechanisms entailing random mutations and recombinations considered were observed to tweak, tinker, copy, cut, divide, and shuffle existing genetic information around, but fell short of generating genuinely distinct and entirely novel functionality. Contrary to Darwin's view of the plasticity of biological features, successive modification and selection in genes does indeed appear to have real and inherent limits: it can serve to alter the sequence, size, and function of a gene to an extent, but this almost always amounts to a variation on the same theme—as with RNASE1B in colobine monkeys. The conservation of all-important motifs within gene families, such as the homeobox or the MADS-box motif, attests to the fact that gene duplication results in the copying and preservation of biological information, and not its transformation as something original.

The case of evolution in notothenioid fish, entailing the speculative conversion of a protease duplicate into an AFGP, only serves to demonstrate the huge problem of supposing that cumulative random changes would contrive to produce novel information, especially if major deletions and other degenerative mutations were involved.

Although the focus here has been on the information within exons that code for the amino acid sequences in proteins, noncoding DNA—which comprises the vast majority of the molecule—also contains information necessary for the regulation and expression of gene products. Changes in these regions can have a profound effect on an organism's evolution. But, although important, without a repertoire of proteins with which to regulate, this is ancillary in effect. For example, it is impossible for an organism to develop vision without the exons coding for light-sensitive opsins or feathers for flight without the presence of keratins in the skin.

Gradual natural selection is no doubt important in biological adaptation and for ensuring the robustness of the genome in the face of constantly changing environmental pressures. However, its potential for innovation is greatly inadequate as far as explaining the origination of the distinct exonic sequences that contribute to the complexity of the organism and diversity of life. Any alternative/revision to Neo-Darwinism95 has to consider the holistic nature and organization of information encoded in genes, which specify the interdependent and complex biochemical motifs that allow protein molecules to fold properly and function effectively.