• Open Access

Menzerath's law at the gene-exon level in the human genome

Authors

  • Wentian Li

    Corresponding author
    1. The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, North Shore LIJ Health System, Manhasset, 350 Community Drive, New York 11030
    • The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, North Shore LIJ Health System, Manhasset, 350 Community Drive, New York 11030
    Search for more papers by this author

Abstract

A previous discussion of a linguistic law called Menzerath's law (the longer a word, the shorter the syllables) in the genomic context was focused on the genome-chromosome-base level (the more number of chromosomes in a genome, the smaller the chromosome size). We apply this linguistic metaphor to more appropriate levels of gene, exon, and base. Using the human gene data, we found that the Menzerath's law at these levels holds true: the more number of exons in a gene, the shorted the averaged exon size. Since this negative correlation can be a trivial consequence of the constant size of the messenger RNA coded by the gene, we also exclude this possibility by showing that messenger RNA size increases with the number of exons. This increase of messenger RNA size is however not fast enough for genes with large number of exons to maintain a constant exon size. © 2011 Wiley Periodicals, Inc. Complexity, 2011.

INTRODUCTION

In any complex system, there exist multiple-leveled structures. Menzerath's law [1] describes a statistical relationship between the numerical measurements of structures at two different levels. Although it is often stated as a two-level relationship, a third level is also required, which can be the lowest possible level with the “terminal elements” [2]. If the higher level-1 object contains x number of the lower level-2 objects, and on average, level-2 objects contain y number of the lowest level-3 objects, Menzerath's law proposes a negative correlation between x and y [1], with his three-level units being words, syllables, and phonemes. A more specific mathematical relationship between x and y was suggested in [3], but we will be mainly focused on the fact of negative correlation instead of fitting the data with a particular formula.

It was suggested in an interesting article [4] that Menzerath's law applies to genomes at the genome-chromosome level (the level-3 object is the nucleotide base). If x is the number of chromosomes in a genome, y is the averaged chromosome size in the unit of nucleotide base, it was found in [4] that y is negatively correlated with x. However, this work was criticized by [5] on two grounds. First, as x ċ y is simply the genome size measured by the base, an inverse relationship between x and y is a trivial consequence of genome size of different species being a constant. Secondly, the choice of the two genomic units in [4], i.e., genome and chromosome, does not seem to have a strong biological meaning. Genome size can increase by incorporating repetitive sequences, and variations of chromosome number can be caused by fusion, fission, or other chromosome structural dynamic processes (some aspects of genome size and genome karyotype evolution are discussed in [6]). Consequently, giving linguistic metaphor to these units is less convincing.

Here, we would like to examine a different choice of genomic units in the human genome: gene, exon, and nucleotide base. Figure 1 shows a generic structure of gene, which consists of multiple exons and introns. Exon DNA sequences are transcribed to messenger RNA (mRNA), but intron sequences are not. mRNA is further translated to a protein sequence which then folds into a three-dimensional (3D) structure. The function of a protein very much depends on the feature of the 3D structure, which in turn depends on the mRNA sequence and its constituent exons. Not only exons are units in DNA sequences, but also they often match the domains or subunits of protein structure [7].

Figure 1.

A schematic illustration of gene and exons. Typically, a human gene consists of several exons (e.g., 6 here) interrupted by introns (e.g., 5 here). During transcription, intron sequences in the DNA are not transcribed, and only the exons are transcribed into messenger RNA (mRNA). mRNA is further translated to a protein sequence, which folds into a unique three-dimensional structure. Many details are not shown in this figure, such as the chromosome pairs in diploid genomes, double-stranded nature of DNA sequences, single-stranded nature of RNA sequences, 5′ and 3′ untranslated regions, alternative splicing, and 3-dimensional structure of chromatins.

Another evidence for exons to be independent units is exon skipping [8, 9] (one form of alterative splicing [10]) where an exon is skipped in transcription as if it is part of an intron. Exon skipping, as well as other mechanisms of alternative splicing, greatly increases the number of possible protein sequences by the combinatorial options in choosing a subset of exons, and plays an important role in development and evolution [11, 12].

In the gene-exon-base tiers, if a negative correlation is indeed found to still exist between x and y, then we would like to check the trivial situation of a constant x ċ y value, which is the length of mRNA size in the unit of base. If all human genes have the same mRNA size, then we should expect a simple y ∼ 1/x relationship. Even if x ċ y is not a constant, we would like to investigate the source of the negative correlation.

Data of human genes was obtained from the refGene.txt file downloaded from http://hgdownload.cse.ucsc.edu/downloads.html for the Feb'2009 assembly (hg19), or equivalently, Genome Reference Consortium Human Reference 37 (GRCh37). The preprocessing step was taken to discard refGene entries that are in either one of the categories: (1) mapped to alternative haplotypes or unaligned sequences; (2) on X and Y chromosomes; (3) non-protein-coding genes such as RNA genes or pseudogenes; (4) incomplete entries that do not have the correct start or stop codon, or whose coding sequence length is not a multiple of 3; and (5) alternatively spliced forms of an included gene. After this preprocessing, 18153 human genes remain.

Figure 2 shows the scatter plot of the number of exons of a gene (x) vs. the averaged exon length (in the unit of bases) of that gene (y). Four different versions of the same data are shown in Figure 2 depending on whether logarithmic scale is imposed on x or/and y axes. The apparent negative trend between the two measurements is confirmed by the nonparametric Spearman rank correlation coefficient of ρ = −0.61 (P-value < 2.2 × 10−12). Linear correlation between the two measurements is also calculated: Pearson correlation coefficients are equal to −0.35, −0.57, −0.47, −0.65, respectively for the four plots in Figure 2.

Figure 2.

The scatter plots of the number of exons per human gene (x, nexon vs. the averaged exon size of that gene) (y, Lexon in the unit of bases). The four versions of the same result are: linear–linear, log(x)-linear, linear-log(y), and log–log. The Pearson correlation coefficients (cc's) and Spearman rank correlation coefficient (ρ) are shown. In the log–log plot, means, medians, and geometric means of genes with the same number of exons (x = 1, 2, …, 50) are marked.

The averaged exon size (averaged over all exons in a gene) is further averaged over all genes with a fixed number of exons (x = 1,2,3, … 50). Three versions of averaging are used: arithmetic mean, median, and geometric mean. Though not perfect, a somewhat linear trend can be seen in log–log plot of Figure 2 between the average of averaged exon sizes and the number of exons. If a linear trend is used for regression in the log–log scale, the slope is around the value of −0.5.

To check the possibility that all human genes are transcribed to mRNAs of more or less the same length, so that the average exon size is inversely proportional to the number exons, we plot the mRNA size (in the unit of bases) as a function of the number of exons in Figure 3 (in log–log scale). Figure 3 shows that the mRNA size does not remain a constant, it actually increases with the number of exons. When we plot the average (arithmetic mean, median, and geometric mean) of mRNA size over genes with the same number of exons (x = 1,2, …, 50), a somewhat linear trend can be found in the log–log scale with slope ∼0.5 in Figure 3.

Figure 3.

The scatter plot of the number of exons per human genes (x, nexon) vs. size of transcribed mRNA of that gene (y, LRNA = Lexon · nexon).

The negative correlation between the average exon size and the number of exons can be understood by this simple argument: the mRNA size LRNA increases with the number of exons nexon roughly by LRNAnmath image; then averaged exon size is Lexon = LRNA/nexonnmath image/nexonnmath image, which decreases with nexon. The nonconstant mRNA size of all human genes can be checked directly: Figure 4 shows the histogram of logarithmic-transformed mRNA sizes (and that of gene size on genomic DNA as measured by either start-codon to stop-codon length or transcription-start to transcription-stop length). The peak between 930 and 940 bases is mainly due to large number of single-exon genes in the olfactory receptor gene family. Although the normal-like distribution in Figure 4 may indicate that mRNA size is limited, its constraint is imposed on the logarithmic scale.

Figure 4.

Histograms of logarithmic-transformed mRNA size (intron excluded) and log-transformed gene sizes (intron included). Two versions of gene sizes are used: one from the translation-start to translation-stop, another from the transcription-start to transcription-stop.

Applying linguistic and information-theoretic metaphors to biology is not new [13]. Many standard jargons in molecular biology were borrowed from these fields (e.g., genetic code, transcription, translation). But one always has to remember that metaphor is not the real thing. The Menzerath's law in quantitative linguistic describes the relationship between word lengths (in the unit of syllables) and syllable length (in the unit of phonemes), both varies within a very narrow range (e.g., 2–3). In the genomic context, our exon number ranges from 1 to more than a hundred, and averaged exon size ranges from 50 to 10,000 bases, both are much wider ranging than those in the original linguistic context.

The difference between our choice of genomic units and those in [4] has another consequence. In order to collect samples with different genome sizes, the data in [4] is an assemble crossing species. In our case, however, all genes are from one species, the Homo sapiens. Back to the original linguistic context, our data is more similar to the word-syllable-phoneme analysis within a single language, as versus acrossing different languages.

In summary, using the three-level genomics units of gene, exon, and base, we confirm a negative correlation between the number of exons and the averaged exon size, analogous to the Menzerath's law in quantitative linguistic on the relation between the number of syllables in a word and the syllable size in the unit of phoneme. Although this negative correlation is not a trivial consequence of a constant messenger RNA size, it can however be understood by a less-than-linear increase of messenger RNA size with the number of exons. Due to this less-than-linear trend, the normalized messenger RNA size, i.e., the averaged exon size, decreases with the number of exons.

Acknowledgements

The author thank the Robert S. Boas Center for Genomics and Human Genetics for support, and Jan Freudenberg, Luis Rodriguez for discussions.

Ancillary