In any complex system, there exist multiple-leveled structures. Menzerath's law  describes a statistical relationship between the numerical measurements of structures at two different levels. Although it is often stated as a two-level relationship, a third level is also required, which can be the lowest possible level with the “terminal elements” . If the higher level-1 object contains x number of the lower level-2 objects, and on average, level-2 objects contain y number of the lowest level-3 objects, Menzerath's law proposes a negative correlation between x and y , with his three-level units being words, syllables, and phonemes. A more specific mathematical relationship between x and y was suggested in , but we will be mainly focused on the fact of negative correlation instead of fitting the data with a particular formula.
It was suggested in an interesting article  that Menzerath's law applies to genomes at the genome-chromosome level (the level-3 object is the nucleotide base). If x is the number of chromosomes in a genome, y is the averaged chromosome size in the unit of nucleotide base, it was found in  that y is negatively correlated with x. However, this work was criticized by  on two grounds. First, as x ċ y is simply the genome size measured by the base, an inverse relationship between x and y is a trivial consequence of genome size of different species being a constant. Secondly, the choice of the two genomic units in , i.e., genome and chromosome, does not seem to have a strong biological meaning. Genome size can increase by incorporating repetitive sequences, and variations of chromosome number can be caused by fusion, fission, or other chromosome structural dynamic processes (some aspects of genome size and genome karyotype evolution are discussed in ). Consequently, giving linguistic metaphor to these units is less convincing.
Here, we would like to examine a different choice of genomic units in the human genome: gene, exon, and nucleotide base. Figure 1 shows a generic structure of gene, which consists of multiple exons and introns. Exon DNA sequences are transcribed to messenger RNA (mRNA), but intron sequences are not. mRNA is further translated to a protein sequence which then folds into a three-dimensional (3D) structure. The function of a protein very much depends on the feature of the 3D structure, which in turn depends on the mRNA sequence and its constituent exons. Not only exons are units in DNA sequences, but also they often match the domains or subunits of protein structure .
Another evidence for exons to be independent units is exon skipping [8, 9] (one form of alterative splicing ) where an exon is skipped in transcription as if it is part of an intron. Exon skipping, as well as other mechanisms of alternative splicing, greatly increases the number of possible protein sequences by the combinatorial options in choosing a subset of exons, and plays an important role in development and evolution [11, 12].
In the gene-exon-base tiers, if a negative correlation is indeed found to still exist between x and y, then we would like to check the trivial situation of a constant x ċ y value, which is the length of mRNA size in the unit of base. If all human genes have the same mRNA size, then we should expect a simple y ∼ 1/x relationship. Even if x ċ y is not a constant, we would like to investigate the source of the negative correlation.
Data of human genes was obtained from the refGene.txt file downloaded from http://hgdownload.cse.ucsc.edu/downloads.html for the Feb'2009 assembly (hg19), or equivalently, Genome Reference Consortium Human Reference 37 (GRCh37). The preprocessing step was taken to discard refGene entries that are in either one of the categories: (1) mapped to alternative haplotypes or unaligned sequences; (2) on X and Y chromosomes; (3) non-protein-coding genes such as RNA genes or pseudogenes; (4) incomplete entries that do not have the correct start or stop codon, or whose coding sequence length is not a multiple of 3; and (5) alternatively spliced forms of an included gene. After this preprocessing, 18153 human genes remain.
Figure 2 shows the scatter plot of the number of exons of a gene (x) vs. the averaged exon length (in the unit of bases) of that gene (y). Four different versions of the same data are shown in Figure 2 depending on whether logarithmic scale is imposed on x or/and y axes. The apparent negative trend between the two measurements is confirmed by the nonparametric Spearman rank correlation coefficient of ρ = −0.61 (P-value < 2.2 × 10−12). Linear correlation between the two measurements is also calculated: Pearson correlation coefficients are equal to −0.35, −0.57, −0.47, −0.65, respectively for the four plots in Figure 2.
The averaged exon size (averaged over all exons in a gene) is further averaged over all genes with a fixed number of exons (x = 1,2,3, … 50). Three versions of averaging are used: arithmetic mean, median, and geometric mean. Though not perfect, a somewhat linear trend can be seen in log–log plot of Figure 2 between the average of averaged exon sizes and the number of exons. If a linear trend is used for regression in the log–log scale, the slope is around the value of −0.5.
To check the possibility that all human genes are transcribed to mRNAs of more or less the same length, so that the average exon size is inversely proportional to the number exons, we plot the mRNA size (in the unit of bases) as a function of the number of exons in Figure 3 (in log–log scale). Figure 3 shows that the mRNA size does not remain a constant, it actually increases with the number of exons. When we plot the average (arithmetic mean, median, and geometric mean) of mRNA size over genes with the same number of exons (x = 1,2, …, 50), a somewhat linear trend can be found in the log–log scale with slope ∼0.5 in Figure 3.
The negative correlation between the average exon size and the number of exons can be understood by this simple argument: the mRNA size LRNA increases with the number of exons nexon roughly by LRNA ∼ n; then averaged exon size is Lexon = LRNA/nexon ∼ n/nexon ∼ n, which decreases with nexon. The nonconstant mRNA size of all human genes can be checked directly: Figure 4 shows the histogram of logarithmic-transformed mRNA sizes (and that of gene size on genomic DNA as measured by either start-codon to stop-codon length or transcription-start to transcription-stop length). The peak between 930 and 940 bases is mainly due to large number of single-exon genes in the olfactory receptor gene family. Although the normal-like distribution in Figure 4 may indicate that mRNA size is limited, its constraint is imposed on the logarithmic scale.
Applying linguistic and information-theoretic metaphors to biology is not new . Many standard jargons in molecular biology were borrowed from these fields (e.g., genetic code, transcription, translation). But one always has to remember that metaphor is not the real thing. The Menzerath's law in quantitative linguistic describes the relationship between word lengths (in the unit of syllables) and syllable length (in the unit of phonemes), both varies within a very narrow range (e.g., 2–3). In the genomic context, our exon number ranges from 1 to more than a hundred, and averaged exon size ranges from 50 to 10,000 bases, both are much wider ranging than those in the original linguistic context.
The difference between our choice of genomic units and those in  has another consequence. In order to collect samples with different genome sizes, the data in  is an assemble crossing species. In our case, however, all genes are from one species, the Homo sapiens. Back to the original linguistic context, our data is more similar to the word-syllable-phoneme analysis within a single language, as versus acrossing different languages.
In summary, using the three-level genomics units of gene, exon, and base, we confirm a negative correlation between the number of exons and the averaged exon size, analogous to the Menzerath's law in quantitative linguistic on the relation between the number of syllables in a word and the syllable size in the unit of phoneme. Although this negative correlation is not a trivial consequence of a constant messenger RNA size, it can however be understood by a less-than-linear increase of messenger RNA size with the number of exons. Due to this less-than-linear trend, the normalized messenger RNA size, i.e., the averaged exon size, decreases with the number of exons.