## 1. INTRODUCTION

Attempts to demonstrate that statistical patterns of language have a trivial explanation have a long history that goes back at least to the research by G. A. Miller and collaborators questioning the relevance of Zipf's law for word frequencies around 19601–3. Zipf's law states that the curve that relates the frequency of a word *f* and its rank *r* (the most frequent word having rank 1, the second most frequent word having rank 2, and so on) should follow *f* ∼ *r*^{−} * ^{α}*4. Miller argued that if monkeys were chained “

*to typewriters until they had produced some very long and random sequence of characters*” one would find “

*exactly the same ‘Zipf curves’ for the monkeys as for the human authors*”3. Under his view, Zipf's law would be an inevitable consequence of the fact that words are made of units, e.g., letters or phonemes. The typewriter argument has been revived many times since then5–8. However, rigorous analyses indicate that the curves do not really look the same and the parameters of this random typing model giving a good fit to real word frequencies are not forthcoming9, 10. Here, we review a recent claim that the finding of another statistical pattern of language, Menzerath's law, is also inevitable11.

P. Menzerath hypothesized that “the greater the whole, the smaller its constituents” (“*Je größer das Ganze, desto kleiner die Teile*”) in the context of language12 (pp. 101). Converging research in music and genomes13–16 suggests that Menzerath's law is a general law of natural and human-made systems. In this article, we leave the term Menzerath-Altmann law for referring to the exact mathematical dependency that has been proposed by the quantitative linguistics tradition for the relationship between *x*, the size of the whole (in parts) and *y*, the mean size of the parts, i.e.17,

where *a*, *b*, and *c* are the parameters of Menzerath-Altmann law.

In the pioneering research by Wilde and Schwibbe14 and later work15, 20, Menzerath's law emerged as a negative correlation between *L*_{ c} and *L*_{ g}, where *L*_{ c} is the mean chromosome length (the size of the constituents) and *L*_{ g} is the chromosome number (the size of the construct measured in constituents). More recently, the law has been found in the dependency between mean exon size (the size of the constituents) and the number of exons of human genes (the size of the construct)16.

However, it has been argued that this negative correlation is trivial11: the definition of *L*_{ c} as a mean, i.e. *L*_{ c} = *G*/*L*_{ g} leads (according to Ref.11) unavoidably to *L*_{ c} ∼ *L _{g} ^{b}* with

*b*= −1, which is supported by the fact that mammals and plants give values of

*b*that are very close to

*b*= −1 (

*b*= −1.04 for mammals and

*b*= −1.07 for plants11). In the present article, ∼ is used to indicate proportionality. Furthermore, it has also been argued that a proper connection between human language and genomes cannot be established a priori using genomes as wholes and chromosomes as parts, due to the fluid nature of chromosomal arrangements and the vast dominance of noncoding DNA, which has no parallel in language11.

Revising those arguments is critical for musicology, quantitative linguistics, and genomics. If they were correct, the relationship between the mean size of the constituents (*y*) and the number of constituents (*x*) which have been the subject of many studies13, 16–18 would be a trivial consequence of the definition of the size of the constituents as a mean. Following Miller's argument, producing Menzerath's law would be as easy as producing Zipf's law by monkeys chained to a typewriter. More precisely, the inevitability of *L*_{ c} ∼ 1/*L*_{ g}11 predicts that Menzerath-Altmann law must always be Eq. (1) with *b* = −1 and *c* = 0 when defining the size of the parts as a mean. If such inevitability is correct, exponents deviating significantly from *b* = −1 should be the exception, not the rule in language, music and genomes.

Here we address the challenge of Menzerath's law in genomes14–16 and beyond13, 17, 18 by reviewing Solé's criticisms11: his mathematical and statistical arguments, essentially the inevitability of *L*_{ c}∼1/*L*_{ g} (Section 2), as well as his conceptual arguments, mainly the mismatch between human language and genomes (Section 3). Finally, we will discuss some general questions that are crucial for understanding the recurrence of Menzerath's law (Section 4).