Estimating Linguistic Diversity on the Internet: A Taxonomy to Avoid Pitfalls and Paradoxes



Both UNESCO and OECD have recognized the public policy benefit of publicizing information on linguistic diversity on the Internet. However, the published methodologies for estimating “linguistic diversity” or “Internet statistics (by language)” do so with different interpretations of these key terms. This article creates a new taxonomy, defining and contrasting user activity, user profile, web presence, and diversity index to distinguish among the various indicators used to estimate language usage on the Internet. This taxonomy facilitates comparisons of the available methodologies, whose limitations are then critiqued. It also helps to resolve the apparent paradox as to whether the use of English on the Internet has declined rapidly or has remained fairly stable. The study concludes that the best estimates of web presence can be achieved by direct measurement: randomly addressing and analyzing a representative sample of all public websites. However, this approach will only suffice if the language detection software used is progressively extended to recognize all the world’s written languages.