Global biogeographic regions in a human-dominated world the case of human diseases

. Since the work of Alfred Russel Wallace, biologists have sought to divide the world into biogeographic regions that reflect the history of continents and evolution. These divisions not only guide conservation efforts, but are also the fundamental reference point for understanding the distribution of life. However, the biogeography of human-associated species—such as pathogens, crops, or even house guests—has been largely ignored or discounted. As pathogens have the potential for direct consequences on the lives of humans, domestic animals, and wildlife it is prudent to examine their potential biogeographic history. Furthermore, if distinct regions exist for human-associated pathogens, it would provide possible connections between human wellbeing and pathogen distributions, and, more generally, humans and the deep evolutionary history of the natural world. We tested for the presence of biogeographic regions for diseases of humans due to pathogens using country-level disease composition data and compared the regions for vectored and non-vectored diseases. We found discrete biogeographic regions for diseases, with a stronger influence of biogeography on vectored than non-vectored diseases. We also found significant correlations between these biogeographic regions and environmental or socio-political factors. While some biogeographic regions reflected those already documented for birds or mammals, others reflected colonial history. From the perspective of diseases caused by pathogens, humans have altered but not evaded the influence of ancient biogeography. This work is the necessary first step in examining the biogeographic relationship between humans and their associates.


INTRODUCTION
Biogeographic regions have been delineated for many animal and plant taxa and are a cornerstone of large spatial scale biology. However, human associated species-such as pests, domesticates, and pathogens-tend to be excluded from consideration when discerning biogeographic regions. These human associates are often expected to be ubiquitous and, on average, do indeed have larger geographic ranges than most other species (Dunn and Romdal 2005, Olden et al. 2006, Smith et al. 2007). Even relatively large human associates, such as rats and house flies, spread around the world with early western colonization (West 1951, He et al. 2009), just as West Nile virus and avian influenza H5N1 more recently spread (Centers for Disease Control and Prevention 2003, Spielman et al. 2004, Fauci 2005, Olsen et al. 2006). Delineation of biogeographic regions for taxa that are truly everywhere, such as what appears to be the case for aquatic protists (Fenchel and Finlay 2004), would not be particularly useful. It's possible that the easy and repeated spread of human associates might blur any interesting or useful biogeographic pattern in the composition of human associates. However, importantly, not all human associates have spread to all inhabited continents (Gonçalves et al. 2003), and some humanassociated species, particularly those that live outdoors for part of their life cycles, are directly influenced by the presence of vectors, alternate hosts, climate or other environmental conditions (Wilcox and Colwell 2005), such that it is possible that historic and climatic influences might still lead to discrete biogeographic regions for these species.
While scientific curiosity and conservation have motivated most studies of biogeographic regions (Wallace 1876, Grenyer et al. 2006, Holt et al. 2013, the biogeography of human associates in general and pathogens in particular has the potential for direct human consequences. Others have suggested that diversity and prevalence of human pathogen species affect human politics, the likelihood of war ), and religion (Fincher and Thornhill 2008) among other aspects of socio-politics (Nettle 2009), such that the biogeographic distribution of pathogens has the potential to pervasively affect human life and societies. If the limits of pathogen distributions are determined not only by climate and attempts at disease control but also the ancient biogeographic distribution of vectors and hosts (e.g., Stensgaard et al. 2013), the differences among regions in their pathogens (and conse-quently religions, behaviors, and socio-politics) may be relatively persistent features of those regions.
To the extent that biogeographic regions for human pathogens exist, they might be expected to differ as a function of the biology of the pathogens. A priori it seems likely that vectored and non-vectored pathogens might differ in their biogeographic regions. Vectored pathogens require at least the presence of a suitable vector and, in some cases, a reservoir host. Nonvectored pathogens, on the other hand, many of which evolved relatively recently (Wolfe et al. 2007), can be transmitted either person to person or via contaminated water such that their global spread might be less likely to be influenced by environmental differences among regions of the world. We could have divided pathogens into other groupings, but we chose to focus on vectored/non-vectored taxa in as much as it allowed us to test a priori predictions (e.g., Smith and Guégan 2010), rely on relatively robust categorizations of pathogens (whether or not a pathogen is vectored tends to be reasonably wellknown), and leads to groups with sample sizes sufficient to justify the development of biogeographic regions. If biogeographic regions exist for either grouping or overall, we hypothesize they might be due to differences in climate, biogeographic history (e.g., the movement of the continents, the chance dispersal of host lineages, etc.) or human history and culture.
Here we test whether human pathogens are distributed globally in distinct biogeographic regions. We first consider 301 human diseases caused by pathogens from the Global Infectious Diseases and Epidemiology Online Network (GIDEON; http://gideononline.com) database and then consider diseases caused by vectored (n ¼ 93) and non-vectored pathogens (n ¼ 208) separately. GIDEON defines vectors as the agent in which a pathogen is transmitted from one host to another. The GIDEON disease data are described in detail in Smith et al. (2007). They are not complete (diseases, particularly rare ones, can be missed), but they are, to our knowledge, the most complete disease data available. As with any global dataset, it is likely the data are poorer in regions with less well developed public health systems such that pathogens unique to those regions might be missed. This includes both rare diseases that have not yet been detected in particular regions and emerging pathogens that have yet to be detected anywhere. As in other studies of the biogeography of non-human associates we use a hierarchal clustering algorithm (e.g., Wang et al. 2003, Xie et al. 2004, Oliver and Irwin 2008, specifically, Ward's hierarchical agglomerative method (Ward 1963), to evaluate if biogeographic regions exist. We then test the robustness of these biogeographic regions using a complementary and more computationally intensive approach: community detection (Lancichinetti and Fortunato 2009). Finally, once biogeographic regions were identified, we compared, individually, the environmental and socio-political variables associated with those regions for both vectored and nonvectored diseases caused by pathogens.

Data collection
Human disease data.-Data on the presence and absence of 301 diseases caused by pathogens (Appendix : Table A1) of humans in 229 countries (Appendix : Table A2) from GIDEON were the basis for the majority of our analyses. GIDEON also provided basic data on the life history of the pathogens that cause each disease, particularly whether or not each is vector borne. We excluded GIDEON entries from our analyses that were not recorded in the database as currently present in any country, were not associated with a pathogen, or were difficult to assign as vector-or nonvector-borne. We used the broader pathogen literature to guide our classification decisions such that we were able to identify vectors for Mycobacterium ulcerans as vector-borne (Marsollier et al. 2002), but none of the other Mycobacteria were classified as vector-borne. GIDEON's data have now been used in a number of papers on the biogeography of disease (Møller et al. 2009, Yang et al. 2012) and represent the highest quality data available at the global scale, except those for particular pathogens (e.g., malaria ).
Environmental and socio-political variables data.-To compare different biogeographic regions, we focused on environmental variables known to be associated with the distribution of individual pathogens. Toward this end, we extracted data on the minimum temperature, maximum temperature, daily precipitation frequency, annual precipitation volume, total precipitation volume in the month with the minimum amount of precipitation, and total precipitation volume in the month with the maximum amount of precipitation for each country from the Tyndall Center for Climate Change Research at the University of East Anglia (Mitchell et al. 2002). Each of these variables has been suggested to influence the distributions of at least some pathogens (Guernier et al. 2004, Jones et al. 2008, Bonds et al. 2012 or their vectors (Lafferty 2009). In addition, we used derived estimates of human population density (persons per square kilometer in 2010), gross domestic product (2012 US dollars), and land area (square kilometer) data per country from the Population Division of United Nations (United Nations 2013) and supplemented as needed from the CIA World Factbook (http://www.cia.gov). Human population density has the potential to influence the persistence of pathogens, but it has been suggested that there is also an association between the diversity of human pathogens and the diversity of birds and mammals, with areas with higher diversity of birds and mammals tending to have more diseases caused by pathogens (Dunn et al. 2010). For our analyses we chose to use only native mammal species richness data since global bird, mammal, and plant richness are highly correlated (Qian and Ricklefs 2008), and our qualitative result should be similar regardless of which of these variables we analyze. Mammal diversity estimates were based on the International Union for Conservation of Nature's native mammal species richness data (IUCN 2013).

Statistical analysis
Using political boundaries as our unit of analysis, we demarcated biogeographic regions based on the composition of diseases caused by pathogens. We employed hierarchical clustering to identity potential biogeographic regions from three different schemas of the pathogen composition by country data: (1) full suite of diseases, (2) vector-borne diseases, and (3) non-vectorborne diseases. We then used a second statistical procedure, community detection, to validate our findings. Ideally, one might consider fine grain data on human diseases, or data on individual taxa of pathogens (e.g., distinguishing the distribution patterns of strains of Bartonella rather than simply the presence of Bartonellosis) but such data exist only for a minority of pathogens. Amazingly, we remain more ignorant about the distribution of human pathogen taxa than we do about rare birds.
All of our analyses were performed in R 2.15.1 (R Core Team 2013) except where otherwise noted. Biogeographic regions were identified using Ward's agglomerative hierarchical clustering method (R function: hclust), as Ward's method is a preferred method in many biogeographic studies (Kreft and Jetz 2010). Regions were created by clustering countries based on their composition of diseases caused by pathogens (i.e., presence and absence). The major benefit to using Ward's method is that the algorithm joins groups while minimizing within-cluster variance. However, agglomerative hierarchical methods produce results (in the form of a dendrogram) without clear indication of the optimal number of biogeographic regions (k). Therefore, we used a Mantel-based algorithm (Borcard et al. 2011) to determine k for each schema (i.e., vector-borne pathogens, non-vectorborne pathogens, and the full suite of 180 pathogens; R package: cluster; function: daisy [Maechler et al. 2013]). This method simply aims to maximize the correlation between the original (unclustered) distance matrix and the distance matrices computed by cutting the dendrogram at various levels (Borcard et al. 2011).
We validated the biogeographic regions found by hierarchical methods using community detection. In particular, we used a modularity maximization algorithm, Fast Unfolding (Blondel et al. 2008), to divide the network of countries into biogeographic regions, modularity being the fraction of connections that occur within regions minus the fraction expected given a particular network. In the network countries were connected if they shared the presence of a disease, and that connection was weighted by how many diseases they had in common. To compare our two sets of regions (i.e., clustering, community detection) we calculated a Rand similarity coefficient (Rand 1971). We assessed the strength of the match through comparison of the Rand similarity coefficient to a post hoc distribution for the Rand similarity coefficient with randomization tests and then calculated the p-value directly from this distribution.
After determining the optimal number of biogeographic regions, we used multinomial and binomial models (R package: nnet) to assess how environmental and socioeconomic factors correlate with these biogeographic regions (Venables and Ripley 2002). Multinomial logistic regressions predict placement in a category, here biogeographic region, based on multiple independent variables, here environmental (number of native mammal species, maximum temperature, minimum temperature, temperature difference, precipitation frequency, total annual precipitation) and socio-political (GDP, population density) To corroborate the multinomial models, each schema and covariate combination was evaluated in isolation using a series of binomial logistic regressions in MATLAB.

RESULTS
Using Ward's clustering algorithm and a Mantel optimality procedure, we determined the following number of biogeographic regions for each schema: vectored diseases (n ¼ 7), nonvectored diseases (n ¼ 5) and the full suite of 301 diseases (n ¼ 2 [Appendix: Table A2]). We discerned seven biogeographic regions when considering vectored diseases ( Fig. 1A; Appendix: Fig. A1), but only five regions when considering just non-vectored diseases ( Fig. 1B; Appendix: Fig. A3), implying that the former have more biogeographic structure. In addition, the differences among biogeographic regions for vectored diseases were much greater than those for non-vectored diseases (Appendix: Figs. A1, A2). The differences between the most distinct biogeographic regions for non-vectored diseases were akin to those found within the biogeographic regions of vectored diseases (Appendix: Figs. A1, A2). Though community detection methods discerned similar biogeographic regions, the exact number of regions was different for the two subset schemas. For example, three regions were found when the vector-borne diseases were considered using community detection, four less than found using Ward's clustering. However, the regions found using v www.esajournals.org community detection match the first branches in Ward's clustering (Appendix: Figs. A1, A2) and the regions detected using the two approaches were much more similar between the methods than expected by chance (p , 0.001) on the basis of Rand similarity coefficients.
Vectored biogeographic regions were significantly correlated with climate, biodiversity, and social variables (Table 1). Non-vectored regions were correlated with all variables except minimum temperature (Table 1). Biogeographic re-gions from the full suite of pathogens were correlated with native mammal species richness and GDP (Table 1).

DISCUSSION
We have shown that biogeographic regions exist for human diseases caused by pathogens. Although those species that live on our bodies, in our homes, or in our backyards are among those with the most direct effects on our health and v www.esajournals.org well-being, the distribution of these organisms remains poorly understood, perhaps in part because they are assumed to live everywhere we do. For diseases and the pathogens that cause them this assumption is wrong. The biogeographic regions for diseases caused by pathogens are robust to the statistical approaches used and are as distinct as the biogeographic regions of, for example, vertebrates, or plants. In other words, not only do biogeographic regions exist for diseases and disease causing pathogens (and likely other human associates), they are comparable in their delineation to the other established biogeographic regions, regions typically described as existing due to biogeographic history before the actions and movement of humans.
The biogeographic regions for vectored diseases coincide in many respects with those recently derived for non-human vertebrates (Holt et al. 2013). Where it exists, the coincidence of these regions suggests that the same historic factors that influence the composition of wild birds, mammals, and amphibians also influence (whether directly or indirectly) which diseases are present in any particular region. While previous work has suggested strong links between climate and the diversity of pathogens or diseases globally (Guernier et al. 2004, Bonds et al. 2012), here we suggest something different: namely that, in addition to climate, history and geography have strong effects on which pathogens are where. That the impacts of history and geography are nearly as strong on diseases and their pathogens as on organisms such as mam-mals, which are superficially less mobile between regions, is both novel and somewhat surprising. The biogeographic regions of birds, mammals and other animals are the result of the geographic position of landmasses, plate tectonics, and chance dispersal events (e.g., Holt et al. 2013). Ultimately, these same factors must also play a role in the distribution of the pathogens that cause diseases whether directly or via their effects on alternate hosts and vectors. As a result, the precise mix of vectored diseases in any particular place is the result of not only climate, human migration, and attempts at disease control, but also millions of years of tectonics and chance dispersal (or failure to disperse). So long as our attempts to control vectored pathogens are incomplete, human health and wellbeing, culture, and even political stability are likely to continue to be influenced by this ancient history.
However, the biogeographic regions for vectored diseases did depart from those of vertebrates in several interesting ways. For example, one of the biogeographic regions clearly defined for vectored diseases, Region 6, (Fig. 1) included portions of the Holarctic, but also included historically British colonized countries, such as Australia. From the perspective of vectored diseases, Australia is part of the same biogeographic region as England even though Australia is one of the most unique historical biogeographic regions in terms of birds or mammals (Wallace 1876, Holt et al. 2013. Similarly, India and Bangladesh, despite being climatically and his- The missing value for temperature difference for the full suite data schema is expected due to two interacting properties of the analysis: the optimal number of biogeographic regions (n ¼ 2) and the variable, temperature difference, is a linear combination of two other variables, maximum and minimum temperature. *p , 0.05; **p , 0.01; ***p , 0.001.
v www.esajournals.org torically very different from any Holarctic region are part of a biogeographic region (Region 7) that is most similar to Region 6 (the region containing Britain [Appendix: Fig. A1]). Clearly, human history can and has altered the biogeographic regions of diseases, even if it has not fully obscured those ancient regions. The history of colonization from other regions also seems to influence the biogeographic regions, albeit somewhat less clearly. The composition of diseases in Italy, for example, was similar to that in its former colonies (e.g., Eritrea, Ethiopia, Somalia, etc.). The relationships between colonies and former colonies and states are doubtless very complex-the consequence of the layered influence of colonial movement, socio-politics and disease biology-and yet the existence of pathogen biogeographic regions that correspond to colonial histories is striking. The ancient biogeography of vectored pathogens can be altered and in as much as the Holarctic (including British) pathogen composition is a relatively benign one, altered to the ends of human benefit. We also found some outwardly unusual pairings of countries. For example, vectored Region 1, with member countries including Afghanistan, Malta, Monaco, Qatar, and others, seems unusual upon initial review as the climate, history, and biogeography of these nations within these regions is diverse. However, similarity indices, such as those used in our hierarchical clustering, tend to unite sites with low diversity (Boyle et al. 1990) and these countries were in the lowest 15% of vectored disease richness as is the case both in countries of small size, where sampling is likely to be incomplete or where, as in Qatar, climate is generally ill conducive to life. In other words, these regions share the attribute of hosting a relatively low diversity of recorded, vectored diseases.
Non-vectored diseases were clustered into fewer biogeographic regions than were vectored diseases regardless of our statistical approach and those regions were far less different from one another than were the biogeographic regions for vectored diseases. One of those regions was composed of the temperate Palearctic and Nearctic (together, Holarctic) regions of the world. The other four were largely comprised, respectively, of the countries that were islands (Region 2), large proportions of sub-Saharan Africa (Region 3), Central and South America (Region 4), and Southeast Asia (Region 5). The links between colonial territories and states is dampened in the non-vectored regions compared to the vectored regions.
Like the pathogen subset schemas (vectored and non-vectored), the full suite of diseases (Appendix: Figs. A3, A4) is distributed into statistically significant biogeographic regions, but the nature of the regions is different from the subsets. We found just two biogeographic regions when considering the full set of diseases and these regions broadly reflected the division between Holarctic and Australia and the rest of the world. At the coarsest grain, in other words, the world can be divided, from the perspective of diseases, into just two pieces with many consequences, all of those associated with diseases, following from this division.
Biogeographic regions of vectored and nonvectored diseases were associated with both climatic and socio-political variables, suggesting that in addition to the influence of history, climate and socio-politics have influenced the distribution of diseases. Though much is understood about the individual natural histories of vectors (Qiu et al. 2002, Foley et al. 2007, less is known about general trends of the histories of the vectored diseases. Given that the biogeographic regions of vectored disease are associated with climate variables, it seems likely that the precise boundaries of these regions will shift as climate changes, as has been suggested to be the case for individual pathogens (Pascual and Bouma 2009), though just how they will shift is likely to be difficult to predict given that such shifts will represent the cumulative effect of the movement of hundreds of pathogen species, their vectors, and alternate hosts. Not surprisingly, given the less distinct biogeographic divisions of the full suite of diseases, they were not strongly associated with environmental and socio-political variables.
A key question that emerges from our work is why diseases, particularly those with vectors, can be grouped into biogeographic regions akin to those for birds and mammals. Clearly, climate influences the distribution of both vectors and pathogens, as we found here, and as shown in many other studies (e.g., Mordecai et al. 2013, Stensgaard et al. 2013). Yet, if climate were the sole influence on the biogeographic regions of vectored pathogens, we would expect those regions to simply reflect the climatic zones of the world. They do not. Instead, they reflect both the influence of climate and, it appears, the influence of the biogeographic history of regions. This influence implies that many pathogens have been unable to disperse to all of the regions for which they and their vectors are climatically suited. In addition, for those pathogens that require alternate hosts (e.g., Chagas disease), the spread of pathogens may be limited by the need for dispersal of those hosts, the vector, and the pathogen. If true, these mechanisms bear obvious consequences for the future distributions of pathogens. As novel, non-vectored pathogens emerge, the relative lack of biogeographic regions for such pathogens suggests time, rather than climate and history are likely to influence where they will occur. Conversely, for vectored pathogens, it suggests that both of these barriers to dispersal may be more persistent.
More broadly, we have shown that diseases caused by pathogens can be persistently influenced by ancient evolutionary history and climate to such an extent as to cluster globally into biogeographic regions. But we suspect similar (though probably not identical) biogeographic regions exist for other human associates such as house guests, pests, and even mutualist species. Given that such species are those with the most direct effect on human fitness and some of the most direct effects on politics and economies, their biogeographic regions are fundamental to the human story. These regions are a reminder of the influences of evolutionary history and climate on our lives. We posit that among the most important factors influencing an individual's life expectancy and general fate is the disease biogeographic region in which they are born. But the influence of human history on the details of these regions is also a reminder that they can be changed.

ACKNOWLEDGMENTS
This work was supported by a grant from the USGS Southeast Regional Climate Science Center to R. R. Dunn. R. R. Dunn was also supported by Army Research Office Award W911NF-14-1-0556 and an NSF Career grant (0953390). We thank J. Osbourne, O. Ott, and M. Trautwein for their assistance with this project.
We also thank J.-P. Lessard, I. Kuznetsov, L. Bettine Kent, A.-S. Stensgaard, and anonymous reviewers for their comments on previous drafts of this manuscript. Fig. A1. Dendrograms from hierarchical clustering using Ward's distance for disease presence for countries (n ¼ 229) from the GIDEON database. The vertical axis represents Ward's distance between clusters. Our analyses resulted in 7 clusters using vector-borne disease (n ¼ 93) presence. Each differently colored cluster indicates a biogeographic region (colors match biogeographic regions from Fig. 1A).

LITERATURE CITED SUPPLEMENTAL MATERIAL APPENDIX
v www.esajournals.org  . Dendrogram from hierarchical clustering using Ward's distance for disease presence for countries from the GIDEON database. The vertical axis represents Ward's distance between clusters. Our analyses resulted in countries (n ¼ 229) being placed into 2 clusters using the full suite of diseases (n ¼ 301). Each differently colored cluster indicates a biogeographic region (colors match biogeographic regions from Appendix: Fig. A3).