Metagenomic assessment of the global diversity and distribution of bacteria and fungi

Summary Bacteria and fungi are of uttermost importance in determining environmental and host functioning. Despite close interactions between animals, plants, their associated microbiomes, and the environment they inhabit, the distribution and role of bacteria and especially fungi across host and environments as well as the cross‐habitat determinants of their community compositions remain little investigated. Using a uniquely broad global dataset of 13 483 metagenomes, we analysed the microbiome structure and function of 25 host‐associated and environmental habitats, focusing on potential interactions between bacteria and fungi. We found that the metagenomic relative abundance ratio of bacteria‐to‐fungi is a distinctive microbial feature of habitats. Compared with fungi, the cross‐habitat distribution pattern of bacteria was more strongly driven by habitat type. Fungal diversity was depleted in host‐associated communities compared with those in the environment, particularly terrestrial habitats, whereas this diversity pattern was less pronounced for bacteria. The relative gene functional potential of bacteria or fungi reflected their diversity patterns and appeared to depend on a balance between substrate availability and biotic interactions. Alongside helping to identify hotspots and sources of microbial diversity, our study provides support for differences in assembly patterns and processes between bacterial and fungal communities across different habitats.


Introduction
Bacteria and fungi contribute significantly to global biodiversity and biomass ( Bar-On et al., 2018), and are fundamentally important for global ecosystems and host health and functioning. Bacteria provide their hosts with vitamins and cofactors, act as plant-growth promoting rhizobacteria, and help digest otherwise indigestible fibres and contribute to immunity (Schmidt et al., 2018). While some fungi are known as important plant symbionts or comprise some of the most beneficial mutualists and detrimental pathogens (Fisher et al., 2012), the role of fungi in non-plant hosts is less known. In environmental habitats, both bacteria and fungi drive decomposition of organic material and nutrient cycling (Falkowski et al., 2008;Berendsen et al., 2012). Given that hosts exist within external environments, external host habitats are much more complex in terms of scale, abiotic heterogeneity and C resources, compared with internal host habitats. Thus, environmental microbiomesand to a lesser extent external host microbiomeslikely experience greater exposure to migrations of different organisms/genes, leading to a greater variety of niches suitable for the establishment of a larger variety of organisms compared with internal host habitats (Pent et al., 2017;Küngas et al., 2020). Different abiotic factors such as pH, climate and organic matter contents shape the environmental community compositions of bacteria and fungi (Tedersoo et al., 2014;Louca et al., 2016;Bahram et al., 2018;Delgado-Baquerizo et al., 2018).
Accumulating evidence hints at the niche specialization of bacteria and fungi, reflected in their global distribution patterns (Frey-Klett et al., 2011;Bahram et al., 2018;Crowther et al., 2019), suggesting that contrasting mechanisms underlie their community assembly processes. At the same time, there appears to be a significant functional overlap between bacteria and fungi for utilizing resources, hinting at the importance of bacterial-fungal interactions (de Boer, 2017). Yet, a simultaneous synthesis of bacterial and fungal community patterns across a wide array of host and environmental habitats is so far lacking.
We have profoundly increased our understanding of global patterns of both bacterial and fungal communities, but most studies tend to examine either bacterial or fungal communities in isolation, often without determining their associated functions. Given that microbial genes and taxa appear to be exchanged across different host and environmental habitats (Sokol et al., 2017;Bahram et al., 2018;Hannula et al., 2019), it remains an open question whether stochastic processes (geographical proximity, random dispersal), or deterministic processes (similarity in environmental conditions, or biotic interactions) determine microbiome structure across habitats. Aside from other abiotic factors, the composition of microbial communities appears to depend on the prevailing and dominant available source of organic carbon (C) in a given habitat (Hoffmann et al., 2013). In general, it is thought that fungi tend to outcompete bacteria in utilizing more complex and varied forms of C, whereas bacteria tend to outcompete fungi for more labile C sources in terrestrial habitats (Boer et al., 2005;Žifčáková et al., 2017). This is likely facilitated by physiological differences, including contrasting stoichiometry, carbon use efficiencies, enzymatic capabilities, and stress tolerance mechanisms between bacteria and fungi (Lynch and Walsh, 2007;Frey-Klett et al., 2011;Sokol et al., 2017;Bahram et al., 2018;Deveau et al., 2018;Naranjo-Ortiz and Gabaldón, 2019). Yet, the role of bacteria in the decomposition of more recalcitrant forms of C and the role of fungi in the utilization of more labile C may be underestimated, and both fungi and bacteria may be equally important in overall C decomposition patterns (Strickland and Rousk, 2010;Bugg et al., 2011;de Vries and Caruso, 2016;Wilhelm et al., 2019). There is also growing evidence that the microbial control of ecosystem processes as well as host health may be mediated by biotic interactions between bacteria and fungi (Mendes et al., 2011). These interactions span an antagonisticmutualistic spectrum to direct predation and parasitism, and range from free-living interactions to mixed biofilms, and intrahyphal colonization by bacteria (Frey-Klett et al., 2011). Collectively these growing insights into the specific properties of bacteria and fungi and their interactions suggest that habitat types may have distinct bacterial-fungal (B/F) ratios and alterations to the B/F balance in either direction may have consequences for ecosystem and host functioning, e.g. in C decomposition, and host health (Allison et al., 2013;Bahram et al., 2018).
Here, by leveraging a global dataset of 13 483 metagenomic samples, we examined the structure and function of fungal and bacterial communities, based on the relative abundance of taxonomic stable rRNA genes and orthologous groups (OGs) respectively ( Fig. S1; Table S1). Metagenomics approach allows us to compare fungal and bacterial compositions simultaneously, and in relation to each other . This may offer advantages over 16S or ITS amplicon sequencing that can only amplify bacteria and fungi, making relative comparisons between them riddled with biases (Handelsman, 2004;Hugenholtz and Tyson, 2008;Grice and Segre, 2011;Tedersoo et al., 2015;Quince et al., 2017). Our selection of publicly available samples included 25 habitat types, spanning internal and external host (including skin, oral and gastrointestinal of different animal species) and environmental (including soil, water and built) habitats. These habitats differ in environmental variation, scale, biotic interactions and dispersal processes. Therefore, we hypothesized that: (i) there is a greater diversity (Shannon diversity index) and relative abundance of both bacterial and fungal taxa and associated functions in environmental and external host compared with internal host microbiomes, where microbes and their functions are likely to be more specialized, and that (ii) certain host associated microbes are likely to be a subset of the microbiome metacommunity of the environment, i.e. displaying a nested community structure. More specifically, due to greater exposure to environmental microbes, external host habitats more likely harbour a subset of microbes from the environment, compared with internal host habitats. We also hypothesized that: (iii) bacteria show stronger habitat associations compared with fungi, due to the greater effect of deterministic than stochastic processes in shaping bacterial communities; and (iv) the relative abundance and diversity ratios of bacteria to fungi decrease in environments with more complex plant-derived C resources, i.e. soils, due to the greater affinity of fungi to plants and their associated resources.

Results and discussion
Not surprisingly, bacteria greatly exceeded fungi numerically across samples, being on average 700-fold relatively more abundant across all habitats ( Fig. 1; Table S1). Both bacteria and fungi, but even more so fungi, displayed higher diversity in terrestrial habitats, particularly in soils and rhizospheres than in marine habitats (Figs. 1, S2-S4). Overall, and partially confirming our first hypothesis, internal host-associated microbiomes were less diverse than most external host-associated and environmental microbiomes (Fig. 1C,D), corroborating the findings from a recent global bacterial metabarcoding analysis (Thompson et al., 2017).
In addition, in line with our first hypothesis, among host habitats, fungi showed the greatest relative abundance as well as diversity in those with direct external exposure such as human nose, oral and skin (Figs. 1, S2-S6). Such habitats are possibly exposed to a greater influence of external environmental factors and connectivity to other habitats leading to a higher exchange of fungal taxa. These effects are less pronounced for bacteria, likely due to their greater dispersal potential across habitats. In addition, the B/F ratio was greater in habitats with presumably higher nutrient availability and lower C/nutrient ratios, which might reflect the greater metabolic flexibility, carbon-use-efficiency, and competitive ability of bacteria compared with fungi in nutrient and soluble C rich environments (Averill and Hawkes, 2016;Bahram et al., 2018). For example, we found a strong association between fungi and the herbivore lifestyle in gastrointestinal tract (GI) communities: plant-fed mice GI and bovine rumen had on average twofold greater fungal relative abundance (mean B/F = 220) compared with omnivores (pig and human GI, mean B/F = 383) and sixfold more compared with carnivores (cat and dog GI, mean B/F = 1305) (Fig. 1). Despite having several orders of magnitude less abundance than bacteria ( Fig. 1), fungi may have a disproportionate ability to thrive in spatially and temporally heterogeneous conditions, largely airborne dispersal mechanisms and the production of hyphae (Fig. S7).
To determine whether similar habitats in terms of conditions and connectivity shape microbial communities, and to test our hypothesis as to whether bacteria have greater habitat associations compared with fungi and to explore the diversity patterns of different habitat groupings, we performed a hierarchical clustering of habitats based on the composition of bacterial communities. The results revealed These data show that the relative abundance and diversity of fungi is higher in terrestrial and aquatic habitats respectively, whereas bacteria show the highest relative abundance in nutritional habitats possibly because greater available nutrition resources for growth. Blue line specifies the median across all habitats. Diversity was calculated based on Shannon index using the genus level abundance matrix, whereas B/F ratio was calculated based on the abundance of SSU reads assigned to bacteria and fungi. [Color figure can be viewed at wileyonlinelibrary.com] three major microbial community clusters (cf. habitat clusters, HC; Fig. 2), that were dominated by environmental (HC1), host associated external (HC2) and host associated internal (HC3) habitats. HC1 included relatively more diverse bacterial communities, and Alpha, Beta and Deltaproteobacteria (Fig. 2). HC2 had increased Actinobacteria and Gammaproteobacteria, whereas HC3 was dominated by Clostridia and Bacteroidetes (Fig. 2).
We found support for our second hypothesis that communities of host-associated microbes are a subset of those in environmental habitats. Notably, dominant fungal, but not bacterial genera, present in the gut were a subset of external host habitats, which were in turn a subset of environmental habitats (Fig. S8). In addition, in line with our third hypothesis, bacteria showed stronger habitat associations and distinctive communities, compared with fungi (Figs. 2, 3, Table S3), perhaps because of their greater environmental associations, comparatively longer evolutionary history and/or greater genomic plasticity (Frey-Klett et al., 2011;Naranjo-Ortiz and Gabaldón, 2019). This was further supported by analysing habitat-specific genera, which revealed that 124 fungal genera representing 7.5% of all fungal genera were significantly associated with a specific habitat, compared with 36.3% in the case of 1360 bacterial genera. However, compared with fungi, the proportion of indicator bacterial genera was less contrasting between habitats, yet it was greater in environmental habitats (Fig. 4).
Overall, the clustering of bacterial communities appeared to depend more on habitat conditions, whereas for fungi it appeared to depend on either the interplay between habitat conditions, spatial dispersion and transfer between proximal environments or none of these . For example, fungal communities in human gut samples clustered with those from the built habitat, so as those from dairy with bovine rumen and human skin. Although this is not direct evidence of a frequent transfer of fungi between these habitats, and we cannot distinguish between transient and non-transient fungi in specific habitats, our results suggest that the interaction of habitat specificity and spatial proximity could affect fungal community composition across diverse habitats, as shown for dispersal as a main driver underlying stochastic community assembly of soil fungi , possibly reflecting the reliance of many fungi on airborne dispersal via spores (Huang and Hull, 2017).
The relatively higher abundance of fungi in terrestrial and particularly soil habitats, in line with our fourth hypothesis, may be related to the major diversification events of fungi being tightly linked to those of plants. The most diverse fungal lineages have affinities for symbiotic, pathogenic and saprotrophic interactions with plants and their resources (Lutzoni et al., 2018;Tedersoo et al., 2018). We suggest that this facilitates the formation of a stable and complex soil fungal communities, whereas stochastic and dispersal-related processes may Between sample Bray-Curtis distance was averaged between samples of biomes, to create a hierachical clustering (ward.d2). Median fungal and bacterial class abundances are shown next to environmental clustering. The three bacterial habitat clusters (HCs) roughly correspond to different balances of the B:F ratio. HC1 (environmental) seems to be driven by the presence and dominance of fungi and complex sources of C. In this cluster, bacteria could rely more on fungal-derived C as well as symbioses and hence more secondary metabolites for enhanced interactions, and these bacterial-fungal interactions drive the bacterial composition and diversity of environmental habitats. HC2 (external host and human influenced environmental habitats) seems to represent ecotones, transitions between more isolated habitats, whereas HC3 (anaerobic isolated habitat like guts) perhaps represents the most specialized habitat which favours bacteria due to more homeostatic conditions (pH, temperature and moisture). Fungi show more diffuse clustering which might be due to their large reliance on passive dispersal (largely airborne), strong affinity to plants and plant-derived C, and less habitat specificity. The top bar plot shows the out of-bag variance explained for each model with the dependent variables on the x-axis. [Color figure can be viewed at wileyonlinelibrary.com] have stronger impact on fungal communities in other habitats. This hypothesis is further strengthened by the observation that most of the habitat specific fungal genera were associated with soil, including multiple indicator genera from Agaricomycetes (Fig. 4, Table S3), which is composed mostly of litter decomposing and plantassociated fungi that thrive in the presence of plant hosts (Lutzoni et al., 2018). Archaeorhizomycetes, a recently discovered fungal class (Rosling et al., 2011), was also almost exclusively found in soil samples. Several Ascomycetes such as Penicillium and Mycosphaerella were among indicator genera for soil and rhizosphere habitats, known soil saprotrophs and plant pathogens respectively (Crous et al., 2009;Diao et al., 2019). Thus, fungi may possess a broad range of C-cycling enzymes in soils, perhaps due to diverse substrates provided by plants and plant-fungal symbioses.
Accordingly, compared with saprotrophic bacteria, saprotrophic fungi have developed higher and wider C/nutrient stoichiometries, greater C demands, higher carbon-use-efficiency with higher C/N resource ratios, and extracellular enzymatic specialization in degrading plant-derived C under aerobic conditions (Keiblinger et al., 2010;Tedersoo et al., 2014). This tight association with plants may provide fungi greater access to C in soils. In line with this, environmental habitats, especially soils, showed the highest relative abundance of carbohydrateactive enzyme genes (CAZymes), which are involved in  the construction and breakdown of complex carbohydrates, in fungi relative to bacteria (Fig. 1). The B/F ratio was negatively correlated to the relative abundance and diversity of fungal CAZymes (r = −0.449, P < 10 −15 , r = −0.419, P < 10 −15 respectively), and to lesser extent to those of bacteria (r = −0.109, P < 10 −15 ; r = −0.207, P < 10 −15 respectively). From specific CAZymes, those related to degradation of lignin showed the strongest correlation to the B/F rRNA gene ratio (Table S2). The more ubiquitous symbiotic interaction between plants and fungi may enhance fungal fitness in relation to bacteria in plant dominated habitats such as soils .
There are a number of other potential factors driving interactions as well assupporting or opposinghabitat associations between fungi and bacteria. Filamentous fungi are especially better suited to deal with more heterogeneous habitats and conditions with resource inequality due to their ability to expand and move towards nutrient rich patches and transfer nutrients through hyphae (Whiteside et al., 2019). By contrast, bacteria appear to be more sensitive to pH and nutrient availability than fungi . As reflected in the HC1 habitat cluster (Fig. 2), these are habitats in which fungi tend to reach their greatest diversity, yet bacteria also thrive here. The mostly hyphal forming fungi in these habitats can act as dispersal vectors (Deveau et al., 2018), provide and connect high quality nutrient resources, and alter localized pH levels, thus providing a mechanism for enhanced bacterial activity, which is especially important under stressful conditions (Worrich et al., 2017). This, together with diverse exchange of metabolites between bacteria and fungi, may be a driver of the composition and diversity of bacterial communities. In some circumstances fungi may even facilitate certain bacteria that are normally considered slow colonizers and poor competitors, enabling them to become locally abundant in symbiosis (Frey-Klett et al., 2011;Phelan et al., 2012;Deveau et al., 2018). While fungi are also influenced by these associations, both beneficially and antagonistically through bacterial products, their associations with plants and plant-derived C is perhaps a stronger factor in shaping their community composition, which is reflected in the relatively lower B/F ratios and higher abundance of fungal CAZymes in environmental habitats, particularly soils (plant dominated habitats) (Figs. 1, S9). Such habitats have a greater exposure to complex C forms and fungal airborne spore dispersal, that would increase the chances of such fungi meeting their preferred substrates (Huang and Hull, 2017).

Conclusions
Our results hint towards the strong role of environmental filtering in structuring cross-habitat bacterial but not fungal communities. The overall diversity and abundance patterns of bacterial and fungal taxa and associated functional genes reflect their mostly contrasting C acquisition strategies, dispersal strategies, morphologies, alongside their subsequent coevolution via complex biotic interactions with each other and macro-organisms under highly variable abiotic conditions. We suggest that future studies simultaneously investigate both bacterial and fungal communities, as well as their functional properties, as their interactions and domain-specific properties may be a strong factor determining the response of ecosystems to environmental change.

Raw read processing
In total, 31 287 samples classified as metagenomic from 405 projects were downloaded from the European Bioinformatics Institute (EMBL-EBI) as of 2018/06/19, using customized scripts available at the fetch-data/ directory of the Supplemental Software package of Coelho et al. (in revision). To confirm the annotation of samples as either metagenomics, the abundance of NOGs reads was considered, i.e. samples with very low NOG abundance relative to the total read were considered as metabarcoding regardless of the information provided by submitters. The basic filtering, functional and phylogenetic profiling are an adapted protocol of the methods used in . In brief, metagenomic and amplicon sequencing reads obtained from public sources were quality-filtered, if the observed accumulated error exceeded 2.5 with a probability of ≥ 0.01, or > 1 ambiguous position or a homonucleotide run > 15 bp was present. Reads were trimmed if base quality dropped below 20 in a window of 15 bases at the 3 0 end, or if the accumulated error exceeded 1 using the sdm read filtering software (Hildebrand et al., 2014). Furthermore, all reads shorter than 70% of the maximum expected read length (per sample) were removed. In total, 697 billion total reads were analysed of which 286 + 253 billion (read pair one and two respectively) passed quality filtering (Table S1). After quality filtering and exclusion of samples from underrepresented habitats and with short reads, we analysed 13 483 samples belonging to 25 habitats. The habitats human gut (n = 7732) and human oral (n = 1586) were the most often represented in our dataset, whereas hot springs were poorly represented (n = 4). Most samples originated from North America (24.0%) and Asia (37.0%). Several geographic regions were not represented in our datasets, including Pacific Ocean, North Sea, Arctic, Atlantic Ocean (Table S1).

Taxonomic annotations
We used a miTag approach implemented in MATAFILER (Hildebrand et al., 2019) to determine bacterial and fungal community composition from metagenome sequence data at the higher taxonomic level, detailed in . Briefly, SortMeRNA (Kopylova et al., 2012) was used to extract potential rRNA genes against the SILVA database version 128 (Quast et al., 2012). For this, we used SSU rRNA gene for taxonomic identification, which is a universal marker for both prokaryotes and eukaryotes. Reads approximately matching these databases with e-values < 10 −4 were further filtered with custom Perl and C++ scripts, using FLASH to attempt merging all matched read pairs. In case read pairs could not be merged, single reads were interleaved such that the second read pair was reverse complemented and then sequentially added to the first read. Lambda (Hauswedell et al., 2014) was used to fine-match candidate interleaved or merged reads to Silva 128 database. The lowest common ancestor (LCA) algorithm adapted from LotuS (Hildebrand et al., 2014) was used to determine the identity of filtered reads based on Lambda matches. This included a filtering step, where queries were only assigned to phyla and classes if they had at least 88% and 91% similarity to the best database hit respectively, thresholds adopted from literature (Yarza et al., 2014). We normalized each taxon by dividing by the total number of reads per sample to account for uneven sequencing depth across samples. Functional annotation of fungal taxa was done using FunGuild database (Nguyen et al., 2016).

Functional annotations
We used a direct a Blast search approach to estimate the functional gene composition of each sample. The quality-filtered reads pairs were first merged using FLASH (Magoč and Salzberg, 2011). In cases were read pairs were not available, singleton reads were used instead. These merged, unmerged and singleton reads were mapped against functional reference sequence databases using DIAMOND 0.9.4 (Buchfink et al., 2014) in blastx mode with the '-k 5 -e 1e-4sensitive' parameters. If two unmerged query reads mapped to the same target, the mapping scores were combined to avoid double counting dependent reads. In such cases, the hit scores were combined by selecting the lower of the two e-values and the sum of the bit scores from the two hits. Based on the highest bit score, longest alignment length and highest percentage identity to the subject sequence the best hit for a given query was selected. Finally, reads with an alignment identity < 50% and matching with an evalue >1 e-9 were excluded.
For functional annotations, we used in silico annotations of metagenomic reads based on a curated database of the orthologous gene family resource eggNOG 4.5 (Huerta-Cepas et al., 2015). eggNOG taxonomic information was used, as reads were mapped competitively against all domains and assigned into prokaryotic and eukaryotic origin, based on the best bit score in the alignment and the taxonomic annotation provided with the database at domain level. In order to estimate the potential of microbes in using C resources, we decided to use the extended orthology represented in eggnog, combined with the precise experimentally validated and functional specific carbohydrate-active enzyme (CAZyme) annotations (Cantarel et al., 2009). For this, we mapped all eggNOG 4.5 amino acid sequences onto the latest CAZy (2019) database (http://bcb.unl.edu/dbCAN2/download/) using DIAMOND. Only high-quality hits (%id > 90, eval < 1e-20, > 80% subject coverage) were accepted to ensure that only valid 'seeds' were retrieved. From these, the eggnog numbers corresponding to CAZymes based on homology searches to the CAZyme database were retrieved. We used our previously derived eggNOG abundance matrix to obtain a CAZyme profile per sample.
All functional abundance matrices were normalized by the total number of reads used for mapping in the statistical analysis, unless mentioned otherwise (e.g. rarefied in the case of diversity analysis, see below). This normalization was chosen as it considers differences in library size, since unmapped (that is functionally unclassified) reads are included. It is important to note that functional and taxonomic abundance estimates represent relative proportions of represented categories, because of biases of sequencing technologies in capturing every molecule in samples. This requires, as we have done, to choose statistical tests that do not assume absolute measurements, and centres analysis of this type on comparisons across the set of samples.

Data analysis
Of note, 13 452 samples were categorized into 25 habitats (after removing potential 16S amplicon sequencing runs and studies misclassified in EBI) using the annotation retrieved from submitters with some modifications; for example, soil and rhizosphere samples were combined as soil (Table S1). For analysing fungal and bacterial diversity, the differences in sequencing depth was accounted for by partial linear regression with diversity and sequence abundance as response and predictor respectively, as used previously in (Tedersoo et al., 2014). For analysing relative abundances of genes and taxa, the data were normalized by total sum of metagenomics reads per sample. For analysing taxa and gene compositions, abundance data were normalized using Hellinger transformation in vegan (Oksanen et al., 2007) of R (R Core Team 2015). The B/F ratio was calculated based on the abundance of SSU reads assigned to bacteria and fungi. Diversity (Shannon Index) was calculated and normalized by the total rRNA gene abundance per sample. To examine diversity, we relied on the diversity of genera, to minimize the misassignments at lower taxonomic level inherent to short reads. Based on our previous study , we found a strong correlation based on genus and OTU diversities in soils (r > 0.8). To test discrimination of the relative abundance of different taxa or functions across habitats, permutational multivariate analysis of variance followed by a generalized canonical discriminant analysis was performed using the candisc package (Friendly et al., 2017). To test the associations of taxa, we used a sparse partial least squares analysis, as implemented in the mixOmics package (Rohart et al., 2017).
To cluster habitats based on their fungal or bacterial composition, miTag tables were filtered for either bacterial or fungal taxa (class for fungi, families for bacteria). These tables were normalized by sum, to obtain a relative proportion of fungi or bacteria only within each sample, filtering taxa with < 1e-5 fractional abundances on average. To obtain scaled Hellinger matrices, we used the sqrt transformed relative abundances and calculated Bray-Curtis distances among samples. To calculate the distances between single habitats, we calculated the mean distance between the respective habitats, for all combinations of habitat pairs. These were then hierarchically clustered using a ward.D agglomerative clustering as implemented in hclust and visualized using the tanglegram function from dendextend (Galili, 2015), combined with itol's visualization of compositions (Letunic and Bork, 2016). The 'nestedness metric based on overlap and decreasing fill' (Almeida-Neto and Ulrich, 2011) was used to calculate nestedness among habitats as implemented in vegan. For this, samples were pooled per habitat and the pooled data were rarefied to the same number of reads per habitat.
Due to heterogeneity in our dataset, we used a machine learning technique (Random Forest) as implemented in RandomForest package (Liaw and Wiener, 2002) to determine whether the relative abundance of a given taxon can be predicted based on the effect of habitat and geography. In addition, indicator genera for each habitat were identified using a two-way indicator species analysis following multiple testing correction, as implemented in labdsv package (https://cran. r-project.org/package=labdsv).

Supporting Information
Additional Supporting Information may be found in the online version of this article at the publisher's web-site: Fig. S1. Distribution of the metagenomic samples used in this study. Colours and size of symbols correspond to habitat types and number of samples, as indicated in the legend. Fig. S2. Cross-habitat differences in relative abundance of fungal classes. Fig. S3. Cross-habitat differences in relative abundance of fungal growth forms. Fig. S4. Cross-habitat differences in relative abundance of fungal functional guilds. Fig. S5. Cross-habitat differences in relative abundance of bacterial classes. Fig. S6. The cross-habitat canonical discriminant analysis of the composition of the top 20 bacterial and fungal classes associating most strongly to habitat types. Fig. S7. Conceptual model showing patterns of fungal and bacterial diversity in relation to habitat type and gradients of associated factors that are proposed to be driving overall diversity patterns. The patterns of diversity of fungi appears to be associated with more spatially and temporally heterogeneous habitats with more available niches with higher and wider C:nutrient values and greater connectivity with other habitats, whereas bacteria show less discernible patterns with these factors, instead showing multiple diversity hotspots across a wide range of habitats, hinting that other more specific factors such as pH as the most important factor driving bacterial diversity, which fungi are less sensitive to. Fig. S8. Soil as a main potential source for fungi but not bacteria in other habitats. The figure shows the occurrences (shaded cells) of top 100 most abundant bacterial and fungal genera across habitats, ordered according to the nestedness measure based on overlap and decreasing fill (NODF). NODF 0 and 100 indicate total randomness and perfect nestedness respectively. To minimize the effect of heterogeneous sequencing depths, data were pooled per habitat and rarefied to the same level across habitats. Text colours represent environmental, external host and internal host habitats, as indicated in the right panel. Fig. S9. Boxplots of relative abundance of CAZyme substrates across three habitat clusters shown in Fig. 2. The boxplot uses only the median value for each category for the 25 habitats, the first p-value refers to n = 25, to account for sample size differences between single habitats (e.g. human gut contained many more samples than Marine). The second p-value (in parenthesis) indicates the p-value across 13,452 samples and their distribution in three habitat clusters. Table S1. Samples used in this study. The table presents a list of sample names, their sequencing and mapping statistics. Table S2. Spearman correlations between CAZyme gene categories and the B/F ratio. Correlations were not done by discriminating on biomes. Table S3. List of indicator bacterial and fungal genera associated with different biomes.