Minireview Modelling spatial patterns in host-associated microbial communities

Summary Microbial communities exhibit spatial structure at different scales, due to constant interactions with their environment and dispersal limitation. While this spatial structure is often considered in studies focusing on free-living environmental communities, it has received less attention in the context of host-associated microbial communities or microbiota. The wider adoption of methods accounting for spatial variation in these communities will help to address open questions in basic microbial ecology as well as real-ize the full potential of microbiome-aided medicine. Here, we ﬁ rst overview known factors affecting the composition of microbiota across diverse host types and at different scales, with a focus on the human gut as one of the most actively studied microbiota. We outline a number of topical open questions in the ﬁ eld related to spatial variation and patterns. We then review the existing methodology for the spatial modelling of microbiota. We suggest that methodology from related ﬁ elds, such as systems biology and macro-organismal ecology, could be adapted to obtain more accurate models of spatial structure. We further posit that methodological developments in the spatial modelling and analysis of microbiota could in turn broadly bene ﬁ t theoretical and applied ecology and contribute to the development of novel industrial and clinical applications.


Introduction
In addition to playing a key role in global biogeochemical cycles, microbial communities occur in and on multicellular eukaryote hosts, such as plants and animals. Such host-associated microbial communities, which may include bacteria, protists, fungi and archaea, are referred to as 'microbiota' -or when taken together with their genomes, metabolites, viruses and physico-chemical environment, as 'microbiomes' (Berg et al., 2020). Microbiota contribute to the adaptation of the host to varying environments, for example by breaking down compounds for easier absorption and preventing the growth of pathogens. These symbiotic relationships are an important evolutionary force both for the hosts (McFall-Ngai et al., 2013) and their associated microbes (Garcia and Gerardo, 2014). Multicellular hosts and their persistent microbial symbionts are in fact increasingly recognized as unified biological entities (or 'holobionts') from an evolutionary perspective (Simon et al., 2019). Microbiota composition is, however, not solely dependent on the identity and phylogeny of the host organism but is shaped by many factors that boil down to four fundamental ecological processes: selection, drift, dispersal and speciation (Vellend, 2010). Together, these factors lead to temporally varying spatial patterns at different scales, from the level of individual cells to biogeographic patterns. Spatial patterns in microbiota composition have been observed, for example on comparable tissue types across host individuals in plants (Bakker et al., 2014), and in animals, such as marine invertebrates (van de Water et al., 2018), insects , amphibians (Griffiths et al., 2018), ray-finned fish (Smith et al., 2015) and mammals (Tung et al., 2015).
Microbiota composition varies dynamically as individual microbes interact with their abiotic and biotic environment, such as physical forces, acidity, redox potential and a multitude of chemical compounds. Many of these factors are directly related to the host, such as the local physicochemical conditions, diet and degree of exposure to the host's environment, but interactions between microbes are also highly important, as are stochastic drift and dispersal (Harris et al., 2017). Spatial structure arises from these processes for two broad reasons. First, most environmental factors affecting the host and its microbiota are unevenly spatially distributed. Second, spatial distance combined with random drift can itself generate spatial heterogeneity in the microbiota by limiting the dispersal of microbes between hosts or between different parts of the host's body. Therefore, spatial patterns contain precious information on how host-associated microbial communities establish themselves, persist and change over time.
It has only recently become possible to reveal the complete taxonomic composition of microbial communities through high-throughput sequencing (Caporaso et al., 2011;Quince et al., 2017). The analysis and interpretation of these data, however, poses a series of technical challenges related to the compositional nature of the data (i.e. only relative microbe abundances are measured in each sample), the discretization of molecular observations into discrete taxa, the accuracy of taxonomic and functional assignments and the large number of rare taxa. Analytical methods addressing these problems have been actively developed (Knight et al., 2018), but they often do not explicitly account for spatial variation, especially in the case of host-associated communities (Björk et al., 2018). Yet the development and wider application of spatial modelling techniques is a key to providing answers to fundamental questions on the ecology of microbiota, such as the relative influence of dispersal and local environmental conditions on community composition, or how and why communities shift between alternative states (Gonze et al., 2018). Accounting for spatial variation also represents an important remaining challenge in medical microbiology, because its confounding effects can reduce the applicability of human microbiota analyses in diagnostics (Gaulke and Sharpton, 2018;He et al., 2018). Microbiota composition and function have been shown to substantially contribute to chronic diseases such as irritable bowel syndrome, colorectal cancer, fatty liver disease, asthma and dementia (Feng et al., 2018). Thus, the study and diagnosis of these and other diseases could greatly benefit from a better understanding of spatial variability in the microbiota, and of its causes and consequences.
In this review, we first provide an overview of factors influencing spatial patterns in host-associated microbial communities, outline practical considerations in their analysis and present topical questions whose investigation would benefit from spatial modelling techniques and tools. We then cover recent methodological developments in spatial machine learning and probabilistic modelling, which provide new means to harness the spatial information in the data. Finally, we detail several extensions and modifications that could significantly improve the applicability of such approaches in microbiome research. While the human gut is often used as an example in this review as one of the most studied microbiota, the spatial modelling approaches that we discuss have broad applicability, and we also provide illustrating examples from various other host organisms.

Axes of variation in microbiota
Host-associated microbial communities generally differ in composition from free-living environmental communities (Adair and Douglas, 2017). Various factors are known to exert selective pressures on the resident microbes, leading eventually to the establishment of communities with relatively stable compositions. Known stabilizing selective pressures include (i) the host immune system (Hooper et al., 2012) and other compounds produced by the host (Fischbach and Segre, 2016); (ii) metabolic products of other microbes such as antimicrobial toxins (Wexler et al., 2016), enzymes (Rakoff-Nahoum et al., 2016) and signalling compounds (Garcia, 2018); (iii) physicochemical constraints such as temperature (Sepulveda and Moeller, 2020), pH (Sylvain et al., 2016), oxygen (Albenberg et al., 2014) and particularly in gastrointestinal communities, host diet (O'Keefe et al., 2015;Riaz Rajoka et al., 2017). The relative strength of the different selective pressures on microbial communities depends strongly on the host and on the specific site (Adair and Douglas, 2017).
Microbial communities are considered to have a relatively stable composition within a single host at a specific host site (Coyte et al., 2015). However, they exhibit large variations along the following axes: (i) between host species at comparable host sites (i.e. on the same tissue type or at the same body site), (ii) across the different surfaces and compartments of a single host species, (iii) between individuals of the same host species at comparable host sites and (iv) along time in a given microbial community when the spatial location of the host individual changes (Fig. 1). A large part of this variation is spatially structured, and we briefly review below the spatial processes and spatially correlated factors that are known to contribute to these patterns. Because of the methodological focus of this review, only a small number of examples from a variety of microbiota are discussed here to outline possible spatially relevant factors in the context of the different axes of variation.

Variation between host species
One widely studied subject in microbial ecology has been the establishment and maintenance of unique microbial communities in different host species at roughly comparable host sites, like in the gastrointestinal tract of animals (Fig. 1A). In a single host individual of any one species, the development of the microbiota often proceeds through time in a predictable fashion through the primary succession of introduced microbes (Ortiz-Alvarez et al., 2018). The vertical transmission of specific microbes from the parent to the offspring is important in most multicellular eukaryotes in providing the offspring with an initial inoculum (Bright and Bulgheresi, 2010).
However, free-living environmental microbes can also establish stable communities in or on various hosts, often gaining fitness benefits in the mutualistic association (Garcia and Gerardo, 2014). Consequently, host species sharing the same spatial distribution tend to have more similar microbiota due to the incorporation of the same environmental microbial taxa. The direct horizontal transfer of microbes between host species further increases this similarity, for example between distantly related ground-dwelling mammals (Perofsky et al., 2019). Hence, the microbiota of spatially coexisting host species are linked together into a wider metacommunity (Adair and Douglas, 2017). Host-microbe associations can nevertheless also be extremely specific, such as when A. Variation between different host species at a comparable host site, i.e. on the same tissue type or at the same body site. B. Variation across the different surfaces and compartments in a single host species, like the leaves, trunk and roots of a tree species. C. Variation between individuals of the same host species at a comparable host site. D. Variation of the microbiota over time at the same host site of the same individual depending on the individual's location. Parts of this figure were adapted from the open source material (NuclearVacuum, 2008;Nordwestern, 2015;Silar, 2016;Rashidi, 2017;DataBase Center for Life Science, 2018). [Color figure can be viewed at wileyonlinelibrary.com] environmental Vibrio fischeri strains colonize the light organs of the squid Euprymna scolopes (McFall-Ngai and Ruby, 1991).
Selective pressures on the microbiota vary between hosts, and these variations exhibit a phylogenetic signal. For example, the phylogenetic relatedness of mammalian hosts correlates with the similarity of their gut microbiota (Song et al., 2020). However, this is not a universal rule, as distantly related birds and bats have surprisingly similar gut communities. This similarity is likely caused by the reduced immune regulation in these hosts, perhaps attributable to flight physiology (see Song et al., 2020), which results in relaxed constraints on microbiota composition. We have only begun to chart how various mechanisms affect the interspecies differences in the microbiota, and disentangling the relative effects of the horizontal transfer of microbes, vertical transfer and hostspecific selective pressure on microbiota composition across host species is an active area of research (Perez-Lamarque and Morlon, 2019; Leftwich et al., 2020). Our understanding of these mechanisms could greatly benefit from incorporating spatial information explicitly in the modelling frameworks.

Variation across surfaces and compartments in a single host species
The composition of microbial communities varies greatly between external surfaces and internal sites in individual host species (Fig. 1B). Indeed, most often the communities on external surfaces appear to be regulated by environmental variables such as temperature, and internal communities by host-related factors like the immune system and diet (Woodhams et al., 2020). Community composition also varies among the external or internal sites. For instance, communities demonstrate distinct spatial distributions along the mammalian gastrointestinal tract to the scale of specific microhabitats, such as the lumen of the large intestine, mucus layers and colonic crypts (Zhang et al., 2014;Donaldson et al., 2016). The communities can be highly organized down to the micrometre scale on surfaces such as the human tongue dorsum (Wilbert et al., 2020). Distinct communities are also observed between the different compartments of plants such as the rhizosphere, phyllosphere, and leaf and root endospheres (Hacquard, 2016). Understanding the processes at play at these finest spatial scales would benefit from the sampling of communities along spatial gradients using a dedicated methodology, rather than at distinct sites in and on the host organism (see 'Accounting for scale' below). These types of analyses would represent a shift from thinking in terms of categorical host sites to a continuous landscape of host-associated microbiota (Proctor and Relman, 2017) and would require spatially explicit modelling approaches.

Variation between individuals of the same species
Another axis of variation can be observed in microbiota composition between individuals of the same host species, when sampling a community at the same host site (Fig. 1C). This type of variation is currently receiving much attention in humans due to its medical relevance. For example, comparing the gut microbiota of patients suffering from a range of diseases to those of healthy controls has led to a number of discoveries on the role played by the gut microbiota in disease pathogenesis (Feng et al., 2018). The differences in host site-specific microbiota in both diseased and healthy hosts relate back to the individual life histories of the hosts and include, for example their genetic background (Benson et al., 2010), initial colonization with microbes (Callens et al., 2018) and related founder effects in the microbial community (Litvak and Bäumler, 2019), environmental exposures (Chiu et al., 2020), diet (Riaz Rajoka et al., 2017), agingrelated changes (Langille et al., 2014), medication  and the diseases themselves (Malla et al., 2019). Many of these factors are unevenly distributed across space, thus producing also spatial patterns in microbiota compositions. The factors affecting interindividual differences are also often unknown or unmeasured due to practical constraints. Because spatial information captures at least part of this variation, incorporating it in the analysis would be beneficial even when the source of the variation is unknown. Furthermore, as in the case of variations between host species, incorporating spatial information is an efficient means to account for the introduction of microbes from the environment or through horizontal transfer from other individuals.

Variation in the same community over time
The fourth axis of variation often examined in microbiota studies is between states of the microbial community in the same host individual at the same host site over time (Fig. 1D). The current state of a microbial community depends on its past states and on the influence of factors with uneven spatial distribution, which are described above. Thus, it is impossible to completely separate spatial patterns from temporal variation in the communities.
While host site-specific communities in the same individual can exhibit stable composition over time (Coyte et al., 2015), hourly to daily variations are common, for example in the mammalian gastrointestinal tract (David et al., 2014;Maurice et al., 2015;Voigt et al., 2016). It is likely that most host-associated communities have multiple stable configurations, which can provide the hosts with similar necessary functions and between which they can 'switch'. Functional redundancy between two communities does not mean that they are equivalent, however, and communities with different initial taxonomic compositions can be expected to react in different ways to stressors (Moya and Ferrer, 2016). Disease-associated (or 'dysbiotic') states of the microbiota are reported to be especially unstable through time, likely due to a reduction in the host's regulation ability (Zaneveld et al., 2017).
Longitudinal studies of the human gut microbiota have shown that the (geographical) relocation of individuals can have measurable long-and short-term impacts on their microbiota (David et al., 2014;Kaplan et al., 2019). Relocation-associated changes have also been observed over time in the microbiota of migratory birds  and stingless bee colonies (Hall et al., 2021). While true temporal models are beyond the scope of this review, spatial approaches should thus not overlook the possible temporal aspects of the data. Microbiota time-series data with enough spatial coverage to allow the simultaneous investigation of spatial and temporal patterns are currently rare, but such studies will be crucial to establish a mechanistic understanding of the communities.

Spatial patterns and the importance of scale
As seen in the previous section, spatial structures are thought to stem from the horizontal dispersion of microbes between hosts (Antwis et al., 2018) and from the environmental filtering of communities by spatially correlated factors. In humans, for instance, such factors include diet for the gut microbiota (Filippo et al., 2010) or lifestyle and environment for the skin microbiota (Lehtimäki et al., 2018). However, the relative importance of the different spatially correlated factors is likely scaledependent (Ladau and Eloe-Fadrosh, 2019). Hence, when designing studies on host-associated microbiota (along any axis of variation), one should carefully consider the spatial scale of the sampling and the possible processes affecting community composition at that scale.

Spatial patterns across scales
The scale-dependence of spatial patterns in microbiota composition is well illustrated by the known patterns of inter-individual differences in the human gut microbiota. Within a household, the horizontal dispersal of microbes increases the similarity between cohabiting individuals (Finnicum et al., 2019). At the neighbourhood scale, the effect of vegetation cover in the living environment affects inter-individual differences, likely due to the dispersal of environmental microbes (Parajuli et al., 2020). At the regional to country scale, spatial patterns in microbiota composition can be attributed to differences in ethnicity (Deschasaux et al., 2018) and lifestyle (Gupta et al., 2017), which both likely affect selection, through genetics and diet for example. At the global scale, the observed patterns are likely due to geographically variable microbial inputs from the environment and to selection through diet and cultural traditions (Gupta et al., 2017;Senghor et al., 2018). Although data on dispersal limitation are sparse for the human gut microbiota, this might play an important role in amplifying geospatial differences. Indeed, dispersal rates appear to differ between bacterial taxa in the human gut (Harris et al., 2017), and in other mammals, dispersal limitation has been shown to contribute to interspecies differences in gut microbiota composition (Moeller et al., 2017). Finally, if the differences in lifestyles and diets between industrialized and rural populations (O'Keefe et al., 2015) are maintained over the timescales of microbial evolution, speciation through adaptation of the gut microbiota to an industrialized lifestyle (Sonnenburg and Sonnenburg, 2019) might also amplify geospatial community differences.
In addition to the scale-dependent relative importance of different processes, the scale of sampling also likely affects the phylogenetic or taxonomic scale of the observed differences (Ladau and Eloe-Fadrosh, 2019). For example, global variability in human gut microbiota composition can be reduced to broad community types separable at the genus level (Costea et al., 2018), but diverging functional traits within populations may only be observable at the species level Tett et al., 2019). Furthermore, cohabiting individuals can share microbial species at the strain level (Truong et al., 2017), and specific microbial strains can be stably present in the gut community of a host individual for decades (Koo et al., 2019).

Accounting for scale
A consequence of the above is that the spatial grain of the study should guide the choice of its design, sampling and taxonomic resolution (Fig. 2), and the possible integration of information on the function and metabolic activity from 'omics data' (Knight et al., 2018;Ladau and Eloe-Fadrosh, 2019). Indeed, while proper modelling techniques are instrumental in addressing the spatial aspects of microbiome research, a prerequisite is that the data enable these analyses. This point comes down to one of the basic principles of computer science, 'garbage in, garbage out', first noted well over a century ago (Babbage, 1864).
Identification of community members should be performed at the most accurate practically available resolution, as the lower units can always be hierarchically grouped at higher levels, for example to reduce the computational burden of the analysis. Strain-level identification of microbes is currently possible even from (deep) metagenomic sequencing of bulk samples (Anyansi et al., 2020). For species-level identification, shallow shotgun sequencing has emerged as a viable alternative to 16S rRNA gene metabarcoding, with a higher taxonomic coverage and accuracy at only slightly increased sequencing cost (Hillmann et al., 2018). Recent developments in long-read sequencing might facilitate the use of the full-length 16S rRNA gene in identifying microbes down to the species or even strain level (Johnson et al., 2019). Finally, single-cell isolation, amplification and sequencing of either DNA (Xu and Zhao, 2018) or RNA , which enables further functional characterization, can be applied to identify individual cells at high resolution.
Most microbiota studies currently use bulk samples, obtained for instance from faeces or by swabbing, to compare communities. While these approaches have proved useful in elucidating large-scale differences in microbiota composition between different hosts and even between sites within individual hosts, both faecal samples (Ingala et al., 2018) and swabs (Prast-Nielsen et al., 2019) are slightly biased proxies for the total community composition at the focal host sites. Thus, new minorly invasive methods and accurate sampling tools and technologies would be highly beneficial for this field of study (Tang et al., 2020). New methods with a high spatial resolution at the micrometre to millimetre scales (Fig. 2) are required to better understand the composition, function and organization of the communities on the host surfaces. Sampling communities at these scales while preserving their spatial organization is inherently difficult, but methods such as fluorescence spectral imaging (Wilbert et al., 2020) and sampling techniques such as cryofracturing (Sheth et al., 2019) have previously shown to be highly applicable.

Open questions in spatial microbiota ecology
While microbiota have been intensively studied over the past decade, our understanding of the ecological processes governing their composition (i.e. selection, drift, dispersal and speciation) and of their relative importance in different contexts and at different scales is still limited (Woodhams et al., 2020). Key open questions in the field regard relate in particular to the importance of horizontal transfer between individuals, of functional redundancy between communities, of founder effects and stochasticity in community dynamics, and of rapid evolution within the host lifetime (Table 1). While much research effort currently focuses on humans, these are general questions that can (and should) also be addressed in other hosts, if only because they are easier to study experimentally. In addition to these general questions, it is of major interest for medical research to better understand geographical variation in the human microbiota and its significance to human health. Because these questions all involve a spatial aspect, the use of models that explicitly take spatial structure into account represents an important step towards addressing these goals.

Spatial modelling frameworks
Much of the methodological development in statistical microbiology has focused on between-sample comparisons assessing the effect of different conditions or treatments, disregarding the spatial and ecological contexts of  (Kurtz et al., 2015) but often without accounting for spatial structure. Macrobial (macro-organismal) community ecology and macroecology represent promising sources of inspiration for spatial models in microbial ecology. Community ecology and macroecology differ in that the former is concerned with the spatial and temporal scales associated with a community of locally coexisting organisms (these scales depend on the organisms considered), while macroecology is concerned with ecological patterns across scales -from the community scale to the global scale, and this for both macrobes and microbes. Despite the significant differences between micro-and macrobial ecology (Ladau and Eloe-Fadrosh, 2019), many of the methods developed to study communities of macrobes are potentially also applicable to microbial communities.
According to the conceptual synthesis in community ecology, 'species are added to communities via speciation and dispersal, and the relative abundances of these species are then shaped by drift and selection, as well as ongoing dispersal, to drive community dynamics' (Vellend, 2010). These processes are inherently the same regardless of the identity and size of the biological organisms. High-throughput sequencing methods now enable comprehensively assessing the composition of microbial communities, which provides microbial ecologists with community composition data increasingly similar to those analysed in traditional community ecology. A number of technical limitations remain, such as uneven DNA extraction efficiency, PCR and sequencing errors, uneven taxonomic resolution, incomplete reference databases and the compositionality of the data (Knight et al., 2018). While this may bias the ecological interpretation of the data (Sommeria-Klein et al., 2016), the uncertainty thus introduced has been steadily decreasing and has now become, in the case of host-associated microbial communities, comparable in magnitude to that of traditional community ecology data (Rocchini et al., 2011). Finally, the increasing use of DNA metabarcoding in plant and animal ecology further contributes to a convergence in data types and methodological approaches between microbial and macrobial ecology (Deiner et al., 2017). This provides the opportunity for microbial ecologists to tap into the rich body of models accounting for a spatial structure that has been developed for macrobial community ecology.
We first review below the classical statistical methods used in macrobial ecology to account for spatial structure and their limitations. In addition to these methods mainly based on dissimilarity metrics and linear models, both macrobial and microbial ecology have seen a rising use of 'predictive modelling' approaches over the last decade. These approaches can be divided into two broad categories, which we review in the subsequent two sections in a spatial context: classical machine learning approaches, for instance, based on decision/regression trees or neural networks, and probabilistic modelling approaches, sometimes referred to as 'probabilistic machine learning' (Ghahramani, 2015). Both types of approaches rely on optimizing, or fitting, a potentially high-dimensional model to the data, however, in machine learning the inference is based on a learning algorithm, while in probabilistic modelling it is based on an explicit probabilistic model (i.e. a mathematical model that predicts the probability distribution of outcomes), which can be more easily constrained by assumptions about the data. Both types of approaches have in common the ability to readily reveal non-linear dependencies and interactions in the data and to make predictions to new data.

Classical statistical ecology
A common approach for the analysis of spatial community composition data in both macrobial and microbial ecology is to normalize taxa abundances per sample, compute pairwise dissimilarities in composition between samples (β-diversity) and perform analyses on the resulting dissimilarity matrix. The advantage of this approach in a spatial setting is that it easily enables Table 1. Key open questions related to spatial variation in hostassociated microbial communities.
Processes shaping community composition: How does the relative importance of the four fundamental processes governing community composition, that is, selection, drift, speciation and dispersal, vary across spatial contexts (e.g. between host species, tissue types or body sites, environments, spatial scales)? Horizontal dispersal: How important is the horizontal dispersal of microbes between host individuals and species, and how does it depend on the characteristics of the microbes (e.g. physiology, relative abundance and activity) and on environmental conditions? Functional redundancy: How does the host selectively filter newly acquired microbes from the environment and how functionally similar are comparable microbiota (i.e. same host species and same host site) in different geographical locations? Founder effects and ecological succession: How do founder effects and interactions between pre-established and introduced microbes affect community assembly? Rapid microbial evolution: To which extent does microbial evolution taking place within the host influence its microbiota over short time scales (i.e. over the lifetime of the host or over a few host generations)? Medical relevance of spatial patterns: What is the extent of geographical variation in dysbiotic (i.e. disease-associated) human microbiota, and how could this variation affect the pathophysiology, diagnosis, prognosis and treatment of different diseases?
investigating the effect of spatial distance on the pairwise dissimilarities between samples. Classical analyses include simple statistical tests (e.g. Mantel tests against spatial distance or environmental dissimilarity), clustering (e.g. Hierarchical Clustering, Partitioning around Medoids) and ordination (e.g. Multidimensional Scaling; Legendre and Legendre, 2012). Until now, these methods have proved widely useful in analysing the highdimensional data produced by high-throughput sequencing in microbial ecology (Paliy and Shankar, 2016). Despite their established usefulness, dissimilaritybased approaches can also produce misleading results and obscure data interpretation (Warton et al., 2012). It is particularly true in the case of microbial community data, which are characterized by compositionality, a high number of rare taxa leading to sparse composition matrices (i.e. with many zeros), and a strong heterogeneity in total read count across samples (Knight et al., 2018). Moreover, most dissimilarity-based statistical methods do not allow incorporating additional data after analysis or making predictions on new samples. This limits their use, for instance, in medical diagnostics and environmental monitoring (Cullen et al., 2020). Fully multivariate statistical approaches, in which the original composition of all samples is jointly analysed, are an alternative to dissimilaritybased methods. They are both less biased and more statistically powerful, especially when the samples are spatially distributed (Legendre et al., 2005). In a fully multivariate approach, the spatial structure of the data can be accounted for by decomposing the matrix of between-sample spatial distances into a set of eigenvectors, called Moran's Eigenvector Maps or Principal Coordinates of Neighbour Matrices, to be used as explanatory variables representing the possible patterns of spatial autocorrelation associated with the sample layout (Legendre and Legendre, 2012). Standard multivariate statistical methods nevertheless assume linear relationships, which makes them inappropriate to model taxa distributions along spatial gradients when taxa abundances exhibit non-linear or even non-monotonous spatial trends (Austin, 2007;Paliy and Shankar, 2016). Furthermore, the commonly used multivariate methods cannot account for the multiple levels of spatial organization of hostassociated communities, forming a nested hierarchy (Björk et al., 2018).

Machine learning
Data-intensive research in microbial ecology often takes advantage of popular machine learning methods such as neural networks, decision trees, support vector machines, gradient boosting and ensembles of learners. These techniques have become increasingly popular due to their relatively easy adoption and the limited need for human intervention during the analysis (Cordier et al., 2019;Qu et al., 2019). They are highly flexible and require little prior parameterization, which makes them well suited for studies with a limited understanding of the mechanisms at play and of the relative importance of the different variables, as is often the case in microbial ecosystems. They are also well suited to data sets with complex structure that exhibit non-linear dependencies and interactions between many variables. They can be used to identify useful properties from the data, such as the dependency between the abundances of taxonomic or functional groups and biometric, environmental, and spatial variables. These properties can then be used, for instance, to optimize model performance for diagnostic or prognostic in medical applications, or for environmental monitoring.
Machine learning methods often feature a large number of parameters with respect to the number of data points, which makes them highly flexible but also prone to overfitting the training data and generalizing poorly. To remediate this, model performance and accuracy are typically evaluated through cross-validation, that is, by quantifying how well the model generalizes to new observations (known as 'out-of-sample' data), rather than through goodness-of-fit to a single dataset (as measured by R 2 or a P-value in classical statistics). Care should nevertheless be exercised when dealing with small sample sizes (Vabalas et al., 2019), or in the case of spatially autocorrelated data, in which case spatially disjoint training and test (validation) sets should be used to avoid overestimating model performance (Meyer et al., 2018;Schratz et al., 2019).
Despite their high performance in classification and regression tasks, the main limitation of these methods is that they function as 'black boxes': the fitted model has limited interpretability, and it does not usually account for the underlying mechanisms. The structure learned by the model can nevertheless be investigated. For instance, the relationship between input and output variables can be visualized through partial dependence plots, obtained by varying the input variables one at a time within the trained model (Greenwell, 2017). Such a posteriori investigations may help understand how specific variables and their interactions contribute to the final predictions. Importantly, they enable estimating effect sizes for individual or multiple interacting variables.
Few studies on microbiota using machine learning have so far incorporated spatial information, which can often be attributed to an insufficient number of samples for reliably detecting spatial patterns. Yet, including spatial data and analyses in studies with an adequate sample size can lead to remarkable performance gains. A study using random forests for disease diagnosis in a Chinese province showed, for instance, that the classification accuracy improved as finer spatial scales were considered, from the regional to the neighbourhood scale (He et al., 2018). It also found that extrapolating locally trained models to larger geographic areas led to poorer performance. Variable selection with random forests and gradient boosting trees was also used in the analysis of gut microbiota from a Finnish population cohort to predict fatty liver disease across geographical regions (Ruuskanen et al., 2021), and ensemble logistic regression was used to trace the geographical origin of clams based on 16S rRNA metabarcoding data on their microbiota (Milan et al., 2019). These studies, however, incorporate spatial information as discrete location information rather than as continuous variables, and accounting for the spatial structure of the data more explicitly could further improve model performance.

Probabilistic modelling
The main limitations of classical machine learning methods is that they poorly estimate uncertainty and that the underlying models are either implicit or difficult to interpret. An alternative is to rely on an explicit probabilistic model, associated with a likelihood function. Inferences can then be made on the data through likelihood maximization, or through Bayesian inference provided that prior distributions have been specified for the inferred parameters. Probabilistic modelling allows providing rigorous uncertainty estimates but also guiding inference with a priori knowledge on data structure or on the mechanisms at play, and thus giving a clearer biological interpretation to the inferred parameters. While classical statistics also relies on fitting a (sometimes implicit) probabilistic model to data, increasing computing power is now allowing for more and more complex models, which may rely on non-normal distributions, accommodate nonlinear relationships between variables and be hierarchically structured (Gelman, 2014). As in non-probabilistic machine learning, it has become a common approach to fit highly flexible models and to assess their generalizability through cross-validation (Ghahramani, 2015). An alternative is to fit models that are more strongly constrained by hypotheses about the data, and to then compare either their goodness-of-fit or their out-of-sample predictive performance to reveal the hypothesis most consistent with the data.
The explicit probabilistic modelling of the spatial variation in host-associated microbial communities, and of their scale-dependent relationship with the host and the environment, can be achieved through Species Distribution Models (SDMs) borrowed from macrobial ecology. SDMs have long been used to predict the spatial distribution of species based on observed species occurrences and bioclimatic variables (Miller, 2010). Nevertheless, simple bioclimatic models cannot capture the effect of many factors affecting species distributions, such as biotic interactions and dispersal limitation (Pearson and Dawson, 2003). This has led to the introduction of hierarchical models able to incorporate these factors in a scale-dependent way while accounting for multiple sources of uncertainty in the data (Hefley and Hooten, 2016). One of the latest developments of this line of research is Joint Species Distribution Models (JSDMs), which enable the joint estimation of the distribution of multiple species based on both abiotic conditions and biotic interactions (Latimer et al., 2009;Ovaskainen and Abrego, 2020). From a technical standpoint, JSDMs are generalized linear mixed models, in which the spatial structure of the data can be accounted for through the covariance matrix of the residuals. JSDMs can be applied to both count and presence-absence data. They can account for environmental covariates, functional traits and phylogenetic relationships between the organisms, and produce model-based variance partitions, ordinations and co-occurrence networks as output.
While the computational costs of the earlier JSDMs were intractable for microbial data, it is now possible to handle hundreds of taxa and samples in a reasonable time (Tikhonov et al., 2020a;Tikhonov et al., 2020b). A few recent studies applied JSDMs to investigate spatial patterns in microbiota (Björk et al., 2018;Aivelo et al., 2019;Minard et al., 2019). In particular, a recent study adapted JSDMs to microbiota by incorporating host phylogeny and traits and illustrated this development on bird and sponge microbiota (Björk et al., 2018). JSDMs were also used to show that variation in the abundance of microbial taxa in tick-associated microbiota is mostly associated with host-specific factors, although environmental effects can be large for individual microbes, including human pathogens (Aivelo et al., 2019). This study demonstrates the use of JSDMs to partition variance between spatial effects and host-related factors and to obtain co-occurrence networks. A study conducted on caterpillar microbiota revealed phylogenetic structuring in the communities, with related microbial taxa exhibiting similar patterns (Minard et al., 2019). The communities displayed high variation between individual caterpillars, on which neither the host-and host plant-related factors nor spatial structure appeared to have significant influence. These studies illustrate the potential of JSDMs to model microbiota, including microbe-to-microbe interactions and the relative effect of different processes on the occurrence or abundance of microbial taxa at different scales.
The use of JSDMs for host-associated microbial ecology has nevertheless a number of limitations. First, microbial communities tend to comprise a higher share of rare taxa than communities of macrobes. Although latent variables and inter-taxa associations can be used to improve predictions on rare taxa (Tikhonov et al., 2017), most taxa are likely to occur too sparsely to be amenable to analysis with JSDM unless it is performed at a coarse enough taxonomic resolution (by grouping at higher levels). Second, current JSDMs do not explicitly account for the compositionality of microbial community data, which may bias the inference (Björk et al., 2018). Third and finally, they only allow host-associated factors to influence the microbiota but not the other way around (Aivelo et al., 2019), and evolutionary processes are not accounted for, which can limit their predictive potential (Cotto et al., 2020). The latter limitation is a stronger concern when dealing with microbes compared with macrobes, as the timescale of their evolutionary adaptation is much shorter (Ferreiro et al., 2018).
Other probabilistic modelling approaches have been developed to account for the specificities of microbial data (compositionality, a highly heterogeneous read count across samples and many rare taxa), although they do not yet explicitly account for spatial structure. They usually model the sampling process explicitly using probability distributions belonging to the Dirichlet-multinomial family (La Rosa et al., 2012). This forms the mathematical foundation for various model-based analyses of microbiota: reconstruction of association networks (Kurtz et al., 2015), classification of microbiota into discrete categories based on their composition (Holmes et al., 2012;Ding and Schloss, 2014) and construction of assemblages of taxa based on their co-occurrence and covariance across samples (Hosoda et al., 2020). Assemblage models are a particularly interesting alternative to taxoncentric models for modelling high-diversity microbial datasets (Sommeria-Klein et al., 2020). The resulting assemblages may be interpreted as groups of microbes with the same ecological niche, and the decomposition into assemblages strongly reduces the dimensionality of the data for downstream analyses. Finally, neutral ecological models describe the stochastic dynamics of ecological communities -including dispersal, drift and speciation -under the assumption that all taxa are equivalent in their competitive abilities. They yield stationary distributions belonging to the Dirichlet-multinomial family for the composition of communities, and they have been used for the ecological interpretation of human gut microbiota data (Harris et al., 2017).

Perspectives
In the light of advances in other research fields, the potential of predictive modelling for the analysis of spatial data still appears largely underexploited in microbiota studies. For example, random forest approaches have been accurate in predicting regional lithology in Australia using continuous spatial information (Cracknell and Reading, 2014), or the spread of a forest disease (Sphaeropsis blight) in Spain using spatial crossvalidation (Schratz et al., 2019). In another example, a geographically weighted ensemble of deep neural networks, gradient boosting trees and random forests accurately predicted temporal wind speeds over mainland China (Li, 2019). The application of frameworks such as these could possibly elucidate the drivers behind the spatial distribution of host-associated microbial community diversity or of individual taxa in these communities. In disease models where microbiota composition is used as a diagnostic tool, we posit that spatial structure should be better accounted for in study design. Merely incorporating spatial data in the current machine learning frameworks as a proxy for unmeasured spatially correlated variables could already improve their performance.
Likewise, a number of extensions and modifications to JSDMs could likely improve their performance for the high number of taxa that characterizes microbial studies. While a generalized linear modelling framework is usually at the core of JSDM models, their performance can be further improved by using Gaussian processes instead (Ingram et al., 2020;Vanhatalo et al., 2020). Advanced computational techniques such as Integrated Nested Laplace Approximation (Blangiardo et al., 2013) could also be used to enhance their computational efficiency. Furthermore, the use of log-ratio transforms to accommodate compositional data in an unbiased way (Gloor et al., 2017), and of Gaussian processes to quantify autocorrelation between hosts (as suggested by Björk et al., 2018) would increase the suitability of JSDMs to the study of microbiota. Other types of models, such as source tracking models aiming at identifying the origin of contaminants in microbial samples (Knights et al., 2011), could be used to model the effect of dispersal between hosts in a spatial context. Finally, further use of these models to assess spatial effects would rely on and benefit from more even and intensive sampling of communities at spatial scales relevant to the study questions, similarly to macrobial ecology studies (see, e.g. Tikhonov et al., 2020a).

Concluding remarks
Host-associated microbial communities vary greatly in space (and time), even at a single host site in a single host species. This variability can now be observed with the use of various high-throughput sequencing and single-cell sampling and imaging methods, but its causes remain largely unclear. It is likely that patterns in these communities could be better understood if their spatial structure were properly incorporated in the analyses. Spatial data in microbiota studies can both reflect the varying ability of the organisms to disperse between and within hosts and serve as a proxy for unknown or unmeasured spatially correlated variables. Recent developments in spatial analysis enable accounting for the scale-dependent hierarchical structure of microbiota and for non-linear interactions between variables, but these approaches are still greatly underused in microbial ecology. Indeed, the complexity of microbial community data, the limited scalability of the methods, and the lack of openly available implementations and benchmark case studies are slowing down the development of the field. Further development of computational efficiency, adjustment to the specific properties of microbiota profiling data and the incorporation of evolutionary processes would facilitate the use of these methods in the spatial modelling of microbiota. Their growing use in microbial ecology could in return spur new methodological development and applications in macrobial ecology, as well as industrial and clinical applications.