Spatiotemporal clustering using Gaussian processes embedded in a mixture model

The categorization of multidimensional data into clusters is a common task in statistics. Many applications of clustering, including the majority of tasks in ecology, use data that is inherently spatial and is often also temporal. However, spatiotemporal dependence is typically ignored when clustering multivariate data. We present a finite mixture model for spatial and spatiotemporal clustering that incorporates spatial and spatiotemporal autocorrelation by including appropriate Gaussian processes (GP) into a model for the mixing proportions. We also allow for flexible and semiparametric dependence on environmental covariates, once again using GPs. We propose to use Bayesian inference through three tiers of approximate methods: a Laplace approximation that allows efficient analysis of large datasets, and both partial and full Markov chain Monte Carlo (MCMC) approaches that improve accuracy at the cost of increased computational time. Comparison of the methods shows that the Laplace approximation is a useful alternative to the MCMC methods. A decadal analysis of 253 species of teleost fish from 854 samples collected along the biodiverse northwestern continental shelf of Australia between 1986 and 1997 shows the added clarity provided by accounting for spatial autocorrelation. For these data, the temporal dependence is comparatively small, which is an important finding given the changing human pressures over this time.

the data. The ease of interpretation stems from the observation that humans are naturally predisposed to understanding categorizations (e.g., color labels, Linnaean system of taxonomy, and so forth).
Our motivation for studying clustering methods comes from ecology, and in particular the task of regionalization (biogeography)-where an analyst wants to find groups of sites that have similar assemblages of species (Pielou, 1984;Woolley et al. 2019). Here, we focus on the biogeographic patterns of bony (teleost) fish on the north-west shelf (NWS) of Australia, which is a productive and biodiverse ecosystem of long-term scientific interest (Considine, 1985;Nowara and Newman 2001). Biogeographic data, such as the fish data, are sampled in physical space and often through time. Spatial and temporal dependence may therefore arise among biological observations but nearly all cluster analyses ignores this possibility. In doing so, these studies inadvertently ignore the potential for spatiotemporal correlation to be confused with ecological groups.
In this work, we extend the model-based clustering method of Foster et al. (2013) to include spatial and spatiotemporal dependence into the model for observations. The model is a mixture-of-experts model (Jacobs et al., 1991;Jordan & Jacobs, 1994), where the probability of observing each cluster is allowed to vary with environmental covariates and we additionally allow for spatial or spatiotemporal correlation by including Gaussian processes (GPs; e.g., see Rasmussen & Williams, 2006). An appealing feature of this approach is that it allows predictions from the correlated-model to leverage off the spatial locations, and time, of the observed data as well as the inherent relationships of biology and the environment-even when predicting at locations with no direct observation.
Our approach is novel in that it is for multivariate observations (measurements on hundreds of species per site is not uncommon), and it is defined for continuous space and time using semiparametric GP response functions. Previous spatial clustering methods include, for example: (1) Spatial scan statistics (Kulldorff, 1997), which ignores environmental effects and focuses entirely on spatial properties in an algorithmic framework.
(2) Two-step approaches where a hard-coded label prediction is produced algorithmically and subsequently regressed on spatial covariates and possibly spatiotemporal coordinates (Anderson et al., 2014;Bilancia & Demarinis, 2014). This approach ignores uncertainty associated with the prediction of the label. (3) Clustering areal data (Alfó et al., 2009;Green & Richardson, 2002;Lawson et al., 2017;Neelon et al., 2014;Torabi, 2016;Wall & Liu, 2009), which requires data to be gridded. The gridding is an unnatural representation of most ecological data, which is best represented in continuous spatiotemporal domain. (4) Digital image analysis (Ambroise et al., 1997;Nguyen & Wu, 2012;Woolrich et al., 2005) and specification of priors that encourage neighboring sites to share cluster labels (Corander et al., 2008;Guillot et al., 2005). Whilst close to our representation, these models do not easily allow for inclusion of covariate effects. Unlike the previously introduced models, our approach allows for covariates and for spatiotemporal autocorrelation within the data. This is achieved in a single analysis, which avoids the problems of propagating uncertainty through multiple stages of an analysis. In addition to these desirable qualities, our approach utilizes semiparametric functions for modeling the responses of cluster probabilities along covariates. We also present novel methods to summarize these effects in an intuitively clear manner.
To address the complexities introduced by the inclusion of spatial and spatiotemporal dependence, we introduce novel methods to conduct approximate Bayesian inference that scale well with both the number of samples and the dimensionality of those observations. Our fastest, and crudest inference method, is based on the Laplace approach, which is also the basis of the integrated nested Laplace approximation (Rue et al., 2009) approach that has been shown to perform well for large number of latent Gaussian variable models. Even though the Laplace method has been used for spatial clustering by, for example, Bilancia and Demarinis (2014) and Anderson et al. (2014), these earlier approaches have utilized it only for the hierarchical model conditional on the predefined cluster structure. In our approach, however, clusters are probabilistic and we use the Laplace approximation to marginalize over the spatiotemporally varying cluster probabilities and to approximate their posterior distributions. This approach is technically similar to the Laplace approximation for multiclass and Multinomial GP models (Juntunen et al., 2011;Rasmussen & Williams, 2006;Riihimäki et al., 2013). We propose also to combine Laplace approximation with partial Markov chain Monte Carlo (MCMC) to improve the accuracy of inference for the key model parameters. This combined approach is similar in nature to the approaches of Vanhatalo et al. (2010), Vanhatalo et al. (2013) and Gómez-Rubio and Rue (2018) in that the Laplace approximation is used for approximately marginalizing over a set of model parameters within an MCMC algorithm. A full MCMC procedure is also examined for comparison.
We demonstrate and test our methods with a simulation study. We then analyze 854 samples of 253 teleost fish on the NWS of Australia (see Figure 1) to test our methods in large real-world data and illustrate the effects of including spatial and spatiotemporal effects by fitting models with and without them. The temporal component may be particularly important for this region that has been subject to differing exploitation rates of fish as well as different resource management paradigms (Considine, 1985;Sainsbury et al., 1993).

NWS region
The NWS region of Australia (see Figure 1(A)) is a remote but resource rich marine area in tropical north west Australia. The continental shelf along the NWS supports a productive ecosystem influenced by both tropical and subtropical systems. Since the mid-1960s, the NWS has supported fisheries and a number of different species have been targeted (Considine, 1985;Wallner & Phillips, 1988;Sainsbury et al., 1993;Nowara and Newman, 2001). At times, the total finfish catch from this area has been much greater than that from any other waters in Australia (Considine, 1985). To understand the effect that early fishing effort had on the composition and health of the finfish stocks, and to assess the possibility of developing a domestic fishery, a number of surveys were undertaken in the early 1980s (Sainsbury, 1979;Nowara and Newman, 2001). More surveys were then sporadically conducted until 1991 with the goal of investigating management options (Sainsbury et al., 1993) with further surveys until 1997. The data contain information about how fish biodiversity varies in space and time. In particular, we wish to uncover the patterns of variation in fish assemblages.

Biological data
The NWS data consists of 854 trawls spread from October 1986 to August 1997. The vast majority of these trawls are "community" focused with an object to describe the fish species community, where attributes of all species are recorded. For this analysis, we chose to exclude earlier data where there was ambiguity about the survey objective (e.g., see Sainsbury & Whitelaw, 1984;Thresher et al., 1986). The raw data is available from the CSIRO data trawler 1 and the data used in our analysis is available as Supplementary Material. The data spanned almost 200 m of depth, with 21 m being the shallowest (see Figure 1(B,C)). We analyze the presence-absence of the 253 species (reduced from 579) that were present in 15 or more trawls. Species with very few presences are unlikely to substantially contribute to the evidence-base for biogeographic patterns.

Physical environment data
The physical environment was delineated using climatologies (long-term averages), which are hence time invariant. These are the same sources of physical covariates as was used by Foster et al. (2013) and Figure 1(B) gives the example of depth. The climatological covariates used in this work are depth, intraannual standard deviation (SD) of nitrate (NO3 SD), intraannual SD of dissolved oxygen (O2 SD) and annual mean of salinity. Intraannual SD can be important to ecological systems as it measures the range of environmental conditions that a single location may encounter. A dense grid throughout the region-bounded by latitude, longitude and depth-of all these covariates is used for prediction. We delineate space with respect to orthogonal axes rotated relative to easting and northing coordinates. The rotation was undertaken since patterns of variation in the NWS region tend to be (approximately) north-east to south-west aligned (as is the sampling region itself, see Figure 1(B)). This was achieved by aligning the spatial coordinates to the first two principal directions of variation in the sampling locations.

Spatiotemporal clustering model
Clustering methods aim to partition the multivariate samples into K groups that are more similar to each other than they are to observations in different groups. Using mixture models (McLachlan & Peel, 2000), the clusters are found by encapsulating each multivariate observation's latent group label into the model. Formally, we define the latent label for the ith observation (i = 1 … n indexes the sampling sites) as the 1 × K vector z i = (z i1 , z i2 , … , z iK ) with the kth element equal to "1" if the observation belongs to group k and "0" otherwise; the groups are assumed mutually exclusive such that the observation is assigned to only one of the K groups. The variable z i is assumed to follow a categorical distribution with mean i = ( i1 , … , iK ).
We build upon the model of Foster et al. (2013) who assume that, conditional on the latent group label, z i , the expectation for the observed species data is constant among sampling sites. That is, the multivariate data for the J species at a sampling location y i = (y i1 , … , y iJ ) has conditional elementwise expectation E ( y ij |z ik = 1 ) = kj that is constant over sampling sites belonging to the same cluster. We follow the nomenclature of Foster et al. (2013) who call the vector of conditional expectations E ( y i |z ik = 1 ) = k "a profile," and the regions of covariate space that the groups occupy are "regions of common profile" (RCP). All observations within an RCP have the same species specific conditional expectations and thus satisfy the requirement that observations within an RCP are more similar than observations in different RCPs. Define p ( to be the conditional probability density of the ith observation, with being all of the groups' profiles (K × J matrix). Note that we have partitioned into its group-specific components k (a 1 × J vector). The unconditional distribution of the observation is the mixture distribution For the NWS data, the observations are binary (present/absent) and so we assume that the conditional distributions for each y ij are independent Bernoulli random variables and parameterize the conditional observation models through their mean, p ( y ij |z ik = 1, kj ) = Bernoulli(y ij | kj ). We extend this mixture model for spatial and temporal dependence by allowing the expectation of the group label, i , to vary with covariates and the spatiotemporal coordinates of the observation. That is i = (x i , s i , t i ) where s i is the vector of spatial coordinates of site i, t i is the sampling time and x i = x(s i ) are the covariates associated with site i. Here, we choose (x i , s i , t i ) to be the softmax function (Neelon et al., 2014) but other link functions (Aitchison, 1982;Daganzo, 1979) could be used as well. The softmax function gives the kth element as where k ∼ N(0, 2 ) are the groupwise constant terms, h k (x i ) are the groups' responses to covariates, and k (s i , t i ) are the residual spatiotemporal patterns. The groupwise constant terms, k , are mutually a priori independent and the responses to the covariates and the spatiotemporal patterns will be modeled using mutually a priori independent GPs (Cressie & Wikle, 2011;Gelfand et al., 2010;Rasmussen & Williams, 2006). We will denote by f k (x(s), s, t) = k + h k (x) + k (s, t) a latent function combining the covariate and spatiotemporal effects for the kth RCP group.
The spatiotemporal random effects, { k (s i , t i )} K k=1 , are the main distinguishing feature between our model and the model introduced by Foster et al. (2013), who only considered models with low-order polynomial basis expansions. Additionally we will perform Bayesian inference, whereas Foster et al. (2013) considered a maximum likelihood approach. The spatiotemporal GPs provide a way for observations to "borrow strength" from other observations nearby by capturing spatial and temporal correlations. Such correlations could arise, for example, from missing covariates or from inherent properties of the ecosystem (e.g., fish foraging behavior and reproduction strategies in our application).
We give the spatiotemporal random effects independent zero mean GP priors where is a separable spatiotemporal covariance function with hyperparameters ,k . The spatial covariance function is chosen to be the Matérn covariance function with 3/2 degrees of freedom (Rasmussen & Williams, 2006) and we use an exponential correlation function for the temporal process so that where r(s, s ′ ) = √ ∑ 2 q=1 (s q − s ′ q ) 2 ∕l ,k,q is a scaled Euclidean distance between the observation sites. The covariance function is parameterized by a variance and a "length-scale" parameter in both spatial (see Figure 1 for the spatial axes) and in time dimensions-giving hyperparameters ,k = { 2 ,k , l ,k,1 , l ,k,2 , l ,k,3 } for the kth spatiotemporal process. In the NWS data analysis, the spatiotemporal process is used in model M5 and it reduces to spatial process when the temporal covariance function is dropped out (corresponding to l ,k,3 = ∞) which is used in models M3 and M4 (see Section 4.2).
We model the functions of covariates with additive, mutually independent, GPs where is the covariance function for the response along the dth covariate (d = 1 … D), in the kth RCP and h,k,d are the corresponding hyperparameters. The GP formulation for predictive functions allows for the linear models, h k (x) = x T , with Gaussian distributed weights, ∼ N(0, I 2 ), as a special case with covariance func- , 2006). Incorporating polynomial regression through a linear model is straight-forward with a suitable basis-expansion of the covariates. In the NWS data analysis we test quadratic covariate responses (models M1 and M3 in Section 4.2). For more flexible models, we use the squared exponential covariance function In the NWS data analysis these semiparametric response functions are used in models M2, M4, and M5 (see Section 4.2). We denote all the covariance function parameters by = { ,k , h,k,1 , … , h,k,D } K k=1 .

Priors
We consider it a priori likely that the spatial variability in the class probabilities could be well explained by the environmental responses. To this end, we prefer priors for variance parameters of the spatial and spatiotemporal covariance functions (models M3-M5 in Section 4.2) which give small effects. We use a heavy-tailed prior to allow for departures from this prior assumption where it is supported by the data. Hence, we use a weakly informative half Student-t prior for the SD of the spatial random effects ,k (Gelman, 2006) with four degrees of freedom and scale 0.1. Moreover, we prefer models where RCPs change slowly in space and consider it a priori more likely that the spatial random effects capture nonlocal variation that is not explained by covariates. So, we specify a half Student-t priors with four degrees of freedom and scale 1 for the inverse length-scales 1∕l ,k,1 , 1∕l ,k,2 , 1∕l ,k,3 , for all k = 1, … , K, to give more weight for spatially and temporally smooth processes. These priors for the spatial model lead to a joint prior which shrinks the model towards a null model in which there is no spatial or temporal effects. We prefer relatively "stiff" functions for the effects of environmental covariates that do not cross their mean multiple times within the range of plausible covariate values since it is a priori reasonable that species assemblages have a unimodal relationship with each covariate. Hence, in the models with GP response along covariates (models M2, M4, and M5 in Section 4.2) we prefer length-scales that are of the same order as the range of the environmental covariates. To encode this we first scale the covariates to have a SD of 1 and then give a Student-t prior with scale one and four degrees of freedom for the inverse length-scales, 1/l h,k,d , of the covariate effects at the standardized scale. Moreover, we assume that it is plausible that the RCP probabilities do not respond to some of the covariates. To encode this, we give half Student-t priors for the SDs of the GP response functions, h,k,d with four degrees of freedom and scale one. The prior variances of the group specific constants, 2 , and the prior variances for linear weights, 2 , in the models that use linear model for covariate responses (models M1 and M3 in Section 4.2), are fixed to ten so that they correspond to fixed effects.
Independent prior information for most species is sparse or nonexistent. So, we specify mutually independent vague priors for the log odds ratio of the conditional observation probabilities, logit( kj ) ∼ N(0, 2 ) where 2 = 10.

Inferential methods
The parameter space of our model is large. For the NWS data with K = 5 (see Section 4.2), J = 253 species and D = 4 environmental covariates, the model includes K × J = 1265 conditional observation probabilities ( ) and a minimum of 4K + 2DK = 60 covariance function parameters ( ) for the full spatiotemporal model (55 for a spatial model). Additionally there are n = 854 multivariate spatial/spatiotemporal sampling sites leading to n × K = 4270 random latent variables, which correspond to the values of the latent function, , at these i = 1, … , n sampling sites. Note that n refers to the total number of unique spatiotemporal coordinates (s i , t i ) in our data. Moreover, there are no spatial replicates from exactly same sampling sites even though sites at different times may be located close to each other. Further complexity stems from the fact that a mixture model's likelihood and posterior densities are known to be "bumpy" with many local maxima (Foster et al., 2018;McLachlan & Peel, 2000). In order to make analyses feasible, we propose to use approximate Bayesian inference methods in three increasing levels of accuracy and computation time.
First, as the fastest and crudest approach, we utilize two Laplace approximations in combination (inference method 1): one for the marginal likelihood of model parameters; and another for the (conditional) posterior distribution of latent variables given the hyperparameters (Rasmussen & Williams, 2006;Tierney & Kadane, 1986;Vanhatalo et al., 2010). The former is for estimating hyperparameters and the latter for estimating the posterior of the latent variables. The key benefits are that the dimension of the parameter space is significantly reduced due to (approximate) marginalization over latent variables, and we can use optimization instead of sampling based MCMC approach. Second, as an approach of intermediate accuracy and computational complexity, we use MCMC either to estimate the conditional posterior of the species profiles at the approximate posterior mode of the latent variables (inference method 2a) or to estimate the joint posterior of latent variables and species profiles at the (approximate) posterior mode of the covariance function parameters (inference method 2b). The posterior modes of the latent variables and covariance function parameters are estimated using the Laplace approximation of the inference method 1. Third, as the most accurate but the computationally heaviest approach, we consider full MCMC for all model parameters (inference method 3). The full MCMC would, however, be computationally infeasible for our NWS data.
We introduce the inference methods in detail below. The performance of the proposed methods is examined with simulated data in Section 4.1. All inferential methods were implemented in Matlab by utilizing parts from the GPstuff toolbox (Vanhatalo et al. 2013). The code used for this work is made available at https://github.com/jpvanhat/SpatClustMixtures.

Parameter inference using Laplace approximations (inference method 1)
Denote by Y the n × J matrix of all outcome measurements and by X the respective n × D matrix of covariates. Let f k = [f 1k , … , f nk ] T be the n × 1 vector of latent variables at all observations corresponding to the kth RCP class. We stack f k to give the full set of latent variables f. The prior, conditional on hyperparameters, for the latent variables is a zero mean multivariate Gaussian f ∼ N(0, C) where C is a nK × nK block-diagonal matrix so that the kth block contains elements For notational clarity, we have omitted the conditional dependence on hyperparameters, the number of RCPs, the exact spatiotemporal coordinates and also environmental covariates.
The conditional posterior of latent variables, given the hyperparameters, is where collects the latent variables at the ith observation site. This posterior has no analytical form but we can use Laplace's method (Rasmussen & Williams, 2006;Rue et al., 2009;Tierney & Kadane, 1986;Vanhatalo et al., 2010) to approximate it with a normal density with meanf and covariance given by the inverse of −∇ 2 f log p (f|Y, , ) evaluated at f =f. Here,f is the mode of log p (f|Y, , ), ∇ f the gradient operator, and ∇ 2 f the Hessian operator with respect to f. We denote this approximation by q(f|Y, , ). The mode,f, is located using a Newton algorithm as described in the Web Appendix 1. After solving for q(f|Y, , ) we apply a Monte Carlo approximation for the RCP probabilities, k (x(s), s, t), by sampling from the multivariate Gaussian approximation for f, and employing the softmax transformation (2) to obtain samples of the RCP probabilities.
For posterior distribution of latent variables at unobserved locations (prediction), . Given the posterior approximation q(f|Y, , ), we can derive a Gaussian approximation for the posterior predictive distribution forf given by The mean and covariance of the posterior predictive distribution are given in Web Appendix 2. The posterior distribution for RCP probabilities at unobserved locations k ( x(s),s,t ) is again approximated by Monte Carlo by using posterior predictive samples off.
To estimate the hyperparameters and , we first transform them so that they are supported by the entire real line. That is, we log transform the elements of , and logit transform the elements of to generate a vector of transformed hyperparameters = [ , ] = [log , logit( )]. Then we use the Laplace method again to approximately integrate over the Gaussian latent variables f (Vanhatalo et al., 2010). This provides us with the approximate marginal likelihood for the hyperparameters q(Y| ). We estimate the transformed hyperparameters by their approximate maximum a posteriori (aMAP) value:̂= [̂,̂] = arg max log q(Y| ) + log p( ).
Note that the prior for the transformed parameters, log p( ), is induced by the priors for the original parameters, , , (see Section 3.1.1) after applying the multivariate Jacobian of the transformation. The gradients of log q(Y| ) with respect to can be solved analytically (Rasmussen & Williams, 2006;Vanhatalo et al., 2010), which allows gradient-based optimization for the hyperparameters. See Web Appendix I for specific details.
To guard against making inference at a local (not global) mode of parameters, we employ two strategies. The first is to seek a region of reasonable starting values-there is no point in searching for local modes far from the likely position of the maximum (Foster et al., 2013;Foster et al., 2017). The second is to perform several random starts from within this region. We implement this strategy as follows. First, we hard-clustered the observation vectors y i , i = 1, … , n by using the K-means clustering (see Kaufman & Rousseeuw, 1990, for example). Then we initialized logit( ) around the logit transformed observed prevalences of each species in each hard clustered group by adding random noise, N(0, 2 = 0.2 2 ), to these logit transformed prevalences. We tested alternative SDs for the Gaussian perturbation on starting values and found that a value of 0.2 worked reasonable well. To guard against forming parameter combinations that the optimization could not escape from, we set all initial values of smaller than 0.2 (greater than 0.8) to be 0.2 (0.8). We also routinely include the unperturbed starting values as these may reflect the RCP groups well, especially if there is only moderate spatial and covariate effects. The other hyperparameters (length-scales and variances) are initialized on the log scale using independent Gaussian realizations with SD 0.2. The mean for the log length-scale parameters was 0 and for the log variances 0.1. These initializations for covariate effects correspond to GP functions of moderate flexibility (recall that the covariates are standardized).

3.2.2
Parameter inference using Laplace method and partial MCMC (inference method 2a and 2b) Typically we are interested in RCP probabilities, k (x(s), s, t), and species profiles, k , but not on the covariance function parameters, , as such. Hence, for improved posterior inference we consider MCMC schemes with increasing level of accuracy for these key parameters ( k ) and increasing computational time demands. Our first partial MCMC approach (inference method 2a) is to sample from the conditional posterior for species profiles given the aMAP estimate for covariance function parameters and latent variables. That is, we sample from Iff is a good summary for the marginal posterior of f sampling from this distribution can provide good approximation for the posterior distribution for species profiles. Alternatively, we can sample conditional on̂= E[ |̂] which is also provided by the Laplace approximation as described in Section 3.2.1. This approach is fast since for each proposal of we need to recalculate only the prior density, p( ), and the likelihood terms p ( y i |z ik = 1, k ) which are computationally cheap. Moreover, given the latent variables, and our independent priors p( ) = ∏ k,j p( kj ), species profile parameters may be nearly independent in the conditional posterior (10), and so constructing an efficient sampler is easy. In practice, we sampled from Equation (10) so that we first sampled from the posterior distribution of the logit transformed species profiles, , using Hamiltonian Monte Carlo (HMC, Neal, 2011) as implemented in the hmc2 function of GPstuff package (Vanhatalo et al., (2013)) and then retransformed these samples back to the original parameters using = logit −1 ( ).
As a second partial MCMC option (inference method 2b), we consider sampling from the conditional posterior for latent variables and likelihood function parameters where f i = [f i,1 , … , f i,K ] T collects the groupwise latent variables at the ith sampling site, and̂is the aMAP estimate for log given in Equation (9). With large datasets, the posterior distribution for is narrow and the posterior for latent variables is rather insensitive to changes in covariance function parameters within the highest posterior probability region around̂(see, e.g., Vanhatalo et al., 2010). In many cases, the conditional posterior of Equation (11) can, thus, be good surrogate for the true marginal posterior p(f, |Y).
Sampling from Equation (11) is considerably faster than sampling from the full posterior since the Cholesky decomposition and inverse of C(̂) need to be calculated only once before the sampling. The time needed for these operations scale as O(n 3 ) with the number of sampling locations n after which the remaining calculations in (11) scale as O(n 2 ). Hence, this approach is feasible in many practical applications where approximating the full posterior (requiring multiple O(n 3 ) operations) would be infeasible. In practice, we do Gibbs sampling by sequentially sampling from the conditional distributions p( |Y,̂, f) and p(f|Y,̂, ). We again use HMC for the former conditional and an elliptical slice sampler (Murray et al., 2010) (implemented in esls function in the GPstuff package) to sample from the latter.

Parameter inference using full MCMC (inference method 3)
Asymptotically the most accurate, but also computationally most demanding, MCMC approach is to sample from the full posterior p(f, , |Y). We again apply Gibbs sampling and sample the latent variables, species profiles and covariance function parameters from their full conditionals. We use an elliptical slice sampler for latent variables (p(f|Y, , )), HMC for species profiles (p( |Y, , f)) and slice sampling for the covariance function parameters (p ( |Y, , f)). The latter is implemented as described by Neal (2003) and implemented in function sls in GPstuff. The species profiles, , are sampled at the logit transformed space and the covariance function parameters, , are again sampled in log transformed space. In full MCMC, the computational bottle necks are the covariance matrix operations, which have to be done for each proposal of covariance function parameters and latent variables. Moreover, strong posterior dependence can occur between the covariance function parameters and latent variables (Vanhatalo et al., 2010). As a result, the full MCMC approach was infeasible for the NWS fish data due to the number of sampling sites and the number of fish species. For that reason, we compared the Laplace approximation and the two above mentioned MCMC methods to the full MCMC using a smaller simulated dataset (Section 4.1), which showed that the Laplace approximation for the posterior of the latent variables and the conditional MCMC approaches agree well with the full MCMC approximation.

Identifiability of the parameters and inferring covariate effects
Our model has several components that impose identifiability considerations. First, the components of the latent function f k (x(s), s, t) = k + h k (x) + k (s, t) are unidentifiable since the overall level k can be absorbed by k (s, t) and the GP formulation of h k (x) (Knorr-Held, 2000). In our model, we note that this confounding is mitigated by the specification of priors for the parameters of the GP components; the effects are shrunk to zero in the absence of a spatial or temporal trend. A remedy leading to explicit identifiablity, as proposed by Knorr-Held (2000) and Hanks et al. (2015), would be to impose sum to zero constraints over the observation locations to the random effects k (s, t) and the GP based response functions h k (x). Second, due to the sum constraint of the probabilities, the softmax function, as defined in Equation (2), is not identifiable for the unnormalized latent functions f k (x(s), s, t) with k = 1, … , K. A common remedy to make the latent functions identifiable over groups (up to label-switching, see the discussion at the end of this section) is to fix the latent function of one of the groups to zero (see, e.g., Foster et al., 2013;Neelon et al., 2014;Foster et al., 2018). Apart from prior distributions, we did not apply any of the above proposed identifiability constraints for the latent function components. The reasons are the following. Due to the nonlinear softmax link function, the covariate effects or spatiotemporal effects to RCP areas are hard to interpret by looking at the latent responses, such as h k (x d ), only. Hence, we do not want to interpret the latent functions, f k , k = 1, … , K or their additive components directly. Rather, we want to interpret normalized RCP probabilities k (x, s, t) of Equation (2). Even though the latent functions or their additive components are not identifiable, the model is identifiable for these RCP probabilities. From the point of view of interpreting RCP probabilities ( k (x, s, t)) we would not gain anything from posing identifiability constraints to the latent variables. On the other hand, implementing these constraints in our inference methods (especially the Laplace approximation) would be cumbersome and lead to increase in computational demand. Hence, the implementation of our model was more straightforward by allowing the latent functions of all groups to vary.
We follow Hill et al. (2017) and Kallasvuo et al. (2017) and apply the posterior inference on the conditional responses of RCP probabilities k at different covariate combinations. We denote the conditional response by k (x d |x ⧵d , s, t) as the probability of the kth RCP as a function of the dth covariate only, conditioning on the remaining covariates (x ⧵d ) and spatial and temporal locations are fixed to (x ⧵d , s, t) (note that k is a function of x d , not a probability density function and | is a similar conditioning statement as used, for example, in likelihood function notation). The conditional response is a random function whose posterior distribution depends on the posterior distribution of the model parameters. We study the probability changes relative to the average probability within the covariate limits at (x ⧵d , s, t): Since the normalized RCP probabilities are identifiable (up to label switching) these conditional responses are identifiable as well.
In the NWS analysis, we plot the expectation of the conditional responses, , with covariates from 50 randomly chosen sampling sites from the original survey data, where the expectation is taken over the posterior distribution of k where the posterior distribution for k is estimated with one of the inference methods described above. We add to this plot an average response, which is taken over the 50 conditional responses in a pointwise manner. Note that we could also report the posterior uncertainty in individual Δ k (x d |x ⧵d , s, t) but we suppress this information to reduce clutter in our plots.
One potential additional complication to the posterior inference is label-switching (Neelon et al., 2014), which in our application means that the posterior inference is essentially identical if the order of the RCP class memberships change. Whilst a problem for interpretation, label switching forms little issue for numerical optimization. This is because any ordering is as good as any other, and the estimation goal is to find any one of the MAPs. Hence, label switching is not a problem for Laplace approximation nor inference method 2a in Section 3.2.2 since the Laplace approximation for the posterior of latent variables will be formed around a single class membership combination. Unlike optimization, label switching can be problematic in MCMC (Stephens, 2000) as the sampler can jump between different states. However, we did not encounter obvious problems in our MCMC routines. This is reasonable since the RCP profiles, k , are typically very long vectors so the posterior distributions conditional on different label combinations are naturally far from each other. If label-switching was a problem in the MCMC inference, we could follow the relabeling strategy proposed by Stephens (2000).

Model comparison via cross-validation
Even though comparison of models with different structures and the choice of number of RCPs is not our focus in this work, we suggest that this could be performed using cross-validation with the average log posterior predictive density as the performance measure (Vehtari & Ojanen, 2012). Cross-validation approach has been utilized in mixture models previously by Wall and Liu (2009). Briefly, we divide the data into 10 randomly chosen parts of equal size, and predict each of these hold-out subsets based on the remaining nine subsets. The predictive performance was assessed using the average log posterior predictive density of the hold-out datasets based on the model estimated from the remaining data. We hold out the same partitions for all the different models under consideration, which can help guard against stochastic noise.
As with any mixture model, detecting too many groups from predictive performance is problematic. In the context of RCPs, any RCP can be split into two or more RCPs, with the same profile and essentially the same log posterior predictive density for the hold out data. Moreover, finite data and randomness in CV splitting induce randomness to cross-validation log predictive densities through which overly complex model can perform best by chance. Hence, there is risk of overfitting if we keep increasing RCPs until a model's log predictive density starts decreasing. For this reason, we calculate also the standard error of the average log predictive density of the hold out data and interpret it as an estimate of the randomness in the cross-validation result (Vehtari & Ojanen, 2012, p. 191). We then choose the largest number of RCPs that increases the average log predictive density by more than 3/2 standard errors compared with the model with one RCP less. This model choice criterion corresponds, roughly, to choosing a model that is better than the alternative models with 90% confidence level.
As an additional, qualitative check, we follow Paci and Finazzi (2018, section 3.4) who proposed that the choice of the number of clusters (K) should not be solely based on model performance metric but should also consider if adding an extra cluster significantly changes interpretation of the model. Hence, if two models have similar cross-validation log predictive densities but the RCP distribution or species profiles do not differ significantly between models, we should prefer the model with the smaller number of RCPs.

Tests with simulated data
In order to test the goodness of our approximate inference methods (Sections 3.2.1-3.2.3) we conducted a simulation study. We used the same simulation set up as described by Foster et al. (2013) with presence/absence observations of J = 100 species at n = 150 randomly distributed observation locations throughout a spatial rectangular area of size [−10, 10] × [−10, 10]. The species data at simulated sampling sites were generated using the clustering model (1) and each sampling site was probabilistically assigned to one of K = 3 RCP clusters. The RCP probabilities were defined as cubic polynomial functions of spatial location. In addition to the simulation model in Foster et al. (2013) we add spatial variation using the Matérn covariance function in the GP prior. We conducted the posterior analysis using the Laplace approximation (Section 3.2.1) and the three MCMC schemes described in Sections 3.2.2 and 3.2.3. Figure 2 summarizes the posterior distribution of the RCP cluster parameters k (s) as approximated by the full MCMC and Laplace approximation. Both approximations give practically identical posterior mean estimates and very similar results for the 5% and 95% posterior quantiles. The Laplace approximation describes the center of the posterior predictive distribution well, but seems to overestimate the 5% quantile and underestimate the 95% quantile at a small number of locations. Figure 3 summarizes the posterior distribution of species profiles, k,j , as approximated by MCMC for species profiles atf (inference method 2a), MCMC for and f at̂(inference method 2b) and by full MCMC (inference method 3). There are no systematic differences between these approximations and all three methods provide close to identical F I G U R E 2 Visualization of the posterior distribution for k , k = 1, 2, 3 over the study region in the simulated data experiment as approximated by full MCMC (inference approach 3) and Laplace approximation (inference approach 1) for each RCP at all spatial locations. Each of the 18 subplots covers the simulated spatial study region of size [−10, 10] × [−10, 10]. The rows correspond to different RCPs so that the first row contains results for 1 , the second for 2 and the last row for 3 . The two left most columns show the 5% lower posterior quantile, the two middle columns show the posterior mean and the two right most columns show the 95% posterior quantiles for MCMC and Laplace approximation. MCMC, Markov chain Monte Carlo; RCP, regions of common profile. TA B L E 1 The models considered for analysis of the NWS fish data uncertainty estimates for the species profiles. Table S1 in the Supplementary Material shows the 10-fold CV comparison between models with 2-5 RCPs. The model with correct number of RCPs (three) has the highest mean log predictive density. However, the difference in posterior predictive performance of models with 2-4 RCPs is small compared with the standard error of the log predictive density estimates indicating that the choice of number of RCPs should not be based only on the average log predictive density comparisons as discussed in Section 3.3.

F I G U R E 3
In our experiments it took approximately 20 s on a Linux laptop with Intel(R) i7-6600U CPU @ 2.60 GHz processor to find the Laplace approximation for this simulation study. Sampling 10 4 samples from the conditional posterior of species profiles (Equation (10), inference method 2a) took another 30 s. Sampling 10 4 samples of latent variables and species profiles at the aMAP of covariance function parameters (Equation (11), inference method 2b) took approximately an hour and sampling 10 4 samples from the full posterior (inference method 3) took approximately 3 h. These performance statistics are naturally highly dependent on the sampler options and, hence, only reflect the relative performance differences between the methods. The reported values correspond to sampling after careful tuning.

4.2
Analyses of NWS data

Posterior inference and model comparison
To contrast different models and to show the real-world utility of the model, we analyze the NWS data with nonspatial, spatial and spatial-temporal models-see Table 1. For the model without spatial correlation, we use only environmental covariates to predict the RCP locations using either quadratic or GP responses (models M1 and M2). For the models that incorporate spatial correlation, we additionally employ GPs over space (models M3 and M4). Finally, we also analyze the data using a model with GP covariate effects and the spatiotemporal effects (model M5). We assessed a range of numbers of RCPs, from 2 to 6, which covers the K = 5 solution that Foster et al. (2013) found best for data collected in 1983 from the same region. The predictive performance of the models increases significantly until K = 5 RCPs since, for the first five RCPs, each additional RCP increases the average log predictive density by more than 3/2 standard errors of its estimate (see Table  S2 in the Supplementary Material). Increasing the number of RCP groups beyond K = 5 produced RCPs that are small (represented by few sites) and are only minor variations of existing ones. Decreasing the number of RCP groups produced models that have amalgamations of these five main RCPs. In particular, with four RCP groups the groups 3 and 4 (see Figure 4) would be merged. These two RCP groups are distributed in similar spatial locations along the depth gradient but they show opposing effects along the salinity gradient (see Figures 4 and 5) which is reflected by differences in the fine scale structure of the spatial distribution of these RCP regions. The difference in their species profiles are also significant (see Table 3 and Figure 4). Hence, even though these RCP regions show similar spatial patterns we concluded that they represent significantly different communities and used the models with K = 5 RCPs for the final analyses.
For the NWS data, we did all computations for this case study with a Linux desktop with Intel(R) Core(TM) i7-4770 CPU @ 3.40 GHz. Finding the aMAP estimate for hyperparameters took 4-7 h after which MCMC for conditional posterior for species profiles only was done in less than a minute and species profiles and latent variables within 3 h.

4.2.2
The effect of covariates and the spatial term The predicted maps of expected probability of each RCP from the models with GP (M2) and quadratic (M1) covariate effects are presented in Figures 4 and S1, respectively. In both model types (GP and quadratic), the addition of a spatial probabilities in the images sum to one over the rows in the first two columns. The left most column is for the model with spatial and GP covariate effects (model M4). The middle column does not have the spatial effects (model M2). The column on right shows species profiles (the aMAP estimate of the probability of observing species) in each RCP. Ordering, from most prevalent species to least prevalent is for visual appeal only and it does not alter the model in any way. GP, Gaussian processes; RCP, regions of common profile effect increases the contrast between areas of high and low probabilities of many RCPs (e.g., RCPs 3 and 4)-compare model M1 with M3, and M2 with M4. For quadratic covariate effects (M1), the spatial distribution of RCPs 3 and 4 also change noticeably with the addition of spatial effects (M3), presumably due to the effects being nonquadratic. The maps produced by the two models with both covariates and spatial effects (left columns of Figure 4 for M4 and Figure S1 for M3) are qualitatively similar implying that the combined effect of covariates and space is similar irrespective of the form of the covariate contributions. However, if the spatial effect is removed (giving models M2 and M1) then the models do differ, and substantially so for RCPs 3 and 4 (center columns of Figures 4 and S1). RCP 2 in particular is not as sharply defined using quadratic covariate effects (model M1 in Figure S1). This implies that, for these data, the importance of using flexible covariate effects (model M2) is lessened when the spatial effect is present (model M3). The reason for this is that the functional form of the covariates is approximately quadratic when spatial effects are present (model M4, see Figure 5). The addition of spatial random effects to the model increased the uncertainty estimates of some of the model components, mainly the estimates concerning the responses along environmental covariates.  Note: For a map of likely location of each RCP type, see Figure 4 (left column). Abbreviations: GP, Gaussian processes; NWS, north-west shelf; RCP, regions of common profile.

TA B L E 2
The posterior mean (and 95% central credible interval) for expected species-richness for each RCP for the NWS spatial model (using GP effects for the covariates, model M4) For these data, the major environmental driver is depth ( Figure 5). This matches ecological understanding of the region, and of marine ecosystems in general Koslow et al., 1997). In the NWS data, there appears to be: a shallow water group (RCP 1) located near the coast; two mid-depth groups (RCPs 3 and 4) that are spatially segregated but not in terms of their depth preference; one deeper water group (RCP 5), and; one group located near the start of the more rapid change in the depth gradient (RCP 2).
The species-richness patterns varied across RCPs and hence depth. Species-richness is defined as the number of different species at a sampling location, and here we quantify this as the expected richness within an RCP group: ∑ J j=1 kj (Table 2). We also calculated the total absolute difference in species profiles between two RCPs: (Table 3). Note: For a map of likely location of each RCP type, see Figure 4 (left column). Abbreviations: GP, Gaussian processes; NWS, north-west shelf; RCP, regions of common profile.
The shallow group (RCP 1) has a greater expected species-richness, while the deeper group (RCP 5) is less rich but tends to have quite a different set of species in the RCP assemblage. The RCPs with intermediate depth have richness that is in between these two (Table 3). RCP 4 arguably has less richness than one might expect given its intermediate depth. There could be a number of ecological reasons for this: (1) its location over a very particular habitat, which requires highly specialized traits; (2) the dominance of a small number of species monopolizing the resources at those sites (an uneven community), or; (3) this group is the result of the sampling process itself (e.g., the trawls were performed using slightly different protocols that affected catch diversity and/or rates. We are unable to investigate which of these three options is appropriate without extra information that is not available within the data themselves.

Spatiotemporal analysis of the NWS data
To investigate possible temporal, as well as spatial, heterogeneity in the fish data, we extended the model to have spatiotemporal dependence (as per Section 3.1). This is model M5 in Table 1. In particular, we fitted a model with covariate effects given by GPs, a Matérn spatial dependence and an exponential temporal dependence as in Equation (4). Figure 6 shows the predicted maps of posterior expected probability of each RCP for years 1986-1997. The spatiotemporal model produces in general similar results as the spatial only model. There is relatively little change in the RCP clustering through time. This is important from a natural resource management perspective, as it means that many management decisions (like zonation) were likely to be enduring and so do not have the need for frequent reassessment. Whilst remaining relatively static, there are some minor differences over the years for some of the RCP groups. For example, RCP 4 is becoming more common in a small patch near the center of the study area and RCP 1 may be retracting from its southern areas ( Figure 6). Conversely, RCP 5 seems to change little over the study period and RCP 3 also experiences only minor changes.

SUMMARY AND DISCUSSION
In this work we have developed a statistical model that groups samples according to their multivariate observations. The model extends the model of Foster et al. (2013) by introducing spatial, and spatiotemporal effects to deal with correlation. The methodology works by allowing the probability of any particular sample belonging to each group to depend on covariates and also their location in space and time. This is achieved by adding flexible spatial and spatiotemporal terms into the mixture-of-experts model (Foster et al., 2013;Jacobs et al., 1991;Jordan & Jacobs, 1994), see Section 3.1. An important benefit of incorporating spatial and spatiotemporal terms into the model as GP effects is that it changes the qualitative nature of posterior prediction at locations that have not previously been sampled. This is done by leveraging observed data from nearby (in space and time) locations. This enables cohesive spatial posterior prediction (even at the locations where data was sampled) and a probabilistic representation of uncertainty. In addition, we have introduced flexible Gaussian process methods to model dependence on environmental covariates. We develop efficient techniques to estimate the models using approximate Bayesian methods for inference. We used this novel methodology to analyze a dataset of tropical fish distribution on the continental shelf of northwestern Australia to show important relationships with physical covariates and spatial dependence as well as the temporal stability of species groups (RCPs). Our analysis suggests that depth is the major delineating variable between the RCP groups ( Figure 5), which agrees with ecological understanding of fish behavior elsewhere and also for other taxonomic groups F I G U R E 6 The posterior expected probability of each RCP at all spatial locations for years 1986-1997. These maps are created using the spatiotemporal model with covariates effects added using GPs (model M5). Only years when data were collected are presented. GP, Gaussian processes; RCP, regions of common profile Koslow et al., 1997). Interestingly, the analysis suggests that the temporal component of variation in the data is relatively minor. This is in spite of the changing human utilization in the area.
Our computational strategy is efficient enough to allow estimation for our example application, which has ∼850 samples and ∼250 measurements (species presence/absence) per sample given five RCP clusters. The size of this modeled dataset is large compared with earlier spatial clustering examples in the literature (Alfó et al., 2009;Green & Richardson, 2002;Lawson et al., 2017;Neelon et al., 2014;Torabi, 2016;Wall & Liu, 2009). To enable inference, we performed approximate Bayesian inference, as defined by a Laplace approximation and MCMC for conditional posterior of latent variables and species profiles (Sections 3.2.1 and 3.2.2). According to the simulation study, these methods provide good approximation for the posterior of latent variables, posterior probabilities of RCP regions and the posterior of species profiles. Full MCMC (Section 3.2.3), however, was infeasible for the NWS data in a reasonable time. The time requirement of Laplace approximation and the full MCMC increases as O(n 3 ) but the constant factor is considerably smaller for Laplace approximation (only tens of optimization steps) compared with full MCMC (thousands of sample proposals). The time requirement of sampling only latent variables and species profiles increase as O(n 2 ) whereas the time needed to sample only species profiles increases as O(J). Hence, full MCMC will become increasingly hard as the number of sampling sites increases but the two partial MCMC schemes (Sections 3.2.1 and 3.2.2) will remain attainable as long as constructing the Laplace approximation is feasible.
The model in Section 3.1 further extends the earlier spatial clustering models by allowing more flexible nonparametric responses to environmental covariates through the use of GPs. In the NWS study, this added flexibility was ultimately not needed as the expected latent response tended to be near quadratic. However, this was only the case for models with a spatial effect. Without the spatial effect, the resulting maps for the GP model and the quadratic model were noticeably different. Hence, it cannot be assumed in general that the relationship between a covariate and the response will necessarily be quadratic, or will even follow a less strict form of ecological niche theory such as unimodality. The GP approach provides added value compared with parametric response functions.