Evaluating and presenting uncertainty in model‐based unconstrained ordination

Abstract Variability in ecological community composition is often analyzed by recording the presence or abundance of taxa in sample units, calculating a symmetric matrix of pairwise distances or dissimilarities among sample units and then mapping the resulting matrix to a low‐dimensional representation through methods collectively called ordination. Unconstrained ordination only uses taxon composition data, without any environmental or experimental covariates, to infer latent compositional gradients associated with the sampling units. Commonly, such distance‐based methods have been used for ordination, but recently there has been a shift toward model‐based approaches. Model‐based unconstrained ordinations are commonly formulated using a Bayesian latent factor model that permits uncertainty assessment for parameters, including the latent factors that correspond to gradients in community composition. While model‐based methods have the additional benefit of addressing uncertainty in the estimated gradients, typically the current practice is to report point estimates without summarizing uncertainty. To demonstrate the uncertainty present in model‐based unconstrained ordination, the well‐known spider and dune data sets were analyzed and shown to have large uncertainty in the ordination projections. Hence to understand the factors that contribute to the uncertainty, simulation studies were conducted to assess the impact of additional sampling units or species to help inform future ordination studies that seek to minimize variability in the latent factors. Accurate reporting of uncertainty is an important part of transparency in the scientific process; thus, a model‐based approach that accounts for uncertainty is valuable. An R package, UncertainOrd, contains visualization tools that accurately represent estimates of the gradients in community composition in the presence of uncertainty.

the individual species themselves, but rather the underlying gradients that influence species distributions and ultimately community composition. However, many data sets do not contain sufficient information to discern precise estimates of the ecological gradients that influence community composition.
In contrast to constrained ordination methods where environmental features related to the sample units are used (Anderson & Willis, 2003;Birks, Peglar, & Austin, 1996;Økland, 1996), unconstrained ordination models do not include ancillary data about the sample units, but rather only use species (or other taxon) composition information to estimate the locations of sample units along compositional gradients. Traditionally, distance-based methods have been used to determine the compositional gradients and locations of sample units (Legendre & Gallagher, 2001;Roberts, 2016); however, these methods require resampling-based approaches for inference and uncertainty assessment (De Leeuw & Meulman, 1986;Heiser & Meulman, 1983;Jacoby & Armstrong, 2014;Smith & Gray, 2019).
In Hui et al. (2015), four benefits of model-based approaches to unconstrained ordination are detailed as follows: controlling spurious data properties, model checking, model selection and inference, and efficiency; however, little attention is given to assessing uncertainty in model parameters. In fact, the only mention of confidence intervals in this section states that "accuracy of such confidence intervals in this context is in need of evaluation." A later paper with a Bayesian implementation (Hui, 2016) largely resolves the issue of accuracy of the intervals and while other recent articles (Hui, Tanaka, & Warton, 2018;Hui, Warton, Ormerod, Haapaniemi, & Taskinen, 2017;Niku, Warton, Hui, & Taskinen, 2017) do touch on variability, understanding and assessing uncertainty in the latent factors is still not a point of emphasis. Walker (2015) does include an analysis of uncertainty in indirect gradient analysis, but the scope is limited to a single dimensional projection of presence/absence data. Ren, Bacallado, Favaro, Holmes, and Trippa (2017) details an approach using a dependent Dirichlet process that implements for understanding uncertainty in projections for bacteria counts, but is limited to count data. Understanding and evaluation of uncertainty is critical, in science in general and particularly in unconstrained ordination, as estimating latent gradients with precision from abundance or presence data is a major challenge. With unconstrained ordination, parameter estimation is challenging because, in addition to the latent factors, species exhibit significantly varying patterns of occurrence or abundance which also need to be estimated. Furthermore, many ecological data sets are sparse in that many species are rare, and the number of observations can be small compared to the number of parameters required to be estimated. The focus of this work is to assess uncertainty in latent factors; quantify the effects of the number of sample units or species and the relative width of uncertainty bounds; and encourage accurate reporting of the variability; additionally, tools are given for designing ordination studies in a way that reduces the variability in the final results.
In many statistical situations, increased precision in parameter estimates is attained by collecting additional data. In this framework, the latent factors are associated with the sample units, and hence, increased precision would be attained by collecting abundance or presence data for additional species at each sampling location. While there are sampling protocols in some ecological disciplines where a fixed number of organisms is identified for each sample unit, in many cases sampling is exhaustive and the number of species per sample unit cannot be increased. Furthermore, the inclusion of rare species results in sparse data where some species are rarely observed. The sparse data do not provide much information about the latent factors associated with the sample units. Fortunately, increasing the number of sample units does have an indirect impact on the latent factors associated with an community composition gradient as it allows the species parameters to be more precisely estimated, which, in turn, results in more precise site-level estimation. The trade-offs between collecting additional species or data at additional sites will be addressed in Section 4.
To determine the characteristics required to achieve a specified precision in estimating the underlying gradients, we present a framework for addressing the sample unit characteristics necessary to detect the latent factors with a specified precision. Precision is impacted by the number of sample units, the number of species, and the assumed generating distribution of the data, but also by the factors associated with the abundance distribution for each species and each sample unit. Simulation studies are presented for a variety of scenarios, and code is made available with our R package, UncertainOrd, that is available on github (Hoegh, 2018).
In this article, we detail the challenges in using model-based methods for unconstrained ordination and emphasize the importance of understanding and relaying the uncertainty in estimates.
Furthermore, in addition to providing tools to calculate the characteristics required to detect underlying gradients at a specified precision level, we also provide software for creating figures that display the uncertainty in the parameters.

| MATERIAL S AND ME THODS
There is a variety of ways to collect data for species response.
A common approach for model-based unconstrained ordination uses latent factor models. Following notation from Hui et al. (2015), the latent factor model with random effects for sample units can be formulated as where α i represents the effect for site i, β j is the effect for species j, each of which is assumed to be independent and standard normal in distribution, and θ j are the factor loadings associated with the latent factors. The function g() is a generic function used to link the mean term μ ij to the assumed generative distribution of the data. The z i 's are referred to as latent factors and are assumed to represent the underlying gradient that influences community composition. Typically, q, the dimension of the latent factors, is two or three for ordination. Without constraints on the vector of θ j values, the latent variables are not invariant to rotation. Hence with q = 2, for identifiability θ is constrained such that θ 11 > 0, θ 12 = 0, and θ 21 > 0, where θ jk is the coefficient, or factor loading, associated with the jth species and kth dimension of the latent factor.
All variables in Equation (1) are fit using a Bayesian paradigm.
Hence, the following model specifications and prior distributions are used where μ α and μ β are the mean parameters for the prior distributions for α and β, respectively, and are usually set to zero. Variance in the prior distributions is specified as V α , V β , or V θ , which can be a covariance matrix. To satisfy the constraints on θ, the upper diagonal element, θ 12 , is set to zero and the diagonal elements θ 11 and θ 22 are forced to be positive by using a truncated normal distribution as the prior. The latent factors are assumed to follow a standard normal distribution.
Unconstrained ordination approaches using only species abun-

| Models for binary data
Multivariate species data are often recorded in a presence/absence format. This results in binary data, and there are two common approaches for this data structure: logistic regression using the logit link and probit regression using a probit link. Generally, in a Bayesian framework, which we adopt in this article, the probit model specification is preferred due to efficient sampling. In particular, a Gibbs sampler can be implemented with the probit link function using a latent normal data augmentation approach (Albert & Chib, 1993;Chib & Greenberg, 1998). However, the recent advances in the Polya-Gamma random variables and efficient sampling also permit Gibbs sampling in the logistic case (Polson, Scott, & Windle, 2013). In this article, we will focus on the probit regression, but the results are quite similar using logistic regression.
The probit regression uses a data augmentation approach to facilitate efficient computation of the posterior distribution. In this case, we assume there is a latent, continuous variable such that if the latent variable is <0, then the observed binary response is equal to zero; otherwise, if the latent variable is >0, then the observed binary response is equal to one. Then, the model can be written as where y ij is the binary response for species i at location j and Φ() is the cumulative distribution function of a standard normal random variable.
The prior distributions from Equations (2-4) are used for parameter estimation.

| Models for count data
Multivariate species data are often observed as an abundance, where the recorded value is the count of the number of individuals by species. Count data are commonly modeled with the Poisson distribution; however, the Poisson distribution has a strict mean/ variance property that is not always realistic. Negative binomial models provide an alternative that has a more flexible structure to fit data where the mean and variance are not equal as in a Poisson model. (1)

| Poisson model
The Poisson model can be written as where y ij is the count response for species i at location j. The prior distributions from Equations (2-4) are used for parameter estimation.

| Negative binomial distribution
The negative binomial is quite similar to the Poisson distribution, but also includes an overdispersion parameter to account for extra variation in the counts. Specifically, the model can be written as note that this parameterization of the negative binomial distribution is different from many approaches that use the number of trials and a probability of success (or failure). With this parameterization, the variance of y ij = ij + 2 ij . This framework also requires a prior distribution on the overdispersion parameter ω. Common prior distributions for ω include a half-normal distribution or a half-Cauchy distribution.

| Ordinal data
In some cases rather than counts or presence, the response of interest is the percent cover of each species. In this case, rather than estimating a percent, values are often recorded in ordinal responses. When modeling ordinal data, a common approach uses the cumulative probit link function. Building off the probit model from Equation (7), and extending to k categories, the goal is estimating the probabilities of the outcome being in each of the k classes, To estimate these probabilities, a latent variable is constructed as where where c l is the lth cutoff point. The variance is set to be one for identifiability in this framework. Then for a given species and site combination, the cumulative probit model is used to determine the probability of each class.

| RE SULTS
In this section, we analyze two well-known data sets to illustrate the uncertainty inherent in model-based unconstrained ordination.

| Spider data set
The spider data set was first published in Van der Aart and Smeenk-Enserink (1974) and contains counts for 12 species of spiders at 28 sampling sites. The spider data set is generally considered information dense as more than 54% of the values are >0 with an average total abundance across all species and sites of nearly ten. In modelbased unconstrained ordination, the focus is the latent factors which are then used as coordinates in a low-dimensional projection.

| Spider presence analysis
We begin with treating the spider data set as a binary response by mapping the abundance data to presence data. Naturally, treating the data as binary provides less information for identifying the latent factors. Nevertheless, many data sets are collected as presence/absence data and this provides a direct measure of the effect of information loss when compared to the spider abundance data below.
Using the probit link function, we fit the model specified in Equations (6 and 7) using a two-dimensional latent variable Z = {z 1 , z 2 }. Priors were specified from the distributions in Equations (2-4) where the hyperparameters were set as = = 0, 2 = 2 = 1, and V = I. Our interest is the latent factors z 1 and z 2 which are used to interpret a latent gradient associated with the 28 sample units. Using

| Spider abundance analysis
The spider data set was also analyzed using the recorded counts of individuals by species. The abundance data set contains more information than the presence data set, as it better represents variability in species response to the latent factors. While the Poisson distribution could be used here, Hui et al. (2015) found the negative binomial to be a better fit to the spider data. Additionally, deviance information criteria (DIC; Spiegelhalter, Best, Carlin, & Linde, 2002) are used to compare the two models and we also find a negative binomial to be more appropriate for this data. Hence, the negative binomial model specified in Equation (10) is used in this example.
With this specification, standard normal priors are placed on α, β, Z, and θ. The dispersion term in the negative binomial distribution has a half-Cauchy prior with variance of 20.
Similar to the abundance data analysis, the uncertainty in the latent factors can be visualized. The parameters exhibit less uncertainty than do the parameters for the presence/absence data, but there is still a large amount of variability present. Figure 1 shows the point estimates of the latent factors using the negative binomial sampling model along with the variability for site 25.

| Dune data set
The dune data set is another famous data set that contains the abundances of 30 plant species collected across 20 sampling sites.
The data set was reported in Batterink and Wijffels (1983) in Dutch and later in Jongman, Braak, and Tongeren (1995). The response is collected using a Braun-Blanquet method with nine ordinal classes.

| Dune ordinal data analysis
The ordinal data analysis uses standard normal priors for z, θ, α, and β. The ordinal model specified in Equations (12 and 13) also requires a prior distribution for the cutoffs. A standard normal prior is used here too with the constraint that order of the cutoffs is appropriately preserved.
The data are analyzed, and a point estimate of latent factors is presented in the left panel of Figure 2. Additionally, the uncertainty for site 18 is presented in the right panel of Figure 2.

| Data analysis summary
The figures of the dune data set and the spider data set suggest there is a high level of uncertainty in the projections from model-based ordinations; however, they do not provide an easy way to quantify the uncertainty. Before describing ways to quantify and compare F I G U R E 1 Point estimates of latent factors for the spider data are shown in the left two panels, where the top row contains information from the spider presence analysis and the bottom row is from the spider abundance analysis. The right panels also contain the posterior distribution of site 25 as the point cloud. The uncertainty is smaller for the abundance data set than in the presence data analysis, but the variability is still relatively large Spider abundance uncertainty, we first note that with these ordination approaches, the interest is more in relative differences than absolute differences. In other words, the interest is not in the absolute value of the latent factors used for projection, but rather in the relative differences between the positions for sampling locations.  Another way to summarize the uncertainty in the estimated latent factors is to look at the width of the posterior credible intervals.
In particular, we considered the 95% HPD posterior intervals for the latent factors and report the average width of these intervals. The average width of the HPD intervals was 2.51 and 1.73 for presence and abundance, respectively, for the spider data set and 2.15 for the dune data set. The one-dimensional gradient used in Walker (2015) for the dune data set had similar levels of uncertainty. With the spider data set, abundance data models have substantially smaller levels of uncertainty than presence data. Hence to limit uncertainty, whenever possible, abundance data should be collected rather than absence or presence.
There are a few differences between the dune and spider data sets that explain the differences in HPD interval widths. To further explore the differences, the dune data set was converted to presence/absence and analyzed using the same priors as the spider analysis, resulting in a mean HPD width of 2.01. In comparing the average HPD width for the spider and dune data sets, in a binary setting, the dune data set has a narrower posterior interval. This is due to a combination of factors. The dune data set has a larger number of species, which likely leads to the increased precision in the latent factors; however, the dune data contains more zeros (67% of species/site combinations are zero) compared to the spider data set (45% of species/site combinations are zero). Comparing the results for the abundance data, the count data appear to contain more information than the ordinal percent cover data. The impact of the number of species observed as well as species rarity and data set sparsity will be further addressed in Section 4.
Unfortunately, ecologists have more control over the number of sample units than the number of species that occur in a sample unit.
In addition, including species that very rarely occur does not provide much information for estimating the latent factors. Fortunately, obtaining more sample units is helpful in that the species parameters are more precisely estimated which results in more precise estimates of site parameters as well.
Finally, it is not always possible to achieve a small amount of uncertainty in the latent factors; hence, we strongly urge reporting results in a way that reflects the uncertainty. In our R package, UncertainOrd, MCMC samples can be extracted to give two-dimensional HPD credible intervals around each latent factor. Using the dune analysis, Figure 3 contains examples of ways to show the projection of the latent factors that respect the uncertainty.

| S IMUL ATI ON S TUDY AND EFFEC T S IZE E S TIMATI ON
This section presents a set of simulation studies designed to assess the uncertainty in the latent factors as a function of the number of species, species frequency of occurrence, and the number of sites.
To simulate species responses, the framework in Equation (1)

| Presence/absence versus abundance simulations
To evaluate uncertainty in presence/absence and abundance ordinations, credible interval widths for latent factor estimates were assessed with a simulation analysis varying the number of species and sites. The goal is to understand the impact of n and p on credible interval width of the latent factors and compare across presence/absence and abundance data. Synthetic data were generated from a Poisson distribution with = = 0 and = = 1. Note that Section 4.3 describes an alternative scenario where these values can be estimated from a pilot study and specified. A factorial structure F I G U R E 4 Average HPD width for abundance data with specified n and p based on 50 replicates was used with 10, 25, 50, 75, and 100 units for species and sampling sites. Fifty replicates were included for each pair of species and sampling sites. The data were then analyzed as both a count response and a binary response, assuming only presence/absence is recorded.
The uncertainty in the latent factors for both scenarios can be seen in Figures 4 and 5. The abundance data allow more precise estimation of the latent factors. This is not surprising as useful information is discarded when counts are mapped to presence/absence. Thus, whenever possible, counts rather than presence should be recorded.

| Sparsity simulation
In ordination studies, utility of the species data is to learn about the sites. Accordingly, observing more species is one option for increasing the information content of the data. Unfortunately, these additional species may rarely occur; hence, we investigate the impact of adding rare species on the uncertainty in the estimates of the latent factors. To explore the effect of infrequently occurring species, we created a simulation using presence/absence data where rare species are included. Note, common heuristics suggest dropping species that occur less than m times, where m is generally 5 or less.
For all simulations, the number of sites was fixed at 20. This value for the number of sites is arbitrary, but puts the focus on impacts from changing the species data. We considered two scenarios: first a standard presence/absence data structure where the species effects are simulated from a standard normal distribution and secondly a framework where a subset of the species are extremely rare.
For both cases, standard normal distributions are used for simulating Z and θ. The site effects, α, are simulated from a normal distribution with mean of zero and standard deviation of 0.5 for all of the standard species. In general, this results in a hypothetical species being observed 50% of the time or at 10 sites, on average. Then, we consider a data set that is augmented with rare species that have effects simulated from a normal distribution with mean of −4 and standard deviation of 0.5. Approximately 75% of the simulated rare species had zero occurrences across the 20 sites. Across all combinations of site and species, the probability of occurrence is 0.027.
The width of the HPD intervals for the latent factors, calculated as the average width of the latent factors across the replicates, can be seen in Table 1.
Unsurprisingly, the data that include the rarer species has more uncertainty in the width of the credible intervals. However, note that the rare species do include some additional information such that 10 species plus 10 rare species contains more information than just 10 regular species. Furthermore, the information from 10 commonly occurring species and 10 rare species does lead to more uncertainty than 20 commonly occurring species.
F I G U R E 5 Average HPD width for presence data with specified n and p based on 50 replicates

| Interval width estimation
To better assist researchers designing ordination studies, we created a function, (HPD width) in the R package UncertainOrd, that allows users to enter various parameters and evaluate the anticipated HPD width both numerically and visually. The function takes the following arguments: family of the sampling distribution, n, p, μ α , μ β , σ α , and σ β .
When designing a study, a common question is how many sites/ species should be estimated. This simulation looks at the trade-off between n and p with parameters that control abundance and presence that relate to a specific data set. As an example, we revisited the spider data set to examine the impacts of changing n and p on the expected HPD width of the latent factors. Using the observed parameter values of the spiders data, we simulated data sets with additional species or sample units. The estimated parameters for all of these data sets are available in Table 2. In general, the results suggest that increasing the species count is slightly preferable to increasing the number of sites when the goal is to minimize the width of the credible intervals. For instance, with 40 sites and 50 species the HPD interval width of the latent factors is 0.97 and 2.04 which is smaller than 1.05 and 2.22 for 50 sites and 40 species. Given constraints on an experiment, this tool will allows researchers to use data-driven methods to choose the appropriate number of species or sites.

| D ISCUSS I ON
The goal of this research was to examine the uncertainty inherent in model-based unconstrained ordination and to provide tools for accurate reporting of the uncertainty in the latent factors associated with community composition gradients. The tools allow researchers to plan for uncertainty due to controllable factors, such as the number of sample units and species to observe. We strongly recommend that modelbased ordination results include some representation of uncertainty.
TA B L E 2 Estimated parameters for α and β for spider data set F I G U R E 6 Average HPD credible interval width for simulated scenarios using population characteristics from spider data set. With n = 28 and p = 12 the average width of the HPD intervals was 1.72 and 2.95 for abundance and presence respectively Model-based methods, using Bayesian credible intervals, provide a way to estimate uncertainty in model parameters associated with latent gradients. Large variability in gradients exhibited by these methods is not a reason to eschew model-based methods; rather on the contrary, accounting for uncertainty is an essential element in the scientific process.
In general, larger numbers of species and sampling locations help limit uncertainty in the underlying gradients. While observing a large number of total units at a site influences the uncertainty in the latent factors, the total number of sites also plays a role.
Where collecting species composition at additional sites also can help reduce variability in the latent factors. Collecting data as an abundance rather than presence/absence is recommended as this also helps limit uncertainty.
Given the complexity of latent factor models in solving for community composition gradients and the lack of additional information outside of species presence/absence or abundance, uncertainty is an unavoidable part of unconstrained ordination. Hence, we have created tools that permit projections in the estimates of the gradients that account for and express the uncertainty present. Accounting for and reporting uncertainty is an essential part of the scientific process and being transparent in the knowledge gained from a scientific study. These tools presented here can be used by ecologists to in the same manner as standard ordination projections, such as determining overlap between sampling units based on the species composition; however, credible regions account for uncertainty and provide valid statistical inferences.

CO N FLI C T O F I NTE R E S T
None declared.

AUTH O R CO NTR I B UTI O N S
AH and DWR jointly identified research idea. AH implemented methods/software, conducted data analysis, and wrote manuscript.
AH and DWR edited and approved manuscript.

DATA AVA I L A B I L I T Y S TAT E M E N T
All data used in the manuscript are already publicly accessible. The dune data set is available in the vegan package in R (Oksanen et al., 2018). The spider data set is available in the mvabund package in R (Wang, Naumann, Eddelbuettel, & Warton, 2018).