A pathway for multivariate analysis of ecological communities using copulas

Abstract We describe a new pathway for multivariate analysis of data consisting of counts of species abundances that includes two key components: copulas, to provide a flexible joint model of individual species, and dissimilarity‐based methods, to integrate information across species and provide a holistic view of the community. Individual species are characterized using suitable (marginal) statistical distributions, with the mean, the degree of over‐dispersion, and/or zero‐inflation being allowed to vary among a priori groups of sampling units. Associations among species are then modeled using copulas, which allow any pair of disparate types of variables to be coupled through their cumulative distribution function, while maintaining entirely the separate individual marginal distributions appropriate for each species. A Gaussian copula smoothly captures changes in an index of association that excludes joint absences in the space of the original species variables. A permutation‐based filter with exact family‐wise error can optionally be used a priori to reduce the dimensionality of the copula estimation problem. We describe in detail a Monte Carlo expectation maximization algorithm for efficient estimation of the copula correlation matrix with discrete marginal distributions (counts). The resulting fully parameterized copula models can be used to simulate realistic ecological community data under fully specified null or alternative hypotheses. Distributions of community centroids derived from simulated data can then be visualized in ordinations of ecologically meaningful dissimilarity spaces. Multinomial mixtures of data drawn from copula models also yield smooth power curves in dissimilarity‐based settings. Our proposed analysis pathway provides new opportunities to combine model‐based approaches with dissimilarity‐based methods to enhance understanding of ecological systems. We demonstrate implementation of the pathway through an ecological example, where associations among fish species were found to increase after the establishment of a marine reserve.

Furthermore, relationships among species vary in time and space under changing biotic or abiotic conditions (Clark, Wells, & Lindberg, 2018); for example, species may only compete when resources are limiting (Perry, Mitchell, Zutter, Glover, & Gjerstad, 1994), in certain parts of their range (Pacala & Roughgarden, 1985), or in the absence of predation (Chase et al., 2002).
To analyze multi-species count data, many researchers have used methods based on an N × N matrix of dissimilarities, D, among sampling units (Anderson, 2001;Clarke, 1993). These approaches handle high-dimensional count data and reliably detect important changes in the structure of ecological communities (Anderson, 2001;Clarke, 1993;Legendre & De Cáceres, 2013). The focus here is to measure holistic changes in the identities of species and potentially also changes in species' relative or proportional abundances.
Dissimilarities, often calculated using measures that exclude joint absences, such as Bray-Curtis, Hellinger, or Jaccard (Legendre & Legendre, 2012), integrate information across all species to define ecological relationships among sampling units. Some measures (such as Gower's measure, 1971) also accommodate data having different types of variables (see Legendre & Legendre, 2012).
Most dissimilarity measures of interest to ecologists emphasize the extent to which two sampling units either share, or do not share, any species in common. Thus, important community-level concepts such as beta diversity (Anderson et al., 2011;Vellend, 2001) and turnover (Baselga, 2010), for example, are measured using dissimilarities. However, dissimilarity-based methods create no formal model of the original variables. Roles of individual species and relationships among them are not directly identifiable, as no species-specific parameters are estimated. Hence, one cannot predict the makeup of communities under defined scenarios, nor readily calculate power.
To characterize ecological communities and make species-level predictions under specified hypotheses, formal joint statistical models of the species variables are required. The multi-faceted challenge for developing such models is to deal simultaneously with (generally) over-dispersed, zero-inflated, high-dimensional, inter-related, mixed sets of (usually discrete) variables, including a host of rare species, for which no single multivariate statistical distribution can be readily articulated. A successful model of community data should allow a wide variety of species-specific marginal distributions that can flexibly change in time and space, while accounting for meaningful interspecific associations.
Models of multivariate abundance data include generalized linear models (GLMs) with (typically, for counts) a log link and either a Poisson or negative binomial (NB) error, and with correlations between species modeled using generalized estimating equations (GEEs, Wang, Naumann, Wright, & Warton, 2012;Warton, 2011;Warton & Guttorp, 2011). Alternatively, relationships among species can be modeled parsimoniously by including latent random variables as linear predictors in the GLM-called generalized linear latent variable models (GLLVMs, Hui et al., 2015;Niku et al., 2017;. The latent variables are intended to capture correlations due to unmeasured environmental drivers or biotic interactions, and one can use model selection to identify an appropriate number of latent variables to include in the model . More sophisticated GLLVMs can also accommodate spatial/temporal autocorrelation (Ovaskainen, Roy, Fox, & Anderson, 2016;Thorson et al., 2016Thorson et al., , 2015, the inclusion of species' traits/phylogenies (Ovaskainen et al., 2017), or variation in correlations at different hierarchical spatial scales (Ovaskainen, Abrego, Halme, & Dunson, 2016).
Generalized joint attribute models (GJAMs) have also been proposed for modeling multivariate ecological data (Clark et al., 2017).
GJAMs model covariances between mixed variable types (presenceabsence, ordinal, discrete, or continuous) on their original scales.
Discrete data (such as counts) are modeled via censoring, with partitions and weights chosen to allow linkages between different variable types (Clark et al., 2017). Partition widths and associated effort in interval censoring also can be chosen arbitrarily to accommodate mean-variance relationships. Imputation across censored intervals maps discrete variables into a multivariate normal (MVN) space to estimate covariances, with Bayesian analysis being used to estimate the latent (imputed) states.
We consider that copulas (Mai & Scherer, 2017) also hold great promise for flexible joint modeling of multivariate ecological count data (Popovic et al., 2018). A copula is a function representing a joint distribution as a mapping from the cumulative distribution functions (cdf) of its marginals, hence can be used to couple virtually any pair of variables (Mai & Scherer, 2017;Sklar, 1959). Copulas allow a tailored multivariate distribution to be constructed from two separate parts: (a) the univariate marginal distributions for each variable; and (b) the joint distributions of the variables in a multivariate copula space.
Our primary motivation for using copulas for ecological count data is that they allow any marginal distribution to be used for any variable. One does not need to forego the utility of the wide potential array of existing univariate statistical distributions, each with its own interpretable parameters, to build a joint model. Also, in copula models, associations among variables are modeled separately from their marginal distributions, making them easy to interpret, and the particular copula distribution used to model associations can also be flexibly chosen to fit a specific context. In contrast, latent variable models typically confound correlation structures with marginal distributions, as correlations among species are induced via latent variables that in turn alter the marginal distributions.
Here, we shall restrict our attention to Gaussian (MVN) copula distributions, and also to counts of species (abundance data) arising from one-way ANOVA-type designs, but the core ideas are readily extended to other types of mixed datasets, other copula distributions, and/or more complex sampling/experimental designs.
Gaussian copulas can draw on the rich statistical literature surrounding MVN distributions, while tailoring marginal distributions to nonnormal ecological variables. Fortunately, difficulties in estimating parameters for Gaussian copulas with discrete marginals (Faugeras, 2017;Genest & Nešlehová, 2007) have recently been surmounted (see Appendix 1).
The aim of this work is twofold. First, we provide an accessible description of copulas and show how they can work for ecological count data through a simple bivariate example. Second, we outline a pathway for the analysis of multivariate ecological count data that combines the use of copulas with dissimilarity-based methods. More specifically, copulas are first used to characterize the properties of individual variables and their associations in a formal parametric statistical model. Dissimilarity-based methods are then used to examine community-level patterns for whole assemblages of species that have been simulated from these copula models under defined scenarios.
Our analysis pathway (a) characterizes each individual species via estimation of marginal distributions and their associated parameters; (b) addresses high dimensionality (optionally) by screening data to identify significant pair-wise associations, using an index that excludes joint absences; (c) characterizes associations among species via estimation of a copula model and its associated parameters; and (d) proposes simulation from copula models to generate realistic ecological data under specified null or alternative hypotheses for model-based inference, ordination, and power analysis in dissimilarity-based settings. In the proposed pathway, we allowed both the marginal parameters and the copula parameters to vary across a priori groups, to maximize flexibility.
We demonstrate the analysis pathway with an example dataset: counts of fishes (p = 47 species) from the Poor Knights Islands, New Zealand (two of the 47 species are shown in Figure 1). Sampling occurred at three different times (September 1998: n 1 = 15, March 1999: n 2 = 21, and September 1999: n 3 = 20), spanning the establishment of a no-take marine reserve in October 1998 (Willis & Denny, 2000; data are provided in Supporting Information Table S1 F I G U R E 1 Two fish species found at the Poor Knights Islands, New Zealand: Chirodactylus spectabilis (top) and Parma alboscapularis (bottom). Photographs by Paul Caiger and are also available from the Dryad Digital Repository: https://doi. org/10.5061/dryad.3s6rm0f).

| A G AUSS IAN COPUL A FOR D ISCRE TE NON -NORMAL DATA
A copula is defined by Sklar's seminal theorem (1959). Let F be a pdimensional distribution function with margins F 1 , F 2 , …, F p . There exists a p-dimensional copula C such that for all (y 1 , y 2 , …, y p ) ∈ ℝ p the following holds: C is unique if F 1 , F 2 , …, F p are continuous. Conversely, if C is a p-dimensional copula and F 1 , F 2 , …, F p are univariate distribution functions, then the function F(y 1 , y 2 , …, y p ) is a p-dimensional distribution function (Mai & Scherer, 2017).
To understand how copulas work, consider that the position of an individual value y of a random variable Y having probability density function (pdf) f Y (y) is able to be expressed as a value along its cdf, denoted F Y (y), on the interval [0,1]. This provides a direct mapping of values from one pdf to another via their cdfs. This approach can also be used to map a value drawn from the pdf of a continuous random variable to the probability mass function (pmf) of a discrete random variable (Figure 2a). For example, consider a random variable Y ~ Poisson (μ = 2.5) and a standard normal variable Z ~ N(μ = 0, σ 2 = 1), whose cdf we will denote by C Z (z). A random value z drawn from Z can be mapped on to Y uniquely by taking the inverse function: . Thus, suppose we draw z = 0.524, then C Z (0.524) = 0.70; that is, we have drawn the 70th percentile of (1) F y 1 , y 2 , . . . , y p = C F 1 y 1 , F 2 y 2 , … , F p y p F I G U R E 2 Schematic diagram showing the mapping between the probability density function of a continuous variable and the probability mass function of a discrete variable via their respective cumulative distribution functions (cdfs); (a) the mapping from the continuous to the discrete yields a unique value; (b) the mapping from the discrete to the continuous is non-unique, but produces a range of values in the space of the continuous variable This simple mapping idea is exploited for modeling the joint distribution (association) between any pair of disparate types of variables in a copula. Here, we shall use a Gaussian copula to handle multiple associated variables simultaneously (but see Mai and Scherer (2017) A measure of association between two species that excludes joint absences is the index of association (Somerfield & Clarke, 2013;Whittaker, 1952): where y denotes the count for species in sampling unit i and I ℓ denotes the association between species and species ℓ. For red moki and black angelfish, there is a statistically significant positive association of I = 0.698 (p = 0.0001, 10,000 permutations).
To model this association, we can use a standard bivariate normal distribution for the copula function (with variables Z 1 and Z 2 ) having correlation parameter ρ = 0.574 (lower left panel, Figure 3). A random sample of z = (1.28, 0.67) in the copula space ( Figure 3 (Somerfield & Clarke, 2013).
One important complication, however, is the fact that a single point in the discrete data space corresponds to an entire region (a hyper-rectangle) in the copula space ( Figure 2b). Thus, estimation of the copula parameter(s) from discrete datasets is problematic, as the mapping of discrete values into the copula space is non-unique. We can address this by performing Monte Carlo integration over each discrete interval (Shi & Valdez, 2014). Computational efficiency is achieved through a Monte Carlo expectation maximization (MCEM) algorithm (Wei & Tanner, 1990). Estimation of the parameters of the correlation matrix for a Gaussian copula with discrete marginal distributions is described in greater detail in Appendix 1.
Standard bivariate normal (Gaussian) copula model with discrete marginal distributions for two fish species from the Poor Knights Islands. Points drawn from the copula space with correlation parameter ρ = 0.574 generate, through the specified negative binomial (NB) and zero-inflated NB marginal distributions, points in the discrete bivariate data space of counts that correspond to an expected inter-specific index of association of I = 0.698

| A COPUL A MODEL FOR ECOLOG I C AL CO U NT DATA
We propose the following steps to develop a full copula-based model for high-dimensional ecological count data:  ii. Identify significant associations among species to model. An index of association is calculated between every pair of species, the null hypothesis of no association is tested for each pair using P-values obtained by permutations (Somerfield & Clarke, 2013), and the significance level for the tests is suitably adjusted for multiple tests (see Supporting Information Data S2 for R code).
For efficiency, species flagged as "rare" may (optionally) be omitted. For the adjustment, one may use, for example, an exact family-wise error rate (FWER), empirically derived from the full set of permutation distributions (Wheldon, Anderson, & Johnson, 2007), or a more conservative per-comparison error F I G U R E 5 Heat maps of estimated copula correlation parameters for samples obtained at each of three different times (September 1998, March 1999, and September 1999 among fish species from the Poor Knights Islands that showed at least one statistically significant index of association (using a per-comparison error rate of 0.01 as a filter, e.g., see Figure 4c) Time 2 Mar 1999 Time 1 Sep 1998 Time 3 Sep 1999 rate (PCER). We used a PCER of 0.01 for the fish data. This was done separately for each group. We identified five significant associations (involving 10 species) in September 1998, followed by a sharp increase to 20 associations (involving 17 species) in March 1999 (Figure 4c), and subsequent decrease to five associations (involving eight species) in September 1999. This step acts as a filter to reduce the size of the estimation problem for building a copula model and omits joint absences in the assessment of species' associations. It is, however, optional; one could allow the copula model in step (iii) to include all inter-specific associations.
iii. Build a copula model and estimate its parameters. We may reduce the problem to a subset of species, m ≤ p, showing a significant association with any other species. Given this subset of m species identified in step (ii), and each of their marginal distributions from step (i), estimate the parameters of the Gaussian copula's correlation matrix (see Appendix 1 for details; R code is provided in Supporting Information Data S3). We condition the estimation of copula correlation parameters on the fixed marginal distributions, which is both practical (ensuring marginal distributions fit each individual variable well) and efficient (Joe, 2005). Here, we estimated a separate copula correlation matrix for each group. Note that we may have chosen not to implement step (ii) above, or perhaps, even though step (ii) may reduce dimensionality dramatically (m ≪ p), we may still have N < m, or m may begin to approach N such that some form of regularization is still desirable. In such cases, as we are using Gaussian copulas, a variety of methods for regularizing the inverse covariance matrix may be considered (Friedman, Hastie, & Tibshirani, 2008;Schäfer & Strimmer, 2005;Ullah & Jones, 2015;Yuan & Lin, 2007); for simplicity, we shall not pursue the topic of regularization further here. Furthermore, we hasten to add that neither rare species nor unassociated species are omitted from the copula models that follow, but they are presumed to be independent of other species. For the fish dataset, species' associations varied through time, and structured groups of associated species were easily seen in copula correlation matrices ( Figure 5). There was an increase in the strength of associations after the establishment of the F I G U R E 6 Schematic diagram of an overall pathway for analyzing ecological community data consisting of the following steps: (a) characterize marginal distributions for each species; (b) identify associations to model; (c) fit the copula;  1999). This is unlikely to have been a seasonal effect, as only three of the 10 species that showed significant associations in September 1998 did so in the following September 1999 ( Figure 5).

| S IMUL ATE DATA AND VISUALIZE RE SULTS
An overall pathway to model, simulate, and visualize ecological community data using copulas is shown schematically in Figure 6.
Once . Expand this to obtain a (p × p) correlation matrix ̂ i for group i by placing zeros in the remaining off-diagonal elements (corresponding to species that will be modeled as independent of one another), and 1s along the remaining diagonal elements.

Draw a random sample (a vector of length p) from the p-dimen-
sional Gaussian copula distribution whose correlation matrix is ̂ i .
Map the values obtained for each dimension in the copula space through the cdf of the individual marginal distribution for each species to obtain a simulated count value for each of the p species in the community.
Ordination plots of simulated data with original data can be used as a simple diagnostic to assess how sensible the model may be.
What is usually of greater interest, however, is to consider changes in community structure among groups in the high-dimensional space articulated by the model. For this, we desire an ordination of the group centroids, along with some measure of the expected variation in those centroids (based on the model). This can be examined on the basis of any dissimilarity measure of interest.
Suppose there are n i sampling units in group i and N = ∑ g i=1 n i . One generates a new (N × p) matrix of simulated data Y sim under the full copula model by drawing the n i sampling units, consisting of p-length vectors y sim , for each group in accordance with that group's estimated copula correlation matrix ̂ i , marginal distributions, and associated parameters. This is then repeated N sim times (where N sim is typically somewhat large, say N sim = 100). One can then construct a super-matrix Y super of dimension (N(N sim + 1) × p) which (row-wise) stacks the original matrix (Y) together with all of the Y sim matrices obtained via simulation under the model. From this, a chosen dissimilarity measure is calculated to yield D super . We wish to map the N sim × g centroids for all of the groups from every simulated dataset along with the g original centroids onto an ordination diagram.
We can do this by calculating the distances among the (N sim + 1) × g centroids from the D super matrix directly (Anderson, 2017) to obtain D cen . Metric multi-dimensional scaling (mMDS) can be used to visualize the distributions of centroids for each group under the model, along with the original group centroids. Kernel density contours (Duong, 2007) clarify the shapes of these distributions in the ordination space (Figure 6e).

F I G U R E 7
Ordinations based on Bray-Curtis dissimilarities of square-root-transformed abundances of fishes (47 species) from the Poor Knights Islands at three different times, obtained using: (a) metric multi-dimensional scaling (mMDS) of distances among centroids for the original data (black symbols), also showing centroids of 100 datasets (colored symbols) generated under the full copula model (parameters estimated separately for each group, shown as three different colors) along with kernel density contours; and (b) non-metric MDS plot of the original data, with replicate sites in each of three groups shown with three different colors Using this approach for the Poor Knights' dataset, the observed centroid for each group falls well within the distribution of centroids for that group generated under the copula-based model (Figure 7a). The dispersion/shape of centroid distributions in the reduced-space ordination also clearly varied among groups.
This graphic is far more informative than the (typically drawn) non-metric MDS plot of the original data (Figure 7b). The latter is dominated by large residual variation within each group and concomitant high stress, which masks group differences. Consider how, in the univariate analysis of data from ANOVA-type study designs, one typically plots the means for each group, along with their associated standard errors (which sensibly assumes normality for the distribution of means under the central limit theorem), to visualize both the relative positions and the variation in the group means. Similarly, the ordination in Figure 7a  (a) take the full set of N = 56 sampling units and estimate a single set of marginal and copula parameters from these data (acknowledging no a priori groups, so H 0 is true); (b) generate three groups of "mock" data (with sample sizes of n 1 = 15, n 2 = 21, and n 3 = 20) directly from that model, then treat this dataset as if it were our "observed" data, but here we know that H 0 is actually true; (c) estimate marginal and copula parameters separately for each of these "groups" in our mock dataset; and then (d) simulate data and draw distributions of centroids under H A and H 0 in the same way as was done for Figure 8a.
For the mock dataset (Figure 8b), the distributions of centroids for the three groups generated under H A (three different colors) do appear separate from one another. This is because they arose F I G U R E 8 Ordinations based on Bray-Curtis dissimilarities of square-root-transformed abundances of fishes (47 species) from the Poor Knights Islands at three different times, obtained using: (a) metric multi-dimensional scaling (mMDS) of distances among centroids for the original data (black symbols) along with centroids of 100 datasets generated under three separate copula models for each of the three groups (colored symbols, H A is true) and centroids of 100 datasets obtained by random permutation of the sampling units among the three groups (gray symbols, H 0 is true); (b) mMDS generated in the same manner as in (a), but where data consisted of "mock" observations where H 0 was known to be true (see text for details)

(b)
H 0 is known to be true (a) from three different sets of estimated parameters, even though the groups themselves, as we know, were, in this case, completely arbitrary. However, quite tellingly, the mock "observed" centroids (black symbols) also lie within the distribution of centroids under H 0 (gray symbols). The overlapping of contours for centroid distributions drawn under H A with those drawn under H 0 also suggests a lack of real difference among the groups. Note too that there is quite high stress here (>0.24)-yet another signal that there is no distinctive group structure to display (Figure 8b).
This can be contrasted with the clear group structure apparent in

| Model-based inference
We may generate data under a specified null hypothesis (H 0 ) to achieve model-based inference. One might consider a null hypothesis that asserts there are no groups and estimate a single set of parameters (marginals plus copula) for the full set of data. However, armed with a full copula model, having estimated separate parameters for each group, we may instead assume a simple null hypothesis that every sampling unit has an equal probability of arising from any group. Thus, we can generate Y sim under H 0 such that each vector y sim is drawn under a multinomial with probabilities of 1/g for each F I G U R E 9 Schematic diagram showing methods of simulation using multinomial mixtures from copula models (with given parameters for the marginal, M, and copula, C, distributions) under (a) a null hypothesis of every sampling unit having an equal probability of arising from any of the groups; or (b) a specific alternative hypothesis for the case of three a priori groups of sampling units and where the alternative hypothesis asserts that all three groups are different from one another Simultaneously alter the multinomial for each group in a series of steps to create a mixture between H 0 and H A . The usual permutation-based test of pseudo-F is distribution-free, so is preferable for robust inference (in this case, with 4,999 permutations, it yielded an identical p-value to the model-based p-value); however, a close match between the model-based distribution and the permutation distribution (e.g., Figure 10a) provides support for the validity of the model's assumptions.

| Power analysis
Copula-based models can be used to calculate power, thus to com- pseudo F θ, and/or π) between H 0 and H A separately for each species can produce unrealistic combinations of parameters. A smooth monotonic power curve is obtained, however, by modeling the continuum between H 0 and H A as a sliding scale of mixture probabilities.
Let denote a g × g matrix of elements ij , the probability of drawing a sample y sim for simulated group i (rows) from original group j (columns). For the Poor Knights' dataset, we may consider matrices of probabilities under H 0 ( 0 ) and under H A ( A ) as: respectively ( Figure 9b). Next, let f k be the fractional probabilistic distance from H 0 to H A in a chosen number of steps (k = 1, …, n steps ), beginning with f 1 = 0 when H 0 is true and f n steps = 1 when H A is true ( Figure 9b). Note that rejecting H 0 is logically distinguishable from the assertion that H A is true. Note also that a wide variety of H A may be specified. At a given step k, the probabilities k used to simulate data are as follows: To provide an example, we generated power curves with n steps = 20 and 1,000 simulated datasets per step for the Poor Knights' dataset and compared the empirical power of PERMANOVA with canonical analysis of principal coordinates (CAP; Anderson & Robinson, 2003;Anderson & Willis, 2003). Tests were done using 999 permutations for each simulated dataset, and CAP analyses were done using seven principal coordinates (which maximized allocation success). When all variables were included, PERMANOVA was more powerful than CAP ( Figure 10b); however, when only a subset of fish species having strong associations were included (namely, those having at least one association with an estimated ρ ≥ 0.70 in the copula model; there were 16 of these), then CAP was more powerful than PERMANOVA ( Figure 10b). As an aside, we noted that generating power curves using a slightly different null hypothesis (i.e., where data under H 0 were generated from a model having a single set of parameters estimated from the full set of N sampling units rather than being a multinomial mixture of equiprobable draws from three groups having three separate sets of parameters) made no substantive difference to any of the above results.

| D ISCUSS I ON
Copulas provide a rich and flexible approach for modeling associations among disparate types of variables. Recent advances in statistical methods to estimate parameters for copulas having discrete marginal distributions using MCEM (see Appendix 1 below) open new doors for modeling count data. By allowing marginal and copula parameters to vary over time, we uncovered a striking increase in the strengths of associations among fish species after the cessation of fishing at a no-take marine reserve ( Figure 5). Naturally, the ecological mechanisms responsible for generating these associations cannot be inferred from observational data alone, but would require additional investigations.
Generalized linear models, GLLVMs, and GJAMs all have tremendous potential for capably modeling count data, particularly if they are extended to allow for changes in over-dispersion or zero-inflation within a species, and changes in correlations among species in different habitats. They do, however, have a few natural limitations.
GJAMs avoid using classical statistical distributions, but this comes at a cost-the utility of explicit count distributions for characterizing individual species as univariate variables is lost, and how to choose partition widths to accommodate different mean-variance relationships remains unclear. In GLLVMs, the relationship between each species and each latent variable is effectively linear on a log-scale (when the default log link for count data is used), which may or may not be appropriate/desirable. Also, the latent variable mechanism for inducing correlations will affect estimation of individual species' over-dispersions (and vice versa), making these two conceptually distinct features difficult to disentangle.
Copula models can be readily extended to include other types of variables commonly encountered in ecology, such as biomass, percentage cover, ordinal data, or mixtures of these with counts. They share some of the desirable features of GLLVMs and GJAMs while presenting some distinct advantages. All three approaches can use an underlying MVN distribution to model associations, but copulas can also use other association models, accommodate a wider variety of parametric marginal distributions than GLMs, and do not entangle the association model with the marginal model. Although outside the scope of the present study, a natural next step would be to explore the predictive capabilities of GLLVMs, GJAMs and copula models across a broad range of ecological datasets.
A fundamental question remains: What are the limits of our approach? For multivariate data, the available degrees of freedom (df), provided sampling units are independent, are likely to be bounded such that N ≤ df < (N × p), and to depend on the level of species' interassociations. One alternative to our proposed preliminary screening for significant pair-wise associations would be to identify subsets consisting of coherent groups of associated species (Somerfield & Clarke, 2013). These practical permutation-based approaches may be complimented (or replaced) by direct regularization/shrinkage of the copula covariance matrix (Schäfer & Strimmer, 2005). One might also consider joint estimation of copula and marginal parameters.
In any case, how to assess parsimony/model complexity in the full framework is an open question.
We have focussed on an ANOVA-type study design. We chose to fit separate parameters (copula plus marginals) to data from each group. However, sampling units might, instead, occur along one or more measured environmental gradients. Continuous predictors can be included in marginal distributions for each variable (as in GLMs; Warton et al., 2015, Niku et al., 2017Popovic et al., 2018). However, responses of species to gradients are generally unimodal and may be modeled this way (Jamil & ter Braak, 2013;Yee, 2004Yee, , 2015, either along each margin or potentially inside the Gaussian copula space. Associations would remain constant using such an approach; however, copula correlations themselves might also be modeled as a function of environmental variables (Nikoloulopoulos & Karlis, 2010)-an idea worth pursuing.
Our approach catered well to varying zero-inflation, but did not optimize models of rare species. Rare taxa are difficult to model (Elith et al., 2006;Fithian, Elith, Hastie, & Keith, 2015) and may not occur randomly; some sites harbor greater coincidences of singletons (Ellingsen, Hewitt, & Thrush, 2007). SDMs can fail to capture the nature of inter-specific associations reliably, particularly for organisms having low probabilities of occurrence (Zurell et al., 2018).
Observational data are often too sparse to model rare taxa well individually, but richness (number of species per sampling unit) can be modeled as a Poisson (or Poisson-binomial) random variable (Calabrese, Certain, Kraan, & Dormann, 2014;Gavish et al., 2017).
Thus, future model developments could include richness as an additional response variable in a multivariate copula. Relationships between richness and abundances of prevalent species (or environmental variables) could be estimated, allowing potential clustering of rare taxa.
The proposed analysis pathway enables researchers to achieve a greater understanding of the roles and relationships among individual species, as well as providing a novel approach to ordination and power analysis for investigating community-level hypotheses. A unique feature of this framework is that we do not consider model-based methods (such as GLMs, GLLVMs, GJAMs, or copulas) as running counter to dissimilarity-based methods (such as ANOSIM, MDS, PERMANOVA or CAP). Rather, they are complementary: It is not a case of "either, or," but a case of "yes, and…." Probabilistic statistical models are essential for characterizing assemblages on a per-species basis, including estimation of useful interpretable parameters (e.g., Supporting Information  (Figures 9b and 10b). By using the latest modelbased approaches in tandem with evolving community-level approaches, as proposed here, we can draw the best from both worlds.
We consider that copula-based joint models of species count data, particularly when combined with dissimilarity-based tools, provide a rich new suite of flexible methods that will generate many new scientific insights in the analysis of ecological communities.

ACK N OWLED G M ENTS
MJA was generously supported by a James Cook Fellowship from the Royal Society of New Zealand. AEP was supported by a travel grant from PRIMER-e (Quest Research Limited).

AUTH O R CO NTR I B UTI O N S
MJA and PD conceived the ideas, PD contributed mathematical formulations for the MCEM and wrote Appendix 1, AP provided R code with contributions from PD and MJA, AEM proposed the idea of mixture models for power analyses, and MJA wrote the manuscript.
All authors approved the final draft of the manuscript.

DATA ACCE SS I B I LIT Y
Data consisting of counts of abundances of fishes from the Poor Knights Islands are provided as Supporting Information Table S1 and are also available from Dryad Digital Repository: https://doi.

APPENDIX 1
This appendix outlines the core idea of using a Monte Carlo Expectation Maximization (MCEM, Wei & Tanner, 1990) algorithm to estimate the covariance matrix for a Gaussian copula with discrete marginal distributions that is constrained to have unit variances along the diagonal (hence takes the form of a correlation matrix). By constraining the Gaussian copula covariance matrix in this way, our algorithm ensures that the variances of individual variables remain precisely as specified by their marginal probability mass functions. For other examples of Gaussian copulas for discrete data, see Dauwels, Yu, Xu, and Wang (2013) and, more recently, Popovic et al. (2018), who describe a general covariance modeling framework, including latent variable graphical models (Meng, Eriksson, & Hero, 2014) and sparse factor analysis (Carvalho et al., 2008).

BA S I C M CE M
The general setup for MCEM is as follows (see Dempster, Laird, & Rubin, 1977;Wei & Tanner, 1990). Let y be an observed data vector.
Let x be latent states or "missing data." Let θ be parameters. Define f(y|x; θ) and f(x; θ) to be the probability density (or mass) functions of y given x and x, respectively, as indicated by the arguments. The likelihood to maximize is as follows: Define the maximum likelihood estimate we seek as The MCEM algorithm works as follows: 1. Start with some initial value θ k = 0 of θ.

3.
Find θ k + 1 as: The Monte Carlo average is the approximation of: 1. Repeat the previous two steps, which are known as the "E"xpectation step and the "M"aximization step.
Without the Monte Carlo approximation, it can be proven that, provided θ k is not a stationary point, iterations will always yield L(θ k + 1 ) > L(θ k ), so that θ k converges to ̂, unless there are local maxima or other stationary points. With the Monte Carlo approximation, iterations will not converge to a single value, but will instead oscillate about ̂, with a precision that depends on the number of Monte Carlo samples, m. Thus, the value of m should be determined by the required precision of the results.
Operationally, the beauty of MCEM is that we maximize a simple average of log probabilities at each step. This is much easier to work with than the (log of the) expected value of the probabilities from the latent states.

M CE M FO R D I S CR E TE G AUSS I A N CO PU L A S
In what follows, we shall think of each discrete observation as having an unobserved fractional component. Let y i = (y i1 , …, y ip ) be counts of p species from sampling unit (or "site") i. Subscript i will be dropped in discussing the likelihood contribution for one site's data only. Let f j (y j ) and F j (y j ) be the marginal probability mass function and cumulative distribution function, respectively, for species j, assuming y j takes non-negative integer values. Dependence on parameters, θ, is implicit. Let f(y) = ∏ f j (y j ) and F(y) = [F 1 (y 1 ), …, F p (y p )]. Let g(y) be the joint density function defined using a copula. g (y) = (z) where ϕ Σ is the multivariate normal density with zero means and unit variances and correlation matrix Σ, I is the identity matrix, and z = (z 1 , …, z p ) is defined by z j = Φ −1 F j y j , j = 1,…, p where Φ is the standard normal CDF.
Let f j y j be a probability density function defined by spreading the probability mass f j (y j ) uniformly in the interval [y j , y j + 1); that means: Now the discrete copula likelihood can be written as the following integral over a hypercube: where i is for sites, and the integral is over u ∈ [0,1] p , where p is the number of dimensions.
To work in the space of the multivariate normal variables defined by z, we change variables to z = Φ −1 F (y + u) , so y + u =F −1 (Φ (z)) and This yields where the integral is over z ∈ Φ −1 F (y + u) for u ∈ [0,1] p . That is the region of z that corresponds to y + u falling in the interval [y j , y j + 1).
Another way to write this is by integrating over the entire range of z and using an indicator function to exclude z outside the range given above. This is written as where I condition is 1 if "condition" is true, and 0 otherwise. Now we are ready to view this in the MCEM framework: 1. The "latent variable" is z, whose probability density is ϕ Σ (z); 2. The parameters are the elements of Σ; 3. The "observed data" are the values of z falling in the valid range