Modeling species co-occurrence by multivariate logistic regression generates new hypotheses on fungal interactions


  • Corresponding Editor (ad hoc): P. R. Peres-Neto.


Signals of species interactions can be inferred from survey data by asking if some species occur more or less often together than what would be expected by random, or more generally, if any structural aspect of the community deviates from that expected from a set of independent species. However, a positive (or negative) association between two species does not necessarily signify a direct or indirect interaction, as it can result simply from the species having similar (or dissimilar) habitat requirements. We show how these two factors can be separated by multivariate logistic regression, with the regression part accounting for species-specific habitat requirements, and a correlation matrix for the positive or negative residual associations. We parameterize the model using Bayesian inference with data on 22 species of wood-decaying fungi acquired in 14 dissimilar forest sites. Our analyses reveal that some of the species commonly found to occur together in the same logs are likely to do so merely by similar habitat requirements, whereas other species combinations are systematically either over- or underrepresented also or only after accounting for the habitat requirements. We use our results to derive hypotheses on species interactions that can be tested in future experimental work.


A fundamental aim of population biological research is to develop mechanistic understanding on the processes shaping the ecology, genetics, and evolution of natural populations. This aim is hard to reach because the underlying processes are difficult to observe directly and often too complex to be derived from first principles. It is much easier to acquire data on patterns resulting from the processes of interest, but an indirect approach poses challenges for data analysis and interpretation. A point in case is research on community assembly processes, where e.g. the presence and magnitude of competitive or mutualistic interactions has been inferred by comparing patterns of species co-occurrence to those generated by null models (Diamond 1975, Connor and Simberloff 1979, Hastings 1987, Kelt et al. 1995, Gotelli 2000, Gotelli and McCabe 2002). As positive (negative) associations can result either from the species having similar (dissimilar) habitat requirements or from direct or indirect interactions, it is difficult to make conclusive inference on the underlying causal factors behind the observed associations.

The effects of species interactions and habitat requirements on co-occurrence patterns can be disentangled, at least to some extent, by accounting for the habitat requirements in the null model. Peres-Neto et al. (2001) modeled the presence-absence of each species first individually as a function of environmental variables, and then used the resulting models to construct randomized species distributions. As the simulated data were generated independently for each species, the simulated communities corresponded to an environmentally constrained null hypothesis. In this paper we combine the two above-mentioned steps of Peres-Neto et al. (2001) by constructing a multivariate logistic regression model for the entire species community. Using a multivariate approach for community modeling is conceptually appealing, as the data are clearly of multivariate nature: for each sampling unit, there is a vector of zeros and ones describing which particular species were absent and which were present in that sampling unit. A multivariate model of species co-occurrence describes the probability of obtaining a given combination of zeros and ones, accounting for possible dependences between the species occurrences. The multivariate logistic regression approach can be used for, e.g., generating simulated species distributions that account for non-independence among the species. As we will show, this provides rigorous means for assessing model fit, for quantifying effect sizes, and for comparing effect sizes among multiple data sets.

The idea of using a multivariate model for a species community is not new as such. For example, multivariate adaptive regression splines (e.g., Leathwick et al. 2006) examine if species occurrences are explained by a shared set of environmental covariates. However, these models do not account for non-random co-occurrence after accounting for the environmental covariates. In the hypothetical case of no variation in environmental conditions, multivariate adaptive regression splines would fail to give any information about species associations. In contrast, the multivariate logistic regression model that we present here would pinpoint which species co-occur more or less often than expected by random.

We illustrate our approach in the context of wood-decaying fungi, with the aim of extracting signals of species interactions from survey data. In this species community, species interactions are likely to play a key role in determining which species are able to grow large enough mycelial biomass inside the dead tree and succeed in producing fruit bodies. The fungal community inside a single log typically consists of several tens of species, present as mycelia or as latent propagules, but only a small fraction of these fruit at a given time (Ovaskainen et al. 2010). Laboratory studies suggest that competition for space and nutrients is the most common type of interaction between wood-decaying higher fungi (Boddy 2000, Heilmann-Clausen and Boddy 2005), though many types of facilitative and even mutualistic interactions have been documented as well.

Many species of wood-decaying fungi have declined (Lonsdale et al. 2008) as a result of loss of natural forests and reduced amount of dead wood in managed forests (Siitonen 2001, Jonsson et al. 2005). Responses of individual species to habitat loss and fragmentation are likely to be modified by species interactions. Conversely, reduced availability of dead wood and forest fragmentation have led to changes in relative abundances of wood-decaying fungi and presumably to changes in species interactions. As the species can be readily observed in nature only at the fruit-body stage, it has remained difficult to link the laboratory-based results on interactive mechanisms to community dynamics observed under natural conditions.

In this paper, our aims are to (1) introduce the multivariate logistic model as a statistical tool for modeling species co-occurrence, (2) quantify species-to-species associations in wood-decaying fungi based on fruit-body data, (3) examine how robust these associations are across study sites with varying degree of naturalness and isolation, and (4) to relate species-to-species associations at the fruit-body stage to earlier results on this species community and to use them to generate new hypothesis on interacting species pairs.

Material and Methods

Multivariate logistic regression

We start from the familiar univariate logistic regression model, with data on species occurrence and covariates on n sampling units. We use yi to denote the occurrence of the focal species in the unit i, yi = 1 corresponding to the presence and yi = 0 to the absence of the species. The logistic regression model reads as

display math


display math

Here Pr( ) stands for probability, the logit link function is defined as logit(x) = log(x/[1 − x]), k is the number of covariates used as explanatory variables, xl,i is the value of covariate l in a sampling unit i, and βl is the regression coefficient, i.e., the effect of covariate l on the occurrence of the species.

In case of presence–absence data on a community of m species, the response variable is a vector yi = (yi1, . . . , yim) describing the occurrences of all the m species in each sampling unit i. This vector has 2m possible states, and thus the full solution of modeling the probability for each species combination separately becomes infeasible with more than few species. A practical approach is to model only the pairwise associations, and let the higher-order associations be implicitly defined through structural model assumptions. To formulate such a model, we follow here O'Brien and Dunson (2004), who noted that Eq. 1 can be equivalently written as yi = 1(zi > 0) where the function 1(x > 0) is the indicator function with value 1 if x > 0 and with value 0 otherwise. The latent variable zi is defined as zi = μi + logit(F[ei]), where F is the cumulative density function of inline image(0, 1) (normal distribution with mean 0 and variance 1), and ei is a random deviate from inline image(0, 1). This alternative formulation of Eq. 1 is a convenient starting point for expanding to the multivariate case. Using the index i for the sampling units and the index j for the species, the multivariate logistic regression model is defined by the following (O'Brien and Dunson 2004, Holmes and Leonhard 2006):

display math
display math
display math
display math

Here the data Y (with elements yij) are arranged as an n × m matrix, X is the n × k design matrix, β is the k × m matrix of regression coefficients, and R is an m × m correlation matrix (a symmetric positive definite matrix with diagonal elements equal to 1). As each component of ei is distributed as inline image(0, 1), it follows that the species-specific marginal distributions of Eq. 2 are identical to univariate logistic regression (Eq. 1).

The elements of the matrix R describe whether a given species pair co-occurs in the same sampling unit more or less often than expected by random, after accounting for the k explanatory variables included in the model. For two species (m = 2), there is only one free parameter, i.e., the matrix element r12 = r21. For m species, R has m × (m − 1)/2 degrees of freedom. This is much less than the full size of the state space (2m), making parameter estimation feasible, but with the hindsight of making a fixed assumption on how higher-order associations relate to pairwise associations.

Fig. 1 illustrates how the mean vector μ determines the prevalence of each species, while the matrix R adjusts the number of co-occurrences. In terms of numbers of co-occurrences, the difference between the full model and the null model can be measured as

display math

The method of Peres-Neto et al. (2001) is based on comparing the number of observed co-occurrences to that predicted by a null model, and it can thus be viewed to use Δ as a measure of nonrandomness in co-occurrence. In contrast, the multivariate logistic regression model uses the matrix R as the measure of non-randomness in co-occurrence. Table 1 shows that, with an equal matrix R, the value of Δ depends on the species' prevalences, and thus the two methods differ in their interpretation. With an equal matrix R, the value of Δ is highest when the prevalence of both species is 0.5, but decreases (eventually to zero) as species' prevalences go to one or zero. This is consistent with intuition, as, e.g., with very low prevalences and thus very few occurrences already a few co-occurrences indicates a highly nonrandom association between the species.

Figure 1.

An illustration of the multivariate logistic model (Eq. 2) with a positive association between two species (r12 = 0.7). The red, blue, and green colors correspond to the cases B, C, and D in Table 1, respectively (case A not shown). The centered circles correspond to the mean vectors μ, and the individual dots show 1000 randomizations of the latent variable z. The axes are shown both at the linear scale and at the probability scale.

Table 1. Translating the matrix R into a nonrandom pattern of species co-occurrence.Thumbnail image of

The field study on wood-decaying fungi

Polypore and dead-wood survey was conducted in southern Finland in 2000–2005 in 542 study sites, each visited once. As the tree species has a major structuring role for the species community, we consider the data separately for Norway spruce (Picea abies) and birch (Betula pubescens and B. pendula). To obtain sufficient statistical power, we selected for the present analyses for each site and each tree species only those species for which there were at least 15 occurrences. This resulted in a set of 16 species on spruce logs, with data from 14 sites, and eight species on birch logs, with data from four sites. Thereby, we analyze the occurrence of 22 species (S1–S22) in 14 forest sites (F1–F14) in altogether 18 data sets (for details on sites and species, see Appendix A).

The study sites represent different kinds of managed and natural forests, with variation in age and the total volume of dead wood (Appendix A). Some of the stands are located in southernmost Finland with a long history of forest use, and others in eastern Finland where intensive large-scale forestry started relatively late and the proportion of natural and seminatural forests is higher than in southernmost Finland (Rouvinen et al. 2002, Lilja and Kuuluvainen 2005).

In the field, we measured log attributes that were expected to be of importance for the occurrence of wood-decaying fungal species (Renvall 1995): tree species, diameter, seven decay classes (from freshly decayed to almost completely decayed), ground contact (no ground contact, less than half of the log in ground contact, or more than half of the log in ground contact), fall type (uprooted, broken, or undefined if uprooted or broken), bark cover (0–100%), and epiphyte cover (0–100%). On each dead-wood object, we recorded the occurrences of the focal species, including all polypores (Aphyllophorales) and 16 other wood-decaying fungal species that are either threatened or indicator species. For more details on the field survey, see Appendix A.

Statistical inference

We fitted four variants of the multivariate logistic regression model to the data (Table 2). In the models M1(I) and M1(R), we did not include any covariates, and thus fitted only the intercepts β0 that correspond to the overall occurrence probabilities of each species. In the models M2(I) and M2(R), we included all measured substrate variables, i.e., diameter, decay class, ground contact, fall type, bark cover, and epiphyte cover. The models M1(R) and M2(R) include the correlation matrix R as a parameter to be estimated. In the models M1(I) and M2(I) the correlation matrix R was set to the identity matrix I (all off-diagonal elements equal to zero), and these models thus assume independence among the species. M1(I) can be considered as the simplest possible null-model, whereas M2(I) is an environmentally constrained null model in the sense of Peres-Neto et al. (2001). As we were interested in examining how robust the correlation matrix is among sites, we did not pool the data, but parameterized the four models independently for each of the 18 data sets.

Table 2. Model variants and their abbreviations.Thumbnail image of

We chose to fit the models to the data using a Bayesian approach, mainly because of its flexibility with hierarchical models (Clark 2005), and because of the possibility of bringing structural prior information. Among possible choices for priors for covariance matrices (Barnard et al. 2000), we assumed the marginally uniform prior for R, so that the marginal prior for each matrix element is uniform in [−1, 1]. For each element of β we assigned a inline image(0, 10) prior with the following constraints. First, for each of the class variables, we set the maximal regression coefficient to zero, which can be done without loss of generality as the model includes the cross mean. Second, we allowed only for biologically plausible decay and ground contact profiles, meaning that the regression coefficients were not allowed to have a local minimum in these ordered categorical variables. We thus excluded the possibility that a species would prefer both extremes (e.g., freshly dead and almost completely decayed wood) more than the intermediate case (e.g., intermediately decayed wood). Third, we assumed that the regression coefficient for undefined fall type was between the regression coefficients for the classes uprooted and broken.

The posterior distributions were sampled through a Markov chain Monte Carlo (MCMC) approach (see Appendix B for the technical details and an illustration of how statistical power depends on sample size, and the Supplement for the source code). To examine model fit, we generated 1000 replicates of posterior predictive data for all data sets and for all four variants of the model, and compared the real data to the 95% quantiles of the posterior predictive data. As our main interest was in patterns of co-occurrence, we used as test variables the number of logs with a given species pair, and the number of logs with a given total number of species.


The estimated correlation matrices revealed a high number of both positive and negative associations among the species (Fig. 2). Many of the species occurred especially on large logs with an intermediate decay stage (see Appendix C) and thus differences in the overall suitability of logs generated positive correlations among the species. In the model M1(R), which ignores the log variables, approximately half of the estimated correlations were found to be positive with at least 90% posterior probability (Fig. 2). However, some of these correlations vanished or even changed sign in the model M2(R) which accounts for the measured covariates. As shown by Fig. 3A, the correlations in the model M1(R) were generally somewhat larger than those in the model M2(R).

Figure 2.

The estimated species-to-species correlation coefficients on dead-wood fungi inhabiting (A) spruce and (B) birch logs. Upper diagonals correspond to the model M2(R) which accounts for the substrate variables; lower diagonals to the model M1(R) which assumes that all substrate units are equal. Red refers to positive and blue to negative correlations, and the size of the colored rectangle corresponds to the median estimate for the absolute value of the correlation coefficient. The estimates are averaged among the sites, including only cases in which the posterior probability for a positive (negative) correlation was at least 90%. Species pairs that were never included in the same analyses are shown by empty circles.

Figure 3.

Consistency of the estimated correlation coefficients among models and data sets. Panel A compares the correlation coefficients (for each site and species pair) between the models M2(R) and M1(R). Panel B compares the correlation coefficients in model M2(R) among different sites, including all cases in which a species pair was present in two different sites. The black dots pinpoint the cases in which the posterior probability for either positive or negative correlation was at least 90% in both models (panel A) or in both sites (panel B). The dots show the median estimates, continuous lines the fitted regression lines, dashed lines the line y = x, and the dotted line the line y = 0.

In cases where there were enough data to quantify the correlation coefficient for a given species pair in multiple sites, the results were robust among the sites (Fig. 3B). In particular, accounting only for the cases in which the correlations were identified as positive or negative with at least 90% posterior probability, the sign of the correlation coefficient for a given species pair was consistent among all sites.

We examined the effect size of nonrandom species associations by comparing the models in terms of the predicted numbers of co-occurrences. The examples shown in Fig. 4A illustrate that the models M1(R) and M2(R) were able to reproduce the numbers of pairwise co-occurrences essentially in an unbiased manner. This is expected, as in these multivariate models the estimated correlation matrix accounts for those co-occurrences that are not captured by the regression coefficients. In contrast, the model M2(I) and especially the model M1(I) led to biased predictions for the pairwise co-occurrences (Fig. 4A).

Figure 4.

Residual (predicted minus observed) numbers of species co-occurrences. In both panels, the four data points with error bars for each case correspond to (from left to right) the models M1(I), M2(I), M1(R), and M2(R). Panel A shows the number of pairwise co-occurrences for three common species, the observed numbers being 229 (S7 and S22), 32 (S16 and S22), and 53 (S16 and S7). Panel B shows the number of logs with a given number of species, the observed numbers being 2608 (0 species), 1856 (1), 640 (2), 147 (3), 50 (4), and 9 (5). The symbols show the median, and error bars show the 95% highest posterior density for the model residual. Data are pooled over all sites.

A comparison of Figs. 2A and 4A shows how ignoring a positive (negative) correlation leads to the underestimation (overestimation) of the number of co-occurrences. As an example, the observed number of co-occurrences for Fomitopsis pinicola (S7) and Trichaptum abietinum (S22; see Plate 1) was 229 in the data. The simplest null model M1(I) which accounts only for species-specific prevalence, gives the median prediction of 139 co-occurrences (Fig. 4A). The environmentally constrained null model M2(I) which accounts for the partly shared habitat requirements, leads to the median prediction of 179 co-occurrences. The remaining co-occurrences are not explained by the habitat variables included in this study, leading to a positive estimate for the correlation coefficient (r = 0.35 in the model M2(R) and r = 0.54 in the model M1(R); Fig. 2A).

Figure Plate 1.

Fomitopsis pinicola (S7, large fruit bodies in the front) and Trichaptum abietinum (S22, many small fruit bodies in the back) often occur on the very same spruce logs. Photo credit: O. Ovaskainen.

As the model M1(I) ignores the overall quality differences among the logs, it spreads the observations evenly among all logs. This leads to underestimation for the number of empty logs, overestimation for the number of logs with one occurrence, and underestimation for the number of logs with three or more occurrences (Fig. 4B). The model M2(I) accounts at least partly for the relevant quality variation among the logs, leading to a somewhat better match than the model M1(I). As with the pairwise co-occurrences, including the correlation matrix in the model (M1(R) or M2(R)) brings degrees of freedom that help to match the frequency distribution of logs with a given number of species. However, even these models lead to slight biases, as the correlation matrix accounts only for pairwise associations, whereas the patterns of Fig. 4B depend also on higher-order associations among more than two species.


Hierarchical Bayesian approaches are becoming increasingly applied in ecology and evolutionary biology, largely because they provide a flexible framework for partitioning the variation in complex and structured data sets (Cressie et al. 2009). Often a hierarchical structure is used to account for the sampling design, the different sampling units being influenced, e.g., by their spatial location (e.g., Diez and Pulliam 2007). In the present case, we used a hierarchical structure to extend a species-specific model to a model of species community. The multivariate logistic model is relatively new in the statistical literature (O'Brien and Dunson 2004, Holmes and Leonhard 2006), and we believe it provides much potential for ecological analyses. To our knowledge, the only study that has so far assumed nonindependence among species in the context of habitat modeling is that of Latimer et al. (2009), who fitted a multivariate probit model to data on four invasive plant species, showing a negative residual correlation for four out of the six species pairs.

In this paper, we have demonstrated how the multivariate logistic model can be used to extract information about species interactions from co-occurrence data. Following Peres-Neto et al. (2001), we have accounted in our analyses for the different environmental affinities of the species, and thus asked whether some species pairs co-occur more or less often than expected solely by their environmental niches. Ferrier and Guisan (2006) classified community models into the three categories of (1) “assemble first, predict later,” (2) “predict first, assemble later,” and (3) “assemble and predict together.” As Peres-Neto et al. (2001) compared co-occurrence data to null predictions based on species-specific models, their approach belongs to category 2. As we have fitted a multivariate model to the community data, our approach belongs to category 3. The most fruitful choice among the three strategies depends on the purpose of the modeling study and the type of the data (Ferrier and Guisan 2006). In the present case, the multivariate approach helped in comparing co-occurrence patterns among sites with very different environmental conditions. Although our data had much variation in the prevalence of the species, the estimated correlation coefficients were robust among the sites (Fig. 3B). Thus the correlation coefficient provides a common currency that makes it possible to compare the effect size of nonrandom species associations among different environmental conditions. As we have demonstrated, the multivariate model can also be used to generate simulated species communities that repeat the nonrandom co-occurrence pattern in the original data.

Our analyses revealed that some of the wood-decaying species commonly found to occur together in the field actually do so merely by similar habitat requirements. For example, Phellinus ferrugineofuscus (S12) is often found with Fomitopsis rosea (S8), but based on the model M2(R) this co-occurrence is explained solely by the shared habitat requirements (Fig. 2). Even more strikingly, Phellinus ferrugineofuscus apparently co-occurs often with Fomitopsis pinicola (S7) and Trichaptum abietinum (S22), but accounting for the similar habitat requirements suggests that these species actually co-occur less often than by random.

Among the associations derived from the full model M2(R) there were more positive than negative ones (upper diagonals in Fig. 2), which may seem counterintuitive given that wood-decaying fungi compete largely for the same resources. This result may be partly explained by positive interactions such as facilitation or parasitism, or indirect interactions mediated through a third species. Further, some relevant host-tree quality variables may be missing from the model M2(R). For example, the chemical composition of the wood, as well as its water and gas contents and temperature conditions are likely to affect the establishment, growth and reproduction of wood-decaying fungi, but we did not have data on these variables. The co-occurrence of competitively exclusive species is also possible through spatial separation inside the log, either horizontally in different ends of the log, or vertically, in heartwood or in sapwood.

Many of the significant positive and negative correlations in Fig. 2A were obtained between an early-successional decayer (e.g., S7, S10, S22) and a mid- or later-successional decayer (e.g., S2, S11, S15, S16, S21), supporting the view that the succession of the species community is much affected by the primary decayer (Niemelä et al. 1995, Renvall 1995, Heilmann-Clausen and Boddy 2005). Some of the associations derived from the full model M2(R) are known from earlier studies, e.g., Skeletocutis carneogrisea (S19) being considered as a successor species of Trichaptum abietinum (S22) (Niemelä 2005). However, for most of the associations identified here we are not aware of an obvious explanation, hence these results provide rich material for deriving hypotheses on direct and indirect species interactions (Appendix D).

Research conducted at the interface between ecological theory and empirical data can be broadly classified to forward and inverse approaches. Forward approaches make assumptions about the underlying processes, and use mathematical modeling or simulations to understand how the resulting patterns depend on parameter regimes and structural model assumptions. In contrast, with inverse problems one attempts to gain information on the underlying processes based on data on the patterns. In this paper, we have addressed an inverse problem where the pattern is the occurrence of fungal species as fruit bodies, and the processes relate to species interactions and colonization–extinction dynamics affected by species niches. As many different processes can lead to an identical pattern, inverse problems are mathematically ill posed in the sense that they lack a unique solution. Thus, our result of the positive and negative associations among species pairs (Fig. 2) should not be considered as evidence of direct interactions, but as data-driven hypotheses that we propose to be tested with the help of experimental work.


Data collection was funded by the Finnish Ministry of Agriculture and Forestry, the Finnish Ministry of the Environment, and the EU Forest Focus research programme. Ilkka Hanski, Saskya van Nouhuys, and two anonymous reviewers provided valuable comments. This study was supported by the Academy of Finland (grant no. 124242 to O. Ovaskainen), the European Research Council (ERC Starting Grant no. 205905 to O. Ovaskainen), the Ella and Georg Ehrnrooth Foundation (grant to J. Hottola), and the Finnish Society of Forest Science (grant to J. Hottola).


Details on the data collection (Ecological Archives E091-184-A1).


Implementation and testing of the MCMC algorithm (Ecological Archives E091-184-A2).


Table showing the species responses to environmental covariates (Ecological Archives E091-184-A3).


Discussion on species-to-species interactions (Ecological Archives E091-184-A4).


Source code for parameter estimation (Ecological Archives E091-184-S1).