The perils of measuring biodiversity responses to habitat change using mixed metrics

Existing quantitative syntheses on how biodiversity responds to anthropogenic habitat change appear to sometimes mix different biodiversity metrics in drawing inferences. This “mixing metrics” practice, if prevalent, would considerably bias our understanding of biodiversity responses and render uninterpretable conclusions. However, the prevalence of this practice remains unknown, and the bias it potentially renders has not been empirically assessed. We fill this gap by conducting a systematic literature assessment of existing syntheses on biodiversity responses to habitat change, along with an analysis of a global database specifically on forest restoration. We found that the “mixing metrics” practice was used in almost a quarter of existing syntheses across a wide range of ecosystem and habitat change types. This practice predictably altered the quantitative, and frequently even the qualitative, inferences on biodiversity responses to forest restoration, in ways contingent on the composition of metrics mixed. We call on future syntheses to be cognizant of the difference in metric meaning and behaviors, and to avoid mixing different metrics in studying biodiversity responses to habitat change.


INTRODUCTION
Large-scale quantitative syntheses have emerged as a prominent tool for understanding biodiversity responses to anthropogenic habitat change (Gibson et al., 2011;Gurevitch et al., 2018;Newbold et al., 2015). They are also frequently referred to by global assessments of biodiversity status and future trajectories (Newbold et al., 2016;IPBES 2019;Almond et al., 2022) that in turn inform highlevel conservation policy-making (e.g., Kunming-Montreal abundance among species). This inherent difference means that the inference of a synthesis is specific to the metrics used (Newbold et al., 2015;Vellend et al., 2013), and crucially, mixing different metrics for an overall inference on biodiversity responses to habitat change is not ecologically meaningful. Yet this "mixing metrics" practice seems common, including in influential syntheses (Crouzeilles et al., 2016(Crouzeilles et al., , 2017Florencia Miguel et al., 2020;Huang et al., 2019), highlighting a need to critically appraise if the unscrupulous mixed-use of biodiversity metrics may have led to spurious conclusions (Collen & Nicholson, 2017). Efforts toward such an appraisal have so far been indirect and hypothetical. For one thing, we do not yet know how prevalent the "mixing metrics" practice is in existing syntheses. In addition, regarding how measured biodiversity responses to habitat change may depend on the metrics used, our understanding is limited to a handful of simulation studies that use hypothetical species assemblages under presumed population changes to compare metric behaviors (Lamb et al., 2009;Dornelas, 2010;Santini et al., 2017). While these studies provide important evidence on the varying inferences of alternative metrics, their hypothetical data lack the realism of true species assemblages, particularly in how the population of each species within the assemblage may change in response to realistic environmental changes (Lamb et al., 2009;Santini et al., 2017), and they do not emulate biodiversity responses to concrete forms of habitat change. Overcoming these shortfalls necessitates empirical data on entire species assemblages under clear contexts of habitat change, crucially with abundance information on the species level to allow the calculation of a range of biodiversity metrics. But this has proven difficult in part because of the intensive efforts needed in compiling such data (Feng et al., 2022).
Here we combine a systematic literature assessment of existing quantitative syntheses on a wide range of habitat change with the analysis of a large empirical dataset on forest restoration (as a specific form of habitat change), to assess the extent to which the "mixing metrics" practice may have biased our understanding. Using three mainstream search engines, we systematically searched for syntheses on the numerical responses of species assemblages to habitat change, and we assessed the prevalence of the "mixing metrics" practice in this collection. We then conducted meta-analyses and meta-regressions on a global dataset of species-level abundance records for entire species assemblages, using six biodiversity metrics frequently used in syntheses and two of their mixtures to compare the inferences drawn. With global momentum (Chazdon & Brancalion, 2019;Dave et al., 2019) and potentially large biodiversity impacts (Benayas et al., 2009), forest restoration provides a data-rich context of strong applied value to assess the potential biases caused by mixing different metrics.

Literature assessment of existing syntheses
We systematically searched for quantitative syntheses on biodiversity responses to habitat change using three mainstream scientific search engines on May 17, 2022: Web of Science (searching "Topic"), Scopus (searching "TITLE-ABS-KEY"), and Google Scholar (see Table S1 for search terms used). We screened all entries returned by Web of Science and Scopus, and the first 200 entries returned by Google Scholar. For a synthesis to qualify for assessment, it must have investigated (i) biodiversity numerical responses to (ii) at least one form of habitat change in terrestrial ecosystems, and it must (iii) have used at least one quantitative metric. From each qualified synthesis, we extracted the following information: (i) ecosystem type, (ii) form of habitat change, (iii) metric(s) used to assess biodiversity responses/changes, and (iv) geographical scope of study. We also recorded the publishing journal and year, along with the journal impact factor and the number of citations on Google Scholar (as of July 23, 2022) for each synthesis to gauge its influence. We tallied the number of syntheses that used the "mixed metrics" practice to assess the prevalence of this practice.

Empirical dataset on forest restoration
We used a previously compiled global dataset to compare the inferences-on biodiversity responses to forest restoration and their underlying factors-by six frequently used biodiversity metrics and two of their mixtures. Our dataset comprised paired species-level abundance records for three types of comparisons between restored and "benchmark" tree covers: tree plantations versus matching (i) reference native forests (native forests that had not been deforested in recent history) and (ii) restored native forests of similar age (age contrast ≤10 years; Hua et al., 2022), and (iii) restored native forests versus matching old-growth forests (Miao, 2022). In addition to specieslevel abundance for paired tree covers, the dataset also recorded the mean annual temperature (MAT; • C) of the study sites and the age of the restored tree covers (year), as variables potentially relevant to the abundance contrast between the tree-cover pairs (Crouzeilles et al., 2016; F I G U R E 1 Distribution and amount of data in our global dataset on forest restoration. Bubbles represent the study locations and numbers of species assemblages for each of the three tree-cover pairs involved. This dataset combined the two datasets reported in Hua et al. (2022) and Miao (2022). Hua et al., 2022;Miao, 2022). In total, this dataset consisted of 33,064 pairs of species-level records, representing 415 species assemblages from 42 countries ( Figure 1).

Comparing inferences by different biodiversity metrics and their mixtures
The six biodiversity metrics we focused on included (i) observed and (ii) rarefied species richness (Gotelli & Colwell, 2001), (iii) arithmetic and (iv) geometric mean abundance (Buckland et al., 2005), and (v) Shannon (Shannon, 1948) and (vi) Simpson (Simpson, 1949) indices. All have been heavily used in existing syntheses except for geometric mean abundance (Marshall et al., 2020; Table S1), which we also included given its desirable quality of giving greater weight to rare species compared to arithmetic mean abundance (Buckland et al., 2011;Hua et al., 2022). We used each of these metrics to derive an assemblage-level response to forest restoration (see below). For analyses using the two metric mixtures, we randomly assigned each species assemblage to a given metric-in equal proportions among the metrics that composed a metric mixture-such that the assemblage-level response was measured in that metric. Metric mixture 1 involved three most frequently used metrics in existing syntheses (observed species richness, arithmetic mean abundance, and Shannon index; Table S1), and metric mixture 2 involved all metrics except for geometric mean abundance.
For all metrics except for geometric mean abundance, we first calculated their values for each tree cover, then the ratios between the restored and benchmark tree covers on a natural log scale to derive assemblage-level responses to forest restoration. We followed established methods in calculating metric values; for rarefied richness, we conducted rarefication for the tree cover with larger individual counts within each tree-cover pair, by randomly subsampling individuals up to the counts of the smaller-count tree cover. For geometric mean abundance, we first calculated, for each species, the abundance ratio between the restored tree cover and its benchmark on a natural log scale, and we took the average of all ratios within each assemblage to derive assemblage-level responses to forest restoration (Buckland et al., 2011;Roswell et al., 2021).
We then conducted separate meta-analyses and metaregressions on the assemblage-level responses derived from each of the six metric and two of their mixtures in program R (version 4.1.1; R Core Team, 2020). We used linear-mixed models to conduct meta-analyses separately for each tree-cover comparison, using the lme function from package "nlme" (version 3.1-153; Pinheiro et al., 2021). The random-effect variables we included for all analyses were, in descending order of nestedness: (i) species' taxonomic group identity; (ii) the finer combination between tree plantations (monoculture, mixed culture, or abandoned plantations), reference native forests (old-growth forests or otherwise), or restored native forests (resulting from natural regeneration, assisted natural regeneration, or artificial planting); (iii) the identity of the primary study; (iv) the site identity of the benchmark forest.
For meta-regressions, we used the same linear-mixed modeling approach by constructing global models, followed by model selection and averaging based on the Akaike Information Criterion corrected for small samples (AICc). Our global models used the same random-effect variables as in the above meta-analyses. Their fixed-effect F I G U R E 2 The prevalence of the "mixing metrics" practice in existing syntheses on biodiversity responses to habitat change. (a) The number of existing syntheses that did or did not mix multiple metrics in assessing biodiversity responses to habitat change, for a range of terrestrial ecosystem types and habitat change forms. Barplots are further disaggregated according to the number of metrics used. (b) The combination of biodiversity metrics subject to mixed use in existing syntheses. Within each of the five metric categories (i.e., richness, abundance, diversity, similarity, and demography; displayed in different colors), only the metrics with the top three use frequencies are presented, and the remaining metrics are grouped as "Others"; more than three metrics may be displayed if multiple metrics had the same use frequency within a given metric category. Links indicate pairs of metrics subject to mixed use (cases of three metrics being used together would therefore appear in two separate links, and so on), in width proportional to the frequency of the corresponding pair. Accordingly, the outer rim length for each metric indicates its total frequency of being in mixed use with other metrics. (c) The influence of existing syntheses as represented by journal impact factors (in 2021 values) and article citation rates. Both axes are on the log 10 scale. * indicates the two syntheses in which authors tested the sensitivity of their conclusions to the mixing-metrics practice (they concluded that this practice did not change their findings relative to individual metrics).
variables included the age of the restored tree covers (this variable was not used for the tree-cover pair of tree plantations versus restored native forests of similar age), MAT of the study sites, and the interaction between age and MAT; we modeled both variables on the natural log scale to accommodate their potential nonlinear effects. From these global models, we constructed a full suite of sub-models to derive model-averaged inferences from the top model set, that is, the set of models with ΔAICc ≤2, using package "MuMIn" (version 1.43.17; Bartoń, 2020).

Prevalence of the "mixing metrics" practice in existing syntheses
We found that the "mixing metrics" practice was present in almost a quarter of existing quantitative syntheses on biodiversity responses to habitat change. Out of the 193 syntheses we located and the 237 cases therein (a given synthesis can involve multiple cases of biodiversity responses to habitat change), 44 (22.7%) syntheses and 61 (25.6%) cases combined at least two metrics to make an overall inference (Figure 2a; Table S1). These syntheses concerned the vast majority of ecosystem types and habitat change forms in the full collection, but were particularly common for forest ecosystems, as well as the habitat change forms of degradation and restoration (Figure 2a). Together, these syntheses involved a wide range of metric combinations across four metric categories (richness, abundance, diversity, and similarity), with observed species richness, arithmetic mean abundance, and Shannon index being the most dominant metrics subject to mixed use (Figure 2b). These syntheses had a median journal impact factor of 6.4 (in 2021 values) and article citation rate of 13.8 times/year (SD: ±29.3 times/year), and included some highly influential studies (Figure 2c).

Implications of the "mixing metrics" practice for synthesis inferences
Across tree-cover comparisons, meta-analytic inferences on the assemblage-level biodiversity responses to forest restoration differed noticeably among the six metrics (Figure 3a-c, the six top rows in each panel). First, rarefied species richness and diversity metrics (Shannon and Simpson) consistently indicated more positive biodiversity responses than the other metrics, frequently finding no significant contrast between restored and benchmark tree covers despite other metrics suggesting otherwise. Second, observed species richness and geometric mean abundance consistently produced more conservative inferences than the other metrics, predominantly indicating significant biodiversity contrasts between restored and benchmark F I G U R E 3 Different inferences on biodiversity responses to forest restoration and underlying factors based on different metrics and their mixtures. Inferences on the contrast between restored tree covers and their benchmark forests in assemblage-level biodiversity amount are displayed for six biodiversity metrics and two metric mixtures, for the comparisons between (a) tree plantations versus reference native forests, (b) tree plantations versus restored native forests of similar age, and (c) restored native forests versus old-growth forests. In panels (a)-(c), scattered dots represent assemblage-level responses to forest restoration derived from our database, and diamonds and associated error bars represent the mean and 95% confidence intervals (CI) resulting from meta-analyses. We considered metrics to produce different inferences on biodiversity responses to habitat change if the means and 95% CIs derived from them were different. (d)-(f) Inferences on which factors (in columns) explained the contrast between restored tree covers and their benchmark forests in assemblage-level biodiversity amount according to the top model sets, based on different biodiversity metrics and their mixtures (in rows). Inferences are displayed for the comparisons between (d) tree plantations versus reference native forests, (e) tree plantations versus restored native forests of similar age, and (f) restored native forests versus old-growth forests. Numbers in each cell represent the estimated coefficient (number outside of parentheses) and its 95% CI (numbers in parentheses) for the corresponding factor based on the corresponding biodiversity metric. Positive and negative relationships are indicated by orange and purple coloring of the corresponding cells, respectively, with statistically significant relationships indicated by dark colors. "-" indicates that the variable concerned was not included in the top model set. For the comparison between tree plantations and restored native forests of similar age, the variable "Age" (i.e., age of the restored tree cover) was not relevant and is indicated by "NA." tree covers. Third, between the two abundance metrics, arithmetic mean abundance consistently indicated more positive biodiversity responses than geometric mean abundance, by a wide margin in two out of the three tree-cover comparisons. Finally, while different metrics may draw the same qualitative conclusions on the significance of biodiversity contrasts between tree covers, they always differed in quantitative estimates.
The above differences also meant that inferences drawn from mixing multiple metrics differed noticeably from those of any individual metric, and depended on the identity and proportion of metrics in the mixture (Figure 3a-c). As expected, inferences from the two metric mixtures were intermediate of those from constituent metrics, and were different from each other (Figure 3a-c). Importantly, by increasing the proportions of "optimistic" metrics (via bringing in rarefied species richness and Simpson index) and lowering that of the more conservative metric (observed species richness), mixture 2 predictably indicated more positive biodiversity responses than mixture 1.
Inferences from meta-regressions on what factors underlay biodiversity responses to forest restoration also differed among metrics and their mixtures (Figure 3d-f). Within a given tree-cover comparison, the sets of predictors identified by model selection as relevant to biodiversity responses were generally inconsistent across metrics. Even where they were consistent for a subset of metrics (e.g., for the comparison between tree plantations and reference native forests, MAT was identified as the sole predictive variable for three metrics and both metric mixtures), their coefficients and statistical significance were always different (Figure 3d). Taken together, our findings-based on a large empirical dataset on forest restoration-demonstrated the very different inferences to be drawn from different metrics and their mixtures on how biodiversity responds to habitat change and the underlying factors.

DISCUSSION
Our critical appraisal highlighted prevalent practice in existing syntheses of mixing multiple metrics in assessing biodiversity responses to habitat change, and the biases this practice can introduce to our understanding. In our analysis of an empirical dataset on forest restoration, mixing multiple metrics clearly altered the quantitative, and frequently even qualitative, conclusions on biodiversity responses and their relevant predictors in comparison with individual metrics. More problematically, this practice muddles the ecological interpretation of conclusions because of the inherent differences in what each met-ric is designed to measure-for example, it is unclear how the conclusion "restoration success was 34% to 56% higher in natural regeneration than in active restoration systems" (Crouzeilles et al., 2017) should be interpreted, when biodiversity was represented by up to five metric categories and more than ten individual metrics at unknown proportions. Given the conservation significance of accurately measuring how biodiversity responds to habitat change and what factors are relevant, and the scientific influence of conclusions from quantitative syntheses, the incomplete and potentially biased understanding risked by the "mixing metrics" practice should be recognized and avoided.
The prevalence of this practice in existing syntheses stems from a failure to recognize that different metricsalthough complementing each other-are expected to yield different conclusions for the same scenario of biodiversity change (Blowes et al., 2022;Midolo et al., 2019). For example, richness metrics by design do not reflect changes in species population size unless the population completely disappears or emerges (Roswell et al., 2021), and therefore tend to be less sensitive than abundance metrics in reflecting biodiversity change (Santini et al., 2017). Similarly, most diversity indices by design would not change unless the relative abundance (or presence/absence) among species changes. They therefore would remain steady even if all species are declining in population size, as long as they have the same decline rate (Santini et al., 2017). On the other hand, while abundance metrics are generally considered the most informative (Giljohann et al., 2018;Midolo et al., 2019), they also differ in the nature of abundance information captured. Most notably, arithmetic mean abundance-while intuitive and most frequently used-is prone to the influence of more abundant species that tend to show large absolute population changes (Magurran & Henderson, 2003); this potential pitfall is largely avoided by geometric mean abundance (Buckland et al., 2005;Giljohann et al., 2018). These among-metric differences in expected sensitivity to biodiversity change have been illustrated by earlier simulation studies (Lamb et al., 2009;Santini et al., 2017;Van Strien et al., 2012), and are further supported by our findings from the empirical dataset on forest restoration.
It is worth noting that ∼75% of existing syntheses we identified on biodiversity responses to habitat change did not mix metrics in drawing inferences. Aside from the 71 cases that used only one metric, the other 76 cases reporting multiple metrics took care to conduct analyses and interpret findings based on each metric separately, with the vast majority of these "nonmixing" syntheses relying on two metrics: observed species richness and arithmetic mean abundance (Table S1). This pattern illustrates two issues pertaining to synthesizing existing data on biodi-versity responses to habitat change. First, observed species richness and arithmetic mean abundance remain the most readily available primary data for synthesis. Given the inherent differences in the information content, behavior, and strength of different metrics as discussed above, there is a need to make data on other metrics more available for synthesis (see also below). Second, following the good practice to avoid metric-mixing is feasible, and certainly should not be trumped by the appeal of an enlarged sample size from pooling primary data on multiple metrics-a likely reason for the prevalence of metric-mixing in existing syntheses. For example, the large sample size of the influential synthesis Crouzeilles et al. (2017) was at least partly an outcome of the synthesis admitting more than ten biodiversity metrics and combining them in analyses. However, in the absence of valid analyses and conclusions, as is the case with metric-mixing, the potential benefit of a large sample size and an apparently straightforward "overall" conclusion would be nonexistent.
Given the difference among biodiversity metrics in inherent ecological meaning and expected sensitivity to biodiversity change, we highlight a number of insights on how studies-syntheses as well as primary studiesshould select and report metrics in measuring biodiversity responses to habitat change, in addition to the obvious note that syntheses must not mix metrics in making inferences. First, diversity indices, by virtue of being the most "optimistic" metrics, should be used with caution in view of the precautionary principle of conservation (Van Strien et al., 2012). Despite their unique information, studies should avoid using them as the sole measures of biodiversity responses. Second, whenever resources allow, it is desirable to collect species-level abundance information.
In addition to its greater information content than richness metrics, it also offers the flexibility to derive richness measures where needed (Midolo et al., 2019). Finally, instead of reporting highly "reduced" assemblage-level metrics as is widely done, primary studies can contribute vastly more useful data by reporting abundance information on the species-and ideally, also sampling site-level. This will give syntheses the flexibility to derive a range of complementary metrics for more informative inferences, and should become a standard practice.
As empirical evidence accumulates on biodiversity responses to anthropogenic impacts, the role of largescale quantitative syntheses in generating broad-scale knowledge on the fate of global biodiversity and informing conservation decision-making will only become more prominent. The onus is on these syntheses to deliver robust science. By highlighting the prevalence and consequences of a heretofore overlooked practice in existing syntheses, we call on future syntheses to avoid mixing metrics in measuring biodiversity responses to habitat change. The scientific rigor of future syntheses will also benefit from an increased initiative of primary studies to collect specieslevel abundance data, and to more openly report such (raw) data; the latter will entail a wider cultural shift and reward system change in scientific research (de Lima et al., 2022;Reichman et al., 2011).

C O N F L I C T O F I N T E R E S T S TAT E M E N T
The authors declare that they have no competing interests.

D ATA AVA I L A B I L I T Y S TAT E M E N T
All syntheses assessed and data extracted are provided in the Supporting Information. The previously compiled global dataset on forest restoration can be accessed from original publications. Code for analysis is available on GitHub at https://github.com/mingxinliu/assessing_ biodiversity_metrics.