Quantifying reliability and data deficiency in global vertebrate population trends using the Living Planet Index

Global biodiversity is facing a crisis, which must be solved through effective policies and on‐the‐ground conservation. But governments, NGOs, and scientists need reliable indicators to guide research, conservation actions, and policy decisions. Developing reliable indicators is challenging because the data underlying those tools is incomplete and biased. For example, the Living Planet Index tracks the changing status of global vertebrate biodiversity, but taxonomic, geographic and temporal gaps and biases are present in the aggregated data used to calculate trends. However, without a basis for real‐world comparison, there is no way to directly assess an indicator's accuracy or reliability. Instead, a modelling approach can be used. We developed a model of trend reliability, using simulated datasets as stand‐ins for the “real world”, degraded samples as stand‐ins for indicator datasets (e.g., the Living Planet Database), and a distance measure to quantify reliability by comparing partially sampled to fully sampled trends. The model revealed that the proportion of species represented in the database is not always indicative of trend reliability. Important factors are the number and length of time series, as well as their mean growth rates and variance in their growth rates, both within and between time series. We found that many trends in the Living Planet Index need more data to be considered reliable, particularly trends across the global south. In general, bird trends are the most reliable, while reptile and amphibian trends are most in need of additional data. We simulated three different solutions for reducing data deficiency, and found that collating existing data (where available) is the most efficient way to improve trend reliability, whereas revisiting previously studied populations is a quick and efficient way to improve trend reliability until new long‐term studies can be completed and made available.


| INTRODUC TI ON
An urgent data crisis complicates the global biodiversity crisis (Turak et al., 2017). Attempts to assess global biodiversity (e.g., the Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services, IPBES) and to set global policies and goals that will halt or reverse its loss (e.g., the Convention on Biological Diversity, CBD, and Sustainable Development Goals, SDGs) need reliable and up-to-date scientific information (Jetz et al., 2019). Yet most studies and tracking programs are either species-or regionfocused, temporally limited, and inherently biased, all of which results in large geographic and taxonomic knowledge gaps (Hortal et al., 2015;Jetz et al., 2019;Meyer et al., 2015;Proença et al., 2017;Turak et al., 2017). Advances in technologies such as camera tracking, satellite sensors, digital image recognition, network speed and capacity, data access, and mobile devices are improving our ability to track and count populations of birds and mammals (Lausch et al., 2016;Nichols et al., 2011;Rose et al., 2015) but our datasets are far from complete. The situation is worse for amphibians, reptiles, insects, and other groups, for which many species have yet to even be described (Mora et al., 2011).
We need tools to improve our understanding of global biodiversity within the limitations imposed by biased and incomplete datasets. Mace and Baillie (2007)  To what extent can we rely on them to present a true picture of the state of global biodiversity? Two of the best-known biodiversity indicators are the Living Planet Index (LPI), which tracks vertebrate population trends , and the Red List Index (RLI), which tracks extinction risk trends (Butchart et al., 2005). The RLI is based on extinction risk classifications at the species-level, created by expert assessment using an objective set of criteria (IUCN, 2012). By contrast, the LPI uses continuous population data collected by scientific surveys.
However, as intensive global long-term studies do not exist for most species, the LPI calculates trends from data compiled from a variety of sources, including grey literature . This means a lack of standardization in study design (individual population time series are standardized, but there is no standardization between populations), monitoring strategy, frequency of assessment, monitoring intensity and effort, even data type (densities, counts of individuals or breeding pairs or even nests, and population size estimates are mixed together). The LPI has taxonomic and geographical imbalances (Collen et al., 2009;McRae et al., 2017), a problem found also in other global biodiversity datasets (Boakes et al., 2010;Collen et al., 2008;Yesson et al., 2007). Further, many included time series are short (McRae et al., 2016;Proença et al., 2017;Saha et al., 2018), and shorter trends tend to be less accurate than longer ones (Arkilanian et al., 2020;Wauchope et al., 2019). Recognizing these limitations, the LPI employs statistical techniques to increase the accuracy and precision of trends. Generalized additive modelling or log-linear interpolation are used (depending on the length of a given time series) to fill in missing values in time series, bootstrapping is used to generate confidence intervals (Collen et al., 2009), and a hierarchical weighting system is applied to account for geographical and taxonomic bias (Collen et al., 2009;McRae et al., 2017).
Nonetheless, the LPI's conclusions on biodiversity change have been questioned, with Buschke et al. (2021) finding an inherent negative bias in the calculation of LPI trends due to random population fluctuations and Leung et al. (2020) finding that the LPI is biased by clusters of extreme declining populations. Further, Leung et al. (2020) used the Living Planet Database (LPD) the LPI is based on to show that global biodiversity is not declining. While the analysis of Leung et al. (2020) has been contradicted by others (Loreau et al., 2022;Murali et al., 2022;Puurtinen et al., 2022), the controversy has placed a spotlight on the LPI and other global biodiversity indicators and increased the urgency of understanding how well we can rely on them.
Without a basis for real-world comparison, there is no way to directly assess an indicator's accuracy or reliability. However, there are ways to address this question indirectly. One solution was employed by the sampled approach to the Red List Index (sRLI), which uses the minimum representative sample size (sample size being the number of species represented in the index for a particular taxonomic group) needed to achieve less than a 5% probability of falsely detecting a positive slope when the Red List Index trend is negative (Baillie et al., 2008;Henriques et al., 2020). Minimum representative sample size was determined through sub-sampling of comprehensively assessed species groups on the IUCN Red List (e.g., mammals, birds etc.; Baillie et al., 2008;Henriques et al., 2020).
Two challenges presented by the LPI require a different approach than that taken for the sRLI. First, LPI trends are based on population time series that are often short and/or infrequently measured, and there are no regional or taxonomic groups within the LPI where the data is comprehensive enough to be certain of the real-world trend. Therefore, comparing sampled trends to LPI trends would tell us little about how the sampled trends might compare to reality. Second, the LPI uses non-linear trends that change slope and direction over time, so trends should be compared in a way that reflects this. Here, we use a modelling approach to overcome these challenges, based on thousands of datasets of synthetic population time series with variations in the underlying properties of the data to represent regional taxonomic groups in the real world and sampling from those datasets. We degraded the samples by randomly removing observations and adding observation error to resemble regional taxonomic groups in the Living Planet Database (LPD, the database underlying the LPI). We then compared the trends calculated from the samples with those from the complete datasets using the Jaccard distance metric (chosen using the distance measure selection method described in Dove et al., 2022) and constructed a multiple regression model to understand how the distance values are influenced by variations in properties of the data. Here, distance metrics can be thought of as a measure of trend accuracy. By selecting a threshold value for accuracy and applying the model to the LPI, we were able to quantify the reliability of disaggregated LPI trends and determine the number of additional time series needed to meet the threshold. Finally, we modelled and compared three different solutions for reducing data deficiency: (a) tracking unstudied populations for a decade to generate new time series for the LPD, (b) resampling previously-studied populations to update old time series in the LPD, and (c) gathering more time series from existing studies to add to the LPD. The results from this study can be used to focus data-gathering and data-collation efforts on the regions, taxa, and populations that would be of greatest benefit to improving our understanding of the state of global vertebrate biodiversity. Figure 1 shows an overview of our methods, with each numbered step corresponding to a numbered subheading in the text.

| Synthetic data generation
We first created simulated datasets to represent "real-world" regional vertebrate groups for which the LPI calculates biodiversity trends. The LPI is often represented as a single global index trend, but can also be disaggregated into hierarchical groups: first into systems (terrestrial, marine, freshwater), then geographical realms within each system, and finally taxonomic groups within each realm. It is this lowest level of the hierarchy, the regional taxonomic groups, which we simulated. From here on simulated regional taxonomic groups will be referred to as datasets. The base units of the LPI, and of our synthetic datasets, are population time series, which we will refer to simply as populations. These populations are grouped into species, and species are grouped into datasets.
Our procedure to simulate a dataset requires six parameters: (1) the total number of populations to simulate (set to 10,000), (2) the mean number of populations assigned to each species (set to 10), (3) the number of years (length of trend) to simulate (set to 50), (4) the mean of the population mean growth rates (μ ds ), (5) the standard deviation of the population mean growth rates (variation among populations, σ ds ), and (6) the mean of the population standard deviations of the growth rate (process error, μ η , that determines annual variation in growth rates within time series).
Before determining parameter settings, we tested each parameter individually for effects on trend accuracy. We did this by generating test datasets with a range of settings for the parameter being tested and keeping all other parameters fixed (see supplementary figures for details), then followed the methods described in Sections 2.2-2.7 (below) to determine if and how trend accuracy was affected. The first parameter, total populations, affected trend accuracy only when greater than half of all populations in a dataset were sampled (see Figure S1), a situation that is unlikely for regional taxonomic groups in the LPD, as it is rare even at the species level (see taxonomic representativeness in McRae et al., 2017). The second parameter, the mean number of populations per species, had no effect on trend accuracy within the wide range of values we tested (see Figure S2). The third, trend length, did affect trend accuracy (see Figure S3) and would therefore need to be set appropriately if adapting the model for a different indicator. However, it is relatively constant across regional taxonomic groups in the LPD (all trends begin at 1970 and end at the most recent year for which there are observations in the database, e.g., 2020). Therefore, we set the first three parameters at fixed values for the "real-world" simulations. Parameters four through six are variable in the LPD and did affect trend accuracy in our test results, and were therefore set to vary in the simulations.
We modelled population time series using the stochastic exponential model with process error: where N t is population size at year t, 1 + r t is annual growth rate (often referred to as lambda, or λ) at year t, and r t ~ N(μ pop , 2 pop ) models uncorrelated process error (i.e., temporal variation in the growth rate that could be caused by, for example, uncorrelated environmental variation) by sampling each annual growth rate from a normal distribution. The population process error, η, is also sampled from a distribution (so different populations have different, but similar levels of η), with σ pop ~ Exp 1 , where 1 is the rate parameter.
Consequently, there is a tendency towards larger values for σ pop , and therefore higher levels of process error, as , the expected value of the distribution increases.
The mean of the normal distribution of population growth rates was itself drawn from a normal distribution, μ pop ~ N(μ spec , 2 spec ). Thus, populations from a species will have similar but not identical underlying mean population growth rates representing perhaps differences in environmental conditions between geographically isolated populations of a given species. In turn, similar species were grouped together into datasets, and we assumed that species within taxonomic groups had underlying population growth rates that were drawn from an identical distribution, μ spec ~ N(μ ds , 2 ds ). Here, larger values for ds lead to a broader range of underlying species growth rates, perhaps signifying broader species-specific variation in responses to drivers such as habitat change within a taxonomic group.
Using this hierarchical approach therefore captures the similarity of time series within a species, and the similarity of time series between species within a taxonomic group.
Growth for each population was modelled for 50 years, starting at a population size of 100. Populations were assigned to species by randomly sampling from a pool of 1000 species labels, with replacement, resulting in a normal distribution of populations per species, pps ~ N(μ pps , 2 pps ), with μ pps = 10 and σ pps = 4.5. While (1) N t+1 = 1 + r t N t , F I G U R E 1 Modelling trend accuracy in the LPI: an overview. This figure illustrates the methodology used in this study. The numbered boxes correspond with the numbered steps in the methods section below. Values given in the figure are for illustration and are not intended to represent actual inputs or results.
populations are unlikely to be normally distributed across species in the real world (one would expect more rare species than common species), simulations confirmed that our modelling approach is robust against distributional assumptions for this parameter (see Figure S2).

| Observation error
The variation in population growth rates modelled above assumes all variation is due to process error. However, time series in the LPD are based on population estimates, which can be assumed to include some level of observation error due to for example, species misidentification, non-detection, and counting errors. This observation error is not accounted for in the LPI, but may affect trend reliability.
Observation error, ɛ, can be calculated using the coefficient of variation (cv), defined as where μ ab and σ ab are the mean and standard deviation (respectively) of the abundance values. Since data in the LPD were collected using a variety of methods, and ɛ is not recorded in the database, we chose a range of ɛ consistent with values reported for other vertebrate surveys (Fryxell et al., 2014;Westcott et al., 2012;Zylstra et al., 2010). We determined through simulations that there is no effect of increasing observation error on trend accuracy

| Data degradation
Observed versions of the datasets were then randomly degraded to resemble the varied quality of sampled real-world data present within the LPD. The length (number of years from first to final observation) for each degraded time series within a dataset was randomly chosen by sampling from a Poisson distribution. We determined through simulations that varying the number of observations does not affect trend accuracy at a given time series length, so we fixed the mean number of observations at half of the mean time series length (rounded up to the nearest integer). The starting years for each time series were assigned randomly. Time series were then cut to their assigned length, and half of the remaining observations were randomly removed.

| Sampling
Populations were randomly sampled from each dataset, without replacement until the desired sample size n was reached. This was repeated to obtain 20 random samples of the same size for each dataset. Values for four of the six dataset parameters described in Section 2.1 may be different for samples than for the dataset they are selected from, and may also vary between samples: the mean number of populations per species, the mean and standard deviation of population mean growth rates, μ ds and σ ds , and the mean of population standard deviations of the growth rate, μ η .

| Calculation of sampled trends
Non-linear index trends were calculated from each sample, following the LPI method described in McRae et al. (2017). Time series with six or more data points were modelled using a generalized additive model (GAM), as described in Collen et al. (2009), with a Gaussian (normal) distribution, smoothed by a thin plate regression spline, with the number of knots set to half the number of observations (rounded down to the nearest integer). Time series that had fewer than six data points were interpolated using the chain method (Loh et al., 2005), as described in Collen et al. (2009)

. The chain method imputes missing values using log-linear interpolation by
where N is the population estimate, i is the year for which the value is to be interpolated, p is the preceding year with an observed value, and s is the subsequent year with an observed value. For all populations, whether interpolated or modelled by a GAM, species indices were formed by a three-step process. First, population sizes were converted to annual rates of change by where N is the population estimate and t is the year. Second, average growth rates were calculated for each species by where n t is the number of populations in a given species, d it is the growth rate for population i at year t, and d t is the average growth rate at year t. Growth rates were capped at [−1:1]. Finally, index values were calculated by.
where I is the index value and t is the year. Equations

| Calculation of the 'true' trend
A non-linear index trend was calculated for each complete, undegraded dataset (without observation error), following McRae et al. (2017), as for the sampled trends. However, the undegraded datasets had no missing values, therefore modelling each time series using the chain method or a GAM was unnecessary, and that step was skipped.

| Comparison of trends
We selected an appropriate distance measure to compare sampled trends with 'true trends' using the selection process described in Dove et al. (2022). Of the distance measures deemed appropriate, we chose the Jaccard distance because it uses a 0-1 scale, making it easier to interpret. The Jaccard distance is calculated as (from Cha, 2007), where P t and Q t are index values from two trends P and Q at time point t, and n is the number of time points. From here on, any value calculated by applying the Jaccard distance to compare sampled versus 'true' trends will be referred to as a trend deviation value, or TDV.
We used TDV here as a measure of trend accuracy, but it is in fact the complement of accuracy (a perfectly accurate trend would yield a TDV of zero); lower TDV means higher accuracy. Furthermore, when referring to TDVs of simulated trends, we used the term 'trend accuracy,' but when referring to TDVs of LPI trends, we used the term 'trend reliability.' This is because TDVs for simulated trends were measured, while TDVs for LPI trends were estimated based on a model. Trend reliability is thus a measure of expected accuracy based on underlying data sufficiency or deficiency, but should not be considered a proxy for accuracy. In other words, a data deficient trend may be accurate but we cannot rely on it to be so.

| Generation of datasets
We generated 3000 datasets (each consisting of 1000 species and 10,000 populations), with each dataset sampled 20 times, resulting in 60,000 samples. Values for mean time series length, μ ds , σ ds , and μ η were randomly selected from uniform distributions, while sample size was randomly selected from a log-uniform distribution, , where SS is sample size and a and b are the minimum and maximum values, respectively (log-uniform was chosen to ensure the model would be robust at small sample sizes, as most datasets in the LPD are small). Ranges for the distributions were chosen to ensure that parameter ranges in the samples would be broader than the ranges present in the LPD (Table 1). Regional taxonomic groups from the LPD with fewer than 20 populations were excluded from parameter range calculations to avoid extreme outliers. We set the minimum sample size to 50 because smaller samples rarely generated a complete trend, and the maximum to 10,000 to improve predictions of the effects of sample size increases.

| Multiple regression model
We built a multiple linear regression model to understand how variables in the simulated data determine trend accuracy (TDV).
First, we removed all simulated datasets in which the mean of the sample parameter values fell outside of LPD parameter ranges (individual replicates were allowed to fall outside of LPD ranges), leaving 2361 datasets, or 47,220 samples. We then randomly selected 67% of the remaining datasets (1581 datasets) to train the

| Model validation
The residuals of the combined data used to train the model were approximately normally distributed. Likewise, the residuals appeared homoscedastic when plotted against fitted values. We compared the actual TDV of each sample in the testing datasets to the predicted TDV for that sample calculated by the model, then calculated the RRMSEP (relative root mean squared error of prediction), defined as where RMSE is the root mean squared error and SD is the standard deviation of the actual TDVs, and where y i is the ith actual TDV, ŷ is the predicted TDV, and n is the number of samples.

| Maximum trend deviation value
We set a maximum predicted TDV as a threshold that regional taxonomic group trends within the LPI should not exceed to be considered reliable. First, we built a linear regression model of the square root of TDV from our training datasets, with the natural log of sample size as the predictor variable, since sample size is the only user-controlled variable within the LPD. Every regional taxonomic group within the LPD represents a single sample from the real world; therefore, we were not interested in the mean TDV achieved by each dataset, but in the range of possible TDV values, especially the upper part of the range (the least accurate sample trends from each dataset).
We used 10,000 bootstrap estimations of the mean of the TDV from each dataset to calculate the 90% confidence intervals using the bias corrected and accelerated bootstrap interval (BCa) method, also known as the adjusted bootstrap percentile method. The BCa method is a non-parametric method that does not assume the data is normally distributed (the TDV values have a beta distribution) and corrects for bias and skewness in the distribution of the mean estimates. We plotted the curve of the square root-log model of the upper 90% confidence interval of TDV in relation to sample size on a (non-log) graph of TDV versus sample size ( Figure 2).
Increasing sample size should naturally lead to more desirable TDV but it is costly in terms of time and money to increase sample size, and it may also be prudent to put the resources elsewhere. It was therefore important to choose a maximum TDV that reflects these trade-offs. To choose a maximum TDV, we used a method called the concordance probability method (CZ) (Liu, 2012). We adapted CZ from the field of biomedical research, where it is often necessary to specify a cut-off value to discriminate between positive and negative results from screening or diagnostic tests (Liu, 2012).
First, a receiver operating characteristic (ROC) curve is built, plotting the rate of true positives (sensitivity) against the rate of false positives (1 − specificity). The idea is to find the point on the curve that maximizes both sensitivity and specificity. The CZ method simply finds the point where their product is maximized.
By considering the square root-log model of the upper 90% confidence interval of TDV versus sample size ( Figure 2) as equivalent to an ROC curve, we applied the CZ method to find the point on the curve where TDV and sample size are minimized. This is the point where we should achieve maximum value from the data. Further right along the curve, increasing the sample size would give a smaller improvement in trend reliability and is therefore not cost-or resourceeffective. Since an ROC curve is intended for binary classification, the CZ method assumes that both sensitivity and specificity are on a 0-1 scale. TDV already ranges from 0 to 1, so we set sensitivity as 1-TDV. We normalized sample size to a 0-1 scale by converting it to a proportion of the complete dataset (dividing by the total number of (10) RRMSEP = RMSE ∕ SD, F I G U R E 2 Trend deviation value (TDV) versus sample size. This plot includes only the upper 90% confidence interval of TDV from each simulated dataset. The curved blue line is the square root-log model of the plotted values. The vertical red line intersects the square root-log curve at the optimal cut-point.
time series in the dataset). Since all datasets were the same size, the relationship between TDV and sample size was not altered by the conversion to a proportion. Specificity was then 1-sample size. The optimal cut-point on the curve is defined as where Se is sensitivity, Sp is specificity, and c is any cut-point.
2.12 | Minimum sample size for regional taxonomic groups Minimum sample size was calculated by rearranging the formula for the multiple regression model to solve for sample size and replacing the TDV variable in the formula with the cut-off value determined above. Values for the other variables in the formula were determined separately for each regional taxonomic group from the LPD, as follows: populations with fewer than two data points were removed, missing data was interpolated using the chain method (Collen et al., 2009) 2.13 | Assigning reliability ratings to regional taxonomic groups The actual number of populations in each regional taxonomic group was divided by the minimum sample size and multiplied by 100 to determine the percentage of the minimum sample size actually met by each group. Groups achieving 100% or greater were designated as reliable, those achieving between 50% and 100% were designated as data deficient, and those achieving <50% were designated as severely data deficient.

| Correlations between reliability rating and LPI relative weighting
The LPI uses a weighting system to account for the estimated number of species in each regional taxonomic group to reduce representational bias ). Each regional taxonomic group has a relative weighting assigned to it, which is used in the calculation of aggregated indices. We used the Pearson's product moment correlation coefficient test to determine if there was any significant correlation between percentage of the minimum sample size achieved for each regional taxonomic group and the assigned relative weightings in the LPI for each group. The test was performed on the combined dataset as well as each individual system.

| Modelling potential solutions
We

| Coding and data
All trends for synthetic data were produced using original code designed to reproduce the functionality of the rlpi package (Freeman et al., 2021). All coding was done in R (R Core

| Regression model
The regression model contains five independent variables (Tables 1 and 2). Together they describe 62% of the variation (adjusted r-squared: .62) in the TDV associated with sampled trends, and with F(5, 29,385) = 9686, p < .001. All independent variables are statistically significant predictors, with p < .001 ( Table 2).
Interaction terms are also statistically significant but do not increase the adjusted r-squared of the model, so we left them out.
RRMSEP is 0.231. Sample size is the most important variable affecting trend accuracy. As expected, higher sampling leads to a lower TDV (higher accuracy). The other variables all have smaller Much of the unexplained variance from the model is due to random sampling. We confirmed this by remaking the model using the sample means, which resulted in an adjusted r-squared of .87. Using the square root of TDV instead of the log further increased the adjusted r-squared to .93. This was not the case for the model using the individual samples, where the log resulted in a higher adjusted r-squared than the square root.

| Maximum trend deviation value
Using the concordance probability method to select a cut point on the square root-log model of the 90% upper confidence interval of TDV versus sample size, we found a maximum TDV value of 0.176.

After placing this value into the model equation and reorganizing
to solve for sample size, we applied the model to the LPI to find the minimum number of populations needed for each regional taxonomic group.

| Minimum sample size
The number of populations needed to achieve the TDV threshold for a reliable trend varies across taxonomic groups and realms (

| Trend reliability
Reliability varies strongly across realms, taxonomic groups, and systems (Figures 3 and 4). Terrestrial trends are the most reliable and freshwater trends the least. Terrestrial and freshwater trends are more reliable in the global north than in the global south, except for terrestrial reptiles and amphibians. Marine bird trends are more reliable in temperate areas than the tropics, while marine fish trends are more reliable in tropical waters than polar. Globally, bird trends are the most reliable but are nonetheless poor in the tropics, especially Africa. Reptile and amphibian trends are data deficient everywhere except the terrestrial Neotropical realm, and marine and freshwater mammal trends are data deficient everywhere (although marine IndoPacific mammals are very close to the threshold at 97%).
The regional taxonomic groups with the greatest potential to affect the reliability of aggregated LPI trends are exclusively tropical ( Figure 5), due to a combination of high relative weighting and low reliability scores. The eight groups of greatest concern include five freshwater and three terrestrial groups, but no marine groups. All are from the tropics. Fishes, birds and reptiles and amphibians are represented, with mammals absent. Overall, the reliability scores of

TA B L E 3
The trend deviation value, the current number of populations in the LPD, the minimum number of populations that would meet the reliability threshold, and the number of additional populations that must be added to achieve the reliability threshold for each regional taxonomic group in the LPD. Note that the trend deviation values here were calculated using the model formula and therefore occasionally fall outside of the 0-1 range of the Jaccard distance the TDV is based on.

| Modelling potential solutions
Adding 200  This solution shows a statistically significant improvement in trend accuracy compared to the control group (p < .001).

| DISCUSS ION
Understanding the changing global state of biodiversity is crucial to making good policy and conservation decisions and 'bending the curve' of biodiversity loss (Mace et al., 2018). Acquiring accurate and comprehensive data is crucial, but the first step is to answer the question: what do we actually know? The present study quantifies the reliability of trends for each regional taxonomic group in the Living Planet Index and estimates the number of population time series needed to meet a standard of expected accuracy.
We used synthetic population time series datasets to construct a multiple regression model of trend accuracy by comparing trends of degraded samples with the trends of the full, undegraded datasets using a distance measure (Figure 1). We applied the model to regional taxonomic groups in the Living Planet Database to reveal that the majority need additional data for their trends to be considered reliable. Data deficiency is a problem globally but is more pronounced in the tropics. This is consistent with the anal-

TA B L E 3 (Continued)
and amphibians (Oliver et al., 2021;Scheele et al., 2019), especially with the rise of citizen science (Oliver et al., 2021). However, overrepresented by species in all realms (except South temperate F I G U R E 3 Proportion of the total amount of time series data needed to achieve the trend reliability threshold that each regional taxonomic group in the LPD currently contains. A score of 100% or greater means that group already has enough data to produce a reliable trend. A white box refers to a group that meets the reliability threshold, while a colored box means the threshold has not been met. The further the group is from meeting the threshold, the more intense the color. A grey box refers either to a group that could not be evaluated because there was too little data (South temperate marine reptiles) or due to an invalid realm-taxon combination (there are no marine reptiles in the Arctic).
F I G U R E 4 Reliability of regional taxonomic group trends in the LPI, grouped by system, realm, and taxon. Map (a) shows the terrestrial (top) and freshwater (bottom) results. Map (b) shows the marine results. The results box outlines are coloured to match their corresponding realms on each map. Reliability scores are binned into three categories, according to the number of time series in the LPD relative to the minimum sample size needed to achieve the TDV threshold. A check mark means that group has at least 100% of the minimum sample size and is considered reliable, a dash means it is data deficient (50%-99%), and an X mark means it is severely data deficient (<50%).
reptiles, which are not represented at all) but which we found to be data deficient in all realms. In contrast, marine fishes are un- tem to the LPI, which accounts for the estimated number of species in each regional taxonomic group to reduce representational bias.
One problem with this is that most of the world's vertebrate species are located in the tropics McRae et al., 2017), which are underrepresented in the LPD . Our concern was that if trends from these areas are the least reliable due to data deficiency, then the LPI could have simply replaced one problem, representation bias, with another: overreliance on data deficient trends. Indeed, our analysis shows that all regional taxonomic groups with a high relative weight and low reliability rating (bottom right of Figure 5) are tropical. Surprisingly, though, we did not find a statistically significant negative correlation between reliability of trends and their relative weights in the LPI. This also holds true for the terrestrial and freshwater systems when considered separately (the marine system actually shows a positive correlation) and is F I G U R E 5 Trend reliability of regional taxonomic groups in the LPD (measured as the percentage of populations in the LPD relative to the number required to achieve the TDV threshold) versus the relative weighting applied to each group when calculating aggregated LPI trends.
Only groups with reliability ratings below the threshold (<100%) are included here.
To determine the groups having the strongest negative effect on the reliability of aggregated LPI trends, we calculated relative weight × (100 − reliability) and labelled the groups with a value higher than 1.

F I G U R E 6
The effect on trend accuracy of potential solutions to data deficiency in LPI regional taxonomic groups. The control group has a sample size of 200 and mean time series length of 14. Group A has an additional 200 time series with observations only in the final 10 years of the index to simulate a 10-year data blitz. In group B, the final observation has been added back in for every time series to simulate resampling of previously-studied populations. Group C is like the control group, but the sample size has been doubled to 400 to simulate adding additional pre-existing studies to the LPI. All other parameters are fixedμ ds : 0; σ ds : 0.3; μ η : 0.4; populations per species: 10; trend length: 50; μ ɛ : 0.15; σ ɛ : 0.1. Each box represents the mean values of 20 datasets, with 20 samples per dataset.
consistent with Nori et al. (2020), who found that species richness and knowledge gaps are not always correlated.
According to our model, the size of a dataset, that is, the number of species or populations existing in the real world for any regional taxonomic group, is unimportant to the calculation of trend reliability for a given sample, as long as the sample represents less than half of the time series in the dataset ( Figure S1). In other words, it is the absolute number of populations represented in the sample that matters, regardless of whether that sample represents 1% or 50% of the total populations in a regional taxonomic group. There are two principles working to cause this seemingly counterintuitive effect. First, the relationship between population size and the sample size needed to reach a desired level of precision is logarithmic and becomes more extreme at lower levels of precision (Israel, 1992). Both solutions had a slight but non-significant positive effect on trend accuracy but were far less effective than adding existing data (solution C; as is currently done for the LPD). It is likely that both solutions (data blitz and resampling) have a greater effect on the accuracy of the final portion of the trend than on the overall trend, but further study would be required to be certain. Either way, resampling would be more efficient than a data blitz, as the same improvement could be achieved in 1 year instead of 10. In the long term, tracking additional populations is essential to completing our picture of biodiversity change. Natural stochasticity means that short time series are of limited value in generating reliable trends (Wauchope et al., 2019), so tracking additional populations takes time to pay dividends. Nonetheless, overcoming indicator biases and data deficiencies will require a balanced global profile of populations, counted regularly to ensure changes can be detected quickly.
There is another limitation underlying the LPI, which cannot be solved by generating new data. All trends in the LPI begin in the year 1970, which is set as the base year for calculating the index values.
Past trends can only be determined by existing data; therefore, while there may be some currently inaccessible data that either could be shared or made available for confidential storage in the LPD (Saha et al., 2018), there are likely to be severe limitations to relieving data deficiency for the early years of the LPI. However, other potential solutions could be examined in future studies. One would be to begin the index at a later year in which there is more data available (e.g., 1990). Another would be to change the base year for calculating the index to a more data-rich year, thus increasing the uncertainty around the early years of LPI trends (Gregory et al., 2019). The downside is that the interpretation of trends would be different. The LPI would no longer measure change in global vertebrate biodiversity relative to 1970, but relative to another year, and much of the change currently recorded in the index would have already occurred before the base year. A different approach would be to use other kinds of data, such as log books and catch records (e.g., Josephson et al., 2008), genetics (e.g., Beaumont, 2003), trade records (e.g., Collins et al., 2020), and land use/climate change modelling (e.g., Visconti et al., 2016) to infer historical abundance estimates for populations where no monitoring took place.
Our regression analysis of the simulated data highlights some rather straightforward results-more data in terms of sampled populations and/or longer time series leads to higher reliability of trends, and more variation in population growth rates within and between populations leads to lower reliability. However, we also found that regional taxonomic groups that show positive trends might need more data (higher sampling, longer time series) than those that are declining. The corollary is that fewer samples might be required to obtain reliable trend estimates for declining groups, but this result also has implications for any biases in species selection. For example, monitoring efforts tend to focus on species at higher risk of extinction (Scheele et al., 2019). Many amphibian populations in the LPD were tracked because they were declining due to the devastating disease chytridiomycosis. This could negatively bias trends and falsely reduce variance in growth rates, leading the model to overestimate reliability because it assumes that tracked populations are randomly selected. On the other hand, Murali et al. (2022) found that population coverage in the LPD is biased towards protected areas, where species are less likely to be threatened, therefore potentially causing a positive bias in LPI trends. Our results also imply that any biases towards or against species that have high/low process error, that is, have very variable annual growth rates, could potentially also bias our estimates of trend reliability. However, our analysis of the simulated data suggests overall sampling intensity far outweighs the other factors included in our model, not least because as sampling number increases so does the coverage of the variability in the taxonomic group.
Other biases in the LPD could also have important effects on our estimates of reliability. Time series are non-randomly distributed across time and/or space in the LPD. For example, while some biodiversity hotspots (e.g., tropical Africa) are poorly sampled, others, especially islands (e.g., Madagascar), are well-studied (Nori et al., 2020), and this may bias entire realms. In the Afrotropical realm, six (12%) of the 51 terrestrial reptile and amphibian time series in the LPD are from Round Island (a tiny uninhabited island near Mauritius) and more than half (57%; 29/51) are from a single study that took place at a reserve in Madagascar over a nine-year period; only seven (14%) are from mainland Africa, and of the seven, four are from a single study at a reserve in Nigeria. In this case, the model likely severely underestimates the amount of data needed to get a reliable trend. Valdez et al. (2023) found that a coarser sampling resolution increases the ability to detect global biodiversity change by reducing the effects of outlier population trends.
Sampling resolution biases such as that in the Afrotropical realm will surely decrease trend reliability at a given sample size. While the Afrotropical realm may be an extreme example, it shows that there are important underlying aspects of the data that cannot be assessed by a model. Fortunately, these issues tend to diminish when more data is present, and thus should not have a large effect on trends assessed as reliable. Our model also assumes that adding additional time series to the LPD will maintain the parameters of the regional taxonomic group to which they are added (e.g., Another limitation of our modelling approach is that we could not correct for the sizes of the 'real-world' datasets (the number of populations that exist) that the LPD 'samples' are drawing from, and therefore may overestimate the sample size needed to achieve a reliable trend for very small datasets. Although there are estimates of the number of species for each regional taxonomic group, our model uses populations as the base unit to measure sample size. We chose to base sample size on populations rather than species for two reasons. First, we found that mean growth rates within the LPD vary almost as much between populations within a species as they do between species. Therefore, we cannot assume that the trend of a population represents the trend of the species it belongs to any better than it represents the trend of its entire regional taxonomic group. Second, localized threats such as land-use change and habitat destruction are likely to affect some populations within a species disproportionately. Population extinctions also occur much more frequently than species extinctions, and may serve as an early warning (Ceballos et al., 2017).
However, a population is not a well-defined unit, and we do not have estimates of how many populations each species or regional taxonomic group is composed of. While our testing suggested we can assume the number of existing populations to be unimportant in determining trend reliability, this assumption breaks down when the sample comprises a large percentage of the dataset. It is unlikely that any regional taxonomic groups currently approach this level of representation within the LPD, but it is nonetheless an important caveat to be aware of.
Despite these caveats, the results of our study reveal the strengths and weaknesses in our understanding of global vertebrate biodiversity, highlighting the regional taxonomic groups for which we have enough data to make responsible decisions, as well as those on which future data gathering and collation efforts should focus.
Some underlying aspects of the data create biases that are not taken into account by our modelling approach, and more fine-scale studies on gaps in population trends should be performed to better understand these biases and where to divert scientific resources. We show that revisiting previously-studied populations is a quick and efficient way to improve trend reliability for data deficient groups until more long-term studies can be completed and made available.
The modelling approach we use to quantify trend reliability can also be generalized to assess other global and/or regional biodiversity indices that utilize population time series data. We are facing an urgent global biodiversity crisis made worse by biased and deficient data, but through careful study and cooperative global efforts we can solve the data problem and begin to 'bend the curve' of biodiversity toward a positive trend.

ACK N OWLED G M ENTS
We thank Sean Jellesmark, Gonzalo Albaladejo-Robles, and Bouwe Reijenga for their support. This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No.

CO N FLI C T O F I NTE R E S T S TATE M E NT
The authors declare no competing interests.