Global gaps in trait data for terrestrial vertebrates

This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. © 2020 The Authors. Global Ecology and Biogeography published by John Wiley & Sons Ltd Department of Genetics, Evolution and Environment, Centre for Biodiversity and Environment Research, University College London, London, UK


| INTRODUC TI ON
Species traits are fundamental to ecological and evolutionary research.
Comparative studies regularly use trait data across organisms to understand evolutionary processes and species coexistence (Escudero & Valladares, 2016;Zamudio et al., 2016), to investigate global patterns of life-forms and functions (Díaz et al., 2016) or to assess the vulnerability of species to environmental changes (Bohm et al., 2016;Pacifici et al., 2015;Pearson et al., 2014). Given that traits influence the ability of species to cope with environmental changes (Newbold et al., 2013) and underpin the contributions of species to ecosystem processes (Lavorel & Garnier, 2002;Violle et al., 2007;Wong et al., 2018), they play an increasingly important role in functional and conservation ecology.
Past and recent efforts to collate and release trait data in the public domain have facilitated the development of trait-based research. For instance, a global trait database has been published for plants (Kattge et al., 2011). As of May 2020, data from this database had been used in 305 publications since its release (activity report, 15 September 2020, https://www.try-db.org/TryWe b/Home.php).
Such databases constitute invaluable research tools and have the potential to advance the field greatly.
Nevertheless, despite the importance of vertebrate species in global research outputs, there is no single source for vertebrate ecological traits. Consequently, researchers wishing to conduct comparative studies across vertebrate groups might have to collate trait data from a range of sources (such as in the studies by González-Suárez et al., 2018), a time-consuming prerequisite that might be a limiting step of the research process. Indeed, collating data from heterogeneously formatted sources presents many challenges (Schneider et al., 2019), particularly when working across a large number of species. For instance, traits might be measured differently across datasets; units might be inconsistent; and taxonomic resolution and nomenclature might vary.
The lack of a curated, readily available global database for vertebrate ecological traits impedes our ability to conduct cross-taxon comparative studies at global scales. However, efforts to collate data into a single database are limited by the availability of underlying data. Given that there are important gaps in biodiversity knowledge (Hortal et al., 2015), trait datasets are often incomplete, with many species lacking estimates for many traits. The incompleteness of ecological trait data at the species level has been termed the "Raunkiaeran shortfall" by Hortal et al. (2015). Furthermore, incomplete trait data are likely to be biased. Biases in trait data can be the consequence of uneven taxonomic and spatial collection effort, with a set of charismatic or easily detectable species being more completely sampled. For instance, González-Suárez et al. (2012) investigated biases in global trait information in mammals. Notably, they found that the availability of mammalian trait data were geographically and phylogenetically biased, with larger and more widely distributed species being better sampled. In addition, data availability also differed across IUCN Red List extinction risk categories, with threatened species (Critically Endangered, Endangered or Vulnerable) being less well sampled for traits than non-threatened species (Least Concern or Near Threatened).
A major issue with incomplete, biased data is the introduction of bias in subsequent analyses. Assessing the amount of missing data in addition to the so-called "missingness mechanism" (whether missing data are missing at random, as opposed to there being systematic biases in the way missing values are distributed; see Baraldi & Enders, 2010) is an important prerequisite. Indeed, there exist diverse techniques to deal with data missingness. The simplest one consists of retaining complete cases only by filtering out missing values (case deletion; see Nakagawa & Freckleton (2008)

BOX 1 Definitions
Trait: Sensu stricto, a characteristic measurable at the level of an individual and that influences organismal fitness or performance (Violle et al., 2007). In this paper, we broaden this definition to include "ecological" traits (e.g., number of habitats used by a species), where the relationship of a species to the surrounding environment needs to be considered. Ecological traits are estimated by aggregating data across multiple individuals.
Trait completeness: For a given species, the proportion of traits for which an estimate is available.
Trait coverage: For a given trait, the proportion of species for which an estimate is available.
conclusions when values are not missing at random (González-Suárez et al., 2012). Therefore, it is crucial to determine the most appropriate way to deal with data incompleteness. For instance, previous studies using terrestrial vertebrate trait data have implemented multiple imputation techniques to fill in the gaps González-Suárez et al., 2012). Nevertheless, imputation techniques could be sensitive to non-randomness in trait data. Phylogenetic biases (where some clades are undersampled compared with other clades) could notably impact the performance of several imputation approaches. It is thus vital to characterize the gaps in trait data before any analysis. However, there has been no study to date investigating global patterns in the availability of trait data across terrestrial vertebrates.
Here, we aim to assess the global state of trait data in terrestrial vertebrates. We focus on a set of traits that are available across the four classes and that are commonly used by ecologists: body size; litter or clutch size; longevity; trophic level; activity time; habitat breath; and a measure of habitat specialization.
We quantify and compare the gaps in trait data across classes by calculating the coverage of each trait across species and the completeness of trait estimates for each species (Box 1). We investigate taxonomic, spatial and phylogenetic biases in trait coverage and completeness.
Given that biodiversity research is biased globally towards birds and mammals (Titley et al., 2017), we hypothesize that herptiles are less well sampled for traits than mammals and birds, having both lower coverage and completeness. Furthermore, building upon previous studies conducted on mammals (González-Suárez et al., 2012), we hypothesize that species rarity influences completeness, focusing on the geographical range size of species as one aspect of rarity. Widely distributed species could be better sampled than narrowly distributed species because their ranges overlap with more study sites, regardless of their abundance.
As such, we test whether the geographical range size of species explains trait completeness.
It is well established that global research effort is distributed unequally (United Nations Educational Scientific and Cultural & Organization, 2015), with patterns underpinned by various geographical and socio-economic factors. For instance, countries with higher gross domestic product tend to host a larger number of research institutions (Martin,Blossey, & Ellis, 2012). The proximity of research infrastructures and the accessibility of survey sites play an important part in explaining the global distribution of knowledge (Hortal et al., 2015). As a result of these factors, biodiversity data gaps tend to be greater in tropical areas (Collen et al., 2008). Tropical areas have the greatest species richness, and therefore these data biases are of great concern for biodiversity conservation. It is thus important to assess whether species-rich regions are systematically undersampled for traits compared with species-poor regions, given the significance of species-rich areas for global conservation. Here, we investigate spatial biases in trait completeness, hypothesizing that species-rich areas are on average less well sampled than species-poor areas.
Finally, we investigate phylogenetic biases in the trait data. We assess whether particular clades have received more attention than others by looking for patterns in the distribution of trait completeness across the terminal branches of phylogenetic trees in each class.

| Sources and taxonomic matching
We used freely accessible secondary sources in our compilation (Table 1), selected for their broad taxonomic coverage and/or for their frequent use in macroecological studies. Across sources, similar species could appear under synonymic names. This was a potential problem for matching sources by binomial names. Indeed, synonymy can artefactually decrease trait coverage, when trait information is not available across all synonyms. Notably, difficulties arise when species have been divided into several subspecies or when different subspecies are clumped together. Systematic manual checks could not be applied considering the scale of the data collection (there were >39,000 unique binomial names across sources). We developed a procedure aiming at identifying one accepted name for each of the binomial names found across sources. When we could not find an accepted name, we used the original name. Figure   .
Briefly, the procedure consisted of extracting synonyms from the IUCN (IUCN, 2020) or from the Integrated Taxonomic Information System (ITIS; https://www.itis.gov/), using the rredlist (Chamberlain, 2018) and taxize (Chamberlain & Szöcs, 2013) R packages. One accepted name was assigned to each synonym. We produced a "Synonym" dataset that we have also made available. We then normalized taxonomy across sources by replacing binomial names with their identified accepted name where applicable.
Given that different taxonomic backbones could be used to correct for taxonomy, we make two versions of our trait compilations available (corrected and not corrected for taxonomy), meaning that users are free to apply their own corrections; for example, taxonomy could be aligned to that of class-specific sources, such as Datasets corrected for taxonomy contain 11,634 species of birds, 5,381 mammals, 10,612 reptiles and 6,990 amphibians. Where no taxonomic correction was applied when matching sources, the compiled datasets contain 13,501 birds, 5,791 mammals, 11,012 reptiles and 8,583 amphibians. For more information, see the Supporting Information (Appendix S1; Figure S1).

| Compilation methods
For continuous traits, we took the median value within species when multiple estimates were available from different sources, after removal of any repeated values, which were assumed to represent estimates duplicated across secondary compilations and derived from the same underlying primary sources. Although intraspecific variation is increasingly being recognized to have important effects on ecological systems (Bolnick et al., 2011;Des Roches et al., 2018;González-Suárez & Revilla, 2013;Siefert et al., 2015), it was not feasible to obtain measures of intraspecific variability from all sources; therefore, estimates were provided as a single measure for each species. For some species and some traits, measures were provided separately for females and males. In such cases, we first obtained the mean of these two measures. Note. Data sources may contain more traits than shown here. Tick marks in parentheses indicate that the trait was present in the data source but that another closely related trait with a better coverage was used instead. The tilde character (∼) before a tick mark indicates that we derived trophic levels from species diet. a http://dataz one.birdl ife.org/home b https://www.iucnr edlist.org/resou rces/spati al-data-download c http://apiv3.iucnr edlist.org/api/v3/docs#general Across sources, there were multiple traits related to each of body size and life span. For instance, body mass and/or body length information could be provided. Different proxies were also available for life span, such as the age at sexual maturity or generation length. In such cases, we focused on the trait presenting the highest coverage.

Body size
Adult body mass estimates were compiled for mammals, birds and reptiles. Body length information was compiled for amphibians, because the coverage for body length was higher than that for body mass. Body mass and body length are known to scale allometrically, although the allometric relationship differs across amphibian clades (Santini et al., 2018). In our amphibian dataset, Pearson's correlation coefficient between log(Body mass) and log(Body length) was .71 (data points shown in Supporting Information Figure S2).

Longevity
We defined longevity as the life span of an individual and maximum longevity as the longest life span reported. We used closely related traits when longevity/maximum longevity was not available or when longevity/maximum longevity had a poorer coverage than a related trait. We selected the age at sexual maturity for amphibians; Pearson's correlation coefficient between log(Age at sexual maturity) and log(Maximum longevity) was .55 (Supporting Information Figure S2). We compiled the generation length for mammals and birds. The correlation between log(Generation length) and log(Longevity) was .74 for mammals and .70 for birds (Supporting Information Figure S3). Finally, we used maximum longevity directly for reptiles.

Litter or clutch size
The number of offspring (litter size) or eggs (clutch size) was compiled directly from the sources and treated as equivalent across classes. We reported measures of central tendencies provided by the sources where applicable; otherwise, we calculated range midpoints (mean of smallest and largest reported litter/clutch sizes).

Trophic level
In all classes, species were described as omnivores, carnivores or herbivores. For reptiles and mammals, this information was compiled directly from the sources. For amphibians and birds, trophic levels were not provided. For these two classes, we inferred trophic levels from dietary information (Table 1). For birds, we used the primary diet (based on food items recorded as composing ≥50% of the diet of a species). Diet for amphibians was described without respect to the percentage use of food items; simply as a binary record of whether or not food items were used. In both cases, species recorded to only consume plant-based resources (seeds, nectar, fruit or other plant material) were classified as herbivores. Species consuming only animal resources (invertebrates or vertebrates) were classified as carnivores. Species consuming a mixture of plant and animal resources were classified as omnivores.

Activity time
Species were described as being either nocturnal or non-nocturnal.
Despite a higher resolution of activity time information in some of the sources (e.g., species being described as cathemeral, crepuscular or diurnal), we adopted the classification of the source with the lowest resolution (EltonTraits: Wilman et al. (2014), for birds), in order to have consistent information across classes. As such, all species defined as diurnal, cathemeral or crepuscular were classified as nonnocturnal, as opposed to species classified as strictly nocturnal.

Habitat breadth
We used IUCN habitat data (IUCN, 2020), which describe species habitat preferences and the suitability and importance of each habitat. We defined habitat breadth as the number of habitats a species was known to use, using level 2 of the IUCN Habitat

Use of artificial habitats
For a species, we recorded whether any artificial habitat was reported to be suitable in the IUCN habitat data.
Finally, our compiled datasets contain an additional column, "Note", where we reported species found to be extinct or extinct in the wild (EW). We used species Red List status and information F I G U R E 1 Procedure used to identify the accepted names of species. We extracted, where possible, the accepted names of species from either the Red List or the Integrated Taxonomic Information  System (ITIS). from Meiri (2018) to flag such species. We reported 75 extinct/EW species for mammals, 160 for birds, 34 for amphibians and 53 for reptiles. It is likely that our datasets contain extinct species that we could not flag, because they were not recorded as extinct in the sources we used.

| Species distributions
We obtained extent-of-occurrence distribution maps for reptiles from Roll et al. (2017) Figure S4). Decreases in range sizes were observed after cutting distribution maps by the known elevational limits (Supporting Information Figure S5).

| Investigating gaps and biases in trait data
We used trait coverage and completeness to investigate taxonomic, phylogenetic and spatial biases in the trait data. Table 2 summarizes the sample sizes (number of species) in each of the following analyses. Note that species for which completeness was 0% were included in all analyses (for more details, see Figure 2). Also note that we did not filter out species identified as extinct or extinct in the wild, because they represented a small proportion of the datasets (.48% for amphibians, 1.4% for both birds and mammals, and .50% for reptiles) and also because we could not exclude such species systematically, because it is likely that we did not flag them all.

| Taxonomic biases
We tested whether completeness varied across taxonomic class using pairwise Wilcoxon rank sum tests. We tested for the extent and performance of our taxonomic corrections by looking at trait coverage when taxonomic corrections are applied and when no correction is applied (Supporting Information Figure S6).

| Phylogenetic biases
Initially, to assess whether more closely related species were more likely to be similar in trait completeness, we estimated the phylogenetic signal in completeness with Pagel's λ (Pagel, 1999) in each class.
We used a bootstrapping approach, calculating λ for each of 50 trees randomly sampled in each class (using the phylosig function of the phytools R package; Revell, 2012). We then estimated the mean and 95% confidence intervals (95% CIs) of λ. Sample sizes for computing λ (number of species represented in both the phylogenies and trait datasets) are shown in Table 2.
We then plotted within-family median completeness in phylogenetic trees built at the family level, using the consensus trees.
Within-family median completeness was calculated using taxonomic information in the trait datasets (sample sizes shown in Table 2).

| Spatial biases
We first investigated whether wider-ranging species were more likely to be better sampled than narrow-ranging species. We tested for a relationship between species range size and trait completeness.
We fitted a generalized linear model with a Poisson error distribution [directly using the number of sampled traits, "N traits ", rather than All species represented in the trait datasets were included in (a). All species from the class-specific phylogenetic trees or from the distribution maps that matched with species in the trait datasets were included in (b) and (c).
the proportion (completeness)]. Class was added as a predictor interacting with range size; thus the model was: Second, we mapped assemblage-level median completeness.
Assemblages were characterized at the pixel level at 50 km 2 resolution. We determined pixel-level composition and richness by stacking species geographical distributions. We then calculated median completeness across species in each pixel. We show the resulting maps for herptiles in the main text, and for mammals and birds in Supporting Information Figure S7 (median completeness was very high across most pixels for mammals and birds). In addition, we provide maps of assemblage-level mean completeness and standard deviation for all classes in the Supporting Information ( Figures S8 and S9 show maps; Figure S10 shows standard deviation against species richness).
We then tested for a spatial correlation between species richness and median completeness. Given that median completeness was very high across most pixels for mammals and birds, we fitted such models for herptiles only. We fitted spatial autoregressive lag models to explain assemblage-level median completeness as a func- The value of W was estimated using the functions tri2nb and nb2listw of the spdep package (Bivand, Pebesma, et al., 2013;Bivand & Wong, 2018). Fitting the model using all grid cells was computationally intractable; therefore, we randomly sampled cells for this analysis (using 30% of the grid cells in each realm). We selected grid cells where species richness was higher than three to avoid sampling issues.
We fitted separate models for amphibians and reptiles, because when adding class as an interacting predictor, the same cells (with the same coordinates) might be sampled for multiple classes, whereas the tri2nb function does not tolerate duplicated coordinates.

| Taxonomic biases in trait information
Trait coverage for mammals and birds was high overall (Figure 2a; mean and median coverage across traits: 89% and 95% for mammals; 84% and 85% for birds). In both cases, litter/clutch size was the trait with the poorest coverage (61% for mammals and 59% for birds).
Coverage exceeded 80% for all other traits (except trophic level for birds, at 75% coverage). Here, we show the distribution of completeness. Continuous lines represent the mean trait completeness for each class, whereas dashed lines represent the median trait completeness. Note that there were species with 0% completeness (230 species for amphibians, 3.3% of amphibian species in our trait data; nine for birds, .077% of species; seven for mammals, .13% of species; and 161 for reptiles, 1.5% of species). Species with 0% completeness were retained in the datasets when there was information for traits we did not select in the analyses but no known value for the traits we did select. For instance, the body mass of the amphibian species Rhinella centralis was known, but other trait values (including body length) were missing, meaning that Rhinella centralis had 0% completeness for the set of traits we considered.

| Phylogenetic biases in trait completeness
As expected from the distribution of trait completeness in mammals and birds (Figure 2), within-family median trait completeness was high across most tips of the phylogenetic trees (Supporting Information Figure S11 and S12; we present the avian and mammalian phylogenies in the Supporting Information because there was little variation in completeness across tips). For birds, λ was .71 (± .0053). For mammals, λ was .78 (± .0035). This indicated that, despite completeness generally being high across tips, the sampling was not evenly distributed across the phylogeny. It is important to underline that Figure 3 shows within-family median completeness, masking the considerable variation in species richness across families, hence masking potential important variation in completeness across species within families. For example, in the amphibian family Allophrynidae (three recognized species), the within-family median completeness was 50%; but our dataset comprised two species of completeness 14% and 86%, respectively. We present similar plots to those in Figure 3 showing the within-family standard deviation in completeness in the Supporting Information ( Figure S13). Within-family standard deviation tended to increase with within-family species richness (Supporting Information Figure S14).

| Spatial biases in trait completeness
Range size was significantly correlated with the number of sampled traits. Larger range sizes were associated with a higher number of sampled traits (i.e., with higher completeness; Figure 4; Supporting Information Table S1). Similar results were obtained when using distribution maps not cut by elevational limits (Supporting Information   Table S2; Figure S15). The rate of increase was steepest for reptiles, then for amphibians, then for birds and mammals (slope estimates for birds and mammals were not significantly different from each other; Supporting Information Table S1).
There were marked spatial variations in median trait completeness in herptiles ( Figure 5). North America and Europe were well

| Discussion
Our work illustrates the taxonomic, spatial and phylogenetic dimensions of the knowledge gaps in trait data, termed the Raunkiaeran shortfall by Hortal et al. (2015). To the best of our knowledge, this study constitutes the first comparative assessment of global gaps for terrestrial vertebrate trait data, despite their use in numerous studies. We showed that the trait data presents important taxonomic, spatial and phylogenetic biases, with contrasts in the availability of trait information between, on the one hand, herptiles and, on the other hand, birds and mammals.
Birds and mammals are globally well sampled for the set of traits we considered, even in the most species-rich assemblages.
Moreover, the availability of trait information for herptiles is lower overall and phylogenetically and geographically biased. Several factors could interplay to shape these patterns. For instance, species that are more easily detectable (for example, wider ranging) and more charismatic are likely to be better sampled. Diverse socio-economic F I G U R E 3 Within-family median trait completeness in herptiles. The number next to each family name represents the number of species included in the calculation of the median.

(a) (b)
predictors could also contribute to geographical biases in trait data sampling; global biases in primary data collection are likely to be one of the most important contributors to the patterns we observe.
Nevertheless, biases in the data could have been introduced at later stages, notably with the selection of sources and traits. Our global compilation reflects, in part, the interest and focus of the secondary data sources we used. It is possible that the addition of new sources from regional journals or other authorities could diminish spatial biases in the data by increasing coverage for certain areas.
Nevertheless, we argue that by focusing on widely used traits, our results are likely to reflect the "true" availability of the data in primary sources and that the shortfalls for other, less used traits would be more pronounced.
We believe that our results are robust to taxonomic uncertainty, although taxonomic matching might potentially be improved further using class-specific sources, such as the Reptile Database or AmphibiaWeb, for identification of synonyms (but see Supporting Information Appendix S9, Figure S16). We have made two versions of our data compilations available, one in which our own corrections were applied and one using the original binomial names of the sources, meaning that users are free to use their own taxonomic backbones and identify synonyms within the compilations.
We believe that taxonomic matching is a recurring issue when working across thousands of species. Taxonomic synonymy artefactually inflates the numbers of identified species, potentially lowering trait coverage (whereas clumping subspecies together can have the opposite effect). Tackling this problem is difficult (Isaac et al., 2004;Jones et al., 2012), notably because there is no global curated database recording the status of species names, and also because of the nature of taxonomy and the debates around the species concept (May, 2011). Nevertheless, taxonomic uncertainty can have important consequences. For instance, Cardoso et al. (2017) showed that inaccuracies and errors in species checklists contributed to the overestimation of plant diversity in the Amazon.

F I G U R E 4
Relationship between number of sampled traits and geographical range size. Models were fitted using a Poisson error distribution. Class was added as a predictor interacting with range size. Rates of increase were not significantly different for mammals and birds but differed for reptiles and amphibians, with the steepest rates of increase for reptiles.  for which fewer trait data are available on average, have higher extinction risks (Collen et al., 2016;Purvis et al., 2000;Ripple et al., 2017) and are more negatively impacted by anthropogenic pressures (Newbold et al., 2018) than wider-ranging species. Trait information is also less available for herptiles in tropical regions such as the Congo basin, Southeast Asia and South America, which are some of the most diverse areas of crucial importance for worldwide conservation (Barlow et al., 2018). Consequently, trait information is on average less available where potentially more crucial to conservation planning. Indeed, trait information can be incorporated into vulnerability assessments and, as such, can help to prioritize conservation efforts. Species traits have been found to mediate species responses to environmental changes across diverse taxonomic groups, and thus can inform on the sensitivity of species to anthropogenic pressures (Flynn et al., 2009;Newbold et al., 2013;Nowakowski et al., 2017). Traits are now commonly used to estimate species vulnerability or extinction risks (Pacifici et al., 2015;Ramírez-Bautista et al., 2020). As opposed to trend-based approaches, which rely on historical population trends (changes in abundance or shifts in distributions) to predict species' vulnerability and extinction risks, trait-based approaches rely on species' intrinsic sensitivity to particular threats. The appeal of trait-based approaches to extinction risk estimation is that, by providing mechanistic insights, they diminish the amount of population information needed. If the responses of species to a threat consistently relate to certain traits, it is possible to generalize patterns across species for which population data are less available (Verberk et al., 2013). Integrating traits into vulnerability assessments is hence of particular interest when field monitoring of species population sizes or distributions is difficult to achieve, but biases in the data could mean that such information is lacking for some of the most vulnerable species.
Traits that influence species responses to environmental changes have been termed "response traits" (or "response-mediating traits"; Luck et al., 2012), as opposed to "effect traits" that underpin ecosystem functioning (Lavorel & Garnier, 2002). For instance, relative brain size and longevity have been characterized as response traits in birds (Newbold et al., 2013;Sayol et al., 2020), whereas dietary characteristics (e.g., trophic levels or guilds) are both response and effect traits. Hortal et al. (2015) highlighted that, for plants, both response and effect traits have been investigated, whereas for vertebrates the research has been more focused on understanding species responses. This could be because the way vertebrate traits interact to shape some ecosystem processes has not yet been characterized well.
Ecosystem processes sustained by animals might be harder to quantify and might be influenced by a combination of traits. The traits compiled in this work are likely to have a role in diverse processes. Nevertheless, there was one important omission, in that we did not compile species diet, potentially the most straightforward trait to link with diverse processes, such as grazing, pollination, scavenging and seed dispersal. From a practical perspective, we chose traits that had been estimated at least for some of the species in each class, and that were readily available. Diet was excluded because although estimates were available for amphibians, birds and mammals, there was no readily available database for reptilian diet.
Movement or dispersal abilities were also excluded because information was not readily available for any class. Although we expect that species diet and dispersal abilities would present similar sampling biases to the ones presented in this work, the addition of such traits to the compilation would represent a valuable contribution F I G U R E 6 Spatial model trends for herptiles. The lines represent in-sample predictions (± SE) for the trend components of the spatial models (trends after accounting for spatial autocorrelation). For practical reasons, we did not consider intraspecific trait variation. Intraspecific variation has been shown to have important effects on ecological systems, and a growing body of literature encourages trait-based research to include intraspecific variability (Guralnick et al., 2016). There have been several calls to produce open-access, global trait datasets (Weiss & Ray, 2019), including a representation of intraspecific trait variation (Kissling et al., 2018).
Notably, Schneider et al. (2019) designed a framework to store and share inter-and intraspecific trait data, accompanied by an R package to standardize the data in a proposed format. Such a proposition could constitute an important step towards the unification of individual datasets into a single, comprehensive database for ecological trait data.
The current spatial and taxonomic gaps in trait data might limit our ability to scale studies up, whereas biases in the data can affect the validity of extrapolations to groups or areas that are undersampled. More generally, biases and gaps in biodiversity data can have important implications for ecological studies. Data gaps can hinder our ability to draw conclusions on observed macroecological patterns. For example, Chaudhary et al. (2016) proposed that marine species richness follows a bimodal distribution, peaking at mid-latitudinal locations, and argued that these patterns were not underpinned by knowledge gaps in species distributions. Moreover, Menegotto and Rangel (2018) attributed the tropical dip in marine species richness to a lack of species distribution data, explained by lower sampling efforts in tropical areas ("Wallacean" shortfall;Hortal et al., 2015). Biases and gaps in trait data could also affect studies in closely related fields, such as functional ecology [for instance, past studies have shown that functional diversity indices are sensitive to missing data (Májeková et al., 2016;Pakeman, 2014)] or community assembly (Perronne et al., 2017).
Ecologists should, therefore, take particular care when designing trait-based studies, because both data quality and data gaps are likely to influence the results and the generality of the conclusions. There exist diverse methods to deal with missing trait values, should data missingness be problematic. Complete removal of missing values ("case deletion") is commonly used but presents several issues, because it reduces sample size and statistical power and introduces potential bias in data subsamples (Nakagawa & Freckleton, 2008). For example, retaining complete cases only from our trait datasets would generate trait data disproportionally representative of mammals and birds, which would be problematic for conducting cross-taxon analysis on terrestrial vertebrates. As such, it is recommended that case deletion be applied only when data are missing completely at random, which is rarely the case (Peugh & Enders, 2004 (Bruggeman et al., 2009), random forest algorithms as implemented in R with missForest (Stekhoven, 2016;Stekhoven & Bühlmann, 2012), multivariate imputation by chained equations (MICE; Van Buuren & Groothuis-Oudshoorn, 2011) and k-nearest neighbour (kNN; Troyanskaya et al., 2001). Penone et al. (2014) introduced missing values (10%-80%) in a complete trait dataset of carnivorans and measured imputation performance in different scenarios. Given that phylogenetic non-randomness in missing trait values can impact imputation accuracy, Penone et al. (2014) removed values in three different ways (completely at random; with a phylogenetic bias; and with a body mass bias). Out of the four techniques, missForest and PhyloPars performed best when species phylogenetic position was included as a predictor of missing trait values. Such imputations appeared to be robust even when trait coverage was as low as 40%, which might be relevant for many reptilian and amphibian traits. The performance was not significantly affected by phylogenetic non-randomness of the data. Hence, missForest and PhyloPars appear to be well suited when traits are phylogenetically conserved, because they allow species phylogenetic position to be included as a predictor of missing trait values. The study by Penone et al. (2014) highlights that there are robust imputation techniques allowing us to deal with incomplete trait data where biases might otherwise be problematic. Nevertheless, it is important to highlight that some imputation techniques, such as single or mean imputation, can be problematic because they do not allow an estimation of uncertainty and suffer from a lack of accuracy (Nakagawa & Freckleton, 2008); indeed, imputation techniques sometimes perform no better than case deletion. We believe that more work should be conducted to assess imputation performance in various contexts, and our compiled datasets might provide an opportunity for such studies.
Although robust imputation techniques can be useful for filling gaps in trait datasets, they are no substitute for continued data collection efforts. Our results show that data are particularly lacking in herptiles, notably in the Afrotropics, the Neotropics and the Indo-Malayan realms. For these areas, incorporating regional databases into existing datasets could contribute to the reduction of global gaps. We believe that both primary research and subsequent efforts to integrate new data and existing databases are required if we are to strive towards the unification of trait databases.
To conclude, this work constitutes, to our knowledge, the first assessment of the global gaps and biases in terrestrial vertebrate trait information. We show that herptiles are undersampled compared with mammals and birds, with important spatial and phylogenetic variability in the availability of trait information. Imputation techniques are one possible solution to these problems. Nevertheless, we believe that primary research, combined with efforts to complete existing datasets, is the only way to fill the current data gaps genuinely and robustly. We hope that the compiled trait dataset and our findings can prove useful for guiding further data collection efforts and for conducting macroecological analyses.

ACKNOWLEDGMENTS
We thank Stuart Butchart for contributing data and for his feedback on an earlier version of the manuscript. Many thanks to Alex Pigot and Richard Pearson for their valuable inputs, which contributed to improve this manuscript. This work was supported by a Royal Society University Research Fellowship to T.N. and a Royal Society Research Grant to T.N., which supports the PhD studentship of A.E.

DATA AVA I L A B I L I T Y S TAT E M E N T
The data used in this work were collected from freely accessible sources. Each of these sources is cited in the manuscript. The data compiled in this work were made available on figshare (https://doi. org/10.6084/m9.figsh are.10075421).