Establishing macroecological trait datasets: digitalization, extrapolation, and validation of diet preferences in terrestrial mammals worldwide

Ecological trait data are essential for understanding the broad-scale distribution of biodiversity and its response to global change. For animals, diet represents a fundamental aspect of species’ evolutionary adaptations, ecological and functional roles, and trophic interactions. However, the importance of diet for macroevolutionary and macroecological dynamics remains little explored, partly because of the lack of comprehensive trait datasets. We compiled and evaluated a comprehensive global dataset of diet preferences of mammals (“MammalDIET”). Diet information was digitized from two global and cladewide data sources and errors of data entry by multiple data recorders were assessed. We then developed a hierarchical extrapolation procedure to fill-in diet information for species with missing information. Missing data were extrapolated with information from other taxonomic levels (genus, other species within the same genus, or family) and this extrapolation was subsequently validated both internally (with a jack-knife approach applied to the compiled species-level diet data) and externally (using independent species-level diet information from a comprehensive continentwide data source). Finally, we grouped mammal species into trophic levels and dietary guilds, and their species richness as well as their proportion of total richness were mapped at a global scale for those diet categories with good validation results. The success rate of correctly digitizing data was 94%, indicating that the consistency in data entry among multiple recorders was high. Data sources provided species-level diet information for a total of 2033 species (38% of all 5364 terrestrial mammal species, based on the IUCN taxonomy). For the remaining 3331 species, diet information was mostly extrapolated from genus-level diet information (48% of all terrestrial mammal species), and only rarely from other species within the same genus (6%) or from family level (8%). Internal and external validation showed that: (1) extrapolations were most reliable for primary food items; (2) several diet categories (“Animal”, “Mammal”, “Invertebrate”, “Plant”, “Seed”, “Fruit”, and “Leaf”) had high proportions of correctly predicted diet ranks; and (3) the potential of correctly extrapolating specific diet categories varied both within and among clades. Global maps of species richness and proportion showed congruence among trophic levels, but also substantial discrepancies between dietary guilds. MammalDIET provides a comprehensive, unique and freely available dataset on diet preferences for all terrestrial mammals worldwide. It enables broad-scale analyses for specific trophic levels and dietary guilds, and a first assessment of trait conservatism in mammalian diet preferences at a global scale. The digitalization, extrapolation and validation procedures could be transferable to other trait data and taxa.


Introduction
With the emergence of the macroecological research field (Brown and Maurer 1989), an increasing interest has developed in compiling comprehensive data on the geographic distribution of life on Earth. For instance, broadscale datasets on species distributions, phylogenies, and ecological or life-history traits are now increasingly becoming electronically available, at least for some vertebrate groups such as birds and mammals (Bininda-Emonds et al. 2008;Jones et al. 2009;BirdLife International & NatureServe 2011;Jetz et al. 2012;IUCN 2013). However, compiling ecological trait data for species-rich clades is challenging and time-consuming, and many individual researchers lack the resources and time to compile such comprehensive datasets. Moreover, ecological trait data are often incomplete, even for well-known and wellstudied clades (Jones et al. 2009) or for species in wellsurveyed regions (Tyler et al. 2012). However, these trait data are essential for better understanding macroecological patterns (MacArthur 1972;Kissling et al. 2012; Barnagaud et al. 2014), evolutionary history (Cantalapiedra et al. 2014;Morlon 2014), or biodiversity and ecosystem functioning (McGill et al. 2006;Safi et al. 2011). Hence, new approaches are needed to get a better coverage of missing trait data, e.g. by "filling in" missing data with predicted values based on species for which trait data are available (Shan et al. 2012).
Diet represents a fundamental aspect of a species' ecological niche (Simberloff and Dayan 1991). It constrains metabolic rates of organisms (Brown et al. 2004) and defines the functional roles and trophic interactions of species in ecosystems (Duffy 2002). Diet preferences can be important for understanding diversification (Price et al. 2012;Cantalapiedra et al. 2014), macroecological distributions (Kissling et al. , 2012, as well as character displacement and evolutionary divergence of species (Grant and Grant 2006;Meiri et al. 2007). More generally, diet preferences have played an important role in understanding the ecology and evolution of communities (Hutchinson 1959;Cody and Diamond 1975;Burness et al. 2001). Nevertheless, only a few studies have examined latitudinal, environmental and biogeographic variation of diet preferences at a global scale (Hillebrand 2004;Primack and Corlett 2005;Kissling et al. 2009Kissling et al. , 2012Sandom et al. 2013; Barnagaud et al. 2014). Moreover, macroevolutionary studies have rarely integrated diet preferences or other trait data across species-rich clades (Morlon 2014). Hence, the importance of diet for macroevolutionary and macroecological dynamics and the structure and functioning of ecosystems worldwide remains little explored.
Mammals are a diverse vertebrate group whose species have colonized nearly all parts of the world. Mammalian species show a wide range of diet preferences (Fig. 1) which is partly related to their dental diversity (Price et al. 2012). Data on global species distributions (IUCN 2013) and phylogenetic relationships (e.g., Bininda-Emonds et al. 2008;Fritz et al. 2009) of mammals have recently become available and numerous ecological adaptations and life-history traits have been described in the literature (e.g., Nowak 1999;Smith et al. 2003;Jones et al. 2009;Qian et al. 2009;IUCN 2013). However, current datasets on ecological traits of mammals are incomplete and do not provide data for all mammals worldwide (e.g., Smith et al. 2003;Jones et al. 2009;Safi et al. 2011;Price et al. 2012). Available datasets on diet preferences of mammals are either restricted to small subsets of species (e.g., Cantalapiedra et al. 2014) or cover around 30-40% of the species (e.g., Jones et al. 2009;Price et al. 2012) and typically only allow categorizing species into three predefined trophic levels (carnivores, omnivores, herbivores). This forces researchers to limit their investigation to the best-known subset of taxa and to a few broad diet adaptations. Moreover, the deletion of missing values (or the use of incomplete datasets) reduces the power of statistical inference and might increase estimation bias (Nakagawa and Freckleton 2008). Hence, available data on key mammalian traits such as diet require additional efforts to achieve broader taxonomic coverage and finer ecological detail.
Here, we compiled and evaluated a global diet dataset for terrestrial mammals (referred to as "MammalDIET" for general information see Table 1). We first digitized diet information from two comprehensive, global and cladewide data sources that provide a relatively standardized way of presenting mammalian diet information (Nowak 1999;IUCN 2013). We then quantified the consistency of data entry by multiple data recorders and developed an extrapolation procedure to fill-in missing diet information at the species level. Extrapolation was performed by using available diet knowledge from other species or other taxonomic levels (genus, family). We then validated the extrapolation procedure (both internally and externally) to identify the most reliable diet categories for classifying mammal species into trophic levels and dietary guilds. Finally, the frequency of different trophic levels and dietary guilds within mammalian families and orders was quantified and their species richness and proportion were mapped at a global scale. With the developed methodological framework (summarized as a flowchart in Fig. 2), we estimated diet preferences for nearly all terrestrial mammal species worldwide. We further provide MammalDIET as a freely available resource to enable macroecological and macroevolutionary analyses, and we encourage researchers to use, test, apply, and refine this dataset in the future.

Digitalization of data
To compile diet information of mammals ("trait information" in Fig. 2), we used two key data sources on diet preferences of mammals worldwide (Nowak 1999;IUCN 2013). We focused on these two data sources because they contain global and cladewide knowledge on mammalian diets and because they allow a reasonably homogenous and standardized way of recording summary knowledge of mammalian diets. Diet information was first digitized from Nowak (1999) during 2011-2012 and additional information was added from IUCN (2013) during 2013 for species which had no species-level data from Nowak (1999). In all cases, we used the IUCN taxonomy as a reference taxonomy (IUCN 2013) and searched for  synonyms where names differed between sources. We excluded marine families, but included all terrestrial mammal species (n = 5364). We converted written text descriptions (for examples see Table 2) of diet preferences from the two literature sources into ordinal data (ranks 1-3). In cases where the text did not allow inferring the relative importance of diet categories, we entered rank 1 assuming that these food items were equally important. In a few cases, the IUCN (2013) data source described species as, for example, carnivorous, herbivorous, frugivorous, omnivorous for which we recorded rank 1 in the corresponding diet categories. A zero (rank 0) was assigned if a specific diet category was not recorded in the literature for a given taxon. Moreover, we took a conservative approach and recorded diet information only at the specific taxonomic level of the original data source, that is, at species, genus, or family level. The majority of diet information from Nowak (1999) was available at the genus level, and hence, we digitized this information at the genus level, not at the species level, even if species within a genus are likely to have the same diets. In contrast, the IUCN (2013) data were almost exclusively available as species-level information. For digitizing the data ("digitalization" in Fig. 2), a total of sixteen diet categories at four hierarchical levels were distinguished (Fig. 3). At the first and coarsest level, we distinguished between "Animal" and "Plant." At the second level, the animal category was subdivided into "Vertebrate" and "Invertebrate." At the third level, the plant category was subdivided into "Seed", "Fruit", "Nec-   (Nowak 1999;IUCN 2013). For data entry into MammalDIET, text descriptions were converted into ordinal data (rank 1-3). tar", "Root", "Leaf", and "Other" material, and the vertebrate category was subdivided into "Mammal", "Bird", "Herptile" (amphibians and reptiles), and "Fish". At the fourth and finest level, we subdivided the leaf category into leaves from woody plants ("Woody") and leaves from herbaceous plants ("Herbaceous"). This was carried out to allow the division of mammalian herbivores into browsers and grazers. For all data entry, we recorded diet preferences down to the finest diet categories possible. At the beginning of the data digitalization process, we tested how well diet descriptions from the data sources could be converted into diet ranks in MammalDIET. To assure the consistency of data entry by multiple data recorders (all authors except J.-C.S.), we randomly selected 20 mammal species from Nowak (1999) before compiling the data and each data recorder then ranked diet descriptions from the source for the same sample species ("calibration" in right-hand side of Fig. 2). Discrepancies in data entries between data recorders were subsequently discussed among all persons to minimize errors of the digitalization process, that is, when transferring written diet descriptions from the sources into an ordinal scale in MammalDIET. Such a calibration step was used to standardize the digitizing of data by multiple recorders. After the data from Nowak (1999) had been assembled, an additional test ("quality check" in righthand side of Fig. 2) was performed based on 120 randomly selected species out of those species for which diet data from Nowak (1999) were available at the species level (n = 682). The recorders re-entered data by transferring written descriptions from Nowak (1999) to diet ranks in MammalDIET and then calculated the percentage of correctly classified diet ranks for all diet categories across the 120 species. This allowed assessing the error rate due to data entry via multiple data recorders.

Extrapolation of diet ranks
Not all diet information was available at the species level. Some information, especially in Nowak (1999), was only available at the genus or, more rarely, family level. We therefore developed a procedure to extrapolate diet information from other species or higher taxonomic levels (genus, family) to species without diet information ("extrapolation" in Fig. 2). This extrapolation procedure assumed some degree of phylogenetic conservatism in diet preferences, at least for the recorded diet categories and at the taxonomic levels applied. The different approaches to data extrapolation are explained in detail later. Information on how diet data were extrapolated is also provided for each species in Appendix Table S1 (cf. variable "FillCode").
No data extrapolation was necessary for those species that already had species-level information from the two data sources (FillCode = 0). For the other species, data extrapolation was performed hierarchically. First, diet data were filled from the genus level (FillCode = 1), then from other species within the same genus (FillCode = 2.1 or 2.2) and finally from the family level (FillCode = 3). We distinguished two ways of data filling from other species within the same genus. First, if only one species in the genus had data, we applied this information to our missing species (FillCode = 2.1). Second, if more than one species in the genus had data, we assigned the diet information for each category to the missing species if all species had the same information in that category (FillCode = 2.2), that is, we only extrapolated information that was consistent among congeneric species. Otherwise data were assigned as not available ("NA"). This ensured a rather conservative way of extrapolating diet information to the species level.

Internal validation of extrapolation
To investigate the robustness of the extrapolation procedure, we applied two validation procedures ("validation" in Fig. 2): first an internal validation (using the specieslevel data from the compiled dataset) and second an external validation (using an independent data source, see below). For the internal validation, we evaluated how well each of the species with species-level diet information in the compiled dataset (i.e., FillCode = 0; n = 2033 species) would be filled if no diet data were available. We used a jack-knife approach where diet ranks were removed from one focal species at a time and then filled by the same extrapolation procedure as described above. This predicted the diet ranks for the focal species as if there was no diet information available for that species. The predicted diet ranks of the focal species were then compared with the empirical diet data as recorded from the original data sources. Across all species in the validation subset, we then calculated the proportion of correctly predicted diet ranks (including ranks 1-3 and 0). Species that were the only species with diet information in a genus had to be disregarded for this internal validation if no further diet information was available at the genus or family level.

External validation of extrapolation
We used an external validation to test the accuracy of the extrapolated species-level diet data in MammalDIET relative to diet data extracted from an additional, independent data source (see validation in Fig. 2). For this additional data source ("external validation dataset"), we chose the species accounts from the new edition of Mammals of Africa Happold 2013;Happold and Happold 2013;Kingdon and Hoffmann 2013a,b;Kingdon et al. 2013), a series of six volumes describing in detail every currently recognized species of terrestrial mammal in Africa. Although having a regional focus (African continent), this compilation of books is the most comprehensive, up-to-date species-level data source that is currently available for mammals in a specific biogeographic region. From the full list of all species with extrapolated diet information in our dataset (n = 3329), we first selected those occurring in Africa (n = 611) and then randomly selected species from this list to subsequently enter diet information from the Mammals of Africa. To aim for a reasonable sample size of species across different diet categories, we stratified the random selection by choosing 30 random species (if available) for each of the sixteen diet categories. Several species were selected more than once and we removed these duplicates and thus ended up with a total of 289 randomly selected species. For each of these species, we checked the diet information in Mammals of Africa and entered species-level diet information in the same way as for MammalDIET (if available). For each of the 16 diet categories (cf. Fig. 3), we then compared how often the diet information from the external validation dataset (Mammals of Africa) was consistent with the extrapolated species-level knowledge in MammalDIET. We report the percentage of correctly extrapolated diet ranks (separately for rank 1 only, and for rank 1 and 2 combined) for the 16 diet categories. We performed this validation for all species in the external validation dataset as well as separately for mammal orders with ≥15 species.

Classification of trophic levels and dietary guilds
We used the information in MammalDIET to group mammal species into different trophic levels and dietary guilds. Based on the internal and external validation results, we identified various diet categories that were reliable for such a classification (for details see results). We applied two different types of classifications. First, we used the "Animal" and "Plant" categories to classify species into three trophic levels (carnivores, herbivores, and omnivores; see "TrophicLevels" in Appendix S1). This classification was coarse and mutually exclusive so that species in one trophic level could not be present in the other trophic level. Second, a few of the diet categories ("Mammal", "Invertebrate", "Seed", "Fruit", and "Leaf") were robust enough, given the validation procedures, to provide a finer classification into dietary guilds (mammal eaters, insectivores, granivores, frugivores, and folivores). This fine classification focused on the functional role of the species in the ecosystem, and categories were not mutually exclusive. After classification we examined how well trophic levels and dietary guilds were represented among mammal orders and families.

Spatial visualization
To illustrate potential applications of the presented data, we combined the trophic level and dietary guild classification with data on the global distribution of mammals. We used the global species distribution maps for terrestrial mammal species from IUCN (2013). We converted the polygon range maps to rasters on a Behrmann cylindrical equal-area projection and extracted species occurrences for grid cells at a resolution of 2°equivalents (~220 km). We chose 2°equivalents over 1°equivalents, but we note that statistical analyses with range maps at these two spatial resolutions usually give similar results (e.g., Hurlbert and Jetz 2007;Kissling et al. 2012). The data handling and extraction were similar to the procedure described by Sandom et al. (2013). We mapped the global distribution of species richness and proportions for each trophic level and dietary guild, excluding Antarctica and grid cells with <50% land area. We note that this mapping is only used for illustrative purposes and that more rigorous statistical analyses on potential drivers of these large-scale richness patterns need further scrutiny.

Digitalization of data
Of the 5364 terrestrial mammal species, a total of 2033 species (38% of all terrestrial mammal species) had species-level information available from the two global data sources. Of those, 682 species were entered from Nowak (1999) and an additional 1351 species from IUCN (2013). Furthermore, information on diet preferences were also available for many genera (n = 453) and families (n = 32). These genus and family diet data were only used for the extrapolation procedure. The calibration step before the data digitalization confirmed that minimizing discrepancies between multiple data recorders can be important when transferring written text descriptions into semi-quantitative ranks. The subsequent additional quality check of 120 randomly selected species revealed a relatively low error rate due to data entry via multiple data recorders. In most cases, diet ranks were identically re-entered for a specific diet category (overall mean AE SD across all categories: 94.41% AE 0.04%, n = 120). The least success was obtained for the diet category "Other" plant material (86%), whereas the highest success was obtained for the diet category "Root" and "Herbaceous" (>99%). All other diet categories had a high classification success of ≥90% ("Animal": 94%; "Vertebrate": 95%; "Mammal": 96%; "Bird": 96%; "Herptile": 96%; "Fish": 97%; "Invertebrate": 91%; "Plant": 90%; "Seed": 96%; "Fruit": 90%; "Nectar": 96%; "Leaf": 92%; "Woody": 95%).

Extrapolation of data
Among the 3331 species (62% of all terrestrial mammal species) with missing species-level diet data, a total of 2556 species (48%) were filled with diet information from the genus level (FillCode = 1). In addition, 337 species (6%) were filled from other species within the same genus (FillCode = 2.1 or 2.2). Of those, 266 species (5%) were filled with information available from one other species in the same genus (FillCode = 2.1), whereas 71 species (1%) were filled from more than one species in the genus (FillCode = 2.2). Finally, information from the family level was extrapolated to 436 species (8%, FillCode = 3). Hence, a total of 3329 species had extrapolated diet information, with only two species (Echinoprocta rufescens and Prolagus sardus) remaining without diet information after the extrapolation procedure. The former of these two species seems to be phylogenetically nested within the genus Coendou (Voss et al. 2013) and can therefore be considered as herbivorous, whereas the latter is extinct (IUCN 2013) and dental morphology suggests a predominantly herbaceous diet (Angelone 2005). Note that we did not enter this additional information into MammalDIET as it was not available from the two original data sources. In total, the original data together with the extrapolation procedure provided species-level data on diet preferences for 99.9% of the world's terrestrial mammals (n = 5362 species).
Across all terrestrial mammal families, the percentage of species with different filling codes varied widely (Fig. 4A). However, species-level diet information was typically available for half of the species within a given family (median: 54%). Some families had species-level diet information for all species while a few families had zero coverage (range: 0-100%). Most diet information was extrapolated from the genus level (median: 29% of species across families). Across families, filling from just one other species in the genus (FillCode = 2.1), from all other species with diet information within the same genus (FillCode = 2.2), or from family level (Fill-Code = 3) was generally very low (Fig. 4). The extrapolation of diet information was also not homogenous across mammal orders (Table 3). Most diet information within orders was filled from genus level information (Fill-Code = 1). Species-level information (FillCode = 0) was particularly well represented in the mammal orders Carnivora (82%), Cetartiodactyla (66%), and Primates (59%) ( Table 3).

Internal validation of extrapolation
The internal validation with the compiled dataset showed that the extrapolation procedure performed best for primary food items (rank 1, Fig. 4), especially for coarse diet categories such as "Animal" and "Plant" (Table 4). More generally, diet categories at high hierarchical levels ("Animal", "Plant", "Vertebrate", and "Invertebrate") were on average better predicted than those at low hierarchical levels (all other categories) (Fig. 4). However, a few diet categories at low hierarchical levels ("Mammal", "Seed", "Fruit", and "Leaf") also showed good validation results for rank 1 (Table 4), whereas the "Vertebrate" category had the lowest predictive potential among the higher hierarchical levels ( Table 4). The prediction of the absence of a diet category (rank 0) was generally very good (Fig. 4) and tended to be better for low hierarchical levels than for high taxonomic levels (opposite to ranks 1-3; Fig. 4).

External validation of extrapolation
Of 289 African species in the external validation dataset, 163 species (56%) had species-level diet information available from the six volumes of Mammals of Africa.
The number of species for validating the sixteen different diet categories was generally good (>10 species) although two diet categories ("Fish", "Nectar") had insufficient sample sizes (3 and 0 species, respectively). For diet categories with sufficient sample sizes, the external validation showed that three diet categories at high hierarchical levels ("Animal", "Plant", and "Invertebrate") as well as four diet categories at low hierarchical levels ("Mammal", "Seed", "Fruit", "Leaf") correctly predicted the diet ranks in ≥60% of the cases (Fig. 4C). The same diet categories were also identified with good validation scores by the internal validation.  Correctly predicted diet rank (%)  Table S2. In (B) each boxplot summarizes the proportion of correctly predicted diet ranks for high (grey boxes) and low (white boxes) hierarchical levels (compare Fig. 3). High hierarchical levels include the diet categories "Animal", "Plant", "Vertebrate", and "Invertebrate", whereas the low hierarchical levels include all other diet categories. Information on ranks 1-3 is provided in Table 2. The "0″ indicates that a diet category was not used (i.e., assumed absence). In (C), extrapolated diet data are validated independently with an external validation dataset (Mammals of Africa, see text for details). The percentage of correctly predicted diet ranks is given for each of the sixteen diet categories for rank 1 data only (gray bars) and for rank 1 and 2 data combined (white bars). Numbers below diet categories give the sample size (number of species) for each validation. Boxes in (A) and (B) represent the interquartile range (IQR), horizontal lines within the boxes represent medians, whiskers extend to 1.5 times the IQR, and outliers are plotted as dots. To explore taxonomic variation in extrapolating diet ranks, we examined the results from the external validation separately for each of five mammal orders with ≥15 species (Figs 5 and 6). This revealed interesting differences in the potential to predict diet adaptations both within and among clades. Two orders (Rodentia and Carnivora) showed a broad range of diet categories, but their specialization on plants and animals differed. Rodents (Rodentia), being predominantly herbivorous and insectivorous and representing the most species-rich order in the external validation dataset (as well as globally, Table 3), showed a 100% prediction accuracy for the diet category "Plant", but a mixed picture with varying percentages of correctly predicted diet ranks for other categories (Fig. 5A). The order Carnivora (here mostly represented by genets and mongooses), predominantly feeding on animal material, showed very good predictions (usually >75% correctly predicted diet ranks) for the categories "Animal", "Vertebrate", "Mammal", and "Invertebrate", but lower values for other diet categories (Fig. 5B). In contrast to the broad range of diet categories in Rodentia and Carnivora, the three other mammal orders showed a stronger specialization on a few specific diet categories (Fig. 6). Primates showed excellent evaluation scores for "Plant" and "Fruit", but lower scores for "Animal" and "Invertebrate" (Fig. 6A). The herbivorous Cetartiodactyla (here mostly duikers, dik-diks, etc.) also showed excellent evaluation scores for "Plant" and "Fruit" (Fig. 6B), but whether species were browsers or grazers varied among species (i.e., lower scores for "Woody" and "Herbaceous" leaves). Finally, the highly insectivorous Eulipotyphla (shrews) showed excellent predictions for "Animal" and "Invertebrate", whereas other diet categories were only represented among a few species (Fig. 6C).

Classification of trophic levels and dietary guilds
Based on the internal and external validation results above, two classification procedures were applied (for Table 3. Summary information across mammal orders of how extrapolation of diet preferences was performed (FillCode = 0, 1, 2.1, 2.2, 3). Diet information was available for 2033 species at the species level from the original data sources (FillCode = 0). For the other species, diet data were first filled from the genus level (FillCode = 1, n = 2556 species), then from one other species (FillCode = 2.1, n = 266 species) or from more than one species within the same genus (FillCode = 2.2, n = 71 species), and finally from the family level (FillCode = 3, n = 436 species  ( details see Table 5). First, each species was grouped into one of three trophic levels: carnivores, herbivores, and omnivores. These mutually exclusive trophic levels were based on the two coarsest diet categories ("Animal" and "Plant") because they defined the highest hierarchical level (Fig. 3) and were among the diet categories with the best validation scores (rank 1 in Table 4 and Fig. 4C).
Only 13 species (0.24%) could not be allocated ("Not assigned" in Table 5) according to this classification. In a second classification, we used finer diet categories (i.e., all categories below "Animal" and "Plant", Fig. 3) to provide a more detailed classification for specific dietary guilds. For this second classification, we only used diet categories if they had well predicted diets in the internal validation (i.e., proportion predicted >0.60 for both rank 0 and rank 1, Table 4) as well as good validation scores in the external validation (≥60% correctly predicted diet ranks, compare Fig. 4C) for diet categories with sufficient sample sizes (>10 species). This included the diet categories "Mammal", "Invertebrate", "Seed", "Fruit", and "Leaf". Hence, we classifiedfor each of these diet categoriesspecies into dietary guilds (mammal eaters, insectivores, granivores, frugivores, and folivores) if the respective diet category had a rank 1 in a given species (Table 5). These dietary guilds were not mutually exclu-sive because a species could be classified into more than one dietary guild (e.g., granivore, frugivore) if it had a rank 1 in these diet categories ("Seed", "Fruit"). A detailed overview of the two classifications is provided in Table 5. The dietary guild assignment for each species is also provided with the dataset (Appendix Table S1, dataset available from the Dryad Digital Repository: http:// doi.org/10.5061/dryad.6cd0v).
The percentage of species within trophic levels and dietary guilds varied considerably among mammal orders (Table 6). For instance, the largest proportions of carnivorous species (as defined in Table 5) were found in the mammal orders Dasyuromorphia (a group of Australian marsupials), Eulipotyphla (such as shrews), and Afrosoricida (tenrecs, otter-shrews and golden-moles). For herbivores, the orders Cetartiodactyla (such as bovids and deer) and Lagomorpha (such as hares and rabbits) contained the highest proportions of species. Omnivorous species were best represented within the orders Didelphimorphia (opossums) and Scandentia (treeshrews). Dietary guilds included mammal eaters (e.g., felids and canids), insectivores (e.g., microbats, tenrecs, shrews), frugivores (e.g., some groups of bats and primates), granivores (e.g., some groups of rodents), and folivores (e.g., bovids, kangaroos, and hares). A detailed overview of trophic levels and dietary guilds is provided for mammal orders in Table 6 and for mammal families in Appendix Table S3.

Spatial visualization
Peaks in species richness of trophic levels showed a surprising spatial overlap across the world (Fig. 7A-C). This indicated that the build-up of species richness in different trophic levels is possibly governed by similar drivers. In contrast to coarse trophic levels, dietary guilds showed more spatial heterogeneity in species richness at a global scale (Fig. 7D-H). For instance, mammal eaters, granivores, and folivores appeared to be particularly species-rich in mountain ranges such as the Andes, Himalayas, East African mountains, and the mountainous west of the USA (Fig. 7D, F, H). In contrast, species richness of frugivores and insectivores additionally peaked in lowland tropical rainforests on all continents (Fig. 7E, G).
Beyond species richness, we also spatially visualized the proportions of each trophic level and dietary guild (Fig. 8). For trophic levels, carnivores showed high proportions in most parts of the world (Fig. 8A), whereas herbivores dominated mostly at high latitudes (Fig. 8B). Omnivores seemed to be proportionally overrepresented in the Saharan desert region (Fig. 8C), but this region is generally species poor. Proportional maps for dietary guilds showed that insectivores had high proportions Table 4. Internal validation of extrapolating diet information, illustrated by the proportions of correctly predicted diet ranks (rank 0-3) within a subset of species for which species-level diet information was available (n = 2033 species). Prediction of diet ranks was performed using a jack-knife approach that first removed the original diet information of a focal species and then predicted the diet ranks with a filling procedure as described in the main text. Proportions >0.60 are highlighted in bold. "NA" reflects missing diet rank data in a specific diet category. throughout the world (Fig. 8E), frugivores mostly had high proportions around the equator (Fig. 8G), and mammal eaters, granivores, and folivores were well represented outside the tropical belt (Fig. 8D, F, H).

Discussion
By digitizing, extrapolating, and validating diet preferences of terrestrial mammals worldwide, we compiled a comprehensive and unique, cladewide trait dataset (Mam-malDIET) relevant for macroecological and macroevolutionary analyses. In contrast to previous datasets that have been made available to the public (Jones et al. 2009;Price et al. 2012;Cantalapiedra et al. 2014), MammalDI-ET allows a finer dietary guild classification and a broader taxonomic coverage. This was achieved by a combination of original and extrapolated data, thus providing specieslevel diet estimates for >99% of all terrestrial mammals. Results from the internal and external validation steps confirmed the use of several diet categories as reliable information for subsequent classification of species into trophic levels and dietary guilds. The methodological approach used here (summarized in Fig. 2) could also be applied more widely when constructing global databases of species-specific traits.
Digitalization of available trait data represents an important step in the compilation of macroecological trait datasets. During this process, errors can occur, for example, when written text descriptions are converted into (semi)quantitative data. We used a calibration step with 20 randomly selected species before entering the data to ensure that diet information was digitized in the most consistent way among multiple data recorders. Furthermore, we tested the error rate due to data entry via multiple data recorders using 120 randomly selected species. This revealed that converting written diet descriptions from textbooks into (semi)quantitative diet ranks was not particularly prone to errors. We found that most diet ranks were entered in the same way by multiple recorders, with an accuracy of almost 95%. Nevertheless, some diet categories such as other plant material ("Other") had a lower success rate (86%) which demonstrates a larger uncertainty in the assigned importance score for such unspecific categories. We emphasize that initial calibrations and subsequent data quality tests were valuable steps to avoid discrepancies in data entries and to maintain the consistency of data entry by multiple data recorders. Other authors of mammalian diet datasets (e.g., Price et al. 2012) also verbally report such cross-validations of scoring by multiple recorders although quantitative assessments are usually not provided. We therefore suggest that explicit guidelines for how to convert diet descriptions into ranked importance scores are needed when many recorders are involved in building up macroecological trait datasets (Jones et al. 2009).
To accommodate the lack of species-level traits in sparse datasets, an extrapolation or prediction of missing trait data based on non-missing entries from other taxonomic or phylogenetic levels might often be the only way to compile macroecological trait datasets with a global coverage (Shan et al. 2012). Our hierarchical extrapola-tion procedure allowed to fill-in gaps of diet information when species-level information was not available from the two original data sources. For some taxonomic groups (e.g., Rodentia, Eulipotyphla), the missing data reflect the limited diet knowledge at the species level. This became evident in the external validation which showed that for many extrapolated species additional species-level diet data were not available, even not from the most comprehensive regional data sources Happold 2013;Happold and Happold 2013;Kingdon and Hoffmann 2013a,b;Kingdon et al. 2013). For instance, for Eulipotyphla (here mostly represented by shrews of the genus Crocidura in the family Soricidae) and Rodentia (various mice genera in the family Muridae), the external validation dataset based on the Mammals of Africa did not provide species-level diet information for 65% and 49% of the species, respectively. Nevertheless, we acknowledge that more species-level diet data could be extracted from additional data sources for some of the species which currently have extrapolated diets in Mam-malDIET. In such cases, MammalDIET could serve as a baseline source for adding additional data and the data coverage for such species could then be improved. Extrapolation will be most reliable if taxa show a high level of phylogenetic conservatism in their diets. An excellent example of such diet conservatism is the microbats (suborder Microchiroptera in the order Chiroptera) which nearly all feed exclusivelyas aerial insectivoreson insects and arthropods. For such groups, extrapolating diet knowledge from suborder, family or genus level will be unproblematic. Other mammal groups also show a high predictability for specific diet categories (Fig. 6). For instance, almost all species in the order Eulipotyphla feed primarily on invertebrates, including the shrews (family Soricidae) and the moles, shrew moles, and desmans (family Talpidae). Categorizing these species as insectivores (as defined in Table 5) is unproblematic even if diet knowledge at the species-level is absent. Nevertheless, several other diet categories are used by only a subset of Eulipotyphla species and an extrapolation in these cases is then less reliable (Fig. 6C). This similarly applies to primates (Primates) and even-toed ungulates (within Cetartiodactyla) which primarily feed on plant material (high phylogenetic conservatism and good predictability), but the specific type of plant material (fruits, seeds, leaves) can vary among species, genera and families, making predictions more difficult (Fig. 6A, B). More generally, the use of specific diet categories can be quite heterogeneous among species within several mammal orders, families and genera. Thus, uncertainty in extrapolating diet information across taxonomic levels depends on the level of diet generalization within taxonomic groups (Fig. 5) and on the hierarchical position of the diet categories (Fig. 3). For instance, some families in the order Rodentia (e.g., Cricetidae, to which true hamsters, voles, lemmings, and New World rats and mice belong) contain insectivorous, herbivorous and omnivorous species, and extrapolations from one species to another or from genus and family level will be less reliable. More generally, predictions across taxonomic levels will be more difficult if species within a certain taxonomic level (e.g., genus) use a large number of diet categories at low hierarchical levels. Despite this, our validations showed a surprisingly Table 5. Ecological and technical details of defining trophic levels and dietary guilds of mammals. Internal and external validations of correctly extrapolating diet ranks were used to guide which diet categories were reliable to group species into different trophic levels and dietary guilds (see text for details). The trophic levels represent three mutually exclusive groups (carnivores, herbivores, omnivores) based on diet categories at the highest hierarchical level ("Animal", "Plant"). The five dietary guilds (mammal eaters, insectivores, granivores, frugivores, folivores) are not mutually exclusive and were classified based on fine diet categories ("Mammal", "Invertebrate", "Seed", "Fruit", "Leaf") with good validation scores (compare Table 4 good predictive ability across the mammal clade for several diet categories, including the "Animal", "Mammal", "Invertebrate", "Plant", "Seed", "Fruit", and "Leaf" categories. Compared with previously published datasets, MammalDIET represents an improved classification of dietary guilds in terrestrial mammals worldwide because the diet data is more detailed and provided in a quantitative format that facilitates customized diet reclassifications. For instance, Price et al. (2012) assembled coarse mammalian diet data and classified species into three trophic levels (carnivores, omnivores, herbivores), covering only approximately one-third of the mammals (n = 1530 species). Jones et al. (2009) recorded eight diet categories and classified mammals into three trophic levels (carnivores, omnivores, herbivores), but only for around 40% of the species. Jetz et al. (2009) compiled diet data for >90% of the mammal species, but only distinguished two trophic levels (primary and secondary consumers), and the data were not made publicly available. MammalDIET provides data for 16 diet categories that can be combined in many ways to generate any kind of customized dietary guilds. This enables a much more refined classification of dietary guilds than previously possible, and researchers are free to define diet guilds tailored to the question they are investigating. Our validation results further support previously applied classifications (e.g., Sandom et al. 2013) and suggest that results using 2-3 trophic levels based on similar data Price et al. 2012) should be relatively robust and reliable.

Conclusions
The compilation of macroecological trait datasets such as MammalDIET is challenging and requires several methodological steps, from digitizing accessible information to extrapolating missing data and validating extrapolation procedures. The approach illustrated here provides an example to fill-in data gaps in mammalian trait information and could be applicable more widely to other traits and taxa. Due to large knowledge gaps on traits of species-rich clades, we suggest that a comprehensive effort into the compilation and prediction of traits is needed to significantly advance macroecological and macroevolutionary research. Fundamental to this effort will be a deeper understanding of phylogenetic conservatism in traits, that is, when it matters and how it varies across taxonomic and phylogenetic scales.