What remains to be discovered: A global assessment of tree species inventory completeness

Recent unprecedented efforts to digitise and mobilise biodiversity data have resulted in the generation of ‘biodiversity big data’, enabling ecological research at scales previously not possible. However, gaps, biases and uncertainties in these data influence analytical outcomes and the validity of scientific research and conservation actions. Here, we estimated tree species inventory completeness globally and identified where future surveys should focus to maximise regional inventories.


| INTRODUC TI ON
Over the last 20 years, unprecedented international efforts to digitise, store and mobilise biodiversity data, including natural history collections, on online databases have proliferated, resulting in the generation of 'biodiversity big data' (Bisby, 2000;Devictor & Bensaude-Vincent, 2016).Many natural history museums and herbaria now store information from their specimen collections in electronic databases.Online databases have facilitated access to large quantities of digitised biodiversity data, mainly in the form of occurrence records (i.e. an observation of a species at a particular location and time), enabling the study of biodiversity at taxonomic, spatial and temporal scales previously not possible (Heberling et al., 2021).
Potential applications of digitised biodiversity data in ecology and conservation science are manifold, ranging from estimates of species richness and patterns of diversity (Ballesteros-Mejia et al., 2013;Brummitt et al., 2020;Pelayo-Villamil et al., 2018) to extrapolation of species' potential geographic distributions (Cunze et al., 2020;Elith & Leathwick, 2009;Franklin & Miller, 2009), bioinvasion studies (Beaumont et al., 2014;Cao et al., 2021;O'Donnell et al., 2012), biodiversity assessments (Guisan et al., 2013;Salinas-Rodríguez et al., 2018), estimating completeness and coverage of species distribution and diversity (Ballesteros-Mejia et al., 2013;Mora et al., 2008;Troia & McManamay, 2016, 2017) and monitoring progress towards global conservation goals (Brooks et al., 2015;Orr et al., 2022).Recently, following the UN Biodiversity Conference, the 15th Conference of Parties (www.cbd.int/ cop) established 23 global targets under the Kunming Montreal Global Biodiversity Framework (GBF), which aimed at reversing nature loss and safeguarding biodiversity by 2030, and promoting human and nature well-being by 2050 (www.cbd.int/ gbf).GBF Target 21 aims to ensure that the best biodiversity-related data and knowledge are accessible and available to policymakers, practitioners and other relevant stakeholders, including the public.Increase in biodiversity data availability and accessibility will help countries monitor their progress towards national goals and take necessary actions with regard to biodiversity conservation and management.The Global Biodiversity Information Facility (GBIF; www.gbif.org) is currently the world's largest biodiversity data infrastructure, facilitating access to more than 2.6 billion occurrence records from over 92,000 data sets worldwide (as of January 2024).
However, concerns over data quality have been raised because gaps, uncertainties, inaccuracies and biases are known to exist in occurrence data (Maldonado et al., 2015;Meyer et al., 2016;Soberón & Peterson, 2004).Generally, uncertainties and biases in digitised biodiversity data can be classified into three dimensions: temporal, spatial and taxonomic-a set of problems described as 'biodiversity knowledge shortfalls' (Hortal et al., 2015;Meyer et al., 2016).
Spatio-temporal uncertainties and biases towards areas of particular interest can result in incomplete knowledge about the species' distribution, a phenomenon known as the 'Wallacean shortfall', which can cause biodiversity maps to resemble maps of sampling effort (Hortal et al., 2015;Lomolino, 2004) and limit understanding of species environmental niches (i.e.Hutchinsonian shortfall) (Cardoso et al., 2011;Hortal et al., 2008;Scherrer et al., 2021).Species misidentification or ambiguous scientific names can lead to uncertainties in their taxonomic identification and biases in taxonomic coverage (i.e.preference for the collection of one taxon over another) can produce an under-or overestimation of species diversity (Meyer et al., 2016;Rodrigues et al., 2022;Stropp et al., 2022).This can result in a discrepancy between the number of described species and the actual species richness, a situation referred to as the 'Linnean shortfall' (Lomolino, 2004).Biases and uncertainties in species occurrence information and incompleteness in taxonomic inventories influence analytical outcomes; this can create concerns over the reliability of ecological studies and the effectiveness of conservation strategies (Beck et al., 2014;Hughes et al., 2021).Thus, for scientific research to be reliable and valid, uncertainties and biases in the underlying occurrence data must be understood and addressed.
In 2017, the world's first and most comprehensive list of tree species and their country-level distribution, the 'GlobalTreeSearch' (hereafter, 'GTS'), was published (Beech et al., 2017).According to Botanic Gardens Conservation International (BGCI, 2021a(BGCI, , 2021b)), approximately 58,000 tree species are currently known to science and published in the GTS database.At least 30% of these have been identified as threatened, with forest clearance and habitat loss being the greatest threats (Newton et al., 2015).Moreover, Guo et al. (2022) found that about 83% of 46,000 tree species analysed were estimated to be subjected to moderate to very high human pressure.While many tree species are being exposed to increasing threats, at least 9000 additional species are estimated to remain to be discovered (Cazzolla Gatti et al., 2022).
Tree conservation and sustainable management of forests are recognised as key actions that help address climate change (Poorter et al., 2015).At the 2021 United Nations Climate Change Conference, COP 26, over 100 world leaders pledged to halt and reverse forest loss by 2030 (UNCC, 2021).Forest protection and restoration have also been identified as one of the nature-based solutions that will help achieve global biodiversity conservation targets, including the UN Decade on Ecosystem Restoration with the goal to halt and reverse ecosystem degradation globally by 2030 (www.decad eonre stora tion.org), the GBF Target 3 aimed at ensuring 30% of land and sea areas are conserved and protected by 2030 (www.cbd.int/ gbf/ targets) and the 'Nature Needs Half' approach to protect 50% of the planet by 2030 (Dinerstein et al., 2017;Wilson, 2016).While actions to protect tree species from habitat loss and land clearance have biodiversity knowledge shortfall, Chao1 estimator, Global Biodiversity Information Facility, inventory completeness, tree species inventories, Wallacean shortfall been proposed, existing knowledge gaps in the global distribution of tree species make their implementation challenging.Thus, bridging these gaps is crucial for effective conservation.
In this study, we utilise tree occurrence records collated in GBIF to calculate sampling effort and estimate the completeness of global tree species inventory (i.e. the proportion of all tree species-known and unknown-that are observed from species occurrence records) for 100 × 100 km grid cells (sampling units) and for terrestrial ecoregions (large units of land containing ecologically distinct assemblages of natural communities and species sensu, Olson et al. (2001) and Dinerstein et al. (2017)).This estimate is crucial in helping to identify where current knowledge is reliable and where gaps and possible biases in the description of tree species richness and distribution exist (Rocchini et al., 2011).We expect that regions with higher sampling effort will have more complete tree species inventories and sampling effort will be unevenly distributed globally.Additionally, since tree diversity is essential to forest ecosystems and forest conservation (BGCI, 2021b), we examine the spatial intersection between tree species inventory completeness in forested areas and (a) global forest landscape integrity layer, which describes the degrees of forest modification by human pressures (Grantham et al., 2020) for forested sampling units, and (b) the amount of protected land and remaining natural habitat for ecoregions in seven forested biomes (Dinerstein et al., 2017).By doing so, we identify areas of diminishing and future sampling opportunities in forested areas for future botanical surveys aiming at improving our existing knowledge of tree diversity and aiding forest conservation efforts.The greatest gain in knowledge would be achieved from areas where estimated tree species inventory completeness is low and predicted tree species richness is high.However, since tree species are threatened by habitat loss and land use change, we consider sampling opportunities to fill inventory gaps to be diminishing in those sampling units and ecoregions that have both low tree species inventory completeness and high losses of natural habitat and forest integrity.In contrast, sampling units and ecoregions with low tree species inventory completeness that still maintain much of their natural habitat and forest integrity are opportunity areas for future discovery.

| Data set
We obtained a list of 58,496 tree species from GTS version 1.5 (BGCI, 2021a;accessed on 29 October 2021).GTS uses the tree definition agreed on by the International Union for Conservation of Nature's Global Tree Specialist Group: 'a woody plant with usually a single stem growing to a height of at least two meters, or if multi-stemmed, then at least one vertical stem five centimeters in diameter at breast height' (Beech et al., 2017, p. 455).To identify and correct potential taxonomic issues, tree species names were checked and standardised in the GBIF Backbone Taxonomy using the 'rgbif' package (Chamberlain et al., 2022).In this study, we require species records with coordinates, so we used the 'rgbif' package (Chamberlain et al., 2022) to download all tree species occurrence records from GBIF (GBIF.org, 02 March 2022; https:// doi.org/ 10. 15468/ dl.aptyh5), which contained coordinates with no documented geospatial issues and that had their taxonomic status as either 'accepted' or 'synonym', including preserved specimens, living specimens and observations.The excluded records, which were unfit for purpose including those with no coordinates or had GBIF documented spatial issues, amounted to ~10% of all tree records.
In total, 42,431,811 records for 52,065 tree species from 3796 data sets were downloaded.
This study focuses on species occurrence data collated within GBIF, as this facility represents the largest cooperation of the world's governments to mobilise biodiversity knowledge and it also shares tools, software and best practices for digitising biodiversity data (www.gbif.org).Furthermore, GBIF is the best indicator of worldwide 'shared biodiversity knowledge' and currently contains the greatest number of species occurrence records.Other valuable biodiversity data infrastructure can complement GBIF database but may be limited in their geographic and taxonomic coverage and accessibility to public users, although regional biodiversity databases can be more appropriate for regional uses in research and conservation planning (Meyer et al., 2015).Thus, it is important to note that this study is more relevant for global research and results reported in this study are only related to GBIF collated data.Downloaded records (hereafter, 'raw data') were flagged with the 'CoordinateCleaner' package (Zizka et al., 2019) to remove potentially erroneous records and those with incorrect or imprecise georeferencing, which included those assigned to GBIF headquarters or biodiversity institutions, or which had equal longitude and latitude, fell into the ocean or had coordinates containing only zeros.
Additionally, we excluded records (1) with no decimals in longitude or latitude, (2) represented hybrid/cultivated species, (3) were not identified to species level or (4) were duplicated records (i.e.defined here as two or more records with the same combination of species name, collection date and location).This filtering process retained 31,343,577 records (73.9% of raw data).
Estimation of species richness and inventory completeness may be affected by non-native species' records.The World Checklist of Vascular Plants (powo.scien ce.kew.org) is the most complete database on all known vascular plants and their Botanical Country distribution (Level 3 of the Biodiversity Information Standards, formerly known as the Taxonomic Databases Working Group [TDWG; www.tdwg.org/ stand ards/ wgsrpd]).Thus, we interrogated checklists of Botanical Countries for tree species from the World Checklist of Vascular Plants using the 'taxize' package (Chamberlain et al., 2020) and validated the remaining occurrence records against the acquired checklists.Species occurrence records that fell outside their native Botanical Country boundaries were removed.After assessing georeferencing quality and removing records with coordinate uncertainly, we retained 26,423,277 records (62.3% of raw data; hereafter, 'cleaned data'), representing 50,290 species from 284 plant families, which were considered fit for use at 100 × 100 km grid and ecoregional scale (see details below).Most of the cleaned data (78.9%, N = 20,841,932) were obtained from human observations (e.g.occurrences recorded from field surveys, citizen science programmes and opportunistic observations), and only 18% (N = 4,701,716) were preserved specimens (e.g.preserved collections hosted in herbarium and natural history museums).Records were collected between circa 1600 and 2022, with <1% of records collected before 1900.Approximately, 16.6% of the cleaned data lacked a full collection date (day, month, year).

| Data analysis
Estimates of species inventory completeness (C) provide a useful way to determine how complete and representative sampling has been across spatial and temporal scales (Soberón et al., 2007).We estimated C for tree species inventories at two scales: (i) 100 × 100 km sampling units and (ii) ecoregional scale.Increasing the spatial resolution will lead to substantial trade-offs for sample coverage and reducing the spatial resolution makes practical implication of the study challenging.For this global analysis, we identified sampling units of 100 × 100 km to be the best compromise between spatial resolution and deriving meaningful information about tree species inventory completeness.We included ecoregions in this study because ecoregions are biogeographic units representing distinct assemblages of biodiversity in particular regions (Dinerstein et al., 2017).They also provide a useful base map for conservation planning and management actions at global and regional scales because they draw on natural boundaries rather than political ones (see examples of countries that have adopted conservation planning for ecoregions in Dinerstein et al., 2017;Olson et al., 2001).We obtained the ecoregion shapefile from Ecoregions 2017 ©Resolve (https:// ecore gions.appsp ot.com/ ), which consisted of 846 terrestrial ecoregions (excluding the Rock and Ice ecoregion) grouped into 14 biomes and eight realms (Dinerstein et al., 2017).Occurrence records, sampling units and ecoregion shapefiles were projected from latitude and longitude (World Geodetic System [WGS] 1984) to a Mollweide (equal area) projection (ESRI: 54009).We excluded areas located in the Antarctic, and the Rock and Ice ecoregion, thus retaining 13,259 sampling units within 846 ecoregions for analyses.
Among ecoregions, 476 were located in seven forested biomes and the rest were in non-forested biomes (Table S1; see Dinerstein et al.'s, 2017 supplementary notes on how forested biomes were defined).To determine patterns in data collection, we calculated the number of occurrence records per species and for each sampling unit, ecoregion and biome.We also analysed the temporal pattern in data accumulation (i.e. the number of occurrence records collected each year and cumulatively) and sampling effort (i.e. the number of records per km 2 ).
We calculated C for tree species in sampling units and ecoregions as follows: where S obs(i) is the observed species richness and S est(i) is the estimated species richness for sampling units/ecoregion i, with C values ranging from zero to one.High C values suggest high degrees of species inventory completeness; this information can be used to validate the current state of knowledge about species richness in a particular region and identify areas with knowledge gaps (i.e.low C values) (Soberón et al., 2007;Sousa-Baena et al., 2014).
With the 'iNext' package (Hsieh et al., 2016), we used the Chao1 non-parametric estimator (bias-corrected form) to calculate S est(i) , which estimates the number of species likely to be present in a region based upon the number of rare species observed in a sample (Chao, 1984(Chao, , 1987;;Colwell & Coddington, 1994).Unlike parametric approaches, the Chao1 estimator makes no assumption about the species abundance distribution and is based on the assumption that additional species are less likely to be found when all species in the samples are represented by at least two individuals (Chao & Chiu, 2016;Gotelli & Colwell, 2011).Thus, it is valid for all species abundance distributions, and it is not affected by environmental gradients and spatial autocorrelation (Gotelli & Colwell, 2011).Although different studies may recommend alternative models, the Chao1 estimator is perhaps one of the most common non-parametric methods used to estimate species richness because it is easy to calculate and performs well in landscapes with different bioclimatic conditions (Haque et al., 2017;Sousa-Baena et al., 2014;Stropp et al., 2016).
While this method is conservative and estimates the lower bound of species richness, Chao (1984) found that it performed well on test data sets, and it has become widely used in studies estimating richness using presence-only records (e.g.Ballesteros-Mejia et al., 2013;Haque et al., 2017).
Generally, areas with C ≥ 0.80 (i.e. at least 80% of the species present have been sampled) are regarded as having more complete inventories or 'well-surveyed' (Haque et al., 2017;Mora et al., 2008;Soberón et al., 2007).However, in sampling units and ecoregions where the number of records is low, the estimates may be inconsistent and yield false/artifactual C values, which are unreliable (Sousa-Baena et al., 2014).Using a similar approach as Stropp et al. (2016) and Haque et al. (2018), we evaluated the relationship between C and number of records to define a range at which C values are more reliable and found a monotonic relationship for sampling units with ~20 or more records and ecoregions with ~40 or more records (see Figure S1).Therefore, we present C estimates only for sampling units and ecoregions with at least 20 and 40 records, respectively.Additionally, we used a combination of minimum C value and median sample size as criteria to define well-surveyed areas.For sampling units, the median number of records was 100 (rounded to the nearest hundred), and for ecoregions, it was 2000 (rounded to the nearest thousand) (Figure S1).
Hence, we restricted well-surveyed areas to sampling units and ecoregions with C ≥ 0.80, and ≥100 records and ≥2000 records, respectively (note that these thresholds are relatively conservative, and the proportion of well-surveyed areas will vary with different completeness thresholds as shown in Table S2).Then, we estimated the number of unrecorded species as the difference (1) between the estimated (S est(i) ) and observed (S obs(i) ) tree species richness.We also calculated the mean C value for ecoregions within 14 terrestrial biomes and statistically tested for differences in the mean C between forested and non-forested biomes using a two-sample t-test (results were considered significant if p < .05).
We then evaluated whether sampling effort was correlated with C using a linear model (results were considered significant if p < .05).
Simply sampling more will not necessarily lead to increasing knowledge or reducing species inventory gaps (Callaghan et al., 2021;Sousa-Baena et al., 2014).Thus, we identified areas of diminishing and increasing sampling opportunities by focusing on forested areas where most trees are found.Many definitions of forests have been published, varying from country to country or even among policymakers, depending on administrative units, land use and land cover types (Lund, 2002).Here, we used 2); 'Nature Could Recover' (the total area of remaining natural habitat and protected land is less than 50% but more than 20%, NNH 3); and 'Nature Imperilled' (the total area of remaining natural habitat and protected land is less than 20%, NNH 4).We divided C values into four quantiles (y-axis) and spatially intersected them with four NNH categories (x-axis).All data were analysed in R version 4.2.1 (R Core Team, 2022) and a summary of R packages used can be found in Table S3.A simple workflow of data processing is represented in Figure S2.

| Trends in data collection
Within the cleaned data, most species had a low number of records-half had ≥20 records.Only 1860 (3.7%) species had ≥1000 records, yet they contributed 85.7% of records to the cleaned data set (Figure S3).The number of records had increased at an exponential rate since the late 1900s.The highest amount of data collected in a single year occurred in 2019 (1,359,037 records) and half of the cleaned data were collected from 2000 onwards (Figure 1).

| Inventory completeness and sampling effort
At both spatial scales analysed, sampling effort (number of records/km 2 ) was globally unevenly distributed and more scattered for sampling units.Generally, the highest concentrations of records were in Europe, United States, south-eastern Australia, New Zealand, Japan, Taiwan and South Korea (Figure S4).
Tree species inventory completeness was also not uniform across the world (Figure 2a,b).At the ecoregional scale, the estimated mean C was 0.77 (±0.17) and this reduced to 0.71 (±0.20) for sampling units (Figure S5).There were 5316 sampling units with ≥100 records and 417 ecoregions with ≥2000 records, but only 35% (293) of ecoregions and 18% (2394) of sampling units can be considered well surveyed (Figure S6 and Table S2)..22 for ecoregions and sampling units, respectively; Figure S7), indicating that higher sampling efforts tend to result in higher C.
However, there were exceptions with some areas having relatively low sampling effort but high C (e.g.most ecoregions in the boreal forests/taiga and tundra biomes).

| Future opportunity areas for sampling tree species
We plotted bivariate maps of tree species inventory completeness against NNH (for forested ecoregions) and forest integrity (for forested sampling units) to visualise the spatial distribution of areas with diminishing or future sampling opportunities (Figure 4).At the ecoregional scale, we found areas having low tree species inventory completeness (i.e.ecoregions with C < 0.8 and <2000 records) and retaining <50% of their natural habitat and protected area (NNH 3-4) concentrated in the north of Indomalaya, including most ecoregions in India, northeastern and southern China and the tropical regions of Southeast Asia (Figure S9a).Ecoregions within NNH 3-4 that contain relatively high C (i.e.ecoregions with C ≥ 0.80 and ≥2000 records) were predominately located in Europe, east coast of Australia, eastern United States, east coast of South America, Madagascar, eastern China and South Korea (Figure S9b).Ecoregions that retain ≥50% of their total natural habitat and protected area (NNH 1-2) but that have low C included some islands in the Indo-Pacific regions and most ecoregions in southern Canada (Figure S9c).The central and north of Brazil, Mexico, central Russia and parts of Sumatra, Malaysia, Borneo and Sulawesi also fell within NNH 1-2 but have relatively high C (Figure S9d).
For sampling units, areas with low tree species inventory completeness (C < 0.80 and <100 records) and low forest integrity (i.e. highly modified and degraded forests) were concentrated in the New Guinea, Canada and Russia (Figure S10c).Lastly, areas with both high C and high forest integrity were scattered across New Zealand, east coast of northern Australia and western North America (Figure S10d).

| DISCUSS ION
In this study, we analysed GBIF occurrence records of the world's tree species and characterised taxonomic and temporal trends in record accumulation and examined spatial patterns in sampling effort and tree species inventory completeness (C).We mapped the distribution of sampling effort and completeness of tree species inventories and analysed the relationship between sampling effort and C, illustrating that these two variables are positively, albeit weakly, correlated.We also extended our research to identify knowledge gaps and areas of diminishing and future sampling opportunities for future botanical surveys.

| Trends in data collection and species inventory completeness
The increase in the annual accumulation of digitised tree records, especially in the last 20 years (Figure 1), corresponds to a period of strenuous efforts to collect, digitise and mobilise biodiversity data, highlighting a growing interest in tree inventories and improved technology for accessing and sampling biodiversity (Heberling et al., 2021).Remarkably, the increased intensity of public participation in research (known as 'citizen science') coupled with the ease of data collection via mobile phone applications have led to the F I G U R E 2 Spatial distribution of tree species inventory completeness for (a) 747 ecoregions and (b) 6692,100 × 100 km sampling units.White areas represent regions with insufficient records (i.e.<20 and <40 records for sampling units and ecoregions, respectively) or are areas excluded from this study (i.e.Antarctic region and the Rock and Ice ecoregion).
The mean tree species inventory completeness values were 0.77 (±0.17 SD) and 0.71 (±0.20) for ecoregions and sampling units, respectively.
proliferation of observation-based records from citizen science programme, such as eBird (Sullivan et al., 2014) and iNaturalist (Mesaglio & Callaghan, 2021).Exponential growth in digitised records over recent years has also been observed in global studies for other taxonomic groups, including butterflies and freshwater fish (Girardello et al., 2019;Mora et al., 2008).
Despite the increasing number of digitised tree occurrence records available in the GBIF repository, spatial bias exists, with most areas of high record density being located in the Global North, a pattern also identified for other taxa including birds, mammals, amphibians and other vascular plants (Meyer et al., 2015(Meyer et al., , 2016;;Serra-Diaz et al., 2018).There was also a potential taxonomic bias in digitised tree occurrence records, with over 85% of records coming from less than 4% of tree species.Previous studies have found that most plant species were represented by only a small number of records collected from a few sites, while a few broadly distributed species tend to dominate collections (Serra-Diaz et al., 2018;Tobler et al., 2007).

Spatial distribution of tree species inventory completeness
was also unevenly distributed globally.At the ecoregional scale, well-surveyed areas were concentrated in Europe, United States and Central America, parts of South America and eastern Africa, Australasia and a few countries in east Asia-a similar pattern also observed for inventory completeness of freshwater fish species at a country level (Pelayo-Villamil et al., 2018) and butterfly inventories at a coarse spatial resolution of 880 km (Girardello et al., 2019).
Mean C for ecoregions within each biome revealed that tree inventories in tundra and boreal forests have high average completeness, contrasting with vertebrates (Meyer et al., 2015) and butterflies (Shirey et al., 2021), which were vastly under-inventoried in these biomes; this finding highlights some geographic differences in C among different taxa.For sampling units, well-surveyed regions were mainly located in the Global North, while low tree species inventory completeness was largely concentrated in the tropics, a pattern also common in other faunal and vascular plant taxa (García- et al., 2023;Girardello et al., 2019;Meyer et al., 2015Meyer et al., , 2016)).

Roselló
Species inventory completeness can be driven by several variables, including geographic, socio-economic and environmental factors (Ballesteros-Mejia et al., 2013;Hughes et al., 2021;Meyer et al., 2015;Yang et al., 2014).Generally, regions with low species richness will have higher species inventory completeness because some species may be more difficult to detect in regions with high diversity (Ballesteros-Mejia et al., 2013;Riibak et al., 2020).Moreover, the observed tree species inventory completeness patterns are likely to be positively correlated to a country's gross domestic product per capita, financial and institutional resources, commitment to GBIF data sharing, as well as human population density and accessibility (Ballesteros-Mejia et al., 2013;Hughes et al., 2021;Meyer et al., 2015;Stropp et al., 2016;Tobler et al., 2007;Yang et al., 2014).
Human observations (particularly citizen science observations) are also closely linked to accessibility, collectors' preference and species appeal, which might bias our knowledge of species distributions to populated regions, protected areas and charismatic species (Girardello et al., 2019;Hughes et al., 2021).Spatial biases in sampling effort and species inventory completeness observed in this study could also be attributed to the fact that nearly 80% of the data came from human observations.Thus, efforts should be put into acquiring comparable data from less-accessible areas and poorly studied taxa where citizen science observations are insufficient (Heberling et al., 2021).Additionally, while botanical interests, research infrastructure availability and data mobilisation programmes contribute to high sampling coverage (Meyer et al., 2016), sociopolitical conflicts will draw away botanical interest and war-torn regions are unlikely to achieve high species inventory completeness (Meyer et al., 2015).

| Priorities for sampling tree diversity
We observed the highest number of estimated and unrecorded tree species richness in the tropics.Cazzolla Gatti et al.'s (2022) global estimate of tree species richness also found a similar pattern of high tree species richness in the tropics, with South America containing the highest estimated richness and 40% endemism for undiscovered tree species.Here, we discuss areas of diminishing and increasing sampling opportunities and provide recommendations for future botanical surveys and conservation efforts at the ecoregional scale because they provide a useful starting point for regional conservation planning (Dinerstein et al., 2017;Olson et al., 2001).Examples in which biodiversity conservation targets have been followed by practical and strategic actions are provided in Table 1.
Ecoregions for which the area occupied by both natural habitat and protected land spans less than 50% (NNH 3-4) and C is low are areas likely to have diminishing sampling opportunity, especially in Southeast Asia.A recent study of >46,000 tree species reported that on average, about half of a tree species' range in a 110 × 110 km grid cell occurred outside protected areas and 13.6% of all tree species assessed lie totally outside protected areas (Guo et al., 2022).It is possible that many undiscovered tree species lie outside protected areas and are being threatened with high human pressure and environmental changes and that some species may, or have, become extinct without ever being documented.Thus, botanical surveys and biodiversity assessments are urgently needed in areas of diminishing sampling opportunities to document remaining tree species.We also recommend immediate action be taken to assess and manage present threats to tree species in these regions and limited financial resources be directed to conservation assessments within hotspots TA B L E 1 Recommendation for increasing biodiversity knowledge and conserving ecoregions based on the degree of tree species inventory completeness (C) and the amount of remaining natural habitat and protected land (NNH categories; Dinerstein et al., 2017).

Degrees of tree species inventory completeness and NNH categories within ecoregions Implication for ecoregions
Recommendation for appropriate ecoregional level management and sampling effort

Example of effective implementation guided by key principles of biodiversity conservation
Low completeness (i.e.C < 0.8 and <2000 records) and NNH 3-4 (NNH 3 = 'Nature Could Recover', NNH 4 = 'Nature Imperilled') High anthropogenic impact means that opportunities for completing tree species inventories are diminishing.
Immediate action is required to identify and manage present threats to prevent further loss of natural habitat and forest integrity.Botanical surveys are required to document remaining tree species. of tree endemism (Gallagher et al., 2023).Ecoregions with low C that contain ≥50% of natural habitat or protected land in their total area (NNH 1-2) are regions of future sampling opportunities where the greatest gain in new biodiversity knowledge can be acquired, especially in South America, central Africa, Borneo and New Guinea.
Moreover, these ecoregions also correspond to the top 17% of tree priority areas for conservation to increase the protected proportion of tree species ranges (Guo et al., 2022).Threats, including human pressures, in these areas should be limited and current and future conservation efforts should ensure that protected areas are effective in conserving existing tree species and forest biodiversity.
For considerably well-surveyed ecoregions where <50% of their area contains natural habitat and protected land (NNH 3-4; as is the case for many developed regions such as Europe, the US and Australia), necessary knowledge about tree species data is relatively sufficient.However, the management of present and future threats will likely be required to protect threatened tree species.
Additionally, botanical surveys should also be carried out periodically to confirm whether species collected historically are still representative of those areas in which they were previously discovered.
This is important because spatial and temporal decay in occurrence data quality could affect their utility and reliability in ecological studies as well as their effectiveness in conservation actions (Tessarolo et al., 2017).Other well-surveyed ecoregions within NNH 1-2 have established tree species richness knowledge and should continue to be monitored and managed.

| Limitation of study
At coarser spatial resolutions, caution must be taken when interpreting completeness because high C may result from sampling artefacts where C may be overestimated from a few well-surveyed sites.Analysis at finer resolutions reduces sampling artefacts (Sousa-Baena et al., 2014), but it may not always be achievable, especially for global analysis, because of existing gaps in sampling efforts.Although local conservation actions typically occur at a much smaller spatial scale (Meyer et al., 2015), the concept of ecoregions is becoming increasingly important as scientists realise that species-specific conservation methods may not allow for the conservation of ecological communities and ecosystems (Shreeve & Dennis, 2011).In fact, many countries, including Nepal and Namibia, have adopted ecoregion strategies for biodiversity management, which have included Indigenous communities to manage large areas of land, and these strategies have proven success, even in countries with low gross domestic product per capita (Dinerstein et al., 2017).
Therefore, our approach of using ecoregions achieved multiple goals.
Finer spatial scale analysis, however, should not be disregarded but viewed as complementary since they provide additional information useful for local-scale conservation.
To account for the sensitivity of our estimator, we analysed completeness for sampling units and observed a pattern relatively consistent with the ecoregional scale, although C values were more scattered.Even at this resolution, many sites lacked high sampling effort and if we were to define well-surveyed areas with stricter criteria, most of the world would be regarded as under-inventoried (Table S2).Additionally, because C is a ratio, informing how well a site has been surveyed, it does not necessarily indicate that biodiversity knowledge in regions with high C values (e.g.C ≥ 0.8) is equal.For example, an ecoregion in Brazil with 10,000 tree species might have obtained a C of 0.8, but this also means that ~2000 species may not yet have been recorded.In contrast, one ecoregion in the Australian dessert may have 10 tree species and a C of 0.8 will mean that there may only be two unrecorded species.
It is also important to note that the GTS is not a static database, and mistakes in taxonomy, distribution and lifeforms have been identified.For instance, non-tree species were erroneously included as tree lifeforms, hundreds of species names were synonyms and a few thousands tree species in national and regional botanical literature have not been included, particularly in developing countries (Mugal et al., 2023;Qian et al., 2019).
Nevertheless, the GTS is being regularly updated as more information becomes available.We recognise its value and use in this study for evaluating the global completeness of tree species inventory.Moreover, results reported here were analysed from digitised tree species occurrence records aggregated in GBIF, which is incomplete, and results reported in this study are only relevant to GBIF data and not to all tree species that are known.
Due to this limitation, tree species inventory completeness may be under-or overestimated in some regions.GBIF was reported to have poor taxonomic coverage (with respect to the total reported species richness in an administrative unit) in many regions including islands, central and northeast Africa, the Middle East and central Asia (Keppel, Craven, et al., 2021).We acknowledge that in many regions of the world, inventory data may be available but not yet digitised and/or mobilised in GBIF or other electronic biodiversity databases, or specimens have been collected but remain to be identified due to the lack of resources (Girardello et al., 2019;Serra-Diaz et al., 2018).We also considered that tree species occurrence data may be stored in other major online biodiversity databases, such as the Botanical Information and Ecological Network and the Global Forest Biodiversity Initiative, and in region-specific electronic databases such as the Atlas of Living Australia, NeoTropTree, the Latin American Seasonally Dry Tropical Forest Floristic Network and the Sub-Saharan tropical Africa database RAINBO.Noteworthy, efforts to survey global plant diversity include the sPlot database, which collated over 1 million vegetation plots and 23 million plant species data worldwide (Bruelheide et al., 2019).Also, as part of the BGCI initiative to assess conservation status of global tree diversity, extensive field surveys have been conducted to collect tree species abundance data in many countries (Newton et al., 2015).Although biases and limitations exist in such expert assessments and targeted surveys, they have produced valuable additional information about tree species from around the world (Bruelheide et al., 2019;Keppel, Craven, et al., 2021;Keppel, Peters, et al., 2021).Nevertheless, these biodiversity databases may be limited in their spatial and taxonomic coverage, and it would be impractical to collect all the local biodiversity databases for use.Undoubtedly, GBIF still represents the best available indicator of worldwide 'shared biodiversity knowledge'.
We also recognise that our tree species occurrence data are not free from taxonomic, temporal and spatial biases and uncertainties despite having gone through rigorous data filtering procedures.Many of our records did not contain information on coordinate uncertainty although they fell within their native distributions.Inaccuracies and species misidentification have also been reported from both experts and non-experts and some citizen science data are comparable with data collected by professional scientists (Aceves-Bueno et al., 2017;Austen et al., 2016Austen et al., , 2018;;Roman et al., 2017).The fact that nearly 80% of our tree species occurrence data came from human observations means there is likely some level of biases and uncertainty associated with them.
However, taxonomic misidentification and imprecise or incorrect sampling locations may not be fully eliminated as they are difficult to detect, especially for large databases (Meyer et al., 2016;Soberón & Peterson, 2004).

| Recommendation and direction for future research
For future studies, it is also crucial to investigate the temporal variation in occurrence data and evaluate when a site has achieved high species inventory completeness.This is important because spatial and temporal information in data quality is known to decay with time due to changes in species' distribution and range areas (Ladle & Hortal, 2013;Tessarolo et al., 2017).As many terrestrial regions and a quarter of global protected areas are predicted to be exposed to high rates of climate and land-use change by 2050 (Asamoah et al., 2021), changes in species ranges and distribution may cause many ecoregions to experience increasing species turnover.If most of occurrence records were from historical periods, in view of species turnover and increasing land-use changes coupled with climate variability, it is uncertain whether the sets of species found several decades ago are still representative of what is found in those regions today (Stropp et al., 2016;Tessarolo et al., 2017).
This points to the risk of out-of-date knowledge in assuming that a site is well surveyed, while in reality, some species might have been extirpated from that area and new species might have colonised it (Stropp et al., 2016).Thus, regular and systematic botanical surveys are necessary to capture species and community-level changes.from other well-surveyed regions are likely to contain many undiscovered endemic tree species and could also be prioritised for future botanical surveys to increase tree diversity knowledge (Sandel et al., 2020;Sousa-Baena et al., 2014).
Moreover, since digitised biodiversity data are known to contain uncertainties and biases, one approach to fix issues related to data quality is to add or correct geographical coordinates, full collection date and standardised scientific names for the record (Sousa-Baena et al., 2014) (however, this would have to overcome the issue that herbaria are often understaffed and underfunded), or improve data digitisation processes to make it easier for citizen science to contribute to data collection.Systematic approaches to data cleaning and quality assessment will also be crucial for understanding the uncertainties and biases in occurrence data before they are applied in research, although there is no one-size-fits-all solution to data cleaning (Zizka et al., 2020).Other sources of valuable biodiversity data, many of which are housed in natural history collections waiting to be identified, should be made available in digital format and this effort should be supported sufficiently by funders (Sousa-Baena et al., 2014).Furthermore, with more emphasis on integrating local and Indigenous knowledge with western science particularly under GBF Target 3, there is an opportunity to engage and train Indigenous communities to empower their skills in collecting species occurrence information, especially in poorly sampled regions (Fauzi et al., 2016).Lastly, a commitment to data sharing should be encouraged, supported and sufficiently funded at regional and national levels to help improve the quality and coverage of digitised biodiversity data, which would greatly benefit future ecological studies, conservation planning and monitoring progress towards global biodiversity conservation targets (e.g.GBF Target 21 and Nature Needs Half) (Keppel, Craven, et al., 2021).
At the ecoregional scale, well-surveyed areas were concentrated around Europe, United States, South America (mainly northern region and Brazil), western and southeastern Africa, Russia, Australia and New Zealand.Even at this coarse scale most ecoregions remain under-surveyed, including those located in northern Africa, the Middle East and Asia.For sampling units, C values were more complete in eastern Europe, eastern and western United States, Australia, New Zealand and Japan.In contrast, most regions in Asia, Africa and South America have low C (Figure S6).At both spatial scales, inventory completeness was only weakly predicted by sampling effort (coefficient of determination [r 2 ] = .15and ,b,d,e).The highest observed and estimated species richness was concentrated in the tropics, particularly South America, eastern China and Southeast Asia.The highest number of unrecorded species was located in the tropical regions of South America, central Africa and Southeast Asia, notably Borneo (Figure3c,f).

F
Trend in record accumulation (red line) and the number of digitised tree species occurrence records collected per year (blue bar lines) from 1900 to 2021.Records collected prior to 1900 and in early 2022 (<1% of total cleaned data) were excluded from the figure.tropics,particularly in Southeast Asia, west Africa and scattered across central Africa, northern and eastern South America, Cuba and China (FigureS10a).Areas with high tree species inventory completeness (C ≥ 0.80 and ≥100 records) and low forest integrity were mostly located in the Global North, particularly, Europe, eastern United States, east coast of Australia, South Korea and Japan (FigureS10b).Areas with low C and high forest integrity (i.e.high forest connectivity and least impacted by human pressures) were predominately located in central Brazil, central Africa,

F
I G U R E 3 Spatial distribution of the observed, estimated and unrecorded number of tree species for (a-c) 747 ecoregions and (d-f) 6,692,100 × 100 km sampling units.White areas represent regions with insufficient records (i.e.<20 and <40 records for sampling units and ecoregions, respectively) or are areas excluded from this study (i.e.Antarctic region and the Rock and Ice ecoregion).Unrecorded number of species was calculated by subtracting the observed number of tree species from the estimated number.Values for the observed, estimated and unrecorded number of species are stretched on a natural logarithm scale.

F
Bivariate maps showing the intersection of tree species inventory completeness (C) with (a) Nature Needs Half (NNH) categories for 448 forested ecoregions and (b) degrees of forest integrity for 5150 100 × 100 km sampling units.The NNH categories are numbered 1-4, representing ecoregions with different amounts of protected land and natural habitat remaining, as described by Dinerstein et al. (2017) as 'Half Protected' (NNH 1), 'Nature Could Reach Half' (NNH 2), 'Nature Could Recover' (NNH 3) and 'Nature Imperilled' (NNH 4).The degrees of forest integrity were obtained from Grantham et al. (2020) with index values ranging from zero (i.e.representing areas most modified and degraded) to 10 (i.e.areas with high forest connectivity and least impacted by human pressures).Percentage values represent the number of ecoregions or sampling units within each quantile or NNH category.At the ecoregional scale, 1.6% and 1.8% of forested ecoregions fell within NNH 1 and had C ≤ 0.25 (lower left) and C > 0.75 (upper left), respectively.Moreover, 7.4% and 4.2% of ecoregions fell within NNH 4 and had C ≤ 0.25 (lower right) and C > 0.75 (upper right), respectively.For sampling units, 7.5% and 6.3% had the highest forest integrity and C ≤ 0.25 (lower left) and C > 0.75 (upper left), respectively.Additionally, 2.3% and 7.7% had the lowest forest integrity and C ≤ 0.25 (lower right) and C > 0.75 (upper right), respectively.White areas represent regions with insufficient data (i.e.<20 and 40 records for sampling units and ecoregions, respectively), not located within forested areas or located in the Antarctic or the Rock and Ice ecoregion, which were excluded from this study.

Future
studies should also examine spatial patterns of C with climatic variability and land-use change data.These data can be used to evaluate the coverage of climatic and landscape conditions for well-surveyed regions and examine whether species occurrence data could be used to study species and community responses to future changes in climate and land use (Ronquillo et al., 2020; Sobral-Souza et al., 2021; Sousa-Baena et al., 2014).Poorly known regions (i.e.low C) with high species richness and distinct climatic conditions