Automated synthesis of biodiversity knowledge requires better tools and standardised research output

s (and associated metadata) to ensure key information is included in a format that facilitates machine-readability.


Introduction
Anthropogenic activity is negatively impacting the natural world (IPBES 2019); vertebrate populations are declining (WWF 2020) and species are being lost at rates reminiscent of mass extinction events (Ceballos et al. 2015). is biodiversity loss threatens ecosystem function (Rockström et al. 2009, Leclère et al. 2020) -which humans rely on (Díaz et al. 2018) -placing people and their livelihoods at risk. Much of our knowledge regarding environmental change impacts draws on global syntheses such as the PREDICTS (Hudson et al. 2017) and BioTIME (Dornelas et al. 2018) datasets, in addition to intergovernmental reports (IPCC 2014, IPBES 2019. e rapid growth in environmental literature over the last 30 years (Anderson et al. 2021) has been essential for facilitating these syntheses. However, as the literature continues to grow, syntheses become ever more challenging and time consuming (Ananiadou et al. 2009, Cohen et al. 2012. 'Big data' approaches, and associated computational tools, provide a means to wrangle the extensive ecological literature into usable information (Westgate et al. 2018). Much of the recent development in synthesis methods has been in expediting and automating the searching for (Grames et al. 2019), and screening of (Wallace et al. 2012, Shackelford et al. 2020, Cornford et al. 2021, papers to address research questions. Within the medical literature, some approaches have even managed to automate the entire systematic review procedure (Marshall and Wallace 2019, Gates et al. 2020, Marshall et al. 2020, Yang et al. 2020, Brassey et al. 2021). In the environmental sciences, automated topic models have provided insight into research trends (Hintzen et al. 2020) and the identification of knowledge gaps (Westgate et al. 2015), with text-classifiers allowing for the automated analysis of social media content to understand public opinions of nature (Johnson et al. 2021a). Complementing these broader, summarisation approaches, direct extraction of ecologically valuable information from literature (e.g. species names and geographic locations) is a growing field, with recent examples including Akella et al. (2012), Millard et al. (2020) and Kulkarni and Di Minin (2021).
Harnessing big data approaches to automatically synthesise data found within individual publications could support the environmental sciences in capturing the abundance of primary literature for compilation projects (e.g. Hudson et al. 2017) and evidence reviews (e.g. the Conservation Evidence project; <www.conservationevidence.com>) within a fully reproducible pipeline. However, validating the outputs of automated approaches is crucial to ensure the tools are accurate and do not introduce unwanted biases (Westgate et al. 2018). Benchmarking also helps users compare between alternative approaches, and track performance gains as techniques improve. Unfortunately, collating data for such validation often requires extensive manual effort, making evaluation a challenge. An exception is the Living Planet Database (LPD; <https://livingplanetindex.org/data_portal>), a collection of vertebrate population time series, each tagged with a species name and monitoring location. e LPD is useful to test automated approaches for biodiversity assessment and ecological research for two core reasons. First, the LPD underpins the Living Planet Index, an aggregated index of changing vertebrate populations with important policy implications (WWF 2020). Second, the LPD is largely based upon research in the primary literature, meaning many records in the LPD can be traced back to a publication and, central to this work, an associated abstract. Here we use the LPD as a reference point to test the performance of automated synthesis approaches in an ecological context. Specifically, we evaluate the performance of automated approaches for three important tasks relating to the synthesis of biodiversity trends and provide recommendations on how to address detected limitations. While our analyses focus on ecological and biodiversity change data, the approach and identified issues are relevant widely to all environmental sciences. e three tasks are as follows: 1. Taxonomic entity extraction i.e. finding which species were studied. Recent papers have used automated extraction approaches to identify species in text (Gerner et al. 2010, Akella et al. 2012) and general taxonomic patterns in ecological research (Millard et al. 2020). However, formal assessments of extraction accuracy and vulnerability to bias are still relatively scarce and warrant investigation. 2. Geographic location extraction i.e. finding where the study was conducted (in this work, we group locations based on country borders, but other geopolitical/biogeographic boundaries could be specified). Whilst development in taxonomic extraction and application has accelerated in recent years (Gerner et al. 2010, Akella et al. 2012, geographic extraction has a far greater history and wealth of available methods (Buscaldi and Rosso 2008, Kitamoto and Sagara 2012, Speriosu and Baldridge 2013, Ding et al. 2018, Magge et al. 2018, Kokla and Guilbert 2020, Wang et al. 2020. Automated geographic extraction could be valuable for extracting countries in coarse spatial resolution synthesis projects, or in the pre-screening phase of fine resolution syntheses. However, automated geographic extraction has been rarely used in ecology and conservation, with very few examples of successful application (Fisher et al. 2011, Millard et al. 2020. As a result, whilst many methods have been developed (Kokla and Guilbert 2020), there is a general need to validate geographic extraction in the field of ecology. 3. Population trend extraction i.e. summarising estimated population trends for studied species and locations. Developing methods that can synthesise ecological findings and data could help manage the ever-growing literature. Population trends, describing change in abundance over time, are amongst the most valuable types of data to compile as they meet the criteria of an essential biodiversity variable, and can thus directly support conservation management and policy (Pereira et al. 2013, Jetz et al. 2019).
In addressing these tasks we explicitly consider two potential 'leaks', which could limit the accuracy of automatically generated output. First, automated synthesis tools used to extract information from abstracts may be ineffective or biased, e.g. favouring the extraction of certain species or locations, and failing to detect population trends. Second, even if automated tools perform well in extracting information from abstracts, abstracts may not accurately represent full studies, e.g. the population trend in the abstract is over-emphasised or only example species are listed for a multi-species study.
To explore these tasks and leaks, we compiled 1556 English language abstracts from the LPD and assessed how well the outputs from the automated extraction aligned with that reported in the LPD (for species names and geographic locations). For 300 randomly sampled abstracts we also manually extracted species, locations and population trends, producing a dataset of publications with information extracted using three methods: 1) LPD estimates, manually extracted from full texts (full-text data); 2) information manually curated (by the authors) from abstracts (manually assessed abstracts); and 3) data automatically extracted from abstracts (automated). We compared alignment between these three extraction types to determine whether leaks were due to ineffective automated tools (automated estimates differ from both types of manual extraction) or abstracts lacking information present in the full text (manual extractions differ).

Taxonomic entity extraction
To automatically extract species names from abstracts, we used a two-step approach (detailed in the Supporting information and Millard et al. 2020). First, we used taxize::scrapenames (Chamberlain andSzocs 2013, Chamberlain et al. 2018) to extract potential taxonomic names from abstracts. We then applied string-matching to retain only Latin binomials also present in the 2017 Catalogue of Life (Roskov et al. 2017), ignoring non-vertebrates.
We used three comparisons to evaluate the performance of automated taxonomic extraction: a) automated versus fulltext data in the LPD; b) automated versus manually assessed abstracts; and c) manually assessed abstracts versus full-text data in the LPD. For each comparison we calculated recall (percentage of species in latter present in the former, per publication), and bias (proportion of species within each order in the former divided by the proportion of species within each order in the latter). We investigated if this bias had phylogenetic signal, whereby certain clades would be under-or overrepresented, by measuring Pagel's λ across orders.

Geographic location extraction
We used the CLIFF-CLAVIN geoparser (D'Ignazio et al. 2014) to extract focal geographic locations (countries and coordinates) from abstracts. As country strings can differ between the LPD records and those resolved by CLIFF-CLAVIN, we used the geographic coordinates from both to identify associated countries (see Supporting information for details). We measured the effectiveness of the automated geographic extraction in a comparable way to the taxonomic extraction, using recall based on country names, and bias as the proportional difference in country frequency between data extraction approaches.

Population trend extraction
We trained machine learning classifiers to predict aggregate, paper-level population trends using a paper's title and abstract (full details in the Supporting information). We assigned paper-level trend categories (increase, stable, decline or varied) based on the proportion and direction of significant population-level trends, which were themselves estimated from a log 10 -linear model of population time-series.
Both random forests and neural networks (constructed in Python; van Rossum 1995) were used to predict trends, representing two well-known and high-performing text classification techniques. e performance of these machine learning approaches is improved by larger amounts of training data and/or better data quality (Liu et al. 2019), which can be generated using data augmentation (Wei and Zou 2019). We explored the impact of text augmentation (e.g. randomly replacing words with a synonym) on the accuracy of our trend predictions, using the Python library EDA (easy data augmentation; Wei and Zou 2019).
Initial analyses (Supporting information, using the 1256 texts remaining after setting aside the 300 manually assessed abstracts), indicated that random forest classifiers incorporating data augmentation but ignoring texts containing 'varied' population trends, performed best. We therefore tested a classifier of this specification on the 300 manually assessed abstracts, comparing the performance of our automated approach to both manual alternatives using accuracy and Cohen's kappa (Kuhn 2020).

Taxonomic entity extraction
For all 1556 texts, the automatic taxonomic extraction recalled an average of 80.8% of species relative to the full text (SD = 35.0%). When considering only the 300 manually assessed texts, our automated approach achieved average recall of 83.7% (SD = 34.1%; Fig. 1a). Cases of low recall were primarily influenced by under-reporting of species within abstracts, as only 82.5% (SD = 35.2%; Fig. 1c) of species with population data in the full texts were recorded in the abstracts. In contrast, loss of information from the automated method was low, with an average of 93.6% recall (SD = 17.9%; Fig. 1b) when compared to data manually extracted from abstracts.
Our analysis also suggests taxonomic bias in both the automated tool and abstract content, when compared to the full text, with some orders substantially under-and overdetected (Fig. 1d). Despite these disparities, we did not detect any significant phylogenetic signal (Pagel's λ likelihood-test p-value > 0.05) in detection bias for any of our comparisons, although sample sizes of 16 and 17 orders may be a limitation here ( Fig. 1d and Supporting information).

Geographic location extraction
e automated geographic extraction generally performed worse than the taxonomic extraction, accurately identifying an average of 69.1% (SD = 45.5%) of countries relative to the full-text extraction when considering the 1556 records. For the 300 manually assessed texts, average recall rose to 77.9% (SD = 40.4%; Fig. 2.1a). However, unlike the taxonomic extraction, accuracy error was driven by the poor performance of the automated geographic extraction (mean recall = 82.1%, SD = 36.7%; Fig. 2.2a), as the manually assessed abstracts and full texts were well aligned (mean recall = 93.9%, SD = 22.6%; Fig. 2.3a).
e automated geographic extraction also showed bias, tending to over-assign records to countries with English as the first language (e.g. USA, UK and Australia; Fig. 2.1b and 2.2b) and under-assign records across South America and Southeast Asia. In contrast, comparing between full texts and manually assessed abstracts suggests more moderate over-/ under-reporting of countries in abstracts (Fig. 2.3b). As an example, records labelled as France in the full texts were split between seven countries in automated extraction, nine countries when comparing the manually assessed abstracts to the automated extraction, and only two countries when comparing the full texts to the manually assessed abstracts (Fig. 2.1c, 2.2c and 2.3c).

Population trend extraction
Of the 300 manually assessed abstracts, 21 were classified as varied based on full-text data, and 180 as either varied or unclear by manual assessment. Here, we therefore present results based on the subset of these 300 texts where all classification approaches (automated, full-text data and manually assessed abstracts) produced categories of either increase, decline or stable (111 studies). ese results allow for a fair Figure 1. Recall and bias of automated taxonomic extraction using taxize and Catalogue of Life on our sample of 300 texts. (a-c) Distribution of recall (percentage of species successfully detected within each study) for the automated tool relative to the full-text data (a), automated relative to manually assessed abstract data (b) and the manually assessed abstracts relative to the full texts (c). e red line represents the mean recall within each of these comparisons. (d) Phylogenetic variability in detection bias of species within texts. Despite visible variation, we found no phylogenetic signal in detection bias (Pagel's λ likelihood-test p-value > 0.05). Each ring around the phylogenetic tree (a, b, c) relates to the comparisons between extraction approaches indicated in the left-hand column titles. Bias ranges from ×0.1 to ×10, where a value of 0.1 in ring a, for example, would indicate that a given order occurs 10 times less frequently in the automated extraction than in the data taken from the full text. e bias colouring is on the log 10 scale. Grey indicates an absence of the order in the reference dataset. comparison between approaches, but also likely over-estimate the performance of both our automated and manual approaches, as we focus on the simplest ecological (and textual) scenarios. Results based on all 300 manually assessed texts can be found in the Supporting information.
Automated population trend prediction performed worse than either taxonomic or geographic data extraction, with accuracy of 64.9% compared to the full text (kappa: 0.473, 'moderate' ; Fig. 3a). Interestingly, the accuracy of manual abstract categorisation compared to estimates based on data from the full text was lower still, at 57.7% (kappa: 0.387, 'fair' ; Fig. 3c). Agreement between the manually assessed abstracts and automated classifications was also low (accuracy: 58.6%, kappa: 0.369, 'fair'; Fig. 3b), suggesting the automated and manual approaches made different mistakes. Figure 2. Recall and bias of automated geoparsing using CLIFF-CLAVIN on our sample of 300 texts. (1a, 2a, 3a) Distribution of recall (percentage of countries successfully detected) for the automated tool relative to the full-text data in the Living Planet Database (1a), the automated tool relative to the manually assessed abstracts (2a) and the manually assessed abstracts relative to the full-text data (3a). e red line represents the mean recall within each of these comparisons. (1b, 2b, 3b) Spatial variability in detection bias of countries within texts, comparing full-text information versus automated, manual abstract assessment versus automated, and full texts versus manually assessed abstracts. Bias ranges from ×0.2 to ×5, where a value of 0.2 in 1b, for example, would indicate that a given country occurs five times less frequently in the automated extraction than in the data taken from the full text. e bias colouring is on the log 10 scale. White countries indicate no representation in the reference dataset. (1c, 2c, 3c) Assignment of records in the comparison groups relative to France in the reference group. e grey line indicates a match between the reference and comparison group, whilst red indicates a mismatch. Line thickness describes its proportional frequency.

Discussion
Here we evaluated three tools to explore key challenges in automated synthesis of ecological and biodiversity knowledge. Our study explored their limitations and tested the potential sources of errors or information 'leaks'. e first two tools for automated taxonomic and geographic extraction delivered moderately successful performance. ese approaches are already being used (Millard et al. 2020), and, compared to manual extraction, offer much faster and more easily reproducible data collation. However, we found that automated extraction of species and locations can introduce biases (e.g. over-/under-representation of certain taxonomic orders), and so should be used cautiously. e performance of taxonomic and geographic extraction was affected by the representativeness of abstracts (relative to the main text), and the biases inherent in automated algorithms. For example, the automatic taxonomic tool performed well in extracting Latin binomials from abstracts, but abstracts poorly represented main texts in terms of taxonomic coverage. On the other hand, the automatic geographic extraction tool performed poorly in extracting locations from abstracts, but abstracts represented main texts well in terms of geographic coverage. e third tool we developed and tested was a population trend extractor which delivered relatively poor performance driven by a lack of clarity regarding trend descriptions in abstracts (a problem concerning how research is presented in the literature) and by the complexity associated with summarising multiple trends into one value (an issue related to limitations of automated tools). e relatively good performance of automated taxonomic and geographic extraction is promising for current/ future synthesis projects, through application as a text prioritisation tool (an example of a project already using these approaches is EntoGEM; <https://entogem.github.io/>). One current issue with global synthesis projects is their substantial taxonomic (McRae et al. 2017, Troudet et al. 2017 and spatial biases (Gonzalez et al. 2016, Tydecks et al. 2018. ese biases hinder inference and erode our ability to predict over space (Yates et al. 2018) and phylogeny (Johnson et al. 2021b). Automatically analysing the content of collated studies early in the synthesis pipeline could reveal imbalances/ gaps in geographic and taxonomic coverage, which if not addressed would undermine subsequent analyses and conclusions. Prioritising the collation of studies that fill data gaps has already been used in some synthesis projects (Jones et al. 2009, Hudson et al. 2017, traditionally relying on manual searches. Using automated taxonomic and geographic identification tools to also identify such publications could speed up the collation of representative ecological data, enabling more rapid and accurate syntheses, thereby better informing conservation decisions. While taxonomic and geographic tools can be recommended, the poorer performance of the trend extraction tool limits our ability to automate the entire trend synthesis process. Although the accuracy demonstrated here may be sufficient for providing a coarse, preliminary overview of population trend distributions in scoping searches, we suspect obtaining more reliable estimates per-study is currently unfeasible for a variety of reasons. First, nature can be complex, making the estimation of population abundance trends difficult (Humbert et al. 2009), and potentially inaccurate (Fournier et al. 2019). Descriptions of such trends are therefore likely to be linguistically complex and could vary depending on the trend estimation method used. Second, information is often not reported to facilitate synthesis, and abstracts can use polarising language, e.g. a population may be increasing, but this message could be confused if the text opens with negative or disaster-based language, or may describe only 'key results' that do not reflect the full content. ird, it is challenging to develop tools that can process complex texts or adequately capture information about multiple diverging trends. e first two issues reflect nature itself and academic writing and are embedded in much of the published ecological literature. It seems unlikely that the way in which researchers write will change given its importance in the framing of research and the complex nature of some biodiversity patterns. Although future developments in machine-learning tools may enable accurate automated trend extraction, we think a more ambitious, short-term change is needed in the form of standardization of abstract structures, language-use and inclusion of metadata. e importance of crafting titles, abstracts and keywords to ensure primary research is easily discoverable for use in syntheses has recently been highlighted (Hennessy et al. 2021), but automated synthesis would likely benefit from even more structure. Some journals, e.g. Global Ecology and Biogeography, structure abstracts into sections, where results and methods are isolated. ese structures would limit the conflation of results with the disaster-based language often found in the introduction and discussion, thereby improving performance of metadata extraction.
Our study tackles some important challenges involved in automating ecological synthesis but there are limitations associated with the approaches we present. First, we focused solely on English language texts, and only explored tools designed for English. Although English is the main language of the scientific literature (Nuñez and Amano 2021), and LPD articles, we recognise that considering publications in languages other than English is important for ensuring biodiversity knowledge/inference is unbiased (Konno et al. 2020. We therefore encourage future work to develop/evaluate similar automated synthesis tools for texts in a variety of languages. Second, we have only evaluated the automated extraction of data from article abstracts. Textmining approaches are known to improve when full-texts are used (Westergaard et al. 2018), with this work also finding that abstracts may not accurately represent the content of the full paper. However, we argue that as access to fulltext articles is often restricted by paywalls, it is important that fast, accurate, automated syntheses can be performed using freely, and easily, available abstracts. ird, we assessed automated tools for extracting large-scale patterns of biodiversity change, i.e. qualitative population trends associated with countries and species. e collation of such information is vital for systematic maps/coarse resolution synthesis, but may struggle to capture the known nuances of biodiversity change, especially at local scales (Dornelas et al. 2019, Leung et al. 2020. Further work to enhance the granularity of data extraction and minimise identified biases is therefore needed before automated approaches are readily applied without caution. Finally, our analysis centres on a database of vertebrate population time-series. Previous research comparing automated and manual extraction of various animal pollinator species (mostly insects) from abstracts found recall of 79.5% (Millard et al. 2020), suggesting that the quality of automated taxon tagging may vary across phylogenetic groups. Further evaluation of automated species extraction across kingdoms and phyla is therefore required, especially as targeting data retrieval for less charismatic groups -i.e. not mammals and birds -is essential for furthering biodiversity knowledge (Guerra et al. 2020).

Conclusion
In this work we have explored the three broad tasks of extracting taxonomic names, geographic locations and population trends from article abstracts. We have shown that the species and country tagging tools perform sufficiently well for us to recommend their wider use, e.g. study prioritisation and coarse-scale literature summarisation, but caution is needed, as these automated approaches can introduce biases, e.g. underrepresenting certain countries. Our trend extraction approach delivered poorer performance, being constrained by poor alignment between abstracts and the main text, poor text classifier performance, and the complexity of the population trend data and its descriptions. To facilitate improved automated synthesis within ecology, we recommend both the improvement of computational tools, and better structuring of abstract text.