Fast, scalable, and automated identification of articles for biodiversity and macroecological datasets

Aim: Understanding broad-scale ecological patterns and processes is necessary if we are to mitigate the consequences of anthropogenically driven biodiversity degradation. However, such analyses require large datasets and current data collation meth-ods can be slow, involving extensive human input. Given rapid and ever-increasing rates of scientific publication, manually identifying data sources among hundreds of thousands of articles is a significant challenge, which can create a bottleneck in the generation of ecological databases. Innovation: Here, we demonstrate the use of general, text-classification


| INTRODUC TI ON
Substantial anthropogenic change is degrading the natural world, creating an urgent need to understand the drivers and consequences of biodiversity loss to inform mitigation strategies (IPBES, 2019;WWF, 2020). For example, monitoring progress towards international conservation policy objectives, such as the Aichi Biodiversity Targets (CBD, 2010), requires the reliable, accurate and rapid tracking of changes in the state of nature (Collen & Nicholson, 2014;Walpole et al., 2009). Currently, collating data for such analyses is a time-consuming and largely manual process Hudson et al., 2014), typically involving literature searches, manual screening of titles and abstracts for relevance, assessment of data quality, liaising with study authors to obtain data when necessary and entering usable data into the database. Estimates from other fields, such as medical systematic reviews, suggest that an experienced reviewer may take between 30 s and several minutes to assess an abstract (O'Mara-Eves et al., 2015). The annual rate of publication of scientific papers is growing at 8-9% per year (Landhuis, 2016) and over 15,500 ecology-related papers were indexed in Web of Science in 2019. Manually creating and updating ecological databases will therefore become ever more laborious (Ananiadou et al., 2009;Cohen et al., 2012). If current data-collection techniques cannot keep pace, large portions of relevant, available data might not be incorporated, potentially leading to suboptimal and potentially biased outputs that could not only hinder scientific progress (Nunez-Mir et al., 2016) but may misinform policy makers.
Combining text-mining and machine-learning approaches has the potential to substantially increase the rate of data discovery and database growth. These techniques have so far had relatively limited use in the biological sciences (Nunez-Mir et al., 2016), but evaluations in the context of producing medical systematic reviews show that they can classify accurately and save time (O'Mara-Eves et al., 2015), and can even correct human error (Bannach-Brown et al., 2019). Within ecology, Roll et al. (2018) recently used automated content analysis and artificial neural networks to accurately determine whether texts associated with the term 'reintroduction' were linked to conservation biology or another topic, and recommended further use of text mining and machine learning in conservation to better inform policy and management practices (Roll et al., 2018).
In this paper, we demonstrate how text classifiers trained through supervised machine-learning can identify papers containing ecological data, applying the approach to two high-profile biodiversity indicator databases as examples. The Living Planet Database (LPD: http://livin gplan etind ex.org/data_portal) contains population time-series data on over 4,000 vertebrate species collected from over 25,000 populations and is used to produce the Living Planet Index (LPI: Collen et al., 2009;Loh et al., 2005;McRae et al., 2017), one of the most widely used indicators of biodiversity (Mace & Baillie, 2007). The database of the PREDICTS (Projecting Responses of Ecological Diversity in Changing Terrestrial Systems) project (Hudson et al., 2017) collates ecological assemblage data from terrestrial sites worldwide that face different pressures relating to land-use change. More than 50,000 taxa and over 32,000 sites are included and from this the global status of a range of indicators, including the Biodiversity Intactness Index (BII: Scholes & Biggs, 2005), can be calculated (Newbold et al., 2016;Purvis et al., 2018). Both indicators have been used widely in high-profile reports, such as the IPBES (Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services) Global Assessment (IPBES, 2019) and the Living Planet Report (WWF, 2020). By tuning our workflow on LPD data and then applying this to PREDICTS, we illustrate the generality of our approach, which can be applied to any (ecological) database created using data extracted from literature sources.
Given the variety of text classifiers available (Khan et al., 2010;Kotsiantis et al., 2007), we compare the performance of easy-tocreate logistic regression (LR) to 'black-box' neural networks, which can capture complex, nonlinear, non-additive relationships (LeCun et al., 2015). Furthermore, we identify aspects of text processing (e.g., 'stop word' removal) that influence the performance of classification models-a facet of methodology that can significantly affect performance but is often overlooked (Ananiadou et al., 2009;Uysal & Gunal, 2014). Specifically, we address the following questions:

| Data
The use of full-text articles can be preferable in terms of accuracy (Westergaard et al., 2018), but restricts the number of documents on which a classifier can be trained and subsequently applied. We therefore focus on the initial article-screening stage, and a method that uses only the titles and abstracts of scientific texts (see Supporting Information Appendix S1 for further details).

K E Y W O R D S
automated classification, biodiversity indicators, Biodiversity Intactness Index, ecological data, Living Planet Index, machine learning, text mining Using our two example databases (LPD and PREDICTS), we define relevant texts as articles that have an English abstract, and have contributed data to, or have been identified as likely to contain data for that database. We identified 633 such records linked with the LPD and 536 with the PREDICTS database. Using these databases allowed us to test and explore our methods, but they could be applied to any such database.
We downloaded the top 125,000 'ecology' articles from the National Center for Biotechnology Information, using the Entrez Programming Utilities (Sayers, 2010); these served as irrelevant texts. For each database, we took a random sample of irrelevant records equal in size to the number of relevant records, with any papers known to contribute to the focal database being excluded from sampling (i.e., papers contributing data to the LPD could not be irrelevant for the LPD but could be for PREDICTS and vice versa).
Combining relevant and irrelevant records yielded 1,266 titles and abstracts for the LPD and 1,072 for the PREDICTS database that were used to train and test the classifiers.

| Text classifiers: construction, training, testing and analysis
We compared two binary classification techniques: logistic regression (LR) representing a strong but easy-to-create baseline while a convolutional neural network (CNN) offers a leading-edge alternative (Zhang & Wallace, 2015). For each method, the specific textprocessing stages were varied to assess how these factors impacted the performance of the classifiers. Figure 1 summarizes the computational workflow, a detailed description of which can be found in Supporting Information Appendix S1.
To assess classifier performance, we used 10-fold cross validation and average area under the receiver operating characteristic curve (AUC) scores (LeDell et al., 2015). Generalized linear models (GLMs) were used to determine the influence of different workflow choices on classifier performance and we retained the best models for subsequent testing and application (see Supporting Information Appendix S1.3 for details).
Although improving data discovery for a single database has value, the broader potential of a text classifier depends on how well and how readily it can be transferred to other biodiversity databases. We optimized our text-processing workflow using the LPD texts before applying the best procedures to the PREDICTS data, providing an example of how such techniques can be readily transferred to various databases.

| Comparing classifiers to search engines
To compare the data discovery rate of our workflow with a search engine, we conducted targeted literature searches for each of the two example databases (Supporting Information Appendix S1.5 and Table S1.3). Articles from each search were ordered separately according to their relevance, as predicted by the search engine (Scopus for LPD, Web of Science for PREDICTS) or our best classifiers (see Results). For each ranked list, RC spent 15 min manually classifying papers as relevant or not based on the content of their titles and abstracts. Binomial mixed-effects models were then used to compare the effect of ranking type (search engine or model) on the proportion of relevant articles found for each database. Search topic (the queries used to conduct each literature search) was specified as a random intercept and equivalent models were also fitted to the F I G U R E 1 Graphical depiction of text-processing workflow and the classifier training. Orange boxes indicate stages of the workflow that were systematically varied; grey boxes represent processing that was constant across all indicated models. For details of the text-processing stages, see Supporting Information Appendix S1 and Table S1.1. Stemming reduces words to their root, for example, 'ecological' would be shortened to 'ecolog'. tf = term frequency (the number of times a term occurs in the text being considered); tf-idf = term frequency-inverse document frequency whereby the term frequency is multiplied by the term's inverse document frequency (idf = log D d , where D is the total number of documents in the training corpus and d the number of documents in which the term of interest occurs); LR = logistic regression; CNN = convolutional neural network manual classifications generated after the first 10 and 5 min. Ten per cent of the manually classified papers were sampled at random and double-checked by experienced members of the LPD and PREDICTS teams to determine the level of agreement between the manual categorizations. Mixed-effects models were re-fitted using the expert classifications to assess how any re-classifications affected the coefficient estimates for ranking type.

| Potential for iterative improvement of classifiers
Performance is expected to improve with the size of the training set (Liu et al., 2019), and the speed of improvement is an important determinant of the general usefulness of an approach.
To explore whether iteratively expanding the classifier training data has the potential to improve the classifiers, we used cross-validation and AUC scores to separately assess how the size of the training dataset and addition of new texts from literature searches influence the predictive performance of the models used in 2.3 (see Supporting Information Appendices S1.6 and S1.7 for details).

| Insight into classifier decision making
Within the LR models, each term in the training texts-for example, an individual word or word stem-is associated with a learned weight (see Supporting Information Appendix S1.1.1 for details).
To identify the terms having the most influence on predicted relevance, term weights were extracted from the best-performing LR models. The 50 most positively and 50 most negatively weighted terms were inspected to see if they could cause biases in the classifications.

F I G U R E 2
Cross-validation and test AUC scores for selected models. All models display strong performance on both the LPD and PREDICTS texts. The LR A and CNN A models make use of abstracts and a number of text-processing stages. Foregoing the text-processing stages (B models) causes model performance to drop only slightly, or improve in the case of the PREDICTS-trained CNN B. However, not using abstracts (C models) leads to a larger performance decrease, especially for the LPD texts. Note truncated y axis starts at .90. Circles and error bars show the mean and 95% confidence intervals, respectively, of the AUC scores from 10-fold cross-validation.  Note: A = models identified as best using AUC (area under the receiver operating characteristic curve); B = models equivalent to A but using simpler text-processing; C = models equivalent to A but not using abstract text; LR and CNN = logistic regression and convolutional neural network models, respectively; NLTK = the Natural Language Toolkit stop word list; df = document frequency; tf = term frequency; tf-idf = term frequency-inverse document frequency.

| RE SULTS
Overall, AUC scores indicate that both the LR and neural network models performed very well, with little difference between the approaches. Among the text-processing choices tested (Figure 1), the type of text data is the most important factor influencing classification performance, for both models (Figure 2, Supporting Information Figure S5.4 and for metrics associated with the selected models).
Ranking search results using the LR A classifier led to a significantly higher proportion of potentially relevant papers being discovered after 15, 10 and (for the LPD searches) even 5 min than if the search engine rankings were used (Figure 3 and Supporting Information Table S4.7). For example, when manually screening LPD-related searches for 10 min, use of the classifier increased the average proportion of relevant papers found from .48 to .65.

Experienced database users (LPD and PREDICTS team members)
agreed with RC's manual classifications in 87% (47/54) and 95% (37/40) of cases, respectively. The positive effects of the classifier on discovery rate increased slightly when using the expert classifications of the sampled texts in combination with the rest of RC's classifications (Supporting Information Figure S5.5), suggesting that the benefits of using the LR A models are not driven by any initial classification errors.
Larger training datasets enhance predictive performance of LR A-style classifiers but with diminishing returns. Furthermore, even models trained with just 200 texts achieve average AUC ≥ .98 ( Figure 4a). Expanding the training data to include texts identified during the literature screening also substantially improves the performance of the LR A-style classifiers on real-world search results.
Up-weighting new negatives relative to the original negatives produces the best performance ( Figure 4b).
Generally, the most positively weighted terms are associated with the respective indicator database; for example, 'pop' and 'abund' for the LPD and 'specy' (stemmed form of 'species') and 'landscap' for PREDICTS ( Figure 5 and Supporting Information Table S4.8).
The most negatively weighted terms for both datasets represent topics in ecology less connected to either the LPD or PREDICTS, such as 'evolv'. The greater similarity of negative terms across the dataset-specific models is illustrated by the fact that whilst 21 terms are shared between the 50 most negatively weighted features for the LPD and PREDICTS models, only 9 are when considering the 50 most positive terms. Interestingly, there are some terms that stand out as potential artefacts of biases in the training texts, for example, 'declin' for the LPD and 'forest' for PREDICTS.

| D ISCUSS I ON
Collating ecological data is essential for understanding the natural world and how it is affected by anthropogenic activity.
Macroecological datasets in particular are critical for exploring the extent to which impacts of such activity can be generalized across space and taxa. We have shown that by using text mining and automated classifiers we can speed up the identification of newly F I G U R E 3 A comparison of the proportion of papers that are manually classified as relevant, when working through papers according to the search engine or our best logistic regression model. Using the automated classifiers is beneficial across all timespans (positive β values and most points above dashed grey line). β values represent the impact of using the classifiers compared to the search engine ranking. Significance is indicated ( † p < .1, **p < .01, ***p < .001) from two-tailed tests. Residual degrees of freedom are 21 for LPD related models and 27 for PREDICTS.  (Pfeifer et al., 2014) and BioTIME (Dornelas et al., 2018). We therefore encourage others to use and improve upon our work to collate urgently needed biodiversity data.
Biodiversity indicators represent an important output of ecological datasets whose usefulness is maximized if they are derived from up-to-date information (Collen et al., 2008;Walpole et al., 2009). The growing availability of dashboards and portals serving indicators and similar derived biodiversity data products-for example, The Biodiversity Indicators Dashboard (Han et al., 2014) and Map of Life (Powers & Jetz, 2019)-further attests to the need for a dynamic view of the state of nature. Our

F I G U R E 4
The impact of training set size and sample weighting on classifier performance. (a) When using LR A specified models, larger training sets improve predictive performance. However, a plateau is apparent indicating that beyond approx. 1,000 texts, little additional improvement is made. Crucially, even when using training sets of 200 texts (100 positives) average AUC scores exceed .98 for LPD classifiers and are around .99 for PREDICTS classifiers. (b) Incorporating new texts demonstrably boosts classifier performance when compared to the original (dashed lines). Up-weighting new negatives relative to original negatives confers benefits, with optimal relative weighting being around .6-.7. Note different axis limits in (a) and (b). In (a), the thick lines show the range of mean AUC values calculated over 10 replicates of 10-fold cross-validation, the associated thin lines show the range of the 95% confidence intervals. The circles and error bars show the mean and 95% confidence intervals, respectively, of the AUC scores from 10-fold cross-validation of the complete datasets, as reported in

| Overall performance of classifiers
Our best classifiers have accuracy, precision and recall that compare favourably with studies using automated text classification to identify papers relevant to medical reviews, which typically report similar (Adeva et al., 2014;Ananiadou et al., 2009;Bannach-Brown et al., 2019) or lower values (Cohen et al., 2012). Although one might naively expect the more complex neural network to outperform the simple logistic classifier (Joulin et al., 2017), we did not find that to be the case here.
Researchers may therefore choose a classification technique based on other considerations, for example, bias assessment/mitigation where LR classifiers offer higher levels of transparency concerning model 'decision making' than do 'black-box' neural networks (see 4.3).
Incorporating abstracts, rather than just using titles, substantially improves classifier performance. Adeva et al. (2014) found the same qualitative pattern, and Westergaard et al. (2018) demonstrated that text mining biomedical literature is significantly better if using full-text articles rather than abstracts. These findings all suggest that text mining benefits from the greater information content of longer texts. While a classifier trained on full texts may display increased performance, the articles available for subsequent screening would be smaller due to current access limitations such as pay-walls and copyright issues, though the in- creasing prevalence of open-access publishing means that these limitations may be transient.
By optimizing our workflow using the LPD texts and then transferring the identified procedure to the PREDICTS data, we demonstrate the general applicability of our methods. Given that PREDICTS-trained models perform at least as well as those trained on the LPD data (Figures 2 and 3), applying our approach to databases like BioTIME (Dornelas et al., 2018) could quickly help identify additional relevant data sources.

| Limitations
Classifiers that we did not consider may perform differently and/ or be more sensitive to text-processing procedures. We have also not addressed how the architecture of deep-learning networks could influence model performance. While a thorough exploration of CNN hyperparameters would be expected to improve performance (Zhang & Wallace, 2015), the principal aim of this paper has been to develop the use of text-mining techniques within ecological data collation workflows and demonstrate their potential benefits.
Having assessed both a strong baseline (LR using bag-of-features) and a leading-edge option (deep learning with word vector representation) (Joulin et al., 2017), our work shows the usefulness of such techniques within ecology.

| Future development
Concerns have been raised recently that machine-learning models may contain bias, primarily due to being trained on imperfect data (Bolukbasi et al., 2016;Tramer et al., 2017). Given the imbalances that exist within ecological datasets (Gonzalez et al., 2016;McRae et al., 2017), text classifiers like ours could propagate bias, as evidenced by the strong influence of certain terms in the logistic models, for example, 'forest' and 'fish'. The accumulation of biases within biodiversity datasets is detrimental to their scientific goals (Gonzalez et al., 2016). Consequently, there is a clear need to assess classifiers carefully for bias prior to their widespread application and to check the representativeness of any subsequently acquired data to mitigate this risk. Further research into this area, especially with regard to biodiversity data coverage, could provide substantial insight into these issues and how best to combat them effectively.
One potential solution could involve technical developments of the text-mining process to ignore specified 'bias terms' and/or preferentially return information associated with entities (e.g., taxa and locations) that are currently under-represented in the focal dataset.
Text-mining techniques could therefore not only increase the rate at which data are incorporated into biodiversity assessments but might also contribute to making ecological databases more representative of reality, to better inform conservation and policy decisions.
Although the methods discussed here can help researchers collate available biodiversity data, in the longer term, it is also critical that published/collected ecological data are made more accessible for use in syntheses (McMahon et al., 2011;Poisot et al., 2019).
Approaches to facilitate this include searchable, centralized repositories (similar to genetic sequence databases, e.g., GenBank; Benson et al., 2012), standardized data formats (Poisot et al., 2019), and/ or the use of a machine-readable mark-up language within articles (Bourne et al., 2008). Crucially, such changes require strong incentives to ensure that the original data collectors receive appropriate recognition for their scientific contributions (Bourne, 2005;Ewers et al., 2019). Although a substantial challenge, these developments would enable the more complete and rapid synthesis of ecological data, which is essential for mitigating biodiversity loss and its associated challenges.

| CON CLUS ION
We have shown that combining text mining and simple machinelearning classifiers is highly effective in identifying papers relevant to ecology datasets. We demonstrate this using two globally recognized biodiversity indicators, but our method is applicable to any dataset comprised of data from literature sources. Interestingly, even relatively simplistic models based on LR perform very well, on a par with more complex neural networks. The wider adoption of these techniques could therefore rapidly increase the rates of data discovery and collation across a wide range of ecological datasets. Removing the discovery bottleneck would substantially help researchers to keep datasets up-to-date and representative of the natural world, both of which are critical for accurately monitoring conservation progress and informing policy. To facilitate further application and development we provide code for building and using such classifiers.

ACK N OWLED G M ENTS
This work was supported by the Natural Environment Research Council (grants NE/R012229/1 and NE/M014533/1). This paper is a contribution to the Grand Challenges in Ecosystems and the Environment initiative.