Large language models help facilitate the automated synthesis of information on potential pest controllers

The body of ecological literature, which informs much of our knowledge of the global loss of biodiversity, has been experiencing rapid growth in recent decades. The increasing difficulty of synthesising this literature manually has simultaneously resulted in a growing demand for automated text mining methods. Within the domain of deep learning, large language models (LLMs) have been the subject of considerable attention in recent years due to great leaps in progress and a wide range of potential applications; however, quantitative investigation into their potential in ecology has so far been lacking. In this work, we analyse the ability of GPT‐4 to extract information about invertebrate pests and pest controllers from abstracts of articles on biological pest control, using a bespoke, zero‐shot prompt. Our results show that the performance of GPT‐4 is highly competitive with other state‐of‐the‐art tools used for taxonomic named entity recognition and geographic location extraction tasks. On a held‐out test set, we show that species and geographic locations are extracted with F1‐scores of 99.8% and 95.3%, respectively, and highlight that the model can effectively distinguish between ecological roles of interest such as predators, parasitoids and pests. Moreover, we demonstrate the model's ability to effectively extract and predict taxonomic information across various taxonomic ranks. However, we do report a small number of cases of fabricated information (confabulations). Due to a lack of specialised, pre‐trained ecological language models, general‐purpose LLMs may provide a promising way forward in ecology. Combined with tailored prompt engineering, such models can be employed for a wide range of text mining tasks in ecology, with the potential to greatly reduce time spent on manual screening and labelling of the literature.


| INTRODUC TI ON
Much of our knowledge of the global loss of biodiversity stems from large-scale syntheses of the ecological literature, such as the WWF Living Planet Index (LPI, 2024), PREDICTS (Hudson et al., 2017), and BioTIME (Dornelas et al., 2018) databases, as well as global reports such as those of the Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services (IPBES, 2019).The ecological literature has simultaneously seen rapid growth in recent decades (Anderson et al., 2021) making it increasingly difficult to synthesise this literature manually (Ananiadou et al., 2009).Thus, there is a growing demand for automated methods.Text mining and natural language processing (NLP) methods are expected to have significant potential in automating tasks such as document classification, named entity recognition (NER) and disambiguation, and the extraction of relations between entities (Farrell et al., 2022).Previous approaches in ecology have focused heavily on named entity recognition of species and taxonomy (Akella et al., 2012;Gerner et al., 2010;Le Guillarme & Thuiller, 2022;Millard et al., 2020) as well as geographical locations or population trends (Cornford et al., 2022) but also include document classification (Cornford et al., 2021) and relation extraction (Kaur et al., 2019).Furthermore, various gold standard databases of species names and taxonomy have been published to aid the evaluation of NER approaches in ecology (Abdelmageed et al., 2022;Nguyen et al., 2019).
Within the domain of deep learning (DL) for NLP tasks, large language models (LLMs) have been the subject of considerable attention in recent years by virtue of great leaps in progress and a wide range of potential applications (OpenAI, 2023a).This deep learning revolution in NLP is driven by the transformer architecture (Vaswani et al., 2023), which underlies many innovative DL tools in the natural sciences, such as the AlphaFold model for protein structure prediction (Jumper et al., 2021).Recent transformer-based LLMs are trained on large amounts of input data such that they are able to generate realistic text and facilitate human-computer interaction via natural language (Ouyang et al., 2022).Provided with the right prompts, these models can exhibit advanced reasoning and problem-solving capabilities (Kojima et al., 2023;Wei et al., 2023;Zhou et al., 2023).In particular, OpenAI's fourth-generation Generative Pre-trained Transformer (GPT-4) has seen major improvements over previous models on a variety of benchmarks (OpenAI, 2023a).GPT-based models have already been applied to multiple research domains, including finance (Wu et al., 2023) and medical science (Chen et al., 2023;Hu et al., 2023;Silva et al., 2024), however, quantitative investigation into their potential use ecology is lacking.
GPT-4 is a current state-of-the-art large language model and is easily accessed using ChatGPT, making it an attractive choice to demonstrate the potential of LLMs for automated information extraction and knowledge synthesis from scientific texts.However, it is important to note that there exist many competitive open source alternatives to closed-access models such as OpenAI's GPT series, and the landscape of open-source LLMs is rapidly evolving (e.g.Dey et al., 2023;Touvron et al., 2023;Zhang et al., 2022).This availability of open-source LLMs is crucial to fostering open, reproducible and equitable uses of AI in ecology.Thus, while GPT-4 does not currently adhere to open science standards, the approach outlined in this paper is serves as a proof of concept for the use of prompt-based interaction with general-purpose LLMs for ecological information extraction.
Here we use GPT-4 to extract information from ecological literature on biological pest control.Natural enemies of pests, such as arthropod predators and parasitoids, can be used as biological control agents and provide an effective way to reduce pesticide usage, which is currently a major driver of insect declines (Wagner et al., 2021).Biological control has historically relied on the introduction of non-native species (classical biocontrol), which can be detrimental to local ecosystems and negatively affect biological diversity.Natural biological control, on the other hand, utilises native species as biological control agents and is typically achieved through the incorporation (and enhancement) of natural habitat in agricultural systems.Natural biological control helps directly regulate the frequency of pest outbreaks (Letourneau, 2012) and, indirectly-as a result of reduced pesticide usage and increased natural habitat-can result in improved soil quality (Gunstone et al., 2021), increased crop yields (Dainese et al., 2019) and increased abundances of other beneficial organisms such as pollinators (Wratten et al., 2012).Crucially, by strengthening the stability and resilience of ecosystem services such as pest control, pollination and nutrient cycling, food production systems can be better buffered against environmental and climatic changes (Brittain et al., 2013;Martin et al., 2019;Oliver et al., 2015), which is of growing importance as the impacts of climate change continue to intensify (IPCC, 2023).
Here we analyse the ability of GPT-4 to reliably extract information from scientific abstracts to identify pests and pest controllers.In addition to determining the ecological roles of species, the model is tasked with extracting taxonomy, geographic location and role-specific information such as pest type and the crop or plant that a pest affects.
As such, the task comprises multiple subtasks, which require both the capability to recognise and disambiguate entities (named entity recognition), and to extract relations between entities (relation extraction).
Rather than fine-tuning the parameters of the model, we proceed to optimise model performance through the fine-tuning of the prompt itself.Finally, performance is analysed for each subtask of the query with the help of proportion-correct, precision, recall and F1-scores.
To our knowledge, this is the first instance of a general-purpose, large language model such as GPT-4 being used for the automation of information extraction and knowledge synthesis in ecology.

| Data collection
In order to obtain relevant literature on potential pest controllers and their hosts, we extracted a set of abstracts from the academic indexing tool Scopus up until the year 2020, using the following search term: TITLE-ABS-KEY("pest control" OR "biological control" OR "pest management" OR "natural enem*") AND (LIMIT-TO(DOCTYPE, "ar")) AND (LIMIT-TO(SUBJAREA, "AGRI") OR LIMIT-TO(SUBJAREA, "ENVI")) AND (LIMIT-TO(LANGUAGE, "English")) The usage of Scopus ensures that all of the extracted literature has undergone peer-review, which is important, given the underlying motive of this work is to automate knowledge synthesis for peerreviewed manuscripts.Our search yielded a corpus of 58,791 abstracts, from which we selected a subset of 100 abstracts to create a training set, used to fine-tune the prompt, and a further 100 abstracts to create a held-out test set, used to analyse the final performance of GPT-4 with the fine-tuned prompt.We highlight that the term 'training' thus does not refer to the training of model parameters, as it is more commonly used to refer to in the machine learning literature.We populated both of these sets by manually labelling the 200 subset abstracts (including titles and keywords) using the columns shown in Table 1, including all species found at the species or genus level, but excluding plants, bacteria, fungi and pathogens.
Thus, we focus here on pest control services of, and provided by, animal species specifically.The manual labelling was carried out by one annotator (Daan Scheepens), based on the definitions shown in Table 1.Species were labelled based only on information in the abstract, with the only exception being biological control agents that lacked an explicit role but could easily be inferred to be either predators or parasitoids based on their taxonomy (e.g.ground beetles or braconid wasps).
To comprise the training set we selected a stratified sample of 100 abstracts to include both predators and parasitoids, in addition to pests, in order to fine-tune the prompt on these roles.
The information on instances of predators, parasitoids and pests was available from a prescreening procedure for a total of 1520 abstracts, which identified and described genera present in these abstracts and could thus indicate which abstracts contained (at least one genera) of predators, parasitoids and pests.The 100 abstracts comprising the training set were obtained from this subset one-by-one, with a continuous attempt to balance the number of predators, parasitoids and pests (Figure S1a).The 100 abstracts comprising the held-out test set were selected randomly from the remaining abstracts in the corpus.Pest species are common in the corpus, which means that there is a relative increase in the proportion of species labelled as pests in the test set compared to the training set (Figure S1b).

| Prompt and objective
We applied GPT-4 to each individual abstract (title, abstract and author keywords) in the training set, with the instruction to find all taxa mentioned at the species or genus level, and to return a table with the columns shown in Table 1.To achieve this, we instructed GPT-4 with an initial prompt (Figure S7), which was consequently fine-tuned against the training set (see Section D of the Supporting Information).
Analogous to the species filtering during manual labelling, we prompted GPT-4 to skip any mentions of plants, bacteria, fungi or pathogens.Although this does not exclude vertebrates per se, we analyse the performance of the model on invertebrates only, as no vertebrates were extracted during manual labelling.While there are cases of vertebrates and other animals groups mentioned only with their common names (e.g. the sheep in 'sheep scab mite'), we discarded these from the predictions as they are typically not organisms of interest to the study and were indeed not manually extracted.The ability of GPT-4 to neglect certain entities in the text is particularly useful for cases where generated output should be limited to certain species of interest.For large-scale studies, the increased completion time of generating exhaustive tables may be substantial, and the generation of comprehensive output may be infeasible due to limitations on token lengths: At the time of this study, both the input and prompt completion of GPT-4 were limited to 2048 tokens (OpenAI, 2023b), corresponding to approximately 1536 words (OpenAI, 2023c).
Following fine-tuning, the prompt assumed the design as shown in Figure 1.By splitting this complex task into three smaller prompts, we adopted what is known as least-to-most prompting (Zhou et al., 2023).The prompt design follows a 'zero-shot' approach, as it contains only instructions rather than including any exemplar prompt completions (which would have drastically increased the token length of the prompt).We attempted to improve the reasoning ability of the model by including the phrase 'let's think through the following tasks step by step' (info-point 3; Kojima et al., 2023), and through the usage of chain-of-thought reasoning (info-points'C'; Wei et al., 2023).These prompting techniques have all been demonstrated to improve the reasoning abilities of GPT-3.5 (Kojima et al., 2023;Wei et al., 2023;Zhou et al., 2023).The prompt also makes abundant usage of specific examples and counterexamples to aid in the correct identification of particular species roles.We applied this fine-tuned prompt to the training data once more without any further changes to obtain final results on the training set, and then applied the prompt to the held-out test set.

| Evaluation
As shown in Table 1, the first five columns of the datasets contain the taxonomy of the species or genus.While higher-level taxonomic information can be extracted from databases such as the Global Biodiversity Information Facility (GBIF, 2023) for a given species or genus, species names are frequently misspelled in the literature and genus names occasionally appear across different phyla, making it difficult to automatically extract the correct taxonomy in all instances.We therefore deemed it of interest to investigate the ability of GPT-4 to (1) predict the missing higher-level taxonomy of a given species or genus from its training data and (2) correct any suspected misspellings.
We refer to the task of extracting taxonomy that is stated in the abstract as 'taxonomic named entity recognition'.Where the taxonomy was not available in the abstract, we searched for the species in GBIF and the Encyclopedia of Life (EOL) (Parr et al., 2014) using the genus and (if available) species name and filled this out accordingly.We refer to GPT-4's task of predicting this missing taxonomy as 'higher-level taxonomy prediction'.This taxonomy concerns only the class, order and family columns in the data sets, as the genus and species columns are needed to predict the missing taxonomy.In both cases, the obtained taxonomic information is then compared with the generated GPT-4 output.Since taxonomic named entity recognition and higher-level taxonomy prediction comprise two very different tasks, we evaluate performance of the model on these two groups of taxonomic terms individually.We do note that GPT-4 does not inherently distinguish between these two tasks, but rather generates taxonomic terms simply on the basis of the information available at any instance (which includes both the text at hand and the information learned from its training data).Therefore, given a generated table, it is not possible to say a posteriori which higherlevel taxonomic terms have been extracted and which have been predicted from the model's training data.
Based on our definitions of exact and approximate matches (see Section E of the Supporting Information), we evaluate the extraction (and prediction) of taxonomy with the proportion correct (PC) score, which computes the ratio of correct predictions to the total number of predictions.We evaluate this score both against exact matches (i.e.counting all mismatches as mistakes) and against approximate matches (i.e.counting only major mistakes as mistakes): where N is the total number of predictions, n exact is the number of exact matches and n approx is the number of approximate matches.
We focus our analysis of the remaining species information (columns 6-14) on the species roles and the geographic locations.For (1) Columns of the data used in this study.

F I G U R E 1
The final prompt design after fine-tuning against the training set.The prompt is divided into three parts.The first part prompts GPT-4 to detect all relevant species in the abstract, the second part prompts GPT-4 to return these in the table with the appropriate columns filled out, and the third part prompts GPT-4 to review the generated table and make any corrections if necessary.These three prompts were always executed within the same session.Points labelled as'C' (and highlighted in grey rather than yellow) designate clarifications; these were used to clarify specific terms such as pest and pest controller, and aimed to reduce persistent problems such as formatting errors and missing pest information in some columns.
these two subtasks, we measure the performance of GPT-4 by computing the precision, recall and F1-score (Equation 2) of the model based only on approximate matches (Section E of the Supporting Information).The F1-score is computed as the harmonic mean (i.e. the reciprocal of the arithmetic mean of the reciprocals) of precision and recall, making it a function of both measures.Unlike the arithmetic mean, the harmonic mean is 0 when either recall or precision are 0.Although this is also true for a geometric mean, a harmonic mean further penalises uneven performance.

| Species extraction
We found that splitting the instructions into two consecutive prompts, in which the first prompt served to identify all relevant species in the text and the second prompt served to fill out the table, led to improvement in entity extraction capability.Although the initial prompt already extracted 627 out of 649 species (96.6% recall) in the training set, the split-prompt approach ensured that all 649 species were extracted in the first part of the prompt.However, in the second part of the prompt, GPT-4 remained challenged to return all previously found species in the final table, and as a result, a small number of species in the training data were still missed in GPT-4's output, with 631 out of 649 species extracted (97.2% recall) (Table 2a).However, since these missing entries correspond to only 2 out of 100 abstracts in the training set, this obtains a mean recall per abstract of 99.5% (standard deviation = 4.4%; Table 2b).
Importantly, we also observed 43 instances of fabricated species entries (i.e.'confabulations') as rows in the GPT-4 output for the training set, which occurred in the same two abstracts where missing entries were observed.These false positives result in a precision of 93.6% and a mean precision per abstract of 99.1% (standard deviation = 6.8%; Table 2) for the training set.For the test set, GPT-4 extracted 244 of 245 species (99.6% recall), and we did not observe any confabulated species entries (100% precision) (Table 2).Moreover, we found that GPT-4 did not generate any entries for plants, bacteria, fungi or pathogens in either the training set or the test set and thus managed to abide by the constraint set on the species extraction very effectively.
We hypothesised that the third step in the final prompt design, in which we instructed GPT-4 to thoroughly review its output, would allow the model to pick up on its own mistakes; however, its success was found to be very limited.GPT-4 corrected its own mistakes only in the case of a single abstract in the training set, corresponding to a table with 12 species rows.In this case, the model correctly identified that it had left out columns 10, 11 and 12 for the pest species in the table and then proceeded to fill these out correctly.In one case in the test set, the model claimed that it had not identified a particular species as 'pest' and stated that it had corrected this, although the species had, in fact, already been identified as a pest.In two other cases, the model returned reassurances other than 'No corrections' (although conveying the same message).The model did not recognise missed species or confabulations.
The generated tables obtained from GPT-4 display varying degrees of accuracy and adherence to the prompted instructions.However, in cases where the information in the abstract is presented clearly and nonambiguously, strong performance can be observed (Figure S2).In contrast, ambiguous usage of language (e.g. a predatory species that acts as a pest) is prone to produce erroneous results (Figure S3).In the remaining analysis, the species in the training and test set that were missed by GPT-4 are omitted as these species naturally cannot be compared to the manual labels.This refers to 18 out of 642 species in the training set and 1 out of 245 species in the test set.For the same reason, we also omit the 43 cases of confabulations in the training set from this further analysis, but discuss these at length in the discussion.

| Taxonomic named entity recognition
We observed only a small number of mismatches between manual labels and GPT-4 predictions in the case of taxonomic named entity recognition, resulting in high PC exact scores (Table 3a).Following comparison with the reference taxonomy, we found that the majority of mismatches (2) were either cases where the GPT-4 prediction was, in fact, correct (23/44 in the training set and 23/26 in the test set), or where the mistake comprised only a minor, rather than a major mistake (Table S2a).Indeed, we observed only one major mistake in the 2240 total taxonomic terms (0.04%) predicted by GPT-4 in the training set and three major mistakes in the 801 total taxonomic terms (0.37%) predicted by GPT-4 in the test set (Table S2a).We also note that PC approx scores amount to 100% for the genus rank, and are in excess of 99.6% for the species rank, in both sets (Table S3a).These ranks bear particular importance, as any higherlevel taxonomy obtained from databases such as GBIF depends on their correctness.GPT-4 does obtain similar performance for the class, order and family ranks, however, suggesting that when higher-level taxonomy is mentioned in the text, it is extracted reliably.
We note that mismatches where the GPT-4 prediction turned out to be correct refer primarily to cases of spelling corrections in the genus and species columns, and cases of corrected ranks (e.g. the correct family as opposed to the superfamily mentioned in the abstract) in the class, order and family columns, although we also find spelling corrections in the family column.A common case in the order column comprises the manually extracted taxa 'Homoptera' and 'Heteroptera', both of which are suborders of Hemiptera, although the former is now considered obsolete by many taxonomists (von Dohlen & Moran, 1995).In terms of spelling corrections, we observed 12 cases in the training set and 4 in the test set.In all cases, GPT-4 corrected the spelling mistake and returned the correct term.While this does not evidence that all misspellings in the data were corrected by the model, it does demonstrate that when a misspelling was addressed, it was done correctly.

| Higher-level taxonomy prediction
For the task of higher-level taxonomy prediction, we observed more mismatches than for the task of taxonomic NER (as proportions of the total predictions), resulting in generally lower PC exact scores (Table 3b).Furthermore, less of these mismatches comprised cases where the GPT-4 prediction turned out to be correct, and more of the mismatches comprised major mistakes (Table S2b).We observed 15 major mistakes out of the 915 total taxonomic terms (1.64%) predicted by GPT-4 in the training set and 7 major mistakes out of the 419 total taxonomic terms (1.67%) predicted by GPT-4 in the test set (Table S2a).We find that while PC approx scores for the class and order ranks fall between 98%-100% (Table S3b), we observe a notable decline for the family rank in both the training set and the test set.
This decline, however, appears to be a result of bias toward a small number of abstracts in which many major mistakes were made and is thus not observed in the scores per abstract (PC * approx ).

| Species roles
GPT-4 captured the roles of species with a high degree of precision and recall for most labels (Table 4).Crucially, while there are a small number of cases of manually labelled predators, parasitoids and pests that have been mislabelled by GPT-4 as something else (e.g. as 'competitor', 'prey', 'host' or 'unclear'), there is no confusion between these roles, either in the training set or in the test set (Figure S4).
Results for the test set are very comparable to results for the training set, although some differences in precision and recall can be observed (Table 4).In particular, the precision for predator roles is reduced from 100% on the training set to 80.4% on the test set.This reduction is a result of GPT-4 predictions of 'predator' for terms that were manually labelled as biological control agents, natural enemies or were left blank (Figure S4b).Recall of predator roles is only marginally reduced on the test set compared to the training set.In both cases, false negatives are the result of a relatively small percentage of abstracts (Table S4).
Performance for parasitoid roles on the test set closely matches performance on the training set, with precision and recall scores between 97.9%-100% and F1-scores of 98.7%-99.0%(Table 4).In both sets, hyperparasitoids (parasitoids of other parasitoids) are accurately distinguished.For pest roles, we observe a substantial amount of confusion between pests and 'prey/host', 'other' and 'unclear', in both the training and the test set (Figure S4).As a result, we observe a relatively large variation in precision and recall for this role between the training set and the test set (Table 4).

| Geographic locations
GPT-4 is capable of effectively extracting geographic information, with locations predicted with 98.7% precision and 97.1% recall on the training set and 95.3% precision and recall on the test set (Table 5).Per abstract, an average of 96.3% of manually labelled TA B L E 3 Proportion correct (PC) scores of GPT-4 on (a) taxonomic named entity recognition, (b) higher-level taxonomy prediction and (c) all taxonomic terms, totalled over all ranks.Individual scores per rank can be found in Table S3.Note: PC scores are presented as measured against exact matches (PC exact ) and as measured against approximate matches (PC approx ).We also present the latter per-abstract (PC * approx ) to avoid bias from abstracts with a large number of extracted species (presented as the mean across abstracts with a spread of one standard deviation).locations are correctly returned by the GPT-4 predictions in both sets (Table S5).No-locations (i.e. the abstract mentioned no location for the respective species) are generally predicted only marginally worse, with the exception of a notably lower precision on the training set (Table 5a), which stems from a total of 14 false negative predictions (Figure S5).These false negatives occur in four abstracts, with most (9 out of 14) comprising empty predictions, and the remaining comprising predictions of insufficient Note: Support: The number of locations and non-locations present in the manual labels.Weighted average: The average of the respective column, weighted by the support of each row.'Location' refers to the location associated with the study of the species and'No location' refers to the case where no location is mentioned in the abstract; if GPT-4 predicted a location that did not correspond to the manually labelled location, this is designated as a false-positive, and if GPT-4 did not predict a location for a manually labelled location, this is designated as a false negative.

TA B L E 5
Precision and recall obtained by GPT-4 for geographic locations in the training set (a) and the test set (b).
information originating from a single abstract (predictions of 'North Island' rather than 'Australia').

| DISCUSS ION
The results presented here show that GPT-4 possesses an effective ability to (1) extract taxonomic and geographic entities from abstracts, (2) identify the roles of species from their descriptions in the abstract, and (3) extract relations between entities in the abstract, such as between pest controllers and pests, and pests and certain industries as well as their affected products.However, a number of observations and caveats merit discussion.In addition, a discussion on the problem of ambiguity and its impact on mismatches between species roles is provided in Section F, and a user guide for researchers interested in working with LLMs is provided in Section G of the Supporting Information.

| Confabulations
The vast pool of information that GPT-4 is able to synthesise appears to provide a notable strength, enabling the model to recognise common usages of language in ecology, such as the conventions regarding taxonomic ranks and the meaning of domainspecific vocabulary.Moreover, it allows the model to make predictions that go beyond merely the information provided in the text, allowing it to correct spelling mistakes, disentangle ambiguities and provide appropriate roles of species based on descriptions in the text.
However, the ability of GPT-4 to synthesise vast quantities of data can also be a source of confabulations.This refers to cases of believable but fabricated information that chat-based large language models such as ChatGPT have been observed to generate (Azamfirei et al., 2023).We observed a total of 43 confabulated species as rows in the output of GPT-4 (all of these in the training set).While the first part of the prompt (Figure 1) consistently returned the correct set of species, a subsequent mismatch was observed between this set and the species returned in the final table.This was observed exclusively in cases where GPT-4 was unable to complete the entire table in one prompt completion and thus required multiple completions to finish the table (using the 'continue generating' function available in the web interface).We suspect that this repeated generation of output may be the cause of these confabulations, possibly as a result of a diminishment of the model's internal memory (Gong et al., 2023).
Indeed, the only cases of confabulated species as rows in GPT-4's output were detected in the two longest tables generated by GPT-4 (corresponding to two abstracts in the training set), consisting of 25 and 58 rows, respectively.Furthermore, all 18 species rows that were missed by GPT-4 in the training set also originate from these same two abstracts, and thus the same mechanism may be responsible for these false negatives.While confabulations are concerning, it is expected that usage of GPT-4 with higher token-lengths for prompt completion would alleviate this issue.For example, updated versions of GPT-4 are stated to have maximum token-lengths of 8192 and 32,768 tokens respectively, while web-interface usage was limited to 2048 tokens per generation at the time of the experiment (OpenAI, 2023b).The fact that the GPT-4 output for the test set contained no tables larger than 12 rows in length may thus explain why we did not observe instances of confabulated species in the test set.
Additionally, a number of apparent confabulations were observed for the 'Industry Type' column in the generated data sets, where associations of pests with agriculture or forestry were stated without any clear basis in the text.This was observed for 33 out of 631 species in the training set and for 20 out of 244 species in the test set.Upon closer inspection, however, we found that these associations could be corroborated by other Internet sources in the vast majority of cases (Tables S7 and S8).Just seven cases in the training set, corresponding to two abstracts, could not be corroborated by other internet sources and thus appear to constitute actual confabulations.
The apparent inability of GPT-4 to state whether information was drawn from within the text or from outside the text, and if so from where outside the text, is a well-known caveat of the model.
However, new plug-ins are continuously being released to address this issue.Connecting GPT-4 to live internet sources, as exemplified by the GPT-4-powered Microsoft Bing Chat, may also provide an effective way to reduce confabulations through the reliable citing of sources.The proneness of large language models to fabricate erroneous, but credible-sounding pieces of information is the subject of increasing discussions in the broader scientific community, with concerns over the accuracy, reliability and accountability of output obtained from LLMs (Birhane et al., 2023).Although AI models are increasingly being used in science and have begun delivering numerous scientific advances (Wang et al., 2023), it is important for scientists to be aware of the limitations of AI tools, such as LLMs and other black-box, deep learning based models and the potential impact of these models on reliability and reproducibility in science.

| The challenge of prompt design
The question of prompt design (or prompt engineering) poses a challenge to the adoption of large language models in scientific research.
Here, we proceeded to optimise the performance of the prompt design against a training set.Our optimisation approach had two main caveats.The first relates to the reliance on manual, trial-and-errorbased optimisation, which is time-consuming and may be intractable a fine-tuning option for GPT-3.5, which is stated to "match, or even outperform, base GPT-4-level capabilities on certain narrow tasks" (Peng et al., 2023).It is moreover stated to allow for a shortening in prompts by as much as 90% as instructions can be fine-tuned into the model itself.Fine-tuning may thus provide a promising alternative, or addition, to prompt engineering for large language models.

| Comparison with other tools
To put some of the results obtained in this work into a broader perspective, we provide a comparison with previous studies that have utilised text mining tools for applications in ecology.Previous attempts to extract taxonomic terms from abstracts have resulted in mean recall scores per abstract of 79.5% (Millard et al., 2020) and 93.6% (Cornford et al., 2022) with the help of the R package Taxize (Chamberlain & Szöcs, 2013), while the extraction of geographic locations has been achieved with a mean recall of 82.1% per abstract (Cornford et al., 2022) with the help of the CLIFF-CLAVIN geoparser model (D'Ignazio et al., 2014).Furthermore, an extensive comparison of eight taxonomic NER models over four gold standard ecology corpora (Le Guillarme & Thuiller, 2022) reports scores for approximate matches ranging between 78%-96% (precision), 74%-93% (recall) and 76%-91% (F1-score).
If we can correctly assume the risk of confabulated species extractions to be minimised with sufficiently large token lengths, we may assume such confabulated entries to pose a relatively low risk for future endeavours, which are likely to incorporate longer token lengths.As such, our results on the test set, which did not suffer from confabulated species extractions, may offer a valuable demonstration of the model's potential performance.The extraction of species from abstracts in the test set was achieved with a total precision of 100%, recall of 99.6% and F1-score of 99.8% (Table 2a).Investigating only the taxonomy of the successfully extracted species, we report PC approx scores on the test set of 99.6% for taxonomy that was stated in the abstract (Table 3a), 98.3% for taxonomy that was not stated in the abstract (Table 3b) and 99.2% overall (Table 3c).Geographic locations (as approximate matches) were extracted from the test set with a precision, recall and F1-score of 95.3% (Table 5b), which is only marginally worse than the performance obtained from the training set.We highlight that GPT-4 achieved these results without prior training on these tasks, while simultaneously generating responses to multiple other tasks laid out in the prompt, such as extracting species roles, identifying pest controllers, pest names and associations, and providing detailed descriptions.

| CON CLUS IONS
In this work, we explored the potential of the next generation of large language models for the automation of knowledge synthesis.
To this end, we investigated a set of abstracts on biological pest control and prompted GPT-4 to extract species, taxonomy and geographical locations, recognise roles, pest-controlling behaviour, pest types, pest associations and mutual relations.
Our results show that GPT-4 is highly capable of this task, with performance on all investigated subtasks largely congruent with the manual labels.Some of the discrepancy appears to be a result of a certain degree of ambiguity in the abstracts and in the task itself.We thus restrained from speaking of ground-truth in this work, since even manual labels are subject to ambiguity and human error: Indeed, we observed a several cases where the predictions of GPT-4 were possibly more accurate than the manual labels.That being said, we also observed confabulations, that is, fabricated text.In the case of confabulated species as rows in the tables, this appears to be a symptom of the token-length limits placed on GPT-4 at the time of the experiment.In the case of individual pieces of fabricated information, it is more difficult to identify root causes, as they may be more dependent on the model parameters and the original training data, and may, in fact, be corroborated by other sources upon closer inspection.We hope that this work makes a valuable contribution to the rapidly evolving domain of the automation of knowledge syntheses in ecology by demonstrating the potential of GPT-4 for this task.Indeed, generalpurpose LLMs may provide an interesting way forward in ecology, since there is currently a lack of specialised, pretrained language models.Combined with tailored prompt engineering, LLMs could be used for a broad range of tasks and have the potential to save a large amount of time spent on manual labelling.Through their vast information base, these models can, in principle, be applied to literature spanning many different languages, helping to mitigate the widespread bias of the English language in knowledge syntheses (Amano et al., 2021;Konno et al., 2023)-although it should be noted that current LLM performance on underrepresented languages has been found to be poor (Laskar et al., 2023).Reliability may be enhanced through the integration of live Internet access and a growing number of plug-ins that address the traceability of information.Additionally, an ability to handle full-text papers is likely to be tractable in the near future and further improve the reliability of extracted information.Moreover, using full texts would for very large datasets.The second relates to the problem of overfitting, which, although not well defined in this context (as there is no proper loss function), may arise if the prompt is too tailored to the training set and fails to generalise to the rest of the data.Regardless of the optimisation strategy, however, designing the prompt to take into account the various intricacies of the task (definitions, examples, counterexamples, clarifications) is highly nontrivial and highly data-dependent, as exemplified by the extensive prompt utilised in this work.Furthermore, performance of GPT-4 was observed to be a highly nonlinear function of the prompt design, with small changes in the prompt leading to large changes in the output (for better or worse).As opposed to extensive prompt engineering, a powerful alternative may be posed by fine-tuning the parameters of the model.Also known as transfer learning(Pan & Yang, 2010), finetuning allows users to customise the pretrained machine learning model to their own use cases by retraining the model (or a subset of parameters of the model) on a much smaller, bespoke data set.With many applications in image and text classification(Weiss et al., 2016), transfer learning is already an active field in biomedicine and, to a lesser degree, ecology.OpenAI has released

PC exact (%) PC approx (%) PC * approx (%)
Precision and recall obtained by GPT-4 for each role in the training set (a) and the held-out test set (b).
Note: Support: The number of instances of the respective role in the manual labels.Weighted average: The average of the respective column, weighted by the support of each row.'B.C.A.': Biological Control Agent.'H.Parasitoid': Hyperparasitoid.'Nat.Enemy': Natural enemy.'Other'includesterms such as 'Pollinator', 'Herbivore', 'Leaf miner', 'Scavenger', 'Ectoparasite' and 'Competitor' and the term 'Unclear' includes 'Not mentioned', as well as empty entries.The primary roles of interest (predator, parasitoid and pest) are emphasised in boldface.TA B L E 4