Many unreported crop pests and pathogens are probably already present

Invasive species threaten global biodiversity, food security and ecosystem function. Such incursions present challenges to agriculture where invasive species cause significant crop damage and require major economic investment to control production losses. Pest risk analysis (PRA) is key to prioritize agricultural biosecurity efforts, but is hampered by incomplete knowledge of current crop pest and pathogen distributions. Here, we develop predictive models of current pest distributions and test these models using new observations at subnational resolution. We apply generalized linear models (GLM) to estimate presence probabilities for 1,739 crop pests in the CABI pest distribution database. We test model predictions for 100 unobserved pest occurrences in the People's Republic of China (PRC), against observations of these pests abstracted from the Chinese literature. This resource has hitherto been omitted from databases on global pest distributions. Finally, we predict occurrences of all unobserved pests globally. Presence probability increases with host presence, presence in neighbouring regions, per capita GDP and global prevalence. Presence probability decreases with mean distance from coast and known host number per pest. The models are good predictors of pest presence in provinces of the PRC, with area under the ROC curve (AUC) values of 0.75–0.76. Large numbers of currently unobserved, but probably present pests (defined here as unreported pests with a predicted presence probability >0.75), are predicted in China, India, southern Brazil and some countries of the former USSR. We show that GLMs can predict presences of pseudoabsent pests at subnational resolution. The Chinese literature has been largely inaccessible to Western academia but contains important information that can support PRA. Prior studies have often assumed that unreported pests in a global distribution database represent a true absence. Our analysis provides a method for quantifying pseudoabsences to enable improved PRA and species distribution modelling.


| INTRODUC TI ON
The spread of invasive species is homogenizing the biosphere, with wide-ranging implications for natural ecosystems (Baiser, Olden, Record, Lockwood, & McKinney, 2012;Santini et al., 2013) and agriculture (Bebber, 2015;Fisher et al., 2012). The number of first observations of crop pests and pathogens has accelerated in recent years, driven primarily by global trade (Bacon, Aebi, Calanca, & Bacher, 2014;Ding, Mack, Lu, Ren, & Huang, 2008), but also potentially by climate change and our improving ability to monitor and identify threats (Bebber, 2015;. Here, we use the term 'pest' to describe any herbivorous arthropod, pathogenic microbe or virus known to attack agricultural crops. Emerging pests can be extremely damaging to agricultural production and the economy, causing both preharvest and postharvest losses (Bebber & Gurr, 2015;Paini et al., 2016;Savary et al., 2017). Recently, for example, sub-Saharan Africa has suffered from the virulent Ug99 strain of the wheat stem rust fungus (Puccinia graminis tritici ;Patpour et al., 2015), the newly evolved maize lethal necrosis viral syndrome (Wangai et al., 2012), arrival of the Asian citrus psyllid (Diaphorina citri) which vectors citrus greening disease (Shimwela et al., 2016) and the appearance of Tropical Race 4 of Fusarium oxysporum f. sp. cubense attacking Cavendish bananas (Ordonez et al., 2015). Central America, Europe, East Africa and Australia have been identified as hotspots of new pest invasions, with maize, bananas, citrus and potato as the crops most likely to be affected (Bebber, 2015). Outbreaks of resident pests due to favourable weather conditions, virulence evolution or crop management factors add to the burden on farmers. For example, a major outbreak of coffee leaf rust (Hemileia vastatrix) in Latin America, likely to have been triggered by a failure in disease management, is reported to have caused large-scale unemployment and social upheaval in recent years (Avelino et al., 2015).
Despite the expanding ranges of many pests, complete occupation of their potential ranges has not yet occurred , and so there remains a strong impetus for implementation of biosecurity measures at international borders (Fears, Aro, Pais, & Meulen, 2014;Flood & Day, 2016;MacLeod, Jones, Anderson, & Mumford, 2016). Control of spread within countries is extremely difficult because of unhindered transport of plants and soils (Ward, 2016), and biosecurity measures focus on quarantine and inspections at borders (MacLeod et al., 2016).
A key component of international phytosanitary action is pest risk analysis (PRA), a suite of methods that allow countries to prioritize protective measures against those pests most likely to arrive and cause serious economic damage (Baker et al., 2014;Robinet et al., 2012). PRA involves assessment of the likelihood of pest arrival, the likelihood of establishment, the potential economic impact if uncontrolled and the prospect of successful control or eradication (Baker et al., 2014). To date, PRA has primarily been based upon expert opinion regarding the likelihood of arrival and potential impact of individual pests. For example, the UK's recently established Plant Health Risk Register (Baker et al., 2014) (Elith & Leathwick, 2009;Sutherst, 2014). The geographic distributions of species are non-random, determined by their biotic environment (e.g. hosts or prey), the abiotic environment (e.g. climate, edaphic factors) and migration (dispersal to suitable habitat; Soberón, 2007;Soberón & Nakamura, 2009;Soberón & Peterson, 2005). Thus, pest invasion risk is, in theory, quantifiable. Numerous modelling approaches are now available to predict the likely distributions and impacts of pests (Elith & Leathwick, 2009;Robinet et al., 2012;Venette et al., 2010), ranging from process-based, or mechanistic models, to statistical, or correlative approaches (Dormann et al., 2012). Regional and global databases on known pest distributions are commonly used to parameterize these models, either providing direct estimates of pests' ecological niches (Kriticos, 2012;Venette et al., 2010), or indirectly via shared geographic ranges (Eschen et al., 2014;Paini et al., 2016;Paini, Worner, Cook, Barro, & Thomas, 2010).
One seldom-acknowledged issue with pest distribution data in global databases is geographic bias in the likelihood that a pest will be detected, correctly identified, reported and recorded (Pyšek et al., 2008). Analysis of one of the most widely studied global pest distribution databases suggests that hundreds of pests already present in many developing countries have not been reported (Bebber, Holmes, Smith, & Gurr, 2014). The total number of observed pests in an administrative area (country, or administrative division for larger countries) can be largely explained by scientific capacity and agricultural production. Under a scenario of globally high scientific and technical capacity (i.e. where all countries have US-level per capita GDP and research expenditure), analysis predicts that many countries across the developing world would report hundreds more pests.
This suggests that a large fraction of the current agricultural pest burden is unreported and unknown, and that even the best global databases suffer from severe observational bias. This has potentially serious consequences for both plant biosecurity activities and for research based upon these databases. Such observational bias may have implications for SDM methods that infer environmental tolerances from observed distributions. Scientific capacity, economic development and the ability to detect, identify and report pests are strongly correlated with latitude, as is climate (Bebber, Holmes, Smith, et al., 2014). Under-reporting of pests at low latitudes will, therefore, bias estimation of climate tolerances, as occurrence is under-reported in warmer regions. Reducing this observational bias by strengthening pest identification efforts in the developing world is, therefore, critical in improving scientific understanding of pest distributions, and in PRA.
The People's Republic of China (henceforth referred to as China) has been predicted to harbour the largest number of pests (Bebber, Holmes, Smith, et al., 2014). China produces the largest quantity of crops by tonnage globally, and has the greatest diversity of production. Both factors are strong determinants of recorded pest numbers (Bebber, Holmes, Smith, et al., 2014). Yet, the actual recorded number of pests in China is much smaller than expected (Bebber, Holmes, Smith, et al., 2014). For many countries, under-reporting of agricultural pests is likely to be purely a function of the lack of institutional capacity to detect, identify and report incidences in the scientific and 'grey' literature used by CABI to populate the distribution data-  (Klayman, 1985). The Chinese research literature, having developed in isolation from the Western literature, therefore, provides a potentially independent data source for testing models of pest distributions.
Here, we develop statistical models of global pest presence using a database of known pest occurrences and confront the predictions of individual pest presence in Chinese provinces with observations from the Chinese literature. In addition, we compare models in which pest absences are treated as true absences with models in which absences are weighted according to estimates of scientific and technical capacity of a given country to report plant health risks, and to investigate the effect of observational bias and pseudoabsences in pest distribution modelling. We then apply our distribution models globally to all unreported pests in all regions, to give predicted probabilities of presence. Finally, we list those pests that are probably present, but as yet unreported, around the world.

| MATERIAL S AND ME THODS
We obtained pest distribution data from the CABI Knowledge Bank We developed two pest distribution models. The 'unweighted' model included geographical and biological predictors and treated all unobserved pests as absent from a region. The 'weighted' model treated unobserved pests as potentially pseudoabsent, using a function of the scientific and technical capacity of each country (Bebber, Holmes, Smith, et al., 2014). Presences were taken as being correct and unambiguous, and given a weighting of unity.
Absences were weighted by the logarithm of the agricultural and biological sciences publication output of each country from 1996 to 2016 (Scimago Lab, 2017), normalized to the logarithm of the output of the United States (the world's most scientifically productive country), such that the absence weight w 0 = log(s)/log(s USA ).
Thus, pests unreported from scientifically advanced nations were assumed not to be present (or, present at undetectable population density), while pests unreported from developing nations were less informative of absence. China, with the second largest research output, had w 0 = 0.93, suggesting that nonreporting of a pest should be relatively strong evidence of its physical absence.
However, we hypothesized that nonreporting in the CABI databases could be due to lack of translation from the Chinese literature, therefore, we set w 0 to zero for China, effectively omitting these pseudoabsences from the analysis. The weighted and unweighted models were compared with a null model assuming constant presence probability using Likelihood Ratio Tests.
To validate the models, we predicted the probability of presence for a random sample of 100 as-yet unobserved pests in all Chinese provinces, excluding Taiwan. The Chinese literature was searched for observations of these unobserved pests in China. We used the text mining methodology designed by CABI for their Plantwise Knowledge Bank. The following rules were followed to locate pest records in the Chinese literature: • Include only papers that are primarily on distribution data, not those where distribution is mentioned, but where something else is the primary focus. If this is unclear do not process the paper.
• Record country and location information given in the paper, including latitude/longitude. CABI uses five levels for location, from the largest scale (i.e. provincial) to the smallest (i.e. village/town).
• Record date of observation/collection (entering each year separately) and date of publication. Can be left blank if not given, or use the date of receipt in the diagnostic laboratory as a surrogate for date of collection.
• Record pest status-present/not found. Only record absence if pest absence is specifically stated in the paper.
• Record pest status using only the status terms defined by CABI, and only if used in the paper, for example, 'widespread', 'restricted' 'soil only' 'greenhouse only' (see CABI guidelines for complete list).
• Record if the paper was a first record of that pest or not and details of this (e.g. 'first record in <country/location>', 'first record on <host species name>').
• Only enter data where the pest/pathogen has been clearly identified, not just symptoms seen.
• Record only natural infections, not artificial inoculants. Publication titles were searched first, followed by full text interrogation. The first 50 search results were scanned before dismissing a search term. The first search term combination was pest name and location (province). If this yielded no result, then pest name and various distribution terms were tried. These distribution terms were: "catalogues" OR "checklists" OR "distribution" OR "inventories" OR "new records" OR "surveys" OR "geographical distribution" OR "new geographic records" OR "new host records". Searches included local names in Chinese which were known or could be identified from the literature, preferred scientific names and nonpreferred scientific names from CAB Thesaurus (https ://www.cabi.org/cabth esaur us/). Searches continued until one piece of literature was found for that pest in that region that fitted all of the requirements for CABI text mining.
If a pest was not found in any of these searches, it was assumed to be absent from the literature and thus effectively absent from the region. We cannot prove, however, that a pest is present at very low population density and has not yet been detected (Crooks, 2005).
Modelled probabilities of reported pest presence in the global dataset, P G , were obtained from the predictor variables for each pest-region combination, for each GLM (predict function in R). We then compared P G with the observed presence-absence data for our Chinese sample data using logistic regressions (glm function in R) and receiver operating characteristic (ROC) curves (pROC library for R). The logistic regression coefficients c and m determine the probability of pest presence in the Chinese sample as P C = 1/(1 + exp(−(c + mP G ))). ROC curves describe the relationship between the true-positive rate (sensitivity, the fraction of presences correctly identified as presences) and false-positive rate (1 − specificity, where specificity is the fraction of absences correctly classified as absences) as the threshold for a binary classifier is decreased from 1 (classifying any presence probability less than 1 as absent) to zero (classifying any positive probability as present). A good predictor will have a high true-positive rate and low false-positive rate for any classification threshold, whereas a poor predictor will have roughly equal true and false-positive rates (i.e. be uninformative). The area under the curve (AUC) for the ROC curves gives the probability that, for a random pair of presence and absence observations, the presence probability will be greater for the presence than the absence (Jiménez-Valverde, 2012). Models with good discrimination ability should have AUC significantly greater than half.
For illustration, we identified probably present pests (PPP) as those which are currently unreported from a particular region, but for which P G > 0.75 in our weighted model. This threshold was chosen based on the Kent scale which suggests a probability of 0.75 as an event that would generally be described as 'probable' (Kent, 1994). This is an arbitrary definition but allows us to suggest some of the pests that PRA and phytosanitary activities should be focused upon.

| RE SULTS
Globally, P G increased significantly with presence in neighbouring regions, the area of host crops, the global prevalence of the pest and per capita GDP in both models (Table 1). P G declined with mean distance from the coast and known host crop genera per pest. The models explained similar fractions of the deviance, and had very similar ROC curves with AUC around 88% (Table 1). P G was always higher for the weighted model, because absences were down-weighted (i.e. fewer true zeros), but predictions for the two models were very highly correlated (r = 0.98). The models found the highest P G for Hemiptera and Lepidoptera, and lowest for Nematoda, Bacteria and Acari, compared with other taxonomic groupings.
For illustration, we defined a 'probably present pest' (PPP) as one unreported from a region, but with P G > 0.75 (using the weighted model). Overall, only 4,702 of 585,955 (0.8%) of all unreported pest-region combinations fell into this class (Table S1). The number of PPPs per pest category was greatest for Fungi (2,052) and Hemiptera (859). Overall, 86% of unreported pest-region combinations were predicted to be unlikely (P G < 0.25). China, India, the United States and Eastern Europe had the largest numbers of    We validated our models using published pest observations from the Chinese literature. Both models were significant predictors of pest presence/absence for 100 randomly sampled pest-province combinations, of which 27 were found to be present (Figure 3,  Figure 3).
Our analysis revealed gaps in the CABI database, which is commonly used for analyses of global pest distributions. Taking one important potato pest, late blight Phytophthora infestans (Oomycota), as an example, we predicted high presence probabilities (>0.75) for 10 provinces listed as not reporting this pest in the CABI database.
However, this pathogen has been reported present throughout the potato-growing regions of China, including Guangdong (Guo, Zhu, Hu, & Ristaino, 2010). Thus, many of the most common PPPs in China were also common globally.

| D ISCUSS I ON
The Chinese literature provided strong and significant support for the predictions of pest distribution models based upon host distribution, pest prevalence and other socioeconomic factors. China's growing economy is expected to lead to large influxes of invasive species, including pests, in coming years (Ding et al., 2008). Analysis of temporal trends in CABI pest observations show a relatively smooth increase in pests from 1950 to 2000, but the pattern for China is more complex, with a slow increase from 1950 until the late 1970s, a step increase, and then a more rapid growth in pest numbers from 1980 onwards . One potential determinant of this sudden acceleration is the strong support We identified a number of pests that were very likely to be present, and the majority of these PPPs were globally distributed and had wide host ranges.  reporting from these nations was less likely when they were part of the USSR.
Predictors like host availability, presence in neighbouring territories and global prevalence were expected to have positive relations with presence probability. The negative relation with distance from coast is likely to be related to import via shipping ports (Huang, Zhang, Kim, & Suarez, 2012;Liebhold et al., 2013), and supports the observation that islands report more pests than countries with land borders (Bebber, Holmes, Smith, et al., 2014). Detailed modelling of individual pest climate responses (Bregaglio, Cappelli, & Donatelli, 2012;Kriticos, Morin, Leriche, Anderson, & Caley, 2013) for such a large number of pests was beyond the scope of this study. Implicitly, we can assume that the presence of the host crop indicates that the climate is suitable for the pest (Paini et al., 2016), though we acknowledge that this is not necessarily the case (Berzitis, Minigan, Hallett, & Newman, 2014). The negative relationship with number of host genera per pest might suggest that host specialists are more likely to invade and establish than host generalists, once host availability has been taken into account. For the practical purposes of PRA, our models provide reliable probability estimates for the presence of unreported pests at subnational resolution, and we have provided a global list of the unreported pests whose presence is most likely (Table S2).
We addressed the issue of pseudoabsences in the CABI data by statistically weighting missing pest observations in proportion to the scientific output of the reporting nation, since scientific output had been confirmed as a strong determinant of total reported pest numbers (Bebber, Holmes, Smith, et al., 2014). Often, unreported pests are treated as true absences in pest risk analyses (Paini et al., 2016). The positive relation of GDP with presence probability supports our hypothesis that wealthy countries are more likely to detect and report pests (Bebber, Holmes, Smith, et al., 2014). Once observational bias is controlled for using scientific capacity-based weighting, per capita GDP becomes a weaker determinant. Our weighted model has similar overall explanatory power to our unweighted model. Nevertheless, the issue of observational biases related to country-level socioeconomic variation has been raised several times for various classes of organism (Bebber, Holmes, Smith, et al., 2014;Bebber, Ramotowski, & Gurr, 2013;Boakes et al., 2010;Jones et al., 2008;Pyšek et al., 2008;Westphal, Browne, MacKinnon, & Noble, 2008), and we, therefore, recommend the application of appropriate statistical controls when analysing datasets produced from reports of species presences (as opposed to distributional datasets derived from rigorous sampling protocols).
Our SDM was statistical, fitting response functions for various predictors to the probability of pest presence. Many SDM approaches exist, from highly mechanistic models based on pest biology and ecology (Bregaglio et al., 2012;Skelsey, Cooke, Lynott, & Lees, 2016) to purely statistical models that utilize only patterns in known distributions (Paini et al., 2010). The rarity of quantitative model input into PRAs is partly due to the scarcity of empirical data available on pest biology and epidemiology required to parameterize mechanistic models, and so key biological parameters are often inferred from known distributions (Robinet et al., 2012). This is particularly the case for newly emergent pathogens for which experimental investigations have not yet been conducted. The European Food Safety Authority has developed quantitative PRA guidelines that recommend modelling approaches and data sources for assessing invasion and establishment risk (Jeger et al., 2018), and application of these methods was attempted for Diaporthe vaccinii, a pest of blueberries (Jeger et al., 2017). However, most of the epidemiological data required for this pest was unavailable, and the risk assessment was thus based on expert opinion or data from related pests (Jeger et al., 2017). Epidemiological parameters can be poorly constrained even for long-established pests. For example, coffee leaf rust fungus (Hemileia vastatrix) has affected coffee production for more than a century, but a recent infection model relied upon temperature response functions derived from the single available study published three decades previously (Bebber, Castillo, & Gurr, 2016).
Initiatives such as the EU-funded PRATIQUE project (2008-11) have attempted to fill this knowledge gap and enable modelling by collating available ecophysiological data for insect pests (Baker, 2012).
While the advantages and disadvantages of the many different pest distribution and impact models continue to be researched and debated (Dormann et al., 2012;Robinet et al., 2012;Sutherst, 2014;Venette et al., 2010), it is clear that practical application of these cabi.org/about-cabi/who-we-work-with/key-donor s/ for full details.

AUTH O R CO NTR I B UTI O N
DB conducted the analyses and wrote the manuscript. EF and GH searched the Chinese literature. TH assisted with CABI data acquisition. All authors contributed ideas and edited the manuscript.

DATA AVA I L A B I L I T Y S TAT E M E N T
Pest distribution data are available with permission from CABI, Nosworthy Way, Wallingford OX10 8DE, UK. Sources for other data sets used in the analysis are given in the text.