A gap analysis modelling framework to prioritize collecting for ex situ conservation of crop landraces

Aim: The conservation and effective use of crop genetic diversity are crucial to over - come challenges related to human nutrition and agricultural sustainability. Farmers’ traditional varieties (“landraces”) are major sources of genetic variation. The degree of representation of crop landrace diversity in ex situ conservation is poorly under - stood, partly due to a lack of methods that can negotiate both the anthropogenic and environmental


| INTRODUC TI ON
The effective use of crop genetic resources-including both traditional farmer varieties (or "landraces") and wild relatives-is important in efforts to overcome challenges related to human nutrition and agricultural sustainability (Burke, Lobell, & Guarino, 2009;Esquinas-Alcázar, 2005;Khoury et al., 2016).Progress in plant breeding and crop diversification is dependent on crop understanding and utilizing the available genetic resources (Glaszmann, Kilian, Upadhyaya, & Varshney, 2010;Hajjar & Hodgkin, 2007).The erosion of genetic diversity within many common crops has occurred over the last century through a combination of land use change, habitat degradation and the ongoing adoption of improved crop varieties or the substitution of crop species by farming communities (Hoisington et al., 1999;van de Wouw, Kik, Hintum, Treuren, & Visser, 2010).In some crops, only a fraction of the genetic diversity once present is still found today in farmers' fields, for example wheat landraces in the Fertile Crescent (Gepts, 2006;Harlan, 1975).
Consequently, ex situ crop genebanks have become essential not only for distributing of genetic resources to various users (e.g.breeders, other genebanks), but also for their conservation of such resources (Gepts, 2006;Hoisington et al., 1999).
Understanding the representation of crop diversity in ex situ repositories provides a foundation for conservation planning (Castañeda-Álvarez et al., 2016;García, Parra-Quijano, & Iriondo, 2017;van Treuren, Engels, Hoekstra, & Hintum, 2009).Methods to assess the current degree of representation, and to inform further collecting efforts, have increasingly been developed for over more than a decade [e.g.Rodrigues et al., (2004); Maxted, Dulloo, Ford-Lloyd, Iriondo, and Jarvis (2008)].Due to the general lack of genetic data, these methods are generally based on ecogeographic methodologies as a proxy for assessments of genetic diversity (Khoury et al., 2019;Ramirez-Villegas, Khoury, Jarvis, Debouck, & Guarino, 2010).Such methods have proved useful in estimating the representation of wild relatives and other wild species in genebanks in comparison with standing extant diversity in their natural environments (Castañeda-Álvarez et al., 2016;Khoury et al., 2019;Syfert et al., 2016).However, their application to cultivated plants, whose spatial distributions are determined by anthropogenic factors as well as environmental drivers, is limited (Fuller, 2007;Hilbert et al., 2017;Morris et al., 2013).This represents a critical gap, since cultivated materials are generally preferred over wild relatives for use by plant breeders (Camacho Villa, Maxted, Scholten, & Ford-Lloyd, 2005;Hammer, Knüpffer, Xhuveli, & Perrino, 1996).
Here, we present a conservation gap analysis modelling framework for cultivated crop diversity, that improves on current ecogeographic methods, using landraces of the common bean (Phaseolus vulgaris L.) as a case study.As opposed to previous analyses of the distributions of cultivated crop diversity [e.g.Upadhyaya, Reddy, Irshad Ahmed, and Gowda (2012), Upadhyaya et al. (2017)], our methods explicitly aim to include anthropogenic drivers in the modelling of the distributions of landraces.The results predict geographic areas that are likely gaps in ex situ landrace conservation collections and provide metrics that can be used to track conservation progress.These results are supplemented with expert knowledge, which is vital for elucidating spatial patterns and drivers of range change that are difficult to model.
Common bean is the most widely human-consumed grain legume, playing an essential role in food and nutritional security, particularly in Latin America and Sub-Saharan Africa (Beebe, 2012;Broughton et al., 2003).Two independent domestication events of wild P. vulgaris have been identified-one in Mexico and Central America, and the second in the Andes mountains of South America (Gepts, Osborn, Rashka, & Bliss, 1986).Significant movement of genetic material and gene exchange between genepools has occurred since domestication, with considerable overlap in current geographic distributions, both in the Neotropics and across other major cultivation areas (Singh, 1989;Singh, Gepts, & Debouck, 1991).These processes have resulted in recognized secondary regions of diversity in Brazil, Europe, Africa and Asia (Escribano & De Ron, 1991;Lobo Burle et al., 2011;Logozzo et al., 2007).
Globally, there are some 250 ex situ collections of cultivated P. vulgaris, with the largest and most diverse maintained at the International Centre for Tropical Agriculture (CIAT) with ~40,000 accessions, and the United States Department of Agriculture (USDA) National Genetic Resources Program with ~15,000 accessions (Debouck, 2014).Here, we assess the representation of common bean landraces in such major genebank collections, including estimating overall conservation and identifying gaps.

| MATERIAL S AND ME THODS
Our modelling framework first necessitates the defining of the study area, gathering of landrace occurrence and characterization data, and compilation of environmental and socioeconomic spatial predictor information.The modelling and conservation gap analysis is then performed, consisting of five main steps: (a) determining relevant landrace groups using the literature to develop and test classification models; (b) modelling the potential geographic distributions of these groups using the occurrence and predictor data; (c) calculating geographic and environmental gap scores for current genebank collections; (d) mapping ex situ conservation gaps; and (e) compiling expert inputs.The overall process is depicted in Figure 1.

| Study area
Crop landraces have been defined as "dynamic population(s) of a cultivated plant that has historical origin, distinct identity and common bean, crop diversity, gap analysis, landrace, plant genetic resources lacks formal crop improvement, as well as often being genetically diverse, locally adapted and associated with traditional farming systems" (Camacho Villa et al., 2005;Casañas, Simó, Casals, & Prohens, 2017).A landrace can be further classified as autochthonous when grown in the original location where it developed its unique genetic and socioeconomic characteristics through grower selection and allochthonous when introduced from another region and then locally adapted."Secondary" landraces may also be recognized, developed by the formal plant breeding sector but now maintained through repeated farmer selection and seed saving (Zeven, 1998).
While landraces cultivated over time in any given location may possess novel traits useful for plant breeding, our distribution modelling method rests on the premise that these varieties have distinct, local environmental adaptations (see 2.4.1-2.4.2).As adaptation to environment is developed over time, the geographic areas where landraces have occurred the longest-the origins and primary regions of diversity-would be considered to have the most significant association between environmental adaptation and genetic variation (Khoury et al., 2016).For this reason, landrace distribution modelling may focus foremost on autochthonous ranges.
For our case study, we focused on the Americas as the centre of domestication and primary region of diversity for P. vulgaris (Gepts et al., 1986).We included all areas extending from the southern United States to central Chile and northern Argentina, including the Caribbean, as this broadly includes the two reported domestication events and distributions of the progenitor and close relatives of the species (Chacon, Pickersgill, & Debouck, 2005;Gepts et al., 1986).We also included Brazil since it is geographically close to the putative regions of domestication and because existing evidence suggests clear relationships between Brazilian bean landraces and Andean and Mesoamerican types (Lobo Burle et al., 2011;Lobo Burle, Fonseca, Kami, & Gepts, 2010).

| Landrace occurrence and characterization data
Our distribution modelling and conservation gap analysis modelling framework requires geographic occurrence (presence) data for landraces and information on the locations where these landraces have been previously collected for conservation ex situ, as well as characterization data on the landrace accessions.To assess the world's common bean landrace collections, we compiled available genebank accession-level passport (i.e.site where collected) data from major online germplasm databases, including the Genesys plant genetic resources portal (Global Crop Diversity Trust, 2019)  Additional occurrences were gathered from the Global Biodiversity Information Facility (GBIF) (GBIF.org,2019), which contained 25,670 observations from herbaria, botanic gardens and other plant repositories, to provide independent data from non-genebank sources.We compiled the datasets into a single database and performed a thorough quality check of all records.
Duplicated observations were eliminated with preference to maintain original data, for example, USDA-GRIN or CGIAR records included in Genesys or WIEWS were discarded.Coordinates were corrected, or if not possible, eliminated, when latitude and longitude were equal to zero, located in inland water bodies or in the ocean, located in the wrong country, had an inverted sign in the latitude and/or longitude or had low coordinate precision (i.e. with less than 2 decimal places).Our full occurrence dataset for P. vulgaris is available in Dataset S1.

| Spatial predictors
With the aim of compiling a robust global dataset of important environmental and anthropogenic drivers of the geographic distributions of crop landraces, we gathered and/or calculated spatially explicit (gridded) information for a total of 50 potential predictors, including climate, topography, diversity and domestication and socioeconomic variables (Table S2.1).For climate, we used a total of 40 variables, derived from a combination of the WorldClim version 2 (Fick & Hijmans, 2017) and the Environmental Rasters for Ecological Modelling (ENVIREM) (Title & Bemmels, 2018) databases.We included topography from the Shuttle Radar Topography Mission (SRTM) dataset of the CGIAR-Consortium on Geospatial Information (CSI) portal (Jarvis, Reuter, Nelson, & Guevara, 2008;Reuter, Nelson, & Jarvis, 2007).Two crop genetic diversity and domestication proxy variables were included, namely the distance to known common bean wild relative populations and the distance to human settlements before year AD 1500.Regarding socioeconomic variables (8 in total), we included datasets on the geographic distribution of ethnic groups (Weidmann, Rød, & Cederman, 2010); crop yield, harvested area and crop production quantity (You et al., 2017); population density (CIESIN, 2018); population accessibility (Nelson, 2008); distance to navigable rivers (Natural Earth, 2019); and percentage of area under irrigation (Siebert, Henrich, Frenken, & Burke, 2013).
All spatial predictor data were scaled to or computed on a common 2.5 arc-min grid, using the geographic coordinate system (GCS) with WGS84 as datum.A complete description of these data sources and their justification for inclusion is provided in Text S2.1 and Table S2.1.The full dataset of ecogeographic and socioeconomic variables is available in Dataset S1.

| Determination of landrace groups
Crop landraces are domesticated, locally adapted varieties of crops, developed through farmer selection over time in specific agricultural ecosystems (Camacho Villa et al., 2005;Jones et al., 2008) and, for most crops, are considered to number in the thousands (Harlan, 1975;Jones et al., 2008).Crop landraces are associated with specific local adaptation traits and farmer preferences, and an understanding of these drivers is important to modelling their potential distributions.Given the large number of landraces and the knowledge necessary to distinguish their biocultural and ecological differences, our method seeks a compromise between the recognition of this complexity and performance of spatial modelling at scales which are feasible and permit comparison with existing genebank collections.
Therefore, the first step of our modelling method was to identify recognized groups within the crop that could be tested for whether they have distinct environmental and socioeconomic niches.We used Google Scholar™ to identify and review publications that, through morphological, physiological, chemical, genetic, nomenclatural or other characters, establish or propose groups of landraces (e.g. by identifying genepools, races, domestication centre(s), genetic clusters or other acknowledged groupings) (Table S2.2).
We then used classification models to test the significance of these classifications.The classification models allowed us to determine whether the classes identified could be predicted on the basis of the spatial predictors from Section 2.3.This process used data from the occurrence database (if the distinguishing characters of the identified landrace groups were reported in the database) or from training datasets containing both characters and geographic coordinates, compiled from the literature review.For this analysis, we used random forest (RF) (Pal, 2005), support vector machine (SVM) (Meyer, Leisch, & Hornik, 2003), K-nearest neighbour (KNN) (Guo, Wang, Bell, Bi, & Greer, 2003) and artificial neural networks (ANN) (Dreiseitl & Ohno-Machado, 2002).The response variable in all models was the group in which a given accession was assigned, whereas the explanatory variables were the spatial predictors.
Models were combined into an ensemble using the mode (i.e. the most frequent predicted value amongst models) and tested using 15-fold cross-validation (80% training, 20% testing).We accepted a given classification if each of its classes was predicted with an average cross-validated accuracy of at least 80% (i.e. 8 of every 10 accessions are predicted correctly).Finally, we used the trained models to predict the corresponding class for any records in the database missing such information.
Variables used in the model were sub-selected from the environmental and socioeconomic predictors using a combination of the variance inflation factor (VIF) and a principal component analysis (PCA) to control for unwarranted model complexity and collinearity between explanatory variables (Warren & Seifert, 2011).We first removed any variables that did not contribute significantly (defined as contributing <15% to the first component) to the variance in the PCA and then discarded any variables with a VIF greater than 10 (Braunisch et al., 2013).The list of variables selected (or alternatively eliminated) for use in modelling are available in Table S2 Background points (pseudo-absences) were generated based on the three-step method of Senay, Worner, and Ikeda (2013).In short, we took a random sample of pseudo-absences from areas that (a) were within the same ecological land units [as reported by Sayre et al. (2014)] as the occurrence points, (b) were deemed as potentially suitable according to a support vector machine (SVM) classifier that uses all occurrences and predictor variables and (c) were further than 5 km from any occurrence.The number of pseudo-absences drawn was equivalent to 10 times the total number of unique occurrences for a given landrace group.
MaxEnt models were fitted through a fivefold (K = 5) cross-validation process in which 80% of the occurrences (and pseudo-absences) were used to train the models, and the remaining 20% were used for testing.For each fold, we calculated the area under the receiving operating characteristic curve (AUC), sensitivity, specificity and Cohen's kappa as measures of model performance.To create a single prediction that represents the probability of occurrence for the landrace group, we computed the median across models.Finally, any areas above the probability value at the maximum sum of sensitivity and specificity were considered the final Landrace Distribution Model (LDM).

| Calculating geographic and environmental gap scores
We developed three scores that compare the geographic and environmental diversity in existing ex situ conservation collections against the LDM, revealing ex situ conservation gaps.
The accession connectivity score (S CON ) was formed with Delaunay triangulation (Lee & Schachter, 1980), that is, triangles linking every three (closest) accession occurrence locations, using the "deldir" R package (Turner, 2019).For each 2.5 arc-min pixel within each Delaunay triangle, we computed S CON following Equation 1.
where, A T−i is the area of the triangle (km 2 ) where the pixel is located (i.e. the i-th triangle), max (A T−i , … A T−n ) is the area of the largest triangle amongst all triangles, D C−i is the Euclidean distance from the pixel to the centroid of the triangle where it is located, normalized by the longest distance (using all pixels) within the given triangle, D NV−i is the Euclidean distance from the pixel to the nearest vertex of the triangle where it is located, normalized by the longest distance (using all pixels) within the given triangle.
From Equation 1, it is clear that S CON for any given pixel is largest (i.e.increases the likelihood of gaps) when the triangle is large (i.e.high area), when the pixel is close to the centroid of the triangle (i.e., where there are no accessions) and when the distance to the vertices (where the accessions are located) is high.
The accession accessibility score (S ACC ) was calculated by computing travel time from each pixel within the LDM to the nearest genebank accession, following Weiss et al. (2018).Travel time was in this case estimated through a product of the distance and the speed of travel (defined by a friction surface).Once the travel time from each location was computed, it was normalized by dividing pixel values by the longest travel time within the LDM, to derive a metric in the range 0-1, with high values reflecting long travel time.
The environmental score (S ENV ) measures how well the environments where the landraces are distributed are represented in ex situ collections.We first performed a hierarchical clustering analysis (Ward's method) for the pixels in the LDM using the predictor variables used to construct the LDM.On a per cluster basis, we computed the Mahalanobis distance between each pixel and the environmentally closest germplasm accession.The distance was finally normalized (0-1), with high values indicative of large distances to sites with similar environments that have previously been collected for ex situ conservation.

| Mapping ex situ conservation gaps
Spatial ex situ conservation gaps were calculated from the conservation gap scores using a cross-validation procedure to derive a threshold for each landrace group and each of the gap scores (S CON , S ACC , S ENV ).To do so, we created synthetic (artificial) gaps by removing genebank occurrences in five randomly chosen circular areas of 100 km radius within the LDM.We then tested whether these synthetic gaps could be predicted by our method and determined the threshold value of each gap score that would maximize the prediction of these synthetic gaps.Performance for each of the five synthetically created gaps was assessed using the AUC, sensitivity and specificity.Finally, the average threshold value of each gap score, cate gaps for all metrics (highest confidence gaps).We termed this 3-value area our "final gaps map." Once the final gaps map was calculated, we estimated the coverage of existing germplasm collections.The coverage is simply the area considered as gap divided by the total area of the LDM.We compute only the coverage resulting from the agreement of the three gap metrics, as an upper-level coverage estimation.

| Compilation of expert inputs
Gap analysis is a tool for assessing collection completeness as well as to plan collecting (García et al., 2017;Marinoni, Bortoluzzi, Parra-Quijano, Zabala, & Pensiero, 2015).Collecting based on model predictions may require extensive discussion with local institutions and crop experts including botanists, collectors, agronomists and breeders.This is because agricultural landscapes are highly dynamic, and areas predicted with gaps may have been subject to recent land use change, varietal replacement by improved or foreign material or significant genetic drift, resulting in loss of uncollected genetic material predicted to be of value (Hammer et al., 1996;van Heerwaarden, Hellin, Visser, & Eeuwijk, 2009;van de Wouw et al., 2010).This means that while the "final gaps map" resulting from Section 2.4.4 provides a detailed regional picture of collecting priorities, the planning of collecting missions will effectively require discussion with experts and further analysis (Greene et al., 1999a(Greene et al., ,1999b;;Jarvis et al., 2005).In this sense, gap analysis results are a discussion support tool that aims at guiding, rather than prescribing where and how collecting may be done.Here, we illustrate this by conducting a semi-structured interview process with two relevant crop landrace experts.
These inputs were used to add additional value to the model results.

| Environmentally distinguishable groups of common bean landraces
Our literature review indicated that a single major classification system based on genetic, morphological and physiological characteristics has been accepted for common bean landraces.This system, first proposed by Singh et al. (1991), classifies beans into two genepools-Andean and Mesoamerican.The Andean genepool, derived from the domestication event proposed to have occurred around Peru, Chile and Bolivia, is composed of typically largerseeded genotypes.The Mesoamerican genepool, derived from the domestication event in Mexico and Central America, is typically composed of smaller-seeded genotypes (Singh et al., 1991).These and subsequent authors divide these genepools into races according to morphological criteria, agro-ecological adaptation and genetic data (see Table S2.2 for a complete list of publications reviewed).
We tested a variety of accession-level data pertinent to common bean genepools, including seed protein type; seed weight, colour shape and brightness; and landrace names.Based on degree of acceptance in published literature and availability of accession-level data with geographic coordinates, we ultimately based our training data on genepool designations given in the CIAT accessions dataset and specific accession numbers gathered from the reviewed literature (Table S2.2).
Our average classification accuracy at the genepool level was 86% (88.3% for Andean and 85% for Mesoamerican landraces), indicating that these two genepools have distinct environmental and socioeconomic signatures, with Mesoamerican beans being present in lower, drier and hotter places compared to Andean beans.Identified predictors (see Figure S2.1) for the classification models agree with previously reported predictors of domesticated and wild bean distributions (Cortes, Monserrate, Ramirez-Villegas, Madrinan, & Blair, 2013;Ramirez-Villegas et al., 2010).At the race level, the classification accuracy was low 58.5% as a mean across all races and hence deemed not informative.Based on these results, we concluded that the genepool level was the most appropriate for all subsequent distribution modelling and conservation gap analysis steps.Hence, in all following sections we show results separately for Andean and Mesoamerican common bean landrace groups.

| Geographic distributions of common bean landrace groups
Figure 2 shows the predicted geographic distributions of Andean (Figure 2a) and Mesoamerican (Figure 2b) landraces.Cross-validated MaxEnt models performed well with mean AUC values of 0.973 (Andean) and 0.996 (Mesoamerican).The MaxEnt-based LDMs also indicated that 23 variables were important for the geographic prediction of landrace presence.Importantly, seven of these are nonclimatic variables (Table S2.1), and amongst these, we find that accessibility and the geographic distribution of ethnic groups contribute substantially to the model.
As expected, Andean landraces were predicted to be mostly distributed across the Andes mountains and to a lesser extent in Mexico and Central America.The converse was true for Mesoamerican landraces.Andean landraces were also predicted to occur in Brazil, which is considered a secondary diversity centre for common beans (Lobo Burle et al., 2010, 2011).Overlap was particularly evident in the geographic intermediate zone in Central America, (Beebe, Rengifo, Gaitan, Duque, & Tohme, 2001;Beebe et al., 2000) and in some areas of Peru.

| Conservation gap maps for common bean landraces
Conservation gap maps, displaying the overlap of results for the three gap scores per pixel, are shown in Figure 3. Andean bean variation is considered less diverse compared to South America (Becerra Velasquez & Gepts, 1994;Beebe et al., 2001), gaps were identified in the states of Oaxaca and to a lesser extent in Chiapas.Gaps were also predicted for Andean beans in Guatemala and Panama.
For the Mesoamerican genepool, the largest overlapping predicted gap was found in the area around Belize-Guatemala-southern Mexico (state of Campeche).Smaller overlapping gap areas were predicted in the states of San Luis Potosi, Jalisco and Sinaloa in Mexico.Across South America, southern Peru is predicted to be a gap.

| Expert inputs for common bean landrace distributions and conservation gaps
To illustrate how gap analysis results may be used to discuss collecting priorities, semi-structured interviews were carried out with two national and international Phaseolus scientists from the study region.One expert, Daniel G. Debouck (DGD), member of many collecting missions for the genus across many countries in the Americas, and expert in bean taxonomy, ecology, domestication and diversity and conservation (both in situ and ex situ) (Freytag & Debouck, 2002).He discussed both Andean and Mesoamerican beans for the entire Americas.Many areas were also identified by the two experts as unlikely to be considered collecting priorities.There were many areas, especially for Andean beans, where the experts indicated that it is likely that landraces are already lost due to traditional cropping practice replacement.This is the case in northern Chile and in southern and coastal Peru, where beans have been replaced by grape cropped for wine and pisco.Other areas were considered by experts to not be collecting priorities since these are mostly "documentation" gaps (e.g.central Brazil for Andean beans); this is because these materials are mostly in national collections, and passport information (including coordinates of collection sites) from these collections was not available or had insufficient quality for inclusion in our analyses.

| D ISCUSS I ON
Here, we documented the development of a novel modelling framework to predict the distributions of crop landraces and to identify gaps in ex situ germplasm collections with relation to geographic and environmental variation in their distributions.We base our framework on the rationale that the distributions of landraces can be predicted using environmental and socioeconomic drivers, and that important conservation gaps can be identified by characterizing the geographic (accessibility and connectivity) and environmental space across which previous collecting has been carried out.
Our analysis suggests that both genepools of P. vulgaris are relatively well conserved and that progress towards comprehensive representation ex situ may be relatively fast if targeted collecting is performed in the areas outlined in the results.This contrasts with results for common bean wild relatives, for which research indicates that about two-thirds of the wild species in the genus need further conservation action, and about half are considered high priority for further collecting (Castañeda-Álvarez et al., 2016;Ramirez-Villegas et al., 2010).
For Andean beans, gaps were predicted throughout most bean-producing countries in South America, with the highest priority being Chile, Peru, Colombia and specific spots in the Venezuelan Andes.For Mesoamerican landraces, the results target regions of Mexico, Belize, Guatemala and to a lesser extent South America (mostly Peru) for further collecting.While current common bean collections already hold substantial diversity from across the Americas (Beebe et al., 2000(Beebe et al., , 2001)), our results, supplemented by expert opinion, indicated that further collecting is warranted, especially where valuable traits such as phosphorous use efficiency (Beebe, Lynch, Galwey, Tohme, & Ochoa, 1997) or heat stress tolerance (CGIAR, 2015) may be found.
Our ongoing review of other crop landraces indicates that the classification approach, based on recognized groups, can be widely applicable to other crops (van Heerwaarden et al., 2011;Lasky et al., 2015;Ndjiondjop et al., 2018).Moreover, the continuous generation of new genetic diversity data and related knowledge (Crossa et al., 2016;Halewood et al., 2018) will facilitate the further application of our methods, which are ultimately dependent on the availability of robust classification, occurrence and characterization data.
While our framework contributes to revealing existing gaps in current germplasm collections and to highlighting geographic areas where novel diversity may be collected, the question remains as to the extent to which the results can support on-theground collecting work.Our discussion with experts indicates that priorities for collecting can be drawn using our predicted gap maps.Moreover, previous ecogeographic analyses have proven useful for collecting planning (García et al., 2017;Jarvis et al., 2005;Marinoni et al., 2015).To further translate our results for action, designing tools for real-time collecting mission support (e.g.route tracing) that combine the outputs with existing technologies for map visualization and navigation would be advantageous.

| Challenges and limitations to landrace distribution modelling and conservation gap analysis
Predicting the distributions of cultivated plants, whose ranges are determined by anthropogenic along with environmental drivers, presents a challenge that has not been fully resolved in geospatial sciences.While we attempted to gather the widest range of quality input occurrence and predictor data and used state of the art approaches to ensure high species distribution model (SDM) performance, several further improvements can be suggested.
With regard to occurrence information, particularly for genebank collections, we incorporated data from the two central global repositories for such information (Genesys and WIEWS) and in addition (due to our focus here on common bean) insured the full compilation of data from the world's two largest P. vulgaris collections (CIAT and USDA).This said, these sources are not fully representative of all common bean collections worldwide, including collections such as the Agricultural Research Institute (CIAP) in Cuba.Ongoing initiatives, such as Genesys that list in a single location passport and (eventually) characterization data for many genebanks (Global Crop Diversity Trust, 2019), may help resolve this data challenge in the future.On the other hand, national policies influencing germplasm distributions hinder the international accessibility of many such "low-visibility" collections (Castiñeiras, Esquive, Lioi, & Hammer, 1991;Lobo Burle et al., 2011).
We also note that coordinate information, which is an essential input into our methods, is missing for many current genebank accessions.Further efforts to georeference records missing coordinates but possessing locality information, and to make this information easily available online, will facilitate a more robust assessment of the state of conservation of crop landraces ex situ.
Distributions of crop landraces are influenced by factors beyond the environmental and socioeconomic predictors used here.
These may include other abiotic (e.g.soil parent material and other edaphic characteristics), biotic (e.g.mycorrhizae, pathogens and pollinators), and agriculturally relevant socioeconomic (e.g.farm sizes and farming systems) factors.Further development of high-resolution global datasets will be needed to incorporate such information into our analyses.Similarly, we note that model uncertainty can be a challenge and highlight the need to use model results as a "discussion support" tool to prioritize collecting.Finally, while we employ a widely used distribution modelling algorithm, it is possible that incorporating other methods, or forming ensembles of multiple methods, could improve our prediction of gaps (Grenouillet, Buisson, Casajus, & Lek, 2011).

| Landrace conservation gap analysis for global targets
The high value of crop landrace diversity in breeding programmes and for farm-level resilience (Camacho Villa et al., 2005;van Etten et al., 2019;van de Wouw et al., 2010), and the evident erosion of these resources in their primary and secondary centres of diversity (van Heerwaarden et al., 2009;Mekbib, 2008)  Recently, Khoury et al. (2019) proposed an indicator to track the conservation of useful wild plants, which furthers tested gap analysis methodologies for wild flora (Ramirez-Villegas et al., 2010).
Here, we developed a coverage metric that, if implemented for a sufficiently large number of crops, could be used to track progress towards the conservation of cultivated plants for SDG 2.5, Aichi 13 and other important international goals.

ACK N OWLED G EM ENTS
This work was carried out under the CGIAR Genebanks Platform and the United Nations Food and Agriculture Organization World Information and Early Warning System on Plant Genetic Resources for Food and Agriculture (WIEWS) (FAO, F I G U R E 1 Conservation gap analysis modelling framework implemented in this study 2019).To ensure inclusion of the crop's major germplasm collections, we specifically gathered occurrence and characterization data from the CIAT database (CIAT, 2018), freely available at and from the United States Department of Agriculture (USDA) Genetic Resources Information Network (GRIN)-Global (USDA ARS NPGS, 2018).
max A T−i , ⋯ ,A T−n * 1 − D C−i * D NV−i maximizing the prediction of the synthetic gaps (balanced with minimizing false positives), was used to discretize the gap score datasets into areas with a high priority for further collecting (areas with gap score above the threshold, assigned a value of 1) as opposed to relatively well-conserved areas (areas with gap score below the threshold, assigned a value of 0).We then summed the three binary gap score maps, resulting in a map with values from 0 to 3. Areas with a value of 0 indicate that there are no accession connectivity, accessibility or environmental gaps (i.e.well-conserved areas); areas with a value of 1 indicate gaps exist due any of accession connectivity, accessibility or environment (low confidence gaps); areas with a value of 2 indicate gaps exist due to two metrics (medium confidence gaps), and values of three indi- Figure S2.2 shows the individual gap scores, whereas Figure S2.3 shows model performance and coverage estimation.Overall gaps are larger for Andean compared to Mesoamerican beans, with representation of their distributions in genebanks estimated at 78.5% for the Andean and 98.2% for F I G U R E 2 Predicted geographic distributions of Andean (a) and Mesoamerican (b) common bean (Phaseolus vulgaris L.) landrace groups F I G U R E 3 Final gaps map for Andean (a) and Mesoamerican (b) common bean (Phaseolus vulgaris L.) landrace groups.Red indicates areas where the three gap scores (S CON , S ACC , S ENV ) agree in identifying a gap the Mesoamerican genepool.There is significant agreement amongst the gap areas identified by the accessibility, connectivity and environmental scores, and all performed well at predicting gaps.For Andean beans, overlapping gap areas were found in the northern Venezuelan Andes, the Santander department in Colombia, specific pockets in the Andean hillsides between the Central and East cordillera in Colombia, the highlands of Ecuador, several areas in western and southern Peru, a major area in northern and central Chile, and central Brazil.In Mexico and Central America, noting that justify urgent action to secure ex situ the diversity of landrace still cultivated by farmers and in addition (though not discussed in this article) to invest in farmer-based (i.e. in situ/on farm) conservation (Bellon, Dulloo, Sardos, Thormann, & Burdon, 2017).The United Nations Sustainable Development Goal (SDG) 2.5, the Convention on Biological Diversity (CBD), Strategic Plan for Biodiversity 2011-2020, Aichi Biodiversity Target 13 (CBD, 2010a) and Global Strategy for Plant Conservation (GSPC) Target 9 (CBD, 2010b) and Article 5 of the International Treaty on Plant Genetic Resources for Food and Agriculture (ITPGRFA) (FAO, 2002) all discuss and/ or establish targets for the maintenance of genetic diversity of cultivated plants and their wild relatives, both in situ and ex situ.

(
https ://www.genebanks.org).The CGIAR Genebanks Platform enables CGIAR Research Centers to fulfil their legal obligation to conserve and make available accessions of crops and trees on behalf of the global community under the International Treaty on Plant Genetic Resources for Food and Agriculture (ITPGRFA).Authors thank Andy Jarvis from CIAT, Chris Richards from USDA and Paul Evangelista from Colorado State University for input on the methodology during early stages of this project, and Angela M. Hernández from CIAT for providing bean passport and phenotypic characterization data for the analyses.Authors also thank CGIAR Genebank .1.We tried different model configurations (i.e.only climate, only non-climate and both) but present only the best-performing one (i.e.where all variables are used).Other results are presented in Text S2.2.