The future of legume genetic data resources: Challenges, opportunities, and priorities

Legumes, comprising one of the largest, most diverse, and most economically important plant families, are the subject of vibrant research and development worldwide. Continued improvement of legume crops will benefit from the recent proliferation of genetic (including genomic) resources; but the diversity, scale, and complexity of these resources presents challenges to those managing and using them. A workshop held in March of 2019 addressed questions of data resources and priorities for the legumes. The workshop identified various needs and recommendations: (a) Develop strategies to effectively store, integrate, and relate genetic resources collected in different projects. (b) Leverage information collected across many legume species by standardizing data formats and ontologies, improving the state of metadata about datasets, and increasing use of the FAIR data principles. (c) Advocate for the critical role that curators exercise in integrating complex datasets into databases and adding high value metadata that enable downstream analytics and facilitate practical applications. (d) Implement standardized software and database development practices to best leverage limited developer time and expertise gained from the various legume (and other) species. (e) Develop tools and databases that can manage genetic information for the world's plant genetic resources, enabling efficient incorporation of important traits into breeding programs. (f) Centralize information on databases, tools, and training materials and establish funding streams to support training and outreach.


| INTRODUCTION
Legumes (Fabaceae) comprise the third largest plant family, with more than 20,000 species (LPWG, 2017). Roughly two dozen crop legumes are critical within food systems worldwide, because of their overall (Meyer, DuVal, & Jensen, 2012;Smýkal et al., 2015;Smýkal, Nelson, Berger, & von Wettberg, 2018). Continued development of legume varieties is essential for addressing new environmental challenges due to climate change and needs for improved production and quality traits . The NSF and USDA funded a planning session in late 2004 to develop objectives and goals for cross-legume genomics research with the participation of approximately 50 legume researchers. The meet-ing was named "Cross-legume Advances Through Genomics" (CATG) and resulted in a white paper and meeting report (Gepts et al., 2005), which laid out strategic goals over a 3-, 5-, and 10-year timeline.
In March of 2019, 15 years after the initial 2004 meeting, a follow-up workshop was held, with the participants comprising the Legume Genomic Data Working Group, to assess the state of legume genomics research and to develop a strategic plan for the coming decade in this field, with support from the NSF (Legume Federation project, award 1444806), the USDA, and the Noble Research Institute. Below, we report the main recommendations from this meeting.
Data curation is critical for the quality and longevity of databases and repositories to ensure the maintenance of the research infrastructure for legume and plant biology research. Skilled data curation is critical for uploading datasets that meet thresholds for experimental design, adequate replication for statistical inference, and suitable protocols prior to broader use by the community. Recommendations: • Recognize data management and curation as a complex, high-level role, requiring specific and significant budgeting in proposals.
• Give specialized training to data curators, including in genomic technologies, data handling and analysis, and database management.
• Promote the usage of FAIR data standards. The data management component of grant proposals should include funding allocated to collecting and storing data according to FAIR standards, as well as the training of students and researchers in proper data handling and preparation methods. Recommendations:

| Training, outreach, and documentation to better integrate bioinformatics with plant biology and breeding
• Foster increased communication between developers, researchers, collection curators, biologists, breeders, and other end-users, to ensure that resources and visualizations can address practical applications.
• Encourage annual meetings of PIs, postdoctoral fellows, and students from all plant disciplines and groups involved in breeding to discuss ideas, review available resources and identify research gaps and opportunities.
• Encourage video and static tutorials from bioinformatics developers when new tools are developed and released. Challenges include the difficulty of making user interfaces both easy to use and sufficiently powerful in terms of the scale and speed of analysis. Further, web technologies and underlying frameworks continue to change, and require ongoing software and security updates for any online database or website. Facilitating interoperability among websites adds an additional level of complexity to the software engineering challenge. Some of these objectives and challenges have been further described by the AgBioData consortium (Harper et al., 2018).

| General development needs and practices: Formats, standards, and computational resources
Recommendations: • Promote greater use of standard APIs by genomic data portals to facilitate cross-site access and data sharing.
• Utilize common user interfaces and tools to ease the burden on users. Encourage standardization of visualization methods, data storage methods, and frameworks for making genetic and genomic data.
• Increase access to computational resources with sufficient storage, memory, and processor capacities for tasks such as genome assembly and annotation, with the capacity to support large-scale queries or exploration of whole genome comparisons among multiple species.
• Make analysis tools that can be used in stand-alone installations, e.g. through containerization, to enable use in regions with limited or unreliable internet access, such as in remote field locations and developing countries.

| Overview
It is now possible, with modest funding, to generate a high-quality draft genome assembly for most species (the exceptions being unusually large genomes like faba bean, Vicia faba, or genomes with particular complexities, such as alfalfa, Medicago sativa, whose autotetraploid genome has made assembly extra challenging). Similarly, it has become relatively straightforward to generate gene predictions, transcriptome assemblies, RNA-seq atlases, large marker sets, and other next-generation sequencing-enabled datasets. After such a resource is generated, the best practice is to make it available in an accepted (and validated) format, with sufficient description, using established "minimum information standards" for the data, e.g., MIAME (Brazma et al., 2001) or MIAPA (Leebens-Mack et al., 2006), and then to deposit it in a permanent and public repository such as GenBank or a generalist repository with DOI-issuing capability such as Data Dryad or Zenodo. For most of these steps, format validators are useful but human curation is required to assess, describe, and often to correct the computational products.
The proliferation of datasets raises additional questions and challenges: How should multiple, related resources (e.g., multiple genome assemblies for a species) be handled? In what ways can they be usefully compared and integrated (e.g., into a pan-genome)? What kinds of metrics are most useful for describing the characteristics of a genomic resource? What standards of quality should be met before an analysis combining disparate resources is likely to yield insights into underlying biology rather than differences in technical approach?
What types of evidence and time points should be considered to evaluate gene-specific differences in genotypes (e.g., susceptible vs.

| Assembly and annotation quality and consistency
Quality metrics for reference genome assemblies and their annotations are often not provided by the sites housing them, and current assembly statistics for quality are generally insufficiently informative.
In particular, "reference" and "draft" are poorly defined, and measures like N50 and L50 are inconsistently or interchangeably used. Efforts to define lineage-appropriate core gene families for assessment of annotation completeness are useful, e.g. BUSCO (Waterhouse et al., 2017) and coreGF (Bel et al., 2012;Veeckman, Ruttink, & Vandepoele, 2016), but interpretation of results using these standards is complicated by the history of polyploidy with legumes. Alignment of genome assemblies with genetic linkage maps or optical maps can provide an additional resource for comparisons and can be used to identify contigs or scaffolds that correspond to the same chromosome or help identify chimeric assemblies.
Not every assembly or annotation needs to be of the highest quality; a fragmentary draft assembly may be sufficient to align sequencing reads to identify sequence variants. Nevertheless, lowquality assemblies and annotations can cause problems if inappropriately integrated into other analyses-for example, including rough annotations in which chimeric, fragmentary, haplotypes, or transposon gene-calls may "pollute" downstream analyses such as gene families.
Similarly, the characterization of genes associated with lineagespecificity or their status as a "core" or "dispensable" gene within the context of a pan-genomic analysis will depend heavily on the completeness of the individual datasets used in the determination. (Salzberg, 2019). There is also recognition that standard genome assembly and annotation methods don't capture all the features of interest-for example, methylation status, chromatin features, or recombination hotspots (Mei, Stetter, Gates, Stitzer, & Ross-Ibarra, 2018). Recommendations: • Improve metrics for assembly and annotation quality and standardize methods for applying the metrics across multiple legume species.
• Increase consistency and comparability for assembly and annotation tools. This will need to be an ongoing effort, as technologies continue to evolve.
• Increase support and standardization for "unusual features" such as small-RNAs, structural-genomic features such as chromatin accessibility, methylation, and epigenetic gene regulation. Further, consider distinct features of particular genomes such as heterozygosity, ploidy levels, disease resistance hotspots, and transposable elements in large genomes.

| Pan-genomes: Supporting multiple reference genomes for a single species
For species that have had multiple genomes sequenced and assembled, characterization of the pan-genome will be important for representing the gene complement and sequence diversity present in a species and to avoid biases that might be introduced in comparisons against a single accession. Similar to the problems inherent in constructing gene families or other types of cross-species comparative analysis from inputs of variable quality, pan-genome-based inferences will be dependent on the quality of the data, the diversity of genotypes sequenced, and the nature of the inputs used to construct them. For example, many aspects of diversity can be addressed by resequencing data and pan-genomes can be constructed from genotypes inferred from such data. However, low coverage resequencing would be of limited value to assess regions of complex structural variation, including those found in rapidly evolving resistance-gene clusters in plant genomes.
Pan-genomes may be represented by a graph data structure, in which each accession is represented by a separate path (Eggertsson et al., 2017;Computational Pan-Genomics Consortium, 2018 and Vigna/bambara groundnut), tubers (e.g., in Phosphocarpus/winged bean, Tylosema/marama bean, and Apios/potato bean), or transitions between growth forms (e.g., tree and herb forms in many lineages). As more genomic resources become available across the legume family, comparative analysis of the repeated evolution of such traits will be increasingly tractable and powerful. For species with limited funding opportunities, the ability to leverage genetic information from better characterized plant relatives is invaluable. Recommendations: • Create comprehensive gene expression atlases for all crop legumes and make them available through websites with longterm support. These atlases are important for defining lists of conserved candidate genes across species or those unique to a certain species, and for generating gene-based markers associated with specific traits.
• Develop standards for replications, growing conditions, and tissue types to establish reference gene atlas databases for the various legume species.
• Improve documentation of annotation methods and evidence to support gene annotations and strategies to facilitate cross-species comparisons.
• Develop methods and standards for naming annotated genes that take into account gene families, haplotypes, and pan-genomes.
• Develop an encyclopedia of genes that underlie domestication and quality traits for important legume species.

| Overview
The ability to efficiently and densely genotype many accessions enables powerful analyses. The data can be used to determine genetic relationships among and within accessions, to determine population structure in plant genetic resource collections, and to make informed decisions about managing a collection based on genetic diversity. In breeding applications, markers with known associations can be used for marker-assisted selection, to identify novel genetic variants within genes, and for genomic selection (GS).
For many legume crops, gene banks contain thousands of accessions which can be narrowed down to a subset of core accessions that represent the genetic diversity of the entire collection, to facilitate phenotyping for traits that are time-consuming or labor intensive to evaluate. Genetic information can be used to help identify redundancies or gaps in the germplasm to inform plant exploration or new plant genetic resource collection initiatives. Combined with genotypic data, phenotypic information about a core set of accessions or in some cases, the entire collection of a species, can be used in genome-wide association studies (GWAS) or if a key gene for a specific trait (e.g., disease resistance) exists, to identify novel alleles for the gene(s) of interest in the accessions evaluated.  (Sempéré et al., 2016) can be used to load and organize data with varying formats, so that data can then be extracted with a common data format and markers across platforms can be aligned to common variants. Imputation methodologies can be applied to align different genotyping methodologies to standard sets of markers, and variants and markers generated from differing genotyping methodologies can be aligned to common reference genome variants, although this requires significant bioinformatics resources imputation platforms (Wang et al., 2018). Haplotype graphs can reduce this complexity and consolidate the information from different genotyping technologies into common haplotypes. The process of imputing SNPs may be easier to implement in some species vs. others considering mode of reproduction (self vs. outcross) and ploidy level.

| Data comparison between genotyping platforms
Recommendations: • Increase comparability across genotyping platforms. Use common SNP/feature sets where possible, and make genotyping results available in well-established formats, with validation. Improve imputation methods for inferring missing data points when compared to common SNP databases.
• Increase the use of genotyping data management systems to help align data formats and outputs to facilitate comparison between experiments or genotyping runs. These systems should associate markers with their underlying variants wherever possible, and document design and reporting strands to enable transformation to a common allele strand output.
• Explore tools and procedures that facilitate consolidation of differ-  (Cook et al., 2012). SVs can also be linked to domestication events (Lye & Purugganan, 2019). A comprehensive characterization of structural variation in legume genomes would require the use of long-read sequencing technologies and the development of de novo assemblies to resolve highly repetitive regions and to eliminate reference bias. Recommendations: • Expand variant analysis to include complex variations: presenceabsence variations (PAVs), copy-number variations (CNVs), and larger structural variations (SVs).
• For particularly large or repeat-dense species, such as faba, pea, and lentil, utilize exome targeted sequencing methods such as exome capture for identifying PAVs and CNVs.

| GWAS: Meta-analysis, resources, and repositories
To better facilitate the combination of GWAS studies in plants, phenotypic characterizations need to be comparable (Zhao et al., 2019). This is difficult if the terminology used to describe traits and trait variation differs. Use of consistent, common descriptors, or ontologies is therefore necessary (Shrestha et al., 2010 (Togninalli et al., 2018), AraPheno (Seren et al., 2017), easyGWAS . Furthermore, as GWAS data is often scattered across publications, there is a need for a centralized repository of GWAS results (e.g., human GWAS catalog) which follows data standards for studies, traits, variants, accessions, and base pair locations. Recommendations: • Facilitate cross-study comparisons by using consistent, common descriptors, from established ontologies.
• Extend ontologies from one legume to others, to facilitate the transferability of information from one legume to another.
• Encourage utilization of ontology descriptors described in other crops, and the use of standardized ontologies maintained at http:// www.cropontology.org.
• Develop a centralized repository of GWAS results, with rigorous adherence to data standards.
• Develop tools to facilitate cross-species and within-species comparisons and meta-analyses across multiple studies, which can in turn enable comparison of GWAS features across studies and enable identification of causal genetic elements.

| Overview
Breeding progress can be measured through genetic gains per cycle of selection. Management of data for screening materials is critical to catalog the genetic diversity and to make selection decisions.
Easy access to a wide range of data, from sources across many disciplines, would facilitate decision-making at every step of a breeding program. Additionally, there is a potential to expand breeding from single crop selections to include breeding for the most favorable interactions with rhizobia and other microbes, for improved symbiotic performance and survival in different soils (Busby et al., 2017;Greenlon et al., 2019).

| Data collection and integration for breeding needs
For genetic resource collections, users need access to information such as pedigree information, trait phenotyping, researcher attribution, links to genomic data, and a variety of other germplasm data.

Recommendations:
• Use standard ontologies, such as the Crop Ontologies (http:// www.cropontology.org/) and Plant Trait Ontology (TO; http:// www.obofoundry.org/ontology/to.html) to enable comparisons across datasets. Contribute terms as needed to facilitate their use for practical breeding applications.
• Expose and retrieve data through standard web service APIs. The Breeding API, BrAPI (https://brapi.org/) should be implemented and contributions made to keep standards current (Selby et al., 2019).
• Establish standard protocols for phenotypic traits and data collection standards to populate GRIN-Global as the main repository for germplasm and phenotypic descriptors. Use web services for integration with related data.
• Develop tools and standard interfaces that meet breeder use cases (haplotype mining, identification of potentially useful parents/alleles), and the ability to access data from multiple databases (germplasm maintenance and discovery research).
• Researchers should be encouraged to utilize appropriate long-term repositories with commitments to FAIR data principles to facilitate data integration.

| Breeding data search, acquisition, browsing, visualization, and analysis tools
To enable better access to data, improved tools to find, view, analyze, and acquire data are required. Having to learn different tools and navigation at different websites, along with the need to combine multiple online and stand-alone applications is inefficient and burdensome. To avoid duplicated development efforts and to provide more consistent website navigation, a number of open source frameworks are available. Recommendations: • Improve tools to easily find germplasm using a variety of search filters, including geographic origin, traits, marker alleles, and heterotic groups.
• Implement common frameworks to the extent possible and collaborate with others doing similar work; develop initiatives to foster collaborations and leveraging of resources to avoid operating in silos.
• Adapt tools developed by other research communities where possible to address new projects and needs in other legumes.
• Promote comparative analysis tools such as the Genome Context Viewer (Cleary & Farmer, 2018) available at https://legumeinfo.org to evaluate legume gene families and phylogenies. Recommendations:

| Breeding management tools
• Link and augment existing breeding management tools to include additional breeder-centric functionalities (support flexible plot designs and modes of reproduction).
• Integrate breeding management tools with data-rich informatics resources focused on climate, soil, and weather to better address GxE interactions.
• Improve breeding decision support tools that consider multiple traits simultaneously

| Gene bank collections
Diverse collections of plant genetic resources are critical for breeders because they archive the diversity on which trait improvement depends. Maintenance and characterization of germplasm collections is essential to maximize their utility for addressing current and emerging threats to agricultural productivity.
Characterizing germplasm and gaining easy access to the data continues to be a challenge. As previously mentioned, some of these Recommendations: • Archive collections in central gene banks and maintain them in sufficient quantity to fulfill seed requests.
• Prioritize genotyping of core germplasm collections for multiple legumes using standard sampling protocols, while also considering the challenge of maintaining and increasing seed lots. Successful examples of this are projects to genotype the entire barley collection (Milner et al., 2019) and the entire U.S. soybean collection (Song et al., 2015). Genotyping entire collections will likely result in the development of new core collections that better represent the genetic diversity of the collection.
• Perform detailed phenotyping of accessions in core collections using standardized protocols for plant growth, replications, locations, and trait ontologies.
• Develop tools to visualize and mine genomics datasets in gene bank collections to increase their application for trait development, prebreeding and breeding purposes.
• Link GRIN descriptors for all legume crops to reference ontologies, such as the Plant Trait Ontology (TO).
• Apply stable data identifiers (DOIs) to germplasm collections and develop recommendations for applying them to heterozygous and heterogeneous seed lots.

| Rhizobia and other symbionts
The capacity to actively select rhizobial strains using genomic informa-  (Ghimire, Charlton, & Craven, 2009). Effective interactions between legumes and symbionts can enhance establishment, drought and salinity tolerance, performance in poor soils, and disease resistance that ultimately could result in higher yields. Recommendations: • Systematically improve rhizobial and other symbionts and determine legume breeding lines responsive to these improved symbionts.
• Expand research and development of tools to discover, modify and utilize knowledge about genome-genome interactions between rhizobia/mycorrhizae and legume hosts in agricultural settings.

| LEGUME-SPECIFIC DATA RESOURCE OPPORTUNITIES
Rhizobial data resources highlight the fact that many aspects of legume biology are distinctive, calling for either novel or taxon-specific approaches to genomic data management. In the area of annotation, legume genomes including Medicago truncatula have played a key role in the elucidation of small secreted protein genes (de Bang et al., 2017). These proteins play key roles in nodule and rhizobial development and hundreds have been discovered across multiple legume genomes. Nevertheless, small proteins are routinely overlooked in genome sequencing and annotation projects. Work in legume genomics has encouraged the development of small secreted protein gene discovery software (Zhou et al., 2013) as well as the reexamination of sequenced plant genomes, including the genomes of nonlegumes (Silverstein et al., 2007). Beyond nodulation, legumes share important taxon-specific data opportunities that must ultimately elucidated from the distinctive lens of legume species. First generation pan-genomes for soybean and Medicago have already been developed (Li et al., 2014;Zhou et al., 2017). Soon, much deeper and more extensive pan-genomes will be publicly available in these and other legume species. Likewise, work to define the "ancestral" legume genome and its evolution into present day species has pulled together genomic data across multiple legume sequencing projects (Kreplak et al., 2019;Ren, Huang, & Cannon, 2019;Wang et al., 2017). The data management and sharing standards, especially annotation, orthology, genomic elements, complex variation, haplotypes, and more, are all incomplete and urgently needed to exploit these pan-species and pan-family genome resources. At the individual species level, unique or novel data challenges remain for legume species. The recent publication of the peanut genome (Bertioli et al., 2019;Zhuang et al., 2019) highlights unusual features of subgenome evolution/domestication, while the pea genome (Kreplak et al., 2019) illustrates the impact of massive transposon expansion. For both, data descriptions and standards are in their infancy for their use in legume genomics. Even among longstudied legume traits such as yield, quality, and stress-tolerance (including protein and oil in soybean and pulses, forage quality and winter-hardiness in alfalfa and clover, and pathogens targeting multiple legume species), strategies to enable actionable decision-making by breeding and germplasm researchers still require integration with sequence-and genome-level datasets to be completed.

| CONCLUSIONS
The state of legume genomics is characterized by an abundance of data, which offers many opportunities for comparison and combination of datasets. Productive integration and comparison requires data management practices and methods that have yet to keep up with the pace at which data are being generated. Such standards should include use of consistent metadata, ontologies, and accepted and validated data formats, and deposition of data in well-supported and maintained repositories to facilitate their use. A critical need exists for data curators, improved computational tools and interfaces targeting human end-users, and computer access via APIs. Data generators, curators, software developers, and users of the data should approach the generation of these resources while being mindful of the long-term goals of these efforts: to improve our understanding of legume biology, including interactions with rhizobial symbionts and other biotic and abiotic factors, to promote the efficient stewardship and utilization of legume genetic resources, and to optimize legume improvement for the benefit of farmers and consumers.