The legume information system and associated online genomic resources

The Legume Information System (LIS; https://legumeinfo.org) houses genetic and genomic data, integrated in various online tools to allow comparative genomic analyses. The website and database maintain data for more than two dozen species, particularly focusing on crop and model species and holding data for other diverse species of taxonomic interest. Major analysis features include genome browsers, sequence‐search tools, legume‐focused gene families and a phylogenetic tree viewer, a gene annotation service (which places a submitted gene into a gene family and phylogenetic tree), an interactive microsynteny and pan‐genome viewer, a novel viewer of genetic variant data, genetic maps and viewers, a Data Store for data sets such as assemblies and annotations, InterMine instances for querying genetic and genomic data, and a tool for viewing geographic distributions of germplasm accessions. LIS also integrates with several other legume data resources and tools, including PeanutBase (https://peanutbase.org), SoyBase (https://soybase.org), Medicago Hapmap (https://medicagohapmap2.org), Alfalfa Breeder's Toolbox (https://alfalfatoolbox.org), and the Legume Federation (https://legumefederation.org).


| INTRODUCTION
The mission of the Legume Information System (LIS; https:// legumeinfo.org) is to facilitate research and crop improvement for the many legume species that are important in global agriculture. LIS is a collaborative project between the USDA-Agricultural Research Service and the National Center for Genome Resources (Dash et al., 2016;Gonzales et al., 2005;Gonzales, Gajendran, Farmer, Archuleta, & Beavis, 2007). LIS houses genetic and genomic data for 30 legume species, with extensive support for 18 of these, as of late 2020.
Strengths of LIS include a Data Store, to serve as a one-stop shop for static, versioned legume genetic and genomic data sets such as genome assemblies and annotations; genome browser support for many species; a gene family and phylogenetic viewer; a micro-and macro-synteny viewer; a tool for viewing geographic distributions of germplasm accessions; InterMine instances for querying genetic and genomic data; tools for exploring gene expression information; and viewers for genetic variant data, including variants associated with traits through published genome-wide association study (GWAS) analyses. We will describe each of these features below.
We first give an overview of the holdings and main tools at LIS and then illustrate the use of some of these tools and resources with a case study to investigate growth habit in cowpea and soybean.

| MATERIALS AND METHODS
LIS holds, at time of writing, substantial genetic and genomic data sets for more than two dozen legume species, as well as some specialized cross-cutting data sets such as legume-focused gene families. Many of these are explorable via various graphical user interfaces (GUIs), such as genome browsers, genetic map viewers, sequence search tools, or query and report interfaces for features such as genes and markers. Genome browsers are provided for sequenced crops and models: crops ranging from Arachis hypogaea (peanut) to Vigna unguiculata (cowpea) and models including Medicago truncatula and Lotus japonicus.

| The Data Store
To provide access to the underlying data used in visualization and query tools, LIS also maintains a Data Store (https://legumeinfo.org/ data/), which holds data sets for download. These data sets include genome assemblies, gene models and sequences, genetic markers, maps, Quantitative Trait Loci (QTLs), GWAS results, and expression data.
In order to help organize data, make it Findable, Accessible, Interoperable, and Reusable (FAIR; Wilkinson et al., 2016), and to aid in its curation and use, the LIS team has adopted a set of practices for consistently organizing, naming, formatting, describing, and tracking the data. These standardized patterns are helpful both for outside users and for LIS developers and curators. The standardization has an important additional benefit of making the data findable and directly usable by applications within LIS and by other projects. The main organizing principles are as follows: The Data Store web interface mirrors a file system of limited depth, with main Genus_species directories, which hold data "collections." Collections are groups of thematically related files from a major, versioned analysis or study. Examples are genome assemblies, annotations, maps, genotype collections, expression, and QTLs.
gnm2.ann1.RVB5, and Wm82.gnm2.div.0SZD. In these examples, the accession_name is Wm82 (abbreviated from Williams 82); the type +version is gnm2, gnm2.ann1, and gnm2.div, indicating genome assembly version 2, genome annotation 1 (on assembly 2), and diversity set (on assembly 2). The random key for each of these is the fourletter alphanumeric string at the end. This unique key is also used in each file name contained within collections and in metadata that describes each collection. The key, then, serves to concisely tie the files and the metadata together.
Within each collection, there are four metadata files: a README, a MANIFEST file that describes each data file, a MANIFEST file that indicates correspondences between filenames in the Data Store and any prior filenames, and a CHECKSUM file. The metadata files all use the "YAML" format, which is both human-readable and is sufficiently structured to be machine-readable. The README has 26 fields (entries), such as identifier (the collection key name), provenance (the data origin), source (the URL address of the data, if appropriate), and subject (a brief description of the data). The README ends with the fields file_transformation (indicating modifications to files, if any) and changes (indicating modifications, if any, to the collection after its initial version). The Data Store interface is implemented using h5ai (https://larsjung.de/h5ai/).

| DSCensor: Tools for summarizing and conducting quality control on Data Store contents
The Data Store contents are indexed and described in the "DSCensor" (https://dscensor.legumeinfo.org), so named for its dual role as a "sensor" of data in the Data Store, and as a "censor" of data sets that may have problems that need evaluation or fixing by project curators-problems such as malformed general feature format (GFF) structure, noncanonical naming patterns, and unexpected assembly statistics or annotation results. The DSCensor uses the MultiQC tools (Ewels, Magnusson, Lundin, & Kaller, 2016), which were developed to describe, analyze, and report on data from multiple bioinformatic analyses-for example, at a sequencing and annotation facility. Plots and tables provide a helpful overview of data characteristics-for example, assembly sizes, scaffold N50s, proportion of gaps in assemblies, and BUSCO results (Seppey, Manni, & Zdobnov, 2019) for gene models, to provide an indication of the completeness of assemblies and annotations.
2.3 | Genome browsers, for exploring genes, expression information, synteny, and genomic organization LIS holds GBrowse (Stein, 2013) and JBrowse (Buels et al., 2016) genome browsers for more than a dozen species-and close to two dozen, if counting partner sites PeanutBase and SoyBase (Arachis and Glycine genomes are linked from LIS and integrated across these sites via queries and browser track links). It is worth noting that the files maintained in the Data Store are indexed for direct use by modern genome browsers, so that annotation tracks may be served from the Data Store by browsers at alternate sites (e.g., http:// medicagohapmap2.org serving Medicago annotations whose curation has been centralized at the Data Store). Similarly, command line tools such as samtools can be used to access subsets of gene sequences or genomic regions via calls to the https-hosted indexed fasta files.
One strength of the browsers at LIS is the cross-species synteny tracks, which identify regions of correspondence between species represented in the present browser and other related species. These synteny tracks are calculated using comparisons between protein sequences in the various species. This allows for identification across significant evolutionary distances (e.g., Glycine to Medicago); it also allows for estimates of evolutionary distance in each synteny block, in the form of Ks values calculated by comparing corresponding coding sequences within the blocks. These are reported, for each synteny block, as median Ks values between the present species and the target species for a given synteny track. These precalculated synteny tracks are also loaded from the Data Store into the InterMine instances to facilitate cross-mine traversal by syntenic region.
2.4 | Gene families, phylogenetic viewer, and sequence annotation tool, for exploring ortholog and paralog relationships LIS provides access to a set of legume-focused gene families, both for download in several formats and for query and visual exploration. The current families were calculated using 14 proteomes from sequenced legume species and five from selected outgroup species. Special effort was made to circumscribe the families to include evolutionary events (speciations and whole genome duplications) within the timeframe of the legume family (extending back about 65 Mya) but not older events-avoiding the angiosperm triplication (Koenen et al., 2020).
This tends to give families of reasonable size, typically including the whole genome duplication that predated the origin of the papilionoid subfamily (Koenen et al., 2020;Ren, Huang, & Cannon, 2019;Stai et al., 2019).
The gene families are available for download via the Data Store and for searching and browsing via https://legumeinfo.org/search/ phylotree. In the latter instance, families can be filtered by gene content and count, key word, or family name. Families can also be searched by gene from the gene search page (https://legumeinfo.org/ search/gene). Within the phylotree viewer, genes are colored by species and also identified using a five-letter prefix, which takes three letters from the genus and two from the species epithet, for example, glyma for Glycine max and medtr for M. truncatula.
The tree visualizations ( Figure 4) are interactive. Clicking on a gene name exposes links relevant to that gene (e.g., to the gene page within LIS or a comparable one at Phytozome). The link "Find similar genomic contexts" links to the genome context viewer (GCV), described next. Interior nodes of the trees are also active elements allowing sets of genes that are children of subtrees to be used as inputs to other tools (e.g., for list creation in the InterMine instances to be described later). Large trees can also be filtered down through either collapsing interior nodes or through focusing on subsets of taxa through interaction with the taxon composition display. All such changes to the visible content of the tree will also be reflected in the multiple sequence alignment viewer that is available through the multiple sequence alignment (MSA) visualization button near the top of the tree display. LIS also provides a tool (https://legumeinfo.org/ annot) to annotate user-provided sequences and to place them into phylogenetic trees for viewing. For a supplied sequence, this translates the sequence (if a nucleotide sequence was provided), then searches against selected legume proteome datasets, conducts an InterProScan analysis for domain homology (Jones et al., 2014), searches and assigns the sequence to the LIS gene families using HMMER (Finn, Clements, & Eddy, 2011), aligns the sequence, and recalculates and displays the gene tree. Functional descriptions are provided using the Automated Human Readable Descriptions (AHRD) pipeline (Jozashoori, Jozashoori, & Schoof, 2019).
The LIS phylogenetic viewer and annotation tool are implemented as Tripal modules (Sanderson et al., 2013;Spoor et al., 2019)

| The GCV, for interactive microsynteny and pan-genome exploration
The GCV (https://legumeinfo.org/gcv2) is a web app for visualizing related genomic regions across a set of species or taxa (Cleary & Farmer, 2018). The GCV operates at the level of genes in genomic regions, typically spanning several hundred Kb, although this is con- GCV makes extensive use of microservices internally and can provide this data via application programming interface (API) calls from other software. This is done, for example, by ZZBrowse, to identify syntenic regions between indicated species. This service architecture also helps enable GCV to function in federated contexts, drawing data from various nonlocal sources, as has been done for an instance for Arabidopsis, for example (https://gcv-arabidopsis.ncgr.org) (Pasha et al., 2020).
2.6 | InterMine interface and data warehouse for selected legume species As part of the National Science Foundation (NSF) Legume Federation project (Bauchet et al., 2019), a set of InterMine instances has been populated for seven legume genera (as of 2020), as well as an integrating LegumeMine that provides access to the data in the genus-specific mines, linking among species via gene family relationships. The legume mines are accessible at https://mines. legumeinfo.org. Intermine is a "data warehouse," built originally to hold genomic data for fly, yeast, and nematode, but extended and made into a generic system for holding and providing access to genetic and genomic data (Kalderimis et al., 2014;Lyne et al., 2015). The InterMine interfaces provide a large collection of query templates to help users drill into genomic data sets; and it also provides methods for constructing custom queries. A particular strength of the mines is their ability to work with lists and genomic regions in nontrivial ways-for example, intersecting a list of genes with particular expression profiles, with genes falling within selected GWAS regions.

| Tools for exploring expression data at LIS
LIS currently houses tissue and developmental expression atlas data for five species. These data can be accessed in several ways.   (Danecek et al., 2011). In a typical GCViT session, a user will select a genotyping project (which points to a VCF file held in the LIS Data Store) and then select one or more accessions within the project for display and comparison. Displays show counts of variants (single nucleotide polymorphisms [SNPs]) in bins across chromosomes, with counts either being absolute (i.e., number of SNPs in a bin at some location for an accession) or relative differences (i.e., the number of "same" or "different" SNPs between two selected accessions, in a bin at some location).
Counts (absolute or of similarities or differences) can be rendered in several ways: as histograms, intensity heat maps, or haplotypes. Haplotypes are rendered as colored lines or blank extents (i.e., binary indications of presence or absence), indicating that counts of SNP differences (or similarities) between two accessions are above a selected threshold.

| Are genetic/trait associations conserved across different legume crops?
The ZZBrowse tool (https://zzbrowse.legumeinfo.org) is designed to explore genetic association studies in two selected species, using genomic synteny as the basis for making the comparison between species.
Beginning with ZZBrowse, we first select two species to examine: cowpea and soybean. In Figure 1, cowpea is framed in blue and soybean in orange. Users may import their own GWAS results as well.
In the example shown in Figure 1, we first filter for growth-habit-related traits: "Growth habit" in cowpea (data from LIS) and "Lodging," "Plant height," and "Stem termination" in soybean (data from SoyBase). Starting with a region with several growth-habit 3.2 | What are the evolutionary histories of the candidate genes?
Looking at the gene descriptors (either with mouse rollover or in the "Annotations Table" view), the most likely candidate gene is the wellknown determinacy (Dt1) gene (Tian et al., 2010;Zhong et al., 2017), Glyma.19G194300, annotated as "flowering locus protein T." Clicking on this gene brings up a menu of other links related to this gene. One of these, "Find similar genomic contexts at LIS for: Glyma.19G194300" links to the GCV at LIS (Figure 3).
This shows (with some filtering for species) synteny for four soybean chromosomal regions (Gm19, Gm03, Gm10, and Gm11) and two cowpea regions (Vu01 and Vu07). In this view, colored triangles represent genes from different gene families (e.g., left-facing F I G U R E 2 ZZBrowse, centered on a Quantitative Trait Locus (QTL) region of interest, with genomic syntenic connections enabled between species. Once a genomic region is selected in the first species (on cowpea chromosome 1 in this case), then syntenic regions in the second species can be identified by selecting a gene symbol in the first species. This triggers a search (using the genome context viewer, when the "genomic linkage" option is ON) that identifies syntenic regions between the two species. Orthologous matches are indicated with rainbow colors of the matching genes F I G U R E 3 Genome context viewer, focused around the query gene Glyma.19G194300. Orthologous genes are indicated with triangles of one color. Gene orientation is indicated by triangle orientation. Solid white designates genes without matches in the view, called "singletons." Dotted white indicates genes that do not have an annotation, called "orphans." This view has been filtered to show just two species, using the expression "glymajvigun," to focus on regions from these two species (Glycine max and Vigna unguiculata, respectively) pink representing genes in the flowering locus T protein family).
Clicking on a gene in the GCV again brings up a menu of links related to this gene. One of these links is to the gene family view at LIS: https://legumeinfo.org/chado_phylotree/legfed_v1_0.L_18BH5B ( Figure 4).
The gene tree shown in Figure 4 shows the relationships of genes from the gene family that contains the soybean and cowpea determinacy (Dt1) genes (and orthologs from other species). The tree shows generally expected phylogenetic relationships-for example, with the warm-season legume species (soybean, pigeonpea, bean, cowpea, and mung bean-indicated by prefixes glyma, cajca, phavu, vigun, and vigra, respectively) falling in one clade and cool-season legume species falling in another (clover, Medicago, chickpea-indicated by prefixes tripr, medtr, and cicar, respectively). At greater evolutionary distances, we see Lupinus angustifolius (prefix lupan) and peanut and its diploid relatives (A. hypogaea and A. duranensis and A. ipaensis; prefixes arahy, aradu, and araip). Also evident in the gene tree are several WGD: one affecting glycine (Schmutz et al., 2010), a triplication affecting Lupinus (Hane et al., 2017;Kroc, Koczyk, Swiecicki, Kilian, & Nelson, 2014), and one affecting all of the papilionoid legume species shown here (Ren, Huang, & Cannon, 2019).

| Is there a signature of selection around the determinacy locus?
Because plant habit is a critical trait for soybean, we might expect to see evidence of selection around this locus-as, indeed, has been reported (Zhong et al., 2017). The expectation would be to see a wildtype (Glycine soja-like) haplotype in indeterminate accessions and alternate haplotype in determinate accessions. We can check this using the GCViT tool at SoyBase (GCViT is available for several species, at the appropriate genomic database sites: peanut at peanutbase.
org, bean and chickpea at legumeinfo.org, and soybean at soybase.

org.).
GCViT produces visualizations of variants from genetic variation collections. These are generally VCF files associated with genotyping projects, derived either from an SNP chip or from genomic resequencing or genotyping-by-sequencing methods. The first step is to select a genotype collection. For soybean, we select the "USB 481" set, from Valliyodan et al. (2020). This has good genome-wide SNP density, and contains a good mix of G. soja and G. max accessions and elite and landrace lines.
We then select genotypes to compare. We select one G. soja accession at random to use as a reference (PI597458); three short (<80 cm) and determinate accessions at random (PI549021, PI361087, and PI594012); and three tall (≥120 cm) and indeterminate accessions at random (PI407729, PI71465, and PI417500). These are plotted in Figure 5.
In Figure 5,

| SUMMARY
LIS provides numerous tools for accessing and exploring legume genetic and genomic data. A notable strength of LIS is the collection of comparative methods, linking together orthologs through their evolutionary histories, using gene trees, synteny, and homology search tools. The Data Store attempts to provide a one-stop shop for genomic data, organized in a systematic way, with standardized formats and file names, to enable efficient computation on many data sets at a time. The case study presented above illustrates how some of these tools and methods can be used to do "on-line biology" by comparing analyses carried out in different species. Investigating determinacy, we find similar genetic associations occurring in syntenic locations in cowpea and soybean and find possible signature of selection for this trait around the determinacy locus in soybean.

ACKNOWLEDGMENTS
The authors thank legume researchers who have provided data and motivation for this work. Mention of trade names or commercial products in this publication is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the U.S. Department of Agriculture. USDA is an equal opportunity provider and employer.

CONFLICT OF INTEREST
The authors declare no conflict of interest.
F I G U R E 5 Genotype Comparative Visualization Tool (GCViT) view of selected soybean accessions from Valliyodan et al. (2020). On the left of each chromosome, single nucleotide polymorphism (SNP) density is shown (as a gray histogram), for reference accession PI597458, from Glycine soja. To the right of each chromosome, the density of SNP differences is shown, for each selected accession compared with the reference accession. Darker colors indicate more SNP within a given bin. Thus, white areas indicate near-identity with the G. soja reference. The first three comparison accessions are determinate G. max lines. The last three comparison accessions are indeterminate G. max lines. Only selected chromosomes are shown, due to space limitations AUTHOR CONTRIBUTIONS SC, AF, SK, NW, and SR did the writing-original draft and review and editing. SK, SH, AF, AB, CC, JC, WH, RN, and SC helped with data curation. AW, AC, and AB did the visualization. JB, SD, SH, RN, SR, NW, and AW helped with the software. SC and AF handled the project administration.

FUNDING INFORMATION
This research was funded in part by the NSF project "Federated Plant Database Initiative for the Legumes" (#1444806) and supported in part by the US. Department of Agriculture, Agricultural Research Service, project 5030-21000-069-00D.

DATA AVAILABILITY STATEMENT
All data sets described in this publication are freely available, as described in the manuscript.

ETHICAL STATEMENT
This study did not involve any human or animal testing.