Cyberinfrastructure and resources to enable an integrative approach to studying forest trees

Abstract Sequencing technologies and bioinformatic approaches are now available to resolve the challenges associated with complex and heterozygous genomes. Increased access to less expensive and more effective instrumentation will contribute to a wealth of high‐quality plant genomes in the next few years. In the meantime, more than 370 tree species are associated with public projects in primary repositories that are interrogating expression profiles, identifying variants, or analyzing targeted capture without a high‐quality reference genome. Genomic data from these projects generates sequences that represent intermediate assemblies for transcriptomes and genomes. These data contribute to forest tree biology, but the associated sequence remains trapped in supplemental files that are poorly integrated in plant community databases and comparative genomic platforms. Successful implementation of life science cyberinfrastructure is improving data standards, ontologies, analytic workflows, and integrated database platforms for both model and non‐model plant species. Unique to forest trees with large populations that are long‐lived, outcrossing, and genetically diverse, the phenotypic and environmental metrics associated with georeferenced populations are just as important as the genomic data sampled for each individual. To address questions related to forest health and productivity, cyberinfrastructure must keep pace with the magnitude of genomic and phenomic sampling of larger populations. This review examines the current landscape of cyberinfrastructure, with an emphasis on best practices and resources to align community data with the Findable, Accessible, Interoperable, and Reusable (FAIR) guidelines.


| INTRODUC TI ON
High-throughput technologies are enabling rapid data generation in the fields of genomics, proteomics, and phenomics. These technological achievements are coupled with substantial reductions in cost as well as an increase in instrument accessibility.
Genomics, in particular, exceeded predictions, and the cost of generating data is now less than the cost to store it (Stephens et al., includes genomics, is focused on more efficient compression algorithms, binary file formats, and improved data transfer protocols to meet current demands (Muir et al., 2016).
Despite the increases in sequence-based resources, fewer than 4,500 eukaryotic genomes are available in the NCBI Genome database. When examining the resources for vascular plants, just under 200 unique genomes are complete and 52 represent tree species ( Figure 1). While full genomes are increasingly available, a significant amount of sequence data for forest trees remains associated with experiments that are not designed around a reference genome ( Figure 2b). In contrast to the 52 species associated with over 6,100 NCBI BioProject studies, over 970 sequencing experiments represent 373 trees without a reference genome. The vast majority of this data is derived from genome sampling (i.e., GBS, RAD-Seq) or transcriptomic approaches (Figure 2a). This leaves most forest tree species categorized as non-model. The ability to achieve high-quality reference genomes in forest trees is hindered by characteristics shared by other plant groups, including high heterozygosity, ploidy, gene duplications, and repetitive sequences (Hirsch & Robin Buell, 2013).
This approach extends the analysis beyond the scope of allelic F I G U R E 1 Growth in number of published reference plant genomes in comparison with those of tree species sequenced since 2002. By 2018, there were 148 plant reference genomes (shown in brown) with only 52 tree species (green). The first forest tree species was sequenced in 2006 (Populus trichocarpa). The highlighted genus names denote the year the first reference was generated for a species in that genus Pangenomes are now available for maize (Hirsch et al., 2014), Oryza (Zhao et al., 2018), Brassica , and Brachypodium (Gordon et al., 2017), and in trees, several hybridizing Populus species (Pinosio et al., 2016). While most research efforts are focused on diving deeper into species with economic drivers, fewer than 1% of the estimated 400,000 diverse land plants are sequenced. This may soon change as several initiatives are proposing ambitious collaborations to characterize large sections of the tree of life. The Earth BioGenome Project is most notable and intends to sequence 10 to 15 million eukaryotes over the next 10 years (Lewin et al., 2018). Obtaining high-quality reference genomes for more forest F I G U R E 2 (a) NCBI project data depicted for 52 species (10 orders) associated with 6,116 BioProject studies. BioProject data were organized into whole-genome shotgun (whole genome or resequencing), Transcriptome (RNA-Seq, sRNA), Epigenome (bisulfite), GBS (genotyping-by-sequencing, RAD-Seq, ddRAD-Seq, RAPTURE, and similar), and exome (targeted capture). (b) NCBI BioProject data depicted for 972 projects representing 373 unique tree species across 16 orders. BioProject data were organized into whole-genome shotgun (whole genome or resequencing), Transcriptome (RNA-Seq, sRNA), Epigenome (bisulfite), GBS (genotyping-by-sequencing, RAD-Seq, ddRAD-Seq, RAPTURE, and similar), and exome (targeted capture) tree species may improve our ability to conserve and manage forest populations.
Forest trees are long-lived, predominantly outcrossing perennials with long generation times and tremendous genetic diversity.
As such, a significant body of literature is dedicated to interrogating forest tree populations spanning environmental gradients through genomics (Aitken & Bemmels, 2016). Population studies examine local adaption through a range of techniques with recent efforts focused on reduced representation genome sampling (Catchen et al., 2017). To date, at least 50 tree species were assessed via genotyping-by-sequencing approaches, such as RAD-Seq, which is a reliable option for trees with and without a reference genome (Parchman, Jahner, Uckele, Galland, & Eckert, 2018). These approaches are typically paired with extensive phenotypic or environmental data to interrogate genotype-phenotype and/or genotype-environment associations for a large number of individuals (Sork et al., 2013). The associated phenotypic and environmental metrics add yet another dimension to the data challenge. High-throughput phenotyping, or phenomics, is extensively adopted in crop species to examine and monitor biomass, photosynthetic efficiency, disease status, growth traits, and root architecture (Fernandez, Bao, Tang, & Schnable, 2017;Shakoor, Lee, & Mockler, 2017;Thomas et al., 2016). Recent adoption of thermal imaging and LIDAR provides opportunities to assess biodiversity, response to drought, growth traits, and pest/ pathogen spread across entire forest plots (Dungey et al., 2018;Ludovisi et al., 2017). This review will describe developments in cyberinfrastructure that enable integration across traditional domains to advance knowledge in the forest tree research community.

| C YB ERINFR A S TRUC TURE AND FAIR
The term cyberinfrastructure was first defined by the National Science Foundation (NSF) in 2003, and described a research network that supported all aspects of the data life cycle, from acquisition to storage, integration, analysis, and visualization.
Cyberinfrastructure includes both the software and hardware elements to support these endeavors, connected to the Internet and accessible to an audience beyond a single institution (Kim, Yu, & Park, 2016). There is agreement among many that these frame-  (Horvath et al., 2018). While CyVerse focuses primarily on genomics, the BIEN (Botanical Information and Ecology Network), a National Center for Ecological Analysis and Synthesis (NCEAS) working group, unifies disparate ecological datasets built from observations across regional plots, herbaria, and other collections (Enquist, Condit, Peet, Schildhauer, & Thiers, 2016). The challenges are immense with varying degrees of digitization, distributed nonintegrated databases, and a lack of universally adopted standards for recording observations. BIEN is targeting not only integration across these collections but also phylogenetic and 'omic data to fully assess the impact of climate change (Enquist et al., 2016). CyVerse and BIEN represent two coordinated efforts from which forest tree researchers could benefit. Integration of ecological, trait, and genomic data for georeferenced populations, alongside computational resources, is critical for questions surrounding forest health and productivity.
Cyberinfrastructure is only as powerful as the underlying data that it stores, transports, and analyzes. While this remains challenging for genetic and genomic data, it is even more so for field observations and measured traits. The FAIR (Findability, Accessibility, Interoperability, and Reusability) data reporting standards, published in 2016, emphasized that data should not only be stored, but also accessible and usable by the greater research community (Wilkinson et al., 2016). These guidelines encourage individual researchers seek out appropriate tools and cyberinfrastructure to support the viability of their digital products. The FAIR reporting standards ask that data be: (a) findable: requires that the information is both machine and human-readable with relevant and persistent identifiers; (b) accessible: requires that information be indexed, searchable, and retrievable by both machines and humans through the use of open-source standard file formats; (c) interoperable: requires that information be exchanged across platforms and relies on standards and semantics to aid in this process; (d) reusable: requires that data are open and associated with appropriate metadata (Reiser, Harper, Freeling, Han, & Luan, 2018). In the era of high-throughput data, cyberinfrastructure and the associated databases are not yet fully compliant with FAIR standards.

| WHERE ARE THE DATA?
Journals and funding agencies encourage the deposition of data in the appropriate archiving locations. Despite these guidelines, all life science fields are experiencing a decrease in well-connected datasets (Alexander, Johnson, & Brown, 2018). In non-model species, the

For plant biologists, the ELIXIR UK-supported Collaborative
Open Plant Omics (COPO) initiative is improving the situation for plant genomic and phenomic data with standards-based integration, guided workflows, DOI generation, and connections to researcher profiles (ORCID) (Shaw et al., 2015). The COPO initiative aims to limit the variation across standards and provide access to analytics which can operate on more robust standards. COPO is just one of the registered services listed in the Fairsharing.org portal that provides a curated and queryable interface to four linked registries, including data standards, databases, collections, and data policies (McQuilton et al., 2016). FAIRSharing is aligned with the FAIR principles and provides guidance on data sharing for numerous disciplines, including the life sciences, for individual researchers, journals, and funding agencies.

| DATA BA S E S
Primary databases are long-term, federally funded, entities that are capable of maintaining persistent identifiers. The major representatives in the genetic and genomic world include NCBI GenBank, EMBL-EBI, and DDBJ, which operate as mirrored repositories for several different sequence types with independent strengths (Meldal & Orchard, 2018;Miyazawa, 2018;Sayers et al., 2018). These repositories excel at providing unified access to a wide range of sequence data for an unlimited number of species.
They do not, however, have the capacity to identify specific community needs and provide organism-specific curation. They generally provide basic functionality for sequence search, sequence comparison (BLAST), and visualization (genome browsers). Since users are often seeking substantial volumes of data, they provide mechanisms for bulk download via FTP, command-line search, and Web-based searches, as well as rapid data transfer pipelines, such as Aspera. In addition, primary repositories must balance data volumes and perceived benefit to the research community. They implement policies that are typically driven by biomedical and model system concerns. This includes NCBI's recent decision to halt the collection of variant data from genomes that are not biomedical models and within EBI-EMBL, only variants associated with INSDC-registered genomes. For those species with draft genomes or without a reference, this provides no mechanism for integrated data sharing of population genomic studies.
Secondary databases curate and provide specialized functionality The international 1KP project generated de novo reference transcriptomes for over 1,000 viridiplantae species with membership from all major lineages (Matasci et al., 2014). The successor to this project, in collaboration with Earth BioGenome, is the Plant 10KB project, which will sequence 10,000 phylogenetically diverse plant species from major clades of embryophytes over the next five years .
Independently, these resources provide valuable contributions to forest tree genomic research; however, connections between these resources and non-model databases remain sparse.
Community databases also work in conjunction with primary databases and other secondary databases. Their origins are more ad hoc in that they are hosted by a variety of different organizations and funded through different mechanisms. The Arabidopsis Information Resource (TAIR) is a well-established community repository for plant biologists that provides a wealth of information that is also cross-linked across information resources (Berardini et al., 2015). Community and secondary databases for plants (and trees) continue to increase and often originate from a single transcriptome or genome project; however, dedicated funding for biocuration beyond the length of the initial project is limited .
Until recently, the majority of databases focused on interfaces for searching curated data, genome visualization via browser, and basic sequence similarity functions. The tree biologist's need for cyberinfrastructure that expands the basic search and BLAST functionality of community databases is responsible for recent and successful deployments of more robust frameworks.
Three Web-based forest tree repositories have persisted with independent specialties and a connection to data analytics: TreeGenes, Hardwood Genomics Web, and three PlantGenIE implementations (Table 1). Both TreeGenes and Hardwood Genomics Web serve as hubs for their respective research communities in addition to the role of data storage, access, and analysis (Chen et al., 2017;Falk et al., 2018). Combined, they host over 1,800 species with the goal of providing integrated resources for non-model forest trees. Hardwood Genomics Web provides expression and co-expression analysis support for model and non-model hardwood species (Chen et al., 2017).
TreeGenes supports population and landscape genomic analysis as well as comparative genomic module for orthologous gene family analysis.
Recent development in both is focused on the Tripal framework. This open-source platform combines a content management system front end with an organism agnostic relational database schema, known as Chado (Sanderson et al., 2013;Spoor et al. 2019). This web/database combination provides a set of modules that can load and provide public views for standard data types (genomes, transcripts, variants, etc).
Two other prominent tree databases associated with horticultural species, CitrusDB and Genome Database for Rosaceae (GDR), utilize Tripal as well as 30 other plant-focused resources (Jung et al., , 2017. Tripal-supported databases integrate with a community of developers that contribute modules to extend the functionality of the standard install (Zhou, Emmert, & Zhang, 2005 (Boekel et al., 2015). Galaxy exists as a publicly accessible Web framework with community-curated workflows for a wide range of bioinformatic analysis. It is also an international consortium of developers that support local instances that can be further customized for a variety of community needs. Community databases can manage user accounts, provide data storage, and expose custom workflows and associated datasets through their sites with Galaxy.
For model organisms, data warehousing solutions can enable faster access through alternative (non-relational) storage designs. BioMART is widely adopted and provides an efficient storage method and standardized user interface to query genomic objects, including genes and functional data (Smedley et al., 2015). BioMART also pairs with an R package that allows one to integrate functional annotations directly into analytics (Drost & Paszkowski, 2017). Gramene, Phytozome, and Ensembl Plants provide data access via BioMART in addition to their independent interfaces. InterMine acts as a more robust framework that combines efficient storage with standard and custom data loaders and analytic tools (Lyne et al., 2015). The Phytozome framework also implements a PhytoMine. As demonstrated by Phytozome, databases have the option to share or expose data in different frameworks, which can enable a variety of Application Programming Interfaces (APIs), other databases, or end users to integrate the data.
In alignment with FAIR guidelines, new tree (or plant) community databases should examine whether an independent Web resource is necessary or whether integration into existing cyberinfrastructure is more sustainable. The support of advanced analytics in community databases encourages frameworks to efficiently transfer data, such as raw reads, from primary repositories to local application servers for analysis. In the era of big data, it is not efficient or realistic to reinvent the functionality required for each new genome or transcriptome. If a new and independent database is required, researchers should consider how to share the data during the lifetime of the resource as well as a plan to disseminate the data in the event it can no longer be maintained as an independent resource. Less than half of the existing databases hosting tree related data are considering aspects of FAIR, and just over half are providing access to basic analytics (Table 1, Figure 3). In the biomedical community, a pilot NIH Data Commons initiative is seeking to integrate independent genomic resources, including Flybase, Mouse Genome Database (MGD), Wormbase, and others into a cloud-based data sharing platform to minimize redundancy and improve integration and data reuse (Mahurkar et al., 2018).
Related to species of agricultural interest, the AgBioData consortium, formed in 2015, represents more than 25 genetic, genomic, and breeding databases hosted in a range of platforms. The consortium values the need for biocuration and encourages member databases to think about data sharing, reuse, and sustainability for existing resources

| ONTOLOG IE S AND S TANDARDS
Datasets curated by biologists have concepts and measures associated with different definitions across disciplines, organisms, scales, and even researchers in the same field. This semantic heterogeneity impedes data integration. Standardized vocabularies, known as ontologies, are used to describe genetic, phenotypic, and environmental observations or products (Bard & Rhee, 2004). The Gene Ontology (  data integration challenges associated with ecological trait data by hosting over 6.9 million trait records for 148,000 plant taxa (Kattge et al., 2011).
The Planteome initiative provides a Web portal with interconnected reference ontologies for the annotation of genomes, expression data, germplasm, and traits for 95 taxa (Cooper et al., 2018).

The reference ontologies include PO, TO, and Plant Experimental
Conditions Ontology (PECO), and the in-development Plant Stress Ontology (PSO), which will describe the abiotic and biotic stressors.
These are integrated with additional terms from CO, GO, Chemical

Entities of Biological Interest (ChEBI), Evidence and Conclusion
Ontology (PECO), and the Phenotypic Qualities Ontology (PATO) (Cooper et al., 2018). PATO unifies phenotype descriptions and makes them amenable to automated processing. It is both an ontology and a uniform way to express phenotype statements (Gkoutos, Schofield, & Hoehndorf, 2017). Planteome leverages the integrated platform to provide annotations which connect an ontology term to a bioentity. A bioentity is defined as a QTL, gene, protein, germplasm, gene product, or similar.
Two independent efforts have brought tree biologists and computational teams to the same table to curate traits and structures.
A wood anatomy and development working group contributed to PO through the partial conversion of established vocabularies in the Glossary of Terms used in Wood Anatomy (Lens et al., 2012).
While this glossary is known to the research community, the term definitions lose meaning when adopted in other disciplines classifying the same structures. Within the CO, an INRA-sponsored effort curated a woody plant trait ontology specific to forest tree breeding and health with terms such as wood density, wood fiber length, tree diameter, and branching angles. This ontology provides a muchneeded standard for structures and traits specific to forest trees that are not represented in other crop species.
The combination of plant-specific and reference ontologies utilized in databases leverages curated efforts and minimizes redundancy. The transition of community-specific vocabularies or natural language descriptions into ontology terms enables automation of aspects of the classification process. While biocuration is an important and fundamental activity for all life science databases, there are few reliable funding streams that will support it . Community databases must select the ontologies that are appropriate for the data they hold and consider workflows to assist in automating data annotation from high-throughput studies.  (Farley, Dawson, Goring, & Williams, 2018). Effective integration of metadata standards, data sharing implementations, and ontological frameworks provides the basis for tools such as CartograTree that enable meta-analysis across population studies for forest trees (Falk et al., 2018). Schematic of recommended cyberinfrastructure to support and integrate non-model tree genomics, phenomics, and environmental data. Community databases housed within existing frameworks that utilize content management systems will ease the management of user accounts, data exchange, and content updates. Guided submission workflows will integrate community-curated ontologies, such as GO, SO, PO, TO, CO, and PATO. Regular imports from primary and secondary sources, as well as multi-institutional projects, will provide the basis for data that can be further curated. Registered users will have direct access to custom workflows with data housed in the database and raw data that can be transferred from primary databases to the local application server existing packages (Table 3). For community databases and end users, workflows implemented in workbenches, such as Galaxy, Taverna, or SciApps, offer the ability for end users with less development experience to work within a graphical user interface to design modular workflows that wrap existing open-source bioinformatic tools (Boekel et al., 2015;Leipzig, 2017;Wang, Lu, Buren, & Ware, 2018;Wolstencroft et al., 2013). Community databases can integrate with local (or public) instances of these workbenches to expose workflows to their user community. The Tripal community provides this resource to all member databases and their users via Galaxy.

| WORKFLOWS AND ANALY TI C S
Customization of the workflows in tools such as Galaxy allows database administrators to expose and update best practice workflows as well as provide HPC access. This is of tremendous importance for non-model plant databases where custom workflows that do not rely on reference genomes must communicate with curated, local genomic resources. For both models and non-models, integration of genomic selection workflows with management tools for breeding could produce robust infrastructure for the full life cycle in forestry.
Following the execution of a workflow hosted by a community database, tools such as CyVerse's Data Store can provide indexed and labeled storage with support for authentication, permissions management, and metadata associations (Schneider & Jimenez, 2019).

| C YB ERINFR A S TRUC TURE RECOMMENDATI ON S FOR THE FUTURE
The era of high-throughput data necessitates efforts to minimize redundancy in storage and optimize methods for finding and reusing the information generated ( Figure 4). While genomic data are better positioned for integration among model systems, the current state for non-models is less than ideal. With the upcoming increase in draft and complete genome references for species without a substantial research community, integration of these resources into established frameworks, such as Tripal, PlantGenIE, InterMine, or BioMART, is important in an era of limited public funds for computational resources. Community databases should consider exposing data in more than one semantic framework to maximize data sharing, and new databases should develop their resource in one of the community-supported open-source frameworks to minimize developer time and leverage established standards. Integration of ontological frameworks with guided submission workflows that can capture and label metadata (study design, geographic data, and analytical methods) is key to generating reusable and reproducible datasets. These guided workflows can enforce submission of both the raw data and derived objects to ensure they are well described and accessible. Generation of persistent identifiers (DOIs) will also be required to provide lasting value to the associated digital objects. They should be designed to provide at least partial automation for term assignments for sequence types, gene products, and phenotypes.
We expect that many of our forest tree species will have a reference genome in the next five to ten years. As such, the ability to integrate decades of population studies onto these genomes will be critical. In addition, we will want to leverage efforts, such as Planteome, to aid in the functional annotation of the gene space.
Journals and funding agencies, in collaboration with initiatives such as FAIRSharing, must continue their role as gatekeepers and determine best practices and preferred standards for specific data types. Agreement on best practices and enforcement of these standards for both publications and data management plans remains a significant barrier. Researchers and funding agencies should look to existing cyberinfrastructure solutions to manage projects from start to finish, rather than at the end of the project.
Metadata tagging, ontology term assignment, and raw data storage can be managed during small-or large-scale collaborations.
Many community databases and other information resources can support this activity and keep data accessible only to project members until public release (Pommier et al., 2019;Wegrzyn et al., 2019).
Finally, reproducible, documented, and custom analytic workflows should be accessible to researchers through the community databases that provide the curated datasets. These integrated platforms must be accessible in the field as a data and metadata collection tool and at the desktop to provide analysis, visualization, and submission. Mobile applications for data collection on the landscape as well as in tree plantations are a key element of cyberinfrastructure (Crocker et al., 2019). Machine learning supported workflows to distill information from high-throughput phenotyping via remote sensing, an increasingly important component of data collection for forest health and productivity, will be required (Kälin, Lang, Hug, Gessler, & Wegner, 2019). Forest tree research will benefit from well-connected and labeled datasets with access to analytics that can integrate across genomic, phenomic, and environmental data.

ACK N OWLED G EM ENTS
This work was supported by the National Science Foundation Plant Genome Research Program of the United States (Grant No. 1444573).

CO N FLI C T O F I NTE R E S T
None declared.

DATA AVA I L A B I L I T Y S TAT E M E N T
The data that support the findings of this study are openly available