(Re‐)use and (re‐)analysis of publicly available metabolomics data

Metabolomics, the systematic measurement of small molecules (<1000 Da) in a given biological sample, is a fast‐growing field with many different applications. In contrast to transcriptomics and proteomics, sharing of data is not as widespread in metabolomics, though more scientists are sharing their data nowadays. However, to improve data analysis tools and develop new data analytical approaches and to improve metabolite annotation and identification, sharing of reference data is crucial. Here, different possibilities to share (metabolomics) data are reviewed and some recent approaches and applications regarding the (re‐)use and (re‐)analysis are highlighted.


TYPES OF DATA FOR RE-USE AND DATABASES
Data sharing in metabolomics facilitates re-use across multiple levels, enabling the scientific community to derive greater benefits.Increased data sharing leads to enhanced collective knowledge.Machine learning and artificial intelligence are more commonly used in metabolomics, necessitating a larger pool of training data from different areas and application fields.Below, we provide a concise summary of the different types of data which can be shared in metabolomics, along with corresponding resources.

Metabolite structures
Although several metabolite structure databases exist, grow, and are curated, the further sharing of metabolite structures needs to be encouraged.This is particularly crucial for newly identified metabolites with novel structures.While these structures are often part of the articles or the associated supplementary information, structures cannot be found in machine-readable formats such as Simplified Molecular-Input Line-Entry System (SMILES) or International Chemical Identifier (InChIs).A positive example here is SMID-DB.org,which stores the structures and information on secondary metabolites from Caenorhabditis elegans and other related nematodes, which have been identified in different publications, including SMILES and, if available, reference spectra [3].Additionally, sharing such structures in larger, more general databases such as Chemical Entities of Biological Interest (ChEBI) [4,5], PubChem [6], ChemSpider, or LipidMaps [7] make them accessible to a broader audience.
LC-MS/MS is often not able to identify full structural details such as the position of hydroxyl groups in complex metabolites such as flavonoids or the position and stereochemistry of double bonds in lipids.To address this limitation, ChEBI, for example, allows the submission and storage of partial structures.Submission of partial structures and the associated molecular formula makes it possible to increase the chemical space covered.Subsequently, when the full structures are identified, they can be linked to the partial structure via the ChEBI ontology (e.g., 1,2-dihexanoly-sn-glycero-3-phosphocholine [CHEBI:72999] is a phosphatidylcholine 32:0 [CHEBI:66850]).
However, the actual structure of metabolites and the information on the organisms that produce them is crucial.This organism-specific information aids dereplication during metabolite identification and helps to filter spurious identifications that are unlikely to be present in the studied organism.A recent example is the LOTUS database, which contains taxonomical information on organisms producing the respective natural product [8].LOTUS is completely linked to Wikidata and is built entirely from open data.Additionally, other databases, such as ChEBI or LipidMaps, also store associations between molecules and organisms that produce them.In ChEBI, specific entries, such as CHEBI:78804 -C.elegans metabolite or CHEBI:75771 -mouse metabolite, have been generated, and metabolites can be linked to them.Such information on the presence of metabolites in specific species or taxa can be used for improved annotation of metabolites [9,8].Furthermore, species-specific metabolites and reference spectra databases can be constructed from this information.Beside the organism-specificity, in case of multicellular organisms the tissue or cell-type origin of a metabolites is of great importance for correct metabolite annotation, for example, having a role such as "mouse lung metabolite".However, currently this information is not part of most metabolite structure databases.The Human Metabolome Database (HMDB) represents an exception curating this information for several metabolites, for example location in biospecimen or tissues [10].
A summary of all mentioned databases with their respective URL can be found in Table 1.

Reference mass spectra
Sharing of reference mass spectra is one of the most obvious and important factors in advancing LC-MS/MS-based metabolomics.Laboratories can't hold a reference standard for each known metabolite, resulting in limited focus and size of their in-house reference libraries.
Though, to be able to annotate metabolites beyond these libraries, it is essential to incorporate diverse reference spectra from different analytical platforms (e.g., different MS types, Orbitrap, QToF, IT).Though, according to different identification schemes, these reference spectra do not provide definitive identifications (which requires a reference standard to be measured under the identical analytical condition), their availability dramatically helps narrow the list of putative [11,12].
Though more and more laboratories share their reference libraries in the public domain, only a small growth in novel compounds is observed.
In most cases, laboratories initially focus on constructing in-house libraries with common metabolites like amino acids, organic acids, and fatty acids or rely on commercially available chemical libraries.Also, to further boost advances in-silico methods beyond their current state, more chemical diversity is required.Fragmentation spectra of novel compounds identified shall be deposited in electronic databases (and not only included in the supplementary information of articles).
Different platforms for sharing MS data have evolved over the last years, and it is becoming more and more standard to upload reference spectra of substances measured in in-house libraries.MassBank [13], MassBank of North America, and GNPS are primary databases that can store MS data [14].All of them offer different functionalities on top of actually storing the spectra.For example, the GNPS ecosystem offers the Mass Search Tool (MASST), which allows searching reference libraries and public datasets for similar spectra [14,15].Different variants of this search tool now exist, for example, FoodMASST, microbeMASST, or plantMASST [16].To enable such tools, a combination with taxonomically informed metabolite libraries is required (see above).Beyond the purpose of annotation, reference spectra can be used to develop and evaluate novel in-silico approaches for the analysis of MS 2 data, for example, CSI:FingerID [17,18], CFM-ID [19,20], MetFrag, etc. [21].Such tools enable advances beyond classical library and spectral matching for metabolite annotation, opening new avenues for analysis and interpretation.

Retention time and collisional cross section data
Retention times (RTs) and collisional cross sections (CCSs) are valuable orthogonal parameters that can be used to identify metabolites.
Above mentioned identification schemes require such an orthogonal parameter of chemical reference standards matched to a metabolite feature for the highest level of identification [11,12].While CCS values are almost instrument-independent, RTs strongly depend on the employed chromatographic system and instrumentation.Even though approaches for the normalization of RTs have been suggested [22,23], substantial variations persist between different column brands, and sharing of retention data is not widespread.It is important to note that data sharing should encompass metadata sharing, as the (re-)use of RTs heavily relies on the available metadata [24].RTs alone are practically useless without the information on the employed column, eluents, flow rate, temperature, and other relevant parameters.Despite this, RT collections are becoming more available.One example is PredRet, which represents an RT collection, but also offers a tool for projecting RTs across different chromatographic systems [25,26].
In the future, larger collections of RTs will enable the development of novel machine-learning models for the prediction of RTs to enhance metabolite identification [27].
With the advent of ion mobility spectrometry and the more widespread application in metabolomics and lipidomics, CCS databases are becoming more critical.Ion mobility enables the separation of ions based on their shape enabling the potential separation of isobaric and isomeric structures.Since deviations between instruments are typically relatively small, CCS values obtained in different laboratories can be used for metabolite annotation [28,29].One example of a CCS database is the CCS Compendium storing CCS values from different instruments (TWIMS, DTIMS, TIMS) [30].Besides the CCS compendium, different collections exist and enable the prediction of CCS values [31][32][33][34].

Entire datasets (raw data)
Besides sharing individual mass spectra, entire LC-MS/MS runs or datasets can be shared.They often include processed feature tables that provide information about metabolite quantities, peak intensities, or areas.While single feature tables are often included in the supplementary information of published articles or generic data-sharing platforms such as Zenodo, there are dedicated platforms for sharing of metabolomics raw data, such as Metabo-Lights [35], Metabolomics Workbench [36], or MassIVE/GNPS [14].
Sharing of such raw data allows other scientists to evaluate the results of the specific study but also to develop new algorithms for peak picking, ion deconvolution, etc.This is especially important when new analytical methods or approaches are becoming available (such as data-independent acquisition [DIA] or ion mobility in the past). of chromatographic metadata [24].They found that 70% of all data was incomplete and missed important information.Lastly, if data is stored in metabolomics-centric repositories, any information regarding identified metabolites can be easily retrieved without manually searching within articles or their supplementary information.Furthermore, most of these repositories allow to specify for example organism and tissue of origin, which allows to reconstruct specific metabolomes, even including unknown metabolites.A summary of all mentioned repositories can be found in Table 2.

(Spatial) Distribution of metabolites
Certain metabolites are only produced in specific organs, tissues or even cells.Information on the spatial distribution of metabolites is important for better understanding of biological functions.The METASPACE project (https://metaspace2020.eu/) offers an annotation platform for spatial metabolomics based on MS imaging (MSI).
The webportal represents a repository for high-resolution MSI data sets.Annotation on the MS1 level can performed using several of the mentioned metabolite structure databases [37].

Metabolite identification
The different presented types of data allow a different level of re-use and re-analysis.The most straightforward way to re-use public data is through mass spectral libraries for metabolite identification.Publicly shared spectra can be matched against measured spectra from own experiments to aid annotation of metabolites not covered in in-house databases.This provides putative annotations and can help to narrow down potential candidates for further structural elucidation.Besides the actual library matching, high-quality reference spectra are required for the development of in silico annotations tools, such CSI:FingerID, CFM-ID, and others [19,17,38]).For a more detailed review, see [39].
Submission of novel structures to chemical reference databases such as ChEBI, PubChem, or others expands the search space for the aforementioned in-silico tools.Together with the information on organisms producing metabolites, this can narrow down potential candidates.However, great care needs to be taken.Ideally, manual curation and data verification must be performed since automatic methods and meta-scores can potentially result in an artificially high increase in "true positive results" [40].Furthermore, metabolite structure databases can serve as input for the annotation of metabolites in MSI experiments [37].

Reference datasets for the development of new workflows
Entire datasets can be used to develop new bioinformatics tools and approaches.This includes every possible step, from peak picking to feature grouping and metabolite identification.Bioinformatics laboratories working on such tools often do not have the capacity to generate required datasets on their own and rely on publicly available datasets.
For instance, the MetaboLights datasets MTBLS235 and MTBLS234 contain reference data for developing peak picking and assembling into features [41].Notably, it contained a synthetic dataset for which the ground truth is known (known number of metabolites or features and their identity, which is typically not the case for biological datasets).Another example is the dataset MTBLS1108 submitted to MetaboLights, which contains data from data-dependent (DDA) and data-DIA, which was used for the development of the DIAMetAlyzer workflow [42].In addition to sharing the complete dataset, the workflow, and associated code are also made available (https://openms.de/application/diametalyzer/ and https://github.com/oliveralka/DIAMetAlyzer_additional_code).This enables direct benchmarking of new processing methods for DIA data and the comparison against an established workflow.

Reanalysis of metabolomics data at a repository scale
Publicly shared data can be used for reanalysis, including new statistical analysis, search for novel compounds described, or comparison with other datasets.ReDu was developed for exactly this purpose allowing the extraction of specific knowledge from public datasets [43].ReDu allows establishing associations between compounds and different metadata,for example, sex, life stage, etc.
Advancements in computational power and improved algorithms allow metabolomics data analysis at a repository scale with hundreds to thousands of LC-MS/MS runs and spectra.One example was performed for testing of a novel confidence score for metabolite annotation beyond spectral libraries [18].Over 2500 LC-MS/MS runs from different human sources were annotated, including novel compounds not present in HMDB [44].Another example is the creation of new suspect spectral libraries [45].Spectra of new structures have been inferred from nearest neighbors of spectra with reference matches in molecular networks, for example, for novel acylcarnitine species.

PROBLEMS AND OPPORTUNITIES
Metabolomics is generally still very much technology driven; as such, no universally accepted analysis method exists (if ever possible).Different laboratories use different types of equipment (e.g., Orbitraps vs. ToFs) and different chromatographic methods [46].While the integration of targeted metabolomics data based on absolute concentrations or known and identified metabolites might be possible, it becomes more challenging for non-identified metabolites.Instrumentation variations, such as differences in dynamic range and ionization efficiencies due to variations in ionization sources, result in varying relative abundances of adducts and in-source fragmentation, which are compound-dependent.Furthermore, different chromatographic methods will result in different RTs.Approaches such as PredRet can partially help to establish correspondence between datasets [26].The use of MS 2 additionally aids information for mapping.However, differences in collision energy between different instrumentation and experimental settings can lead to differences in fragmentation spectra.The use of merged or ramped spectra might overcome this in future.More research is required to better understand how differences between analytical setups are evolving and if there are ways to overcome and normalize them.In case of lipidomics analysis, it has been recently shown that shared reference materials can improve harmonization of different methods [47].Besides the actual technical differences, several differences in the semantics of metabolites exist, for example, identifiers for metabolites are not harmonized.Metabolite names can often be ambiguous, and systematic IUPAC names are often not used because of their lengths and complexity, and trivial names are preferred (e.g., (2S)-2-amino-3-(1H-indol-3-yl)propanoic acid vs. L-Tryptophan).The most unambiguous identifier for a metabolite is its structure, which can be reported using a SMILES, InChI, or InChIKey.Several approaches have been published to overcome this issue, for example, bridgeDB or RefMet [48,49].However, metabolite nomenclature is a re-occurring issue [50].
Nevertheless, several opportunities are given by sharing metabolomics data.Different metabolomics datasets covering the same or similar biological questions can be combined to increase the statistical power of studies.However, since metabolomics is far from a standardized technology, integration of datasets might be complicated if collected on different platforms.Standardized targeted metabolomics methods and kits can help to generate data that can be easily merged to improve statistical power [51,52].Results from such studies will represent the first line of large-scale integration of data for broader data analysis and enable new findings.However, the knowledge of the metabolism of different organisms is still scattered.
Public sharing of metabolomics datasets also allows the data-driven reconstruction of organism metabolomes.For example, the repository MetaboLights allows to search for organism-specific studies and compounds.Compounds are retrieved from the annotated and identified compounds in the datasets deposited and linked to a specific species.
Together with in-silico reconstructions of metabolism (also known as genome-scale metabolic models), the knowledge can be continuously updated and enhanced to create a more fine-grained picture.
Besides scientific questions, public datasets can be used to educate the next generation of metabolomics scientists.

CONCLUSION
Data sharing in metabolomics can be conducted on different levels, from submitting novel chemical structures to structural databases and sharing reference spectra and libraries to entire datasets.Such sharing is essential for the growth of the field of metabolomics.Though different obstacles and problems associated with metabolomics need to be solved (e.g., common identifiers, comparable methods), each new dataset, reference spectrum, or novel structure increases our knowledge of the metabolism of different organisms and biological systems and is therefore valuable and important.
However, the field of metabolomics is far from being standardized and requires more vigorous control of metadata related to experimentation and instrumentation.Without meaningful metadata, shared data is only of partial use.For example, an RT without a description of the employed chromatographic system represents just a single number or a reference spectrum without information on the chemical structure cannot be used for training purposes.
As technological advancements highly influence metabolomics, it is crucial to make new types of data for the community to keep up with these developments.With the introduction of ion mobility instruments, there has been a significant release of CCS databases and collections, which is expected with novel and alternative fragmentation modes, such as electron activated dissociation (EAD) or ultraviolet photodissociation (UVPD).Both have been shown to be valuable tools for the detailed analysis of lipids allowing them to determine double bond and sn-positions in glycerophospholipids [53,54].Furthermore, new data types are needed, such as for the prediction of quantities, ionization efficiency, or adduct formation [55,56].
It is important to acknowledge that metabolomics is still behind fields like genomics, transcriptomics, and proteomics in terms of data sharing, and new standards need to be established.Nevertheless, big parts of the metabolomics community realized the value of sharing data on different scales, and data becomes more available.Facilitating easy integration and uploading to the metabolomics repository will help to streamline this process further.Current software tools often allow the export to common open data formats, such as .mzMLand mzTab [57,58,59].Once automatic upload and (re) data analysis become feasible; metabolomics will flourish and be used by a wider range of scientists, including non-experts.Until then: Share your data!

Furthermore, in
theory, data from different sources can be fused and compared to increase statistical power.However, in reality, the diversity of data from different laboratories makes direct comparisons challenging, as different mass spectrometric setups might have different responses to a specific metabolite.An essential factor for the (re-)use of such datasets is the comprehensive capture of metadata, including information about the organism, experimental conditions, and other relevant details.As an example, Harrieder et al. recently checked metadata associated with different datasets in Metabolights and Metabolomics Workbench for the completeness Repositories for metabolomics (raw) data.
Images are associated with rich metadata such instrumentation and origin of samples.Beside images other database exist, for example, the MetaboAtlas21 (https:// metaboatlas21.metabolomics.fgu.cas.cz/), which allows to browse dis-TA B L E 2