Data‐Driven Compound Identification in Atmospheric Mass Spectrometry

Abstract Aerosol particles found in the atmosphere affect the climate and worsen air quality. To mitigate these adverse impacts, aerosol particle formation and aerosol chemistry in the atmosphere need to be better mapped out and understood. Currently, mass spectrometry is the single most important analytical technique in atmospheric chemistry and is used to track and identify compounds and processes. Large amounts of data are collected in each measurement of current time‐of‐flight and orbitrap mass spectrometers using modern rapid data acquisition practices. However, compound identification remains a major bottleneck during data analysis due to lacking reference libraries and analysis tools. Data‐driven compound identification approaches could alleviate the problem, yet remain rare to non‐existent in atmospheric science. In this perspective, the authors review the current state of data‐driven compound identification with mass spectrometry in atmospheric science and discuss current challenges and possible future steps toward a digital era for atmospheric mass spectrometry.


Introduction
In this perspective article, we review the current state of datadriven mass spectrometry in atmospheric science.We focus on automated compound identification, which refers to the large-DOI: 10.1002/advs.202306235scale identification of molecules facilitated by digital tools, open knowledge, and data sharing practices.The past 50 years have seen the emergence of large mass spectral databases, which are filled with mass spectra for a variety of compounds. [1,2]Mass spectral databases are used during compound identification and the development of data-driven identification tools.As a result, many research fields, which rely on high-throughput mass spectrometry, have been able to improve, accelerate, and automate data analysis of mass spectrometry experiments.However, in atmospheric science, we believe that there is room for a broader application and more specific development of such tools.Here, we outline the potential and current barriers for datadriven compound identification in atmospheric mass spectrometry.
Atmospheric science includes the study of all chemical and physical processes that occur in the atmosphere.These processes drive a complex, interlinked system with global impact.The chemical composition of the atmosphere mostly consists of nitrogen and oxygen gas (around 99%), followed by noble gases (about 1%), water vapor (≈ 0.01-4%), and carbon dioxide (0.04%).In addition, the atmospheric gas mixture contains a vast number of trace gases, including methane and carbon monoxide (around 2 ppm and 100 ppb, respectively); inorganic vapors, such as nitrogen and sulfur compounds (e.g., NO, NO 2 and HNO 3 , and SO 2 , COS, and CS 2 ); and a substantial number of organic compounds from either biogenic or anthropogenic emissions (e.g., terpenes and polyaromatics).5] Trace gases can alter the atmospheric composition at any given time.Certain trace gases are very reactive and have short lifetimes, while others are practically nonreactive and persist for far longer periods, allowing them to transport over long distances.Trace gas emissions of organic compounds enter the atmosphere mainly in reduced and poorly water-soluble forms.Through oxidation, the organic compounds increase their affinity for the condensed phase (see Figure 1).This means they can be scavenged by liquid droplets and airborne particles.One example of this complex multi-phase chemistry is secondary organic aerosol particle generation.[11] An autoxidation process drives this gas-to-particle conversion by Figure 1.Particles in the atmosphere form through complex processes spanning multiple spatial-scales.First, emissions of volatile compounds enter the atmosphere and oxidize into lower volatility compounds.These low-volatility compounds eventually form clusters which, in turn, can grow into atmospheric nanoparticles.Mass spectrometry has become the measurement method of choice to study atmospheric molecular processes like these.Introducing data-driven methods such as machine learning to the mass spectrometry workflow can help unlock the full analytical potential of mass spectrometry and provide unprecedented insight into atmospheric processes.
generating a sequence of progressively more oxygenated and often isomeric, reaction products from the same parent hydrocarbon. [12,13]With each oxygenation step, the reactant molecules become better at condensing into smaller nanoparticles. [14,15]he volatility of a compound and its tendency to form atmospheric secondary organic aerosol particles can be described conceptually by the volatility basis set. [14,16,17]The basis set contains information on the vapor concentration and oxygen content (the oxygen to carbon ratio, O:C, or the average carbon oxidation state, OSc) and correlates the volatility evolution with structural changes.The most oxygenated, and generally also the most polar, compounds contribute most to aerosol particle formation and typically have the highest O:C ratios and lowest saturation vapor concentrations.The most extreme cases are the so-called ultralow volatile organic compounds (ULVOCs) with saturation vapor concentrations lower than 3 × 10 −9 μg m −3 . [10,14,16,17]At the opposite end of the volatility basis set scale, we find the most volatile, and the least polar, organic compound gases.
The shear number of emitted volatile organic compounds, combined with the many aforementioned reaction schemes, lead to a combinatorial explosion of possible reaction products.The number of different, emitted volatile organic molecules is estimated to lie in the thousands or even millions. [18,19]Through atmospheric reactions, each emitted volatile organic compound multiplies into thousands of reaction products.For example, a decane molecule (10-carbon alkane) with around 100 isomers could already yield over one million distinct compounds. [18]nderstanding the complex atmospheric chemistry behind aerosol particle formation is an important and challenging task.Efforts to map atmospheric compounds and processes contribute to a better basic knowledge of the chemistry in one of Earth's largest and most complex systems.The atmospheric chemistry leading to particle formation also contributes to air pollution and climate change.Aerosol particle pollution has adverse effects on air quality and human health, [20] contributing to 7-9 million premature deaths annually. [21,22]Additionally, aerosol particles impact the climate by reflecting and absorbing solar radiation, an effect addressed in climate models used by the Intergovernmental Panel on Climate Change (IPCC) to inform and guide legislation and action plans for climate change mitigation. [23]In this context, compound identification could, for example, help to develop a better understanding of particle growth, an important factor in determining aerosol-cloud interactions. [24]Small changes in our understanding of aerosol particle growth could alter the number of cloud condensation nuclei by 50% and, thus, affect the outcome of climate models. [14]In this perspective, we propose merging experimental mass spectrometry techniques with data-driven approaches, such as machine learning, to accelerate identification of new atmospheric compounds (see Figure 1).
Atmospheric scientists utilize a combination of laboratory and field-campaign spectrometry experiments to map out the intricacies of atmospheric chemistry leading to particle formation (Figure 2).Field-campaigns generate numerous experimental spectra of compound mixtures.Such mixtures often contain unknown compounds and have a composition that varies between measurement sites.[27][28] In a datadriven approach, existing experimental infrastructures would be coupled to data science frameworks.Reference compounds shared in data infrastructures can function as training data for automated compound identification tools.Such digitization of atmospheric mass spectrometry could then expedite compound identification in laboratories and field measurements and help us to gain basic knowledge of the chemistry guiding particle formation (Figure 2).

Mass Spectrometry as a Window into Molecular-Level Atmospheric Processes
Much of what is currently known about atmospheric molecularlevel processes was obtained with mass spectrometry.While mass spectrometers primarily provide data on the molecular mass and formula, the molecular formula alone often cannot uniquely identify a compound. [29]To gain additional insight into Figure 2. Data-driven compound identification in atmospheric mass spectrometry requires an integration of experiments and data science frameworks.Laboratory experiments can be used to create reference spectra for atmospheric compounds (1).Field measurements produce large amounts of mass spectrometry data of unknown compounds (2).Reference spectra and field measurements can be collected in shared data repositories (3).Data-driven (e.g., machine learning-based) compound identification tools can be trained with reference spectra and be used to identify new compounds measured in field campaigns or laboratories thereby increasing our basic knowledge of atmospheric processes (4).molecular structures, mass spectrometry can be combined with techniques such as chromatographic separation, [30] induced fragmentation (MS/MS [31,32] and electron ionization [EI] mass spectrometry [33] ) ion mobility spectrometry, [34,35] ionization characteristics, [36][37][38] and spectroscopy methods. [19]Such combined approaches have the potential to identify compounds and address a wide range of research questions, including those requiring high-throughput analysis.However, the use of mass spectrometry in atmospheric science faces many challenges, which we outline below.
Figure 3 shows examples of mass spectrometric techniques used to study different compounds in atmospheric chemistry. [39]n the introduction, we alluded to the fact that atmospheric chemistry (gas, molecular clusters, and particles) involves compounds with widely different volatility.Since mass spectrometry is inherently a gas-phase detection method, any specimen must first be volatilized.For this purpose, specialized techniques have been developed to study low-volatile molecules with mass spectrometry.
The experimentally resolvable fraction of compounds, in terms of their volatility, has expanded steadily, as techniques have improved. [31,40]For example, large biomolecules have been detected using several spray ionization sources (e.g., electrospray ionization [ESI] [41,42] and atmospheric pressure photoionization [APPI]), [43][44][45] and surface-bound species by desorption techniques such as matrix-assisted laser desorption ionization (MALDI). [31,46]Particulate bound targets, the constituents of nanoparticles, can be detected through direct aerosol sampling by, for example, using an aerodynamic lens with subsequent flash vaporization and EI ionization in aerosol mass spectrometry (AMS), [47] or by collecting the particles onto a filter (or wire) with subsequent rapid thermal desorption vaporization of the condensed-phase constituents.The latter is, for example, applied in chemical ionization mass spectrometry (CIMS) [48,49] detection (with, e.g., filter inlets for gas and aerosols [FIGAERO] [50] or thermal desorption multi-scheme chemical ionization inlet [TD-MION] [51] ).
Of the atmospheric compounds, the volatile gas-phase organic molecules are commonly investigated with either gaschromatography mass spectrometry (GC-MS) [52] or proton transfer reaction mass spectrometry (PTRMS). [53]The least volatile fraction (corresponding to the lowest gas-phase concentrations) can generally only be measured by atmospheric pressure interface (Api) CIMS methods employing anion attachment. [7,10,54]inding techniques that are applicable to the whole range of molecular species present in the atmosphere is a major challenge in atmospheric mass spectrometry, and multiple techniques are currently required to cover the whole volatility range (Figure 3).
Besides a broad compound coverage, the ideal mass spectrometric technique in atmospheric science should be able to analyze ambient gas-phase samples directly without the need for sample pre-treatment. [55]However, such techniques are rare and are often limited by, for example, sampling requirements (e.g., limited time resolution resulting from the necessary temporal spacing of compounds as they pass through a chromatographic column), sensitivity, and interference from background compounds (e.g., spectral overlaps in spectroscopic techniques). [56,57]pi-CIMS is popular because it can sample ambient air, usually through a differentially pumped interface (see, e.g., ref. [58]).Samples do not need to be pre-treated, which enables direct, online analysis.While various methods exist for analyzing aerosols in real-time, such as resonance multiphoton ionization [59,60] and secondary electrospray ionization, [61] we will focus here on Api-CIMS due to its user-friendliness, reliability, and robustness.Api-CIMS can operate continuously for months, even in field conditions.Without sample pre-treatment, Api-CIMS can be coupled with other research methodologies, which provide complimentary information, such as ion mobility. [34,35][64][65][66] The atmospheric composition at a research site can be monitored for days, weeks, or sometimes even years.These timeconsuming field campaigns are characteristic of atmospheric mass spectrometry and set atmospheric science apart from other research fields that use mass spectrometry (e.g., metabolomics or pharmaceutics). [67]Field instruments usually produce relatively long time series for a selected group of target ion signals. [36,37]t the opposite end of the time spectrum, specimen can also be collected on a filter or a filament and then analyzed within a few minutes in an Api-CIMS [38,50,51] enabling high-throughput studies of aerosol particles.While early quadrupole-based Api-CIMS instruments were by necessity only monitoring selected target ions, modern mass spectrometric methods measure the whole mass spectrum continuously. [31]The field measurements are often performed up to a mass resolution of 200 000 (the higher the mass resolution, the smaller the resolvable changes in the target mass), which generates large amounts of data that make data analysis challenging.Currently, only a fraction of compounds in atmospheric mass spectrometry measurements are definitively identified due to the various challenges we will review in the next section. [19]Two possible mass spectrometry approaches exist that are suitable for compound identification following or during field campaigns.[70] Alternatively, current developments for improved compound identification by other mass spectrometry techniques used during field-campaigns are ongoing and outlined below.
Field campaigns often employ soft ionization approaches such as Api-CIMS, which minimize ion fragmentation.In Api-CIMS, reagent ions attach to target molecules (adduction mode), revealing molecular formula information.Details on the molecular structure can be obtained by coupling Api-CIMS with molecular fragmentation techniques (MS/MS). [71]Varying the reagent ion increases sensitivity and selectivity, with detectable target ion concentrations ranging down to 10 −4 cm −3 . [15,54,72,73]New methods, for example, selected ion flow tube mass spectrometry (SIFT-MS) and specialized CIMS, [74] have been developed to improve compound identification by varying the ion-molecule interaction.Noteworthy is the 2019 development of the MION inlet platform, [55] facilitating rapid transitions between ionization modes (e.g., nitrate in anion mode [75] and aminium-or protontransfer in the cation mode [76] ).MION has already increased the number of detectable atmospheric molecules [55,77] and further methodological synergy promises even better compound identification in atmospheric mass spectrometry. [72,78]ummarizing this section, atmospheric science is in a state of dichotomy.Field campaigns have produced large amounts of data, but these data are not labeled and have not been uploaded to mass spectral databases (see following sections).0] The vast atmospheric compound space, the heterogeneity of studies (field vs laboratory), and the multiple mass spectrometric techniques have produced a data landscape that is difficult to navigate.Standardization procedures for data collection, processing, and analysis are still lacking.Combined, these challenges have aggravated compound identification in atmospheric science.

Compound Identification with Mass Spectrometry
The identification of unknown compounds and processes is the holy grail of atmospheric mass spectrometry.To identify unknown processes and compounds is challenging, requiring suitable identification techniques and a high-accuracy identification method.Since only a few hundred atmospheric compounds out of potentially millions have been identified in aerosol samples, [68][69][70] the chemical space of atmospheric compounds remains largely uncharted.We also note that, while compound identification is important for gaining basic knowledge of atmospheric chemistry and for use in particle formation modeling, [79] atmospheric mass spectrometry studies are diverse in type and aim.Some studies do not require compound identification, such as: I) inventorying compounds based on their properties, II) realtime monitoring, or III) monitoring known sources or processes (for a review, see ref. [19]).In these example cases, it can be sufficient to track a molecular or elemental composition, or specific compounds and sources, which are easier objectives than compound identification.
In this perspective, we focus on compound identification.We have identified three factors that most affect the accuracy of compound identification in mass spectrometry that we will present in more detail in the following: the chosen experimental technique, the compound identification method (or tool), and the existence of reference standards.
Mass spectrometry methods are able to identify compounds to a varying degree.In 2015, Nozière et al. introduced the I-factor to quantify the identification accuracy of a mass spectrometry technique in terms of the ability to narrow down the number of plausible candidate structures. [19]In the best case, only one plausible structure is identified and the I-factor is equal to one.If the identification method is not able to discern between isomers of the molecular formula, the I-factor goes up to the number of isomers (two or higher).Uncertainties in the determination of the molecular formula can further increase the I-factor.
Nozière et al. used the I-factor to compare atmospheric mass spectrometric techniques in terms of their compound identification ability. [19]The best I-factors were achieved when two or more techniques, such as chromatography and mass spectrometry, were combined.Fragmentation mass spectrometry methods such as tandem mass spectrometry and EI mass spectrometry, coupled to chromatography methods, reached I-factors of 1-3.The I-factor of soft ionization techniques like CIMS were estimated around 4-40 at the time of publication.The newly developed MION-CIMS method, which uses multiple ion chemistries (see Section 2), has the potential to achieve similarly low I-factors as the combination of two or more techniques given above. [55,56]he data produced by mass spectrometry techniques are used to isolate candidate structures with the help of a compound identification method.

www.advancedscience.com
The identification accuracy of compound identification methods and tools varies and is determined by their ability to match a recorded spectrum to a molecular structure.In Section 5, we summarize these tools and their principles.The performance of a compound identification tool is measured by the Top-k accuracy.Unlike the I-factor, which quantifies the ability of a mass spectrometry technique to resolve the identity of a compound, the Top-k accuracy gives the percentage of instances in which the correct compound is found among the k best matching compounds during a compound search.For example, a benchmark study in ref. [80] reported a Top-1 accuracy of 39.4 (and a Top-10 accuracy of 74.8) for their highest-ranking identification tool.This means that the tool identified the correct molecular structure in two out of five cases (Top-1 accuracy of 39.4) and found it among the ten best matches in three fourths of all cases (Top-10 accuracy of 74.8).Here, it should be noted that the absolute numbers are highly dependent on both the data size used in training and the molecular database used to retrieve candidate molecular structures.Moreover, the recorded mass spectrum's quality and type can limit the compound identification method's ability to provide reasonable candidate structure suggestions.
The accuracy of a compound identification tool often depends on the existence of appropriate reference standards, that is, measured mass spectra of compounds, which are either identical or similar to the unknown compound.In the compound identification process, most approaches search for the measured spectrum, or a very similar one, in a database.Even if the identification method does not employ a spectral database search, it has still likely been developed, parameterized, or trained with data from one or more such databases.In atmospheric science, the lack of reference standards is a large barrier for effective compound identification, [15,19,56] which we will return to later in this perspective.
In the digitization of compound identification in atmospheric mass spectrometry, machine learning will naturally play a large role.As we will detail in the next section, machine learning tools are already utilized to automate and improve analysis and processing of mass spectrometry data in other fields (see a recent review in ref. [81]).Figure 4 illustrates a typical mass spectrometry data acquisition process.In atmospheric mass spectrometry, machine learning is already applied to some, but not all, of the steps outlined in Figure 4. Machine learning models have been trained on different atmospheric mass spectrometry data (like AMS, PTRMS, ESI-mass spectrometry, single particle mass spectrometry, and inductively coupled plasma mass spectrometry) for aerosol classification and source apportionment and [82][83][84][85][86][87][88][89][90] prediction of composition [91][92][93][94] and properties. [95,96]Moreover, a recent review highlighted the role of machine learning in data pre-processing during measurements of volatile organic compounds. [97]Thus, machine learning is being integrated into the data analysis of atmospheric mass spectrometry, but little attention is currently devoted to compound identification.GC-MS machine learning models for molecular formula annotation of atmospheric, halogenated compounds, [98] or for molecular property and quantification factor prediction, [69] are two notable exceptions.
We will next address the reasons for the gap between the perceived demand and utility of smart, high-throughput compound identification tools for atmospheric mass spectrometry and the lack of corresponding availability of such tools.We will also identify the major barriers for introducing compound identification techniques in atmospheric mass spectrometry.A key to both these points are currently available mass spectral databases and their link to the success story of machine learning for compound identification in the field of metabolomics.

Mass Spectral Databases
Digital mass spectrometry libraries with reference mass spectra, so called mass spectral databases, have been used for compound identification since the 1960s. [1,2]Over time, mass spectral databases have grown in size and usage, partly as a result of increased data processing and storage capabilities as well as adoption of open science practices.Table 1 summarizes a selection of mass spectral databases that are hosted by research institutions, or distributed by companies and mass spectrometry vendors.The mass spectral data are either collected through research community contributions (e.g., refs.[99-105]), or curation of scientific publications, measurements, and computations (e.g., refs.[106-113]).
By design, mass spectral databases either cover a specific compound space or aim for some level of generality.However, in reality, the data in large mass spectral databases tend to reflect the interest of the primary users and contributors.This is evident in Table 1, which includes specific mass spectral databases created MassBank of North America (MoNA) mona.fiehnlab.ucdavis.eduAuto-curated public database with experimental and computational mass spectra of > 650 292 compounds.Includes quality estimation of the mass spectra.
[ 105] RIKEN tandem mass spectral database (ReSpect) for phytochemicals spectra.psc.riken.jpA curated database with 8649 tandem mass spectra of 3595 plant metabolite compounds collected from scientific literature in 2011 and authentic standards.Has grown since and now contains 9017 (+368) spectra.[ 104]   Maurer/Wissenbach/Weber LC-MS n Library of Drugs, Poisons, and their Metabolites, (2nd edition) sciencesolutions.wiley.comLC-MS n library of over 2270 compounds and over 3600 of their metabolites curated for forensic use.[ 112,113]   Metlin Gen2 (Mass consortium) massconsortium.comMETLIN is a highly curated commercial database with experimental spectra on over 930 000 molecular standards (2023) (LC-MS/MS).All molecular standards were analyzed in positive and negative ionization modes and at four different collision energies (0, 10, 20, and 40 eV).
[ 118] for and by the metabolomics community.These databases contain predominantly small molecules called metabolites, found in organisms, cells, or tissues.As in atmospheric science, mass spectrometry is used in metabolomics to identify and quantify molecules of interest.The plethora of mass spectral databases in metabolomics can be attributed to open science initiatives in the research field and the ensuing rapid growth over the past 25 years.As a result, large, general mass spectral databases contain mostly metabolites (see also Figure 5a), [102,103,110] despite no stated limitation or constraints on the compound coverage.For this reason, we have decided to highlight metabolomics in this perspective and to use it as a comparative example for developments in atmospheric science.Besides metabolites, other common compound classes in general databases include molecules found in drug or environmental samples (see an overview of NIST 2023 tandem mass spectral library in Figure 5a).Mass spectral databases provide data collected with a variety of mass spectrometric techniques.03,106,109,110] The most common technique is LC-MS/MS mass spectrometry followed by GC-MS.For example, the MassBank of North America contains approximately 30 times fewer MS1 spectra (22 500) than tandem mass spectra (including all MS n ) (May, 2023). As expected, hese most common mass spectrometric techniques found in mass spectral databases are those that facilitate compound identification (see Section 3).
The number of compounds in the mass spectral databases of Table 1 varies considerably, although a direct comparison of the database size is complicated by the non-standardized way in which the size is reported (e.g., number of ions, number of unique compounds, or number of spectra).The reported data volume of mass spectral libraries either increases continuously or with new versions.The data volumes listed in Table 1 reflect the state in August 2023.LipidSearch by Thermofisher is the largest mass spectral database with spectra for over 1.7 million lipid ions.Massbank of North America is the largest open access database with spectra for over 650 000 compounds.The smallest database reports spectra for only 200 compounds. [109]The median size of all databases reported in Table 1 is 26 485 (average > 290 000).However, the databases overlap in terms of the compounds they cover. [119]The total amount of compounds offered by all databases together is therefore likely less than the sum of their individual compound counts.Synthetic (i.e., computational) mass spectra have been important for creating large mass spectral databases.Table 1 also lists mass spectral libraries with computationally predicted (so called in silico) tandem mass spectra or GC-MS spectra. [101,103,106,108,111,116]For example, LipidBlast is a purely computational database, which also provides a tool for users to build their own tandem mass spectrometry database. [108]The motivation for generating computational databases, and sometimes combining them with experimental ones, is the need to accelerate data collection.The large number of predicted mass spectra can greatly increase the average mass spectral database size.For example, HMDB contains experimental LC-MS/MS spectra for approximately 4000 compounds, but computational spectra for more than 200 000 compounds.The quality and information content of in silico spectra is, however, still a subject of debate.
The retention time provides useful additional information and is often enough for correct compound annotation in LC-and GCmass spectrometry.However, for certain isomeric compounds, even the simple chromatographic separation does not provide a positive compound identification and further separation can be necessary. [120]Retention times in GC-MS are collected in MassBanks, [102,103,121] GMD, [100] and NIST23, [110] among others.In addition, computationally predicted retention times are supplied in, for example, HMDB. [106]However, retention times tend to vary significantly between laboratories, which hampers their utility for compound identification.Machine learning techniques can help in alleviating this problem (see Section 5).
Vinaixa and colleagues have reviewed features of mass spectral databases in 2016. [119]They identified beneficial features such as open access, downloadable, large size, curation, data from different platforms, functionality to merge spectra, inclusion of chemical standards, and addition of unknown compounds.On the adverse side, they list commercial licenses, lack of curation and spectrum information, limited sample sources, only negative polarity mode, or only computational data.The review also surmises that there might be a trade-off between too many and too few instrument types as well as collision energies.Following Vinaixa et al., we summarize some features of the mass spectral databases in Tables 1 and 2.
Table 2. Features of the mass spectrometry databases.Open access, partial or full free access to mass spectral data; Data upload, users can contribute with data; Comp.data, contains computationally (in silico) generated mass spectra; Exp.data, experimental mass spectrometry data; collects unknowns, collects and adds unknown spectral queries; machine learning tools, has associated machine learning tools.

✓ ✓ ✓
Human Metabolome Database (HMDB), v5  a) For academic and non-commercial use; b) Download page contains non-redundant mass spectra that were calculated from available multiple replicate spectra; c) Provides a tool to make your own database with computational data; d) Stores spectra of compounds tentatively identified.
Mass spectrometry data pipelines and infrastructures are important to further grow mass spectral databases and to facilitate data management, curation, and reproducibility. [122]For example, Pedrioli and colleagues developed the open, vendorindependent data representation mzXML in 2004, which enables cross-platform data analysis and management. [123]In addition, a plethora of freely available software has been developed to facilitate mass spectrometry data processing and upload, such as OpenMS, [124] TidyMass, [125] XCMS, [126,127] metaboscape, [128] progenesis, [129] mztab-m, [130] mzMine, [131] and MS-DIAL. [132]urthermore, the GNPS database offers a feature-based molecular networking tool, which connects feature processing to molecular network modeling. [133]nother important data management feature mitigates provenance variability.In LC-MS/MS mass spectrometry (as in other soft ionization techniques), data collected at different experimental conditions can vary in appearance.To mitigate such spectral variability, certain database providers have developed the concept of spectral trees [114] and merged spectra [121] that combine spectra collected under different conditions for the same analyte.

Compound Identification: Approaches and Software
Compound identification is the primary purpose of mass spectral databases.Traditionally, compounds were identified by searching libraries or databases for matches.[144][145] In the traditional library search, the measured mass spectrum is compared to all spectra in a mass spectral database.The compound is identified (be it correctly or not) as the one with the most similar mass spectrum, out of those in the database.A mass spectral library search is inherently limited by the size of the database, which typically is some orders of magnitude smaller than the target compound space. [146]tate-of-the art compound identification methods also use database information but go significantly beyond library searches.136][137] During compound identification, spectra predictions are made for all entries in a compound database and compared to the measured spectrum to find the best match.In contrast to traditional mass spectral library searches, in silico fragmentation methods search through compound databases (e.g., PubChem) and not through mass spectral libraries.Compound databases cover a larger portion of chemical space than mass spectral databases and are thus less limited in content and size.Rule-based in silico fragmentation methods are limited by the available fragmentation models that rely on heuristic bond energies (measured or estimated), while combinatorial methods generally need to limit A machine learning model learns to map a mass spectrum to a feature space, here represented by a molecular fingerprint vector.In a second step, the similarity is scored between the predicted fingerprint and the molecular fingerprints of a compound database.ML, machine learning; MS/MS, tandem mass spectrometry.
the amount of fragmentation allowed by the model.In a similar vein, fragmentation tree methods find the optimal fragmentation tree that matches a recorded spectrum.Fragmentation trees are used for de novo molecular formula annotation through Gibbs sampling and Bayesian statistics. [141,147]138] The third category of compound identification algorithms is referred to as machine learning approaches, which are emerging as powerful property and structure inference tools in spectrometry. [148]][144][145] In the first step, a mass spectrum is mapped to a feature space represented by a so-called fingerprint.A fingerprint is a vector that encodes the presence or absence of certain molecular features or their counts.Molecular fingerprints can be calculated in different ways from a molecular representation, like a 2D molecular geometry (e.g., refs.[149, 150]).The mapping from spectra to molecular fingerprints requires a reference dataset of spectrum-molecule pairs.Supervised machine learning algorithms are then trained to assign fingerprints to spectra.Examples include kernel methods, such as support vector machines, [142] vector valued kernel ridge regression, [143,151,152] and multiple kernel learning support vector machines, [80,125,139,144] or a combination of deep learning and multiple kernel learning. [145]In the second step, the fingerprint vector is compared to the molecular fingerprints of compounds in compound databases.][155] Additional information channels such as LC retention times, [154,[156][157][158][159] pairwise retention orders [160] or retention indices [154,[156][157][158][159] (both relating to the retention order of compounds from LC), or collision cross sections [161] can further improve the identification success.For retention time data, the heterogeneity of data across different laboratories is a hin-drance because the retention times depend on the configuration of the chromatograph.Machine learning techniques have been developed to standardize retention times across different laboratories [162] and learn from the relative retention times of molecules, [163,164] which are known to be more invariant across laboratories than absolute retention times. [160]pen access mass spectral databases containing high-quality reference mass spectra have been essential for the development of machine learning-based compound identification.For example, FingerID, [142] IOKR, [143] Adaptive, [145] CSI:FingerID 1.0, [139] and CSI:FingerID 1.1 [80] were all trained using different sets of compounds from different libraries (MassBank, GNPS, MassHunter Forensics/Toxicology PCDL library [Agilent Technologies, Inc.], and NIST17), with sizes ranging from approximately 1200 to 16 083 compounds.The increase in compound identification accuracy during the past decade can largely be attributed to the growth of the spectral databases.In these examples, Agilent Technologies, Inc. and the NIST mass spectral library are the only commercial datasets.
In summary, a variety of approaches and software are now available for compound identification.Open access mass spectral databases have been integral to the development of machine learning approaches and have facilitated the emergence of datadriven mass spectrometry in metabolomics.We will review in the next section how this insight, concepts, tools, and infrastructures can be transferred to atmospheric science.

Toward Data-Driven Compound Identification in Atmospheric Mass Spectrometry
In principle, all compound identification approaches we reviewed in this perspective could be directly used in atmospheric science.Suitable training or reference data, however, might be a limiting factor.The identification success rate would strongly depend on the number of atmospheric compounds in available mass spectral databases, or at least on the similarity between these compounds and those in the databases.Furthermore, the preferred mass spectrometric techniques in atmospheric science may differ from those prevalent in current databases.While compound identification algorithms may be able to extrapolate to the chemical space of atmospheric compounds, such generalization would be algorithm dependent and likely incur large uncertainties.We will address these points and propose an action plan to improve data-driven compound identification in atmospheric science.We start off by highlighting general challenges faced in the adoption of mass spectral databases for data-driven compound identification.

Data Heterogeneity in Mass Spectrometry Databases
The content coverage of current mass spectrometry databases is heterogeneous in terms of compounds, instruments, and experimental procedures.Tool and method developers, therefore, face the challenge of balancing the available data volume, more of which is beneficial for, for example, machine learning methods, against the increased effort of handling the heterogeneity appropriately.Another challenge is the aforementioned coverage overlap, which could introduce biases in data-driven tools derived from more than one database.The current extent of this overlap is, unfortunately, not known, since the last investigation by Vinaixa et al. dates back to 2016. [119]The heterogeneity of available mass spectrometry techniques (see Figure 5b) presents a further challenge but also an opportunity.The characteristics of spectra produced by different mass spectrometry techniques differ, which necessitates dedicated tool and method development.In the long run, however, this technique diversity could be advantageous since different spectrometries could complement each other synergistically.With transfer learning, multivariate machine learning models could be trained to convert between techniques or operate directly on heterogeneous datasets.
In summary, in atmospheric science, much work is still required to assess the utility of existing databases, determine which training data to include in new models, and to establish initial identification tools for atmospherically relevant compounds.Below, we provide a first assessment of the relevance of current mass spectral libraries for data-driven atmospheric mass spectrometry.Investments in improved compound identification for atmospheric science can be justified by the progress achieved in other application domains, such as metabolomics, which have been able to collect experimental data for tens of thousands of compounds (see Section 4).

Compound Coverage of Atmospheric Molecules
As alluded to in Section 4, atmospheric compounds are currently under-represented in mass spectral databases.Compound identification approaches that were developed for specific database compounds will almost certainly perform worse for atmospheric compounds than for compound classes in the databases.This is true for traditional library searches, which can only identify structures stored in a mass spectral database, as well as for algorithms built with database compounds and spectra.
How well compound identification algorithms perform for atmospheric compounds depends on the overlap of atmospheric compound space with available mass spectral databases.Figure 7 shows a first visualization of this overlap.The figure presents a t-stochastic neighborhood embedding (t-SNE) analysis for three atmospheric molecular datasets (here referred to as Gecko, [165,166] Wang, [167] and Quinones [168,169] ) and two datasets of drug and metabolite compounds, representative of those in mass spectral databases (nablaDFT [170,171] and Massbank of North America [103] ).t-SNE clustered the compounds according to the similarity of their (molecular) topological fingerprints. [149,150]igure 7 shows that the atmospheric compounds cluster closer together and are therefore more similar.Their clusters do, however, not overlap strongly, which indicates that these three datasets cover different parts of atmospheric compound space.The drug and metabolite compounds form their own clusters, most notable is the dense ring of MassBank molecules surrounding the clusters of the other datasets.The two drug and metabolite datasets share some similarity in the inside of the ring, but only the MassBank has some small overlap with the three atmospheric datasets.The implications of Figure 7 are: i) most atmospheric compound classes are absent from mass spectral databases; ii) most atmospheric compounds therefore belong to a chemical space unknown by current compound identification algorithms; iii) the performance of compound identification algorithms in atmospheric science is unpredictable.Three traditional library searches report identification rates of only 2-35% for atmospheric molecules, [68][69][70] providing further evidence for our three suppositions.
The fact that atmospheric compounds differ from those in available mass spectral databases implies that compound identification algorithms would have to be able to extrapolate to be applicable in atmospheric science in the short term.Yet, classical rule-based in silico fragmentation algorithms generalize poorly due to built-in rule-sets for chemical bond fragmentation, [146] while in silico fragmentation methods based on combinatorial search (e.g., MetFrag, CFM-ID) are expected to do slightly better.On the other hand, generalization is a common challenge for machine learning models in chemistry. [172]For example, a machine learning model is forced to generalize when it evaluates a new elemental composition, [173] molecular size, [174] or functional group [175] that was not in the training data.Methods for quantifying uncertainty or confidence in a model's prediction have been developed through ensemble methods, [174,176] Bayesian neural networks, [177] Gaussian process regression, [178] support vector machines, [179] and Monte Carlo dropout. [180]In metabolomics, it has been shown that machine learning methods predicting molecular fingerprints from spectra out-perform in silico fragmentation approaches. [80,164]However, it is not known if this also holds true in atmospheric science, where the coverage of the reference spectra of the relevant chemical space is significantly smaller.
Until atmospheric data are available in large enough quantities in mass spectral databases, it would seem prudent to not develop new compound identification methods or workflows immediately for atmospheric science.Machine learning-based approaches, for example, could instead evolve from existing methods developed in other application domains by means of transfer learning.For mass spectrometric techniques commonly found in mass spectral databases, such as tandem mass spectrometry or EI-MS, transfer learning would be particularly well-suited, as already developed models would likely only have to be retrained on atmospheric data.However, for underrepresented techniques such as Api-CIMS, transfer learning would not be applicable and new approaches would have to be developed.Api-CIMS applications are currently flourishing in atmospheric science (see Section 2) [34,35,50,51,55,74,76,77,181] but are practically absent from current databases (e.g., less than 0.1% of the European MassBank [102] data, see Figure 5b).If atmospheric science is moving toward data-driven compound identification, this severe lack of data needs to be addressed.In the following, we outline an action plan to fill this data vacuum.

Action Plan
In this perspective, we reviewed the current challenges of implementing data-driven methods for mass spectrometry in atmospheric science.We next present practical strategies to overcome the identified barriers.Our recommendations are summarized in Figure 8 and expanded on in the following.

A1-Relevant Data
A paradigm shift toward data-driven mass spectrometry in atmospheric science could begin with access to relevant data (Section 6).For atmospheric mass spectrometry, reference spectra would have to be collected for the compounds taking part in atmospheric chemistry, including the atmospheric gas-phase, small clusters, and nanoparticles (see Section 2).The collection could begin with representative compounds and expand from there.Finding such relevant molecules is no simple feat because the chemical space of atmospheric compounds is large and largely uncharted.We suggest to use data-driven approaches, possibly based on the volatility basis set description of atmospheric compound space (see Section 1), to ensure data collection of compounds with varying properties of interest, such as, for example, volatility and O:C ratio.Data collection should furthermore include the multiple mass spectrometry techniques used in atmospheric science for compatibility with existing databases and compound identification tools, as well as for a holistic description of atmospheric chemistry.It is particularly important to include presently under-represented techniques (e.g., Api-CIMS, as addressed in Section 6.2) to improve their data coverage in the databases.The methodology portfolio could be augmented with synthetic data generated with computational tools as discussed further in A4 below.For example, computational studies in atmospheric chemistry have shown that the binding energy between molecules and reagent ions can be used to predict the experimentally measured CIMS sensitivity (e.g., refs.[72, 78, 182]).

A2-Standardization
To utilize the collected data in atmospheric science to its full extent, standards and standardized practices for data collection, curation, management, and sharing need to be agreed on and implemented.For certain mass spectrometric techniques (e.g., EI-MS and MS/MS), such practices have already been developed in other fields (e.g., metabolomics, see Section 4) to ensure data standardization and reproducibility (e.g., platformindependent data formats, data analysis pipelines, and spectral trees or merged spectra).They could be directly applied to atmospheric mass spectral data and should be embraced by atmospheric scientists.Conversely, for techniques currently under-represented in mass spectral databases (e.g., Api-CIMS), appropriate standardization practices still need to be developed.Such practices also need to consider the specific use cases in atmospheric science (e.g., the lack of sample pre-treatment and separation by chromatography).For example, Api-CIMS data should be easy to standardize, because the number of different Api-CIMS instruments used in the field has stayed relatively small, with a dominant fraction of the data being acquired by similar methods, such as chemical ionization atmospheric interface time-of-flight (CI-Api-ToF) instrumentation, or the recently introduced orbitrap CIMS systems. [54,58,181,183,184]For Api-CIMS, the standardization of ion production and gas-phase sample introduction is crucial for ensuring fully reproducible measurements.The signal depends on specific ion-molecule reactions and interaction time.Gas-phase chemical ionization is typically linear and scalable, allowing for a wide range of ion concentra-www.advancedscience.comtions for increased sensitivity.Normalizing measured signals with the number of charge carriers (i.e., reagent ions) is essential in Api-CIMS analysis to account for differences in the initial ion pool.Digital CI-Api-ToF twins can aid in the standardization. [185]

A3-Infrastructure
Data collection and sharing require dedicated infrastructures.En route toward data-driven science, atmospheric science could proceed in two different ways: i) establish dedicated mass spectral databases for atmospheric science data that are operated by the atmospheric science community, or ii) contribute atmospheric science data to existing mass spectral databases.A dedicated database in option (i) offers better control over the data (for example, data curation, labeling, and quality control) but requires concerted actions of key stakeholders and sustained funding. [186]dopting existing mass spectral databases as in option (ii) is therefore easier in the short term.Contributing to an existing, interdisciplinary mass spectral base promotes data sharing with the broader mass spectrometry community, which expands the user base.We recommend a third option, which is an amalgamation of the two approaches above: curating dedicated databases that can be local to research groups or consortia, but are regularly uploaded and synchronized with large open access databases (such as the MassBanks or GNPS).Dedicated databases could, for example, be linked to collections of reference spectra of atmospheric compounds (e.g., refs.[25-28]).Such collections need to grow to provide access to curated high-quality training data for the data-driven method development.Meanwhile, data from field campaign repositories containing data of unknown compounds can be shared for compound identification.In addition, community datasets, such as refs.[68, 166, 187, 188], could complement data infrastructures.They offer distinct advantages such as having been purposefully curated with design criteria like similarity and balance in mind.

A4-Dedicated Machine Learning Methods
In Sections 5 and 6.2, we reviewed the potential and challenges of available machine learning-based compound identification tools in atmospheric science and observed that the identification performance depends strongly on the availability of relevant data (see A1).For tandem and EI-mass spectrometry, data are available for other compounds, and we propose to begin applying existing machine learning techniques to atmospheric data and to then refine the models accordingly.Over time, such models could be improved through transfer-learning, possibly coupled to active learning schemes, as new atmospheric data become available (Section 6.2).For mass spectrometric techniques, which lack existing machine learning models, but are used for compound identification in atmospheric science (e.g., MION-CIMS), new, dedicated models need to be developed.use of computational mass spectral databases until experimental counterparts become available (see A1).To that end, machine learning could also assist in building computational databases by expediting calculations of the binding energies used to predict CIMS sensitivity.

A5-Community Endorsement
Wide-spread adoption of standardized data practices requires a community wide effort.Together, the atmospheric science community needs to commit to open data sharing and publishing.The data should preferably be shared through open access databases, or with FAIR sharing rights, [189] if published with commercial parties.Adoption of community-wide data practices can be encouraged through education in data literacy and machine learning, for example, in summer schools, webinars, or workshops.Further dissemination at atmospheric science conferences and through research networks would create awareness and rally the community to endorse the new paradigm.

Take-Home Message
In this perspective, we reviewed the current state and potential for data-driven compound identification in atmospheric mass spectrometry.Although developments of experimental techniques now enable monitoring and tracking of atmospheric chemical processes, an accurate method for high-throughput compound identification is still missing.Community-wide efforts to improve data standardization and collection can support the transition toward reliable identification of atmospheric compounds with mass spectrometry.Integration of data-driven approaches, such as machine learning, into mass spectrometric data analysis will facilitate knowledge gain.Concomitantly, a true paradigm change requires a community endorsement and a combined effort to collect, curate, and share data in a standardized manner.
Although the development of data-driven approaches requires an initial time and resource investment, data-driven approaches promise to be more efficient than the manual processing currently employed.Successful examples in parallel fields can be used to guide and inform this shift toward a digital era in atmospheric mass spectrometry.

Figure 3 .
Figure 3. Example overview of mass spectrometric techniques and complementary separation techniques (in italicized font), used to study atmospheric compounds ranging from molecules in the gas-phase, clusters to aerosols, and aerosol surfaces.The arrows at the bottom of the figure indicate the inverse relation between measurable scale and detectable volatility.EI, electron ionization; DMA, differential mobility analysis; IMS, ion mobility spectrometry; CI, chemical ionization; ESI, electrospray ionization; EESI, extractive electrospray ionization; AMS, aerosol mass spectrometry; MALDI, matrix-assisted laser desorption ionization; FIGAERO, filter inlet for gas and aerosols; TDCI, thermal desorption chemical ionization; FAB, fast atom bombardment; BBI, bursting bubble ionization; ISAT, interfacial sampling with an acoustic transducer.

Figure 4 .
Figure 4. Data processing and analysis steps in a mass spectrometry experiment which have been performed using machine learning methods.Spectral information is extracted through data processing and analysis.Data processing serves to mitigate statistical effects such as batch-tobatch variations, or missing data.Other processing steps include peak processing, alignment, integration, and annotation.Conversely, data analysis aids in the classification or detection of molecules and the identification of chemical pathways to the observed molecules.ANN, artificial neural network; CNN, convolutional neural network; RF, random forest model; SVM, support vector machine.

Figure 5 .
Figure 5. Example of listed contents in mass spectral databases.a) The reported compound coverage of the NIST 23 tandem mass spectral library.b) The different reported mass spectrometric techniques in the European MassBank.These two databases represent general mass spectral databases.E&L, extractables and leachables; CI, chemical ionization; B, bombardment; GC, gas chromatography; EI, electron ionization; TOF, time-of-flight; ESI, electrospray ionization; Q, QQ, QQQ, single, double, triple quadrupole instrument; LC, liquid chromatography; EI-B, electron bombardment ionization; QFT, quadrupole Fourier transform; ITFT, inductively coupled plasma Fourier transform.

Figure 6 .
Figure 6.Schematic of the operating principle of most machine learning based compound identification tools.A machine learning model learns to map a mass spectrum to a feature space, here represented by a molecular fingerprint vector.In a second step, the similarity is scored between the predicted fingerprint and the molecular fingerprints of a compound database.ML, machine learning; MS/MS, tandem mass spectrometry.

Figure 7 .
Figure 7. Similarity between molecular datasets containing drug molecules (nablaDFT), metabolites (Massbank of North America), and atmospheric molecules (Gecko, Wang, and quinones) shown through t-SNE clustering.The molecules were compared based on their topological fingerprint.

Figure 8 .
Figure 8.Our proposed action plan is designed to overcome the challenges hindering a successful implementation of data-driven mass spectrometry in atmospheric science.The plan contains five steps A1-A5.

Figure 9
outlines our proposal for a machine learning-based compound identification scheme for MION-CIMS.The CIMS sensitivity for different reagent ions acts as the molecule-specific MION-CIMS fingerprint.The machine learning model learns how to map the MION-CIMS fingerprint to a molecular representation.The development of such a new machine learning-based model could make

Figure 9 .
Figure 9.A proposed workflow for machine learning based compound identification with MION-CIMS.The model learns how to map the molecule specific MION-CIMS fingerprint (set of CIMS sensitivity values for different reagent ions) to a molecular representation.

Table 1 .
List of select mass spectrometry databases.The list is divided into open access (top) and commercial (bottom).Data volumes reflect the state in August 2023 (the data were taken from an associated webpage or publication).GC, gas chromatography; MS, mass spectrometry; FAB, fast atom bombardment; MS/MS, tandem MS; LC, liquid chromatography; MS n , tandem mass spectrometry done with n fragmentation stages.The database contains 26 485 unique structures (when full structure is available).The GNPS database contains data contributions from the public and other mass spectral libraries.LipidBlast fiehnlab.ucdavis.eduan in silico tandem mass spectral library for lipid identification containing predicted spectra for 119 200 compounds.Provides a tool for users to predict new spectra for their molecules, available in MS-Dial software.>48 169 lipid structures, 26 122 of which were determined experimentally and 22 047 of which were generated computationally.LMSD has links to in-house (500 lipid standards) and external (54 877 MS and MS/MS spectra for 7210 lipids from MassBank of North America) mass spectrometry resources.