A complex matrix characterization approach, applied to cigarette smoke, that integrates multiple analytical methods and compound identification strategies for non‐targeted liquid chromatography with high‐resolution mass spectrometry

Rationale For the characterization of the chemical composition of complex matrices such as tobacco smoke, containing more than 6000 constituents, several analytical approaches have to be combined to increase compound coverage across the chemical space. Furthermore, the identification of unknown molecules requiring the implementation of additional confirmatory tools in the absence of reference standards, such as tandem mass spectrometry spectra comparisons and in silico prediction of mass spectra, is a major bottleneck. Methods We applied a combination of four chromatographic/ionization techniques (reversed‐phase (RP) – heated electrospray ionization (HESI) in both positive (+) and negative (−) modes, RP – atmospheric pressure chemical ionization (APCI) in positive mode, and hydrophilic interaction liquid chromatography (HILIC) – HESI positive) using a Thermo Q Exactive™ liquid chromatography/high‐resolution accurate mass spectrometry (LC/HRAM‐MS) platform for the analysis of 3R4F‐derived smoke. Compound identification was performed by using mass spectral libraries and in silico predicted fragments from multiple integrated databases. Results A total of 331 compounds with semi‐quantitative estimates ≥100 ng per cigarette were identified, which were distributed within the known chemical space of tobacco smoke. The integration of multiple LC/HRAM‐MS‐based chromatographic/ionization approaches combined with complementary compound identification strategies was key for maximizing the number of amenable compounds and for strengthening the level of identification confidence. A total of 50 novel compounds were identified as being present in tobacco smoke. In the absence of reference MS2 spectra, in silico MS2 spectra prediction gave a good indication for compound class and was used as an additional confirmatory tool for our integrated non‐targeted screening (NTS) approach. Conclusions This study presents a powerful chemical characterization approach that has been successfully applied for the identification of novel compounds in cigarette smoke. We believe that this innovative approach has general applicability and a huge potential benefit for the analysis of any complex matrices.


| INTRODUCTION
High-resolution accurate mass spectrometry (HRAM-MS)-based nontargeted screening (NTS) is a key methodology for characterizing the chemical composition of complex matrices. 1 One major part within such a workflow is compound identification that can be achieved by either matching compound features against spectral databases (suspect screening analysis [SSA]) or, without any prior knowledge, by comparing first-order fragmentation (MS/MS) derived information with in silico predicted fragments from compound databases (nontargeted analysis [NTA]). 2 NTS enables the simultaneous identification and semi-quantification of a large number of compounds using an unbiased approach. This approach also allows the performance of accurate mass measurements, tandem experiments to facilitate compound identification (ID), and retrospective targeted screening for compounds of interest. 3,4 Once interfaced with liquid chromatography (LC), it is also able to achieve isomeric separation of constituents and deliver information regarding the physicochemical properties of compounds.
Given the high number and structural diversity of small molecules in complex matrices such as biological specimens, natural products and tobacco smoke, the latter known to contain more than 6000 constituents, 5 a combination of analytical approaches is required to cover the broadest possible range of compound classes within these diverse chemical spaces. [6][7][8] Reversed-phase (RP) chromatography is a universal separation mode that has been employed most commonly in non-targeted LC/MS studies, 9,10 and hydrophilic interaction liquid chromatography (HILIC) has been shown to provide good retention for small and very polar molecules. 11 In addition to these separation modes, heated electrospray ionization (HESI) and atmospheric pressure chemical ionization (APCI), using both positively (+) and negatively (−) charged ionization, have provided complementary information 12,13 depending upon analyte polarity, size, and the presence or absence of heteroatoms and functional groups.
For the purposes of establishing a powerful analytical workflow, there is more to consider than simply a requirement for a set of complementary analytical methods that cover the broadest possible chemical space. The integration of each applied method into a standardized and automated data evaluation process, including structural ID with cheminformatics tools, is key for successful handling of the vast amounts of data produced by NTS approaches 14,15 in a time-efficient manner. Nevertheless, ID of organic molecules by LC/MS remains a major challenge, 16,17 with shortcomings including a lack of commercial mass spectral libraries and the unavailability of any standardized retention time (tR) or retention index (RI) systems. However, HRAM-MS measurements of ionized molecules can be used as a starting point to generate molecular formulae, with consideration for the isotopic pattern 18 and chemical and heuristic rules. 19 In order to enhance the degree of confidence in structural candidates, or to achieve de novo ID, tandem mass spectral (MS 2 ) library searches are of high interest, concomitant with an increasing availability of publicly and commercially available MS 2 libraries. 3,20 In the absence of any reference MS 2 spectra, computational approaches, 21 including in silico fragmentation, 22,23 have emerged as additional sources of orthogonal information for successful compound ID.
The primary aim of this work was to establish and provide a detailed evaluation of a comprehensive LC/HRAM-MS-based NTS strategy for the chemical characterization of tobacco smoke, which combined multiple complementary separation and ionization modes.
Evaluation of the output from these complementary analytical approaches was performed using a streamlined semi-automated data processing and compound ID workflow aimed at improving both specificity and confidence in the annotation of small molecules even in the absence of reference standards, and further enabling the discovery of novel compounds.

| Sample generation and preparation
Mainstream whole smoke derived from 3R4F cigarettes was generated according to the Health Canada Intense (HC) smoking regime 25 using a linear smoking machine (puff volume -55mL, duration -2s, puff interval -30s). Trapping of the particulate phase (total particulate matter; TPM) was performed using a 44mm Cambridge glass fiber filter pad (CFP). The gas/vapor phase fraction of whole smoke was trapped using two consecutive microimpingers placed behind the CFP, each filled with 10mL of extraction solution maintained at approximately −60°C using a dry ice/ isopropanol mixture (Figure 1). The total mass of material trapped by the CFP is also referred to as the TPM, which was determined as the weight difference of the CFP before and after the smoke generation process. In this publication we have focused on the analysis of the particulate phase, since the majority of compounds of 3R4F-derived smoke amenable for analysis are present in this fraction, and the current manuscript is intended to focus on the developed methodology rather than a full characterization of all available smoke fractions. After TPM collection, the filter pad was crushed and extracted using two consecutive steps with either

| Instrumentation
LC/HRAM-MS analysis using full scan and data-dependent first-order fragmentation (MS 2 ) modes with high-energy collision-induced dissociation (HCD) and stepped normalized collision energy (NCE) was performed using a Q Exactive™ Hybrid Quadrupole Orbitrap mass spectrometer (Thermo Fisher Scientific, Bremen, Germany). The   The calibration options provided within the Thermo Scientific software were used.

| Data processing
Combined full-scan and data-dependent fragmentation data were processed using Progenesis QI™ software (Nonlinear Dynamics, Newcastle upon Tyne, UK), comprising raw data import, alignment, feature extraction, deconvolution, normalization with ISTDs, Scientific) software. The structural proposals for each compound in the curated list were further reviewed in Progenesis QI™, and the most likely candidate structure was assigned in consideration of peak abundance, m/z (mass-to-charge ratio), detected adducts, molecular formula, overall score for mass/tR deviations and isotope similarities, and fragmentation score (FS). A list of the best structural proposals for the extracted compounds was exported as a csv file.

| Semi-quantification
Excel was used for the calculation of semi-quantitative levels and relative standard deviations (RSDs), which were based upon   The Venn diagram presented in Figure 3B and the full list of 3R4Fderived particulate phase smoke constituents presented in Table S1 (supporting information) clearly demonstrate the complementary characteristics of all applied analytical methods which contributed, in varying degrees, to the identification of 331 major constituents.
Among the other chromatographic/ionization approaches, a total of 147 compounds were identified using RP-APCI(+), comprising 62 method unique compounds and 81 compounds overlapping with RP-HESI(+). Given that the same analytical column and solvents were used for RP methods with positive ionization, these numbers indicate both a high degree of complementarity and also many differences due to the varying susceptibility for matrix effects using either HESI(+) or APCI(+) ionization mechanisms, as has been reported previously. 29,30 Such an overlap between RP-LC/HESI(+) and RP-LC/APCI(+) was desirable in order to minimize the possibility for analytical gaps in the chemical space amenable to LC/HRAM-MS.

| Compound identification strategy
Integration of these analytical approaches with a semi-automated stepwise data processing workflow was achieved using Progenesis QI™ software, querying fragmentation information from multiple sources/databases, which has been shown to be preferable over using fewer or single resources. 20 High-quality experimental MS 2 mass spectra were obtained by applying both HCD and collisioninduced dissociation ion activation modes, which, amongst other MS parameters, can critically affect the matching of mass spectra. 3 Parallel usage of both ion activation modes was found to be a complementary approach that increased compound ID rates. 32  A limitation of in silico fragmentation for the differentiation of structural isomers is demonstrated for scopoletin (7-hydroxy-6methoxycoumarin, see Figure 4) as an example, which was confirmed by reference standard. Nevertheless, our overall workflow of combined ID strategies enabled differentiation between structural isomers based upon the recorded first-order fragmentation spectrum, providing that the fragmentation pattern was different. As shown in Figure 4A Figure 4D). The assigned fragments matched the structural features for scopoletin with a good fit (FS=41.7). However, higher-ranked hits that corresponded to other hydroxymethoxycoumarin isomers were found, including those proposed by the in silico approach (CSID4589551, CSID4475385, CSID4678041, Figure 4), all of which exhibited near identical fragmentation and could not be distinguished based on scoring. In a second example, shown in Figure S2 (supporting information), cotinine (C10H12N2O) could be distinguished from N-formylnornicotine (C10H12N2O) due to the higher spectral match score with the first-order fragmentation spectrum of cotinine comprised in UCSD.
In both examples, tR information strengthened the confidence for correct annotation of the isomeric compounds. The two pairs were baseline-separated in RP mode due to the chromatographic resolution achieved by sub-2-μm particle packed columns, which in this study also contributed to the successful ID of other isomeric pairs/groups that had identical or similar MS 2 spectra.   For the purpose of determining the reliability of our workflow, including sample generation and preparation as well as data acquisition and processing, a total of N=15 observations were made, derived from three sample replicates and five injections per replicate.

| Chemical constituents of 3R4F TPM
RSD values for identified compounds ranged between 1% and 12%, demonstrating good analytical performance as well as reliable compound extraction, alignment, and integration achieved by Progenesis QI™.
The overall ID score was calculated between 0 and 80 as a combination of FS, tR, accurate mass match and isotopic similarity, with each parameter equally weighted. 37 Higher scores up to 100 could not be achieved with our approach due to the absence of collision cross section information from ion mobility experiments, which were not performed. An isotope pattern filter has proven beneficial to further reduce the number of structural proposals for a given empirical formula in cases where high mass accuracy alone was insufficient for compound ID when querying elemental composition. 18,19,38,39 For evaluating whether isotope similarity values were negatively impacted by low signal intensity, as has already been reported in the literature, 38 Confidence for compound ID was assigned as "high" if the overall score was above 50 or between 45 and 50 in combination with a FS above 45, which indicated either correct compound ID or a similar structural isomer. Scores not matching these criteria were classified as "medium" confidence IDs, which typically points to at least the   correct compound class. Figure 5A depicts the relative distributions of confirmed, high, and medium IDs for the four different analytical methods, which were independent of compound concentration as shown in the left panel of Figure S3 (supporting information). Details for the individual compounds are indicated by color code in Table 1 and Table S1 (supporting information) Table 1 and representation of the chemical space for tobacco smoke. Figure 5B shows the distribution of chemical constituents measured by LC/ HRAM-MS (red dots) within the chemical space comprising these 4141 tobacco and/or smoke constituents (light grey dots). The 331 Identified chemical components in TPM derived from a 3R4F reference cigarette. A, Relative distribution of confidence levels for identification of compounds using the four analytical methods. Confidence levels are specified in Table 1  Querying multiple databases in SSA and NTA was key for optimizing not only confidence levels, but also absolute numbers of identified compounds. According to Vinaixa et al, 20 there is a relatively low overlap of compounds with MS n (n≥2) spectra in existing spectral databases, which explains why most users currently search multiple databases. 15 NIST MS/MS searches yielded 11 unique compound IDs in our study, whereas UCSD MS 2 comparisons confirmed 126 chemical constituents in 3R4F-derived smoke ( Figure 5C; "ID Basis" column in Table 1 and Table S1 (supporting information)). A total of 50 constituents identified as being present in tobacco smoke were not listed in UCSD or NIST 14 MS/MS libraries (namely compounds without a PMI code as identifier in Table S1, supporting information).