Molecular Origin of Blood‐Based Infrared Spectroscopic Fingerprints

Abstract Infrared spectroscopy of liquid biopsies is a time‐ and cost‐effective approach that may advance biomedical diagnostics. However, the molecular nature of disease‐related changes of infrared molecular fingerprints (IMFs) remains poorly understood, impeding the method's applicability. Here we probe 148 human blood sera and reveal the origin of the variations in their IMFs. To that end, we supplemented infrared spectroscopy with biochemical fractionation and proteomic profiling, providing molecular information about serum composition. Using lung cancer as an example of a medical condition, we demonstrate that the disease‐related differences in IMFs are dominated by contributions from twelve highly abundant proteins—that, if used as a pattern, may be instrumental for detecting malignancy. Tying proteomic to spectral information and machine learning advances our understanding of the infrared spectra of liquid biopsies, a framework that could be applied to probing of any disease.

. An overview of the most abundant proteins in human blood serum. The first three columns contain the names of the proteins ordered by their average concentration, typical levels and the average LFQ values measured for the healthy individuals. It is apparent that the LFQ values are not proportional to the levels in mg/dL, as expected.

Chemicals and reagents.
Methanol and ethanol of HPLC-grade, sodium chloride and proteins at highest available purity rate were purchased from Sigma Aldrich GmbH (Taufkirchen, Germany). The proteins that were purchased as powder were diluted in 20mM PBS buffer (Sigma Aldrich GmbH). If traces of additional salts were present in the protein solution, the buffer was exchanged to 20 mM PBS using Nanosep Omega centrifugal filters with 3 kD cutoff (VWR, Germany).

Clinical study participants.
We performed a clinical study on lung cancer, including subjects with benign conditions and non-symptomatic healthy volunteers as reference. The following clinical centers were involved in subject recruitment and sample collection: Department of Internal Medicine V for Pneumology, Ludwig-Maximillian-University (LMU) of Munich; Urology Clinic, LMU; Department of Obstetrics and Gynaecology, LMU; Breast Cancer and Comprehensive Cancer Centre Munich (CCLMU), LMU; Asklepios clinic, Gauting; Comprehensive Pneumology Centre (CPC), Munich, all located in Germany. All participants signed written informed consent form for the study under research Study Protocol # 17-182 or # 17-141, both approved by the Ethics Committee of the Ludwig-Maximillian-University (LMU) of Munich and performed in compliance with all relevant ethical regulations. Analyses focus on subjects with clinically confirmed carcinoma of lung at the TNM clinical stages II and III, with no metastases, prior to any cancer-related therapy, and without any other cancer occurrence. Healthy references were non-symptomatic individuals, without any history of cancer, not suffering from any cancer-related disease nor being under any medical treatment. Lung cancer cases were compared to matched individuals from the following groups: Chronic obstructive pulmonary disease (COPD), pulmonary hamartoma and non-symptomatic healthy individuals matched for gender, age and smoking status. Full breakdown of all participants is listed in SI Appendix Table S2.

Blood sample collection and preparation
Blood samples were collected, processed and stored using previously defined standard operating procedures: Blood draws were all performed using Safety-Multifly needles of 21G (Sarstedt AG & Co KG, Germany) into 4.9 ml or 7.5 ml serum tubes, centrifuged at 2.000 g for 10 minutes at 20 °C, aliquoted and frozen at -80°C within 5 hours from the time of sampling. All samples were thawed, further aliquoted for measurement and re-frozen at -80°C to ensure a constant number of freeze-thaw cycles before analysis. Before any measurement, the aliquots were thawed at room temperature, shaken for 20 seconds, and spun down again.

Fourier-transform infrared spectroscopy measurements
Measurements of liquid biofluids, their fractions and single proteins were all performed in hydrated, fluid state using an automated FTIR device MIRA-Analyzer (Micro-biolytics GmbH, Germany) with a flow-through transmission cuvette (CaF2 with 8 μm path length), as demonstrated previously. [8] The spectra were acquired with a resolution of 4 cm -1 and an averaging time of 45 s. After sample exchange a water reference spectrum was measured to reconstruct the infrared absorption spectra. Samples were measured in a random order to avoid systematic effects during data evaluation. To measure and track experimental errors during the measurement campaign, quality control samples from pooled human serum (BioWest, Nuaillé, France) were measured after each 5 samples.
The pre-processing of the experimental spectra was performed using home-built software and relies on the previous work [9,10] . The spectra were obtained in the range from 930 cm -1 to 3050 cm -1 and truncated to 1000-3000 cm -1 . Baseline correction was introduced to account for the water substituted by the blood constituents in the sample compared to the pure-water reference. In particular, water absorption spectrum was added to the sample spectrum with a coefficient optimized such that the first derivative of the signal at 1800-2200 cm -1 (2200-2400 cm -1 for the metabolite fraction) is minimal [10] . Subsequently, the minimum of the absorption in this region is subtracted from the spectrum ("offset correction"). Finally, vectornormalization was used to reduce experimental noise [11] .
The absorption spectra of single proteins were measured at 1 to 5 mg/mL concentration depending on the sample availability. The spectrum of PBS buffer was subsequently measured and subtracted from the protein spectrum prior to further pre-processing. When the buffer was exchanged using centrifugal filters, which implies sample loss, the resulting protein concentration was determined using BCA Protein Assay.

UPLC-MS proteomics measurements
Sample preparation was carried out according to our Plasma Proteome Profiling pipeline [12,13] , employing an automated setup on an Agilent Bravo Liquid Handling Platform. In brief, 1 µl of each plasma sample or input from the HSA depletion method was aliquoted into 24 μl of lysis buffer (P.O. 00001, PreOmics GmbH) in a 96 well plate (Eppendorf twin.tec PCR LoBind). Reduction of disulfide bridges, cysteine alkylation, and protein denaturation was performed at 95°C for 10 min. [14] Trypsin and LysC were added to the mixture after a 5-min cooling step at room temperature, at a ratio of 1:100 micrograms of enzyme to micrograms of protein.
Digestion was performed at 37°C for 1 h. An amount of 0.5 μg of peptides was loaded to Evotips (Evosep, Odensee, Denmark) following the manufacturer protocol. Samples were measured using LC-MS instrumentation consisting of an Evosep One (Evosep, Odensee, Denmark) [15] , which was coupled to a Q Exactive HF-X Orbitrap (Thermo Fisher Scientific) using a nano-electrospray ion source (Thermo Fisher Scientific). Purified peptides were separated on 15-cm HPLC columns [ID: 150 μm; in-house packed into the tip with ReproSil-Pur C18-AQ 1.9 μm resin (Dr. Maisch GmbH)]. For each LC-MS/MS analysis, about 0.5 μg peptides were used for 21 min run. Column temperature was kept at 60°C by an inhouse-developed oven containing a Peltier element, and parameters were monitored in real time by the SprayQC software [16] . MS data were acquired with a Top12 data-dependent MS/MS scan method. Target values for the full-scan MS spectra were 3 × 10 6  MS raw files were analyzed by MaxQuant software, version 1.6.3.3 [17] , and peptide lists were searched against the human UniProt FASTA database. A contaminant database generated by the Andromeda search engine [18] was configured with cysteine carbamidomethylation as a fixed modification and N-terminal acetylation and methionine oxidation as variable modifications. We set the false discovery rate (FDR) to 0.01 for protein and peptide levels with a minimum length of 7 amino acids for peptides, and the FDR was determined by searching a reverse database. Enzyme specificity was set at C-terminal to arginine and lysine as expected using trypsin and LysC as proteases. A maximum of two missed cleavages were allowed. Peptide identification was performed with an initial precursor mass deviation up to 7 ppm and a fragment mass deviation of 20 ppm. All proteins and peptides matching to the reversed database were filtered out. Label-free protein quantitation (LFQ) was performed with a minimum ratio count of 2 [19] .

Fractionation of blood serum
All the samples and all reagents were kept at 4 0 C during the process. The samples were processed in batches of 24, including 4 quality control samples per each batch (see above). Two-step fractionation of liquid biofluids has been performed. The goal of the first step is to separate most of the proteins from the suman serum albumin (HSA) and follows the previously proposed [20] . To that end, 0.1M NaCl and 42% of ethanol have been added to the samples. They were vortexed for 1 hour, then centrifuged for 20 minutes at 16000 rcf, so that a pellet containing most of the proteins (HSA-depleted fraction) is formed. The supernatant that contains HSA and metabolites was transferred to a new tube, while the pellet was re-dissolved in water via vortexing for 1.5 hours. A small pellet was left in the tube. We have shown that if the HSA-depleted protein pellet is vortexed for longer, the left-over pellet is reduced, but the FTIR spectra of the supernatant are identical to those obtained after 1.5 hours. The HSAdepleted fraction has been transferred to a new tube and placed into the vacuum concentrator (Concentrator plus, Eppendorf GmbH, Germany) for 3 hours. To avoid clogging of the automated measurement system, the HSA-depleted fraction has been centrifuged for 15 minutes at 15000 rcf and frozen at -80 0 C until further use.
To separate HSA-enriched protein fraction from metabolites, we added 59% of pre-cooled methanol, vortexed the samples for 1 minute and centrifuged for 15 minutes at 15000 rcf, so that a pellet containing HSA and other proteins was formed [21] . The supernatant was transferred to a new tube and fully dried in the concentrator in 3 hours. The metabolites were then re-dissolved in water via vortexing for 2 minutes and frozen at -80 0 C until further use.
The HSA-enriched pellet was fully re-dissolved in water via vortexing for 2 minutes and placed into the vacuum concentrator for 3 hours. The resulting HSA-enriched protein fraction was frozen at -80 0 C until further use. The total time required to process a sample batch was 8 hours.

Classification models
The data analysis was performed using the Scikit-Learn [22] (v. 0.23.2) module in Python (v.3.7.6). We trained classification models based on linear support vector machines (SVM) algorithmas implemented in the LinearSVC class with default parameters. We evaluated the model using stratified 10-fold cross validation, repeated 10-times with different randomization in each repetition. For the visualization of the model performance, we use the notion of the receiver operating characteristic (ROC) curve. As an overall metric for model performance, we use the area under the ROC curve (AUC). For the evaluation of the massspectrometry data, we performed ranking of individual proteins using forward feature selection based on the SVM-classification performance.