An optimized protocol for large‐scale in situ sampling and analysis of volatile organic compounds

Chemical communication is the oldest and most ubiquitous mode of communication in the living world (Wyatt, 2003). From chemotaxis in bacteria (Adler, 1975) to foraging trails in ants (Billen & Morgan, 1998), examples of organisms using chemical cues to find resources (Leonhardt, Menzel, Nehring, & Schmitt, 2016; Wilson, 1965), conspecifics (Brennan & Kendrick, 2006; Dweck et al., 2015) or to avoid danger (Holopainen & Blande, 2012; Mathis & Smith, 1993) can be found across the tree of life. The sheer diversity and abundance of chemicals in the environment provide an opportunity for organisms to utilize them for survival and reproduction. Organisms often employ multiple chemical compounds as blends in specific ratios, from two to six compounds in moth pheromones (Roelofs, 1995) to mixtures of several compounds for individual recognition in mammals (Brennan & Kendrick, 2006). These chemicals are also released into a complex natural environment with thousands of chemicals emanating from every microbe, plant, and animal. This cacophony of information can make the collection, separation, and interpretation of chemical signals a daunting task, particularly for unknown analytes Received: 27 October 2017 | Revised: 15 March 2018 | Accepted: 29 March 2018 DOI: 10.1002/ece3.4138

embedded in a natural matrix. While organisms such as plants and insects have been extensively studied and constitute a large proportion of our existing knowledge of chemical cues, systems such as mammals (Burger, 2005) and marine organisms (Hay, 2014) have been relatively less frequently explored and are largely restricted to zoo or captive individuals, due to difficulty in large-scale sampling under natural conditions. Large-scale sampling and analysis of infochemicals have become particularly relevant in recent years due to a remarkable shift in chemical ecology from an individual-centric (Vet, 1999) to a more population-and community-centric approach (Dicke, 2006).
Chemical ecologists are now interested not only in the chemical cues produced by organisms and the behavioral responses elicited, but also at the individual variations in these infochemicals and responses (Vet, 1999), their impact on populations and interspecies interactions (Hay, 2014), and the role of chemical communication in shaping communities and ecosystems (Dicke, 2006). Nevertheless, large-scale studies of complex chemical matrices pose a considerable challenge for sampling, analysis, and interpretation of volatiles.
In this study, we have attempted to address the major concerns associated with large-scale in situ sampling and analysis of volatile organic compounds (VOCs) and have developed an optimized protocol for such studies.
A wide variety of sampling techniques is available to extract and retain the volatile profile of a biological sample (Millar & Haynes, 1998). Large-scale volatile sampling requires a good balance of sensitivity, chemical retention and preservation, and sampling comprehensiveness. Relatively small amounts of volatiles are released from biological samples and diffuse in a large volume of air, which requires volatile sampling techniques to be highly sensitive to capture them.
Often, large-scale sampling requires collection over long duration to obtain the required sample size, necessitating the storage of samples for lengthy periods before analysis and increasing the risk of microbial contamination and loss or degradation of sample (e.g., Birkemeyer et al., 2016). In addition, the complete volatile profile of a biological sample is often difficult to obtain due to differences in adsorption/absorption efficiencies of the volatile constituents.
Different sampling techniques might thus capture different subsets of the same original volatile composition. When studying a novel organism with a large repertoire of unknown volatiles, it is therefore important to choose the most comprehensive sampling technique. In situ techniques can eliminate the need to collect and store samples but must also be optimized for sensitivity and comprehensiveness.
Identification is a major rate-limiting step in large-scale analysis of semiochemicals. Analysis of VOCs is commonly performed using gas chromatography for separation (Dewulf, Van Langenhove, & Wittmann, 2002) and mass spectrometry for identification and quantification. Mass spectrum for each extracted volatile is visually compared to reference chromatograms in mass spectral libraries to initiate the process of identification. This is a highly laborious process and becomes especially difficult when sample sizes are large, and samples are rich in volatiles, such as studies that look at individ-  Skogerson, Wohlgemuth, Barupal, & Fiehn, 2011) are often ill-equipped for natural product identification, particularly rare compounds.
Finally, statistical analysis of volatiles should be relevant to the study system as well as adhere to the nature of volatile data.
Typically, volatile profiles of samples belonging to two or more distinct sample sets are compared using clustering or classification tools to identify distinguishing volatiles from each set. However, classifying a study system into explicit sets may not always be possible and/ or justifiable. Researchers may, in some cases, be interested in the variation in volatile composition across the range of a particular factor or may be unsure about the classification categories. In such scenarios, regression and clustering analysis can be used, respectively, for functional inferences. A major limiting factor for statistical analysis of VOCs is high dimensionality, that is, relatively larger number of variables/volatiles (p) than samples (n) (Johnstone & Titterington, 2009). Another important aspect of analysis of volatiles which is often ignored is that volatile data are represented as relative proportions (instead of absolute concentrations) of extracted volatiles.
Such "compositional" data require transformations rendering them suitable for standard statistical tools (Aitchison, 1982) or tools that conform to the nonindependent nature of such data (Ranganathan & Borges, 2011).
We have addressed concerns particular to large-scale field studies of volatiles at each stage: sampling, analysis, and interpretation of VOCs. In an attempt to alleviate as many of these concerns as possible, we have developed a pipeline for large-scale in situ studies of VOCs. To develop our pipeline, as a case study, we have explored the role of chemical communication in lek mating behavior of an antelope, Indian blackbuck, Antilope cervicapra.

| Case study species
The blackbuck is an antelope endemic to the Indian subcontinent.
It is a near-threatened species, primarily found in grasslands and open woodlands in India (IUCN). Blackbuck are known to breed throughout the year with two annual mating peaks, March-April and August-October (Ranjitsinh, 1989). They have a wide variety of mating systems, including the rare lek mating, in which males aggregate and display to females on small, clustered, resource-less territories (Isvaran, 2003). There is a strong spatial skew in mating success of males in a blackbuck lek (aggregation of males), with 90% of the matings occurring in central territories (Isvaran & Jhala, 1999). All territories are repeatedly marked by males with dung and urine that accumulate to form dung piles (Figure 1a). The dung piles are periodically evaluated by potential mates as well as competitors, possibly for olfactory cues about age (like in white rhino, Marneweck, Jürgens, & Shrader, 2017), strength, virility, genetic compatibility, or identity (e.g., MHC and MUPS in mice, Cotton, 2007) of the defecator. A study on captive blackbuck (Rajagopal, Archunan, Geraldine, & Balasundaram, 2010) highlights variation in volatile profile of urine of males corresponding to dominance hierarchy. Similarly, in a lek, olfactory cues from dung piles are likely contributors to spatial variation in mating success of males and, in that case, are hypothesized to have a spatial variation corresponding to mating success.

| Case study site
Volatile sampling for the study was conducted in Tal Chhapar wildlife  hosting a single lek occupied by 150(±20) males during the mating peak. Before the onset of the study, all principal (large) dung piles on the lek were marked using GPS ( Figure 1b) and the mean of their latitudes and longitudes was designated as lek center (Isvaran & Jhala, 1999). All territories within 65 m from lek center were defined as central territories (C), territories on the edge of the lek (120-250 m from lek center) were defined as peripheral territories (P), and the rest (between 65 and 120 m) were defined as middle territories (M). To determine spatial variation in volatile profiles of dung piles, 13 territories from each zone were selected for volatile sampling ( Figure 1b).

| Sample collection
Freshly defecated fecal pellets (10-15 pellets) were collected, whenever available, from each of the 39 selected territories in sterilized glass vials using sterilized gloves. Owing to dry (15% humidity), hot were stored at −20°C till analysis to minimize microbial growth and loss of volatiles.

| Volatile extraction
Volatiles were extracted from fecal samples by three different techniques-solvent extraction, solid-phase extraction, and thermal desorption and their relative yields (number of volatiles and contaminants) were assessed using gas chromatography-mass spectrometry (GC-MS).

| Solvent extraction
Solvent extraction was performed using two most widely used solvents for volatile extraction (Millar & Haynes, 1998)-hexane (nonpolar) and dichloromethane (DCM, midpolar). Three different exposure times (3, 6, and 9 hr) were used to extract volatiles from five samples of March 2015. Three to four blackbuck pellets from each sample collection vial were ground together to obtain six replicates of 2 g each and immersed in approximately 5 ml solvent for each of the solvents and exposure times. The extracts were filtered using Whatman filter paper. Traces of water in the filtrate were removed using anhydrous sodium sulfate, and the filtrate was concentrated by evaporating the solvent using a slow stream of ultra-high-purity nitrogen gas. The concentrate was directly subjected to GC-MS analysis.

| Solid-phase extraction
Solid-phase extraction (SPE) was performed using preconditioned polydimethylsiloxane (PDMS) tubes procured from Carl Roth (Rotilabo ® -silicone tube). PDMS tubes of 1.5 mm inner diameter and 3.5 mm outer diameter were cut into 5 mm pieces and soaked F I G U R E 1 Large-scale chemical analysis case study system, the blackbuck, Antilope cervicapra. (a) Male blackbuck on territorial dung pile on lek. (b) Topography of dung piles on lek in February 2016 with principal dung piles (filled circles, central-dark gray, middle-light gray, and peripheral-white) and sampling dung piles (center-1 to 13, middle-14 to 26, and periphery-27 to 39) for 4 hr in 1:1 mixture of acetonitrile and methanol. They were then dried using ultra-high-purity nitrogen gas and conditioned in a Gerstel Tube Conditioner by heating over the stream of nitrogen gas at 4 bar constant pressure. The entire process was repeated twice before using for extraction. For the extraction, two PDMS tubes were exposed to fecal pellets collected in each glass vial for 4 hr.
On each day of sampling, an empty glass vial used as environmental control and volatiles were extracted using two tubes from this vial as well. The tubes were then removed and stored in labeled, sterilized 0.5-ml amber glass vials.

| Thermal desorption
Thermal desorption was performed using a Gerstel Thermal Desorption Unit (TDU) and Cooled Injection System (CIS 4) controlled by Gerstel Modular Analytical Systems Controller C506 and Gerstel Maestro 1 software which extracts volatiles and directly introduces them into a GC-MS for analysis. One fecal pellet from each collection vial was crushed using a sterilized spatula, and 2 g of the powder was added into the TDU liner by covering both the end of the liner with glass wool. The samples were introduced into the TDU at the initial temperature of 30°C using a Gerstel MultiPurpose Sampler (MPS). After a delay time and initial temperature of 1 min each at 30°C, the TDU temperature was increased to 200°C at the rate of 100°C/min and retained in 200°C for 10 min. Volatiles were desorbed from TDU and transferred to CIS at 210°C, trapped in the silanized glass wool liner of the CIS, and maintained at −50°C using liquid nitrogen. After the equilibration time of 0.20 min, the CIS was ramped to 220°C at the rate of 12°C/s and held at constant temperature for 5 min for optimal transfer of volatiles to GC. TDU-CIS was also used to desorb and introduce volatiles from solvent extract/ PDMS tube into GC-MS.

| GC-MS analysis
Volatiles extracted from samples were separated and identified using an Agilent 7890B gas chromatograph coupled with a 5977A MSD mass spectrometer. An HP-5 MS column (30 m × 0.25 mm id, 0.25 μm film thickness) was used with helium as the carrier gas at a flow rate of 1 ml/min. The column oven was kept at 40°C for 1 min, increased to 180°C at a rate of 5°C/min, and finally in- GC chromatograms of the different volatile extracts of replicate samples were compared to assess best volatile sampling method.
Each GC chromatogram was also compared to a corresponding blank control to check for contaminants. Blank controls were empty sterilized glass vials exposed to similar environmental conditions and extraction procedures as the samples. The optimized laboratory-based volatile sampling protocol was then tested for feasibility for largescale in situ sampling in season 2. Further optimization of downstream analysis was carried out using these samples.
Volatile analytes were identified by matching the mass spectral data of the peak with library spectra (NIST and personal libraries created from authentic standards), by comparing their relative retention index using a homologous series of n-alkanes (C 6 -C 30 hydrocarbons, Sigma-Aldrich), by comparing their elution order and/or by comparing their retention time with standards. Quantity of volatiles extracted was approximated by the area under each peak. A known peak (octamethylcyclotetrasiloxane) was taken as an internal standard, and all peak areas were normalized by dividing with the peak areas of internal standard for SPE extracts (Kallenbach et al., 2014).
Contaminants are removed by comparing the spectra of samples with corresponding controls.

| Statistical analysis
In this study, we were interested in understanding the spatial variation in volatiles from dung piles across the lek. This variation can either be discrete with dung piles in the center of the lek being remarkably different from all other dung piles (like the mating behavior) or gradual with a stepwise change in volatile composition from center to periphery. As a result, it was necessary to assess different statistical approaches for proper biological inference of the chemical data. It also gave us the opportunity to explore different statistical approaches used in chemical studies and arrive at the most suitable tools for large-scale studies in general. Three common statistical approaches are-(1) clustering, (2) classification, and (3) regression.
Clustering is an unsupervised statistical approach and was used to look for natural clusters of territories with similar chemical composition. Classification is a supervised statistical approach for which we predefined our study system into zones (center, middle, and periphery, as described in sampling methods) and analyzed chemical variation between these zones. To observe gradual variation, we used regression approaches and distance of each territory from the center of lek was used as the parameter to note variation. To optimize our protocol, we used two alternate statistical tools from each approach and compared their efficiencies using relevant statistics.
The tools used were principal component analysis (PCA) and hierarchical clustering for clustering, linear discriminate analysis (LDA) and random forest classification (RF) for classification, and principal component reduction (PCR) and random forest regression for regression analysis.
Random forest is a machine learning algorithm which can be used to assess the importance of variables (volatiles) in classification and regression analysis (Breiman, 2001;Ehrlinger, 2015). It builds decision trees using bootstrapping from samples and creates a ranked variable importance list by running permutations of decision trees. Variables to be considered can be assessed based on corresponding variable importance scores-mean decrease in accuracy (for classification) and increase in mean square error (for regression).
PCR (Jolliffe, 1982) is a dimensionality reduction approach based on PCA used to circumvent problems of multidimensionality in data such as GC-MS data. PCA is performed on the observed data matrix for the explanatory variables to obtain the principal components (PCs), and then a subset of the PCs is selected, based on some appropriate criteria (e.g., variability explained), for the intended multivariate analysis (e.g., multivariate regression in PCR).
Principal component analysis, hierarchical clustering, and LDA were performed using PAST3 software (Ryan, Hammer, Harper, & Paul Ryan, 2001), an open source software for statistical analysis.
Random forest classification and regression models and PCR were optimized, cross-validated (10 CV), and compared using "caret" package (Kuhn, 2015) in R. Classification models were compared based on classification accuracy, and regression models were compared based on root-mean-square error (RMSE). the instrument. Besides, TDU is a laboratory-based technique and therefore requires removal of biological samples from the field. This was not ideal for our study as we were unable to immediately analyze the samples and fecal matter is subject to rapid microbial conversion altering the VOC profile from that which is present in situ.

| Volatile sampling
Solvent extraction was the least comprehensive method of extraction in this study. Both hexane and DCM extracted very few analytes despite extended periods of exposure (9 hr) with blackbuck fecal samples (Figure 2b,c). Even though DCM was successfully used to extract volatiles from urine in a previous study on captive blackbuck (Rajagopal et al., 2010), it was found to be not appropriate for fecal samples in the current study. Solvent extraction was recognized as a highly sensitive technique that requires little equip- A thorough experimental comparison of three commonly used volatile sampling techniques in terms of comprehensiveness, sensitivity, volatile preservation, and suitability for large-scale in situ sampling concludes that solid-phase extraction using PDMS along with thermal desorption is the most efficient and practical sampling technique for large-scale study of volatiles. While other field-based sampling and analysis methods are available, including solid-phase microextraction and portable GC-MS (Kücklich et al., 2017;Marneweck et al., 2017), these methods are both costly and cannot be applied for very large-scale concurrent sampling as required in this study. In particular, our method is effective for initial studies that can reveal major analytes within and between samples.
However, other methods could be more suitable for detailed and/ or small-scale analyses requiring a comprehensive overview of the total VOC profile. These errors were low and asserted the use of RT-BP grouping as a useful automation tool for analysis. All contaminants were also successfully removed. Contaminant removal included removing analytes (RT-BP groups) from samples which were present in controls as well as contaminants with known BP values, like silicates (see Appendix S1). For further analysis, the number of samples in which each analyte was present was calculated and analytes that were found in less than 5% of the total samples were eliminated to reduce noise. Mass spectrometer chromatogram and relative retention index (RRI) (Kováts, 1958) of one representative from each RT-BP group was used to identify the corresponding compound. In this study, more than 200 unique analytes were detected and quantified out of which 100 (Table 1) were found in at least 5% of the samples. These analytes were sorted in decreasing order of their abundance, and their ranks in the sorted list were used as their identification number. Chemical names of 68 of these analytes were identified (Table 1)   needed for the script is a simple comma-separated file (.csv) consisting of peak area, base peak, and retention time (see Appendix S1). This method is less sophisticated than the several existing freely available and open source data preprocessing programmes such as XCMS (Smith, Want, O'Maille, Abagyan, & Siuzdak, 2006) and CAMERA (Kuhl, Tautenhahn, Böttcher, Larson, & Neumann, 2012) but has an inherent advantage in terms of simultaneous removal of many known contaminants (like PDMS-derived volatiles in this study). While silica-based compounds have signature base peaks that can be fed into the Perl script for easy removal, other contaminants with unique retention time and base peak can also be easily removed with user-provided information (Appendix S1).

| Statistical analysis
Principal component analysis did not produce strong clusters with unscaled ( Figure 3a) as well as scaled and log-transformed data ( Figure S2), while hierarchical clustering produced three clusters, varying predominantly in concentration of compound 8 "metacresol" (Figure 3b). High levels of meta-cresol were detected in a few samples from central and middle territories, whereas almost all samples from peripheral territories showed low concentrations of meta-cresol (Figure 3a,b). Overall, there was more variation between volatile profiles of territories within a zone (C/M/P) than between the zones in chemical composition. Clustering tools alone were thus not sufficient to convey meaningful information about spatial variation in chemical signature of blackbuck dung piles, probably due to individual variation being much higher than spatial variations.
Classification by LDA had high accuracy of 92.8% in sorting samples to corresponding zones (Figure 4a). Classification by random forest classification model had a lower efficiency (65.10%). Among the zones, classification accuracy was highest for central territories (Table 2). Compound 8 (meta-cresol) was the most important volatile distinguishing between the zones (Figure 4b). Both methods arrived at similar results, and in first glance, LDA analysis appeared to be better than random forests in terms of classification accuracy.
Accuracy checks on a test dataset using the 10 cross-validation method revealed that the random forest model produced consistent results, and the model was optimized to use 1,000 trees (ntree) and 15 variables per try (mtry) to produce 65.10% accuracy. LDA, however, failed at 10 cross-validation possibly due to large number of zero values in the data, so the model could not be optimized. Jackknifing reduced the classification efficiency of LDA drastically to 17.42%. In this study, among these two methods, random forest was thus determined as a better statistical approach in terms of accuracy and consistency to assess discrete spatial variation in chemical composition of lek. Considerable difference between the three zones in chemical composition corroborated the hypothesis of spatial variation in mating success of males being correlated with spatial variation in chemical signature of the lek. meta-cresol was again determined as the most important driver of this variation.
Among the two regression models, optimized random forest regression model (ntree = 1,000, ntry = 5, 10 CV) explained 26.17% of variation in volatile composition across territories (Figure 5a). The margin of root-mean-square error (RMSE = 55.512 m) is very high (>1/5th of the distance between center and outermost territory on the lek) indicating a weak trend along distance from center. For principal component regression, PCA was used for dimensionality reduction and model optimization. The optimized model with four principal components had higher error (RMSE = 60.82273 m) than random forests (Figure 5b). Regression analyses detected a small but consistent variation in chemical composition of lek from center to periphery. meta-cresol was again predicted as the most important compound varying with distance from center ( Figure 5c). Random forest regression method was determined as a better tool to assess gradual spatial variation in chemical composition of lek.

| CON CLUS IONS
Here, we have developed a pipeline for large-scale in situ studies of VOCs employing solid-phase in situ extraction, thermal desorption coupled with GC-MS, semiautomated analysis using retention time and base peak, and statistical analysis using random forests. Each of the selected methods exhibited both advantages and drawbacks, but each step was selected to maximize efficiency, sensitivity, chemical retention and preservation, and comprehensiveness of analysis. Our methodology helped in analyzing about 200 samples in two weeks (not including the GC-MS runtime) as opposed to 7-8 months (details explained in Section 2) that would have been required to do this work manually. While the TDU-GC-MS instrument is specialized and costly, the other techniques explored are inexpensive and require relatively little expertise to perform. In addition, our pipeline is not restricted to mammals as used for the case study but could be employed with nearly any living system in land or sea (PDMS can also absorb chemicals in aqueous solutions). Our method was also specifically developed for large-scale, population-level studies where neither compounds nor sample groups may be previously known. As such, it can be employed in new systems that have little prerequisite knowledge of the natural products involved. Further studies that employ this method may incorporate additional steps that increase the effectiveness of our novel pipeline methodology.

ACK N OWLED G M ENTS
Authors are thankful to the Rajasthan Forest Department, Jaipur, India, and Tal Chhapar wildlife sanctuary, Rajasthan, for permits to collect samples. Authors are grateful to Bhagwana Ram, Rupsy

CO N FLI C T O F I NTE R E S T
None declared.

AUTH O R CO NTR I B UTI O N S
JVN, SO, URK, and VSP conceived the ideas and designed methodology; JVN collected the data; VSP identified the compounds; SDK wrote the code for semiautomation analysis; JVN analyzed the data; JVN and SO led the writing of the manuscript. All authors contributed critically to the drafts and gave final approval for publication.

DATA ACCE SS I B I LIT Y
The data used in this study has been provided in Dryad Digital Repository: https://doi.org/10.5061/dryad.kp18283.