A priori in partial ordering methodology the input data are understood as exact and true values, which is denoted as the “original data matrix”. As such even minor differences between values are regarded as real. However, in real life data are typically associated with a certain portion of noise or uncertainty. Hence, introducing noise may cause changes in the overall ordering of objects. The present paper deals with the effects of data noise or uncertainties on the partial ordering of a series of objects, a series of obsolete pesticides being used as an illustrative example. The approach is fuzzy like, and partially ordered sets are obtained as function of noise. A main focus of the work is to identify the range in terms of noise, where the original partial order is retained. We call this range the “stability range”. It is demonstrated that by increasing data noise the range where the “original partial order” is obtained decreases. The original partial order is based on the original data matrix. Further, it is found that significant changes in the partial ordering appear outside of this stability range. The possible relation between data noise and the stability range is discussed on an empirical basis. Copyright © 2015 John Wiley & Sons, Ltd.

2-Acetyl-1-pyrroline (2AP) is known as a principal basmati aroma compound. The present study aims at discriminating rhizobacteria isolated from soils cultivated with basmati and non-basmati rice for long duration. Volatile profiling was used as marker to discriminate the rhizobacterial isolates. Quantification of 2AP and other volatile compounds (VCs) produced by rhizobacteria was undertaken using HS-SPME coupled with GC-MS. Chemometrics tools such as hierarchical cluster analysis (HCA), principle component analysis (PCA) and multi dimensional scaling (MDS) were applied for volatile profiling of different isolates. Results showed significant discrimination of all 2AP producing (AP-P) and non-producing rhizobacterial isolates (AP-NP) on the basis of their VC profile. This was validated by bacterial identification data as well. The frequency distribution for 2AP levels indicates that basmati isolates had higher frequency for 2AP production as compared to non-basmati control. AP-P and AP-NP isolates have different VC profiling pattern irrespective of their origin. These isolates were found belonging to different groups when identified using 16S rDNA sequencing data. Chemometric analysis (PCA, HCA and MDS) helped to identify volatiles, which could be used as biomarker in discriminating the AP-P and AP-NP isolates. VC pattern of rhizobacteria could be used as volatile markers to distinguish between AP-P and AP-NP rhizobacterial isolates. Copyright © 2015 John Wiley & Sons, Ltd.

In developing partial least squares calibration models, selecting the number of latent variables used for their construction to minimize both model bias and model variance remains a challenge. Several metrics exist for incorporating these trade-offs, but the cost of model parsimony and the potential for underfitting on achievable prediction errors are difficult to anticipate. We propose a metric that penalizes growing model variance against decreasing bias as additional latent variables are added. The magnitude of the penalty is scaled by a user-defined parameter that is formulated to provide a constraint on the fractional increase in root mean square error of cross-validation (RMSECV) when selecting a parsimonious model over the conventional minimum RMSECV solution. We evaluate this approach for quantification of four organic functional groups using 238 laboratory standards and 750 complex atmospheric organic aerosol mixtures with mid-infrared spectroscopy. Parametric variation of this penalty demonstrates that increase in prediction errors due to underfitting is bounded by the magnitude of the penalty for samples similar to laboratory standards used for model training and validation. Imposing an ensemble of penalties corresponding to a 0–30% allowable increase in RMSECV through sum of ranking differences leads to the selection of a model that increases the actual RMSECV up to 20% for laboratory standards but achieves an 85% reduction in the mean error in predicted concentrations for environmental mixtures. Partial least squares models developed with laboratory mixtures can provide useful predictions in complex environmental samples, but may benefit from protection against overfitting. © 2015 The Authors. Journal of Chemometrics published by John Wiley & Sons Ltd.

]]>Advances in technology make it happen to have massive amount of information in the form of multiple variables per object. The use of multivariate approaches for modeling the real-life phenomena is natural in such situation. There are numerous multivariate approaches in the literature, and its a challenge to stay updated on the possibilities. Partial least squares (PLSs) are one of the many modeling approaches for high-throughput data, and its use in different fields to address the variety of problems has been increased in recent years. We therefore present an overview of PLS's applications. The objective of this paper is to give a comprehensive overview on the advances in PLS algorithm together with its applications for regression, classification, variable selection, and survival analysis problems covering genomics, chemometrics, neuroinformatics, process control, computer vision, econometric, environmental studies, and so on. We have mainly presented different PLS approaches and their applications, so that the reader can easily get an understanding of possibility to use PLS for their own field. For further reading, literature references together with software availability are provided. Copyright © 2015 John Wiley & Sons, Ltd.

]]>As a representative paradigm of evolutionary algorithms, particle swarm optimization (PSO) has been combined with partial least square (PLS) (called PSO-PLS) to select informative descriptors in quantitative structure-activity/property relationship (QSAR/QSPR). However, one of the main limitations of PSO-PLS is that it ignores PLS model information. In this paper, by incorporating the PLS model information into PSO-PLS, we present a novel weighted sampling method (called WS-PSO-PLS) to choose the optimal descriptor subset. Due to the fact that the regression coefficients of the PLS model reflect the importance of descriptors in the model development, we firstly obtain the normalized regression coefficients by establishing the PLS model with all the descriptors. Afterward, weighted sampling is used to generate some individuals according to the aforementioned normalized regression coefficients. Finally, we employ some dimensions of the generated individuals to replace the corresponding dimensions of the individuals with poor quality in the population at each generation. WS-PSO-PLS has been assessed through three QSAR/QSPR datasets and the experimental results suggest that WS-PSO-PLS has the capability to effectively guide the search process by introducing the PLS model coefficients into PSO during the evolution and, therefore, performs better than PSO-PLS. WS-PSO-PLS could be considered as a general and promising mechanism to introduce extra information to improve the performance of PSO for descriptor selection in QSAR/QSPR. Copyright © 2015 John Wiley & Sons, Ltd.

]]>After showing that plain covariance or correlation-based criteria are generally not suitable to deal with multiple-block component models in an exploratory framework, we propose an extended criterion: multiple co-structure (MCS). MCS combines the goodness-of-fit indicator of the component model to a flexible measure of structural relevance of the components. Thus, it allows to track various kinds of interpretable structures within the data, on top of variance–maximizing components: variable-bundles, components close to satisfying relevant structural constraints, and so on. MCS is to be maximised under unit-norm constraints on coefficient-vectors. We provide a dedicated ascent algorithm for it. This algorithm is nested into a more general one, named THEME (thematic equation model explorator), which calculates several components per data-array and extracts nested structural component models. The method is tested on simulated data and applied to physicochemical data. Copyright © 2015 John Wiley & Sons, Ltd.

]]>No abstract is available for this article.

]]>Large-scale process data in plant-wide process monitoring are characterized by two features: complex distributions and complex relevance. This study proposes a double-step block division plant-wide process monitoring method based on variable distributions and relevant features to overcome this limitation. First, the data distribution is considered, and the normality test method called the D-test is applied to classify the variables with the same distribution (i.e., Gaussian distribution or non-Gaussian distribution) in a block. Thus, the second block division is implemented on both blocks obtained in the previous step. The mutual information shared between two variables is used to generate relevant matrixes of the Gaussian and non-Gaussian blocks. The *K*-means method clusters the vectors of the relevant matrix. Principal component analysis is conducted to monitor each Gaussian subblock, whereas independent component analysis is conducted to monitor each non-Gaussian subblock. A composite statistic is eventually derived through Bayesian inference. The proposed method is applied to a numerical system and the Tennessee Eastman process data set. The monitoring performance shows the superiority of the proposed method. Copyright © 2015 John Wiley & Sons, Ltd.

The effects and regulatory actions of the polychlorinated biphenyls (PCBs) substituent characteristics on their relative retention times (RRTs) during gas chromatography were analyzed based on known experimental RRTs of 209 PCB congeners and biphenyl; the substituent characteristics used for this analysis included the total amount of substituents, the similarity between two phenyl rings in a single PCB congener, the substituents distribution in single phenyl ring, the main/second-order interactions effects at each position, and the combined effect of two phenyl rings. At last, the universality of regulation was validated on other experimental conditions. Among them, the full factorial experimental design included 10 factors correlated with each substituent position and two levels (0, 1) were initially applied to the domains of the substituent characteristics. The obtained results have revealed that increasing the total amount of substituents can increase the RRTs of PCBs linearly, but similarities between the two rings cannot control the RRTs effectively. Meanwhile, the more compact the substituent distributions on a single phenyl ring are, the bigger the RRTs of PCBs are. Based on a full factorial experimental design, the overall important trend for each position is as follows: *para* > *meta* > *ortho* and the main regulatory substituents for the second-order interaction effects are distributed in the same phenyl ring in the following sequence: *N*_{o} > *N*_{m} > *N*_{p}. The congener with two perpendicular phenyl rings exhibits a milder combined effect on RRTs and smaller RRT relatively. The regulation has a good universality among different experimental conditions, revealing the dominant effect of substituent characteristics on RRTs of PCBs. Copyright © 2015 John Wiley & Sons, Ltd.

The nonlinear, nonnegative single-mixture blind source separation problem consists of decomposing observed nonlinearly mixed multicomponent signal into nonnegative dependent component (source) signals. The problem is difficult and is a special case of the underdetermined blind source separation problem. However, it is practically relevant for the contemporary metabolic profiling of biological samples when only one sample is available for acquiring mass spectra; afterwards, the pure components are extracted. Herein, we present a method for the blind separation of nonnegative dependent sources from a single, nonlinear mixture. First, an explicit feature map is used to map a single mixture into a pseudo multi-mixture. Second, an empirical kernel map is used for implicit mapping of a pseudo multi-mixture into a high-dimensional reproducible kernel Hilbert space. Under sparse probabilistic conditions that were previously imposed on sources, the single-mixture nonlinear problem is converted into an equivalent linear, multiple-mixture problem that consists of the original sources and their higher-order monomials. These monomials are suppressed by robust principal component analysis and hard, soft, and trimmed thresholding. Sparseness-constrained nonnegative matrix factorizations in reproducible kernel Hilbert space yield sets of separated components. Afterwards, separated components are annotated with the pure components from the library using the maximal correlation criterion. The proposed method is depicted with a numerical example that is related to the extraction of eight dependent components from one nonlinear mixture. The method is further demonstrated on three nonlinear chemical reactions of peptide synthesis in which 25, 19, and 28 dependent analytes are extracted from one nonlinear mixture mass spectra. The goal application of the proposed method is, in combination with other separation techniques, mass spectrometry-based non-targeted metabolic profiling, such as biomarker identification studies. Copyright © 2015 John Wiley & Sons, Ltd.

]]>