Samples with analyte concentrations outside a method's dynamic range are a reality of clinical chemistry and are particularly of interest in method comparison studies. The most obvious remedy—to ignore any such values—introduces bias and loses the information that censored data might add to the analysis. Extending conventional errors-in-variables methods to incorporate value-censored data recovers this information. The formulation presented uses a variance model more flexible than either the constant variance or the constant coefficient of variation models. Copyright © 2016 John Wiley & Sons, Ltd.

]]>A simple and efficient approach is reported to estimate the sparsest Tucker3 model for a considered linear dependent multiway data array using PARAFAC profiles. Employing the least possible number of non-zero core elements equal to the pseudo array rank of data, a better and easier interpretation of the data array is possible. The approach does not require any prior information. The type of rank deficiency, that is rank overlap or closure in different modes, and the Tucker3 core size can be determined from a congruency factor while running the algorithm. The replacement method (RM) of optimization is applied to determine the pattern (positions and values) of non-zero elements in the sparsest core of the Tucker3 model. Full rank and rank deficient simulated data sets in different conditions as well as an experimental 3D fluorescence data set from gold nanoparticle (AuNP) interaction with HIV genome are successfully used for evaluating the performance of the algorithm. Copyright © 2016 John Wiley & Sons, Ltd.

]]>The accurate and reliable real-time prediction of melt index (MI) is indispensable in quality control of the industrial propylene polymerization (PP) processes. This paper presents a real-time soft sensor based on optimized least squares support vector machine (LSSVM) for MI prediction. First, the hybrid continuous ant colony differential evolution algorithm (HACDE) is proposed to optimize the parameters of LSSVM. Then, considering the complexity and nondeterminacy of PP plant, an online correcting strategy (OCS) is adopted to update the modeling data and to revise the model's parameters adaptively. Thus, the real-time prediction model, HACDE-OCS-LSSVM, is obtained. Based on the data from a real PP plant, the models of HACDE-LSSVM, DE-LSSVM and LSSVM are also developed for comparison. The research results show that the proposed real-time model achieves a good performance in the practical industrial MI prediction process. Copyright © 2016 John Wiley & Sons, Ltd.

]]>Near infrared (NIR) spectroscopy is an efficient, low-cost analytical technique widely applied to identify the origin of food and pharmaceutical products. NIR spectra-based classification strategies typically use thousands of equally spaced wavelengths as input information, some of which may not carry relevant information for product classification. When that is the case, the performance of predictive and exploratory multivariate techniques may be undermined by such noisy information. In this paper, we propose an iterative framework for selecting subsets of NIR wavelengths aimed at classifying samples into categories. For that matter, we integrate Principal Components Analysis (PCA) and three classification techniques: *k*-Nearest Neighbor (KNN), Probabilistic Neural Network (PNN) and Linear Discriminant Analysis (LDA). PCA is first applied to NIR data, and a wavelength importance index is derived based on the PCA loadings. Samples are then categorized using the wavelength with the highest index and the classification accuracy is calculated; next, the wavelength with the second highest index is inserted into the dataset and a new classification is performed. This forward-based iterative procedure is carried out until all original wavelengths are inserted into the dataset used for classification. The subset of wavelengths leading to the maximum accuracy is chosen as the recommended subset. Our propositions performed remarkably well when applied to four datasets related to food and pharmaceutical products. Copyright © 2016 John Wiley & Sons, Ltd.

The identification of industrial chemicals, which may cause developmental effects, is of great importance for an early detection of hazardous chemicals. Accordingly, categorical quantitative structure-activity relationship (QSAR) models were developed, based on developmental toxicity profile data for zebrafish from the ToxCast Phase I testing, to predict the toxicity of a large set of high and low production volume chemicals (H/LPVCs). QSARs were created using linear (LDA), quadratic, and partial least squares-discriminant analysis with different chemical descriptors. The predictions of the best model (LDA) were compared with those obtained by the freely available QSAR model VEGA, created based on a dataset with a different chemical domain. The results showed that despite similar accuracy (AC) of both models, the LDA model is more specific than VEGA and shows a better agreement between sensitivity (SE) and specificity (SP). Applying a 90% confidence level on the LDA model led to even better predictions showing SE of 0.92, AC of 0.95, and geometric mean of SE and SP (G) of 0.96 for the prediction set. The LDA model predicted 608 H/LPVCs as toxicants among which 123 chemicals fall inside the AD of the VEGA model, which predicted 112 of those as toxicants. Among the 112 chemicals predicted as toxic H/LPVCs, 23 have been previously reported as developmental toxicants. The here presented LDA model could be used to identify and prioritize H/LPVCs for subsequent developmental toxicity assessment, as a screening tool of potential developmental effects of new chemicals, and to guide synthesis of safer alternative chemicals. © 2016 The Authors Journal of Chemometrics Published by John Wiley & Sons Ltd

]]>In this article, we focus on adaptive linear regression methods and propose a new technique. The article begins with a review of the online passive aggressive algorithm (OPAA), an adaptive linear regression algorithm from the machine learning literature. We highlight the strengths and weaknesses of OPAA and compare it with other popular adaptive regression techniques such as moving window and recursive least squares, recursive partial least squares, and just-in-time or locally weighted regression. Modifications to OPAA are proposed to make it more robust and better suited for industrial soft-sensor applications. The new algorithm is called smoothed passive aggressive algorithm (SPAA), and like OPAA, it follows a cautious parameter update strategy but is more robust. The trade-off between SPAA's computation complexity and accuracy can be easily controlled by manipulating just two tuning parameters. We also demonstrate that the SPAA framework is quite flexible and a number of variants are easily formulated. Application of SPAA to estimate the time-varying parameters of a numerically simulated autoregressive with exogenous terms (ARX) model and to predict the Reid vapor pressure of the bottoms flow from an industrial column demonstrates its superior performance over OPAA and comparable performance with the other popular algorithms. Copyright © 2016 John Wiley & Sons, Ltd.

]]>No abstract is available for this article.

]]>
]]>
Unsupervised random forest: a tutorial with case studies http://onlinelibrary.wiley.com/resolve/doi?DOI=10.1002%2Fcem.2793Unsupervised random forest: a tutorial with case studies Nelson Lee Afanador, Agnieszka Smolinska, Thanh Tran, Lionel Blanchet 2016-05-24T05:02:13.25187-05:00 doi:10.1002/cem.2793 John Wiley & Sons, Inc. 10.1002/cem.2793 http://onlinelibrary.wiley.com/resolve/doi?DOI=10.1002%2Fcem.2793 Cover story 231 231
Partial Least Squares (PLS) is a wide class of regression methods aiming at modelling relationships between sets of observed variables by means of latent variables. Specifically, PLS2 was developed to correlate two blocks of data, the X-block representing the independent or explanatory variables and the Y-block representing the dependent or response variables. Lately, OPLS was introduced to further reduce model complexity by removing Y-orthogonal sources of variation from X in the latent space, thus improving data interpretation through the generated predictive latent variables. Nevertheless, relationships between PLS2 and OPLS in case of multiple Y-response have not yet been fully explored. With this perspective and taking inspiration from some basic mathematical properties of PLS2, we here present a novel and general approach consisting in a post-transformation of PLS2 (ptPLS2), which results in a decomposition of the latent space into orthogonal and predictive components, while preserving the same goodness of fit and predictive ability of PLS2. Additionally, we discuss the application of ptPLS2 approach to two metabolomic data sets extracted from earlier published studies and its advantages in model interpretation as compared with the ‘standard’ PLS approach. Copyright © 2016 John Wiley & Sons, Ltd.
Investigating the effect of flexible constraints on the accuracy of self-modeling curve resolution methods in the presence of perturbations http://onlinelibrary.wiley.com/resolve/doi?DOI=10.1002%2Fcem.2787Investigating the effect of flexible constraints on the accuracy of self-modeling curve resolution methods in the presence of perturbations Nahal Rahimdoust Mojdehi, Mathias Sawall, Klaus Neymeyr, Hamid Abdollahi 2016-04-20T19:30:48.658384-05:00 doi:10.1002/cem.2787 John Wiley & Sons, Inc. 10.1002/cem.2787 http://onlinelibrary.wiley.com/resolve/doi?DOI=10.1002%2Fcem.2787 Research article 252 267

Multidimensional data exploration often begins with some form of dimensionality reduction, of which principal component analysis is the most commonly used. This approach, in its traditional implementation, can only capture linear relations, and could hamper the ability of the data analyst to detect important non-linear structure in the data. In this tutorial, we present a relatively unknown and yet powerful alternative method known as Unsupervised Random Forest (URF). URF makes an ingenious use of a simple assumption: if the data that we are modelling holds any structure, it should be distinguishable from a randomly generated dataset. URF does not rely on any distributional assumptions, data attributes (continuous or categorical), or scaling. Similar to its parent method Random Forest, it can model both linear and non-linear relationships. Another advantage of URF is the limited number of parameters to optimize. Low-dimensional visualisation, via the study of the proximity matrix, allows the user to discover patterns and clustering in the data. This tutorial describes not only the underlying theory but also the practical inner workings of URF. Two real data sets demonstrate the potential of URF and provide a basic framework for comparing its performance to other explorative methods. Further research opportunities are also presented. The corresponding codes in R and Matlab are available.

]]>Multidimensional data exploration often begins with some form of dimensionality reduction, of which principal component analysis is the most commonly used. This approach, in its traditional implementation, can only capture linear relations, and could hamper the ability of the data analyst to detect important non-linear structure in the data. In this tutorial, we present a relatively unknown and yet powerful alternative method known as Unsupervised Random Forest (URF). URF makes an ingenious use of a simple assumption: if the data that we are modelling holds any structure, it should be distinguishable from a randomly generated dataset. URF does not rely on any distributional assumptions, data attributes (continuous or categorical), or scaling. Similar to its parent method Random Forest, it can model both linear and non-linear relationships. Another advantage of URF is the limited number of parameters to optimize. Low-dimensional visualisation, via the study of the proximity matrix, allows the user to discover patterns and clustering in the data. This tutorial describes not only the underlying theory but also the practical inner workings of URF. Two real data sets demonstrate the potential of URF and provide a basic framework for comparing its performance to other explorative methods. Further research opportunities are also presented. The corresponding codes in R and Matlab are available.
Unsupervised random forest: a tutorial with case studies http://onlinelibrary.wiley.com/resolve/doi?DOI=10.1002%2Fcem.2790Unsupervised random forest: a tutorial with case studies Nelson Lee Afanador, Agnieszka Smolinska, Thanh N. Tran, Lionel Blanchet 2016-03-14T07:44:49.026379-05:00 doi:10.1002/cem.2790 John Wiley & Sons, Inc. 10.1002/cem.2790 http://onlinelibrary.wiley.com/resolve/doi?DOI=10.1002%2Fcem.2790 Tutorial 232 241

Unsupervised methods, such as principal component analysis, have gained popularity and wide-spread acceptance in the chemometrics and applied statistics communities. Unsupervised random forest is an additional method capable of discovering underlying patterns in the data. However, the number of applications of unsupervised random forest in chemometrics has been limited. One possible cause for this is the belief that random forest can only be used in a supervised analysis setting. This tutorial introduces the basic concepts of unsupervised random forest and illustrates several applications in chemometrics through worked examples. Copyright © 2016 John Wiley & Sons, Ltd.

Multidimensional data exploration often begins with some form of dimensionality reduction, of which principal component analysis is the most commonly used. This approach, in its traditional implementation, can only capture linear relations, and could hamper the ability of the data analyst to detect important non-linear structure in the data. In this tutorial, we present a relatively unknown and yet powerful alternative method known as Unsupervised Random Forest (URF). URF makes an ingenious use of a simple assumption: if the data that we are modelling holds any structure, it should be distinguishable from a randomly generated dataset. URF does not rely on any distributional assumptions, data attributes (continuous or categorical), or scaling. Similar to its parent method Random Forest, it can model both linear and non-linear relationships. Another advantage of URF is the limited number of parameters to optimize. Low-dimensional visualisation, via the study of the proximity matrix, allows the user to discover patterns and clustering in the data. This tutorial describes not only the underlying theory but also the practical inner workings of URF. Two real data sets demonstrate the potential of URF and provide a basic framework for comparing its performance to other explorative methods. Further research opportunities are also presented. The corresponding codes in R and Matlab are available.

]]>
Unsupervised methods, such as principal component analysis, have gained popularity and wide-spread acceptance in the chemometrics and applied statistics communities. Unsupervised random forest is an additional method capable of discovering underlying patterns in the data. However, the number of applications of unsupervised random forest in chemometrics has been limited. One possible cause for this is the belief that random forest can only be used in a supervised analysis setting. This tutorial introduces the basic concepts of unsupervised random forest and illustrates several applications in chemometrics through worked examples. Copyright © 2016 John Wiley & Sons, Ltd.
Multidimensional data exploration often begins with some form of dimensionality reduction, of which principal component analysis is the most commonly used. This approach, in its traditional implementation, can only capture linear relations, and could hamper the ability of the data analyst to detect important non-linear structure in the data. In this tutorial, we present a relatively unknown and yet powerful alternative method known as Unsupervised Random Forest (URF). URF makes an ingenious use of a simple assumption: if the data that we are modelling holds any structure, it should be distinguishable from a randomly generated dataset. URF does not rely on any distributional assumptions, data attributes (continuous or categorical), or scaling. Similar to its parent method Random Forest, it can model both linear and non-linear relationships. Another advantage of URF is the limited number of parameters to optimize. Low-dimensional visualisation, via the study of the proximity matrix, allows the user to discover patterns and clustering in the data. This tutorial describes not only the underlying theory but also the practical inner workings of URF. Two real data sets demonstrate the potential of URF and provide a basic framework for comparing its performance to other explorative methods. Further research opportunities are also presented. The corresponding codes in R and Matlab are available.
Post-transformation of PLS2 (ptPLS2) by orthogonal matrix: a new approach for generating predictive and orthogonal latent variables http://onlinelibrary.wiley.com/resolve/doi?DOI=10.1002%2Fcem.2780Post-transformation of PLS2 (ptPLS2) by orthogonal matrix: a new approach for generating predictive and orthogonal latent variables Matteo Stocchero, Debora Paris 2016-02-24T20:37:45.679179-05:00 doi:10.1002/cem.2780 John Wiley & Sons, Inc. 10.1002/cem.2780 http://onlinelibrary.wiley.com/resolve/doi?DOI=10.1002%2Fcem.2780 Research article 242 251

]]>Partial Least Squares (PLS) is a wide class of regression methods aiming at modelling relationships between sets of observed variables by means of latent variables. Specifically, PLS2 was developed to correlate two blocks of data, the X-block representing the independent or explanatory variables and the Y-block representing the dependent or response variables. Lately, OPLS was introduced to further reduce model complexity by removing Y-orthogonal sources of variation from X in the latent space, thus improving data interpretation through the generated predictive latent variables. Nevertheless, relationships between PLS2 and OPLS in case of multiple Y-response have not yet been fully explored. With this perspective and taking inspiration from some basic mathematical properties of PLS2, we here present a novel and general approach consisting in a post-transformation of PLS2 (ptPLS2), which results in a decomposition of the latent space into orthogonal and predictive components, while preserving the same goodness of fit and predictive ability of PLS2. Additionally, we discuss the application of ptPLS2 approach to two metabolomic data sets extracted from earlier published studies and its advantages in model interpretation as compared with the ‘standard’ PLS approach. Copyright © 2016 John Wiley & Sons, Ltd.

Self-modeling curve resolution methods have continuously been improved during recent years. Many efforts have been made on curve resolution methods to reduce the rotational ambiguity by means of different types of constraints. Choosing proper constraints and cost functions is critically important for the reduction of the rotational ambiguity because the constraints have a direct influence on the accuracy of the area of feasible solution (AFS).

In this work, we introduce a new improved cost function, which serves to apply nonnegativity, unimodality, equality, and closure constraints. We also investigate the reduction of the AFS under hard and soft constraints. Another point of this work is to evaluate the accuracy and precision of the reduced AFS in the presence of noise and perturbations, under hard and soft implementation of nonnegativity, unimodality, equality, and closure constraints. A comparison is given between the reduced AFS with soft constraints (small deviations from constraints are accepted) and the reduced AFS under hard constraints (restrictedly forced constraints). A graphical visualization of this comparison is presented for various model problems. The results show that an AFS computation with soft constraints provides more reliable results, especially in the presence of noise. The test problems substantiate significant advantages of soft constraints over hard constraints because the obtained profiles are closer to the potentially true noisy profiles, which contain small deviations from ideal responses. Using tunable parameters *ϵ*,*γ*,*ω*,*δ* is one of the advantages of soft constrained cost function that allows the small deviations from ideal responses. Ultimately, soft constraints can help to reduce the lack-of-fit, and they are a proper instrument to handle the effect of noise on the AFS. Copyright © 2016 John Wiley & Sons, Ltd.

In this contribution, a technique is proposed to create a data-driven interpretation of a given chemometric analysis of a Raman dataset. In real-world applications, the chemometric analysis is fixed by some external measurement, for example, a legal standard, or a set of fixed goals. Thus, the exact chemometric work flow is fixed because of those goals. However, a further optimization, for example, of the measurement itself relies on an interpretation of the resulting chemometric analysis. For this purpose, a data-driven analysis of the chemometric analysis itself has to be carried out. This contribution tries to achieve that goal by combining two methods. The first proposed technique is the calculation of the so-called importance map, which allows the computation of the importance of every channel for a given model and a given dataset. This importance map is constructed after the complete result of an out-of-bag (OOB) validation and the decrease of accuracy by randomized channels. The second technique is the growing of the optimal decision tree based on the action of the model used for chemometric analysis. By this way, a clustering is achieved on which by binary classifiers, the optimal decision tree is grown. This tree can be interpreted as dividing the whole dataset into meta clusters. Combining these techniques, a new way of interpreting datasets based on the chosen model is proposed. This combination closes the gap between chemometric analysis and the need for interpretation. Copyright © 2016 John Wiley & Sons, Ltd.

]]>Leishmaniasis is a disease caused by a protozoan parasites belonging to the genus Leishmania. It causes morbidity and mortality in the tropical and subtropical regions. Current drugs are toxic, expensive, and require long-term treatment. Thus, identification and development of novel, cheap, efficient, and safe antileishmanial compounds as drug candidates are important from pharmaceutical point of view. Quantitative structure–activity relationship (QSAR) methods are used to predict the pharmaceutically relevant properties of drug candidates whenever it is applicable. The aim of this study was to use two different techniques, namely multiple linear regression (MLR) and artificial neural networks (ANNs) in predicting the antileishmanial activity (i.e. pIC_{50}) of 5-(5-nitroheteroaryl-2-y1)-1,3,4-thiadiazole derivatives. To this end, genetic algorithm-coupled partial least square and backward multiple regression method were used to select a number of calculated molecular descriptors to be used in MLR and ANN-based QSAR studies. The predictive power of the models was also assessed using leave-one-out and leave-group-out cross validation methods. Also, molecular modeling studies were conducted based on DNA topoisomerase I to identify the binding interactions responsible for antileishmanial activity of those analogs. The results suggest that hydrogen bonding interactions and several hydrophobic interactions of ligands with the active site of *Leishmania major* topoisomerase IB are responsible for their potent antileishmanial activity. These results can be exploited for structure-based computer-aided drug designing of new and selective leishmania topoisomerase inhibitors. Copyright © 2016 John Wiley & Sons, Ltd.