## INTRODUCTION

A common idea that can be found in the large number of statistical techniques available to industrial practitioners engaged in process improvement and modelling work is information extraction. This idea of obtaining useful information from data is not new and is an important part of scientific investigation and business decision making. To obtain information from data for process improvement, the data must be collected and then analysed. With modern data acquisition and storage systems, ever larger volumes of data are becoming accessible. The volume of data that are available and the number of variables that can be used for model development can be difficult to manage.

The practitioner is often faced with a multitude of choices and no shortage of references on how to proceed with data analysis for a particular industrial project. For example, one may start with early references such as Shewhart (1980) who outlined a set of techniques that could be used to improve process operations and to better inform workers on how the underlying mechanisms of their processing plants are operating. The empirical modelling methods described by Deming (1943) show how data and models can be used together to provide a means to develop adjustments that account for measurement error. As a result, the adjusted data can better represent the underlying process on which they are measured. The improvement tools presented by Ishikawa (1976) and by The Western Electric Company (1956) can be applied to many industries and still act as a basic reference for fault detection, monitoring, and process improvement. The idea of continually improving a process was also described by Box (1957) where a procedure is given that can be used to find information on where the best operating point for a plant may be. The list goes on up to the present day with many similar references and descriptions of possible applications readily available (Mitra, 1998).

Although the methods described above are generally applicable, they do not deal directly with many practical aspects of data analysis. When analysing industrial data, it quickly becomes apparent that the large number of variables that are typically available can become problematic if not treated properly. Problems in analysis can arise due to the condition and structure of the data themselves. Dependencies and colinearity in the data set can produce erroneous results if they are not accounted for. Many references such as the works by Draper and Smith (1981) and Morrison (1990) discuss this topic in varying degrees, and a particularly lucid and informative description of it is given by Box et al. (1973). In this work the authors describe some possible types of dependencies in a multivariate data set, how the dependencies can arise, be identified, and how they can be dealt with to improve a model. The diagnostic technique advocated by the authors involves the computation of the eigenvectors and eigenvalues of the covariance matrix of a data set. The information provided by the decomposition of the covariance matrix is the key in their analysis and their suggested course of action. Although the examples presented are focused on model fitting, the methods can be applied to data analysis in general, especially for process monitoring and improvement.

The concept of analysing the covariance structure to provide information on multivariate data sets has been used in many data-driven projects. Some useful references from Professor MacGregor and his colleagues at McMaster University are listed here. These references are significant because they focus on applications and provide practical means to deal with many of the difficulties in analysis that practitioners may face. A description of how multivariate statistical methods can be used to develop process monitoring methods for steady-state processes is given by Kresta et al. (1991). The authors also show how predictive models can be developed using partial least squares (PLS). In their work, the use of multivariate statistics for process monitoring and model predictions is suggested to deal with correlations in the variables. The important topic of developing models with incomplete data sets is discussed in Nelson et al. (1996). The authors describe a means to use the decomposition of the covariance matrix to fill in missing data values, thereby making models that are used for on-line process monitoring more robust. The development of monitoring methods for batch processes was described by Nomikos and MacGregor (1994). This article provides a suggestion on how to form data matrices in a way that can be applied to process fault detection and monitoring. Once formed, the data matrix is decomposed using singular vectors and singular values derived from principal components analysis (PCA). Also, the application of image analysis outlined by Yu et al. (2003) has provided some useful insight into how multivariate analysis of images can be performed to diagnose operational problems and to find opportunities for improvement. These methods can be particularly useful in processes that are difficult to measure using traditional instrumentation. Lastly, summaries of various applications of multivariate statistical modelling techniques were given by Kourti (2002, 2005). These summaries provide insight into the many useful features that multivariate modelling methods provide, in particular those that are useful in industrial settings.