Process data suffer from many different types of imperfections. For example, bad data due to sensor problems, multi-rate sampling, outliers, compressed data etc. Since most modelling and data analysis methods are developed to analyze regularly sampled and well conditioned data sets there is a need for pre-treatment of data. Traditionally data conditioning or pre-treatment has been done without taking into account the end use of the data, for example, univariate methods have been used to interpolate bad data even when the intended end use of data is for multivariate analysis. In this paper we consider the pre-treatment and data analysis as a collective problem and propose data conditioning methods in a multivariate framework. We first review classical process data analysis methods and acclaimed missing data handling techniques used in statistical surveys and biostatistics. The applications of these acclaimed missing data techniques are demonstrated in three different instances: (i) principal components analysis (PCA) is extended in data augmentation (DA) framework for dealing with missing values, (ii) iterative missing data technique is used to synchronize uneven length batch process data, and (iii) PCA based iterative missing data technique is used to restore the correlation structure of compressed data.
Les données de procédé souffrent de nombreux types d'imperfections. Par exemple, de mauvaises données dues à des problèmes de détecteur, les données à taux multiples, les données anormales, les données comprimées, etc. Étant donné que la plupart des méthodes de modélisation et d'analyse des données sont développées pour l'analyse régulière des ensembles de données échantillonnées et bien conditionnées, il existe un besoin de prétraitement des données. Traditionnellement, le conditionnement ou le prétraitement des données a été réalisé sans tenir compte de l'utilisation finale des données, c'est-à-dire que des méthodes univariées ont été utilisées pour interpoler les mauvaises données même si les données finales étaient destinées à l'analyse multivariée. Dans cet article, on considère le pré-traitement et l'analyse des données comme un problème collectif et on propose des méthodes de conditionnement de données compatibles avec les techniques d'analyse de données. On a d'abord examiné les méthodes classiques d'analyse des données de procédés et les techniques de manipulation des données reconnues utilisées dans les études statistiques et les biostatistiques. Les applications de ces techniques de données manquantes reconnues ont été démontrées dans trois exemples: (i) l'analyse des composantes principales (PCA) a été étendue au cadre de la méthode d'augmentation des données (DA) pour traiter les valeurs manquantes; (ii) la technique itérative des données manquantes a été utilisée pour synchroniser les données de procédés discontinus de longueurs inégales; et (iii) la technique itérative des données manquantes basée sur la PCA a été utilisée pour restaurer la structure de corrélation des données comprimées.
Although the process industry is well equipped with sensors, measurements can get corrupted due to failure of the measurement device and/or errors in data management. Common causes that lead to bad data are, sensor breakdown, measurement outside the range of the sensor, data acquisition system malfunction, power black outs, computer system crashes, interruption of transmission lines, wrong format in logged data, glitches in data management software, data recording errors, etc. Some of these phenomena are illustrated in Figure 1, which shows a pressure signal when the measurement occasionally exceeded the range of the sensor, with bad values. Sometimes data may not be available at the required time interval because of the nature of the sensors or the strategy of sampling, for example, analyzer readings of composition have lower sampling frequency compared to measurements such as temperature, flow rate, and pressure. Figure 2 shows composition measurement from an analyzer with 15 min sampling interval where the rest of the variables are available at much higher frequency and logged every minute. In the distributed control system (DCS) a “zero order hold” is applied to the measurements and interpolated values are supplied to the controller at every 1 min interval. Measurements with different sampling rates are often collected in a single data matrix for monitoring or identification purpose and referred to as multi-rate data. If the interpolated points of the slowly sampled variable are taken out, the problem will have a similar appearance as a missing data problem. In case of lab analysis where samples need to be collected manually, the measurement frequency may be completely irregular, asynchronous and there will also be a time lag between the instant the measurement is available and the actual sampling time.
In addition to these problems, there are many other data problems which are conventionally not considered as missing data problems but missing data handling techniques may be used to deal with such problems, for example, synchronizing un-even length batch process data and restoring correlation of compressed data (Figure 3).
In this paper we treat these diverse problems under the general framework of “missing data” problems. Missing data analysis is a well established area in statistical surveys and it has wide applicability. However, there are major differences between the data analysis in statistical surveys and the process industry. This should be taken into consideration in the treatment of missing data. In statistical surveys the data collector and the analyst are two different entities and the collected data ends up in a large database. Typically the data collector has access to more information than the analyst. In reconstructing or filling missing values, the data collector uses this additional information and provides a complete make-up data set, so that a variety of potential users can reach correct conclusions irrespective of the analytical tools. In contrast to this, process data are automatically archived in the data historian and typically the engineer/analyst is the only user of the data. It is not important to give the data a superficial make-up, rather it is important to treat missing data in a way that is commensurate with the data analysis and the intended use of the data. Unlike other methods, the proposed technique considers missing data problem in a multivariate framework. Therefore, the objectives of this study are to: (i) establish the link between the statistically rigorous missing data handling techniques and process data analysis, so that the process engineering community can make use of these methods, (ii) extend some of the commonly used process data analysis tools using these formal methods for building models from data matrices with missing values, and (iii) implement novel applications of missing data handling techniques towards solving problems which are apparently not missing data problems and yet can be treated as such.
The organization of the paper is as follows: second section introduces the reader to different types of missing data commonly encountered during process data analysis. Process missing data are classified based on the concepts developed in the statistics literature. Third section is a tutorial overview of the likelihood based advanced concepts of missing data treatment techniques. In fourth section these advanced concepts are used to extend PCA based methods for dealing with missing data. Fifth and sixth sections describe application of missing data handling techniques to synchronize uneven length data from batch process and to restore the correlation structure of compressed data, respectively. The paper ends with concluding remarks in seventh section.
CLASSIFICATION OF MISSING DATA
Over the last three decades a vast amount of literature has emerged in the statistical literature with applications in biostatistics and chemometrics to deal with the missing data. Several important definitions and concepts have emerged to systematically analyze and treat missing data. In this section we explain these concepts from a process data analysis perspective.
Patterns of Missing Data
Sometimes it is beneficial to sort the columns containing missing values into orderly patterns. Historically, survey methodologists have classified different patterns of missing data encountered in surveys. For example, unit nonresponse, which occurs when the entire data collection procedure fails (because the sampled person is not at home or refuses to participate, etc.); item nonresponse which means partial data are available (e.g., the person participates but does not respond to certain individual items) are two such classifications. In longitudinal studies (e.g., for drug trials) towards the end there are dropouts for different reasons and the data collected from subjects are of uneven length. Such data can be ordered in a monotone pattern (Little and Rubin, 2002). In process data analysis, missing values are encountered because of reasons different than surveys, however the patterns can be very similar to those found in a survey. The different patterns commonly encountered in process data are shown graphically in Figure 4 where Y denotes an (N × k) rectangular data set without missing value. The ith row is denoted by and any element yij is the value of variable Yj for the instant or sample point i. Figure 4a is an example when all the missing values belong to one variable. In process industries this typically happens when the sensor breaks down for a long period of time. Figure 4b is an example of unit nonresponse, an analogous situation in process industries is when the process is down due to a fault condition (e.g., sheet-break in a pulp and paper mill) and the only available information are the time stamps. This poses problem when building a dynamic process model. Figure 4c is a general pattern where the short missing points may be due to outliers removed for robust analysis and the long periods can be due to sensor downtime. The orderly pattern of missing values shown in Figure 4d is a unique signature of multi-rate data. The variable with missing value may be a quality variable such as, concentration which are often measured less frequently in process industries, for example, due to laboratory sampling or by an on-line gas chromatograph at a slower sample rate.
Missing Data Mechanism
Missing values occur for reasons beyond our control. In statistical surveys often the data analysts do not have the information of what may have caused the data to be missing. However, for analysis purpose assumptions are made about the reason for missing data. These assumptions are usually untestable. If the assumptions are good then similar conclusions will follow from a variety of realistic alternative assumptions. Rubin (1976) laid out a probabilistic framework for the missingness mechanism and obtained the weakest condition under which it is appropriate to ignore the mechanism that may have been the cause of missing data. The mechanisms that lead to missing values in process data are limited. Therefore, often it is not required to perform tests to determine the mechanism of missingness. The causes that lead to missing values in process data often can be traced back by looking at the pattern of missing data. Additional information can be found in the process operators' log book or by looking at other measurements of the control loop. For example, the valve positions of the control loop will indicate if there was a valve saturation. From a process data analysis perspective the classification of missingness mechanism can help as a guideline to decide if the analysis based on the observed data is valid for the complete data set. Here we examine different causes for missing values in process data and classify them in to different missingness mechanisms.
Let Y= (yij) denote an (N × k) rectangular data set, in which some of the values are missing. Missing values are denoted by Ymis and observed values are denoted by Yobs. In concise form the data matrix is represented as Y = [Yobs, Ymis]. This is a general notation used in the missing-data literature which is also followed in this paper. It does not mean that observed and missing data are in two different blocks, rather missing values are distributed all over the data matrix. For any data set, a matrix M = (mij), referred as the missingness matrix, identifies what is known and what is missing. Each element of M is usually a single binary item indicating whether yij is observed (mij = 1) or missing (mij = 0). In the statistics literature, missingness is treated as a random phenomena. The distribution of M, called missingness mechanism, is characterized by the conditional distribution of M given Y, p(M|Y, ϕ), where ϕ denotes parameters unrelated with Y. Based on different conditionality the mechanism of missingness have been classified into three classes (Rubin, 1976):
(1)Missing completely at random (MCAR)Missingness does not depend on any part of the data, either missing or observed
This does not mean that the pattern has to be random, rather the pattern does not depend on the values. Under such condition missing data are not systematically different from observed data and for reconstruction of the missing values it is not necessary to include the missingness mechanism. Some examples from process data would be, regularly or irregularly sampled multi-rate data, missing data due to sensor failure, etc.
(2)Missing at random (MAR)This is a less restrictive assumption than MCAR and the weakest condition under which the missingness mechanism can be safely ignored. In this case missingness depends only on the observed component, Yobs and not on the missing component, Ymis
For example, in a case where quality measurements are costly and time consuming, the condition variables are measured regularly and the quality variables are measured only when these condition variables indicate that the process is closer to the specification limit. Thus missing values are not systematically different from normal observed values and model based on normal data can be used to estimate the quality variables.
(3)Non ignorable mechanism (NI)If the mechanism of missingness is dependent on both the observed and the missing part of the data then the mechanism is Non Ignorable. This is the most restrictive assumption. In this case the underlying reason that caused the missing data has an effect on the inference and has to be included in any analysis. For example, process data are sometimes not recorded because values are outside the range of the sensor. Most chemical processes are inherently nonlinear and the linearity assumption is valid only in a narrow region. Therefore outside the sensor range process behaviour may be systematically different from normal behaviour and models based on the observed data are not valid in that region.
REVIEW OF MISSING DATA HANDLING METHODS
Whenever data analysts come across missing values they adopt different methods to give the data a complete look. Some of the historical methods for data analysis are: complete case analysis, available case analysis, single imputation, etc. Details of these methods can be found in Little and Rubin (2002). Some of these methods have been formalized and appear in widely used statistical software. In this section we review two likelihood based methods, expectation maximization and data augmentation which have stronger theoretical basis and more general applicability. We will also describe another widely used method, multiple imputation, mostly used for sensitivity analysis of the missing data replacements.
The expectation maximization (EM) algorithm is a general method for obtaining maximum likelihood estimates of parameters from incomplete data. The EM algorithm is based on the intuitive idea of estimating the missing value and iteratively re-estimating the parameters using the estimated missing values. The origin of EM algorithm has been traced back to Fisher (1925) and McKendrick (1926). Some of the important contributions along the way were made by Hartley (1958), Baum et al. (1970), Orchard and Woodbury (1972) and Sundberg (1976). Baum et al. (1970) proved the monotone convergence of the EM algorithm and Sundberg (1976) provided an easily understandable theory underlying the EM algorithm and illustrated its application using several iterative examples (Hartley, 1958; Baum et al., 1970). However, the popularity of the method is due to the seminal paper of Dempster et al. (1977). The word EM was also coined by them. As pointed out by VanDyk and Meng (2001) this paper has two main contributions which popularized the method. First, they gave the algorithm an informative title identifying the key steps, the Expectation step (E-step) and Maximization Step (M-step). Second, they demonstrated how it can be implemented to solve a wide class of problems. Some of them were never thought of previously, for example, factor analysis (VanDyk and Meng, 2001). The implementation steps of EM algorithm are described below. Let Y denote the complete data matrix with density p(Y|θ) where . If Y were observed completely the objective would be to maximize the complete-data likelihood function of θ!
In the presence of missing data, however, only part of Y is observed. In a convenient but imprecise notation we write Y= (YobsYmis) where, Yobs is the observed part of the data and Ymis denotes the unobserved or missing part of the data. For simplicity we assume that data are missing at random (MAR), so the likelihood for θ based on the observed data is
Because of the integration, maximizing Lobs can be difficult even when maximizing L is trivial. The EM algorithm maximizes Lobs by maximizing the expected value of complete-data log-likelihood. The log-likelihood of the observed data increases with each iteration of the EM algorithm until converging to a local or global maximum (Dempster et al., 1977). The rate of convergence is directly related to the amount of unobserved information in the data matrix, that is, convergence becomes slow with greater amount of missing data. The algorithm starts with some initial value of parameters, θ(t) and iterates between the following two steps:
Expectation Step: In the E-step we find the expectation of the logarithm of the complete-data likelihood given the observed data and the current estimate of the parameters
Maximization Step: In the M-step we find the value of θ(t+1) to maximize such that
Although the general theory of EM applies to any model, it is particularly useful for data which comes from any exponential family of density functions. In such a case the E-step reduces to finding the expected value of the sufficient statistics of the complete-data likelihood. Also if the closed form solution for the parameters are not available the M-step becomes complicated and the simplicity of the algorithm is compromised in the implementation stage. Here we explain the steps in the EM algorithm for the simplest case, parameter estimation of a univariate Gaussian distribution.
Example: Suppose have a univariate normal distribution with mean µ and variance σ2. We write y = (yobs, ymis) where y represents random samples of size n, is the set of observed values and the missing data. The log-likelihood based on the complete-data is:
where and The sufficient statistics of the log-likelihood function is and . Log-likelihood is linear in the sufficient statistics. If all the data values are available the closed form solution for the parameters are given by, and . Following exactly the above description the steps of the algorithm are:
E-step: The expected value of the log-likelihood function over the distribution of the missing values is calculated in this step
M-step: In this step the expectation of the log-likelihood is maximized with respect to the parameters
Clearly it is evident from Equations (9) and (10) that calculation of the expectation of the log-likelihood is not necessary. Rather in the E-step we need to calculate the expectation of the sufficient statistics of the log-likelihood. Therefore the calculations become much simpler.
First, in the E-step the expected values of the sufficient statistics are calculated as follows:
Substituting these expected values we obtain the estimates of the parameters in the M-step,(13)
In the EM algorithm, “missing data” are not directly replaced in the log-likelihood function, rather expected value of the “sufficient statistics of likelihood” is replaced in the function. Simple substitution of µ(t) would lead to omission of the term . This is the main difference between EM algorithm and other naive methods such as substitution of “estimated missing value” and re-estimation of parameters. For each imputation of the missing value, correction is also done in the covariance so that the error and covariance structures remain the same.
If the sufficient statistics of log-likelihood function are linear in data (e.g., multinomial distribution) then the E-step simply reduces to estimation of the conditional expectation of the missing values and naive methods such as substitution of the estimated “missing value” and re-estimation of parameters is equivalent to the EM algorithm.
The EM algorithm has two major limitations: (i) in some cases with large fraction of missing values it can be very slow to converge; and (ii) in cases where the M-step is difficult (e.g., does not have any closed form), the theoretical simplicity of the algorithm does not convert to practical simplicity. Two types of extensions of EM algorithm have been proposed to speed up the convergence. The first type which is more like the EM algorithm retains the monotone convergence properties of EM by keeping the E-step unchanged and mostly modifying the M-step of the algorithm. The basic idea is to replace the M-step with several conditional maximization steps where a closed form for the M-step is not available. Several methods have been developed along this line. Expectation conditional maximization (ECM) (Meng and Rubin, 1993), expectation conditional maximization either (ECME) (Liu and Rubin, 1994), alternating expectation conditional maximization (AECM) (Meng and van Dyk, 1997), parameter-expanded EM (PX-EM) (Liu and Rubin, 1994) are some of the notable extensions. The other type is based on the idea of speeding the algorithm by combining it with Newton–Raphson type updates commonly known as the Hybrid EM algorithm (Jennrich and Sampson, 1966; Laird and Ware, 1982) However, these algorithms can be categorized under ECME as well.
The term Data Augmentation refers to methods for iterative optimization or sampling algorithms via the introduction of unobserved data or latent variables. In the statistics literature data augmentation was made popular by Tannner and Wong for posterior distribution of parameters (Tanner and Wong, 1990). A similar algorithm was developed by Li (1985) but from a different perspective. Important methodological and theoretical papers on Data Augmentation include Damien (1999), Higdon (1998), Mira and Tierney (1997), Neal (1997), Roberts and Rosenthal (1997) and VanDyk and Meng (2001). In the physics literature Data Augmentation is referred to as method of auxiliary variable (Swendsen and Wang, 1987). Auxiliary variables are adopted to improve the speed of simulation; important contributions include those by Edwards and Sokal (1988).
If the missing data mechanism is ignorable then all the relevant information about the parameters are contained in the observed-data likelihood L (θ|Yobs) or observed-data posterior p (θ|Yobs). Except for some special cases, these tend to be complicated functions of θ, and extracting summaries like parameter estimates require special computation tools, Data Augmentation (DA) can be very useful in this respect. The basis of Data Augmentation is Bayes' Rule for estimating the joint density:
Integrating both sides over the missing data space gives the desired posterior density:
If Ymis is sampled from the posterior distribution , then in discrete form Equation (15) can be written as,
Integrating both sides over the parameter space gives the posterior density:
Equations (16) and (18) suggest an iterative scheme. The key idea behind Data Augmentation is to solve the incomplete-data problem by repeatedly solving the tractable complete-data problem. In data augmentation Yobs is augmented by an assumed value of the Ymis. The resulting complete-data posterior becomes much easier to handle. The solution is further improved by the iterative implementation via the following two steps:
Imputation Step: Given a current guess θ(t) of the parameters, first a value of the missing data is drawn from the conditional predictive distribution of Ymis,
Posterior Step: Conditioned on a new value of θ is drawn from its complete-data posterior,
Repeating the above steps from a starting value of θ(0) yields a stochastic sequence whose stationary distribution is , and subsequences and have and as their respective stationary distribution. For a value of t that is substantially large, θt can be regarded as an approximate draw from and as an approximate draw from . Data Augmentation may be viewed as a stochastic counterpart of EM where the Imputation step is similar to the Expectation step and the Posterior step is equivalent to the Maximization step of the EM algorithm (VanDyk and Meng, 2001). The Data Augmentation method is very closely related to an iterative method known as the Gibbs' Sampler. Gibbs' Sampler is a Markov Chain Monte Carlo method used to generate samples from joint distribution of a set of variables where it is difficult to sample from the joint distribution directly, but relatively easy to sample from the conditional distribution. In this respect, Data Augmentation may be viewed as a Gibbs' Sampler with two parameter vectors (i.e., vector one containing model parameters and vector two containing the missing values) (Gelman et al., 2004). The main advantages of Data Augmentation are: it is intuitive, steps are easy to follow and implementation is easy for a wide variety of problems. The method also has good convergence property.
Remarks: The main idea of EM and Data Augmentation algorithms is to impute the values in a way so that there is no change in the statistical properties of the data set. For example, in case of multivariate normal data the mean and covariance of the data matrix contains all the information of the data. Simple imputation of conditional mean may lead to a distorted covariance matrix of the data matrix. The EM algorithm directly makes the correction of the covariance matrix in the expectation step of the algorithm. On the other hand, Data Augmentation makes the correction in the imputed value. Instead of imputing the conditional mean, appropriately scaled random errors are added to the conditional mean and samples from the conditional distribution are used as imputed values. This indirectly makes the correction or prevents the deflation of the covariance matrix.
The basic idea behind multiple imputation is to assess the additional variability introduced because of imputations of missing values (Rubin, 1977, 1978). The main feature of multiple imputation is, for each missing data point several values (for instance, m samples from the conditional distribution) are imputed. Thus there would be m complete data sets. Each complete data set is analyzed using standard complete-data procedure just as if the imputed values were real data. In a survey setting this is most appropriate because the data collector and the analyst are often two different identities and the data collector has more information than those that are reported in the database. Based on the additional information the database constructor can think of different imputation models and use them to fill the data. So the analyst will have a chance to use all of these different sets and use it to do sensitivity analysis.
The simplest method for combining the results of m analysis is Rubin's rule (Rubin, 1987). Suppose that Q represents a population quantity (e.g., regression coefficient) to be estimated. Let and denote the estimate of Q and the standard error that one would use if no data were missing. The method assumes that the sample is large enough so that has approximately a normal distribution so that has approximately 95% coverage.
In presence of missing data, using multiple imputation (MI) m different data set are created, subsequently there will be m different estimates of Q and U, . Rubin's overall estimate is simply the average of the m estimates,
The uncertainty in has two parts: the average within imputation variance,
and between-imputations variance,
The total variance is a modified sum of the two components,
where the factor (1 + m−1) is a correction for finite number of imputations. The relative increase in variance due to nonresponse is given by the ratio
One of the main requirement for proper multiple imputation is that the parameters used for estimating missing values should also be sampled from the distribution to reflect the uncertainty about the parameters of the model. So it is natural to motivate multiple imputation from a Bayesian perspective, where estimating posterior distribution of parameter is an integral part of the analysis (Schafer and Schenker, 2002). As a result, it is widely accepted that multiple imputation using Bayesian method of analysis is generally proper (Rubin, 1987). However, with a variety of examples Nielsen (2003) has shown that the Bayesian method does not generally lead to proper multiple imputation and even in cases when it is proper the Bayesian method may estimate a variance which may go either way, be much higher or much smaller than the actual estimate. In response to that it has been argued that the examples were pathological cases and multiple imputation has a self-correcting nature that leads to approximately valid statistical inference (Rubin, 2003; Zhang, 2003).
MISSING DATA HANDLING IN PCA
Missing values in data matrix pose difficulties at two stages in PCA based process monitoring. Firstly, in building a model from the historical data sets. Secondly, during the monitoring phase for calculating the scores and the residuals. Methods developed for score calculation in the presence of missing values include trimmed score (TRI), single component projection (SCP), conditional mean replacement (CMR), projection to model plane (PMP). These are all single step methods and essentially the implementation of Equation (28) (Nelson et al., 1996; Artega and Ferrer, 2002). In this study we focus only on the issues that arise during off-line modelling stage due to the presence of missing values in the data matrix. A brief review of these methods are in order here.
Originally the NIPALS algorithm was used for building principal component models. Christofferson (1970) extended the NIPALS algorithm for finding first and second principal components in presence of missing values in the data matrix. The method has been generalized for finding multiple PCs in the presence of missing data (Grung and Manne, 1998). It uses a least square minimization criteria to estimate the scores, t and the loadings p in successive steps. Let ZN×n be the data matrix which contains some missing values where the missing values are represented by zeroes. If all the missing values were known and YN×n was the complete data matrix, the relation between Z and Y can be conveniently expressed with the help of missingness indicator MN×n with elements mij = 1 for non-zero zij, and mij = 0 when zij = 0. Consequently the relationship between each element of Z and Y is zij = mijyij. In the latent variable model r is the number of retained latent variables. The objective function may be written in the following form:
For the ith row the objective function is given by,
Defining S(i) with elements as the ith row of Z; and t(i) as the ith row of T, Equation (26) can be written in the following form:
If elements of the loadings matrix, pij are known the solution to the regression problem is:
Similarly, defining z(j) as the jth column of Z containing elements zij; p(j) as jth column of P containing elements pij and the matrix Q(j) with the elements the loadings matrix can be found by ordinary least-squares regression as:
A robust way of extracting the PCs is to use the singular value decomposition (SVD) algorithm. In the presence of missing values in the data matrix, an iterative imputation approach is used to fill the missing values and estimate the PCs. In this study we refer to this algorithm as principal component analysis imputation algorithm (PCAIA) (Grung and Manne, 1998; Troyanskaya et al., 2001; Walczak and Massart, 2001). The algorithm is described below:
(1)Initially missing values of the data matrix are filled with the unconditional mean of the variables. For example, missing values are filled by the column averages of Yobs, which gives the augmented data matrix where .
(2)SVD is performed on the augmented data matrix. The loading matrix, is used to predict the noise free values .
(3)Missing values are filled with predicted , and the augmented data matrix, .
(4)Convergence is monitored by observing sum of squared errors of the observed values and corresponding predicted values from step (2).
Step (2) and step (3) are repeated until convergence.
Treatment of missing data based on NIPALS algorithm and SVD essentially minimize the same least squared objective function as in Equation (25). NIPALS algorithm converges at a faster rate than the iterative imputation method. However, the estimated scores are not orthogonal to each other. To get orthogonal scores after convergence of the algorithm, SVD is performed on the data matrix with imputed values. PCAIA algorithm is an iterative implementation of Equations (28) and (29).
Remarks: PCAIA is a pseudo version of the more general EM algorithm. The two major iterative steps of the algorithm, similar to the EM algorithm, are discussed next.
Parameter Estimation step is similar to the Maximization (M-Step) of the EM algorithm. The loadings of the PCs are calculated from the augmented data matrix, where missing values are filled with conditional expected values. These are the parameters in this case. However, the method is optimal in the least squares sense and in this respect different from the Maximum Likelihood Estimates obtained in EM.
Missing Value Estimation resembles the Expectation step of the EM algorithm. Using the estimated parameters, missing values are estimated in this step. These values are used to fill the missing values and thus obtain a better augmented data matrix. In the Expectation step of the EM algorithm missing values are not directly estimated, rather the expectation of the sufficient statistics of the log-likelihood function are calculated. Therefore, the two methods will only be equivalent when the log-likelihood is linear in data (e.g., binomial distribution) or in other words the sufficient statistics of the log-likelihood equation are linear function of the data values.
Limitations of PCAIA
Some of the limitations of PCAIA are discussed below. Methods based on NIPALS algorithm also suffer from similar limitations in the presence of missing data.
Distortion of the covariance structure
Consider a measurement matrix . The measurement at sampling instant i, yij1×n can be decomposed as follows:
where is the measurement error and xij is the noise free true variable. For building latent variable model, measurements from a particular section or unit are collected in a data matrix. After collecting N samples we can write it in the following matrix form:
where is the noise free true values and is the measurement error matrix. Therefore, the covariance of the measurement matrix Y can be divided into two parts, ΣY = ΣX + Ωε. In filling the missing values the method ignores the random error part of Y. Missing values are filled by the conditional expectation of the missing values, . As a result Ωε is underestimated and the estimate of ΣY from such imputed data matrix gets distorted. Since the objective in PCA is to capture the covariance structure of the data, it is important that this information does not get distorted. The degree of distortion will depend on the percentage of missing data and the relative magnitude of the measurement error. Also this type of imputations over emphasize the linear relationships between the variables, therefore the imputed data set is not suitable for analysis of variance.
Model order selection
The rank of the loading matrix P or equivalently the number of major PCs in the model is known as the order of the model. The loadings are given by the eigenvectors or the singular vectors, and the corresponding eigenvalues or singular vectors indicate the variances explained by the eigenvectors. Ideally one would like to include the minimum number of eigenvectors necessary to explain the total variance of the deterministic part, X. Methods commonly used for model order selection are, SCREE-plot, Broken root, Cross validation, Significance test, etc. In selecting the number of PCs, all these methods with the exception of cross validation make use of the fraction of the variance explained by major PCs to the total variance explained by all PCs
Once the fraction has been calculated, the user decides on the percentage of variance that needs to be captured. The number of PCs necessary to capture the specified variance information will determine the model order.
Distortion of the covariance matrix has a direct impact on the selection of model order. The error variances are attenuated because of missing values in the data matrix which leads to the shrinkage of the denominator term of the ratio defined in Equation (32). Therefore the percentage of total variance explained by major PCs will no longer remain constant for a data set, rather it will depend on the fraction of missing values present in the data matrix.
Extending PCA to the Data Augmentation Framework
The limitations of PCAIA arise due to the fact that the missing values are replaced with the conditional expected values of the missing values, that is, while imputing the missing values the errors in the measurements are ignored. Depending on the magnitude of the measurement errors the covariance matrix of the imputed data set may get distorted from the covariance matrix of the original data set. Therefore it is important to take the measurement error into consideration during the imputation phase since it is precisely the covariance information that PCA is attempting to model. In this section we propose an algorithm which combines PCAIA with the ideas of Bootstrap re-sampling and Data Augmentation strategies. The proposed algorithm is named PCA-Data Augmentation (PCADA).
The basic idea is to iteratively implement the imputation and posterior steps described by Equations (19) and (20) as discussed in Data Augmentation Section. Suppose that at the ith iterative step, the available data set is Yobs and the missing values are randomly distributed throughout the data matrix. The data set can be given a complete makeup with some initial estimates of the missing values, for example, filling the missing values with the mean of the observed values . The complete data set at this stage is defined as, . The parameters or the loading matrix, can be calculated by applying SVD on the augmented data matrix Yaug. After the initial estimation, the imputation and the posterior steps are carried out as follows.
Imputation step requires that the missing values are sampled from the distribution conditioned on the observed values and the parameters. Using the estimate of the loading matrix, and augmented data matrix, Yaug conditional expectations of the measurements are calculated by the following equation:
The differences between the observed measurements and the corresponding estimated values of X give the residuals:
These residuals are collected in a matrix to form the residual matrix r. A residual term sampled from the residual matrix is added to each expected value of the missing data points. The imputation values for the missing data points are given by the following expression:
where k is a random integer number between 1 and N and rkj is a residual term sampled randomly from the jth column of the residual matrix r. These estimated values are imputed in place of the missing values and the augmented data matrix is given by, .
Model parameters from their posterior distributions are sampled at this stage. A “Bootstrap” re-sampling technique is used to create the posterior distributions of the model parameters. The parameters in this case are the elements of the loading matrix P.
Let us assume that after imputing the missing values, the completed data matrix is Yaug. Using the “Bootstrap” re-sampling method, J Bootstrap data sets are created from the augmented data matrix. SVD is performed on each of these data sets, which results in a series of model parameters (i.e., loadings matrix, ). Averages of the estimated model parameters are given by:
In the next iterative step the estimated loading matrix, is used to calculate the conditional expectation of the missing values. The Imputation and the Posterior steps are repeated alternatively until convergence. Convergence is monitored by observing the sum of squared errors between the observed values and the corresponding predicted values
The flow-network process shown in Figure 5, will be used to compare the relative advantages and disadvantages of different methods. This is a benchmark example used by Narasimhan and Shah (2004) and a similar example was used by Nounou et al. (2002) to evaluate different properties of Bayesian PCA. It is assumed that the fluid flowing through the network is incompressible and there is no time delay in the process. The constraint model A, of the process can be obtained easily from the mass balance equation at the junctions. The following four mass balance equations apply for this flow-network system:
Thus the constraint model is:
The rank of the constraint matrix is 4, which is also known as the order of the constraint model. In the above example x1 and x2 were chosen as independent variables with their time-series being the auto regressive (AR) processes given by,
where γ ∼ N(0, 1). The rest of the flow rates, x3 to x6 were calculated from the mass balance equations. These variables are noise free and satisfy the constraint,
where and X1 to X6 are vectors containing the actual flow values at each sampling point. However, in process industries the actual values of the variables are generally not available, only the noise corrupted measurements Y are available,
where ε is a matrix containing the measurement noise. Measurement noises are assumed Gaussian, independent and identically distributed and also un-correlated in the variable direction (i.e., ).
The following two measures were used to quantify the performance of the proposed algorithms.
Principal component analysis (PCA) is carried out by applying SVD on the covariance matrix where the loadings of the PCs are given by the eigenvectors. In a multidimensional problem the eigenvectors can be multiplied by any non-singular matrix to define the same hyperplane. The exact value of each of the element depends on how the basis vectors are selected. Therefore, direct comparison of the elements of the eigenvectors with actual model parameters is not feasible. Instead one should examine if the hyperplane defined by the estimated model is in agreement with the actual model hyperplane. In this study the subspace angle, θ is used to measure such agreement.
Let F and G be given subspaces of real space , , , and assume for convenience that . The smallest angle between F and G is defined by
Assume that the maximum is attained for u = u1 and v = v1. Continuing in this way until one of the subspace is empty, we are led to the following definition. The principal angles between F and G are recursively defined for by,
subject to the constraints
where σk is an eigenvalue of FTG. Therefore subspace angle or principal angle is the minimum angle between the subspaces (Bjorck and Golub, 1973). On the other hand, the similarity index is a combined index defined by,
where λi is the eigenvalue of FTGGTF. The value of the similarity index is between 0 and 1, where 1 means that the two subspaces are linearly dependent (Krzanowski, 1979). Clearly these two indicators have the same origin but differs in the way the result is reported. In the current study we use subspace angle to quantify the model quality. The built-in function “subspace.m” from Matlab's “Data analysis and Fourier transforms” toolbox was used to calculate the subspace angle. The details of the algorithm can be found in Knyazev and Argentati (2002).
Total sum of squared error (TSE)
The main objective of process monitoring is to estimate the noise-free values of the signals. In order to evaluate the performance of the proposed algorithms we also calculated the total sum of squared errors (TSE) between the noise-free signal and predicted signal. Total sum of squared error is given by,
In addition to the prediction trend plots, TSE gives a quantitative way of comparing the performances of the algorithms. Process behaviour is simulated using the transfer functions given in Table 1 with a = 0.9 and b = 0.8. The missing data problem is most interesting when the sample size is small. In this case 200 samples were taken to estimate the model.
Table 1. Algorithm describing the steps of synchronizing data from batch processes with different completion time using a combined DTW and missing data technique
Steps of synchronizing the uneven length batch process data
Step 1: Collect data in a three-way data matrix. Select a reference batch from the collection of data
Step 2: Align data set from each batch with the reference data set using dynamic time warping
Step 3: Create a dynamic data matrix by including the lagged variables
Step 4: Use PCAIA to fill the missing values of the shorter data sets
Step 5: Unfold the data set in two-dimensional data matrix and build model using PCA
Results and Discussion
Model order selection
As explained in Limitations of PCAIA Section, covariance of the data matrix shrinks if conditional means are substituted for missing values. This directly affects the ratio of the variance explained by the major PCs to the total variance. This is illustrated in Figure 6 for the flow-network system which shows that with more data missing in the data matrix, the fraction of total variance explained by first two principal components increases. So, methods which use variance information for selecting model order, such as, SCREE-plot, Broken-stick method are affected because of missing data. Figure 7 shows the model order selection using prediction error sum of squares (PRESS). It compares the PRESS of PCADA and PCAIA from 15% missing data with the PRESS of PCA without any missing values. Though the calculated values of prediction error sum of squares (PRESS) changed because of missing values, PRESS is lowest at two PCs for both methods and clearly indicate that the first two PCs are sufficient to explain the systematic variability of the data. Also PRESS calculated using PCADA are closer to the true values than that calculated using PCAIA.
Convergence of PCA-Data Augmentation (PCA-DA) were monitored using the calculated sum of squared errors between the observed values and corresponding predicted noise-free values. In the flow network example, the actual constraint model is exactly known, the changes of the subspace angles with iteration steps were also calculated to reaffirm the convergence properties. Both the sum of squared errors and the subspace angles decrease with each additional iteration step and reaches to their minimum values at convergence. It is also evident that both indices have similar trends and point towards the convergence around the same iteration steps. Therefore, when the true model is not known the sum of squared errors between the observed values and the corresponding predicted values can be used to monitor the convergence of the algorithms. However, the convergence of PCA-Data Augmentation (PCADA) is not smooth like PCAIA as shown in Figure 8 because at each iterative step residual error corrections are carried out by adding random noise with the conditional expected values. This randomness is also reflected on the convergence plot. Once the algorithm converges the indices vary around the minimum values (Figure 9). Therefore one should check for a bounded value rather than a constant term to determine the convergence. PCADA converges even with very low signal to noise ratio and high percentage (up to 25%) of missing data.
Comparison of model quality
In Figure 10 the estimated values of the missing measurements have been plotted against the noise-free true values. The plot clearly shows that PCADA gives better prediction than PCAIA. In order to get a quantitative feel, we also calculated the sum of squared errors between the predicted values and the true values as shown in Figure 11 for different percentages of missing values. These values are an average of 20 Monte Carlo simulations and the error bars indicate the standard deviations of the estimates. It is evident from the plot that the proposed PCADA algorithm gives better estimates of the missing values, as well as, the estimated models are of better quality than that of estimated by PCAIA. However, the computational load of PCADA is substantially higher than PCAIA.
SYNCHRONIZING UNEVEN LENGTH BATCH PROCESS DATA
Until now we have shown the application of the iterative techniques for solving problems which are directly related to missing data. In addition to these a wide range of problems can be formulated as missing data problem and the iterative techniques may be used effectively to solve such problems. In this section we demonstrate one such example where PCAIA is used to synchronize uneven length batch process data. Batch processes are used for producing highly value added products such as, pharmaceutical products, speciality polymers, biomedical products etc. Monitoring of batch process is important because early detection of the anomaly can prompt taking corrective actions or altogether discarding further costly processing of the batch. Multi-way PCA (MPCA) is commonly used for monitoring batch processes (Nomikos and MacGregor, 1994). Batch process monitoring is essentially concerned with monitoring the trajectories. The allowable band for the trajectories are calculated from the trajectories of different batches. Therefore to take this variation into account, data from different batches have to be included for building the model. Typical batch process data will have a three-way matrix structure as shown in Figure 12a.
In order to apply PCA the data first needs to be unfolded in a two-way structure. Data can be unfolded in three distinct ways to a two-way data matrix (Westerhuis et al., 1999). However, the unfolding proposed by Nomikos and MacGregor (1994) is the most meaningful in this context as it provides a way to include the normal batch to batch variations in the model. This is illustrated in Figure 12. In a typical batch run () variables are measured at time intervals. Here the end point of the measurements is a variable and varies from batch to batch. The variation of batch length may be due to a wide variety of reasons including feed quality variation, poor control etc. Similar data will be generated from several similar batch runs. This vast amount of data can be organized in a three-way data matrix. The unfolding proposed by Nomikos and MacGregor (1994) slices the matrix in vertical direction and arranges the time slices side by side. In the unfolded matrix each batch appears as an object. The data are then mean-centred and scaled prior to applying PCA. This unfolding is particularly meaningful because by subtracting the means of the columns of the unfolded matrix the main nonlinear and dynamic component of the data are removed. A PCA performed on these mean-centred data is therefore a study of the variation in the time trajectories of all the variables in all batches about their mean trajectory. However, if the batch lengths are uneven, before carrying out the mean centring operation different batches need to be synchronized. In this section we will directly move into the various issues related to synchronization of batch data. The details of MPCA technique in batch process can be found in Nomikos and MacGregor (1995). A review and comparative study on different techniques of batch process MPCA has been carried out by van Sprang et al. (2002) and Westerhuis et al. (1999).
Several methods have been used to synchronize the uneven length data from different batches. Nomikos and MacGregor (1994) proposed the use of an indicator variable instead of time. Data from different batches are synchronized with respect to the indicator variable. In order to use this method the “indicator variable” has to have some specific properties, such as, same starting and ending point in different batches, monotonically increasing or decreasing trend in time and also the signal has to be noise-free. A constant increment is selected along this “indicator variable” and rest of the variables are synchronized with respect to this indicator variable. This method has also been used by Kourti et al. (1996) to synchronize a semi-batch polymerization process. The main critique for the method is that in many cases it is difficult to find a variable which can meet all these criteria therefore it may not be applicable in general. A simpler practical solution was used by Lakshminarayanan et al. (1996) where the shorter batches were padded with the last measurements and thereby all batch lengths were made equal. This essentially implies that, all the time differences are at the last stage of the batch process. Therefore it is not suitable for batch processes which have multiple stages. However, the method works well in many situations when the batch lengths are not substantially different. Another option is to consider data from all batches only up to the shortest batch length. Thus data collected during the later stage of the longer batches are not included in the model. Unfortunately the data collected towards the end is of great interest as these measurements provide information whether or not the reaction is complete or the cycle is finished. To estimate the end point of the batch process a two-stage method is proposed by Marjanovic et al. (2006). The method is particularly useful for predicting the batch completion time before hand. However, until now the most general and elegant solution for synchronizing batch trajectories is via dynamic time warping (DTW). DTW is widely used in speech recognition, particularly in isolated word recognition (Myers et al., 1980; O'Shaughnessy, 1986; Silverman and Morgan, 1990). In chemical processes DTW was introduced by Gollmer and Postens (1995) to detect the onset of different growth phases and failure in batch fermentation process. Nomikos and MacGregor (1994) used DTW for synchronizing un-even length batch trajectories. DTW is a flexible, deterministic pattern matching technique. It is able to translate, compress and expand the patterns locally, which are very attractive features for multi-stage batch process data. The method uses the theory of dynamic programming, hence the name DTW. There are many versions of DTW. However, for batch data synchronization asymmetric DTW is commonly used and in this study we will focus on asymmetric DTW only.
Asymmetric DTW is a two-step procedure. In the first step, a reference batch is selected and the test batch data points are aligned along the time indices of the reference batch. Some measure of the total distance between the “reference” trajectory and the “test” trajectory is minimized in order to find the best indices for the test batch. The most commonly used local distance is the weighted quadratic distance. Several local and global constraints are also imposed in the algorithm. End point bounds are the most common and useful when the end points of both trajectories are known with certainty.
After aligning the trajectories the next step is to deal with the excess or inadequate data points in the synchronized data matrix. If the length of the “test” batch is longer than the reference batch, some points of the “test” batch need to be discarded. For example, if two points from the “test” batch correspond to one point in the reference batch then based on the distance measure one point is discarded or an average of these two points are assigned to that position. On the other hand, if the “test” batches are shorter than the “reference” batch, after alignment there would be many gaps in the “test” data set (i.e., some of the rows of the test data set will be empty). Typically, a “zero order hold” or “first order hold” is used to fill these gaps.
However, none of these methods are deemed appropriate considering the fact that batch process data are dynamic and multivariate. By applying the “zero order hold,” though the spatial correlation between the variables are preserved, the temporal trends of the variables get distorted. On the other hand, “first order hold” or “linear interpolation” takes care of the temporal correlation to some extent but destroys the spatial correlation between the variables. Therefore a method is needed which will preserve both temporal trend and spatial correlation between the variables. In the current study we propose a method based on missing data handling technique which attempts to conserve both temporal and spatial correlation of the batch data set. The proposed method takes advantage of the data matching capability of DTW and the multivariate nature of the missing data handling technique by combining these two techniques.
Combined DTW and Missing Data Technique
The “test” data sets are first aligned along the time stamp of the “reference” data set using DTW. If the test data set is shorter than the reference data set some gaps will be created in the test data sets. Subsequently these gaps are filled by applying missing data handling technique. The overall methodology of the technique is described in Table 1. The pattern of data from a single batch after synchronization is shown in Figure 12b. Apparently in this pattern there is no way of using multivariate methods to fill the missing values as the rows that contain missing values are completely empty. Therefore, the model cannot be used to predict unique values for the missing data points. However, batch process variables are not only spatially correlated with each other at any given time, but also correlated in the temporal direction. Therefore it is reasonable to include lagged variables in the data matrix and the pattern of the data matrix with lagged variables is shown in Figure 12c. From this pattern it is evident that iterative missing data techniques can be used to fill the missing values. Once the lagged data matrix has been created, PCAIA is applied to fill these missing values. PCAIA is a pseudo-EM algorithm which iterates between the model parameter estimation and missing value estimation steps. The application steps of PCAIA are described in Missing Data Handling in PCA Section. Because of the time shifted values in the rows containing missing values, we will get some unique prediction of missing values by using this model. As the model is multivariate the predicted values are also consistent with the correlation structure of the data. The procedure is repeated for each set of batch data which has length shorter than the reference batch. Once all batches have the same length and the missing values have been filled, the data matrix can be unfolded by any of the three unfolding techniques. In the current study, we unfolded the matrix in the variable direction as proposed by Nomikos and MacGregor (1994). After arranging the data in a two-dimensional rectangular structure, ordinary PCA is applied to build the model for monitoring purposes. In the following section we demonstrate the technique using a simulated batch polymer reactor.
The proposed methodology is applied to a feed batch polymer reactor process (Luyben, 1990; Chen and Liu, 2002). The reaction system involves two consecutive first-order reactions:
The schematic diagram of the reactor with the different measurement locations are shown in Figure 13. The reactor is operated under closed loop with an on/off control strategy. The reaction has three distinct stages, fn the startup stage, the steam in the jacket initially heats up the reactor content until the temperature reaches the desired operation level. In the second cooling stage, the cooling water in the jacket is used to remove the exothermic heats of reaction. The third stage is more of a maintenance stage, the reaction temperature is self-sustained at this stage. The cooling water valve is turned on occasionally to take out the excessive heat generated from the reaction. The jacket temperature, the temperature of the metal wall between the reactor and the jacket, the reactor temperature and the cooling water flow rate are the four measured process variables. Two quality variables, concentrations CB and CC are measured at the end of each batch run. The simulation conditions and relevant parameters remains the same as that of Luyben (1990), except the initial concentration CA.
Variation in initial feed concentration is common, as feed may be obtained from different sources or because of the presence of impurities in the feed. We consider a range of initial feed concentrations which affect the batch completion time. Data are collected from a total of 16 batch runs and the completion time of each batch run varies between 73 and 100 min. We select the batch with 95 min completion time as our reference data set since the distribution of the batch lengths has a peak around that point. In order to build the monitoring scheme the four process variables are included in the model.
The typical trend of the variables are shown in Figure 14. Since the process has three different stages and there is no monotonically increasing or decreasing variable, DTW is the most appropriate method for synchronizing the data. The data are synchronized using two different techniques.
First, we use DTW for aligning all the batches along the reference batch and apply “first order hold” to fill the gaps in the shorter batches.
Second, we use DTW for aligning the data with the reference batch and use the proposed methodology based on missing data handling technique to fill the missing values.
Subsequently, PCA is carried out on the data matrices synchronized by these two alternative methods and the results are illustrated in Figures 15–18. Figure 15 gives the cumulative percentage of variances explained by the PCs. We are able to get a very compact model when the proposed missing data technique is used to synchronize the data. The first principal component, extracted from the data set which is synchronized using the missing data handling technique, explains 85% of total variance compared to the 40% variance from the data synchronized using “DTW-first order hold.” In the proposed method only two PCs explain 90% cumulative variance, whereas eight PCs are required to explain the same amount of variance of the data set synchronized using “DTW-first order hold.” In MPCA a compact model has a special meaning as we are looking at the variation between the batches. In the ideal case if all batches have normal behaviour and the variations are due to random measurement noise then a single PC should be sufficient to capture most of co-variance information of the data. The model from the proposed methodology has compact structure because of the fact that PCAIA is a multivariate technique and therefore the predicted values are commensurate with the overall correlation structure of the data matrix. On the other hand, a “first order hold” of filling data creates points which may have introduced extra variation in the data matrix and therefore more PCs are required to explain that additional variation in the data.
The SPE and the T-square plots for the two methods are given in Figures 16 and 17. The SPE plot from the traditional “DTW-first order hold” marginally detects batch number four as an abnormal batch compared to the rest of the batches. The T-square plot of the proposed missing data based method detects the 16th batch as abnormal. The fourth batch has a completion time of 85 min and completion time of the 16th batch is 73 min. Since the 16th batch feed has the most impurities therefore it appears more justified to single out 16th batch as the abnormal batch.
Though the proposed method shows good promise in off-line analysis there are several limitations of DTW and these also apply to the proposed methodology. The biggest critique against DTW is that, on-line application of DTW is not straightforward. Our view point on this is that the monitoring does not have to be on-line throughout the processing. Rather in systems like this where the processing is going through multiple stages, after completion of each stages PCA can be applied to find out the process status. For example, in this case three models can be built using the normal data sets. First one, using data from the heating phase only, second one using data of both the heating and cooling phase and the third one using data from all three stages. As soon as we detect the completion of a stage, the data can be synchronized using the proposed methodology and the respective model may be used to detect any abnormality in the batch. This is consistent with the overall objectives of batch process monitoring. Without intervening the operation this procedure may help to detect any abnormality in the processing. In the event if an abnormality is detected the subsequent processing may be abandoned to avoid any additional cost.
FORMULATION OF COMPRESSION AS A MISSING DATA PROBLEM
Process data historians use compression algorithms to reduce the file size and as such the data storage in the server. The data historian currently used in industries mostly use direct methods (e.g., Swinging Door data compression) for compressing data. Compressed data are usually reconstructed using univariate methods, such as, linear interpolation. These reconstruction methods do not take into account the changes that take place in other variables, and as such linear interpolation-based data reconstruction algorithms may destroy the correlation between different signals. So, the reconstruction may not be reliable depending on the end use of the data. In particular, such techniques may be potentially detrimental if the reconstructed data is used for multivariate analysis since such analysis makes use of the correlation between the variables. The detrimental effect of data compression and its remedy have been extensively studied in Imtiaz et al. (2007). For the sake of completeness in this section we will briefly describe how data compression problem can be formulated as a missing data problem and solved using missing data handling techniques.
Reconstruction of Compressed Data Using Missing Data Handling Technique
Almost all multivariate statistical data analysis methods, for example, pattern matching of historical data, fault detection and isolation using PCA, make use of the correlation between the variables. Therefore, in order to use compressed data, especially for multivariate analysis, alternative methods should be used for reconstruction. Instead of linear interpolation based methods we recommend PCA based missing data handling technique such as PCAIA be used to reconstruct the compressed data set.
The first step of the reconstruction algorithm is to isolate the raw data points from the interpolated points. A compression detection algorithm is used to find these points (Thornhill et al., 2004). Since the reconstructed signal is piece-wise linear, it will have discontinuity only at the locations of the raw values.
Therefore, the locations of the raw values are given by the locations of the non-zero double derivatives. Only the raw values are retained, and all interpolated points of the data matrix are considered as missing. This is illustrated in Figure 20a where the missing values have been marked by “NaNs.” However, at this stage the “percentage of missing data” in the data matrix would be high since in many situations we may encounter highly compressed data, for example, if data is compressed to a factor of 5, 80% of the data would be missing. This poses difficulty in reconstruction as most iterative missing data handling techniques do not converge for more than 20% missing values in the data matrix. Therefore a multistage procedure is applied to bring down the percentage of missing data in the data matrix. In the first step all rows which do not have any original points are removed from the data matrix. Clearly, these rows do not contain any information and it will not have any impact on process models as only steady-state models are of interest. This is illustrated in Figure 20a. This data matrix was obtained from a data set which was compressed by a factor of 3. Therefore 66% of the data are missing at this stage. All the rows which do not contain a single “raw value” are shaded. After removing the shaded rows in Figure 20a, the new data matrix takes the form shown in Figure 20b. The ratio of raw values to missing values improved at this stage and the “percentage of missing data” in the new data matrix (Figure 20b) reduced to “50%.” In the next phase, rows which contain only one raw value are taken out of the data matrix. This will further reduce the percentage of missing values in the data matrix. The procedure is repeated until the percentage of missing values in the data matrix comes down to 30%, for example, removing rows with only two “raw values” in the next step. After the percentage of missing values in the data matrix is within 30%, PCAIA is applied to reconstruct the variables and restore the correlation structure. The details of PCAIA algorithm has been described earlier in Missing Data Handling in PCA Section.
Results: Restoring Correlation of Compressed Data
We demonstrate the application of missing data treatment technique in reconstructing compressed data with a data set from refining process. All six variables are level measurements at different locations of a distillation column. The sampling frequency is 1 min and the total length of the data set is 20 000 samples. The data was obtained in uncompressed form. For investigative purpose it was compressed to different compression levels. The trend plots of the raw signals and linear interpolated compressed signal are shown in Figure 19. The correlation matrix of the raw uncompressed data set is colour coded in Figure 21. The colours bar indicate the magnitude of the correlation. This data set was compressed using the Swinging Door algorithm to a compression factor of 10 and subsequently reconstructed using linear interpolation. It is evident from the colour-map that correlation between the variables has been severely distorted at this level of compression. The correlation structure of the reconstructed data using PCAIA is compared with the original correlation in Figure 22. The colour coded correlation map clearly shows that the PCAIA based reconstruction can restore the true correlation between the variables.
Missing data handling techniques based on rigorous statistical theory can be used systematically for conditioning process data. In this paper we have demonstrated the application of missing data techniques in process data analysis. Missing data handling techniques have been used to extend PCA for dealing with missing values. In addition to that two novel applications of missing data handling technique have been shown which are not conventionally considered as missing data problems. The contributions of this paper are summarized below:
This paper acts as a bridge between the statistical methods for dealing with missing data and process data analysis. It explores the similarities and dissimilarities of the missing data problem in process data analysis and related subjects (e.g., statistical surveys).
A tutorial introduction on different concepts and methods related to missing data problem has been presented. These concepts have been explained with examples related to process data analysis.
PCA has been extended to Data Augmentation framework for building models from data with missing values. The estimated models using proposed methods are of better quality compared to the models from use of the conventional methods.
Multivariate missing data handling technique is combined with DTW to synchronize un-even length batch process data. The proposed method conserves the correlation between the variables and leads to a compact latent variable model.
Data compression has been formulated as a missing data problem. A PCA based missing data handling technique has been used to restore the correlation between the compressed variables.
We would like to thank Prof. S. Narasimhan of Indian Institute of Technology (IIT), Madras and Dr. M.A.A. Shoukat Choudhury of Bangladesh University of Engineering and Technology (BUET) for their suggestions and many helpful discussions.