Principal component analysis for bar charts and metabins tables



In recent years, the analysis of symbolic data where the units are categories, classes, or concepts described by intervals, distributions, sets of categories, and the like becomes a challenging task since many application fields generate complex and massive amounts of data that are difficult to analyze with traditional techniques. In this article, we propose a strategy for extending standard principal component analysis (PCA) to such data in the case where the variables values are ‘bar charts’ (i.e., a set of categories called bins with their relative frequencies). First, we introduce ‘metabins’ which mix together bins of the different bar charts and enhance interpretability. Standard PCA applied on the bins of such data tables can lose the bar chart constraints and suppose independencies between the bins. Therefore, we introduce a ‘Copular PCA’ as copulas take care of the probabilities and the underlying dependencies. Some theoretical results lead to the representation of the bar chart variables inside a hypercube covering the correlation sphere of a PCA applied on the bins. We give several ways for representing individuals and pathways of individuals × metabins or individuals × variables. Several tools of interpretation of such representations based on ‘coherency’ of metabins (or variables) among a trajectory (i.e., oriented pathway) of individuals and ‘diversity’ of individuals among a trajectory of metabins (or variables) are illustrated by some simple examples. © 2013 Wiley Periodicals, Inc. Statistical Analysis and Data Mining, 2013