Clustering large data sets described with discrete distributions and its application on TIMSS data set



Symbolic data analysis is based on special descriptions of data—symbolic objects. Such descriptions preserve more detailed information about the data than the standard representations with mean values. A special kind of symbolic object is also representation with distributions. In the clustering process this representation enables us to consider the variables of all types at the same time.

We present two clustering methods based on the data descriptions with discrete distributions: the adapted leaders method and the adapted agglomerative hierarchical clustering Ward's method. Both methods are compatible—they can be viewed as two approaches for solving the same clustering optimization problem. In the obtained clustering, the leader is assigned to each cluster. The descriptions of the leaders offer simple interpretation of the clusters' characteristics. The leaders method enables us to efficiently solve clustering problems with a large number of units; while the agglomerative method is applied on the obtained leaders and enables us to decide upon the right number of clusters on the basis of the corresponding dendrogram.

Both methods were successfully applied in analyses of different data sets. In this paper, an application on the Trends in International Mathematics and Science Study (TIMSS) data set is presented. The descriptions with distributions enable us to combine two data sets: answers of teachers and answers of their students, into one data set. The descriptions of the obtained clusters enable us to interpret the results in a more understandable way. © 2011 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 4: 199–215, 2011