Silhouette width using generalized mean—A flexible method for assessing clustering efficiency

Abstract Cluster analysis plays vital role in pattern recognition in several fields of science. Silhouette width is a widely used index for assessing the fit of individual objects in the classification, as well as the quality of clusters and the entire classification. Silhouette combines two clustering criteria, compactness and separation, which imply that spherical cluster shapes are preferred over others—a property that can be seen as a disadvantage in the presence of complex, nonspherical clusters, which is common in real situations. We suggest a generalization of the silhouette width using the generalized mean. By changing the p parameter of the generalized mean between −∞ and +∞, several specific summary statistics, including the minimum, maximum, the arithmetic, harmonic, and geometric means, can be reproduced. Implementing the generalized mean in the calculation of silhouette width allows for changing the sensitivity of the index to compactness versus connectedness. With higher sensitivity to connectedness, the preference of silhouette width toward spherical clusters should reduce. We test the performance of the generalized silhouette width on artificial data sets and on the Iris data set. We examine how classifications with different numbers of clusters prepared by different algorithms are evaluated, if p is set to different values. When p was negative, well‐separated clusters achieved high silhouette widths despite their elongated or circular shapes. Positive values of p increased the importance of compactness; hence, the preference toward spherical clusters became even more detectable. With low p, single linkage clustering was deemed the most efficient clustering method, while with higher parameter values the performance of group average, complete linkage, and beta flexible with beta = −0.25 seemed better. The generalized silhouette allows for adjusting the contribution of compactness and connectedness criteria, thus avoiding underestimation of clustering efficiency in the presence of clusters with high internal heterogeneity.


| INTRODUC TI ON
Cluster analysis is the method of grouping similar objects in order to simplify the structure of a data set. It is concerned with discontinuous variation in the data set that allows for separating and identifying "types" of objects. Clustering is a common exploratory tool for pattern recognition in large samples in various fields of science, like geoinformatics (e.g., Lu, Coops, & Hermosilla, 2016), genomics (e.g., Ramoni, Sebastiani, & Kohane, 2002), epidemiology (e.g., Kenyon, Buyze, & Colebunders, 2014), or psychology (e.g., Clatworthy, Buick, Hankins, Weinman, & Horne, 2005).
Moreover, classification is a prerequisite for naming abstract entities like biogeographical regions and habitat types; thus, it is a basic statistical approach in bioregionalization (e.g., González-Orozco, Laffan, Knerr, Miller, & Jetz, 2013;Lechner et al., 2016) and vegetation typology on different scales (e.g., De Cáceres et al., 2015;Lengyel et al., 2016;Marcenò et al., 2018). Clustering methods could be divided into two groups according to three independent aspects: (a) crisp versus fuzzy clustering, (b) hierarchical versus nonhierarchical clustering, and (c) model-based versus heuristic algorithms. Crisp clustering procedures provide unequivocal assignment of objects to groups, while fuzzy methods express degrees of membership as weights and allow for assigning an object to multiple groups at a time. The advantage of fuzzy classification over crisp methods is that they enable differentiation of typical, transitional, and outlier objects (De Cáceres, Font, & Oliva, 2010).
However, fuzzy algorithms are much more intensive computationally and they require more subjective decisions from the user for the parameterization; therefore, crisp methods are still the most widespread. Hierarchical methods classify the objects into groups which are nested subsets of each other, while nonhierarchical methods produce a simple partition without nested structure. Model-based clustering fits mixture of distributions on the observed data optimizing the likelihood function, while heuristic methods optimize different other (most often geometric) criteria.
Despite the fact that fuzzy classification and hierarchical methods offer additional information, the most common objective of numerical classification is to group the objects into mutually exclusive, exhaustive sets, that is, to produce a partition. In spite of advantages of model-based methods, partitions are often created by heuristic methods. Its reasons are that (a) model-based methods are much more computation intensive that limits their application in large datasets; (b) data do not always follow a simple distribution type or there is no reasonable a priori information on the distribution; and finally (c), there may be cluster shapes that are hardly captured by fitting simple mixtures.
By its basically descriptive nature, clustering techniques, especially crisp algorithms, produce classifications even if there is no discontinuity in the data set, potentially leading to false conclusions about the within-sample variation. In model-based clustering, where finite mixture of distributions are fitted, calculating information criteria, such as BIC (Fraley & Raftery, 1998) or integrated complete-data likelihood criterion (ICL, Biernacki, Celeux, & Govaert, 2000), are the standard way for selecting the best classification. A plethora of methods is available for testing the quality (also called validity or efficiency) of classifications without fitting probability distribution, each applying more or less differently formalized criteria (Handl, Knowles, & Kell, 2005;Milligan & Cooper, 1985;Vendramin, Campello, & Hruschka, 2010). One of the most commonly applied methods for assessing cluster validity is silhouette width (Rousseeuw, 1987), which encompasses two clustering criteria: separation (i.e., average distance to the closest other cluster) and compactness (i.e., average within-cluster distance; Handl et al., 2005). It is originally defined for crisp classification but Campello and Hruschka (2006) presented a generalization to fuzzy memberships. Silhouette width is calculated for each object of the classification thus indicating how well they fit into their respective cluster. The cluster-wise or the global mean of objects can be used to assess the distinctness of specific clusters or the validity of the total classification, respectively, higher means suggesting more efficient classification. Due to the compactness criterion involved as average within-cluster distance, silhouette prefers spherical cluster shapes (Rousseeuw, 1987); however, in practice clusters can possess different shapes according to their structure in the multidimensional space of the variables. Moreover, each clustering algorithm has its own tendency to produce clusters with certain characteristics, including cluster shape, and evaluating them by validity indices following different shape criteria can bring misleading results (Handl et al., 2005). Hence, in the presence of nonspherical clusters, silhouette width may falsely suggest low classification efficiency. Those indices are more suitable for elongated or irregular cluster shapes which quantify the degree to which objects are assigned to the same cluster as their nearest neighbors, that is, those applying the connectedness criterion (Saha & Bandyopadhyay, 2012).
In this paper, we propose a generalization of the silhouette width.
Applying the generalized mean, we propose a flexible formula which allows for scaling the sensitivity of the index between connectedness and compactness, thus allowing high values for nonspherical clusters. This enables users to optimize classifications for different cluster shapes depending on the relevance of connectedness versus compactness criteria for the research question. Generalized mean has a parameter (denoted by p) that determines the importance of connectedness and compactness. Parameter p is analogous to the scale parameter of Hill diversity (Hill, 1973) that determines the importance of rare and common species in determining diversity of a community (In fact, Hill diversity can be regarded as weighted generalized mean of rarity, see Leinster & Cobbold, 2011). The use of the new method is illustrated on artificial point patterns and a widely known real sample data set. Our goal is showing how generalized silhouette width with different p parameters evaluate typical classification patterns; at the same time, we do not aim at nominating an optimal p parameter value (as neither is there an optimal scaling parameter for Hill diversity) or classification method.

| The original silhouette width
The original definition of silhouette width according to Rousseeuw (1987) is as follows. Let i be a focal object belonging to cluster A.
Denote by C a cluster not containing i. a(i) is defined as the average dissimilarity between i and all other objects in A, while c(i,C) is the average dissimilarity between i and all objects in C.
The silhouette width, s(i), is defined as: s(i) ranges between −1 and 1. Values near 1 indicate that object i is much closer to the other objects in the same cluster than to objects of the closest other cluster, implying a correct classification. If s(i) is near 0, the correct classification of the focal object is doubtful, thus suggesting intermediate position between two clusters. s(i) near −1 indicates obvious misclassification. Accordingly, averaging silhouette widths over a cluster gives an assessment of the "goodness" of that cluster, or a sample-wise average can be used as an index of the validity of the entire classification. Instead of cluster-wise or sample-wise averages of s(i), the number or proportion of objects with positive silhouette width can also be used as validity measures. For a cluster containing a single object, s(i) takes the arbitrary value 0.

| Implementing the generalized mean
Applying the arithmetic mean to calculate average within-and between-cluster distances, as the index was introduced originally (Rousseeuw, 1987), implies that the ideal cluster shape is spherical. However, this preference can be relaxed by choosing other types of means. Generalized mean (also called Hölder or power mean) offers a flexible solution to calculate sample means ranging between minimum and maximum (Cantrell & Weisstein, 2019). Let X be a sample of positive real numbers x 1 , x 2 , …, x n and p an element of affinely extended real numbers. The generalized mean of degree p is as follows: For p = 0 and p = |∞| the following exceptions are to be made: The generalized mean takes the values of well-known summary statistics presented in Table 1. The original version of silhouette width is the special case when within-and between-group average distances are calculated by p = 1. By changing the p parameter, it is possible to emphasize lower or higher distances in the calculation of means. The lower the p parameter is, the more importance is attributed to objects in close proximity, while the effect of farther neighbor objects (including outliers) is reduced. In this way, the criteria of compactness are gradually replaced by connectedness and clusters with irregular or elongated shape can also be considered "good". At p = −∞, a classification is ideal if each object is assigned to the same cluster as the most similar other object in the sample.
This procedure follows the logic of single linkage clustering, while the original version making use of arithmetic averages followed the logic of average linkage. In contrast, when p > 1, the compactness criterion is attributed higher weight; thus, the preference toward spherical clusters is further increased and the effect of outliers on the overall classification should become more significant. At p = +∞, the clustering criteria of complete linkage are applied.

| Data sets and tests
We test the performance of the generalized mean with different parameterization on artificial point patterns and a well-known public data set.
Artificial data sets containing 100 objects and two variables were generated. The data sets represented data structures some of which were also applied by Podani (2000) for the illustration of the behavior of different clustering methods: (a) completely random point pattern without true clustered structure, points on the two sides of the plane are assigned to different clusters (low separation, low compactness); (b) two clusters with few transitional elements between them (moderate separation and compactness); (c) two distinct point aggregations corresponding to two clusters (high separation, high compactness); (d) two overlapping clusters, both containing point duplicates with a little offset, thus each point has a "pair" (or close neighbor) belonging to the same cluster (low separation, high compactness, high local connectedness); (e) the same point pattern but and petal length, for the possibility of plotting the total variation in two dimensions. Species assignment was used as a priori classification. Data were accessed from the vegan (Oksanen et al., 2018) package of the R software (R Core Team, 2017); then, variables were standardized to mean = 0 and standard deviation = 1.
On these data sets, generalized silhouette widths with different p parameter values were calculated using the a priori classifications. p parameters were selected for the tests with the aim of representing the descriptive statistics which are special cases of the generalized mean (Table 1) and being spread evenly across values near zero. Patterns of misclassified objects (i.e., objects with negative silhouette width) on the point scatters were assessed visually.
Overall classification quality was measured by misclassification rate (MR; the number of misclassified objects in the sample divided by the total number of objects) and mean silhouette width (MSW; the sample-wise mean of s(i)).
We evaluated also the performance of different classification methods in the view of the generalized silhouette width. For this purpose, we used a two-dimensional random point pattern of 1,000 points because we supposed that in the lack of true cluster structure the inherent characteristics of the methods will determine classification the most. We classified this data set using single linkage, group average, complete linkage, and beta flexible (with beta = −0.25) methods. Silhouette width with different p parameters was calculated at each group number of the hierarchical classifications between 2 and 20; then, mean silhouette widths were compared across group numbers, p parameters and classification methods. Given the nonclustered structure of this data set, we do not expect a peak in the change of MSW which would indicate an "optimal" cluster level.
Computations were carried out by the R software (R Core Team, 2017) using the cluster package (Maechler, Rousseeuw, Struyf, Hubert, & Hornik, 2018). Program codes for silhouette width using generalized mean and for generating artificial data sets are available in the Supporting Information.

| RE SULTS
In most cases, we inspected, within each data set mean silhouette width (MSW) decreased with increasing p. With artificial data, when the point pattern was random, for p parameter values up to zero F I G U R E 1 Silhouette width patterns of objects grouped into two clusters with low separation and low compactness. MR, misclassification rate; MSW, mean silhouette width; misclassified objects are circled With p = −∞, there was no misclassification, and MSW was 0.92.
With increasing p, misclassified objects appeared gradually in the larger cluster near the border of the two clusters but they were not abundant until p = 3. However, with p = ∞ as high as 33% of all objects were indicated misclassified, all belonging to the larger group, and MSW were 0.202. With concentric groups, the inner, compact group was considered perfect regardless the p parameter ( Figure 8).
However, the assessment of the outer group varied greatly. With p = −∞ all objects were deemed correctly classified. As p raised, the number of misclassified objects in the outer group increased, too.
With p = 0, misclassified plots gave 23% of the total data set which means 46% of the outer group. From p = 1 and higher, all objects in the outer group were considered misclassified; thus, the data set consisted of a perfect and a totally bad cluster together giving 50% correct classification rate. Along the gradient in the parameter value, MSW decreased from 0.92 (p = −∞) to 0.153 (p = ∞).
Similarly to the simulated data, with the Iris data set, misclassification rate increased with increasing p parameter (Figures 9 & 10).

| D ISCUSS I ON AND CON CLUS I ON S
The results supported our expectation about the behavior of the silhouette method using the generalized mean. Both artificial data and the Iris data set showed that cluster compactness plays a decreasingly significant role in the assessment of classification validity with decreasing p parameter value. With p << 0, clusters are assessed mainly on the basis of connectedness and separation criterion. In the extreme case (p = −∞), it means the relativized difference between the minimal distances of objects belonging to the same cluster versus minimal distances of objects belonging to the closest other cluster, while distances from other members of the same and the neighbor cluster are completely disregarded. As we increase the p parameter, more importance is attributed to more distant objects within and between clusters, that is, to the compactness criterion.
In most cases, mean silhouette width decreased and misclassification rate increased with increasing p parameter value. In other words, these classifications tended to seem decreasingly efficient as the compactness criterion was attributed more and more im- F I G U R E 9 Silhouette width patterns of the Iris data set using sepal length and petal length variables after standardization to mean = 0 and standard deviation = 1 with p ranging from −∞ to 0. MR, misclassification rate; MSW, mean silhouette width; misclassified objects are circled such pitfalls, we advise to calculate silhouette width with different values of p, which could be a logically similar procedure to calculating diversity profiles using scalable diversity measures (Tóthmérész, 1995  Future research should explore the possibility of adapting the generalized mean into other developments of the silhouette width (e.g., for fuzzy classifications, Campello & Hruschka, 2006) and applicability as a classification criterion (e.g., in the OPTSIL algorithm, Roberts, 2015).

ACK N OWLED G M ENTS
The

CO N FLI C T O F I NTE R E S T
None declared.

AUTH O R CO NTR I B UTI O N S
A.L. developed the idea and the methodology, wrote the scripts, conducted data analysis, and wrote the manuscript; Z.B.D. developed the idea, reviewed literature, commented on the results, and improved the manuscript.

DATA AVA I L A B I L I T Y S TAT E M E N T
Scripts for calculating generalized mean and generating specific point patterns are enclosed in the Supporting Information. Iris data set is available from the vegan package of R.