Resemblance profiles as clustering decision criteria: Estimating statistical power, error, and correspondence for a hypothesis test for multivariate structure

Abstract Clustering data continues to be a highly active area of data analysis, and resemblance profiles are being incorporated into ecological methodologies as a hypothesis testing‐based approach to clustering multivariate data. However, these new clustering techniques have not been rigorously tested to determine the performance variability based on the algorithm's assumptions or any underlying data structures. Here, we use simulation studies to estimate the statistical error rates for the hypothesis test for multivariate structure based on dissimilarity profiles (DISPROF). We concurrently tested a widely used algorithm that employs the unweighted pair group method with arithmetic mean (UPGMA) to estimate the proficiency of clustering with DISPROF as a decision criterion. We simulated unstructured multivariate data from different probability distributions with increasing numbers of objects and descriptors, and grouped data with increasing overlap, overdispersion for ecological data, and correlation among descriptors within groups. Using simulated data, we measured the resolution and correspondence of clustering solutions achieved by DISPROF with UPGMA against the reference grouping partitions used to simulate the structured test datasets. Our results highlight the dynamic interactions between dataset dimensionality, group overlap, and the properties of the descriptors within a group (i.e., overdispersion or correlation structure) that are relevant to resemblance profiles as a clustering criterion for multivariate data. These methods are particularly useful for multivariate ecological datasets that benefit from distance‐based statistical analyses. We propose guidelines for using DISPROF as a clustering decision tool that will help future users avoid potential pitfalls during the application of methods and the interpretation of results.


| INTRODUCTION
In data-rich scientific studies, it is often necessary to apply a clustering algorithm to detect groups of homogenous objects with respect to a set of descriptors (i.e., measured variables). Detection of groups is useful in ecology, economics, genetics, and other disciplines that analyze large, multidimensional datasets. Clustering techniques for multivariate datasets are diverse and can be drawn from methods derived from one or more of the following approaches: sequential versus simultaneous, agglomerative versus divisive, monothetic versus polythetic, hierarchical versus nonhierarchical, probabilistic versus nonprobabilistic, and constrained versus unconstrained (Legendre & Legendre, 2012). In many cases, these methods are sensitive to the sequence of the steps within the algorithm, to random decisions enforced by the algorithm, or to arbitrary assignment of stopping rules, numbers of clusters, or levels of resemblance that define homogeneity.

| Resemblance profiles and clustering criterion
Multivariate studies of complex datasets are often analyzed statistically using distance-based (db) methods. These db-methods begin with a series of pairwise comparisons between all objects to determine their relative resemblances with respect to a set of descriptors, and these resemblance values can be interpreted as either similarity or dissimilarity. The selection of a resemblance measure is discretionary and varies with the type of data being analyzed as well as the method of analysis (Batagelj & Bren, 1995;Clarke, Somerfield, & Chapman, 2006;Faith, Minchin, & Belbin, 1987). Clarke, Somerfield, and Gorley (2008) developed the SIMPROF routine based on the concept of a "similarity profile," which represents the matrix of pairwise similarity values between any set of objects.
SIMPROF-based studies have also been conducted on dinoflagellates and ciguatera poisoning (Parsons, Settlemier, & Ballauer, 2011), food webs (Kelly & Scheibling, 2012), habitat classifications (Gonzalez-Mirelis & Buhl-Mortensen, 2015;Valesini, Hourston, Wildsmith, Coen, & Potter, 2010), species/environment relationships (Travers, Potter, Clarke, & Newman, 2012), metagenomics (Khodakova, Smith, Burgoyne, Abarno, & Linacre, 2014), and otolith elemental microchemistry (Moore & Simpfendorfer, 2014). While the preceding literature review reflects the recent use of the algorithm in ecological applications, it is likely that the method has uses in other disciplines as well. Clarke et al. (2008) demonstrated the use of SIMPROF in conjunction with agglomerative hierarchical clustering via the unweighted pair group method with arithmetic mean (UPGMA; Figure 1), and they also described two theoretical corollaries to the functional dynamics of their algorithm. They proposed that (1) the test for multivariate structure would become more powerful as the number of descriptors increased and (2) that the resolution of any structure identified (i.e., number of groups, G) might be far finer (greater) than is meaningfully interpreted (Clarke et al., 2008). It is our understanding that these corollaries have yet to be tested empirically with numerical simulations, and given recent inconsistencies in the performance of other permutation-and distance-based hypothesis tests (e.g., ANOSIM and MANTEL tests; Anderson & Walsh, 2013;Legendre & Fortin, 2010), we felt this action was warranted.
F I G U R E 1 Theoretical diagram of the process flow for DISPROF clustering with UPGMA: (1) Data are pretreated and configured.
(2) An appropriate resemblance metric is applied to the pretreated dataset.
(3) The UPGMA site-connection linkage is assembled. (4) DISPROF is employed in an iterative process to identify the grouping structure in the data and create breaks in the associated linkage tree. (5)  The present paper intends to improve our understanding of the proposed corollaries to the Clarke et al. (2008) approach, to help users of SIMPROF avoid potential pitfalls during analysis and interpretation, and to encourage use of the method outside of the ecological focus.
We tested the SIMPROF method by estimating and describing the type I and type II error rates for the hypothesis test for multivariate structure while varying the datasets' distribution type, dimensionality, data-cloud overlap between adjacent clusters, and data-cloud shape or overdispersion. We also elucidated the effects of dataset configuration variability on the quality of the solution achieved by examining the level of correspondence between the algorithm's clustering solutions and the known grouping partitions for datasets with structure.

| Review of the SIMPROF approach
For a set of objects, a similarity profile is created by plotting the rankordered similarity values versus each value's rank ( Figure 2a). This profile is ultimately checked against the mean rank-ordered similarity values for many randomized profiles (i.e., ≥1,000) created via permuting the original descriptor measurements across objects. The π statistic is created by summing the absolute deviations of the observed profile from the mean of the set of permuted profiles. Intuitively, one can see that if an observed profile has many more high and/or low similarity values than would be expected under the null conditions, then multivariate structure would be deemed present ( Figure 2b). The null hypothesis (H o ) of "no multivariate structure among objects, with respect to the descriptors" in the original dataset, is formally tested by examining the placement of the observed π statistic relative to the null distribution of all permuted π statistics. To model the null distribution of the π statistic, an additional set of permuted similarity profiles (i.e., ≥1,000 iterations) is created, and their associated π statistics are calculated with respect to the same mean profile used to calculate the original observed π statistic. The p-value for the observed π statistic is calculated as the proportion of π statistics that are at least as large as the observed statistic versus the total number of π statistics calculated via permutation (Clarke et al., 2008).
Resemblance profile consideration is inserted into UPGMA clustering as a clustering decision criterion in an iterative process ( Figure 1). The data are required to be in [N × P] matrix format, where the N rows represent individual objects (sampling units) and the P columns of the matrix represent the descriptors (measured variables). In many real-world, large datasets, there are often some objects where certain descriptor measurements are missing due to either technical failure or human error. When compiling these data, we must remove objects that do not contain an accurate measurement for all descriptors of interest (zero-value measurements may be appropriate, but missing measurements are not). Once the data are assembled and checked for quality, user-defined pretreatments are applied (e.g., standardization and/or normalization) and an appropriate resemblance measure is employed. One advantage to the approach considered here is the use of distribution-free statistics, which releases the analyst from the often-unrealistic assumption of Gaussian data distributions, and decreases the need for data transformations to satisfy those assumptions. Another advantage to using distribution-free significance tests is that they are often generalized to accept any of the potential pool of resemblance measures available to researchers (Legendre & Legendre, 2012).
After a square, symmetric distance-matrix is produced, an UPGMA clustering solution is constructed to reflect the magnitude of apparent resemblance between the objects with respect to the descriptors.
SIMPROF can be used as an iterative decision criterion to assess each node of the UPGMA dendrogram to determine whether the objects connected by any node are clusters of relative homogeneity, or whether there is additional multivariate structure present in those remaining objects (Clarke et al., 2008).
Recall  (Clarke et al., 2008). Due to the multiple-testing aspect of the algorithm, a p-value correction method can be employed when determining significance for tests between sets of objects (Clarke et al., 2008). The primary output of UPGMA clustering with SIMPROF is a grouping partition containing a cluster assignment for each object. Using this decision framework creates immediate advantages when interpreting the clustering dendrogram in that (1) the researcher is no longer required to arbitrarily assign a single level of similarity that defines all clusters and (2) the clusters can be defined by varying levels of similarity. To obtain a twodimensional ordination of the identified groups in hyperdimensional space, a Euclidean embedding can be produced via principle coordinates analysis (PCoA; Gower, 1966). This ordination is based on the same symmetric resemblance matrix used in the clustering process, and the group assignments can be overlain in place of the object labels to present a final clustering diagram.

| Rationale
The only modification we made to the original Clarke et al. (2008) algorithm was to use dissimilarities (or distance) for the computation of the resemblance profile; this convention is consistent with the Fathom Toolbox for MATLAB (Jones, 2015), which was used for our testing and evaluations, and is advantageous because dissimilarity measures span a broad range of types (i.e., metric, nonmetric, or semi-metric) that can be applied to a diversity of potential research disciplines. These types of resemblance measures also allow ordination of the objects via multidimensional methods, which require db-resemblance measures, and are intuitively interpreted with two objects' spatial "closeness" in ordination space as being more similar (i.e., less dissimilar). Because similarity profiles and dissimilarity profiles are analogous, we refer to "DISPROF" hereafter.
To test the effectiveness of DISPROF at detecting the presence of multivariate structure among objects, we used simulated datasets with both unstructured and structured sets of descriptors, under four different simulation scenarios (Table 1). We attempted to simulate data that would be applicable to a range of numerical studies including, but not limited to, the ecological type of data that SIMPROF was initially developed for ( Table 2). The unstructured data were simulated with a single grouping structure present and were used for estimating type I error rates for DISPROF; the structured data were simulated with known groups among objects and were used to estimate type II error rates and the power of the hypothesis test. Structured data were also used to examine the effects of descriptor overdispersion in ecological count data, as well as the effects of increasing numbers of descriptors and the type of correlation structure among them. We retained the grouping partitions from the structured data simulations, and doing so allowed us to test the correspondence between the clustering solutions achieved by the UPGMA with DISPROF algorithm and these baseline partitions. The criterion for rejecting H o in this simulation study was set at α = .05, and we opted to use a progressive Bonferroni p-value correction (Legendre & Legendre, 2012) for instances where repeated hypothesis testing was conducted (i.e., simulated structured data testing).
All data simulations were coded in MATLAB using the Fathom Toolbox (Jones, 2015), the OCLUS routine (Steinley & Henson, 2005), and the Darkside Toolbox (Kilborn, 2015). To complete the algorithm testing described below, we used the University of South Florida Research Computing high-performance computing hardware running MATLAB v. 2016 and used an experimental MATLAB module from the Fathom Toolbox called "ClustX."

| Data simulation methods
In all simulations, varying size conditions for the resultant data matrices were used, and this allowed us to investigate the effects of changing the numbers of objects (N) and dataset dimensionalities (P, number of descriptors) on DISPROF's performance, and also the quality of the clustering solutions achieved by the algorithm. S = 1,000 datasets were simulated for each combination of [N × P] under additional simulation scenarios described in Table 1. The simulation scenarios allowed further investigation of DISPROF's performance regarding variation in (1) the underlying probability distribution of the data; (2) the amount of overlap between groups' data clouds; (3) the location and dispersion among groups of objects representing ecological abundance data; and (4) correlation structures among descriptors within groups of objects.

| Unstructured data (Sim 1)
The first set of simulations were used to estimate type I error rates for the DISPROF routine for data drawn from eight different probability distributions (Table 1). Each probability distribution was used to simulate a specific data type, and the properties of the simulated data informed the choice of resemblance measure (Table 2). Each statistical distribution had S = 40,000 unstructured datasets across all combinations of [N × P]. A total of 320,000 independently generated unstructured datasets were used to complete the type I error rate estimations. Within each of the S = 1,000 equally sized datasets, the columns were individually parameterized at random from a set range of values specific to the underlying probability distribution ( Table 1).
The instances where random processes produced objects with all zero-value entries were allowed to persist in the data, and they were treated as a special case during the calculation of Bray-Curtis and Jaccard dissimilarity matrices. In this special case, any comparison of two objects with all zero-value entries would be assigned a dissimilarity value of one (i.e., perfectly dissimilar), as they share no common variability (Anderson & Walsh, 2013;Warton & Hudson, 2004

| Structured data-overlapping groups (Sim 2)
The second set of simulations were designed to examine the effects of dataset configuration, as well as the average amount of overlap per dimension between the data clouds that represent grouped objects, on the DISPROF routine and its grouping solutions. We used an established data simulation routine described by Steinley and Henson (2005), called OCLUS, to produce a total of 450,000  3, 5, 10, 25, 50, 150, 225, 300} For each scenario, S = 1,000 datasets were simulated, and mean dissimilarity profiles (DISPROF) were obtained with 1,000 permutations and the p-values for the test were calculated with 999 permutations (α = .05). Variables are as follows: G, total number of groups; N , total number of objects; P, total number of descriptors; T, number of successful trials; df, degrees of freedom; μ i , mean for all descriptors in group i; λ, Poisson rate parameter; σ 2 i , variance for all descriptors in group i; q, probability of success for a trial; θ i , overdispersion parameter for all descriptors in group i; Σ i , correlation among descriptors in group i; Ov, average overlap per axis between data clouds for G 1 and G 2 . a Where θ = 0, then μ = σ 2 , and the negative binomial distribution reduces to the Poisson.
T A B L E 2 Probability distributions used in Sim 1-Sim 4: The representative data type and the resemblance measure used to determine the pairwise distance between objects No data were transformed prior to subjection to the resemblance measure.
datasets with overlapping grouping structures. The OCLUS routine implementation in MATLAB allowed the configuration of the probability distribution type, the number of groups (G) and whether or not they overlap, the number of objects per group (n i ), and the average amount of group overlap across all dimensions (Ov) between groups of objects in hyperdimensional space. Note that Ov for the entire dataset is evenly distributed across all dimensions, and two major assumptions of the OCLUS routine are (1) that all dimensions are independent; and (2) that all groups are independent (Steinley & Henson, 2005). For our purposes, when simulating all structured data with multiple groups (Sim 2-Sim 4), a simple simulation design was employed where two groups (G = 2) with n 1 = n 2 = 25 (N = 50) objects were simulated. In Sim 2, for each [N × P] configuration the average overlap between the two groups was increased progressively from Ov = 0.01 to 0.50, in 0.01 increments. S = 1,000 datasets were simulated for each [N × P × Ov] configuration. Descriptor data were drawn from the multivariate normal distribution with equal variances ( 2 1 = 2 2 = 1) for both groups (Anderson & Walsh, 2013;Steinley & Henson, 2005). Normally distributed data were used to examine the type II error because the concern that the underlying probability distribution of the data would impart some sort of unknown structure was negligible as the data were simulated in a known grouping configuration. As cluster analysis falls into the category of "exploratory" data analysis, it should be obvious that the amount of overlap between objects in a sampling data set, or any inherent grouping structure, is unknown at the time of testing. Therefore, it is important to understand the empirical effects group location and overlap on clustering solutions if we are to put any faith in the solutions provided by the algorithm.

| Structured data-overdispersed descriptors (Sim 3)
The third simulation scenario also indirectly dealt with group location, but the main focus of these simulations was on determining the effect on DISPROF from increasing the overdispersion of one group while holding the other group constant, and to do so for ecological frequency data (i.e., abundances or counts). We used the Fathom Toolbox for MATLAB to implement ecological-data simulation scenarios similar to those used by Anderson and Walsh (2013), and in Sim 3, we simulated ecological abundance data drawn from the overdispersed negative binomial and/or Poisson distribution (Tables 1 and   2). These data were simulated where the σ 2 >> mean (μ), and the σ 2 parameter is related to μ such that σ 2 = μ+ θμ 2 , where θ is the overdispersion parameter. In cases where σ 2 = μ, the data were drawn from the Poisson distribution, and the data were drawn from the negative binomial distribution otherwise. In Sim 3a, we simulated a total of 36,000 datasets with G = 2, μ 1 = μ 2 = 10 (collocated groups), and we induced heterogeneity between the groups by increasing the overdispersion for the descriptors in G 2 . In Sim 3b, we maintained the group heterogeneity from increasing θ 2 when we simulated an additional 36,000 datasets with G = 2, but in this scenario, we set μ 1 = 10 and μ 2 = 30 (separated groups). For all [N × P] configurations, four different combinations of θ 1 and θ 2 were used to simulate S = 1,000 datasets for all [N × P × (θ 1 and θ 2 )] configurations (Table 1). In Sim 3, we simulated ecological count datasets with no overdispersion in G 1 and increasing θ in G 2 , and where the groups were collocated in hyperdimensional space (Sim 3a) or where they existed in separate locations (Sim 3b). It should be noted, however, that this method does not account for data-cloud overlap, and is possible that two simulated groups that do not share a mean value could still overlap if the θ parameter were extremely high. We tested values ranging from zero overdispersion, to low (θ = 0.1), to medium (θ = 0.4), to high (θ = 0.9).

| Structured data-increasing correlation (Sim 4)
The fourth set of simulations was used to examine the effects of correlated descriptors within a group of objects on DISPROF and its clustering outputs. We simulated data with different correlation structures (Σ) between descriptors in G 1 and G 2 , and where Σ 2 increased in G 2 (Sim 4a), and also with Σ 1 = Σ 2 , but still increasing Σ (Sim 4b, Table 1). In both cases, we simulated data drawn from the multivariate normal distribution with μ 1 = 10, μ 2 = 30 and 2 1 = 2 2 = 1. The square, symmetric correlation-matrices Σ were built such that each descriptor would be correlated with all other descriptors in the dataset by the proportion listed in Σ. Sim 4 examines data with correlated descriptors whose level of correlation varies from no correlation (Σ = 0), to medium (Σ = 0.6), to high correlation (Σ = 0.9).

| Power, resolution, and correspondence estimation
As all datasets in Sim 2-Sim 4 had G = 2, we estimated the proportion of type II errors for each [N × P × Ov], [N × P × (θ 1 and θ 2 )], and [N × P × (Σ 1 and Σ 2 )] configuration by finding the number of instances, per S = 1,000, where the H o was retained at α = .05 (i.e., no multivariate structure deemed present). Type II error estimates were converted to power, and values ≥0.80 were considered acceptable at our selected confidence level (Cohen, 2013). As our primary interest was in exploring the efficacy of using DISPROF as a clustering criterion, we examined the first iteration of sequential testing of H o (to record type II error rates), but we also allowed for all subsequent DISPROF iterations to run until the clustering implementation was completed. This unconstrained approach allowed the UPGMA clustering with DISPROF algorithm to settle on complete clustering solutions with the maximum number of groups that could be discovered The final result of each DISPROF clustering attempt was a partition for the simulated objects that identified each object's group membership. In all cases, G and the generated grouping partition were retained for further analysis. The number of groups identified was used to examine the effective resolution of the clustering solution, with larger values of G being indicative of fine resolution and smaller G values being coarse. The grouping partitions were used to compare the computed results against the known reference partition for each structured dataset simulated. The measure of correspondence between the clustering solutions' partitions and their reference partitions was calculated using the Hubert-Arabie adjusted Rand index (ARI HA ). This effort was undertaken due to the importance of a clustering algorithm being able to find "correct" structure in the data. The absolute value of ARI HA ranges from 0 to 1, requires a probabilistic interpretation, and measures the likelihood of agreement between one randomly chosen pair of objects represented in both partitions, corrected for chance (Hubert & Arabie, 1985). Negative ARI HA values can be interpreted as a probability of agreement that is less than what would be expected by chance alone. We interpreted ARI HA values ≥0.80 as "good" correspondence with anything above 0.90 being "excellent." Likewise, ARI HA values <0.80 were interpreted as "moderate" correspondence, and values below 0.65 were interpreted as "poor" correspondence (Steinley, 2004).

| Unstructured data (Sim 1)
The mean estimated type I error rates for DISPROF were within the confidence interval that would be expected for the chosen level of α = .05 for all simulated unstructured data, regardless of the base probability distribution that the data were drawn from (Table 3). There was also no apparent effect of the number of objects or descriptors on the type I error rates for DISPROF (Figure 3).

| Structured data-overlapping groups (Sim 2)
The mean power values for each P-dimension, calculated from the 50 proportions of type II errors, estimated for each [N × P × Ov] configuration (S = 1,000), showed an increase in the power of DISPROF to detect the presence of multivariate structure as the overall dimensionality of the dataset increased (Table 4). A closer look at each P-dimension's power values ( Figure 4) showed that, for P ≤ 10, as Ov decreased, the statistical power of DISPROF increased asymptotically from unacceptable levels toward 1. For all values of P ≥ 25, the power was estimated to equal 1 for all Ov. Furthermore, for any given Ov the power increased as P increased. The average number of groups (Ḡ) per S = 50,000 datasets from all [N × P] configurations across all 50 Ov levels was similar across all P, ranging from a minimum Ḡ = 1.81 (P = 2) to a maximum Ḡ = 2.16 (P = 5; Table 4). Closer inspection of each [P × Ov] combination (S = 1,000) revealed that DISPROF clustering solutions where P ≤ 3 displayed an increase in Ḡ as Ov decreased. Ḡ increased from a value of Ḡ < 2 and asymptotically approached the mean of Ḡ for all clustering solutions within a given [P × Ov] combination. For all P ≥ 5, Ḡ values remained above 2 for all Ov and were much more tightly bound around their respective means (Figure 5a, Table 4). The mean correspondence values (ARI HA ) for each S = 50,000 datasets from all [N × P] configurations across all Ov increased as P increased (Table 4), and for any single Ov level, the ARI HA also increased with P (Figure 5b). A more detailed view of ARI HA within each P-dimension (Figure 5b) indicated for P ≤ 5 the mean ARI HA values persisted below 0.8 for the majority of Ov scenarios, but had a generally increasing trend. Eventually, the ARI HA had high correspondence values at low levels of Ov. All P ≥ 10  Figure 5b).
The power of DISPROF within all [P × (θ 1 and θ 2 )] configurations where θ 2 > 0 increased with P until a threshold value of P was met, and for the remaining dimensions where P ≥ P threshold , the power was 1. The value of P threshold decreased as θ 2 increased and the difference in spread of the two groups became more pronounced (Table S1). The mean number of groups identified in Sim 3b across all [P × (θ 1 and θ 2 )] configurations where θ 2 < 0.9 was approximately 2 (the correct number), and there was no apparent effect of increasing P or θ 2 when the two groups were sufficiently separated in hyperdimensional space (Table 5). For simulations where θ 2 = 0.9, Ḡ increased from ~2.5 groups identified per 1,000 datasets at P = 2, to ~4 groups at P = {5, 10}, after which the value of Ḡ tapered off to around 2 starting at P = 150 (Table 5). The mean correspondence values for scenarios where θ 2 = {0, 0.1} remained excellent for all P; where θ 2 ≥ 0.4, the ARI HA increased with P (Table 6). In Sim 3a, where μ 1 = μ 2 , DISPROF clustering, on average, never settled on the solution of G = 2. When θ 1 = θ 2 = 0, all P returned Ḡ = 1 (as the two groups were effectively identical), but for all other [P × (θ 1 and θ 2 )] configurations where θ 2 > 0, as P increased so did the value of Ḡ (max Ḡ = 28 groups, Table 5). The same pattern was observed in the ARI HA values for Sim 3a as was seen for Ḡ ; for all θ 1 = θ 2 = 0 scenarios, the ARI HA = 0, and for all other levels of θ 2 the ARI HA values increased along with P (Table 6), reaching their maximum values around 1 when P ≥ 25.

| Structured data-correlated descriptors (Sim 4)
For all P, when both groups had no correlation structure, Ḡ was consistently ~2, and ARI HA values were excellent; where at least one group had no correlation structure, Ḡ increased and the ARI HA decreased as P increased (Table 7). For all P where the correlation structure for either group was Σ ≥ 0.6 (medium to high), DISPROF produced clustering solutions where Ḡ increased with P (Table 7). However, in those same scenarios, the ARI HA decreased as P increased, and it should be noted that none of the simulation scenarios in Sim 4a or 4b that included any amount of within-group descriptor correlation returned clustering solutions with an ARI HA ≥ 0.8 for any P ≥ 5.

| DISCUSSION
The DISPROF algorithm is designed to test the H o that there is "no multivariate structure among objects, with respect to a set of descriptors" in a dataset. The utility of deploying the algorithm with a clustering technique such as UPGMA is in (1) the reduction of arbitrary decision criteria (i.e., dissimilarity thresholds for group identification); (2) the ability to assess multivariate structure at multiple levels of resemblance; (3) the inclusion of the frequentist approach to hypothesis testing; and (4) the application of db multivariate statistical techniques.
As such, it is important to determine where UPGMA clustering, with DISPROF implemented as a decision criterion, is affected by changes in data configuration, distribution, dispersion, and correlation. We were particularly interested in statistical error rates associated with DISPROF and the resolution and correspondence of the grouping solutions provided by DISPROF with UPGMA under a variety of potential data scenarios.

| Type I error
When assessing the DISPROF algorithm's H o , there appears to be no effect of distribution type or [N × P] configuration on type I error rates. The mean type I error rates for all [N × P] within each probability distribution type fell within acceptable ranges for the expected number of rejections (α = .05). As DISPROF correctly failed to reject H o with acceptable levels of type I error, it is, therefore, reasonable to assume that there is a low likelihood that the underlying probability distribution will impart some sort of unknown grouping structure to the dataset (e.g., where some unwanted noise structure might elevate false positives). This is notable given that these techniques were developed for ecological datasets such as those tested in Sim 1f, but they appear to be applicable to many common data types collected by different lines of scientific inquiry (Tables 1 and 2). However, the activity displayed by DISPROF in Sim 3a and Sim 4 leads us to believe that further investigation may be required for datasets with high levels of overdispersion or correlation among descriptors. In these cases, misclassification appears to increase along with both θ and Σ, and is exacerbated by increases in P (Tables 6 and 7). These findings are also notable as overdispersion and correlation are two common qualities of ecological datasets.

| Power
The power of DISPROF to detect structure in data is generally poor with low-dimensional (P ≤ 5) multivariate normal data, and with lowdimensional (P ≤ 10) ecological count data where μ 1 = μ 2 , the latter being expected as this configuration can be interpreted as G = 1. As DISPROF performed decidedly better when μ 1 = 10 and μ 2 = 30, it follows that the hypothesis test relies heavily on the location parameter when assigning group membership, and when heterogeneity of groups is only defined by overdispersion the two are confounded by the algorithm. A similar response to collocated sets of heterogeneous objects was observed during empirical investigation of ANOSIM and the MANTEL test (Anderson & Walsh, 2013). The power of DISPROF improves dramatically once P ≥ 25, and increases with greater separation between groups in hyperdimensional space. With group separation in hyperspace, the power of DISPROF to evaluate H o is unaffected by increasing the overdispersion in ecological data, and the test for structure is able to correctly identify the presence of groups in virtually all simulated datasets where μ 1 = 10 and μ 2 = 30. The presence of correlation structure among the descriptors within any group also has no noticeable effect on the power of DISPROF to detect structure.
The power of DISPROF is excellent in most cases and, as Clarke et al. (2008) predicted, its ability to detect structure becomes more powerful as the dimensionality of the predictors increases, and so we have found their corollary (1) to be supported. A potential explanation for the increase in power observed along with the increases in P may be related to the idea of a group's identity, or the unique combination of numerical values that quantitatively represent a set of objects (i.e., their "fingerprint"). The more descriptors used to quantify an object, the less likely the unique fingerprint that describes that group of similar objects could be re-created by chance. Therefore, during the randomization process of the DISPROF test, and with a large enough P, breaking the structure in the original data is relatively easy to do in order to create the null distribution for the test statistic. This is essentially the overfitting problem in reverse (Babyak, 2004;Hawkins, 2004). This overfitting is appropriate because it essentially creates highly unique observed resemblance profiles to test against for structure, and because no extrapolation or interpolation is based on the overfitted identity. Any unique group identity exposed in the dataset

| Resolution and correspondence of DISPROF
If either of the theoretical corollaries presented by Clarke et al. (2008) were to be considered cautionary, it would be corollary (2), which regards the resolution of DISPROF solutions being finer than ecologists (or any professional) utilizing the method could interpret meaningfully. We further contend that the correspondence between these grouping partitions and any known grouping structure in the simulated datasets is informative and is indicative of the DISPROF clustering method's ability to settle on "meaningful" solutions. Therefore, any discussion of the issues surrounding the resolution of the grouping F I G U R E 5 The relationship for Ḡ and ARI HA with Ov for DISPROF clustering: (a) The mean number of groups identified (Ḡ) versus the average data cloud overlap (Ov) for all P tested under Sim 2. Each line plot represents the 50 Ḡ values for S = 1,000 datasets at each Ov level for a given P. The optimal grouping solution (G = 2) is represented by the horizontal dashed line. (b) The mean correspondence of the grouping solution (ARI HA ) versus the average data cloud overlap (Ov) for all P tested under Sim 2. Each line plot is configured as in panel (a), the horizontal black dashed line represents lower bound for excellent correspondence (ARI HA = 0.9), and the red dashed line represents lower bound for good correspondence (ARI HA = 0.8). Boxplots to the right represent the distribution of standard errors for each estimate of the Ḡ and ARI HA for all Ov within a noted dimensionality for P. The horizontal red line in each boxplot represents the median standard error value in the distribution, with the upper and lower edges of the box being the 25th and 75th percentiles. Whiskers extend to encompass the most extreme data points, and outliers are plotted individually as crosses P = 2 P = 3 P = 5 P = 10 P = 25 P = 50 P = 150 P = 225 P = 300 P = 2 P = 3 P = 5 P = 10 P = 25 P = 50 P = 150 P = 225 P = 300 .04 .05 .06 .07 .08 .09 .10 .11 .12 .13 Standard error 0 .005 .010 .015 .020 .025 .030 .035 .040 .045 Average group overlap (Ov)   solutions is incomplete without also discussing their correspondence with reality (i.e., "correctness").

| Effect of group locations
The structured data were simulated as either two groups whose location in hyperspace was defined by the progressively decreasing amount of average overlap between the groups' data clouds (Sim 2), or as two stationary groups whose location was predefined to be the same (Sim 3a) or different (Sim 3b, Sim 4). In all cases, we have demonstrated that when the two groups have higher overlap in hyperspace, the DISPROF algorithm has a tendency to underestimate the number of groups, and often settles on solutions where only a single large group exists. When clustering multivariate normal data, as in Sim 2, the effects of the amount of overlap are overridden by increases in the dimensionality of the dataset (Figure 5a) and potentially are due to the increase in complexity of the fingerprint for the groups that coincides with the extra dimensions. The result of this override is that even at levels of data overlap that reach as much as 50%, DISPROF clustering is able to detect the correct number of groups in data that have P ≥ 5. However, the correspondence values for those correct numbers of groups do not reach acceptable levels (ARI HA ≥ 0.80) until P ≥ 10 ( Figure 5b). Therefore, when clustering multivariate normal data with equal variances, the most reliable resolution and correspondence levels will be achieved with P ≥ 10.
The simulated ecological count data showed a profound effect of group location on the resolution and correspondence of the clustering solutions provided by DISPROF. Particularly in cases where the two sets of objects had the same central tendency but different overdispersion structures, and regardless of the number of descriptors in the dataset, DISPROF either underestimated the number of groups (e.g., G mode = 1), or very greatly overestimated it (e.g., G mode = 26). This directly contrasts with the performance of DISPROF with ecological count data whose groups are separated in hyperspace. In these cases, once again regardless of the number of descriptors, DISPROF performed optimally and identified the correct number of groups, on average, in ecological data, even with high levels of overdispersion. This finding is consistent with those for the multivariate normal data, in that low Ov improved DISPROF's performance as a clustering criterion. High group overlap may negatively affect DISPROF in the same manner as having low numbers of descriptors (P), where the high-overlap situation allows for group fingerprints that are not unique enough when compared to one another. In this case, the randomization process is unable to break the structure in the datasets and the differences between the mean resemblance profile (representing H o ) and the observed profile are negligible (i.e., no structure present); thus, the routine returns a solution that identifies the entire data cloud as one group.

| Effects of overdispersion among descriptors within groups
The ecological count data used here were simulated so that we could examine the effects of increasing the overdispersion (θ) of G 2  while holding θ 1 = 0. The purpose of this exercise was to increase the relatability of the results to ecological data, as many species composition and abundance datasets are highly overdispersed. Our results indicate that when the groups do not overlap in hyperspace, the effects of the overdispersion of the second group are negligible when considering the resolution of the clustering solutions, but the correspondence of those solutions with reality is unacceptable when P ≤ 10 for data with high overdispersion (θ 2 = 0.9). When the groups are defined by different levels of overdispersion and share a location, the effects of increasing overdispersion become more pronounced and are seemingly amplified by increasing the dimensionality of the dataset being tested. In these cases, the resolution of the solutions is as described previously, but the correspondence levels for the resultant partitions are all inadequate. The point of interest, however, is that the ARI HA values tended to be around 0.5 for clustering scenarios where the overdispersion among descriptors is medium or high (i.e., θ 2 = {0.4, 0.9}) and P ≥ 25 (and for θ 2 = 0.1, the P threshold = 150). This indicates that one group is being identified fairly well and the other is being completely misrepresented by the grouping algorithm. We suspect that the increase in θ 2 causes the numerical fingerprint of the objects within the group to be too dissimilar when only compared to one another, and the result is a series of singleton groups, as the clustering algorithm iteratively works through the UPGMA connection of the overdispersed nodes. It seems as though the effects of overdispersion among ecological count data are secondary to the effects of group location in hyperspace, but supersede those of dataset dimensionality (dimension < overdispersion < location).

| Effects of correlation structure among descriptors within groups
Our simulation studies that incorporated different correlation structures among descriptors within groups were also undertaken in an effort to relate our investigations to studies incorporating ecological datasets, which often contain descriptors that are correlated with one another to some degree. We used multivariate normal data in our simulations to ensure that the observed effects of different correlation scenarios were not confounded by some other distributional assumptions. It appears as though medium to high levels of correlation (Σ = {0.6, 0.9}) among descriptors within a group will strongly impact the number of groups identified, and it tends to increase Ḡ as Σ increases. Drawing inferences from these clustering results may be dubious, however, because for virtually all clustering solutions that had medium or high correlation among descriptors, regardless of dimension, the mean correspondence was well below acceptable limits.
Correlation structure among groups affects the shape of the data cloud in hyperspace. It is interesting to note that DISPROF seems to have an improved ability to detect more "correct" structure in data where the shapes (i.e., correlation structures) of the groups are the same (Σ 1 = Σ 2 ), as opposed to one group having no correlation structure (i.e., spherical data cloud) and the second group having medium-to-large correlations among descriptors (i.e., data cloud distortion). As our simulations only explore medium-to-high correlation among all descriptors, it would be of interest to examine low, negative, and mixed correlation structures to describe DISPROF's performance variability under a full range of correlation conditions. The control scenarios, where Σ 1 = Σ 2 = 0, were among the only scenarios that returned reasonable Ḡ or ARI HA results; however, these scenarios effectively recreate a simplified version of those data simulated under Sim 2. The overall ARI HA results suggest that increasing the correlation between descriptors in one group and not the other tends to produce increasingly unreliable grouping partitions, and these results are in line with those from Sim 2, where low P results in low ARI HA . One explanation for this might be that as the level of correlation between descriptors increases the effective size of P decreases, and when considering the pairwise dissimilarity between objects, because the variability across all correlated descriptors in a group is essentially the same, datasets with high P and Σ tend to have similar DISPROF clustering dynamics as datasets with low P and no correlation structure.

| DISPROF as a clustering decision criterion
Strengths of using resemblance profiles as a hypothesis test for multivariate structure are that the type I error rates (1) are within the range of acceptability for α = .05, (2) tend to be binomially distributed around 5%, and (3) are resistant to the effects of both the underlying probability density function and (4) the [N × P] configuration of the data. Additional strengths include the facts that, when μ 1 ≠ μ 2 , the power of DISPROF (5) is within the acceptable range for P ≥ 10 and is unaffected (6) by up to 50% average group overlap, (7) by increasing overdispersion among ecological count data, and (8) by increasing correlation structures among descriptors. Finally, (9) the first theoretical corollary proposed by Clarke et al. (2008), that the power of the test for multivariate structure increases as P increases, was confirmed.
From a traditional statistical error perspective, it appears that using resemblance profiles is a very effective method for identifying multivariate structure; it rarely identifies structure that is not present and it almost always identifies structure that is present. The weaknesses of using this hypothesis test are mostly related to the second Clarke et al. (2008) corollary, where the resolution of any grouping structure identified may be too fine to interpret meaningfully. The realized power of the resemblance profile hypothesis test comes when it is implemented as a clustering criterion, and success is based upon the partition returned by the algorithm. The resolution of the partition and the solution's correspondence with interpretable multivariate structure in the dataset are ultimately what the researchers will use to explain their theories. The second Clarke et al. (2008) corollary appears to be valid, but it manifests differently depending on the type, configuration, and hyperdimensional structure of the dataset being considered. However, if we constrain our analysis to relatively high-dimensional, low-correlation datasets where the group locations are separated, then the resolution-versusinterpretability concern wanes greatly. The power to detect structure is very high, even with P as low as 10 descriptors, and so it follows that any additional resolution imparted on the solution (which may account for any reduction in correspondence) is likely the result of an actual numerical signal in the dataset, and can be manifest from random (or unmeasured) processes, or error. An alternative explanation may be related to the construction of the null distribution for the test statistic π, where group properties such as location and hyperdimensional shape may preclude the permutation procedure from accurately depicting the null scenario.

| Recommendations for using DISPROF (SIMPROF)
The results presented for type I error, power, resolution, and correspondence suggest that using resemblance profiles as a test for multivariate structure, and as a clustering decision criterion, has strengths and weaknesses. The results also highlight pitfalls that can be avoided if particular care is taken prior to implementation of these clustering techniques. The complex interactions between the data type/configuration and the hyperdimensional structure and overlap between groups strongly affect the results achieved when clustering with DISPROF. The method is nonetheless an improvement over traditional UPGMA clustering, most notably due to the removal of the arbitrary and static assignments of resemblance thresholds that define groups of objects. Because the realized power of using resemblance profiles as clustering decision criteria cannot be maximized without making tradeoffs between resolution and correspondence with interpretable structure, we make the following recommendations.
1. Exploratory analysis, such as principle coordinates analysis (PCoA), should be performed to determine, at a minimum, if any hypothesized grouping structures might have high amounts of overlap (i.e., Ov > 50%) in hyperdimensional space, and DISPROF should be avoided in high-overlap situations. Data clouds that appear to overlap greatly could produce unreliable results and should not be clustered using these methods.

2.
Medium-to-high correlation (i.e., ≥0.6) among all descriptors should be avoided, and efforts should be made to either reduce or remove the correlated descriptors in a dataset. In an effort to create more parsimonious models, priority should be given to descriptors that are indicative of independent processes, whenever possible. In the case of ecological abundance data, where many species are often both of interest and are highly correlated, it may be of benefit to use a dimension reduction technique (e.g., PCoA) that produces new orthogonal descriptors, with no correlation structures, prior to clustering with DISPROF.
3. The data dimensionality should be restricted to P ≥ 25 descriptors in order to achieve solutions with ideal resolution and "excellent" correspondence (ARI HA ≥ 0.90) to meaningfully interpretable structure.

4.
A less conservative guideline would be to restrict the number of descriptors to P ≥ 10. This new limit retains power, increases the potential for higher resolution solutions, and reduces correspondence from "excellent" to "good" (0.80 ≤ ARI HA < 0.90).
Since its initial development and addition to PRIMER-E (Clarke & Gorley, 2015), the use of resemblance profiles has been gaining traction as a clustering criterion, mostly in the ecological literature. Our results provide recommendations for ecologists to use when applying these methods, and demonstrate the methods' transferability to other numerical analyses, data types, and disciplines. With a better understanding of the dynamic performance of resemblance profiles as clustering criteria and the potential variability in the results they produce, researchers can more confidently deploy SIMPROF and interpret the results with respect to beta-diversity, species/environment relationships, or any other complex multivariate model and/or associated hypotheses. While there appear to be clear advantages imparted by the use of resemblance profiles as clustering criteria, there are still many questions that deserve additional attention that were beyond the scope of this evaluation.

CONFLICT OF INTEREST
None declared.

DATA AVAILABILITY
All simulated datasets and analyses performed in MATLAB are publicly available upon request.