A treeless absolutely random forest with closed‐form estimators of expected proximities

We introduce a simple variant of a purely random forest, called an absolute random forest (ARF) used for clustering. At every node, splits of units are determined by a randomly chosen feature and a random threshold drawn from a uniform distribution whose support, the range of the selected feature in the root node, does not change. This enables closed‐form estimators of parameters, such as pairwise proximities, to be obtained without having to grow a forest. The probabilistic structure corresponding to an ARF is called a treeless absolute random forest (TARF). With high probability, the algorithm will split units whose feature vectors are far apart and keep together units whose feature vectors are similar. Thus, the underlying structure of the data drives the growth of the tree. The expected value of pairwise proximities is obtained for three pathway functions. One, a completely common pathway function, is an indicator of whether a pair of units follow the same path from the root to the leaf node. The properties of TARF‐based proximity estimators for clustering and classification are compared to other methods in eight real‐world datasets and in simulations. Results show substantial performance and computing efficiencies of particular value for large datasets.


INTRODUCTION
Clustering methods are used to search for the structure of a population based on sample data by partitioning units into groups based on their dissimilarities.Measures to characterize similarities or resemblances in an unsupervised setting between pairs of units are many and varied and are critically important in the search for structure in data.Similarities are also used in classification in supervised settings where, for example, the k-nearest neighbor (k-NN) algorithm groups units based on the similarity of their feature vectors to the labeled units in the learning sample.
Although Euclidean distance is mathematically a highly desirable way to measure resemblance, in many applications the data demonstrate that the assumption is just not tenable.In such cases, it is instead appropriate to obtain data-driven measures, as we do here.In this report, we describe a method for obtaining closed-form data-driven estimators of proximities based on a new simple absolutely random forest (ARF).
Random forest (RF) is a well-known, widely used ensemble method for obtaining data-driven estimates of a proximity matrix [4].Briefly, trees are grown on bootstrap samples, resulting in a forest of strongly diversified models.The classic RF is a classification or regression algorithm based on labeled vectors of attribute or feature data from a learning sample.At each node in a tree, a random sample of features is obtained and membership in child nodes is based on optimizing a criterion of purity such as the Gini coefficient or entropy.Recursive partitioning continues until a leaf or terminal node is reached according to a specified stopping rule.The resulting forest can be used to obtain a measure of similarity based on what we call a pathway function.The most common pathway function is a binary indicator of whether a pair of units follow a common pathway in a tree from the root to the terminal node [4,24].The similarity measure is the mean of the pathway functions where the average is taken over the trees.It is exactly equivalent to Breiman's measure, the proportion of trees in the RF that the pair appear together in the same terminal node.Conditional on the data and assuming convergence as the number of trees goes to infinity, the average of the indicator variables defined by a pathway function in an RF is an estimator of the corresponding population similarity value.If labels are known, the estimated similarity matrix can be used to classify a new subject using k-NN or, after a simple transformation, the matrix can be input to a clustering algorithm.When labels are not known, various devices such as creating synthetic data [6] and random assignment of a unit to a class are used to create pseudo labels.Then the RF tree-growing algorithm can produce a forest from which a proximity matrix can be computed based on a specified pathway function.
The statistical properties of RF parameters are extremely difficult to obtain in part because splitting at each node is data dependent.This is why simplified versions, which some have called purely random forests, have been studied by Breiman [5], Genuer [14], Cutler and Zhao [9], Ciss [8], Ting et al. [27], Biau et al. [2], Zhu et al. [30], Bicego and Escolano [3], Aryal et al. [1], and Mantero and Ishwaran [24].In all of these versions of RFs, units are randomly split at each node based on a sample of features from the learning data.These simplified versions of RF can provide intuition as to the properties of classical RF and many are useful in their own right.
The main contribution of this report is the introduction of a new simple version of a RF for clustering that enables the computation of closed-form estimators of many population parameters.These can be evaluated rapidly even on large datasets, as will be seen below, and can be useful in applications.The improvement in computational costs over standard methods can be substantial.In this approach, at every node in a tree, a single feature is randomly chosen with replacement, and a threshold or cut point is determined randomly by sampling once from a uniform distribution whose support is the range of the features in the learning sample of the chosen feature.Different from all other RF variants, the support of the uniform distribution does not change as units travel the path from root to terminal nodes.The entire sample is used to grow each tree.Because they are everywhere random, we call the resulting ensemble an absolutely random forest.For any pathway function, such as pairwise proximity, the parameter estimate can be computed from an ARF in the traditional way by growing the forest and counting the proportion of trees in which a pair of units meet the pathway function condition.But, most importantly, the simple structure of the ARF algorithm enables closed-form estimators to be obtained for many population parameters.In this communication, we obtain estimators of the expected value of pairwise similarities for three pathway functions.Although we do not pursue its properties here, we present in the Appendix a closed-form estimator of a pathway function for an individual unit, the number of nodes to isolation.This is possible because an ARF eliminates several dependencies induced by updating the support for choosing a threshold, which is a fundamental element in the construction of most RF versions.As a consequence, the need to grow the forest is eliminated.We call the conceptual probabilistic structure of an ARF, a treeless absolute random forest (TARF).
In this report, we focus on features that are continuous even though the methods considered are RF-based.Among the reasons for the popularity of RF is its ability to handle categorical or nominal variables.There are a variety of approaches that can be used to code them.For example, binary encoding, transforms discrete variables into binary vectors, with each category becoming a binary feature; One-hot encoding is similar to binary encoding, except that each category becomes a separate binary feature.Consideration of these and other alternatives would considerably expand the scope of the work presented here.This together with the complexity involved in evaluating the population parameter estimators in sample studies with feature vectors comprised of both continuous and categorical variables has led us to limit the presentation to features that are continuous.
The paper proceeds as follows.In Section 2, we introduce extremely randomized trees (ET) and unsupervised extremely randomized trees (UET), the two versions of purely random forests that are closest to an ARF.In Section 3, the properties that define an ARF are given, beginning with a procedure to normalize the distribution of features so they each have support on the unit interval.The probability that two units split at a node in an ARF is obtained in Section 4. Pathway functions are introduced in Section 5 and the manner in which they define the proximity matrix is described.Three different examples of pathway functions for pairs of units are studied.In each case, the expected value and variance of the proximity are obtained without having to grow an ARF.In Section 7, we describe how to use TARF as a classifier if labeled data are available.In Section 8, the properties of TARF-produced proximity matrices for use in cluster analysis and classification are compared to three other methods for producing proximity matrices in real-world datasets.We conclude with some remarks in Section 9. Of particular importance is a discussion of the pros and cons of changing the support of the uniform distribution from which thresholds are obtained.

BACKGROUND: ET AND UET
In studying a simplified version of RF when labels are known, Geurts et al. [15] proposed an ensemble algorithm called extra trees (ET), an abbreviation for extremly randomized trees.To reduce bias, the full training sample is used to grow each tree.Specifically, at each node, a subset of k features are randomly chosen from the set of K features and a random split is performed on each of them.
The "best split," in terms of a specified impurity measure, is used to define membership in the child nodes.The process is repeated until all members of a child node have the same class label or the same feature value or the number of units in the node is less than a tunable parameter n min , sometimes called the smoothing strength.The smaller the value of k, the weaker the dependence of the constructed trees on the unit labels.In particular, when k = 1, the structure of the tree and the labels in the sample data are independent.Dalleau et al. [10,11] adapted ET to a setting in which labels are unknown, calling their method UET.The source code and installation instructions can be found on Gitlab (https://gitlab.inria.fr/kdalleau/uetcpp).At the root node of a UET, a single feature is randomly chosen.The splitting threshold is obtained by a random draw from a uniform distribution whose support is determined by the range of the chosen feature of the sample units in the node.At each subsequent node on the path to a leaf, the range of the support is recalculated based on the units remaining, a new uniform distribution is defined, and a new random threshold is obtained.The distribution of the random splitting variable is uniform with decreasing support as units move from node to node.This renders closed-form computation of estimates of population parameters such as the expected value of pairwise proximities, otherwise infeasible because of the immense number of splitting permutations even for a small number of features.We are particularly interested in UET because it is the closest version of an RF-type ensemble to an ARF, differing only in the changing support of the uniform distribution from which the splitting threshold is obtained.

DEFINING AN ARF
The construction of an ARF proceeds in the same manner regardless of whether the labels of units are known or unknown.We assume that the units have been independently drawn from a sample space, Ω, with underlying feature distribution function f (x).The sample is considered fixed and stochastic properties are generated by the random elements of the random forest.The data we consider are N units u, each with a K-dimensional feature vector, x uk , u = 1, 2, … , N and k = 1, 2, … , K. For simplicity of notation, the first step in growing an ARF is to transform each of the continuous univariate components of the sample feature vectors so that their support is the unit interval.This does not alter the paths of any unit for any tree.A simple unity-based transformation for feature k for unit u is (x uk − min(x k ))/(max(x k ) − min(x k )).Without confusion, denote the normalized component k of a data vector x u for unit u by x uk .A pair of units u and v with corresponding feature vector components x uk and x vk for all k satisfy the relationship 0 ≤ |x uk − x vk | ≤ 1. Conceptually, two vectors have high proximity if |x uk − x vk | is small for all of the K features.Like a UET, an ARF ensemble is a forest of trees grown with no classification considerations.The entire dataset (as opposed to a bootstrap sample in classical RF) is used to grow each tree.At the root node, a single random feature k ∈ (1, 2, … , K) is selected with replacement so that all features are equally likely to be chosen.A cut point, Z k , is randomly drawn from a continuous unit uniform distribution.Units u with a value of the selected feature x uk ≤ Z k are passed to the child node to the left while those with x uk > Z k are passed to the child node to the right.The process is repeated at each node without adjusting the support of the uniform random variables (to the range of the features of the units in the node as is done in RFs).Recursive partitioning continues at every node until a stopping criterion is met.In an ARF, the process stops when a unit reaches a prespecified tree depth, D. In particular, if D = 1, units in the root node are split into two child nodes and the process of growing the tree ends.If D is large, it is possible that at some node in the path down the tree, the value of the random split point Z k may be smaller or larger than the feature values of all of the units in a node.In this case, the units are all passed to the same child node to the left or right depending on which side of the features in the node the random threshold falls.The process continues even if a node contains only one unit until it reaches a leaf at depth D, resulting in what are called balanced trees.
A large number of trees are grown in the same manner to create the ARF.It is comprised of trees with "pure" nodes in the sense that the partitions are not determined by artificial labels, measures of purity or entropy, nor by any concept related to classification.Also, the distribution of the random splitting variable is the same for every node in every tree.

THE PROBABILITY THAT TWO UNITS SPLIT AT A NODE IN AN ARF
In this section, we obtain the probability of a fundamental event that takes place in the process of growing an ARF, a split between units u and v with corresponding feature vectors x u and x v .Suppose that in the first step of growing a tree, feature k is randomly selected and z k is the result of a random draw from a unit uniform distribution.The probability that u and v are split at the root node conditional on k and their feature values x uk and x vk is The right-hand side has the form of an L 1 norm, which is related to Manhattan or city-block distance, scaled by the range of the features.Because features are equally likely to be chosen with probability 1/K, given their feature values, the unconditional probability that u and v are split at the root node is Similarly, the probability, q uvk , that they will not be split conditional on k and their feature values x uk and x vk is a measure of similarity.Its value q uvk = 1 − p uvk = 1 − |x uk -x vk |.The unconditional probability that the two units will not be split at a node they share is (1)

TREELESS PROXIMITIES DEFINED BY PATHWAY FUNCTIONS
Although less common than the ubiquitous completely common pathway, other pathway functions have been proposed in the literature.These are designed to represent the similarity of units with finer granularity than the rather demanding completely common pathway condition.The Zhu2 pathway function, the node count to divergence was considered by Torkkola and Tuv [28] and Zhu et al. [30].It is a count of the number of nodes in a tree that a pair of units jointly share until their paths split, that is, they follow the same pathway down the tree until they reach a node where they diverge and continue separately to their respective terminal nodes.Torkkola and Tuv [28] describe this as "the deepest common parent of two nodes" and the pathway function "the average shared path length."Zhu3 [30] is a similar pathway function differing only in that the count is weighted by the inverse of the number of units in the sample in the node at which they split.Torkkola and Tuv [28] normalized the function by dividing by the deeper of the level of the terminal nodes of the two units.Further examples are given by Ting et al. [27] and Aryal et al. [1] who described mass-based dissimilarity, a data-dependent dissimilarity defined as the probability mass of the (smallest) region covering the two units.Proximity estimates based on a specific pathway function can be obtained from an ARF by counting the number of trees that meet the criteria divided by the number of trees in the forest.Based on a particular choice of pathway function, the proximity of all pairs of sample units can be determined and a transformation, such as the square root of one minus the entries in the proximity matrix, can be input to a clustering algorithm.
In this section, we introduce the notion of an expected proximity matrix (EPM) whose (u, v)th element is the expected proximity of units u and v. Assuming convergence, it is the proximity matrix for a specific pathway function that would result in the limit as the number of trees in an ARF goes to infinity.Our goal is to obtain an estimate of the expected value and variance of the entries of the matrix based on the underlying nodal splitting probabilities, without having to grow the forest.In the following sections, we obtain estimators of the mean and variance of EPMs for three pathway functions for pairs of units.They are the completely common pathway, the node count to divergence, and the unit count at divergence.

Proximity based on the completely common pathway to a terminal node
The completely common pathway (cp) function is a binary indicator,  cp uv , that takes the value 1 if units u and v share a common path from the root to the terminal node and zero otherwise.Because bootstrap sampling is not used in an ARF ensemble, every pair of units is present in the root node of every tree.For depth, D, the expected value of the indicator of the common path proximity of units u and v is the probability that they will not be split at any node as they travel down the path of a tree to a comon shared terminal node.
Theorem 1.The (u, v) entry in the EPM for a completely common pathway with depth D is and the variance of the (u, v) entry is Proof.Equation (2A) follows because the events that u and v do not split at the root and each of the child D − 1 nodes are mutually independent and identically distributed.▪ The simplest case of the completely common pathway function corresponds to D = 1.In this case, the common pathway function has the distribution of a mixture random variable with equal mixing parameters and binomial distributed components.In the literature, a tree with one root and exactly two leaves has been called a tree stump.The left side of Figure 1 displays the expected proximity based on the common pathway condition as a function of q as it ranges from 0 to 1 for D = 1, 2, 3, 5, 7, and 10.It can be seen that as D decreases and q increases, the probability of splitting before reaching the terminal node increases monotonically.The right side of Figure 1 displays the variance of the proximity based on the common pathway condition as a function of q over the same range.As D increases, the variance shifts to the right.
Separate and apart from RF or probability considerations, Equation (2A) with D = 1 is the definition of Gower's [16] generalized similarity coefficient (GSC).Gower showed that a similarity matrix whose components have the form q uv is positive semi-definite.Since the elements all fall between zero and one and the diagonal elements are all equal to one, then a dissimilarity matrix whose (u, v) element is (1 − q uv ) 1/2 is Euclidean.Although not discussed here, Equation (2A) is additionally appealing as a proximity measure because it can accommodate quantitative and qualitative features, and it can handle missing values, just as can RFs.The GSC has been used to define dissimilarities in hierarchical cluster analysis, principal coordinate analysis, and in other ordination programs as well for more than 50 years.Here, we see that it is mathematically identical to the expected value of the proximities under a common pathway criteria with D = 1 computed for a theoretical TARF.There are several publicly available programs for computing Gower's distance/similarity measure.In the R library (R Core Team, 2021), it may be found in a package by Maechler et al. [23], where it can be used directly as input to a cluster analysis.Also, a package by Laliberté et al. [20] computes GSC with several options for handling ordered categorical variables.Other packages include Gower introduced by van der Loo [29] and kmed introduced by Budiaji [7], which may be found at https://www.R-project.org/.Recently, D'Orazio [13] suggested some modifications that would (1) improve the GSC with respect to interval and ratio scaled variables, (2) attenuate the impact of outliers, and (3) reduce the unbalanced contribution of different types of variables.These may be applicable in the TARF context as well.

Proximity based on the expected number of nodes in the common path to divergence
The node count to divergence pathway function for two units is a count s uv of the number of nodes that u and v share until either they split or they jointly reach their common terminal node [27,29].If s uv is small, that is, u and v split near the root node, they are likely to be relatively dissimilar.If u and v share a common path over a large number of nodal split challenges, they are likely to have a relatively high level of similarity.In an ARF, all pathways have length D so normalization is not necessary, although dividing by D makes all values lie in the unit interval.
Let u and v be two units in the root node of a tree generated in an ARF of depth D. Let the random variable F I G U R E 1 Mean and variance of the proximity for the completely common pathway function to a terminal node.Left: the expected proximity based on the common pathway condition as a function of q as it ranges from 0 to 1 for D = 1, 2, 3, 5, 7, and 10.Right: the variance of the proximity based on the common pathway condition as a function of q as it ranges from 0 to 1 for D = 1, 2, 3, 5, 7, and 10.
S uv , be the number of common nodes that the units share up to and including the node that they either separate or reach a common terminal node.An outcome s = 1 corresponds to the pair splitting at the root node and s = D corresponds to the pair reaching the same leaf node.Theorem 2. Conditional on the sample, the number of nodes in the common path to divergence of u and v, S uv , is distributed as a censored geometric random variable with with mean given by Equation (3A) and variance given by Equation (3B).
Proof.If D were infinite, the probability that u and v split at node s uv = s for any value would be given by P(S uv = s) = p uv q s−1 uv .Notice that this probability has the form of a geometric distribution for which E(S uv ) = q uv ∕p uv .However, S uv has a positive probability only for S uv ≦ D, the depth of the tree, after which it is unobservable.Consistent with the completely common pathway function, the probability that u and v are in the same terminal node at tree depth D is q D uv .Also, P(S uv < s) = 1 − q s uv .Since S uv is a censored geometric random variable, its expected value is and its variance is (3B) ▪ A plot of the expected number of shared nodes and the variance over the range of q for D = 1, 2, 3, 5, 7, and 10 is given in Figure 2. It can be seen that as q and D increase, the expected number of nodes at which the two units split and the variance increase.Note that for any fixed D, E(S uv ) is a monotonically increasing function of q.

Proximity based on the number of units in the node where u and v diverge
Let N(u*,v*) be the number of units in the nodes where units u* and v* diverge in a tree in an ARF.We want to compute the expected value, E(N(u*,v*)).Denote by N 1 (u*, v*) the number of units that split away from the pair as a result of the first feature draw.Continue growing the tree recursively following u* and v* and at the jth node denote by N j (u*, v*) the cumulative number of units that have split away from them up to and including the jth node.For notational simplicity, suppose the sample size is N + 2 instead of N and without loss of generality, let u* be unit v N+1 and v* be unit v N+2 .
Theorem.The expected number of units at the node where the pathways of units u* and v* diverge is given by Equation (4A) and variance is given by Equation (4B).
Proof.As long as the pathways of u* and v* have not diverged, it suffices to consider the F I G U R E 2 Mean and variance of the proximity for the number of shared nodes to splitting pathway function.Left: the expected proximity based on the number of shared nodes to splitting pathway function as a function of q as it ranges from 0 to 1 for D = 1, 2, 3, 5, 7, and 10.Right: the variance of the proximity based on the common pathway condition as a function of q as it ranges from 0 to 1 for D = 1, 2, 3, 5, 7, and 10.
number of units that have split from u*, because u* is on the same path as v*.Therefore, splitting from either one implies splitting from both.Denote by N j (u*) the cumulative number of units that have split from u* up to and including the jth node.The maximum value that N j (u*) can obtain is N.
Proof of the lemma.The probability that an arbitrary unit v ≠ u* or v* does not diverge from u* at node j is q u*v conditional on units u*, v*, and v are all members of the jth node.Denote by N*, the set of integers from 1 to N; by  n , the set of all subsets of integers of length n of the set N*; by A, a set in  n and by A c its complement.Note that the cardinality |A| = n and |A c | = N -n.Denote by r u*v,j = q  u * v the probability that the paths of u* and v do not diverge up to and including the jth node.For j = 1, 2, … , D − 1, the probability that n units have split from u* and v* given that the pair are members of the jth node in their common pathway is .
which is the probability of a Poisson Binomial random variable.▪ Therefore, the conditional expected value of N j (u*) given that u*, v*, and v all are in the jth node is The last term is the rather intuitive average taken over all sample units v of the probability that the paths of u* and v have not diverged through the jth node.The conditional variance is ) .
Thus, the desired expectation is Using the law of total variance, Var(X) = E(Var(X|Y ) + Var(E(X|Y ))), we obtain

CLASSIFICATION BASED ON A TREELESS RANDOM FOREST
In this section, we describe how to use TARF as a classifier if labeled data on n units are available.Although TARF is designed for problems involving unlabeled data, it is interesting to see how it might perform compared to methods specifically designed for classification.Let u* be a unit with a K dimensional feature vector, x u* , whose class membership is to be determined.The expected proximity can be obtained between units u* and u i , for i = 1, 2, … , N for a specified pathway function using TARF at a chosen level D. A classical k-NN approach can be used to assign a class label based on the majority vote, the most frequent label of the k closest units.However, it is sometimes beneficial to use the class information of all the units.We illustrate the approach when there are two classes, writing the formula explicitly for the completely common pathway.Suppose that in the training sample there are n 1 individuals in Class 1, n 2 individuals in Class 2, and n 1 + n 2 = n.By using Equation (2A) at tree depth D, the proximity, q D *  , between u* and u j is obtained.Let δ(u.) = 1 if u. is a member of Class 1, and 0 otherwise.Then the probability that u* is a member of Class 1 is This follows from the fact that the random variables are distributed as a Poisson Binomial.The 50% voting classification rule is: Classify u* as a member of Class 1 if P((u*) = 1) > 0.5 and Class 0 otherwise.For more than two classes, u* is assigned to the class with the highest probability.

EVALUATION OF TARF IN REAL-WORLD DATASETS
In this section, the properties of the proximity matrix produced by TARF for use in a cluster analysis is appraised in eight real-world studies.Dalleau et al. [10] conducted empirical evaluations of ET in these datasets as did Lin et al. [21] in their evaluation of the performance of a newgeneralized iterative clustering algorithm (GICA) for obtaining a proximity matrix.We use all eight of the studies for appraising TARF for clustering and four of them for classification.They are part of the set of benchmark data widely used in cluster analysis research and are available on the California-Irvine (UCI) Machine Learning Repository website, http://archive.ics.uci.edu/ml.

Estimating the proximity matrix for use in clustering-The choice of the depth parameter
The depth, D, may be thought of as a tuning parameter.To gain some insight into its effect, we chose one of the eight public datasets, the Wisconsin Breast Cancer (Original) Dataset as a test bed.The dataset consists of 699 units, 458 with benign tumors and 241 with malignant tumors.There are 10 features of cytological characteristics derived from breast fine-needle aspirates.Using the data from this study, TARF was run to obtain a proximity matrix for each of the three pathway functions described in Section 5, for D = 2-10.The three methods for generating the proximity matrix are as follows: 1. TARFcp: The pathway function is the completely common pathway.Equation (2A) is applied to obtain the TARF proximity matrix.2. TARFdiv: The pathway function is the number of nodes to divergence.Equation (3A) is applied to obtain the TARF proximity matrix.3. TARFnu: The pathway function is the number of units in the node at divergence.Equation (4A) is applied to obtain the TARF proximity matrix.
The proximity matrices were converted to dissimilarity matrices and clusters were obtained using the PAM algorithm introduced by Kaufman and Rousseeuw [19].The results of these clusters were appraised using the Silhouette score (SIL), Adjusted Rand Index (ARI), normalized mutual information (NMI), and accuracy (ACC).are meaningful differences among the pathway functions shown in Figure 3.In Figure 3A, TARFdiv appears best for values of D ≥ 2. In this range, TARFdiv has smaller NMI and ARI values than the other two pathway functions.The ACCs of the three pathway functions in Figure 3D are almost identical across D.

Evaluating the performance of TARF for clustering in eight real-world datasets
In each of the eight studies, the three TARF methods, TARFcp, TARFdiv, and TARFnu with D = 4 were run to obtain proximity matrices.As in Section 7.1, clusters were obtained using PAM.Three other methods for generating the proximity matrices were run.They are. 4. RFsupervised: was run to provide a counterfactual "best case scenario" for the quality of a clustering derived from a Brieman RF ensemble of decision trees with class membership known [4].RFsupervised is counterfactual in that the performance of TARF in this section is being studied in the unsupervised case.In general, in any real-world clustering setting, the true unit labels are not known so they would not be available as input to the algorithm.For the remaining methods, labels are not known.The number of trees, n tree , was set to 4000, the number of features randomly selected at each split, m try , was √ K, and the maximum depth was reached when the node was pure or when it contained less than two units; all other tunable parameters were set to their default values.5. RFunsupervised: To obtain a proximity matrix when labels are unknown, synthetic data of the same sample size as the real dataset were generated by random sampling from the product of the empirical marginal distributions of the features [6,25].Using the RF algorithm, an ensemble of trees was generated with the classifier programed to separate units that are members of the real dataset from those that are members of the synthetic dataset based on bootstrap samples from the merged datasets.The number of trees was set to 4000, the number of features randomly selected at each node was √ K and the max depth was reached when the node was pure or when it had less than two units; all other tunable parameters were set to their default values.6. UET: UET is based on ET with labels that are randomly generated [10].A single feature without replacement was selected at each node and n tree was set to 10,000.
The max depth of a node was reached when either all units were members of one class, or when the number of units was less than √ N; all other tunable parameters were set to their default values.
As above, the results of the clusterings were compared based on SIL, ARI, NMI, and ACC.Computer time was measured in seconds.These together with the sample size, number of features and number of labels for each study are given in Table 1.It can be seen that the relative merits of the six approaches depend on the measure.For each of them, the value that is best in the study is shown in bold italics.If the best was RFsupervised, the value is highlighted in gray.Otherwise, if it was one of the other methods, it is highlighted in yellow.Methods that were numerically essentially equivalent to the best measures were also highlighted in yellow.We subjectively made an overall choice of best among the five methods for producing a proximity matrix for clustering taking computer resource use into account.It is indicated by time displayed in a bold red font in the row of the method.
When RFsupervised was run, labels were treated as known.Therefore, as expected, it performed best or equivalent to the best in five of the eight studies (Iris, heart disease, breast tissue, segmentation, ionosphere).As mentioned above, it is not a true competitor because it is supervised.None of the methods was best in the Parkinson Study and in the Segmentation study, TARFcp and TARFdiv performed fairly similarly and both were considered best.RFunsupervised was never the best approach on any measure.Of the remaining four unsupervised methods, TARFnu performed best or nearly best with respect to the SIL in six of the eight studies.Taking all of the measures into account, the best clustering properties among the eight studies were produced by TARFcp in three, by TARFdiv in three, and by TARFnu and UET in one.
In terms of computer resource utilization, TARFcp was usually slightly faster than TARFdiv and TARFnu took about twice as long to run.TARFcp run times are about 10 or so times faster than UET in every one of the eight studies.This may overestimate the advantage of TARFcp because no effort was made to optimize the code we wrote to implement UET.TARFcp was slightly faster than RFunsupervised in seven of the studies.In six of the studies used in this evaluation, the sample sizes are relatively small.Applications in biology can have hundreds of thousands of features.In such cases, TARF methods can have substantial and meaningful run-time benefits and also perform well.The number of cores used to run the analyses should not alter the relative speed of these methods.

Estimating the proximity matrix from TARF for use in classification
In this section, the properties of a proximity matrix produced by the TARF methods for input to a nearest neighbor algorithm classifier are appraised and compared to RF ensemble-based methods.In this setting, labels are assumed to be known.The RFsupervised algorithm was run to provide a "best-case scenario".The four studies listed in Table 2 that had two classes were used.As in Section 7.2, the TARF algorithms studied are TARFcp, TARFdiv, and TARFnu with D = 4.These were run to obtain proximity matrices to which the nearest neighbor algorithm in Equation ( 5) was applied.Note that for classification, RFsupervised does not use the proximity matrix.Subjects are classified by dropping the feature vector down the trees and the majority vote is used to determine class assignment.Finally, the proximity matrix produced by ET with n tree = 10,000 and m try = 1 was obtained and the nearest neighbor algorithm was applied.The results were examined in terms of ACC and the AUC of the ROC.These are based on the out-of-bag units for RFsupervised and for ET.Because there are no trees in a TARF, there is no analogous approach for them.The results for each study are given in Table 2 together with the sample size and number of features.The ROCs are presented in Figure 4A-D.Across the four studies, as expected, RFsupervised best.four of the remaining methods performed well the Disease (Figure 4A) and Wisconsin studies (Figure 4C) but in the Parkinson study, only TARFdiv did well.In the Parkinson and Ionosphere studies, TARFdiv was best although ET performed nearly as well in the latter.

SIMULATIONS EVALUATING TARF FOR CLUSTERING
The purpose of the simulations is to investigate the relative properties of the six procedures when the underlying feature distributions of the data are given by a family of two-component mixtures of multivariate normal (ℳ ) random variables where the feature vector is four-dimensional.The cumulative distribution function of the feature vector is where ) is a four-dimensional vector of ones,  is a 4 × 4 covariance matrix, and the three parameters, , , and , where 0 ≤  ≤ 1,  ≥ 0, and  ≥ 0 are scalar values.The form of the population mixture distribution implies that except for a few parameter values, there are two clusters, one whose features are distributed by ℳ(1, ) and the other at ℳ(−1, ).In a random sample, the expected number of individuals in the two clusters will respectively be in the proportions  and 1 − .The population does not have two clusters when  = 0 or 1, nor when  = 0.
For the simulations, we obtained a randomly generated covariance matrix [18] whose value is The expected value of the feature vector 5, ) = (0, 0, 0, 0).We utilized a total sample size of 400 for both clusters.In the simulations, the distributions of the features were assumed to take various forms as a function of , , and  depending on the issue being considered.We studied variations in  ranging from 0.1 to 0.9 in Section 8.1, variations in  from 0 to 2 in steps of 0.25 in Section 8.2, and variations in  from 1 to 3 in steps of 0.25 in Section 8.3.
The Monte Carlo procedures described below were repeated 100 times, and the results were averaged.In each of the simulations, the three TARF methods, TARFcp, TARFdiv, and TARFnu with D = 4 were run to obtain a proximity matrix.Three other methods for generating the proximity matrices, RFsupervised, RFunsupervised, and UET, were also run.As above, the RFsupervised was run to provide a "best-case scenario," but because it utilizes known labels, it is not a true competitor.The number of trees was set to 10,000 for RF but only 1000 for ET because of the length of time required for it to run.For RFsupervised and RFunsupervised, the number of randomly selected features at each node is the square root of the total number of features.For ET, it is set to 1.The maximum depth for RFsupervised and RFunsupervised is reached when all leaves are pure or when all leaves have less than 2 units, and for ET it is reached when the number of units in a node is one-fourth of the number of units in the sample.All other tunable parameters were set to their default values.As in Section 7.1, clusters were obtained from the estimated proximity matrices using PAM [19].As above,

F I G U R E 4
The receiver operating curves for four real-world studies.ACC, accuracy; AUC, area under the receiver operating curve; SE, sensitivity; SP, specificity.
the results of the clusterings were measured by the SIL, ARI, NMI,and ACC.
Tables 3-5 display the results of the simulations for the population parameter being considered.The best value for each varying parameter value for each measure is given in bold italics font, in gray highlights for RFsupervised, or yellow highlights for the other methods.Multiple entries with yellowed highlights indicate near equivalence with the best among the six methods.The best algorithm over all of the parameter values for each measure is listed in red font in the first column.Methods that were essentially equivalent to the best measure were also highlighted in yellow.

How do the methods compare as the cluster membership proportions vary?
To study the various possible cases, the characteristics of the features are assumed to follow the distribution function Property of the methods with variation in the cluster mixing proportions α. 0.9 in steps of 0.1.The parameters  and  are set to 1 in Equation (6).These population parameters generate the same sample distributions but in different sample proportions determined by the value of the parameter  in the mixture distribution.The distance between the centroids of the two clusters remains unchanged across the nine values of .At  = 0.5, the population proportions in the two clusters are identical.The results are given in Table 3.

Distribution of features: 𝜶𝓜𝓝(1
As can be seen, for every measure and method, the results are symmetric in that the values for  and 1 - are nearly the same.The RFsupervised method was the best as measured by the SIL but for no other measure.Except for ACC, ET was substantially worse than the rest for most values of .While the three TARFs all did well, if one method needs to be chosen to run, the best in the table is TARFnu.

How do the methods compare as the centers of the clusters vary?
To address this question, the features of the sample are assumed to follow the distribution function 0.5ℳ(1 ′ , Σ) + 0.5ℳ(−1 ′ , Σ), where the parameter  varies between 0 and 2 in steps of 0.25.The parameter  is set to 0.5 and  is set to 1 in Equation (6).The structure of the mixture distribution results in samples whose cluster distributions are the same when  = 0, and they become increasingly dissimilar as  increases from 0 to 2. The results are given in Table 4.The three TARFs all did very well, but once again if one method needs to be chosen, TARFnu appears to be the best.

How do the methods compare as the variance-covariance matrix of the feature distributions vary?
To study how the measures perform as the covariance structure of the feature distribution varies, the features are assumed to follow the distribution function 0.5ℳ(1 ′ ,  ) + 0.5ℳ(−1 ′ , ) for  running from 1.0 to 3.0 in steps of 0.25.The parameter  is set to 0.5 and  is set to 1 in Equation (6).Table 5 examines the performance in the nine cases.ET performed substantially worse than the rest for most values of .While the three TARFs all did well, if one measure needs to be chosen, especially because of the SIL, the best in the tableis TARFnu.

DISCUSSION
The eight real-world studies and the simulations attest to the potential value of TARF, particularly for use in clustering.When the dataset is very large, such as is commonly found in the search for biomarkers, the advantage in speed of execution together with its performance provides a compelling argument for using a TARF-based method.In this work, we have evaluated the properties of the proximity matrices produced by TARF for use in cluster analysis for three pathway functions.The resulting clusters were compared internally with SIL and to ground truth by ARI, NMI, and ACC.The properties were compared to the classical random forest algorithm and to UET.We found that the unsupervised RF method was never as good as the three TARF methods and UET.While all three TARF methods had good properties, TARFnu appears to be a slightly better choice than the rest largely because of its superiority with respect to the SIL.This pathway function, the number of units in the node at divergence, is different in form than the other units.It is possible that other pathway functions will do comparatively better in other circumstances.Our examination of the relative merits of the methods merely scratched the surface in terms of the vast range of possible scenarios.
An RF-based ensemble of trees obtained for use in classification transforms a pair of feature vectors into a proximity measure based on a specified pathway function.The idea is clear; units with similar features proceed down the tree in similar pathways split by split as long as the demand of the pathway function is satisfied.To capitalize on the benefits of RF for use in clustering when no labeled data exist, many proposals have been presented to get around the limitation.At one extreme is the proposal of Siegel et al. [26] who searched for subtypes of individuals with Post Traumatic Stress Disorder.These authors suggested that the RF algorithm should start with a relevant, carefully chosen classification objective.Calling the approach purposeful clustering, they proposed in their application to classify cases with PTSD versus healthy controls in a RF as the first step for obtaining a proximity matrix.Since healthy controls are of no interest in the clustering stage, their proximities are dropped from the matrix, leaving a proximity matrix comprised only of cases.This is the input to the clustering algorithm.
At the opposite extreme of purposeful clustering is the use of purely random splitting as in ARF.This strategy also leads to units with similar features proceeding together according to a pathway function.The simple rule, smaller to the left, larger to the right, keeps units with similar feature vectors together and splits those that have features that are not similar.A random splitting threshold will, with high probability, split feature vectors that are far apart and leave together feature vectors that are similar.In this the underlying structure of the data is the driver the growth of the tree.This is at least part of the idea behind some of the recent suggestions for simplified RFs.
What is the reason for changing the support for selecting the threshold at each node?In classification with RF, sequential splitting of labeled units to segregate them into known classes is quite natural.Arguably, as long as greater purity can be achieved at a node, splitting should continue.This is facilitated by redefining the support so that even a random split is guaranteed to separate some units.However, in the unlabeled case, the justification for this strategy is not so clear.For example, although unknown to the analyst, suppose that all of the units in a node belong to the same true class and differences in their feature vectors are very small.Redefining the range of the support means that every threshold choice is guaranteed to erroneously separate them.In contrast, leaving the range of support unchanged from the range at the root node will not likely lead to a random draw that splits them, as is appropriate.The fixed support strategy seems more intuitive when trees are grown in the search for structure.It is hard to see how varying the probability that two units split as a function of the properties of the other units in the node as they move down the tree helps in obtaining good estimates of proximities.
On a technical note, if the range is updated, a split of at least one unit in both child directions is guaranteed.But in an ARF, sometimes the randomly chosen threshold will not split the units.Although this may seem computationally inefficient, as we have argued, it is informative regarding the dissimilarity of members in the node.More importantly, the issue of computational efficiency is minimized when it is not necessary to literally grow the forest and count the trees that satisfy the pathway function condition.The simple change to the manner of splitting enables the exact calculation of the desired probabilities and expected values.
Although we have focused on pathway functions that involve pairs of units, obtaining exact estimators for other cases is possible.In theory, at least, pathway functions can be defined for three or even more units.We have studied a pathway function for individual units, time to isolation, which is widely used for the detection of anomalies.A closed-form parameter estimator was obtained and is given in the Appendix.
We recommend that when using TARF methods, the depth of the tree should be considered to be a tuning parameter and that the user experiment with different pathway functions.As these methods are used, we will learn what TARF specifications will prove to be most useful in different applications.

APPENDIX A. EXPECTED NUMBER OF NODES UNTIL A UNIT IS ISOLATED
Here, we depart from the estimation of parameters that involve pairs of units and give an example of a pathway function for a single unit.The pathway function considered here seeks to characterize how difficult it is to separate a unit from the rest of the sample.The node count at isolation pathway function is the expected length of the path of u* from root until the first node containing no other units except those, if any, that are identical to u*, truncated at depth D. This value is an essential measure for the detection of outliers using RF.The intuition is that fewer nodes will be required to isolate anomalous data points because they are different.The Isolation Forest algorithm of Liu et al. [22] uses the same procedure as ARF to grow Isolation Trees, except that it uses a training set instead of all The SIL is the only one of the four measures concerned with the internal properties of the clustering.The three remaining measures are concerned with the similarity of the clustering to the ground truth.The results are shown in Figure 3A-D.The SIL is the only measure where there F I G U R E 3 Four measures of the quality of the cluster as a function of depth.The green line is TARFcp, the orange line is TARFdiv, and the blue line is TARFnu.The y-axis is the SIL for Graph A, the ARI for Graph B, the NMI for Graph C, and ACC for Graph D.
Comparative performance of six methods for producing a proximity matrix for use in a PAM cluster analysis of real-world data.For the Iris study, TARFcp and TARFdiv produced the same clustering and therefore identical ARI, NMI, and ACC.For the Parkinson study, RFsupervised and TARFdiv produced the same clustering and identical ARI, NMI, and ACC.d The overall best value across methods per measure is given in bold italics, in gray highlights for RFsupervised, or yellow highlights for the other five.Multiple entries with yellow highlights indicate near equivalence with the best among the five methods.Time in red font indicates the subjectively overall best among the five methods.
TA B L E 1Abbreviations: ACC, accuracy; ARI, Adjusted Rand Index; NMI, normalized mutual information; RF, random forest; SIL, Silhouette score; TARF, treeless absolute random forest; UET, unsupervised extremely randomized trees.a Properties of the clusters: SIL, ARI, NMI, ACC in %, Computer time in seconds.b Sample size, number of features, number of labels, and ACC of RF as a classifier in %. c Comparative accuracy a and AUC b of the ROC for classification in studies with two classes.The overall best value is given in bold italics, in gray highlights for RFsupervised, or yellow highlights for the other four.Multiple entries with yellow highlights indicate near equivalence with the best among the five methods.
TA B L E 2Note: a Accuracy = percent correct classification.b AUC of the ROC = area under the receiver operator curve.c Sample size, number of features.

′ , 𝚺) + (1 − 𝛂) 𝓜𝓝 (−1 ′ , 𝚺)
The overall best value in a column is given in bold italics, in gray highlights for RFsupervised, or yellow highlights for the other five.Multiple entries with yellowed highlights indicate near equivalence with the best among the six methods.Methods in red font indicate the subjectively overall best among the six methods. Note:

of features: 0.5𝓜𝓝(1'𝝓, 𝚺) + 0.5𝓜𝓝(−1'𝝓, 𝚺)
Property of the methods with variation in the locations of cluster centroid.The overall best value is given in bold italics, in gray highlights for RFsupervised or yellow highlights for the other five.Multiple entries with yellowed highlights indicate near equivalence with the best among the six methods.Methods in red font indicate the subjectively overall best among the six methods.Abbreviations: ET, extremely randomized trees; NMI, normalized mutual information; RF, random forest; TARF, treeless absolute random forest. Note:

of features: 0.5𝓜𝓝(1 ′ , 𝜽𝚺) + 0.5 𝓜𝓝(−1 ′ , 𝜽𝚺)
Property of the methods with variation in the covariance matrix.The overall best value is given in bold italics, in gray highlights for RFsupervised, or yellow highlights for the other five.Multiple entries with yellow highlights indicate near equivalence with the best among the six methods.Methods in red font indicate the subjectively overall best among the six methods.Abbreviations: ET, extremely randomized trees; NMI, normalized mutual information; RF, random forest; TARF, treeless absolute random forest. Note: