Stochastic Estimations of the Total Number of Classes for a Clustering having Extremely Large Samples to be Included in the Clustering Engine

Numerous reports have elucidated the classification of a large amount of data using various clustering techniques. However, an increase in data size hinders the applicability of these methods. Here, it is investigated how to deal with the exploding number of possibilities to be sorted into irreducible classes by using a clustering technique when its input capacity cannot accommodate the total number of possibilities. This can be exemplified by atomic substitutions in the supercell modeling of alloys. The number of possibilities is sometimes equal to trillions, which is extremely large to be accommodated in a cluster. Thus, it is not practically feasible to identify directly how many irreducible classes exist even though several techniques are available to perform the clustering. In this regard, a stochastic framework is developed to avoid the shortage limitations, providing a method to estimate the total number of irreducible classes (the order of classes), as a statistical estimate. The main conclusion is that the statistical variation of the number of classes, at each sampling trial, can serve as a promising measure to estimate the total number of irreducible classes. Characteristics of this approach is also discussed by comparing with the conventional one based on Polya's theorem.


Introduction
Classifying a large number of data into some groups by individual attributes is a universal task. Beyond the handling based on clear classification rules, the clustering, driven by unsupervised learning techniques, has been developed extensively [1][2][3][4] An inherently associated situation with clustering is the large amount of data that needs to be classified. The larger it becomes, the more accurate, in general, the statistical quality of the estimations tends to be, thereby showing the superiority of data science. However, data size could be extremely large, rendering its accommodation into a single clustering process difficult. Let us set up the problem as follows: Suppose we have a clustering tool that can process up to l max samples. The tool classifies l reducible samples (as input) into M(l) irreducible groups (as output) based on some attributes. Let the maximum possible number of samples be L (≥ l max ), and G = M(L) the total number of the irreducible attributes that we want to determine. However, we cannot identify G when l max ≪ L owing to the input capacity limitations.
To illustrate the problem more concretely, let us consider a magnetic alloy, Nd (1−x) Ce x−y La y ) 2 Fe 14 B, which is composed of Nd 2 Fe 14 B by substituting a part of Nd-sites with Ce or La. Supercell methods in ab initio calculations are used to describe such atomic substitutions, which are modeled as a periodic array of a cell structure. [5] The size of the cell (specified by how many Nd have to be accommodated) is selected to ensure that the given concentration (x, y) is captured by the ratio of the number of atoms. [5] To describe Nd 0.7 Ce 0.225 La 0.075 ) 2 Fe 14 B, 40 Nd-sites are substituted by nine Ce and three La, as shown in Figure 1. The number of possible atomic configurations in this case would amount to 40!/(28!9!3!) = 1,229,107,765,600 = L (total reducible samples to be classified). These "raw" configurations can further be classified into subgroups of irreducible configurations based on the theoretical framework to identify the equivalent structures under the symmetric operations. [6] The number of the subgroup, G, reduces by several orders (typically G ≈ 100), even for L ≈ trillion of samples. Once we identify G, we can construct concrete G representative configurations as irreducible structures by using a recursive method. Several packages in materials science provide such a tool to perform the classifications [7,8] but their input capacity, l max , cannot handle such a large number of L ≈ (trillion).
The present study considers the problem of identifying G even for l max ≪ L. A stochastic method is proposed to identify G = M(L) using a tool considering M(l) (l ≪ L) with l max ≪ L. The method repeats the numbers of sampling to calculate M(l), thereby obtaining the respective statistics, {M(l)} by which G is estimated. The estimation can be achieved in considerably fewer trials than L∕l, as shown in this study. In the above example that classifies crystal structures, we note that a more powerful deterministic framework exists (Polya's theorem); however, the present method can properly handle the example. Furthermore, the method described in this paper can handle a broader category of problems. The method only requires a technique to provide M(l), which can be a machine-learning technique to perform the clustering, wherein the classification rule is not necessarily based on any clear principle such as solid-state physics, group theory, as observed in the above-mentioned example. Polya's theorem as a conventional solution for the similar problem relies on algebraic symmetry, and hence its applicability is limited only to such problems with the symmetry. We note again that the present framework has much wider applicability because it does not rely on the symmetry, but only on the existence of any sorting tool that is not necessarily deterministic, for example, machine-learning clustering. When combined with existing clustering techniques, the method can expand the scope of the technique beyond the limitation of input capacities (depending on the memory capacity) for computations).
If l max is around several times greater than G, the convergence of M(l) → G should be observed to identify the number of irreducible groups G, as shown in Figure 2. Nevertheless, this strategy still necessitates considerable computational costs because, in general, it is laborious to identify whether a monotonic behavior has achieved convergence or still requiring a sufficient number of l. The proposed method provides a powerful breakthrough, offering another way to identify G as a position of the peak. We derived that V[M(l)] (the variance of M(l)) with respect to the random selection of l members of reducible samples becomes maximum when l ∼ G.

Outline of the Results
To formulate the variance of M(l), we shall introduce a probability for a set of l samples to be sorted into M irreducible classes, denoted as P(l, M; G). The expectation value,M, and its variance V[M(l)] is then given as For l = 1, the sample surely corresponds to an irreducible class, namelyM(0) = 1, regardless of the sample selection; therefore, the following result is achieved: V[M(1)] = 0. In the limit when l is large, we know that the convergence M(l → L) = G; hence, V[M(l → L)] → 0. As such, we expect V[M(l)] to have a maximum peak as a function of l in the range of 1 < l < L; we define such a l value as l 0 :  For each G, we generated randomly ten different cases of the multiplicities for each irreducible structure. Subsequently, we simply picked up a sample with l to identify M(l), and then evaluate the variance for each l to obtain its behavior with a peak at l = l 0 . The plot is scaled considering l 0 (G) = G (dashed black line in the figure), leading to the conclusion that we can estimate G from the position of the peak of V[M(l)].
We can observe that V[M(l)] has a peak at around l ∼ G, shown as follows.
With the simplified assumption that the size of each irreducible class (i.e., the number of elements inside each class) is identical, we can derive an asymptotic behavior, as derived in the Appendix, §A∼ §B, and shown in Figure B1. By substituting Equation (4) into Equation (2), we can show that the variance V[M(l)] has a peak approximately at l ∼ (G + 1) ( §C). Even without such an assumption used in the analytical treatments, The numerical verifications have shown that this result remains unchanged, as shown in Figure 3.
We can obtain a rough estimation that V[M(l)] has a peak at approximately l ∼ G from the schematic picture shown in

Discussion
Although the result is still within the extent of a heuristic finding, we note that 10 × V[M(l 0 )] provides a reasonably reliable estimate for G. As shown in Figure 5, the estimate appears rather firm compared to another estimator, "l 0 ∼ G," shown in Figure 3.

Conclusion
We have formulated a stochastic framework to classify vast number of samples (L ≈ trillion) using any clustering technique to find out the total number of irreducible classes G. The situation we considered is when L is extremely large to be accommodated into the capacity of the input size of the technique due to the limitation of the memory-or file-size. Among L, the framework randomly picked up a set of samples with the size l (≪ L) to be affordable by the capacity and sorted by a clustering tool into M classes. By repeating the sampling varying the size l, we obtained the statistics {M(l)}. We have derived that the variance of M(l) becomes maximum at l ∼ G. Although no rigorous mathematical proof was developed, we heuristically obtained that G could be estimated as the quantity approximately ten times of V[M(l 0 )]. In our future study, we will focus on developing a mathematical verification of the proposed method. For each G, we generated randomly ten different cases of the multiplicities for each irreducible structure. Subsequently, we simply picked up a sample with l to identify M(l), and then evaluate the variance for each l to obtain its behavior with a peak at l = l 0 . The plot is scaled to V = 0.1 × G (dashed blue line), leading to the conclusion that we can estimate G from the value of V[M(l 0 )] multiplied by ten.

Appendix A: Derivation of the Probability
We denote the probability to obtain M irreducible classes after a clustering technique has processed l reducible samples as P(l, M; G), where l samples are randomly chosen from a population with a total of G irreducible classes. The situation can be equivalently described as the "distributing color envelops into l pigeonholes with one each, where an envelope has G-color variations, but only M (< G) colors are found over the pigeonholes." Although the multiplicity for each color (i.e., the number of envelops with each color) is generally different to each other, we consider the assumption that the multiplicity is the same for each color. We can subsequently evaluate P(l, M; G), in a straightforward way, analytically, even though we shall examine later whether or not the final result is not substantially affected by the considered assumption. Letting the number of cases be a(l, M; G), the probability can be written as  + 1, i; G The above equation can be explained as follows: Imagine that the already existing l pigeonholes include a total of M colors that corresponds to the first summation appearing in the nominator in the right-hand side. Multiplying G to the summation implies counting all the possibilities for (l + 1)-pigeonhole to have any color. That includes the cases with a total of (M + 1) colors that should be excluded because we are considering the cases with "total Similarly, we shall derive an asymptotic form of D(l, M; G) using differential approximations, as given in Equation (B18) by considering D (l + 1, M; G). We remind that the nominator and the denominator in Equation ( where the constant of integration is determined such that D(l, M = 1; G) = 1 can be satisfied. Substituting Equations (B10) and (B18) into Equation (B1), we obtain as given in Equation (4)  Approx. Approx. Approx. Approx. Approx.
l Figure B1. Comparisons between the asymptotic behavior (Approx. Equation (B24)) and the original one (Equation (A1)) of P(l, M; G) with G = 5, as the functions of l with several different M.
This corresponds to the counting of possible permutations of M selected from G colors. Thus, As the asymptotic evaluations affect the original normalization for P, we again introduce a normalization factor, becomes the probability to be applied to Equation (2) in the asymptotic region. As shown in Figure B1, the asymptotic evaluation performs fairly well to reproduce the behaviors of the original form. Comparisons between the asymptotic behavior (Approx.) and the original one of P(l, M; G) with G = 5, as the functions of l with several different M is shown in Figure B1.