Share density-based clustering of income data

The Lorenz curve is a fundamental tool for analyzing income and wealth distribution and inequality. Indeed, the Lorenz curve and its derivative, the so-called share density, provide valuable information regarding inequality. There is a widely recognized connection between the Lorenz curve and elements from information theory field. Starting from this evidence, the aim of this work is to compare the income inequality of different subgroups, by using a proper dissimilarity measure, borrowed from information theory, between parametric share densities. This measure is then considered for clustering purposes. To this end, a dynamic clustering algorithm is considered to group unconventional data, such as density functions. Finally, an application, regarding data from Survey on Households Income and Wealth (SHIW) by Bank of Italy, is shown.


INTRODUCTION
In the economic and social literature, issues related to income and wealth inequality have attracted the attention of many researchers. Indeed, it is crucial to explore the root causes of income disparities and to understand the determinants of these differences across countries and regions, social classes, ethnic groups and so on (see, e.g., [20]). To this end, tools are required to compare various aspects of inequality among groups. The Lorenz curve is a well-known and widely used tool for analyzing income inequality. Since its proposal, in 1905 (see [26]), many ideas has been suggested by statisticians and economists, generating a fertile field of study. Although some caution in making comparisons among different groups by using Lorenz curves are needed, especially when they intersect (see, e.g., [13]), this tool remains a reference point in this field of analysis, as demonstrated by the very wide literature dealing with it.
On the other hand, the Lorenz density is rarely mentioned explicitly. One of the few references to Lorenz density can be found in [17], where this function is called share density, to indicate the relation with the share of total income owned by a specific portion of a population. Subsequently, the concept of Lorenz density was taken up in [43], which studied its moments and furnished an expression for its variance; a detailed description of its properties and characteristics can be found in [21]. A more recent reference can be found in [38], where the share density is considered for the Gini index decomposition.
It is known that, given a positive continuous random variable, the corresponding Lorenz curve can be viewed as a distribution function on the unit interval (see [23]) and it is possible to consider its derivative as a density function.
If the Lorenz curve is differentiable, the share density is easily obtained as its first derivative, which allows us to infer the share owned by different segments of a population. From this perspective, it seems natural to compare groups of earners in terms of inequality by quantifying the dissimilarity between their corresponding share densities using a proper dissimilarity measure. Such a measure can then be employed to find groups whose behavior is similar in terms of income earned by the various percentile ranges of population. In fact, it is most important to consider the structure of inequality and explore the various income brackets in greater depth than is possible with an aggregate coefficient, such as Gini or Theil's ( [27]).
To this end, in the present study, an unconventional cluster approach was considered. Currently, the concept of clustering has been extended to patterns described by unconventional data, often called symbolic data [3] or distribution-valued data [28]. In this context, complex data can be described by intervals or distributions and structured as a set E of objects described by multi-valued variables. A common task in this data analysis is the detection of homogeneous groups of objects in set E, such that objects belonging to the same group have a high degree of similarity, in contrast to objects from different groups. Various methodological proposals have been suggested, according to the specific kind of data, such as the work of [41] for histogram data, [39] for cumulative distribution functions, [28] for univariate probability density functions and [8,42] for multivariate extensions. In [4], a method to cluster univariate probability density functions was considered and extended to the multivariate context by performing density estimation and clustering simultaneously.
Here, set E is made up of groups of income earners, each described by its own share density function. As highlighted in [41], it is necessary to find a proper dissimilarity measure to evaluate the degree of proximity between the objects of set E. We propose using the Jensen-Shannon (JS) dissimilarity as a measure of discrepancy among groups of income earners and develop a dynamic clustering algorithm (DCA) based on this information. An advantage of this approach is to obtain, for each cluster, a prototype whose descriptor synthesizes the features of income inequality for the earners belonging to that cluster.
After introducing the preliminary concept and notation in Section 2, a DCA based on JS divergence among share densities will be described in Section 3. In Section 4, we consider some parametric models for income and the corresponding share densities, and then explore an application to real data, which is considered and discussed in Section 5. Finally, in the last section, some considerations conclude the paper.

INFORMATION THEORY AND INEQUALITY: CONCEPT AND NOTATION
Although information theory and income inequality analysis could appear very distant fields of study, there are many intersections and contact points. Indeed, even if information theory was developed within the field of communications to explore the losses and errors in transmitting data through a channel [37], thanks to the work discussed in [40] these concepts were also applied to income and wealth analysis, as tools for constructing indicators of economic inequality. These concepts were later generalized by Cowell [9], who introduced Generalized Entropy. In recent years, there has been renewed interest in entropy based inequality measurements, as shown, for example, by papers of Rohde [34,35], who explores the connection between inequality indexes and the Lorenz curve, and proposes a symmetric information theoretic index, namely the J-divergence, to measure inequality.
A key concept is the notion of entropy. We remember that, given a continuous random variable (rv) X with probability density function (pdf) f (⋅) and support , the corresponding Shannon entropy, also called continuous or differential entropy, is given by: Many of the properties of differential entropy are similar to those of its discrete counterpart, but important differences remain, such as the possibility of assuming negative values (for details, see, e.g., [30]).
One measure closely related to differential entropy is the Kullback-Leibler divergence, also called relative entropy, a non-symmetric measure of how one density function g(x) is different from a second, f (x), considered as reference: In the present study, we start from differential and relative entropy for share densities to quantify a measure of distance among different groups of income earners. In doing so, we assume that all incomes are nonnegative. This is the most common assumption in the literature, especially in the context of continuous random variables modeled by distributions with positive support (see [6,7,25]), although negative incomes are possible in real situations and their presence requires particular consideration (see, e.g., [31,32]).
Let us consider a non-negative continuous rv Y with distribution function (df ) F(y) and quantile function F −1 (t) = sup{y|F(y) ≤ t}, t ∈ [0, 1]. Then, the expectation of Y , if exists finite, is given by E(Y ) = ∫ 1 0 F −1 (t)dt, and it is possible to obtain the corresponding Lorenz curve as follows: where u ∈ [0, 1]. It is known (see, e.g., [23]) that both quantile function and Lorenz curve L(u) can be considered as distribution functions of continuous rv having support on the unit interval. Therefore, by construction, the derivative l(u) with respect to u, given by can be viewed as a pdf for U rv having unit support. In particular, l(u) is a special case of weighted distribution, belonging to the class of length-biased distributions ( [36]). As proposed by [17,43], l(u) defines the share of income owned by the portion of population that falls in the given percentile range and it is related to the probability that a unit of amount, chosen at random, is earned by a specific percentile range of population, that is exactly the fraction of the owned amount, for a given percentile, in proportion to the whole. This function provides information regarding income inequality. For example, Shao [38] highlighted the close relation between share density and the Gini index, while Rohde [34] has shown that the two well-known Theil inequality indexes, L and T (see [10] for details), can be directly obtained from l(u). In particular, the Gini index is given by G = 2 ⋅ E(U) − 1, while Theil's T index coincides with the Shannon entropy of l(u) density, changed in sign: (1) Therefore, the differential entropy H of the share density l(u), besides quantifying the randomness of U random variable, also gives information regarding the global inequality level present in a population. Moreover, it is easy to verify that expression (1) coincides with the Kullback-Leibler divergence between the l(u) density and the share density in case of Egalitarian condition, that is, the case where all income levels are equal. In this case, in fact, each unit of amount has an equal chance of being owned in all percentiles u of population, so that L EG (u) = u and l EG (u) = 1.

DCA FOR SHARE DENSITIES
To compare and cluster together earner groups with a similar income inequality structure, we consider the DCA [14,15]. This has been widely employed in multivalued analysis as a non-hierarchical iterative algorithm for finding clusters and, simultaneously, the most representative element of each cluster. The algorithm has two steps: the partitioning of objects belonging to set E into J clusters (allocation step) and the identification of a representative object, or prototype, for each cluster (representation step). More formally, these two steps aim to find a partition P = {C 1 , … , C J } of the set E = { 1 , … , K } and corresponding prototypes L = {G 1 , … , G J } which locally optimize an adequacy criterion that measures the fit between clusters and their prototypes. Criterion function Δ can be generally expressed as follows: where is the dissimilarity measure between the generic object k and the prototype G . To find the best partition according to criterion Δ, a proper dissimilarity measure, to quantify the degree of proximities between two objects and between objects and prototypes, is required. This measure has to be consistent with the prototype, that is an element of the space of description of E.

Dissimilarities among share densities
The definition of a dissimilarity measure is crucial in performing DCA, because it defines the criterion in (2) directly and, as a consequence, impacts on the identification of prototypes. Generally, different measures of distance or dissimilarity are available depending on the type of object to be classified. Given the aforementioned motivations and to analyze existing differences among various groups of income earners and their inequality structure, the next step is to define the dissimilarity measure starting from share densities. Of all possible dissimilarity measures for quantifying the discrepancy between density functions, here we will consider the JS divergence. JS divergence, also called the total divergence to the average, is a well-known measure of dissimilarity among probability distributions.
Given K densities f 1 , … , f K , the JS divergence can be obtained, as follows: is the Kullback-Leibler divergence between kth density and mixture f m = ∑ K k=1 k ⋅ f k , with k ∈ [0, 1]. Alternatively, it can be written in terms of Shannon entropy H, as where H (f k ) is the Shannon entropy for the kth density.
It can be shown that D JS (f 1 , … , f K ) ≥ 0 and equality holds when , and thus a bona fide measure of dissimilarities between f 1 (⋅) and f 2 (⋅). Now, let L 1 , … , L K be the Lorenz curves for K different groups of income earners and l 1 , … , l K the corresponding derivatives with respect to u. Hence, the JS divergence among l k densities (k = 1, … , K) is given by: where l m = ∑ K k=1 k l k (u). To define the l m mixture density, the decomposition of the Lorenz curve reported by [2] will be considered, so that k represents the income share for the kth group.
In our context, the criterion function in (2), can be expressed in terms of Kullback-Leibler divergence between each share density and the share density for the corresponding propotype, as follows: It can be shown that optimizing criterion Δ is equivalent to minimizing JS divergence within cluster and contextually maximizing JS divergence between clusters. Indeed, one of the properties of JS divergence is that the total divergence, that is, the divergence of all considered objects, can be decomposed into two quantities: one relates to the dissimilarities in each cluster and the other reflects the dissimilarities across clusters, according to Huygens' theorem: The previous decomposition allows us to evaluate the quality of the obtained partition according to the classical criteria [5], using the index obtained by the ratio between D W JS and total JS divergence. Moreover, l ( ) m is such that it minimizes JS divergence within the j-th cluster. It is worth noting that l ( ) m is a convex mixture of share densities, coherent with the decomposition of Lorenz curve given in [2]; therefore, it is still a share density, so that each prototype belongs to the space of description of E.
From (4), it is evident that the proposed approach, based on JS and consequently on Kullback-Leibler divergence, takes into account the whole behavior of share density for each object k , so that any kind of existing difference, for example tail inequality or crowding around the centre of income distribution, will affect the results. Therefore, clustering procedures based on a dissimilarity matrix whose elements are computed as JS divergences can exploit any discrepancies that are eventually present in different segments of the population.

MODELING INCOME: SOME PARAMETRIC DISTRIBUTIONS
To model income data, different parametric distributions have been proposed (see [7,24], among others). In this section, some of them are briefly reminded, along with their principal characteristics. In particular, the associated share densities are obtained and the relations with Gini and Theil indexes, that is, G = 2 ⋅ E(U) − 1 and T = −H(l), are inspected.

Pareto distribution
Starting from a pioneering work of 1895, the Pareto model and its variety of generalizations have been widely used in the context of income analysis [1]. The df of the classical Pareto distribution is given by: where > 0 is a scale parameter and > 0 is a shape parameter. The tth quantile of the Pareto distribution is y(t) = F −1 Par (t; , ) = (1 − t) −1∕ , and its rth moment is given by E (Y r ) = r Moreover, the Lorenz curve and Gini index are given, respectively, by: and , then the corresponding U rv has a Beta distribution with parameters 1 and 1 − 1 , that is: Proof . By differentiating the expression (7) with respect to u, it is easy to verify that the share density corresponding to Y ∼ Par( , ) is given by: resulting in a Beta pdf with parameters 1 and 1 − , or, alternatively, in a Kumaraswamy pdf, i.e. U ∼ KW .
Proof . Given the (8) and remembering that, for a rv U ∼
The entropy for the share density corresponding to a Pareto distribution can be equivalently obtained from the entropy of a Beta rv or considering the result for the entropy of a Kumaraswamy rv, reported in [30], that is: It is worth noting that the classical Pareto distribution can be viewed as a Generalized Beta distribution of the first kind (GB1), and consequently, the previous expression coincides with the negative Theil's index reported in [29] for a GB1 (−1, , , 1).

Dagum model
The Dagum distribution (see [11]) is widely appreciated in economic and financial fields for its excellent fit to empirical income (for an exhaustive review of this model and its application see [22,24]). The df of a Dagum rv Y has the following expression: where y > 0, > 0, > 0, and > 0. Parameter is a scale parameter, while and are shape parameters. The tth quantile of the Dagum distribution is: Moreover, the rth moment is given by for > r, where B(., .) is the complete Beta function. It is known that for Dagum model, the Lorenz curve is given by the following regularized incomplete beta function I z (⋅, ⋅) (see, e.g., [11,12]): where z = u 1∕ , > 1, u ∈ (0, 1) and B(x; ., .), with x ∈ (0, 1), is the incomplete Beta function. The corresponding Gini index is where Γ(.) is the Gamma function.
then the corresponding U rv has a Generalized Beta of the first type (GB1), specifically U ∼ GB1 Proof . From (13), it is easy to verify that the corresponding share density is given by: which is the expression of a GB1 density (see [33]) for suitable choices of parameters, namely U ∼ GB1

Corollary 4. If Y ∼ Da( , , ), the corresponding Gini index is given by G
Proof . Given the Proposition 3, it is possible to obtain the expression (14), remembering that, for U ∼

GB1
( Dagum [11] showed that G < 0 and G < 0, thereby interpreting both and as inverse indicators of income inequality. Actually, besides the Gini index, they affect several inequality measures such as the Theil, Bonferroni and Zenga indices. However, [12] shows that they have a different role in describing inequality, as impacts inequality mainly for lower incomes, while shows a greater influence on the right tail of distribution. For this reason, [16] consider the reciprocal of and to have a direct indicator which can explain inequality in both tails of Dagum distribution. Starting from (15), it is easy to obtain a closed expression for the entropy of l Da , as follows: where the following results have been used (see [18]): .
It is worth noting that the result in (16) agrees, except for the sign, with that reported in [6] for Theil's T index, confirming the relation between T and H(l).

Singh-Maddala
The df of the Singh-Maddala distribution ( [24]) is given by: where b > 0 is a scale parameter and a, q > 0 are shape parameters. The tth quantile is given by and exists whenever −a < r < aq. The Lorenz curve can be expressed in terms of the regularized incomplete beta function I z , as follows: where z = 1 − (1 − u) 1∕q ; the Gini index has the following expression: a, b, q), then the corresponding share density is the pdf of a rv Proof . Given the expression (18), it is easy to obtain the corresponding share density, as: from which it is immediate to verify that the rv Z = (1 − U) ∼ GB1 . SM(a, b, q), then the corresponding Gini index is given by Proof . It is known that, for Z ∼ GB1 , the corresponding first moment is (see [33]): Remembering the Proposition 5, we can write: A closed expression for H (l SM ) can be obtained considering the share density in (19), as follows: where the following results have been used (see [18]): ] .
It is worth noting that the result in (20) agrees, except for the sign, with that reported in [6] for Theil's T index.

Generalized Beta of second type
A widely used distribution for modeling income is the Beta Generalized of second kind (GB2) [6]. This is a very flexible four-parameter distribution, obtained by transforming a beta random variable. Indeed, its df is given by: , for a, b, p, q > 0. The corresponding pdf is: The GB2 distribution encloses many distributions, such as the Generalized Gamma, Gamma, Weibull, Dagum and Singh Maddala. In particular, setting and (a, b, p = 1, q), expressions (10) and (17) are, respectively, obtained.
The tth quantile of the GB2 distribution is not available in closed form, but can be written as F I G U R E 1 Empirical and fitted distributions of income (in tens of thousands) for Italian regions.
The rth moment exists for −ap < r < aq and is given by . In addition, the Lorenz curve can be expressed in terms of the beta function (see [6]) as follows: and exists whenever aq > 1. After some passages, it is possible to obtain the share density for the GB2 as: .
The Gini index has a quite complex form involving the generalized hypergeometric function and is available in [24,29].
Using similar passages done for Dagum and SM distribution, it is possible to obtain a closed form for the entropy of l GB2 : Also in this case, the relation between the obtained entropy and the T index reported in [6] for a GB2 rv is confirmed.

APPLICATION TO ITALIAN REGIONS
To demonstrate the potential of the proposed method, data from the Survey on Household Income and Wealth (SHIW), conducted by the Bank of Italy in 2016, are considered. This survey includes income, wealth, and other aspects of the economic and financial situation of about 8000 Italian households and refers to the head of the household, defined as the person with the highest income. To consider the real composition of households, the equivalent income is obtained using the OECD-modified equivalent scale. This scale, first proposed by [19] and adopted by the Statistical Office of the European Union (EURO-STAT) in the late 1990s, assigns an equivalence value of 1 to the household head, of 0.5 to each additional adult member and of 0.3 to each child. Total household income is then divided by the sum of the weightings.
Income data for each region are described by the Dagum distribution. The maximum likelihood estimates of parameters and corresponding standard errors are reported in Table 1. From the comparison between fitted and empirical distributions, shown in Figure 1, it is evident that this model accurately describes the income distributions for the various groups of households.
Given the parameter estimates, the values of Gini are obtained respectively from equations (14) and (16). Moreover, to have a direct indicator of tail inequality, the reciprocal of̂and̂are also reported in Table 2. From these results, it is evident that Italian regions have different degrees of income concentration levels, as indicated by Gini and Theil's indexes. Interestingly, even when these two synthetic measures are similar, the structure of inequality can differ depending on the specific segment of the population. This seems to be the case, for example, of Liguria and Sardinia, two regions with very similar values forT andĜ, but with different concentration levels in distribution tails, namely for the poorest and richest, as 1∕̂k and 1∕̂k suggest. The last column of Table 2 also reports the membership for each region to the cluster. To obtain this specific partition, the DCA described in Section 3 and based on JS divergence among share densities is considered, with a number of clusters equal to 3. A preliminary hierarchical algorithm was implemented to determine the suitable number of clusters to be fixed. Moreover, to compute the divergence in (4), we consider the expression (16) to obtain regional share density entropies and numerical integration procedures for entropies of mixtures.
The different behaviors of the Lorenz curves and share densities for regions belonging to the three clusters are depicted in Figure 2. From both panels, the departure from equity situation in all regions is notable. Indeed, all curves are far from the case of a perfectly equitable distribution, represented, for the share densities and Lorenz curves, respectively, by a horizontal line with an intercept equal to 1 and a diagonal line with a slope of 1.
As we can see from the results and from Figure 2, this algorithm allows us to obtain clearly characterized clusters, with regions having higher concentration levels F I G U R E 3 Left tail (first panel) and right tail (second panels) of Lorenz curve for the three clusters. included in cluster 2. For these regions, in fact, both Theil and Gini indexes are higher, respectively, than 0.1437 and 0.2883. Moreover, these regions record high concentration levels in tails of distribution, too, as confirmed by tendentially higher values for both reciprocalŝand̂. On the other hand, lower inequality is generally present in regions belonging to clusters 1 and 3. The element that differentiates these two clusters is the behavior in the tails of distribution: for regions included in cluster 3, higher inequality exists in correspondence to lower values of u, that is, for low income households, so that the behavior in this first part of the distribution appears more similar to those of curves in cluster 2 (see first panel of Figure 3). On the other hand, regions belonging to cluster 1 show, in the initial part of the graph, curves nearer to an equity condition. The reverse situation generally applies for higher values of u, where curves for regions belonging to cluster 3 are further from curves in cluster 2.
Finally, considering the decomposition of total divergence given in (5), we find that only 25.2% of total JS dissimilarity is due to differences in regions belonging to the same group, while the remaining 74.8% is because of differences between the three groups, indicating low heterogeneity inside clusters and, in general, a good quality of the partition obtained.

CONCLUSIONS
From all the possible ways for drawing knowledge from the Lorenz curve, in this paper we propose to focus on its derivative, the so-called share density, a particular function that provides a good deal of information regarding inequality. Starting from the connection between share density and elements from information theory, we propose to compare income inequality among different groups using the JS dissimilarity. This measure can be usefully employed in a symbolic clustering context, where density functions represent the objects to be partitioned. Although many parametric and non-parametric approaches can be used to describe income distributions and the Lorenz curve, we focus here on the Dagum model owing to its ability to fit real data. With these premises and with the aim of showing the potential of the proposed method, a real dataset regarding the income and wealth distributions of Italian households at the regional level is considered. The obtained findings show that our proposal can be used to identify groups, that are homogeneous in terms of income concentration levels, and to uncover the differences in inequality across regions, not only globally, as do synthetic indexes, but also at particular levels, by exploring the different income brackets in a population. For these reasons, the proposed method opens new perspectives for future methodological development and could represent an aid to support policy makers in understanding the structure of inequality.

ACKNOWLEDGMENT
The author thanks the two anonymous reviewers for their constructive comments and suggestions. Open Access Funding provided by Universita della Calabria within the CRUI-CARE Agreement.

FUNDING INFORMATION
The author gratefully acknowledges research grant 'Fondo sostegno aree socio-umanistiche-Cda del 26.03.2021-quota DESF' from the University of Calabria.

DATA AVAILABILITY STATEMENT
The data that support the findings of this study are available in Bank of Italy at https://www.bancaditalia.it/. These data were derived from the following resources available