Parametric Representation of the Top of Income Distributions: Options, Historical Evidence and Model Selection

Approximating the top of income distributions with smooth parametric forms is valuable for descriptive purposes, as well as for correcting income distributions for various top-income measurement and sampling problems. The proliferation of distinct branches of modeling literature over the past decades has given rise to the need to survey the alternative modeling options and develop systematic tools to discriminate among them. This paper reviews the state of methodological and empirical knowledge regarding the adoptable distribution functions, and lists references and statistical programs allowing practitioners to apply these parametric models to microdata in household income surveys, administrative registers, or grouped-records data from national accounts statistics. Implications for modeling the distribution of other economic outcomes including consumption and wealth are drawn. For incomes, recent consensus shows that among the many candidate distribution functions, only a handful have proved to be consistently successful, namely the generalized Pareto, and the 3–4 parameter distributions in the generalized beta family, including the Singh-Maddala and the GB2 distributions. Understanding these functions in relation to other known alternatives is a contribution of this review.


I. Introduction
Since Vilfredo Pareto's (1895, 1896ab) seminal work, welfare economists have been noting regularities in the shape of the top of various income distributions and the value of using parametric approximations in place of actual income distributions.The distributions tend to be smooth and monotonically declining with the level of incomessystematically a higher share of households have lower incomes while only a few households have higher incomes.The density functions of the distributions are also known to be heavy, meaning that the dispersion of top incomes is high, and the richest households are significantly richer than those that follow.
It has been widely accepted that these distributions can fruitfully be described by a handful of parameters.These parameters vary modestly across countries, years, income concepts, or exact income ranges.The broadly similar shape of the distribution of top incomes in various settings mean that estimates from one empirical distribution could be used to approximate or validate incomes from other distributions.
Recent academic literature has recognized the value of this tendency in light of measurement problems associated with top incomes.It is now well known that top ends of empirical income distributions suffer from various representation problems including data contamination (Cowell andFlachaire 2002, 2007), top coding (Hlásny and Verme 2018a), or sampling issues such as sampling errors, exclusion from sampling frame, and systematic nonresponse (Hlásny and Verme 2018c).Evidence from the up-to-date stock of modeling literature has shown that problematic or missing incomes can satisfactorily be imputed using parametric approaches (Hlásny and Verme 2015;Lustig 2019).
The implicit assumption motivating these analyses is that the observed top incomes are realizations from a stochastic distribution well approximated by one of the known families of parametric distribution functions.Vilfredo Pareto proposed a one-parameter distribution function (referred to as Pareto 11 ), but subsequently generalized it to a two-parameter form (generalized Pareto, or Pareto 2) providing better fit across income distributions.Benini (1897) provided an early confirmation of the relevance of the Pareto distribution for incomes, but also proposed 2-and 3-parameter variants as providing a better fit.Literature that The rest of the paper is organized as follows.Section II provides an overview of the families of distributions and individual functions considered in the existing literature.Section III then summarizes the evidence of their fit to empirical income distributions.Section IV reports the key lessons and implications for practitioners.Finally, three appendices review some of the methodological options for estimating and diagnosing the alternative parametric models: Appendix 1 lists a number of relevant user-written programs for Stata software, and Appendices 2 and 3 briefly review some alternative estimators and diagnostic tests for alternative models.

II. Distribution functions and their relations
Several families of smoothly decaying distribution functions have historically been considered for modeling top incomes.This includes the generalized Pareto, beta (aka, Burr; see Burr 1942), gamma (Ammon 1895), and other distribution families.These families overlap, as some of their members can be derived or approximated as special cases of distribution functions in other families (Cowell 2011).Moreover, new distribution functions resembling the standard families continue to be proposed as providing better or more flexible fit in relation to top incomes.These distribution functions are generally well understood, and their moments and degrees of inequality represented by them can be computed analytically.(For the exponential, Pareto, lognomal, Burr and Dagum distributions, see references in Giorgi 1999:253-254.)A large arsenal of statistical programs has grown to estimate and diagnose them (see Appendix 1).This section provides a brief overview of a number of the promulgated distribution functions, some of their properties, and the relationships among them.Sarabia and Castillo (2005) study the stability of multiple distribution functions with respect to the maximum order statistics, and their use in the analysis of incomes.More technical detail for many of the distribution functions can be found in are appropriate for modeling observations but not for prediction.Using them on contaminated data, or on one income bracket may not allow consistent imputation of the misreported or out-of-sample values.Kernel density estimation may also perform poorly in the tails.Relatedly, performance of these methods depends crucially on sufficient sample sizes.Single-distribution methods are more robust to data issues, because each distribution is estimated on an effectively larger sample, and the methods allow imputation and extrapolation.Finally, mixture models often produce multi-modal composite distributions, and can yield unrealistic predictions in between modes.
The finite mixture model can be estimated automatically in Stata, starting in Stata 15.For instance, fmm 2: streg, distribution(gamma) estimates a fixed mixture model of 2 gamma distributions.The count and the type of parametric distributions are user-selected and can be evaluated using AIC, BIC and other modelselection criteria.

(Generalized) Pareto family
The Pareto family of distribution functions, including those introduced by Pareto himself in the 1890s and those proposed over the following century (Arnold 2015) remain among the most commonly applied functions and the first candidates to turn to for top-income modeling.This is because of their theoretical justification (Jones 2015), good empirical fit, ease of fitting, and ease of computing moments and inequality indexes.Pareto distributions exhibit slow decay based on the power law, which makes them appropriate for modeling of the heavy upper-most tail of income distributions (Gabaix 2009).
There are several commonly used specifications of the Pareto distribution functions.They differ according to the rate of decay or shape in different income ranges, and their income scale.
Pareto 1 has one shape parameter, while Pareto 2 and Pareto 3 are 2-parameter models allowing for shape and scale parameters, and Pareto 4 has an additional shape parameter (Arnold 2015).
Pareto 2 and 3 can be viewed as special cases of the Pareto 4 and, in turn, Pareto 1 is a special case of theirs.Pareto 3 was characterized by Arnold et al. (1986), who showed that the logistic and Zipf distributions are mere transformations of the Pareto 3.
Several less commonly used forms are also worth mentioning.Benini (1905) proposed 2-and 3-parameter generalizations of the Pareto akin to the lognormal and log-Weibull, with a subexponential rate of decay (Kleiber 2013).Stoppa's (1990ab) two generalized Pareto 2 distributions, obtained by exponentiating the Pareto distribution, relied on up to 4 parameters.Generalized Pareto 3 (Arnold and Laguna 1977;reviewed by Cowell 2011:ch. A.3.3) has links to other Pareto distributions as well as the exponential, Fisk's sech-squared, logistic, reverse Pareto, and Weibull (for observations with a bounded support) distribution functions as special cases.Beirlant et al. (2009), and Papastathopoulos and Tawn (2013) proposed several re-parameterized variants of 3-parameter extended generalized Pareto distributions providing better fit to heavy upper tails.These were applied to incomes by Charpentier and Flachaire (2019).
A number of distribution functions have top tails decaying at a similar rate as the Pareto, such as the closely related Champernowne (1937Champernowne ( , 1952Champernowne ( , 1953Champernowne ( , 1973) ) family of distributions.Colombi (1990) proposed a Pareto-lognormal distribution as an alternative distribution for fitting incomes beside the commonly used Singh-Maddala and Dagum models (described under generalized beta distributions below).Johnson et al. (1994) evaluate the similarly behaving generalized exponential, logistic and Gompertz distributions.Zandonatti (2001) proposed a generalized Pareto 1 (aka, generalized Fisk) distribution, which also serves as the generalization of the Lomax (aka, shifted Pareto) distribution. 3Lomax is a special case of the generalized Pareto with one parameter being fixed.Hamed et al. (2018) also introduced a class of transformed T-Pareto distributions, with Pareto distributions and those approximating them as special cases.

(Generalized) Gamma family
The generalized gamma family of distributions exhibit exponential decay that is faster than the power-function decay of the Pareto.They encompass all gamma distributions (Stacy 1962;Salem and Mount 1974), as well as the chi-square, exponential, lognormal, Rayleigh, and Weibull (Bartels and van Metelen 1975) distributions as special cases.Exponential distribution is a limiting case (shape parameter=1) of the Weibull distribution.Amoroso (1925) proposed a 3-4 parameter generalized gamma distribution with gamma as a special case (Lee 1984;Esteban 1986).
3 Simon (1955) has referred to the class of Pareto-Yule-Zipf distributions as a class of functional forms with similar properties and rates of decay that enable them to serve as approximations for one another.Pareto distributions also belong in the class of Pareto-Lévy distributions, which satisfy the weak Pareto law on the rate of decay of top incomes, and have the property of 'stable distributions' meaning that if two income sources follow the same law (up to location and scale parameters), their sum will too (Mandelbrot 1960(Mandelbrot , 1961;;Zolotarev 1986;Dagsvik et al. 2013).
As with the Pareto family, a number of distribution functions have been proposed that approximate the behavior of the gamma distributions.Gibrat (1931) proposed a 2-parameter lognormal (aka, Galton, McAlister, Cobb-Douglas, or Gibrat) distribution as providing a superior fit to high incomes that fall short of the topmost tail (see also Kalecki 1945;Aitchison andBrown 1954), andMetcalf (1969) enhanced this distribution with a third shifting parameter.The Johnson SB distribution function (aka, 4-parameter lognormal), appropriate for outcomes with spread-out lower incomes and thin (platykurtic) tails, requires minimum and maximum limits on income (Johnson 1949).Dubey (1970) proposed a compound-gamma generalization that can be reparameterized as the beta 1 and 2 described below and the F distributions.Benktander (1970) proposed two distributions, one approximating the lognormal (aka, Benktander 1), and one approximating the Weibull distribution (aka, Benktander 2) (Kleiber and Kotz 2003:247-252).
Finally worth noting here, Marshall and Olkin (1997) proposed a method for expanding the lognormal and Weibull distributions by another parameter with some useful properties.
These various distributions differ in their rate of decay, with lognormal and Weibull distributions decaying more gradually than gamma or exponential, a desirable property for modeling incomes.Lee (1984) review the relations among the lognormal, exponential, gamma and Weibull distributions.Chakraborti and Patriarca (2008) relate the exponential and gamma.

(Generalized) Beta family
The beta family of distributions has been used for incomes with bounded support, including a maximum limit on income (such as on an interval 0 to 1; Thurow 1970).Beta 1 and 2 distributions embrace as special cases the chi-squared, exponential, F, gamma, Lomax, power, uniform, and unit gamma distributions (Tadikamalla 1980;McDonald and Richards 1987;Johnson et al. 1994;McDonald and Xu 1995).Moreover, both the beta and gamma families can be said to belong to the generalized beta (henceforth GB) family of distribution functions. 4he GB distribution allows for up to 5 parameters, with GB1 and GB2 providing the most commonly used 4-parameter specifications. 5Their flexibility allows for thin or thick upper tails.
Following the exposition by McDonald and Butler (1990), most recent reviews of distributions for top-income modeling use the flexible 5-parameter GB distribution as their starting point (McDonald andXu 1995). Figure A.1 in Cowell (2011) starts with GB2 and has the Singh-Maddala, a 3-parameter special case, at its center.The 5-parameter GB distribution has a probability density function (; , , , , ) = where  is a 'peakedness of the density' parameter,  and  are 'scale' parameters,  and  are 'shape and skewness' parameters, and (, ) is the beta function (McDonald and Xu 1995).The GB2 distribution results from setting c=1; the Dagum distribution results from setting c=1 and q=1; the Singh-Maddala distribution results from setting c=1 and p=1; and the generalized gamma is the limit of the GB as  → ∞ when  =  1  ⁄ , c=1 (McDonald 1984).
GB nests all the distributions in the beta and gamma families, as well as the Cauchy, chisquare, exponential, F, Fisk's sech-squared, lognormal, Lomax, power, uniform and Weibull distributions as special or limiting parametric cases (McDonald 1984;Patil et al. 1984;Cummins et al. 1990;Johnson et al. 1994;McDonald and Xu 1995).Among the 3-parameter special cases of the GB2, the Singh-Maddala (aka, Burr, Burr 12, beta-P, Pareto IV, or generalized log-logistic distribution) and Dagum (aka, Burr 3, inverse Burr, 3parameter kappa, beta-K) distributions have been adopted most widely.The Singh-Maddala distribution can be said to be a generalization of the Pareto, Weibull and sech-squared distributions, and behaves as Pareto among the highest incomes (Schluter and Trede 2002).The Dagum distribution itself has several commonly-used variants.Dagum 1 is the standard 3parameter type that was promoted by Dagum (1999).Generalized 4-parameter Dagum 2 (aka, generalized logistic-Burr; evaluated by Jenkins and Jäntti 2005) and Dagum 3 distributions were also promoted by Dagum (1980Dagum ( , 1983)).
As with the Pareto and gamma distributions, the family of GB distribution functions shares properties and even overlaps with other families.For instance, if a variable is distributed as under the power law function, then its inverse is distributed as Pareto.The Lomax (1954) distribution is a 2-parameter distribution that is a special case of the generalized Pareto distribution (special parameterization; aka, Pareto 2), the beta-prime distribution (scale parameter 1), the F distribution (shape parameter 1 and scale parameter 1), the log-logistic (special parameterization; shape parameter 1), the q-exponential distribution (special parameterization), or a mixture of exponential distributions (special parameterization) (Kleiber and Kotz 2003;Van Hauwermeiren et al. 2012).
The Hall (1982) class of distributions is also said to including the Singh-Maddala, student, Fréchet (for unbounded variables with a lower limit and a heavy tail) and Cauchy distributions, or a mixture of two strict Pareto-I distributions.

Exponential Generalized Beta family
A related family of distributions derived from the GB family is the exponential generalized beta (EGB) distribution.If a random variable Y is distributed as GB, then log(Y) is said to be distributed as EGB.
For income distributions with extremely heavy tails, EGB can provide a viable way of parametric modeling in these tails.The EGB family covers the generalized forms of the exponential (EGB1), logistic (EGB2), Gompertz and Gumbel distributions; the 'exponential' versions of the Singh-Maddala, Fisk, power and Weibull distributions; and the standard versions of the Burr 2 (aka, generalized logistic, log-Burr 3), exponential, logistic and normal distributions as special cases (Ragab et al. 1991;Johnson et al. 1994;McDonald and Xu 1995).Generalized

Other distributions
Beside the above oft-used families of distributions and their reparameterizations, various other models have been considered to represent the top income distribution, although they have not been adopted widely or have not been retained in recent literature.Among these, we could mention D'Addario's (1949) generating system of income distributions that includes the Amoroso generalized gamma and the Davis (1941ab) distributions (including Vinci distribution) as special cases (Kleiber and Kotz 2003:238-242;Dagum 1990aDagum , 1996Dagum , 2012)).Topp-Leone (1955) class of distributions has been adopted by Van Dorp and Kotz (2006) in proposing a 2-parameter generalization for fitting lower-range incomes.Also, Barndorff-Nielsen (1978) proposes a class of generalized hyperbolic distributions for fitting log incomes.

III. Empirical properties
Some distributions fit better at the top tail than above the middle of the income distribution; some distributions are more robust to data irregularities and can provide better out-of-sample fit.
Parametric selection can be done through empirical testing, or using prior evidence or conceptual fit of each distribution.One important consideration is the nature of the data at hand.The incidence of income measurement errors (particularly at the uppermost tail), negative or zero incomes, and whether income data are available for individual units or as population densities in various income brackets (unit-record vs. grouped-records data) affects the estimation.Alternative distribution functions differ in how easily their parameters can be estimated on various income distributions, and in the stability and precision of the estimated parameters (Metcalf 1972:ch.2;Merkies 1987).
In prior decades, another important property of distribution functions was the ease of their estimation (Campano and Salvatore 2006).Each additional estimable parameter represented a challenge for estimation.
A "largely settled" modern consensus is that the vast bulk of incomes are best represented by lognormal, Singh-Maddala, Dagum, or GB2 distributions (Jäntti et al. 2015:310), while the uppermost 0.1-10 percent are distributed according to the extreme-values or Pareto distributions.
This has to do with a "transition that occurs in the shape of an empirical income distribution from the middle, which decays exponentially, to the upper tail, which decays with respect to power" (Nirei and Souma 2007:441).One advantage of relying on distributions in the GB family is that they have known properties, and inequality indexes under these distributions can be derived analytically.For instance, the Gini coefficient can be computed using the estimated coefficients for all GB class distributions, except for the 5-parameter GB itself (Dagum 1977;McDonald 1984;Boundarian et al. 2003).
This section reviews existing evidence of the empirical fit of alternative parametric models, and draws lessons for practitioners' future work.One aim is to show how the modeling literature has evolved from using simple distribution functions to more complex ones, how some classes of distributions have been sidelined in favor of better performing ones, and how the nature of data at hand affect the modeling choice.
Pareto 1 Pareto ( 1896) concluded that his one-parameter model provided a good fit for historical incomes in various places including England, Italian and Peruvian cities, German states, and Paris.
On the general nature of the Pareto coefficient, Davis (1941b) noted that the coefficient has been estimated at similar levels for years as far apart as 1471 and 1894."There is no reason to believe that a significant change has occurred in [the Pareto] parameter during the depression or afterwards, although there has been a tendency for it to increase as the tax burden has grown since 1933" (Davis (1941b:31).
More recent literature has restricted the use of the Pareto to a narrower and narrower range of top incomes (Jenkins 2017;Charpentier and Flachaire 2019) and considered other parametric functions for lower incomes.Aitchison andBrown (1954, 1957) found that for lower earnings in homogeneous occupational groups -1950 UK data for nine agricultural occupationsa lognormal distribution provides a good fit, while for higher earnings a power law distribution, such as Pareto, fits better.Clementi and Gallegati (2005) As a final note, a number of studies take advantage of the power-law property, also known as van der Wijk's law, that the mean of a Pareto-distributed variable can be approximated as ̅ = (| ≥  0 ) =  0  ( − 1) ⁄ , where  is the Pareto coefficient.This property makes it easy to approximate the mass of top incomes above a specific threshold.Bernstein and Mishel (1997) offered an early application.This is followed in recent studies by the World Inequality Lab, which have adopted the Pareto approximation and van der Wijk's law extensively (https://wid.world/world-inequality-lab/; Piketty and Saez 2003;Blanchet et al. 2017Blanchet et al. , 2018b)).We could also point out the related Fisher-Tippett-Gnedenko theorem, or the first EV theorem (Fisher 1930), pertaining to the distribution of extreme order statistics.To the extent that the distribution converges, it is said to converge to only one of 3 possible models, all in the class of generalized EV distributions: the Gumbel, the Fréchet, or the Weibull.Lognormal Aitchison andBrown (1954, 1957) promoted the lognormal distribution as offering good fit to a wide range of incomes.Incomes between the 10 th and the 80 th percentiles (Fournier 2015), or even between the 5 th and the 95 th percentiles (Montroll 1978) were found to follow the lognormal distribution well.Metcalf (1969Metcalf ( , 1972) ) found that the lognormal distribution overcorrects for the positive skew of the distribution of US-CPS incomes in 1949-1965.He proposed a generalized 3-parameter shifted version (i.e., displaced lognormal) that fits better than the lognormal and the exponential distributions, particularly lower down in the distribution.Metcalf (1972) described a method for estimating the displaced lognormal distribution by anchoring it at three symmetrically-spaced quantiles in the empirical data.Hart (1983) estimated the moments of the bivariate lognormal distribution for British 1971-1978 male earnings accounting for household size and composition.
While Gibrat's and Metcalf's 2 and 3-parameter lognormal distributions have been shown to provide inferior fit at the complete top, they continue to be used for incomes lower down in the distribution.In rural China, the lognormal distribution provided 'perfect' fit to income data (Kmietowicz and Ding 1993).Harrison (1981) Chung and Cox (1994) modeled the dispersion of superstar performers' earnings using the Yule distribution, and found that it fits very well.Fisk distribution was successfully fitted to incomes in homogeneous occupations (Fisk 1961), in 17 Peruvian cities during 1971-1972 (Arnold and Laguna 1977), and to the 1972 gross weekly earnings for full-time male workers in seven occupational groups in the UK, where it outperformed the lognormal or the Pareto distributions (Harrison 1979(Harrison , 1981)).

Estimation specifications: Lower and upper cutoff points for Pareto models
The discussion in the preceding paragraphs suggests that the choice of functional form and the quality of fit of Pareto models depend on the income bracket evaluated.Various lower cutoff points have been used in existing studies, from the 80 th (Piketty et al. 2017), 90 th (Chancel and Piketty 2017), 99 th (Burkhauser et al. 2016), or as high as the 99.5 th percentile of survey observations (Jenkins 2017).Coles (2001), andGilleland andKatz (2005) explore the determination of the lower cutoff for the Pareto distribution.Cowell and Flachaire (2007), and Davidson and Flachaire (2007) choose the 90 th percentile as the lower point based on the graphical Hill-plot approach (Drees et al. 2000).
The Hill plot is a line showing the Pareto 1 coefficients estimated with various lower cutoff points.Under the Pareto law, the line should be horizontal.Hence, departures from that shape are an indication that the original data are not Pareto-distributed, and 'Hill's horror plot' is an indication of a specific systematic type of deviation.
Hlásny and Verme (2018abc) compare different combinations of lower and upper truncation points in survey data, and argue for a choice that maintains a large sample size for fitting and at the same time avoids using contaminated uppermost values for estimation.Jenkins (2017) evaluates various left-truncation points for the Pareto 1 and 2 distributions analytically, using alternative diagnostic tests.Using British administrative data, he identifies the optimal cutoff point at between the 90 th and the 99 th income percentile in various years, and typically at the 95 th or higher percentile.Charpentier and Flachaire (2019) put the left cutoff point for the Pareto 1 model even higher, at the 99 th or higher percentile.

(Generalized) Gamma
Gamma distribution was promoted as recently as the 1970s as offering good fit among 2parameter models, and was evaluated positively on incomes in the US (Salem and Mount 1974) and the Netherlands (Bartels and van Metelen 1975).Salem and Mount (1974) showed that the gamma fitted better than the lognormal for the middle-income range of US family incomes during 1960-1969.In earlier decades, the 3-4 parameter generalized gamma distributions by Amoroso (1925) were successfully used for German and Italian incomes, even though the fit was weaker for US incomes (Davis 1941a:405-406).
More recent evidence suggests that distributions with more parameters, or other functions relying on the same number of parameters provide a closer fit to income data.Kloek and Van Dijk (1978) found that the lognormal and gamma functions showed poor fit to Dutch earnings, instead favoring the log-t and the log Pearson 4 models, and in some applications also a 3-parameter generalized gamma and a 4-parameter Champernowne function.Hippel et al. (2012) found that the extended generalized gamma, power normal and newly-proposed power logistic distributions compared favorably to Dagum, GB2 and logspline distributions on grouped-records household or family data in the US.Most recently, however, studies have used functions in other distribution families, including the 2-parameter Pareto 2, or the 3-5 generalized beta 2 functions.

(Generalized) Beta
Beta distributions have been at the forefront of empirical modeling of the overall income distributions.A large body of research has focused on comparing individual members of the beta family.Among the early research, a 3-parameter beta distribution was successfully fitted to the 1949-1966 and 1952-1980 US income data (Thurow 1970;Slottje 1984).
Recent decades have seen a quest for using more parametric representations that offer superior flexibility.One conceptual argument for using distribution functions with 3+ parameters is that the Lorenz (1905) functions of any two such distributions can show any relationship with one another.
By contrast, Lorenz curves of 1-2 parameter distributions (such as F, lognormal, Lomax, Pareto, Weibull etc.) cannot intersect.To allow two estimated distributions to have intersecting Lorenz curvesas is common with real-world distributionsdistribution functions with 3+ parameters should be applied.
Higher-parametric distributions also appear to offer better fit at the top.Majumder and Chakravarty (1990) proposed a four-parameter generalization of the Dagum and Singh-Maddala distributions that exhibited better fit for personal income than the Dagum, Singh-Maddala and Champernowne distributions.However, McDonald andMantrala (1993, 1995) showed that hypergeometric' distribution to household liquid assets in the US, and found that the added parameters were significant and useful, relative to the more restrictive GB and gamma models.
For 1970-1980US family incomes, McDonald (1984) preferred the GB1, GB2, Singh-Maddala and generalized gamma distributions to the gamma and Weibull distributions, which themselves outperformed the Fisk distribution.For grouped-records Japanese income data for 1963-1971, Singh-Maddala slightly outperformed the Fisk, which outperformed beta, gamma, lognormal and Pareto 2 (Suruga 1982).Atoda et al. (1988) and Tachibanaki et al. (1997) found that more-parametric models were needed for Japanese incomes.Chotikapanich et al. (2007) applied the GB2 successfully to grouped-records data in China.Bandourian et al. (2003) found that the Weibull, Dagum and GB2 distributions provide the best fit for group-records pre-tax incomes among ten GB-class distributions with 2, 3, and 4 parameters, respectively (compared to gamma, lognormal, beta 1, beta 2, generalized gamma, Singh-Maddala, GB1, GB).They used 23 high and upper-middle income countries, and income data grouped into 20 5-percentage point groups.McDonald et al. (2013) found that the GB2 approximates well the skewness and kurtosis in a set of 78 income distributions, compared to the generalized gamma, Dagum, Singh-Maddala, beta 1 and 2, and GB1 distributions.Okamoto (2013) evaluated a newly proposed k-generalized beta distribution favorably compared to GB2, on 60 income distributions in the Luxembourg Income Study database.Brzezinski (2013) found that the GB2 model outperforms the Dagum and Singh-Maddala models on most distributions of disposable household income per adult equivalent in Central and East Europe during 1991-2010, with a few exceptions where Dagum and the GB2 perform similarly.The GB and GB2 models have been found to provide the best fit for US family income (grouped-records for 11 or 21 quantiles), while the lognormal distribution provide the worst fit (the other evaluated distributions being GB1, gamma, generalized gamma, beta, beta 1, beta 2, Singh-Maddala, and Dagum).The GB2, Dagum, and gamma are the best-fitting four-, three-, and two-parameter distributions.
The conventional wisdom from recent research is that the GB2 distribution can be used among middle and upper-middle incomes, but in the upper-most tail one might prefer a power-law distribution such as the Pareto 1 instead.Because the right tail of incomes may also be contaminated by nonresponse or income under-reporting, right-truncation to the GB2 distribution may be applied at the 95-99 th percentile (Hlásny and Verme 2018ab; Hlásny 2019).Finally, because bottom incomes may not be approximated well by the GB2 distribution, Jenkins et al. (2011:69) propose estimating the GB2 distribution on income data left-truncated at the 30 th percentile.

Singh-Maddala
In response to Salem and Mount's (1974) Brachmann et al. (1996) concluded that the GB1 and B1 distributions did not provide a good fit to German net household incomes, while the Singh-Maddala and gamma fared well among the classes of 3-and 2-parameter distributions.
Comparing the lognormal, gamma, log-logistic, Weibull, generalized gamma and Singh Maddala distributions, Tachibanaki et al. (1997) concluded that the generalized gamma fit better than the lognormal, the gamma, or the Weibull.The Singh-Maddala outperformed the Weibull and the log-logistic.Different estimations yielded vastly different parameters and inequality indexes, on account of 1) different estimators, and 2) data differences such as individual/grouped-records data.Tachibanaki et al. (1997) also compared income distributions estimated using individuallevel data (maximum likelihood method) versus grouped-records data (minimum chi-square or maximum likelihood method).They found that grouped-records data allowed estimating a wellfitting distribution even when the values of individual incomes did not allow the estimation on individual-level data (i.e., the beta and the Johnson SB distributions).Biewen and Jenkins (2005) evaluated the fit of the Singh-Maddala and Dagum distributions for income distributions in the US, Great Britain and Germany, and concluded that the Singh-Maddala distribution performed better.Similarly, Jäntti et al. (2015) found that the Singh-Maddala distribution offered a good fit to the distributions of positive incomes in the US, Germany, Italy, Luxembourg and Spain.

Dagum and log-logistic models
The natural alternative to the Singh-Maddala model is the other 3-parameter Burr-family specification, the Dagum function.As the studies in the previous and this section show, the evidence of their relative performance is mixed.Dagum (1983) found that his distributions (Dagum 1, 2 and 3) outperformed the lognormal, gamma and Singh-Maddala models on US family incomes.Using the UK data, the Dagum 1 was also shown to outperform the gamma distribution. 7agum also found that a 2-parameter log-logistic distribution, as a special case of the Dagum model, fit Canadian data well.Suruga (1982) and Atoda et al. (1988) also evaluated the log-logistic model positively for Japanese incomes, relative to the lognormal, gamma, and beta distributions, but found it to be comparable to the 3-parameter Singh-Maddala distribution.
Using French incomes, Espinguet and Terraza (1983) confirmed that the Dagum 2 distribution provided a superior fit compared to the Weibull, the Singh-Maddala, the Box-Cox-transformed logistic, the 3-parameter lognormal, and the 4-parameter beta 1 models.Dagum 1-3 showed good fit to Italian incomes (Dagum and Lemmi 1989), and incomes in the Buenos Aires region (Botargues andPetrecolla 1997, 1999ab), where they outperformed the lognormal and the Singh-Maddala.Bantilan et al. (1995) found that the Dagum 1 provided a good fit for family incomes in the Philippines.Campano and Salvatore (2006) reported that the United Nations had concluded that the 4-parameter Dagum distribution "was the model of choice for over 60 countries" (p.51).
The Dagum fit as well as or better than the 5-parameter Champernowne distribution (Campano 1987).In fact, estimating "one serves as a check on [estimating] the other" (Campano and Salvatore 2006:53).Jäntti et al. (2015) also favored the Dagum distribution for positive incomes (and the exponential distribution for negatives).
In relation to other GB-family distributions, Bordley et al. (1996) concluded that the Dagum 1 worked better than other 3-and 4-parameter models, including the gamma and the GB1, but the GB2 remained as the most preferred form.Kleiber (1996Kleiber ( , 2008:110:110) listed all the known empirical applications of the Dagum distribution, and summarized them by noting that the Dagum outperformed other models with 3 or fewer parameters, but concurred that the 4-parameter GB2 could perform better.

Beyond beta: other distribution functions
A small number of studies have assessed the empirical fit of models outside the GB family, but their results are difficult to compare because of the lack of consistent benchmark specifications and test statistics.Here is the limited evidence.Horsky (1990) used a 2-parameter EV distribution (Gumbel distribution in the EGB class) to model wages and total incomes in the US Census data.
The model fit wages better than the lognormal distribution, and as well as the best of the evaluated 3-and 4-parameter models.Chumacero and Paredes (2005) found that the EV distribution outperformed the generalized Cauchy, logistic and other functional forms in fitting the Chilean income distribution, with the lognormal distribution also having desirable properties.
Van Dorp and Kotz (2006;also Kotz and van Dorp 2004) applied the 'reflected' generalized Topp-Leone distribution to 2001 US household incomes, and found that it performed better at lower ranges of incomes than the beta family of distributions.Atoda et al. (1988) found that a 4-parameter J-shaped beta distribution, related to the Topp-Leone distributions, and a 2-parameter Weibull distribution provided a good fit to narrow income categories such as farmers' primary incomes.The Weibull performed as well as the 4-parameter Singh-Maddala.Barndorff-Nielsen (1978) successfully fitted the generalized hyperbolic distribution to the 1962/1963 personal logincome data for Australian residents.

IV. Summary and implications for practitioners
Fitting smooth parametric curves to the top of empirical income distributions is valuable for describing the general shape of the income density functions, as well as for correcting income distributions for various top-income measurement problems.Thanks to the extensive research undertaken in the past quarter-century, we now have access to a large body of empirical evidence and tools for estimating and diagnosing alternative models, and for replacing actual top incomes with model-generated ones.Alternative estimators have been developed, and a number of statistics exist to inform our model and estimator selection.This body of knowledge helps us understand the nature of the process underlying the distribution of incomes, and to some degree even the mechanisms driving the process.To the extent that observed incomes suffer from various top-tail measurement problems, the existing knowledge enables us to overcome them.
The proliferation of distinct branches of modeling literature over the past decades gave rise to the need to survey the alternative modeling approaches and be able to discriminate among them based on their fit to empirical data.To this end, we have reviewed the state of methodological and empirical knowledge regarding the adoptable distribution functions, and provided references to original sources and statistical programs (refer to Appendix 1) allowing practitioners to apply these parametric functions to microdata in household income surveys, administrative registers, or grouped-records data generated from national accounts statistics.
Many of the distribution functions reviewed in this study have theoretical or empirical merit.
With arbitrary income data at hand, it is a priori unclear which distribution function will provide the best fit for the incomes used in estimation, as well as for those outside of the estimation sample.
A typical confirmatory research path is to choose a set of theoretically or empirically comparable parametric forms, fit each of them, and then evaluate their degree of fit.Esteban (1986) propose an alternative search path: formulating a set of hypotheses, or stylized facts, about the empirical distribution, and then identifying the (only) parametric distribution that satisfies them.For instance, hypothesizing that the income share elasticities have a constant rate of decline (plus two other assumptions), he shows that the 3-parameter generalized gamma is the only density function within its class that satisfies them.
The alternative approaches to fitting each distribution function are beyond the scope of this survey, but a brief, non-exhaustive review is provided in Appendix 2. Similarly, some guidelines how to diagnose the alternative distribution functions and rank them are provided in Appendix 3.
Our review suggests that among the large number of candidate families of distribution functions, only a handful have proved to be consistently successful in terms of feasibility of fitting and their fit to the original data, and have been used over the years and to this day.The choice depends in part on the range of incomes that needs to be represented parametrically.Across many countries and years, 3-4 parameter distributions in the generalized beta familynamely the Singh-Maddala, the Dagum and the GB2 distributionshave proved successful at representing the vast mass of income distributions, approximately from the 30 th to the 95 th percentile.The lognormal distribution, itself a 2-parameter member of the generalized beta family, offers a weaker fit for that range of incomes, but may also be estimated due to degrees-of-freedom considerations, say on grouped-records data.For top incomes, flexible power function distributions, including 2-3 parameter members of the EV and Pareto families exhibit a better-fitting rate of decay of income densities.Finally, in the uppermost tail of the top 0.5% or up to top 5% of income observations, due to data sparseness and heavy rate of decay, the Pareto 1 serves as an adequate parsimonious model.
Model selection clearly also depends on the properties of the data at hand.Unit-records data call for functional forms and estimators that are less sensitive to individual income values.
Grouped-records data, on the other hand, call for functional forms and estimators that are less demanding with regard to the degrees of freedom, such as distributions with fewer parameters.
The narrower the income concept (say, wages or primary income), and the more homogeneous the sample (say, farmers, or workers in the same occupation, or residents in one city), the better the fit of simpler distribution functions such as the lognormal, the Weibull, or the Pareto 1 models.
Finally, to the extent that multiple types of income must be added together (say, income sources) or juxtaposed (say, those of two subgroups), parametric functions that are additive or that allow the Lorenz curves to cross (e.g., 3+ parametric functions), respectively, would facilitate the most relevant comparisons.In this respect it should be noted that there is presently limited record of stability of distribution functions across disparate components of total incomes.
To summarize the evidence presented in this review, a large number of distribution functions have been considered over the past 120 years for modeling economic outcomes.We have introduced several classes of these models, including their theoretical properties and empirical record of fit.Among the dozens of models, only a handful have proved to be consistently (pnorm: probability plot comparing a distribution, particularly its center, to a reference normal distribution) extreme plot, pp qq return density: PP & QQ plots following extreme estimation hillp (Scotto 2001): creates the Hill plot, with estimates of the Pareto 1 coefficient on the vertical axis, and the share of top income observations on which the coefficient was estimated on the horizontal axis (with confidence interval).
hangroot (Buis 2007): hanging rootogram comparing an empirical distribution to the best-fitting reference parametric distribution.
dpplot (Cox 2002): plots density probability plots for an empirical distribution given a reference distribution.
ggtax (Gonzalez Rangel 2013): creates a graph of the estimated generalized-gamma shape and scale parameters, against other generalized-gamma family models (including lognormal, exponential and Weibull).
glcurve (Jenkins and Van Kerm 2004): draw the Generalised Lorenz curve, and generate GL ordinates
lrtest GEV: likelihood ratio test of Gumbel vs. unrestricted extreme-value model following extreme estimation compsta (Brzezinski 2014c): test of the computational stability of estimation using perturbation tests

Appendix 2. An illustration of alternative estimators for distribution functions
One major reason for considering the parametric modeling of the top tail of income distributions is that the empirically observed distributions suffer from various measurement problems or incomparability across samples or survey waves.The highest incomes are to a large extent excluded from household surveys, because the households are not identified on the sampling frameand therefore are not selectableor have a probability of being selected, but for one reason or another end up excluded.This could be simply due to a low probability of selection, accounting for their rareness and the inadequate oversampling of the relevant demographic group.Secondly, unit and item nonresponse tend to be high in this group, as is underreporting and post-survey topcoding by statistical agencies.
The parametric models are applied to imperfect, empirical income distributions, and the specific functional forms and parameter values must be estimated and assessed.To this end, a number of estimators have been proposed to fit the distribution functions presented in the main text.The aim of this short and non-exhaustive review is to make readers aware of the variety of estimation approaches, and provide some key references.Hill (1975) proposed a maximum-likelihood estimator of the Pareto 1 coefficient (see also Quandt 1966;Schluter and Trede 2002:155).For fixed k, this estimator Hk,n converges in distribution to a gamma distribution as n → ∞.Huisman et al. (2001) proposed a weighted version of Hill's estimator correcting it for its small sample bias.A robust estimator of the Pareto coefficient was proposed by Cowell andVictoria-Feser (1996ab, 2007).ML 'Optimal b-robust estimator' (ML-OBRE) places an upper limit on the score function to make the estimation robust to the presence of high values (Victoria-Feser and Ronchetti 1994;Doğru and Arslan 2015).Brzezinski (2013Brzezinski ( , 2016) ) and Jenkins (2017) review alternative estimators for the Pareto coefficient.Mahdi and Cenac (2006) compared three methods for estimating the logistic and Rayleigh distributionsthe maximum likelihood, the moment and the probability weighted moment methods.Hippel et al. (2012) proposed a 'best-of-breed' model selection criterion, which chooses the best fitting model among the extended generalized gamma, power normal and power logistic families.McDonald andRansom (1979ab, 2008) identified and evaluated three estimators for the estimation of GB and gamma models on grouped-records data.Pearson minimum chi-squared estimators (asymptotically equivalent to the method-of-scoring maximum-likelihood estimators, and generalizations of estimators minimizing the sum of squared differences between the predicted and observed probabilities.On decile-grouped data, the scoring and minimum chi-squared estimators were found to perform similarly, and outperformed the least-squares estimators in terms of sample bias, variance, and mean squared error.

Appendix 3. Diagnostic tests
There are three types of diagnostic tests available to distinguish parametric forms for the modeling of the top tail of income distributions.One, standard confidence-interval based tests can help assess goodness of fit of stand-alone models or estimation approaches.These tests can also assess the relative fit of multiple models nested in one another, as well as the relative fit of models estimated on different subsamples.Two, graphical tests can help to evaluate visually any deviations in model fit from known yardsticks.Three, simulations on well understood (synthetic or uncontaminated) data can help evaluate the relative performance of alternative models or estimators, and their robustness to various experimental designs of data contamination.The following sections present some of the commonly used tests and their results.

Confidence-interval based tests
To test whether empirically observed data are consistent with a particular smooth parametric function, we must assess the deviations of the empirical data from the predicted pattern of dispersion.The available model selection criteria include sum of squared errors, sum of absolute errors, log-likelihood value, likelihood ratios, the Pearson chi-squared, Kolmogorov-Smirnov, Cramér-von Mises, and Anderson-Darling statistics, and others.These are briefly introduced below.The likelihood ratio, equal to twice the difference between the unrestricted and restricted model likelihoods, is distributed as chi-squared with degrees of freedom equal to the parameter restrictions.

𝐸 𝑖 𝑖=1…𝑚
, where O and E are the observed and expected shares of observations in each quantile i=1…m, respectively.This statistic provides a rejection criterion in large samples, but results in false non-rejections in small samples, and is sensitive to the delineations of quantiles i (Stephens 1986).Model selection criteria such as the Akaike information criterion (AIC = 2 − 2 log()) can also be used for pairs of nested models.The Kolmogorov-Smirnov statistic compares the observed cumulative density of all order statistics with that estimated under a parametric form, the Cramér-von Mises statistic evaluates the squared deviations between these cumulative densities, and the Anderson-Darling statistic assigns greater weight to the tails.
Regression-based tests rely on coefficient standard errors (t and F tests) and measures of model fit (R-squared, AIC, BIC, Wald F, likelihood ratio, and Lagrange multiplier tests). 8When the parametric distributions are fitted by maximum likelihood, the required test statistics are available.In other cases, such as when parameters are on a boundary of the allowed parameter space, testing can be done using standard errors obtained by bootstrapping the estimation procedures (Cowell and Flachaire 2015:50-51;McDonald and Xu 1995:144).
To discriminate among nested models, including between the Singh-Maddala, the Dagum and the GB2, the parameter restrictions can be tested simply by looking at the corresponding parameter standard errors.However, confidence-interval tests based on asymptotic behavior perform poorly in small samples or in distributions with heavy tails as is common with income distributions (Cowell and Flachaire 2015:52-56).Clauset et al. (2009) proposed a statistical framework for testing whether the data are consistent with the Pareto behavior, relying on maximum-likelihood fitting methods and goodness-of-fit tests based on the Kolmogorov-Smirnov statistic and likelihood ratios.

Graphical tests
One problem with data-reduction approaches such as relying on logarithmic transformations of distributions, or replacing full distributions with distributional statistics is that they lose sight of the true dispersion of original observations, particularly those in the extreme tails, where data are sparse and the fit is likely weakest.These extreme observations may also influence the estimates obtained under any fitted curves.As far back as a century ago, Lorenz (1905:217) criticized tests of Pareto's power-law property and complained that "logarithmic curves are more or less treacherous," suppressing any deviations in the original variables.Cirillo (2013) raised similar criticisms in regard to various graphs relying on logarithmic transformations, because they downplay the effect of outliers.These criticisms also apply to using distributional statistics in place of the full Lorenz curve.This section reviews the existing methods for testing the shape of income distributions.Like with distributional statistics, the selection depends on one's needs and on the part of income distribution of special interest.Dombos (1982) provided an early review of available graphical diagnostic tools.A good start is to review the density histograms or kernel density plots compared to reference parametric distributions.Chauvel (2016) proposed what he called isographs showing the diversity of local inequality along the income scale.He writes: "The isograph presents the slope of the 'Fisk Graph' (Fisk, 1961:176) that is indeed a logit-log transformation of the Pen's parade (Pen, 1971:49-59), a transformation of the cumulative distribution function graph."Chauvel also reviews the shape of the density functions of distributions (strobiloids), and the evolution in them over time.
Another set of graphical tools is based on the so-called Zipf law (Gabaix 1999), proposing that one should find a linear relationship between log(rank order) and log(frequency) among the relevant range of incomes.Under the Pareto distribution, we should find a linear relationship between the logarithm of the proportion of individuals with income greater than a threshold and the logarithm of the threshold itself.This suggest a graphical test of the validity of this property.
The plot of the log of incomes on the x-axis and the log of the survival function (log(1-CDF)) on the y-axis, known as the Pareto diagram (Cowell 2011) or Zipf plot (Cirillo 2013) is among the most commonly used diagnostics.Pareto QQ-plot, {-log(1-F(xi)) ; log xi}, has also been commonly used, but this plot performs poorly at distinguishing related distributions such as the lognormal and the Pareto (Cirillo 2013).The PP-plot, also referred to as the probability plot, shows the predicted versus actual values of the cumulative densities under various lower thresholds.The tightness of the estimated curve around the 45° line indicates goodness of fit.
Relative density plots juxtapose a best-fitting reference distribution function against the observed distribution.If the empirical distribution exactly follows the reference distribution, the plot will be a uniform density.
Jenkins (2017) used various graphical (and confidence-interval based) toolsthe Zipf plots, mean excess plots, Zenga (1984) curves, and Hill plots (Drees et al. 2000) to test Pareto 1 against Pareto 2 on an empirical income distribution.Mean-excess plots show the mean income above a threshold for various thresholds.For Pareto, the graph is a positively-sloped straight line above a lower threshold.Ghosh and Resnick (2010) used the mean-excess plots to diagnose several candidate functions in the generalized Pareto family.Zenga curves show a transformation of the Lorenz curve above a certain threshold.For Pareto distributions the Zenga curve is positivelysloped (and the more heavy-tailed the distribution, the higher the curve), while for a lognormal distribution, the curve is flat.Hill plots show the (Pareto) top-income index estimated across different delineations of top incomes, with various lower thresholds.
Similarly, generalized Pareto curves (Fournier 2015) show how the inverted Pareto coefficients (the ratio between the average income above a threshold, and the threshold income) evolve with varying thresholds in terms of income percentiles.Pareto distributions, the Champernowne distribution, the Sech 2 distribution and the Singh Maddala distribution have a generalized Pareto curve with b>1 as one approaches the top end.On the contrary, lognormal, the three-parameter lognormal distributions, the gamma and the generalized gamma distributions and the Weibull distribution have a generalized Pareto curve with b=1 as one approaches the top end.
Figure 1 illustrates the relations among the more common types of distribution functions taking the GB2 as the starting point.(Other illustrations of the relations among distribution functions appear in Song 2005; or Leemis and McQueston 2008.)Among the family of GB functions, GB2 is the most widely accepted today for modeling incomes, and various representationsand classificationsof its density function have been proposed, including as the generalized Feller-Pareto (Arnold and Laguna 1977; Arnold 2015), generalized beta prime (Patil et al. 1984), transformed beta (Venter 1984), and generalized F (in different parameterization; Kalbfleisch and Prentice 1980) distributions.The GB2 distribution as well as its 3-parameter restricted versions decay like a heavy-tailed power function.By contrast, the lognormal distribution decays exponentially with a mid-heavy upper tail, and the Pareto class of distributions has a polynomial decay.
Gumbel (for variables with an infinite support but light tails, like the exponential distribution) is a limiting case of the EGB (with p=q)(McDonald and Xu 1995).The 2-parameter (log-)Gompertz distribution is another limiting case (refer to figure3inMcDonald and Xu 1995:142).Pearson familyMany distribution functions in the generalized Pareto, beta and gamma families belong to the class of Pearson distributions(Lee 1984), even though generalized gamma, Singh-Maddala and Dagum do themselves not(McDonald and Xu 1995:138)."Pearson system purports to account for all possible shapes of observed distributions in any field [...], a general purpose system" (Dagum2012:3).There are 7 types of Pearson distributions, with the generalized Pareto and beta distributions covered by types 1 and 2, the inverse Gaussian and gamma covered in type 5, the beta prime covered in type 6, and the t and Cauchy distributions covered in type 7. Beta 1 and 2 can be referred to as Pearson 1, and the 2-parameter gamma as Pearson 3.Amoroso's (1925) 3-parameter generalized gamma distribution has the Pearson 5 distribution (aka, inverse gamma, inverted gamma, reciprocal gamma, or Vinci) as its special case.Cauchy distribution is itself a special case of Pearson 7. Finally worth noting, transforming the distributions in the Pearson family can also lead to the so-called Gompertz-Verhulst distributions, with favorable properties for the fitting of extreme values(Ahuja and Nash 1967).Aiuppa (1988) provided calculations to estimate a number of distributions in the Pearson family.Extreme value distributionsAnother category of distributions that has been cited in the top-incomes literature is the extreme value (EV; aka, von Mises, or von Mises-Jenkinson type) distributions, based on the EV theory(Kotz and Nadarajah 2000;Coles 2001).EV distribution.There are 3 types commonly used: EV1 (aka, doubly exponential) covers the exponential Weibull (aka, log-Weibull), Gumbel and Gompertz distributions; EV2 is a Fréchet-type distribution; and EV3 is the Weibull distribution (either the ordinary type, or reversed Weibull type).The generalized Pareto distribution can be represented as an EV1 distribution related to the exponential Weibull.

Figure 1 .
Figure 1.Relations among univariate distribution functions in the generalized beta family , using the US Panel Study of Income Dynamics (PSID) for 1980-2001, the British Household Panel Survey (BHPS) for 1991-2001, and the German Socio-Economic Panel (GSOEP) for 1990-2002, propose using a two-parameter lognormal (or an exponential) distribution for the low and middle income population (up to the 97 th -99 th percentile), and the Pareto distribution for the highest income group (1%-3% of the population).The parsimonious Pareto 1 model fares well even compared to more-parametric alternatives.Gabaix (2009) concluded that the 2-parameter lognormal distribution does not necessarily outperform the Pareto 1 on the topmost incomes, on account of the degrees of freedom lost.Hlásny and Verme (2018ab) similarly argued that Pareto 1 provides more consistent results than the Pareto 2 for data in the US Current Population Survey (CPS), and than the GB2 for data in the EU Surveys on Income and Living Conditions and the US CPS.Hlásny and Verme (2018c),Ibragimov andIbragimov (2018), andHlásny (2019) found that the Pareto 1 fit well the distribution of incomes in transitional economies -Egypt, Russia and Mexico, respectivelyand that the results were surprisingly similar to those obtained in high-income countries.Hlásny and Intini (2015) evaluated the fit of the Pareto 1 model to the distributions of household consumption in 11 Arab-region surveys, using lower cutoff points between the 80 th and the 99.9 th percentiles.They concluded that the tops of the Egyptian, Palestinian and Tunisian distributions are adequately described by Pareto 1, but Jordan and Sudan suffer from influential observations, which affects the estimation of the Pareto coefficients.With grouped-records data, an important consideration is the assumed distribution of incomes in each income range.Taking advantage of the ease of working with the Pareto distribution, and taking Pareto's (1896) lead, densities in each income bracket can be modeled asPareto-distributed   (Feenberg and Poterba 1993;Piketty 2001).Gastwirth (1972) considers the extremes of the least versus most inequality in each income range to compute bounds on the composite inequality indexes.Alfons et al. (2013) derive the parameters of the Pareto 1 distribution in the presence of sampling weights, and numerically show their desirable properties.Atkinson et al. (2011) summarize the literature contending that the distribution of top incomes is best approximated by the Pareto distribution.6Atkinson (2017) promotes the Pareto 1 distribution as a valuable point of departure for modeling top incomes, but he favors more complex distributions including Pareto 2 for specific income distributions.Harrison (1981), using the British New Earnings Survey, concludes that the Pareto distribution fits the dispersion of top 15-20 percent of incomes well, provided that an appropriate criterion of fit is used (i.e., Gastwirth-Smith rather than chi-squared test).Disaggregating data by occupational group leads to better estimation results than using all data jointly, but the Pareto model outperforms the lognormal model both in the aggregated and the disaggregated data.The estimated Pareto coefficients differ across occupational groups, and even across choices for the lower cutoff point for a single occupational groupwhile the coefficient appears more stable in the aggregate sample.The poor performance of the lognormal distribution appears to be due to the income differences across occupational groups.
Blanchet et al. (2018)  estimated different Pareto distributions across income fractiles, and used a cubic spline to restrict the fitting and provide smoothness between fractiles ranges.Pareto 2In recent years, understanding has shifted toward advocating the more flexible Pareto 2 distribution, particularly when a larger range of incomes are analyzed(Atkinson 2017).This shift in preference is based on the Pickands-Balkema-de Haan theorem(Balkema and de Haan 1974;Pickands 1975), also called the second theorem in the EV theory, giving the asymptotic tail distribution of values of a random variable above a certain threshold.For a large class of true distributions, as the lower threshold increases, the distribution is said to converge to the Pareto 2.
Burkhauser et al. (2009) used one ofStoppa's (1990a) exponentiated Pareto distributions to correct for top coding in US-CPS income data, because the distribution function has a heavy tail and can be made to mimic the mean as well as the variance of the raw top incomes.Departing from a baseline Pareto 1 specification,Jenkins (2017)  found strong support for the Pareto 2 model for UK incomes, except in the complete top of the distribution (top 0.5-1%, or even fewer topmost incomes depending on year) where the Pareto 1 performs as well as the Pareto 2 or even better.Based onJenkins (2017), the most appropriate left-truncation threshold varies greatly across survey waves, from below the 85 th percentile to as high as the 99.9 th percentile for Pareto 1, even though it is more consistently between the 95 th and the 99 th percentile for Pareto 2. Charpentier and Flachaire (2019) compared Pareto 1, Pareto 2, and Beirlant et al.'s (2009) 3parameter extended Pareto distribution on the US CPS and the South African National Income Dynamics Study surveys.They found that the one-parameter Pareto 1 is clearly dominated by the two/three-parametric alternatives, but the two-parameter Pareto 2 and the three-parameter extended Pareto cannot be ranked easily across various modeling specifications.Only among the topmost 1 percent or fewer incomes the Pareto 1 and 2 perform similarly, and the extended Pareto distribution appears to outrank them in the precision of fitting the empirical distribution.Finally, Charpentier and Flachaire (2019) report that the extended Pareto distribution also outperforms the Pareto 4 model (Arnold 2015) and a variant of an extended generalized Pareto model (EGPD3 in Papastathopoulos and Tawn, 2013).Charpentier and Flachaire (2019) also discuss three potential sources of bias in the estimation of a tail index (such as the Pareto coefficient): misspecification bias, estimation bias and sampling bias.Misspecification bias can be evaluated -with some success -using graphical methods such as the quantile-quantile (QQ) plot or the Pareto diagram.Increasing sample size does not mitigate the misspecification bias; only a more appropriate parametric form or a different (i.e., higher) threshold can mitigate it.Estimation bias arises because the observed empirical upper tail differs from the underlying population upper tail, and the divergence can be systematic, on account of oversampled (rare outliers) or undersampled units (underreporting, top coding).Having a larger sample size may partially mitigate these problems.On ther other hand, sampling bias comes from survey nonresponse or incidental inclusion of out of sampling frame units.This, once again, is unlikely to be mitigated by increasing the sample size.
concludes that the lognormal distribution fits well up to the 80 th percentile of incomes.Applying the lognormal distribution to individual occupational groups, and adding the groups to one another, he fails to confirm the summation property of lognormal distributions, suggesting that not all occupational groups exhibit incomes distributed as lognormal.McDaniel et al. (1988) successfully fitted a tri-variate form of the Johnson SB (4parameter lognormal) distribution to monthly income, financial assets, and a physical disability indicator.Finally,Cowell (2015:15)  concludes that the lognormal model fits the distribution of wages, but not of broader income concepts.Other Pareto-like modelsWinkler (1950;  as reported by Kleiber 2013) fitted a 3-parameter Benini distribution to US incomes of 1919.Thatcher (1968) used the Champernowne distribution to model UK earnings, andCampano (1987) used the 5-parameter Champernowne on 1969 US incomes.Colombi (1990) estimated that a Pareto-lognormal distribution fit the Italian 1984-1986 family incomes as closely as the Singh-Maddala and Dagum models.
Fisk distribution also fared well on the UK Family Expenditure Survey (Henniger and Schmitz 1989).Linear and quadratic special cases of conic distributions (evaluated positively by Kleiber and Kotz 2003) provided better fit to US individual earnings than the Singh-Maddala (Houthakker 1992; Taylor 2010).More recently, Dagsvik et al. (2013) found that a 3-parameter Pareto-Lévy distribution outperforms the 4-parameter GB2 distribution on grouped-records income data from 5 out of 8 evaluated OECD countries.Fournier (2015) concluded that the Pareto 3 and 4, Champernowne, Fisk and Singh-Maddala distributions fit well the observed distribution of top incomes in the French fiscal authority data for 2006.

(
Majumder and Chakravarty's model was a mere re-parameterization of the GB2 model.McDonald and Mantrala favored Dagum and Singh-Maddala models for 1970-1990 US family incomes.Only the four-parameter GB2 and the five-parameter GB distribution outperformed the Dagum 1 McDonald and Xu 1995).Gordy (1998) proposed a 6-parameter 'compound confluent advocating of the gamma distribution,Singh and Maddala (1976) compared the gamma, the lognormal and the Singh-Maddala distributions, and found that the latter provided a better fit to the 1960-1972 US family incomes.In regard to the degrees of freedom used up in estimation, particularly on grouped-records data,Cramer (1978) called for the development of more appropriate model-selection criteria recognizing the nonlinear nature of parameter estimation.Commenting onSingh and Maddala's (1976) assumptions and model-fit statistics, he questioned whether their distribution really outperformed the gamma distribution.Nevertheless, subsequent research has largely confirmed the superior fit of the Singh-Maddala model to income distributions worldwide, compared to lower-parametric specifications.McDonald and Ransom (1979ab) compared the fit of the lognormal, gamma, beta and Singh-Maddala distributions to decile-group data for 1960-1975 family incomes.The Singh-Maddala outperformed the beta distribution across most years, the beta in turn outperformed the gamma, and the gamma outperformed the lognormal.This ranking held when the same estimation technique was used across the competing distribution functions, but not necessarily otherwise.
successful, namely the generalized Pareto, and the 3-4 parameter distributions in the generalized beta family, including the Singh-Maddala, Dagum and GB2 distributions.Recent literature has focused on ascertaining the settings under which each model is appropriate, which estimator should be used in fitting each model, and how to compare or replace the observed values with the model-estimated ones.It should be re-emphasized that the distribution functions presented here can perform only as well as the income observations they are based on.In the presence of topincome measurement biases, the models should be estimated on right-truncated samples or on samples where the measurement errors have been allayed.The quest for modeling best practices does not stop here.Future academic research should focus on bridging the disparate branches of modeling literature and developing a single-stop tool for model selection, estimation and evaluation.Among practitioners, a consensus should be sought on how to embrace the findings most effectively to, say, track measurement issues such as tax evasion, report on developmental policies, and improve fiscal targeting and redistribution.