Correspondence Analysis Using the Cressie–Read Family of Divergence Statistics

The foundations of correspondence analysis rests with Pearson's chi‐squared statistic. More recently, it has been shown that the Freeman–Tukey statistic plays an important role in correspondence analysis and confirmed the advantages of the Hellinger distance that have long been advocated in the literature. Pearson's and the Freeman–Tukey statistics are two of five commonly used special cases of the Cressie–Read family of divergence statistics. Therefore, this paper explores the features of correspondence analysis where its foundations lie with this family and shows that log‐ratio analysis (an approach that has gained increasing attention in the correspondence analysis and compositional data analysis literature) and the method based on the Hellinger distance are special cases of this new framework.


Introduction
Correspondence analysis is a popular means of visually analysing the association between two or more catagorical variables.Nishisato, (1980, section 1.2), Lebart et al. (1984), Greenacre (1984), Beh (2004) and Beh & Lombardo, (2012, 2014, 2019) collectively provide comprehensive historical, technical, computational and bibliographical reviews of this technique.The reason for its strong appeal is that the association is most commonly quantified using Pearson's chi-squared statistic, X 2 (Pearson, 1904).
While X 2 dominates much that has been written in the categorical data analysis (and related) literature, it is certainly not the only measure that one may use to assess whether there exists a statistically significant association between two or more categorical variables.For example, the likelihood ratio statistic, M 2 (Wilks, 1938), and the Freeman-Tukey statistic, T 2 (Freeman & Tukey, 1950), may also be considered and are two of many special cases of the Cressie-Read family of divergence statistics (Cressie & Read, 1984).This paper shows that a broader perspective can be gained by using this family to perform correspondence analysis thereby extending the analysis beyond Pearson's statistic.To describe the framework of this technique, we do so in the following seven sections.Section 2 gives a brief overview of the Cressie-Read family of divergence statistics and its five most popular special cases.We also paramaterise the family using a sum-of-squares measure involving a general residual, called the divergence residual, that relies on a parameter δ; changes in δ lead to special cases of the family of divergence statistics.Some further comments on the divergence residual are made in Section 3 while Section 4 demonstrates how correspondence analysis can be performed using this residual.In particular, we describe the application of singular value decomposition (SVD) and the various properties that stem from this approach, including the definition and interpretation of principal coordinates, their distance from each other and from the origin in a low-dimensional plot, and their role in modelling the association between the variables.In Section 5, we show how two values of δ lead to the log-ratio analysis (LRA) of Greenacre (2009) and to the Hellinger Distance Decomposition (HDD) method of Cuadras & Cuadras (2006) as special cases of this framework.Further insights into LRA and HDD, and to variants that have yet to be explored, will be made in Section 6.
We demonstrate the application of this technique in Section 7 by studying the association in the data formed from the cross-classification of the Nobel Prize awarded between 1901 and 2022 and the laureates' country of affiliation.We compare and contrast the results obtained from performing specific special cases of the framework and discuss some issues concerned with the 'best' and 'worst' choice of δ.We provide some final remarks on the technique in Section 8 including addressing the question what value of δ is to be recommended?

The Cressie-Read Family
Consider an I Â J two-way contingency table, N, where the i; j ð Þth cell entry has a frequency of n ij for i ¼ 1; 2; …; I and j ¼ 1; 2; … ; J. Let the grand total of N be n and let the matrix of relative frequencies be P so that its i; j ð Þ th cell entry is p ij ¼ n ij =n where P I i¼1 P J j¼1 p ij ¼ 1. Define the i th row marginal proportion by p i• ¼ P J j¼1 p ij .Similarly, define the j th column marginal proportion as p •j ¼ P I i¼1 p ij .To determine whether there exists a statistically significant association between the row and column variables, one may calculate any number of measures.Five of the most common are (1) Pearson's chi-squared statistic, X 2 (Pearson, 1904), (2) the log-likelihood ratio statistic, G 2 (Wilks, 1938), (3) the Freeman-Tukey statistic, T 2 (Freeman & Tukey, 1950), (4) the modified chi-squared statistic, N 2 (Neyman, 1940(Neyman, , 1949) ) and ( 5) the modified log-likelihood ratio statistic, M 2 (Kullback, 1959), all of which are asymptotically chi-squared random variables with Þdegrees of freedom.They can all be expressed in terms of p ij = p i• p •j which is the Pearson ratio of the i; j ð Þ th cell of the contingency table; see Goodman (1996), Beh (2004), Greenacre (2009) and Beh & Lombardo, (2014 p. 123) who used this ratio in the context of correspondence analysis.
Each of these measures of association is a special case of the Cressie-Read family of divergence statistics (Cressie & Read, 1984) which, for a two-way contingency table, is defined as Here, δ is the parameter of interest and lies within the interval δ ∈ À∞; ∞ ð Þ .The general nature of (1) ensures that specific values of δ lead to the five measures of association we described earlier; In the context of goodness-of-fit testing for categorical variables, Cressie & Read (1984) suggest that the most appropriate values of δ to choose from are those that lie within the interval 0; 3=2 ½ .Such an interval is advocated when no knowledge of the type of alternative [hypothesis] is available (Cressie & Read, 1984, p. 462).However, for a two-way contingency table where n > 10 and min np i• p •j > 1 for all i ¼ 1; 2; …; I and j ¼ 1; 2; … ; J , they advised that an appropriate choice of δ is 2/3 so that CR ¼ CR 2=3 ð Þ is referred to as the Cressie-Read statistic.
A second-order Taylor series approximation of the family that is made around (2) By using (2), M 2 ¼ CR * ð0Þ ¼ CR À1 ð Þ; this result can be obtained by determining the limiting value of CR * ðδÞ as δ→0.Similarly, it can be verified that X 2 ¼ CR * ð1Þ ¼ CRð1Þ and This approximation also preserves the double centring of the p δ ij values through its expected value under independence, p i• p •j δ , for all δ.Equation (2) shows that the family of divergence statistics can be approximated by a weighted sum-of-squares of the power transformation of the ratio of the observed cell proportions compared with what is expected under the hypothesis of independence.Power transformations have been discussed in the correspondence analysis literature in the past; one may refer to Cuadras & Cuadras (2006), Cuadras et al. (2006) and Greenacre, (2009Greenacre, ( , 2010) ) as key examples of this research.One may refer to Read & Cressie, (1988, pp. 94-95) and Cressie & Read (1989) for an overview of (2).Cressie & Read (1984, p. 462) say of (2) that the second-order approximations for the first three moments under the simple null hypothesis indicate that the moments of [CR * ðδÞ] converge most rapidly to the asymptotic χ 2 distribution for δ ∈ 0:3; 2:7 ½ .Therefore, if the aim of analysing the association between the variables of a contingency table is to focus on the inferential issues such as assessing the nature of the association, this interval suggests that, for example, the Freeman-Tukey statistic and Pearson's statistic are the preferred measures.However, because the aim of correspondence analysis is to visualise the nature of the association (knowing, or assuming, that an association exists) such an analysis need not be confined to such a test of association and so allows for more flexibility in the choice of δ for visualising the association between the variables.We shall be considering δ ∈ 0; 3=2 ½ in our application of the framework in Section 7 although Read & Cressie, (1988, p. 96) point out for (1), that when choosing the appropriate value of δ to use, departures involving large ratios of the alternative to null expected frequencies in one or two cells are best detected using large values of δ, say δ ¼ 5the same observation can also be made for (2).Therefore, large values of δ may be chosen to identify those cells of a contingency table that exhibit very large, positive, Pearson ratios.They also advocate choosing a large negative value of δ (such as δ ¼ À5) when the Pearson ratios are close to zero.

The Divergence Residual
The correspondence analysis of a two-way contingency table quantifies the association between the variables using the total inertia defined by ϕ 2 ¼ X 2 =n where X 2 is Pearson's chi-squared statistic; this analysis shall thus be referred to as classical correspondence analysis below.Because X 2 is a special case of CRðδÞ this suggests that a more general framework of correspondence analysis can be adopted, one that quantifies the total inertia by ϕ 2 ðδÞ ¼ CR * ðδÞ=n for any given value of δ.Therefore, from (2), a generalisation of the total inertia is defined such that to be the divergence residual of the i; j ð Þ th cell of the contingency table, then There are four special cases of the divergence residual that are now worth considering.They arise when δ ¼ 0; 1=2; 2=3 and 1 although other values of δ may also be considered.These values of δ lead to the following residuals: leads to classical correspondence analysis where ϕ 2 ð1Þ ¼ X 2 =n.
• When δ ¼ 1=2, the Freeman-Tukey residual is defined as and was used as the basis of the Freeman-Tukey variant of correspondence analysis described by Beh et al. (2018) where and can produce features that are more similar to the Freeman-Tukey variant of correspondence analysis than the classical variant.Using this residual, 4 The Correspondence Analysis Framework

SVD of the Divergence Residual
Recall that correspondence analysis quantifies the association between categorical variables using X 2 =n.Because X 2 =n ¼ ϕ 2 ð1Þ, this suggests that a more general framework of correspondence analysis can be considered.Such a framework can be formed by applying a SVD to the matrix of divergence residuals so that, for the i; j ð Þ th cell, and some fixed value of δ, where and M * is the maximum number of dimensions required to depict all of the association that exists between the variables of the contingency table.Its value depends on δ and more on this is discussed in Section 6.2.The quantities a im ðδÞ and b jm ðδÞ are the i th and j th element, respectively, of the m th left and right singular vectors of the matrix of divergence residuals for a fixed δ: The m th largest singular value is λ m ðδÞ so that 1 > λ 1 ðδÞ > λ 2 ðδÞ > … ; λ M * ðδÞ > 0.
We may also approach the correspondence analysis of N under this framework by considering a generalised SVD (GSVD) of the matrix of adjusted divergence residuals and These results can help to simplify our discussion of the features of correspondence analysis for a given value of δ.We now discuss some of the key properties of this framework.

Principal Coordinates
To visually summarise the association between the row and column variables of N , a low-dimensional correspondence plot can be constructed by plotting the principal coordinate of the i th row and j th column category along the m th dimension using their principal coordinate defined as  (11) respectively.From ( 8) and ( 9), the principal coordinates have the property where λ 2 m ðδÞ is the principal inertia of the m th dimension of the correspondence plot.Therefore, irrespective of the choice of δ, the first dimension reflects (proportionally) more of the association than any other dimension in the correspondence plot.In fact, the total inertia, (4), can be expressed in terms of these squared singular values and the principal coordinates by While the principal coordinates, (10) and ( 11), are used to visually summarise the association, biplot coordinate systems may also be considered; see, for example, Greenacre, (2010, chapters 7 and 8), Gower et al. (2011) and Beh & Lombardo, (2014, section 4.5.3).One may also consider using the lambda-scaling discussed in Gower et al., (2011, section 2.3.1) and Beh & Lombardo, (2014, p. 195).We shall confine our attention here to visualising the association using ( 10) and (11).

Distance from the origin
Consider the power transformation of the i th centred row profile such that The j th element of this profile can be expressed in terms of the power transformed Pearson ratio such that This distance measure can also be written as the squared Euclidean distance of a row point from the origin so that Because ϕ 2 ðδÞ can be expressed in terms of the Pearson ratiossee ( 3) and ( 4)then This shows that, irrespective of the choice of δ, if all the principal coordinates along all dimensions lie at the origin of the correspondence plot then the total inertia, and hence the Cressie-Read family of divergence statistics, is zero.The further away from the origin a point lies, the greater the impact of its category to the association between the categorical variables.

Interpoint distances
Consider now the squared distance between the i th and i 0 th row profiles for some fixed value of δ.This distance is and may also be expressed in terms of the power transformation of the Pearson ratio so that This second distance measure is equivalent to Greenacre (2009, eq.(17)) in his description of power transformations in correspondence analysis.Substituting the right hand side of ( 7) into (15) and simplifying yields and is the squared Euclidean distance between the i th and i 0 th row profiles in the optimal correspondence plot.This result shows that if two row profiles (whose elements are raised to the power of δ) are equivalent then they will share identical positions in the correspondence plot.Similarly, two points that have an identical position in the correspondence plot means that their categories have identical profiles.This feature also means that two rows (say) with equivalent profiles can be combined without impacting on their position in the low-dimensional display so that the property of distributional equivalence is preserved for all δ; see Greenacre (1984, pp. 65-66) and Lebart et al. (1984, p. 35).It is worth noting that ( 15) is equivalent to Cuadras & Cuadras (2006, eq. ( 13)) where δ is akin to their power 1 À α .While Cuadras & Cuadras (2006) confined their attention to δ ∈ 1=2; 1 ½ they also showed that the property of distributional equivalence is satisfied in their framework.

The Divergence Correlation Model
Any departure from independence for the i; j ð Þ th cell of a contingency table can be considered by assessing how different r ij ðδÞ is from zero.To do so, rearranging (7) leads to the saturated model where the left hand side is the reconstituted value of p ij for the chosen value of δ ≠ 0 .Equation ( 16) shows that, irrespective of the choice of δ, when the row and column variables are completely independent (so that λ * m ðδÞ ¼ 0 for m ¼ 1; … ; M * ) then p ij ðδÞ ¼ p i• p •j as expected, for i ¼ 1; 2; …; I and j ¼ 1; 2; … ; J .
By using the row and column principal coordinates defined by ( 10) and ( 11) p ij may be estimated, for any δ ≠ 0, from and is the saturated correspondence model of the Cressie-Read family of divergence statistics.This model confirms that the origin of the correspondence plot coincides with complete independence between the row and column variables, irrespective of the choice of δ.It also shows that points far from the origin reflect a deviation from what we expect under independence for those row and/or column categories.We shall not discuss here how far from the origin a principal coordinate needs to be for determining the category's contribution to the association but the interested reader may refer to Alzahrani et al. (2023).There are a variety of additional options that are available in the literature and for more information one may consult Beh & Lombardo (2014, Chapter 8) and the references mentioned therein.While Cressie & Read (1984, p. 462) recommend that δ ∈ 0:3; 2:7 ½ when using (2) for inferential purposes, a feature of ( 16) and ( 17) is that p ij ±∞ ð Þ→p i• p •j .Thus, the row and column points in an optimal (or even a low) dimensional correspondence plot will get closer to the origin as δ→±∞.The exceptions to this are those row and/or column categories that play a very dominant role in defining the association structure between the variables.In the case when δ→0, and is similar (but not identical) to the RC model of Goodman (1979) and Rom & Sarkar (1992, eq.(2.4)).

Four Special Cases
We now turn our attention to examining the features of the aforementioned framework when δ ¼ 0; 1=2; 2=3 and 1.They lead to variants of correspondence analysis where the features underlying them are based on the modified log-likelihood ratio statistic (M 2 ), the Freeman-Tukey statistic (T 2 ), the second-order approximation to the Cressie-Read statistic (CR * ) and Pearson's statistic (X 2 ), respectively.

Correspondence Analysis and M 2
While (3) does not exist for δ ¼ 0, as δ→0 the SVD of the matrix of divergence residuals can be performed so that, for the i; j ð Þ th element, or, equivalently, from the GSVD of the matrix of adjusted divergence residuals, respectively.For a zero cell frequency, there are various strategies that can be considered, such as the formal procedures described by Ishii-Kuntz (1994).One may also simply replace the zero cell frequency with a small positive value such as 0.2 (Evers & Namboodiri, 1979), 0.05 (Beh & Lombardo, 2014) or 10 À8 (Clogg & Eliason, 1987, p. 13).For the sake of simplicity, we shall replace the zero cell frequency with 0.05 in the application that is discussed in Section 7.
Determining the limiting value of ( 14) as δ→0, and then simplifying, leads to squared distance between the i th and i 0 th row profiles so that categories with identical profiles will have an identical position in the correspondence plot.For such a plot, the total inertia is measured in terms of M 2 .Equation ( 20) was briefly discussed by Greenacre (2009, p. 3113) where f * im ¼ f im ð0Þ.The proportion p ij can be reconstituted by determining the limiting value of (17) when δ→0 leading to (18), so that, for M < M * a non-negative value will be guaranteed.

Correspondence Analysis and T 2
Substituting δ ¼ 1=2 into (5) yields the SVD of the matrix of Freeman-Tukey residuals where the i; j ð Þ th element is Here ð Þis the m th element of the i th left and j th right singular vectors of this matrix, respectively, where An equivalent approach to performing this variant of correspondence analysis is to substitute δ ¼ 1=2 into (7) leaving us with the GSVD of the centred square root of the Pearson ratios so that Beh et al. (2018) described the correspondence analysis of a two-way contingency table using r ij 1=2 ð Þ and showed its link to the Freeman-Tukey statistic.Cuadras & Cuadras (2006) and Cuadras et al. (2006) also considered a rescaled version of this residual (to accommodate for a different scaling of the matrix of singular vectors) in their discussion of HDD without linking their approach to the Freeman-Tukey statistic.
We can assess the difference between the square root of two row profiles, say, by substituting δ ¼ 1=2 into (14) yielding and is the Hellinger distance between the i th and i 0 th row profiles.This distance can also be expressed in terms of the squared Euclidean distance of their principal coordinates so that where This distance measure was described by Greenacre (2009, eq. ( 17)) and, in terms of the T 2 , by Beh et al. (2018).Beh et al. (2018, p. 79) also discussed the role of the T 2 and r ij 1=2 ð Þin terms of row (and column) Hellinger distances in a low-dimensional correspondence plot.Their discussion supported the comments of Domenges & Volle (1979), Rao (1995) and Cuadras & Cuadras (2006) who strongly advocated the use of Hellinger distances in correspondence analysis.
The reconstituted value of p ij can be obtained by substituting δ ¼ 1=2 into (17) yielding which is equivalent to Beh et al. (2018, eq. (15)).Thus, this result will yield a positive reconstituted cell proportion for all cells of the contingency table (even when considering less that M * components from the SVD), a property not guaranteed with the classical correspondence analysis of N (when δ ¼ 1).

Correspondence Analysis and CR *
One variant of correspondence analysis that has not been discussed is when the association is described using the Cressie-Read statistic.While the framework cannot be applied directly to this measure, it can be expressed in terms of its second-order Taylor series approximation by substituting δ ¼ 2=3 into (2).Doing so means that a visual summary of the association can be performed by applying a SVD to the matrix of Cressie-Read residuals, r ij 2=3 ð Þ, so that This variant is equivalent to performing the GSVD to the matrix of centred power (of 2/3) transformed Pearson's ratio so that The weighted squared Euclidean distance between the i th and i 0 th transformed row profiles can be determined by substituting δ ¼ 2=3 into (2) giving so that such a distance measure relies less on the column marginal proportions than if δ ¼ 0 or δ ¼ 1 were considered.This suggests that, while not sharing the distance property that is unique to the Hellinger distance, a correspondence analysis based on CR * might be more preferable than performing a correspondence analysis that is based on the modified log-likelihood ratio statistic (LRA) or Pearson's statistic (classical correspondence analysis), given the symmetric nature that is assumed of the association between the categorical variables.

Correspondence Analysis and X 2
The classical approach to correspondence analysis can be performed by substituting δ ¼ 1 into (5) so that where λ m ¼ λ m ð1Þ while a im ¼ a im ð1Þ and b jm ¼ a im ð1Þ have the property Alternatively, substituting δ ¼ 1 into (7) means that a GSVD of the matrix of centred Pearson's ratios can be performed so that where ffiffiffiffiffi p •j p are constrained by ( 8) and ( 9), respectively, for δ ¼ 1; this classical approach to correspondence analysis has been described in great detail by many, including, but certainly not limited to, Greenacre (1984), Lebart et al. (1984), Beh (2004) and Beh & Lombardo (2014).
By substituting δ ¼ 1 into (14), one obtains where ( 25) is the same chi-squared distance between the i th and i 0 th row profiles that is discussed throughout the correspondence analysis literature so that X 2 is used as the basis for assessing depatures from independence.
Because the first-order Taylor series expansion of the natural logarithm function is ln x ≈ x À 1, for 0<x ≤ 2, (24) can be viewed as a first-order Taylor series approximation of (19).Thus, LRA will produce increasingly identical results to those obtained using classical correspondence analysis as p ij = p i• p •j →1; a feature discussed by Cuadras & Cuadras (2006) and Greenacre (2009).In this case, because p ij = p i• p •j →1 implies that the categorical variables approach complete independence, the two configurations will approach the origin of their display so that T 2 and X 2 →0.
The i; j ð Þ cell proportion can be reconstituted by substituting δ ¼ 1 into (16) giving and is the saturated RC correlation model that has been reviewed extensively in the categorical data analysis literature; see, for example, Kateri (2014, eq. (7.5)).Unlike ( 18) and ( 22), ( 26) is prone to producing negative reconstituted values of p ij in the unsaturated case, especially if M<<M * .

Futher Insights Into LRA and HDD
So far, we have compared and contrasted the results obtained by performing a SVD on the matrix of the various divergence residuals (or a GSVD on the adjusted version of these residuals).
In particular, we have described that the LRA method of Greenacre (2009) and the HDD method of Cuadras & Cuadras (2006) are special cases when δ ¼ 0 and δ ¼ 1=2, respectively.We now provide additional insight and perspectives into the features of these two methods.
6.1 Benefits of δ ¼ 1=2 Rao (1995) points out that there are benefits in using the Hellinger distance (when δ ¼ 1=2)see ( 21)instead of the chi-squared distance (δ ¼ 1)see ( 25).One key benefit is that the Hellinger distance of two row profiles (say) does not rely on any information contained in the column marginals; we described this feature in Section 5.2.In fact, ( 14) can be expressed as a weighted/power transformed version of the Hellinger distance as where w j ðδÞ ¼ p •j = 2δp δ •j 2 so that the Hellinger distance of the row profiles is observed only when δ ¼ 1=2 and a chi-squared distance arises only when δ ¼ 1.Note that as δ→∞ the impact on the column marginal proportions becomes more apparent, although it diminishes as δ→ À ∞.
We can consider such extreme values of δ when identifying the row and column pairs whose expected values are more, or less, than what would be expected (in the extreme case) under the hypothesis of independence.

On the value of M *
A well known property of classical correspondence analysis is that Cuadras, 2006;Cuadras et al., 2006).We have also identified this feature when δ ¼ 1 and δ ¼ 1=2, respectively, and observe that when δ ¼ 0 then M * ¼ min I; J ð Þ, as it is for LRA.Cuadras & Cuadras (2006) and Cuadras et al. (2006) also state that the additional dimension when using the Hellinger distance is a drawback.The inclusion of an additional dimension may be problematic unless one focuses instead on the quality of the low-dimensional plot used to visualise the association.While the Hellinger distance (for HDD, see ( 21)) and the logarithmic distance (for LRA, see ( 20)) requires one more dimension than classical correspondence analysis, an assessment on the relative size of the principal inertia values can be made.That is, for a two-dimensional correspondence plot, the inclusion of an extra dimension may have a negligible effect on λ 2 1 ðδÞþλ 2 2 ðδÞ À Á =ϕ 2 ðδÞ because the total inertia ϕ 2 ðδÞ may be significantly smaller than ϕ 2 ð1Þ.As Cuadras et al. (2006) show in their example, as we do in ours in Section 7, the extra dimension needed for δ ¼ 0 or δ ¼ 1=2 is negated by the improvements in the quality of the two-dimensional correspondence plot yielded from performing a correspondence analysis using these values of δ.

On the Choice of δ
Performing a correspondence analysis where (2) rests at its foundations means that virtually any value of δ can be considered.In constrast, for the LRA of Greenacre (2009), δ was confined to lie within the interval 0; 1 ½ choosing δ ¼ 0; 0:25; 0:5; 0:75 and 1and (while not being a goal of his paper) doing so allows for direct comparisons to be made of the configuration of principal coordinates for these values.Thus, the framework described earlier shows that, not only are HDD and LRA special cases obtained using the Cressie-Read family of divergence statistics, but that virtually any value of δ can be considered.A caveat to this though is that, depending on the value, negative δ's can lead to some computational problems due to the SVD of (5).

Affiliations and the Nobel Prize Data
Since its inception in 1901, the Nobel Prize has been awarded to successful recipients in the fields of physics, physiology and medicine, peace, literature and economics.We shall be examining the association between the Prize awarded (between 1901 and 2022) and the Country of affiliation where such affiliations largely comprise of universities and other internationally recognised research organisations.This data is given in Table 1 and comes directly from the URL https://www.nobelprize.org/prizes/facts/lists/affiliations.php.A total of 790 affiliations are listed on the website so that Table 1 represents over 95% of all affiliations.We point out that the row category 'NA' is not a country but represents recipients who, according to the website, are affiliated with an institution that is not centrally located in a single country.The data set also does not contain the full list of Peace prize winners but does include the multiple affiliations of some Nobel laureates.we can conclude that there is a statistically significant association between the Country and Prize variables.To visually investigate the nature of this association, we shall perform correspondence analysis using the framework described earlier for δ ¼ 0; 1=2; 2=3 and 1. 7.3 A Correspondence Analysis: δ ¼ 0; 1=2; 2=3; 1

The correspondence plots
Figure 1 gives the two-dimensional correspondence plots where (a) M 2 , (b) T 2 , (c) CR * and (d) X 2 are used as the numerical basis for visualising the association between Country and Prize; all four plots depict at least 80% of the association between the variables and so are very good visual depictions of the association.Table 2 summarises the various numerical features for δ ¼ 0; 1=2; 2=3 and 1 that complement the plots in Figure 1.These are the first two squared singular values (λ 2 1 ðδÞ and λ 2 2 ðδÞ), their cumulative percentage contribution to the total inertia and the test statistic, CR * ðδÞ.If one defines the 'best' correspondence plot to be the one that explains more of the association than the others, then Figure 1a (δ ¼ 0) is the 'best' and depicts slightly more than 90% of this association, while δ ¼ 1=2 accounts for nearly 85%, δ ¼ 2=3 accounts for nearly 84% of the association while the 'worst' of these four options is δ ¼ 1 which visually depicts about 82% of the association.Thus, based on this criteria alone, one might assess LRA to be the most suitable analysis to perform while the classical analysis is the least preferable.There are alternative ways to define what might be considered the 'best' or 'worst' analysis, and we remain cautious about giving a preference for one over another at this stage.We discuss this issue further in Sections 7.4 and 8.
Suppose we now compare the configuration of the four correspondence plots of Figure 1.They all show that Austria and Belgium have virtually the same profile distribution (irrespective of the power that is applied to the elements of their profile).Thus, they have a similar impact on the association, as do Sweden and the United Kingdom.These findings are apparent as each pair has the same position along the first two dimensions.We can also see that Russia/USSR has a similar impact on the association to the Netherlands.It also suggests a similar relative distribution of prizes for the USA and Denmark.There are also some obvious differences in the four configurations.For example, δ ¼ 0 shows that, while both sets of categories are centred at the origin, the configuration of points for the Nobel Prize categories is located more on the left of the plot while the Country categories are located on the right.Comparing the four plots for the 'Economic' and 'Peace' prizes shows that when δ ¼ 1=2 and (while less apparent) δ ¼ 2=3, they appear to have a similar profile distributions and so have a similar impact on the association in Table 1.For δ ¼ 0 and δ ¼ 1 they are located further from each other suggesting that, when compared with δ ¼ 1=2, there is a strong presence of over-dispersion for these categories.This should be of no surprise.Haberman (1973) shows Agresti (2013, p. 80) said such a variance averages 1 À 1=I ð Þ1 À 1=J ð Þ < 1:0 so that in both cases the variance of n ij exceeds its expectation and so over-dispersion is present.We now turn our attention to discussion this issue.
Table 2. Some key numerical summaries from the SVD of the matrix of divergence residuals for Table 1.

On over-dispersion and related matters
A visual assessment of the extent to which over-dispersion is present in Table 1 can be made by following the procedure of Efron (1992).He recommended plotting ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi Var n ij À Á q against ffiffi ffi n p r ij ð1Þ yielding what we refer to as the dispersion plot.Because the value of δ produces different variance quantities, we adapt his procedure as follows.Rather than considering (which is appropriate only if δ ¼ 1), we examine the dispersion plots using the residual where r ij ðδÞ is defined by (3).This residual is an asymptotically standard normal random variable so that the variance of Bishop et al. (1975, example 14.6-3).
Therefore, the emendation to the dispersion plot proposed by Efron (1992) is to consider plotting ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi against R ij ðδÞ.For example, when δ ¼ 1 then the square root of the variance is The limiting value of the variance as δ→0 is zero.
While Cressie & Read (1984) considered a general family of chi-squared random variables resulting in (1), the history of the distributional properties of n δ ij where n ij is a Poisson random variable is lengthy and is not wholly independent of our discussion, as we shall briefly outline below.
Figure 2 provides a comparison of the dispersion plots for δ ¼ 1; 1=2; 2=3 and δ ¼ 0 where all the axes are consistently scaled except for the vertical axis of plot (d). Figure 2a shows that there is more variation (around zero) of the R ij ð1Þ values than there is for the other three values of δ we have considered here.Because the R ij ð1Þ values increase as the square root of the variance of n ij increases, Efron (1992) concludes this is evidence of over-dispersion in the contingency table.There is a constant variance for the n δ ij values when δ ¼ 0 and 1/2 thereby providing a stabilised variance measure.However, there is far more variation in the residuals when δ ¼ 0 than there is when δ ¼ 1=2 and this is because the R ij ðδÞ involves dividing by δ→0.It should be of no surprise that the R ij 2=3 ð Þ residuals are similar to those obtained when δ ¼ 1=2; however, the square root of the variance of n is not perfectly consistent.It is worth commenting here that Anscombe (1953) compared the power transformation of n δ ij when δ ¼ 1=2 and 2/3 and noted that while the square root transformation does stabilise the variance it only halves the skewness of the R ij 1=2 ð Þ values (a feature not shown here).However, Anscombe (1953, p. 229) does say that the 2/3 transformation is highly successful in normalising the distribution of the residuals even for values of np i• p •j that are as low as 4; see also McCullagh & Nelder (1984, p. 38) for similar comments and who referred to R ij 2=3 ð Þas the Anscombe residual.Anscombe (1953, p. 229) goes on to point out that for δ ¼ 2=3 we have almost perfect normalisation, but at the cost of a non-constant variancethe latter feature is shown in Figure 2c.
We direct the interested reader to Beh & Lombardo (2020) for more information on the various strategies that can be adopted for dealing with over-dispersion in a contingency table when performing correspondence analysis.
Figure 3b shows that the percentage of the total inertia explained by the first principal axis is 80.32% at δ ¼ 0 then increases to (asymptotically) 100% for values of δ that are slightly greater than zero.This percentage then decreases steadily for 0:2<δ<1:5.The second principal inertia increases steadily in this interval from 10.10% at δ ¼ 0 to 27.1% at δ ¼ 1:5.While this might suggest that the quality of a two-dimensional display for these δ values should be fairly similar,   this is not the case because the total inertia drops quickly for values of δ less than about 0.8 and then stabilises; see Figure 3a.Thus, the quality of the first two dimensions varies for δ ∈ 0; 3=2 ð .Suppose that one wishes to identify those categories that dominate the association between the variables of Table 1.Such dominance can be determined by choosing very large values of δ.While not shown, the quality of the two-dimensional correspondence plot quickly approaches 100% for δ>4.For these δ values, the impact of the second dimension reduces to near zero so that all of the association can be depicted using a single (the first) dimension.As a result of this feature, all categories lie relatively close to the origin except for the 'Peace' and 'France' categories and so identifies these two categories as dominant contributors to the association.In fact, the largest of the 75 Pearson's ratios of Table 1

The 'worst' correspondence plot
We now turn our attention to describing the 'worst' quality display for δ ∈ 0; 3=2 ½ ; by 'worst', we cautiously refer to the value of δ that gives the poorest quality two-dimensional correspondence plot used for visually describing the association between the Country and Prize variables of Table 1.In this case, δ ¼ 1:5 produces a correspondence plot that describes 76.31% of the association between the variables.For this value of δ, the first dimension contributes to 49.21% of the association while 27.10% is depicted using the second dimension.Figure 4a gives the resulting two-dimensional correspondence plot and shows that the configuration looks similar to the plot obtained when δ ¼ 1 (Figure 1d).

The 'best' correspondence plot
For values of δ ∈ 0; 3=2 ½ the 'best' quality display for Table 1 is achieved not when δ ¼ 0 but when δ is very, very close to zero. Figure 4b gives the two-dimensional correspondence plot when δ ¼ 0:01 which accounts for 99.96% of the association in Table 1; smaller positive values of δ will visually depict slightly more of the association although the configuration of the points is very similar.Note that for Figure 4b, the first dimension accounts 90.81% of the total inertia while the second dimension accounts for 9.15%.It also highlights three clusters of countries that exhibit similar profile distributions under the transformation of their profile elements.They are (1) Switzerland, Japan, NA, Canada, Italy, Austria and Belgium (left of the origin), (2) UK, Germany, Sweden, Russia, the Netherlands and Denmark (above the origin), and (3) USA and France (closest to the origin).

A cautionary note
While we have used the terms 'worst' and 'best' to describe the correspondence plots in Figure 4, we use them cautiously.This also does not necessarily mean the values of δ that give the 'worst' or 'best' plots are the 'worst' or 'best' values for the data structure or for the analysis; hence, the reason for the quotation marks.We do concede that 'worst' and 'best' may be assessed using criteria different from how we have described them in Section's 7.4.2 and 7.4.3.For example, because the configuration of points in Figure 4a does not provide a clear distinction of some of the Country and Prize categories, one may wish to treat the correspondence plot for δ ¼ 0 as the 'best' plot because it does help to provide such distinctions.Also, of the four δ values considered earlier, it is the only option that displays more than 90% of the association.Additionally, the distance measures between the row (or column) points are interpretable (in terms of the differences in the natural logarithm of the profile elements), the total inertia is based on the modified log-likelihood ratio statistic and the variant of correspondence analysis on which it is based (LRA) is well documented.However, because more than a quarter (21 of the 75) of the cells of Table 1 contain a zero frequency LRA may not be ideal, unless a small quantity is added to each of them (as we have carried out earlier) or an alternative strategy for dealing with the zero cell frequencies is adopted.Adding to this issue is the presence of other small, and large, cell frequencies which means that the data is subject to overdispersion, as Section 7.3.2has shown.To help deal with this issue one may set δ ¼ 1=2 as a more appropriate option to LRA so that T 2 , and not M 2 , is used as the basis for calculating the total inertia.Such a choice may be made despite the relatively low proportion of the total inertia depicted using two dimensions.However, using the Freeman-Tukey statistic to calculate the total inertia deals with over-dispersion issues and allows for comparisons of the profiles to be made using the Hellinger distance.As a result, one may also view T 2 as a preferable option because the position of one set of principal coordinates is not influenced by the marginal information of the other variable.

Modelling the Association
While the choice of δ may be made based on the structure of the data, and other criterion, and the resulting correspondence plot, its selection may also be made based on the quality of reconstituting the cell frequencies based on the results obtained from the two-dimensional correspondence plots of Figure 1.Using ( 16), the reconstituted cell frequencies for δ ¼ 0; 1=2 and 1 are presented in Table 3 for M * ¼ 2; the first set of values are obtained for δ ¼ 1, the second set of values that appear in parentheses are for δ ¼ 1=2 and those in brackets are for δ ¼ 0 (where the zero cell frequencies have been replaced with 0.05 to avoid taking the natural logarithm of a zero count).The reconstituted cell frequencies for δ ¼ 2=3 can also be obtained and are similar to those obtained when δ ¼ 1=2 and so they have been omitted from Table 3.
An inspection of Table 3 for δ ¼ 1 shows that there are two reconstituted cell frequencies that are negative and they appear in bold text; ('Germany', 'Economics') has a value of À2.06 (to two decimal places) and ('Germany', 'Peace') has a value of À0.01.One may therefore consider the classic reconstitution formula of (26) to be an unsuitable model to reconstitute the cell frequencies for M * ¼ 2. Note that the reconstituted values for δ ¼ 0 and δ ¼ 1=2 in Table (3) are all non-negative.

Discussion
The discussion of power transformations of profiles in correspondence analysis is not new.One only has to consult Greenacre (2009), Cuadras & Cuadras (2006) and Cuadras et al. (2006) for discussions of this issue.Related to these discussions is the importance of the role played by the Hellinger distance in the construction of a correspondence plot, a feature advocated by Rao (1995) and Cuadras & Cuadras (2006) and explored in the context of the Freeman-Tukey statistic by Beh et al. (2018).However, these features can all be incorporated into a single framework by using the Cressie-Read family of divergence statistics as the primary measure of the association between the variables.Applying a SVD to the matrix of divergence residuals defined by (3) leads to a power transformation of the profiles where the power, δ, can take on any value; although we confined our attention to δ ∈ 0; 3=2 ½ even though Cressie & Read (1984) recommend δ ∈ 0:3; 2:7 ½ for inferential purposes when using (2).From this SVD, one obtains general association and distance measures, general statements for coordinate systems and association models.It also leads to well known special cases including the HDD method of Cuadras & Cuadras (2006) and the related approach described by Beh et al. (2018) and LRA (Greenacre, 2009).
With such flexibility in the choice of δ an obvious question that requires answering is what value of δ is to be recommended?While we have briefly discussed the 'best' (and 'worst') choices of δ in Section 7.4 from a purely geometric perspective, its selection can be made based on a variety of different criteria.It may depend on the structure of the data, the output that is generated from the analysis or the ease and interpretability that a value of δ provides.It may also be chosen according to some other criteria.For example, the classical approach to correspondence analysis (δ ¼ 1) may be recommended simply because it is the most popular approach to correspondence analysis and the analysis that most practitioners are aware of, although familiarity with a technique does not guarantee its suitability.Also, if one is interested in reconstituting the cell frequencies then δ ¼ 1 is prone to producing negative values, as Table 3 and the analysis performed in Beh et al. (2018) shows, and so should generally be overlooked for other values that guarantee non-negative values, such as δ ¼ 0 or 1/2.From a practical perspective, such a feature of correspondence analysis is rarely performed because its primary purpose is to provide a visual understanding of the association between the variables.One could select the value of δ that produces the correspondence plot that depicts the maximum amount of the association (when measured in terms of the total inertia) that exists between the variables that form the contingency table; we have briefly considered this option in Section 7.4.3.If δ is not bounded and so is allowed to approach infinity, the percentage of the total inertia that is accounted for in a two-dimensional correspondence plot increases to 100%.This is useful if the sole purpose of the analysis is to detect outlying categories but this is not generally recommended.If δ is confined to lie within the interval 0; 3=2 ½ (as we have carried out earlier) then some other value of δ may be adopted.For the analysis of Table 1, since nearly 100% of the association is displayed in a two-dimensional correspondence plot as δ→0, choosing δ to be close to (but not equal) to zero may be preferred.As Figure 4b shows, such a choice does not provide a clear and practical view of the association structure of the variables of Table 1.Thus, one may feel tempted to consider δ ¼ 0 as the most appropriate choice.While its dispersion plot reveals a constant variance of n δ ij (at zero!), the residual terms artificially increase to infinity when δ ¼ 0 and this will happen irrespective of the data being analysed.Another challenge with using δ ¼ 0 is that because the total inertia has at its heart the modified log-likelihood ratio statistic, problems arise if there are multiple zero cell frequencies in the contingency table.While an ad-hoc solution is to add a small value (say, 0.05) to each zero cell frequency such a problem is exacerbated when performing a correspondence analysis on sparse contingency tables.
There is one value of δ that may be considered and avoids many of the problems we have just described; whenδ ¼ 1=2and this is our answer to the question we asked earlier.It is the only value of δ that adequately avoids the problem of over-dispersion by stabilising the variance of n δ ij and guarantees non-negative reconstituted cell frequencies for M <<M * .Choosing δ ¼ 1=2 also ensures that the Freeman-Tukey statistic is the core measure of association.This is advantageous because it is well documented and generally very well understood; although we concede that other values of δ give measures of association that are also well documented.Because the Freeman-Tukey statistic results from considering δ ¼ 1=2, it ensures that differences between the profiles of a variable are assessed using the Hellinger distance, a feature that has been strongly advocated in the correspondence analysis by, for example, Domenges & Volle (1979), Rao (1995), Cuadras & Cuadras (2006) and Beh et al. (2018).Despite this, using the Freeman-Tukey statistic or, equivalently, Hellinger distances has not gained wide appeal in the correspondence analysis literature.Nor has it yet gained wide appeal from a practical perspective.
There are further issues that can be explored that are beyond the scope of this paper.One avenue for futher investigation is to generalise this framework for more than two categorical variables so that it can be adopted in the context of multiple correspondence analysis.One could certainly consider applying it to the Burt matrix of a multi-way contingency table.While joint correspondence analysis (JCA) (Greenacre, 1988) is designed to improve the quality of a correspondence plot by removing redundant information from the Burt matrix, further improvements could be made by incorporating the aforementioned framework into JCA.It could also be adapted for multi-way correspondence analysis (Carlier & Kroonenberg, 1996;Lombardo et al., 2021) by incorporating the extensions made to the family of divergence statistics outlined in Pardo (1996), Pardo & Pardo (2003) and Pardo (2010).
Perhaps the greatest potential in growth for any further development in correspondence analysisat least, for growths that align with the framework described in this paperis by expanding its general nature beyond CRðδÞ; the Cressie-Read family of divergence statistics is a special case of the more general family of divergence statistics described by Csiszár (1967).For a two-way contingency table, this family takes the form Here, p is the vector of p ij values and e is the vector of p i• p •j values while ϕðxÞ, for x>0, is a convex function where ϕð1Þ ¼ ϕ 0 ð1Þ ¼ 0; ϕ 00 ð1Þ > 0; 0 Â ϕ p=0 ð Þ ¼ plim x→∞ ϕðxÞ=x and 0 Â ϕ 0=0 ð Þ ¼ 0; see, for example, Cressie & Pardo (2000) and Kateri (2018).The CRðδÞ family can be obtained when ϕðxÞ while the Kullback-Leibler divergence is obtained when ϕðxÞ ¼ x logðxÞ.Such a property has the potential to expand the framework on which correspondence analysis is built for visually exploring the association between categorical variables.

FIGURE 3 .
FIGURE 3. Solid line is the (a) percentage contribution to the total inertia and (b) principal inertia values of a two-dimensional correspondence plot for δ ∈ 0; 3=2 ½ .For both plots, the dashed lines refer to the first dimension and the dotted line refers to the second dimension.