A two‐step log‐linear procedure for graphical representation and inference of associations in cross‐classified data for disease diagnosis

Biometrical sciences and disease diagnosis in particular, are often concerned with the analysis of associations for cross‐classified data, for which distance association models give us a graphical interpretation for non‐sparse matrices with a low number of categories. In this framework, usually binary exploratory and response variables are present, with analysis based on individual profiles being of great interest. For saturated models, we show the usual linear relationship for log‐linear models is preserved in full dimension for the distance association parameterization. This enables a two‐step procedure to facilitate the analysis and the interpretation of associations in terms of unfolding after the overall and main effects are removed. The proposed procedure can deal with cross‐classified data for profiles by binary variables, and it is easy to implement using traditional statistical software. For disease diagnosis, the problems of a degenerate solution in the unfolding representation, and that of determining significant differences between the profile locations are addressed. A hypothesis test of independence based on odds ratio is considered. Furthermore, a procedure is proposed to determine the causes of the significance of the test, avoiding the problem of error propagation. The equivalence between a test for equality of odds ratio pairs and the test for equality of location for two profiles in the unfolding representation in the disease diagnosis is shown. The results have been applied to a real example on the diagnosis of coronary disease, relating the odds ratios with performance parameters of the diagnostic test.

is a fundamental phase in clinical practice, prior to the treatment and prognostic phases.The diagnosis of a disease is made through the application of diagnostic tests and also considering the symptomatology of the patient and the covariates related to the disease (eg, gender, the presence of risk factors, family history, etc.).For example, an adult male with fatigue may require an electrocardiogram for the diagnosis of coronary heart disease.However, if the male is young, then fatigue may be due to other causes (eg, anemia or diabetes), and therefore the patient will initially require a laboratory test.Therefore, knowing covariates related to the disease is also of great help for the correct diagnosis of the disease.
In many situations, a preliminary step is taken to decide the most influential variables in a set, but often this selection may not be so clear.Instead of this kind of variable-oriented approach, a person-oriented approach to data analysis might be preferred. 1Any combination of the categories of the variables is called a profile.In the most general case, the data consist of a profile-by-response contingency table in which the profiles represent a single categorical response variable.In this situation, the analysis of associations between profiles of individuals based on diagnostic test together with covariates (rows), and having or not having a disease (columns), is of great interest.In the clinical setting, this allows greater control over the profiles with the highest risk, which can favor personalized treatment and improve health outcomes.
3][4] It has been shown that the DA model produces the same expected frequencies than the RC(M) model, 5 although the graphical interpretation of the associations is easier in the DA model in terms of points in a Euclidean space. 2 This is performed by considering a suitable parameterization of the association term that, together with a transformation for the identification of the parameters in the model, preserves the log-linear relation.In low dimensions, the distance association model is an approximation to the traditional log-linear model and, in general, it is more efficient than a two-step procedure that first performs traditional log-linear analysis and then represents the associations in terms of Euclidean distances.In a combined procedure of clustering and distance association model to deal with sparse data sets, the superiority of a simultaneous strategy compared to a two-step procedure has been explicitly shown by Vera, de Rooij and Heiser 3 for the latent class distance association model and by Vera and de Rooij 4 for the latent block distance association model.
In DA models, associations are analyzed relative to an unconditional unfolding procedure after removing the effect of the overall mean, and the main row and column effects, in a log-linear framework.However, when a binary variable is considered as a response, while the profiles are a categorical explanatory variable, a row conditional representation may also be appropriate.For example, when one of the two variables is binary, a simple and easy-to-interpret conditional representation can be made when the associations are analyzed directly from the observed frequencies (without removing the overall and main effects) using the procedure proposed by Heiser. 6This exploratory procedure relates the position of the points (centroids and vertexes) in a simplex with the distances between them and the observed proportions.
An important feature in DA models, and in particular for the diagnosis of a disease, is to statistically determine significant differences between the profile locations in the unfolding representation.This is a very important issue in this context and one that is not traditionally covered by unfolding procedures.In general, unfolding is based on the relationships between the elements of two sets to be represented, in this case the profiles of the row variable and the two column categories.However, unfolding does not usually address the relationships within the categories of each variable separately, and this is also true of the distance association model.Stability analysis of point locations has been used to determine confidence regions in multidimensional scaling, 7,8 but an inference-based procedure in this respect in unfolding, and in particular in the distance association model, would be desirable.
In general, when two qualitative variables are studied in a I × J table, the independence between them is usually studied by the classical Fisher's exact test or the chi-squared test. 9In a distance association model, it is also of interest to study the independence between an explanatory or response variable (such as the presence or absence of a disease) and profiles (eg, based on diagnostic tests and covariates), in particular, in terms of estimated expected frequencies.Testing independence is equivalent to testing that all odds ratios between profiles are equal to one.This will be of special interest when the representation of associations is one-dimensional, as occurs in the diagnosis of a disease, given the relationship between odds ratio and distances.It is important to note that studying the independence for all the profiles across all combinations of individual hypothesis tests about the odds ratio at the  error (or equivalently through the respective confidence intervals at the level 1 − ), can lead to erroneous conclusions due to the propagation of the  error.In a distance association model for the disease diagnostic, this can be related to the location of the points in the unfolding representation which highly complements the graphic interpretation.
In this article, we show that, in full dimension, a distance-based parameterization of the association terms leads to an equivalent linear decomposition of the expected frequencies to that given by the traditional log-linear model, which means the distance association model in full dimension is a log-linear model with a special interpretation of the parameters.This result is particularly relevant when the variable of interest is binary, as is usual in the diagnosis of a disease, or is determined by a categorical variable with at most four categories, (eg, non-diseased, patients in phase 1, patients in phase 2 and patients in phase 3), which allows the associations to be visualized graphically in full dimension.In this situation, a two-step procedure is proposed that first estimates a log-linear model, for example, using any statistical software familiar to medical researchers, and then represents the estimated associations in full dimension using the parameterization of a distance association model.From a computational point of view, the expected frequencies in any saturated log-linear model coincide with the observed frequencies, so only the DA parameterization of the observed frequencies is needed for the graphical representation, which makes it even easier to use.However, since the model is a true log-linear model, it allows the use of hypothesis tests and confidence intervals for the validation of the traditional log-linear model, while also enabling clinical researchers to take decisions easily based on an exact graphical representation of the associations in low dimension.
Moreover, based on the relationship between the odds ratio and the Euclidean distances in the unfolding representation, we propose an inferential procedure to decide about a degenerate solution 10 in the DA model, and also about significant differences between two locations in the unfolding representation in terms of the odds ratio.In this context, when the profiles correspond to results of binary diagnostic tests and categorical covariates, the odds that a profile has the disease can be expressed in terms of measures of accuracy of the diagnostic tests in the different patterns of the covariates.In the same way, the odds ratios between each two profiles can be expressed.Thus, if the profiles are obtained from a diagnostic test and two binary covariates, then the odds that a profile has the disease can be written in terms of the post-test odds of the diagnostic test, and the odds ratio between two profiles can be expressed in terms of the likelihood ratios (or post-test odds) of the diagnostic test in the different patterns of the covariates.Therefore, the Euclidean distances in the unfolding representation are related to measures of the quality of the binary diagnostic test in the different patterns of covariates.
The rest of the article is organized as follows.In the next section, we formulate the two-step log-linear procedure for estimating and representing associations for cross-classified data, and the equivalence of the log-linear model to the full-dimensional distance association model is shown in terms of their parameterizations.Furthermore, for the case of association with a binary variable, the relationship between independence and equality of location of the points in the unfolding representation is shown in terms of odds ratio.In particular, the relationship between a degenerate solution and independence is also shown.Section 3 introduces the independence test in terms of odds ratios for a general distance association model.Additionally, in the case of the association with a binary variable, the independence test is studied in terms of odds ratio, together with its properties, and a method is proposed to study the causes of significance when the independence test is significant.The results are also specified for the situation in which a diagnostic test and two binary covariates are available, relating the odds ratio between two profiles with measures of the accuracy of the binary diagnostic test.In Section 4, a Monte Carlo experiment is performed to study the asymptotic behavior of the hypothesis test introduced above, using eight profiles.In Section 5, the results have been applied to the diagnosis of coronary heart disease.In the final section, we summarize the main conclusion drawn.

REPRESENTATION OF ASSOCIATIONS BASED ON DISTANCES
Let us denote by F = ( f ij ) an I × J contingency table that collects the counts of combinations of row and column categories, which can represent profiles of variables.We will consider here the particular situation in which the rows represent profiles or categories of a response variable and the columns represent possible stages or characteristics of a disease.Let us define the I × M matrix X and the J × M matrix Y, whose row vectors x i , i = 1, … , I and y j , j = 1, … , J are the coordinates of the row and column categories of F , respectively, in dimension M.
We use the well-known equivalence of the multinomial and Poisson distribution, 9,11 and under the usual Poisson sampling model, counts are considered as independent random variables, 9 and denoting by  ij the expected values, the log-likelihood is given by log In the usual multiplicative form in a log-linear framework for a two-way cross-classification, the expected frequencies can be written as follows: where  is the overall scale parameter,  i is the row effect parameter,  k is the column effect parameter, and  ij is the interaction effect.In the distance association model is assumed the association parameter  ij can be expressed in terms of the Euclidean distances, . Thus, the greater the association between a row and column categories, the smaller the distance between their corresponding points.Taking the logarithm of (2) we can express the model in log-linear terms (focusing on the squared distances) as follows, log where log (note that this model is the distance association model, which in low dimension is log-quadratic in the configuration parameters, but in full dimension is log-linear as shown in Appendix A).As usual, for identification purposes, the mean of the row effects and the mean of the column effects will be assumed to be zero.To this end, the following parameterization of the expected frequencies 2 can be considered.

Equivalence between log-linear and distance association models in full dimension
Given the parameter estimated values from (3) by the usual saturated log-linear (see Appendix A) or by the distance association model in full dimension, let us denote by μ = ( μij ) the I × J matrix of the estimated expected frequencies (μ = , since both are saturated models).To obtain an identified solution, the parameters are expressed as a function of singular values and singular vectors, 2 since the singular value decomposition is unique and is characterized by M(M + 2) constraints, with M = min(I, J) − 1 the maximum dimension. 10et us denote by G = ( g ij ) , the matrix of entries , and denote by g, the global mean of G, and by g i and g j the marginal means for the ith row and the jth column of G, respectively.Then, we define λ = g, λR i = g i − g, λC j = g j − g, and  the matrix of entries The mean of the values of  R i , i = 1, … , I and of  C j , j = 1, … , J is equal to zero, and . Then, after this parameterization, the model is characterized by 2 + M(M + 2) further constraints. 2Thus, in full dimension M, two configurations X and Y can be estimated such that (3) holds.This results in a distance association model which is equivalent to the corresponding log-linear model, and therefore, a two-step procedure for the parameter estimation is allowed.The parameters of the log-linear model can be first estimated using any statistical software for inferential purpose, or simply consider the expected (observed) frequencies, and then the above parameterization can be used to represent associations in full dimension.

Odds ratio and distances
The log odds of a response j against a response j ′ for a given class i are given by The odds are a function of both the main effect parameters and the distances.Concerning the distances, the odds are in favor of the closest category.Concerning the main effects, the odds are in favor of the category with the largest  value.The odds ratio can be defined in terms of squared distances as 2 and if the row category I and the column category J are set as the reference, then the usual set of local odds ratio, denoted by and only if ) .
When the cross-classified data set involves a binary variable, the baseline category J represents one of the two choices, and the unfolding representation is in one dimension.Then, denoting by J and j the only two column categories in the table, it follows that for any i, i ′ = 1, … I row categories, which only occurs when both categories i and i ′ are represented by the same point in the unfolding configuration (see Appendix B).Hence, from a statistical point of view, a hypothesis test contrasting the equality of the corresponding odds ratios would allow us to decide whether the position of two nearby points can be considered to be significantly different in terms of association.When the response variable is binary, Therefore, the hypothesis that two row profiles do not differ from each other with respect to their association with the response variable can be formulated in terms of the independence test by considering the null hypothesis:

TESTING DIFFERENCES IN DISTANCE ASSOCIATIONS BETWEEN PATIENT PROFILES
The study of patient profiles and their association with a certain disease is of great interest in Clinical Medicine and Preventive Medicine.Here we will focus on patient profiles defined from qualitative variables, such as the results of binary diagnostic tests and disease-related covariates.Likewise, we will also consider that the disease status of each patient (present disease or absent disease) is known by applying a gold standard (GS), which is the medical test that allows to determine whether or not a patient has the disease.In this context, consider a random sample of n individuals, to whom T binary diagnostic tests (BDTs) are applied, and K qualitative covariates A 1 , … , A K are observed in all individuals.Each covariate can take values a 1 , a 2 , … Let T t be the binary random variable that models the result of t-ht BDT, with t = 1, … , T, such that T t = 1 when the test is positive and T t = 2 when it is negative, and GS models the result of the gold standard (GS = 1 if the individual has the disease and GS = 2 if he does not have it).Let a a 1 , … ,a K ,t 1 , … ,t T and b a 1 , … ,a K ,t 1 , … ,t T be the number of diseased and non-diseased individuals, respectively, in which A 1 = a 1 , … , A K = a K and T 1 = t 1 , … , T T = t T , with t t = 1, 2 and t = 1, … , T. The observed frequencies are the realization of a multinomial distribution with probabilities The maximum likelihood estimators of these probabilities are Let  be a vector of dimension 2I whose components are the above probabilities, where I is the number of profiles.Applying the multivariate central limit theorem it is verified that Since the sample of size n is the realization of a multinomial distribution, the variance-covariance matrix ∑  is estimated as follows: Assuming GS = 1 the baseline category, let , then the odds of having the disease is the same for both profiles.It is obvious that In terms of the probabilities of the vector , the odds ratio between i and profile i ′ is written as follows: and its estimator is In this situation, it is of interest to study whether or not the profiles are associated with the disease, and for this reason the following hypothesis test is studied.

Testing independence and degenerate solutions in DA
In general, the distance association model in low dimension is not equivalent to the related log-linear model, except for M = min(I, J) − 1.In addition, the DA model represents associations between row and column categories, after overall, row and column effects in the cross-classification dataset are removed.Hence, a procedure for testing independence in terms of the estimated distances is of great interest, in particular when the DA model is estimated in a dimension lower than M. As noted above, testing the independence between the I profiles and the disease is equivalent to testing that all odds ratios are equal to 1, and that therefore all profiles have the same odds of having the disease.In this situation, independence is equivalent to the fact that the location of all profiles is the same in the unfolding representation, that is, that the solution is degenerate. 10,12Here, we focus on the situation in which the associations are represented in full dimension, in which case the log-linear and DA models produce the same expected frequencies, which coincide with the observed frequencies as they are saturated models.Although global independence can be tested using any classical test, or by comparing models in a log-linear framework; here, we focus on odds ratio tests and its properties, in line with the identification of degenerate solutions.Having set a profile i, i , and an independence test, or global hypothesis test to compare that all odds ratios are equal to 1, is defined as follows: In this hypothesis test, only I − 1 odds ratios are involved, and the null hypothesis means that all odds ratios are equal to 1, regardless of the baseline profile.In this paper, we solve this contrast by applying transformations on the odds ratio, specifically the natural logarithm, which is a transformation widely used to compare and estimate parameters and whose application to the case of an odds ratio in a 2 × 2 table is widely known.Therefore, the hypothesis test ( 11) is equivalent to the test where As U i is a function of the probabilities of , the estimated asymptotic variance of ∑ U i is obtained applying the delta method, that is, Then, the test statistic is distributed according to Hotelling's T-squared distribution with a dimension I − 1 and n degrees of freedom, where I − 1 is the dimension of the vector Ûi .When n is large, the statistic Q i is distributed according to a central chi-squared distribution with I − 1 degrees of freedom when the null hypothesis is true, that is, Therefore, having set a profile we will obtain a value for the test statistic of the global test and the corresponding P-value.It can be shown (see Appendix C) that using the transformation of the natural logarithm, the test statistic for the global test is the same regardless of the baseline profile set, that is, Setting an  error, if the P-value of the hypothesis test (12) is higher than  then we do not reject that all of the profiles have the same odds of having the disease, that is, the unfolding configuration is degenerate.For a P-value ≤ , we accept that at least one odds ratio is different from 1, which means that at least one profile has odds of having the disease different to the rest of the profiles.Faced with this situation, it is necessary to investigate the causes of the significance of the test, for which the usual steps are: 1. Solving the individual hypothesis tests on the odds ratio between two profiles.This is of special interest in this framework, since as shown above in Section 2, this is equivalent to testing whether the distance between the locations of two row profiles differs from zero in the DA model representation, that is, the localization of the two profiles is the same in the unfolding representation.Hence, we propose to statistically determine if there are any differences between the associations of two row profiles with respect to having the disease or not.To this end, we study the test A Wald test statistic for this hypothesis test is 9 which is distributed according to a normal standard distribution when n is large, where Var ) is obtained applying the delta method.This test statistic is the classical one used to test that the log-odds ratio is equal to zero. 9A confidence interval for the log-odds ratio can be obtained simply by inverting this test statistic (Agresti 9 ).However, it is important to take into account that the formulation of hypothesis tests allows us to know the evidence against the null hypothesis.
The number of individual hypothesis tests that need to be solved is I(I − 1)∕2.
2. Adjusting the P-values to control the  error to adjust the I(I − 1)∕2 P-values obtained by solving the individual tests, we propose to use Holm's method, 13 which is easy to apply and is less conservative than the classic Bonferroni method.
The development for the clinical situation in which the diagnosis of the disease is based on the application of a BDT and on the observation of two binary covariates is shown in Appendix D.

SIMULATION EXPERIMENTS
Monte Carlo simulation experiments have been carried out to study the type I errors and the powers of the independence test based on the odds ratios (Q statistic), together with the Pearson chi-square test of independence ( 2 statistic) and likelihood ratio test (G 2 statistic). 9The value of I = 8 was considered, such as the situation studied in Appendix D and, for example, when considering two BDTs and a binary covariate, or three BDTs.As usual, the value of 0.5 was added to the entire table when at least one cell has zero frequency (a situation in which none of the three independence tests can be applied).These experiments consisted of the generation of 10 000 multinomial random samples sized n = {125, 150, 200, 300, 400, 500, 1000, 2000, 5000} (for n ≤ 100 it was not possible to generate enough random samples under the necessary conditions 14,15 to apply the classical tests), and whose probabilities were calculated as follows: 1.For ∑ p rst were considered the values of {0.20, 0.40, 0.60, 0.80}, which also set the values of ∑ q rst (since and q rst = ∑ q rst −q 111 7 , respectively.Therefore, it is considered that p rst (q rst ) are all equal and so, in each profile the odds ratios are all equal (for i ′ > i), which considerably simplifies the dimension of the problem.
For the type I error, all of the odds ratios are equal to 1, and for the power we considered that in the first profile (i = (1, 1, 1)) all of the odds ratios are equal to each other and greater than 1 (O 1i ′ > 1 with i ′ = 2, … , 8), and that in the  rest of the profiles it is verified that O ii ′ = 1 with i = 2, … , 7 and i ′ = i + 1, … , 8. As the nominal error,  = 5% has been considered.Table 1 shows the results obtained for some of the scenarios considered.The Q test presents type I error values that increase with the sample size, and very close to the nominal error for n ≥ 500 or n ≥ 1000, depending on the scenario.The Q test is somewhat more conservative than  2 and G 2 tests (mainly for n ≤ 400), but it does not exceed the nominal error in excess with any sample size in all considered scenarios.The  2 test and the G 2 test greatly exceed the nominal error in some situations, especially for n = 125-150 (also for n = 200 in the G 2 test), and therefore both hypothesis tests can give rise to too many false significances and should not be used for these small sample sizes, as expected.In general, the  2 and the G 2 test are somewhat less conservative than the Q test for 200 ≤ n ≤ 400, and type I errors are very similar for all when n ≥ 500.
Table 2 shows the results obtained for the powers, indicating the values of the odds ratio of the first profile and the values of the probabilities of the multinomial distribution.No results are shown for n = 125-150 in two of the scenarios because it has not been possible to generate samples that meet the conditions to apply the chi-square test.In all the hypothesis tests and under the same sample size, the power increases as the values of the odds ratio of the first profile increase.
For the Q test, in general, a sample size between moderate (n = 125) and large (n ≥ 500) is needed for the power to be high (greater than 80%), depending on the scenario.The  2 and G 2 tests are a little more powerful than the Q test when n ≤ 150-200 (depending on the scenario) because their type I errors are also larger (they can greatly exceed the nominal error) than those of the Q test.In general terms, there is no important difference (less than 1% on average) between the powers of the three methods when n ≥ 300-400, and it is necessary to have a sample size between moderate (n = 125) and large (n ≥ 500) so that the power is high (over 80%), depending on the values of the odds ratios.
From the results of the Monte Carlo experiments, it follows the Q test has an adequate asymptotic behavior for its practical application: its type I error does not exceed the nominal error and its power is high when the sample size is not excessively large (depending on the value of the odds ratios).The  2 and G 2 tests should not be applied when n = 125-150 (moderate sizes) since their type I errors can exceed the nominal error, giving rise to too many false significances.When the sample size is large, the three methods have a very similar asymptotic behavior.

ILLUSTRATIVE EXAMPLE
To illustrate the performance of the model, we have analyzed a data set from Weiner et al. 16 on the diagnosis of coronary disease.In particular, here we are focused on the exercise stress testing (EST) for the diagnosis of coronary disease, using a coronary arteriography as the GS.The proposed model has been implemented in R and the script and data set to reproduce this application are available as supplementary material.for men and r = 2 for women), sub-index s refers to RST-TW (s = 1 abnormal and s = 2 normal) and sub-index t refers to the result of EST (t = 1 positive and t = 2 negative).
The estimated expected values when applying the proposed two-step procedure in one dimension, along with the estimated values of the row, column, and overall effects are shown in Table 3.The saturated model perfectly recovers the observed values (3), as is to be expected.Figure 1 shows the representation in one dimension of the associations between the profiles in terms of having or not the disease (this is displayed in two dimensions for easy viewing).As can be appreciated; (1) profiles (1,2,1) (man with normal RST-TW and positive EST) and (1,1,1) (man with abnormal RST-TW and positive EST) are the two profiles most closely associated with having coronary disease; (2) profiles (2,1,2) (woman with abnormal RST-TW and negative EST) and (2,2,2) (woman with normal RST-TW and negative EST) are the profiles most closely associated with not having coronary disease; (3) the rest of the profiles are associated in a similar way with having or not having the disease and also to a similar degree.
The test statistic of the independence test is Q = 524.5237and P-value = 0.If we set  = 5% then we reject the null hypothesis that all of the log-odds ratios are equal to 0 (ie, that all of the odds ratios are equal to 1).Since the independence test is significant, the unfolding solution for the representation of the associations is not degenerate.Table 4 shows the estimations of the odds ratios between each two profiles and the results of the comparison in pairs (test statistics, P-values and adjusted P-values).For  = 5% the adjusted P-values using the Holm's method indicate that: 1.The odds of having coronary disease in profile 1 (man with abnormal RST-TW and positive EST) is significantly greater than in profiles 2 (man with abnormal RST-TW and negative EST), 4 (man with normal RST-TW and negative EST), 5 (woman with abnormal RST-TW and positive EST), 6 (woman with abnormal RST-TW and negative EST), 7 (woman with normal RST-TW and positive EST) and 8 (woman with normal RST-TW and negative EST).We do not reject that the odds of having coronary disease in profile 1 is equal to the odds of having the disease in profile 3 (man with normal RST-TW and positive EST).Therefore, a man with positive EST (whatever the status of the RST-TW) has an odds of having coronary disease which are significantly greater than the rest of profiles.2. The odds of having the disease in profile 2 (man with abnormal RST-TW and negative EST) is significantly lower than the odds of having the disease in profiles 3 (man with normal RST-TW and positive EST), and significantly higher than in profiles 6 (woman with abnormal RST-TW and negative EST) and 8 (woman with abnormal RST-TW and negative EST).We do not reject that the odds of having coronary disease in profile 2 is equal to the odds of having the disease in profiles 4 (man with normal RST-TW and negative EST), 5 (woman with an abnormal RST-TW and positive EST) and 7 (woman with normal RST-TW and positive EST).3. The odds of having the disease in profile 3 are significantly higher than the odds of having the disease in profiles 2, 4,

TA B L E 5
. The odds of having the disease in profile 4 are significantly higher than the odds of having the disease in profiles 6 and 8.We do not reject that the odds of having coronary disease in profile 4 is equal to the odds of having the disease in profiles 5 and 7. 5.The odds of having the disease in profile 5 is significantly higher than the odds of having the disease in profiles 6 and 8. Therefore, a woman with abnormal RST-TW and positive EST has a greater odds of having coronary disease than a woman with negative EST (whether or not the RST-TW is normal).We do not reject that the odds of having the disease in profile 5 is equal to the odds of having the disease in profile 7, and therefore we do not reject that a woman with positive EST has the same odds of having coronary disease whatever the status of the RST-TW.6.The odds of having the disease in profile 6 is significantly lower than the odds of having the disease in profile 7, and therefore a woman with abnormal RST-TW and negative test has a lower odds than a woman with normal RST-TW and positive EST.We do not reject that the odds of having coronary disease is equal among profiles 6 and 8, and therefore we do not reject that a woman with negative EST has the same odds of having coronary disease whatever the status of the RST-TW.7. The odds of having the disease in profile 7 are significantly higher than the odds of having the disease in profile 8, and therefore a woman with normal RST-TW and positive EST has a greater odds of having the disease than a woman with normal RST-TW and negative EST.
Therefore, the results obtained lead to the following conclusions: 1. Profiles 1 and 3 (non-significant test) both have greater odds of having the disease than the rest of the profiles, and therefore they are the profiles with the greatest disease of coronary disease.2. Profiles 6 and 8 (non-significant test) are those with the lowest odds of having the disease, and therefore they are the profiles with the lowest risk of coronary disease.3.For profiles 2, 4, 5, and 7, the individual tests between them are not significant and, therefore, there is no evidence of any differences regarding their positions in the unfolding representation and thus of their associations with the coronary disease.
Regarding the disease prevalence and the parameters of the EST, Table 5 shows their estimations in the different patterns of the covariates.Applying the equations of appendix D we obtain the estimations of the odds ratios given in Table 4.For those profiles which are more associated with having the disease, it is obtained that: (a) the odds of a man with an abnormal RST-TW and a positive EST having the disease is 6.4, and (b) the odds of a man with a normal RST-TW and a positive EST having the disease is approximately 7.4.In both cases, the odds are calculated as PPV rs ∕ since the result of the BDT is positive in the two profiles.The rest of the odds are interpreted in a similar way.

DISCUSSION
In this paper, a two-step log-linear procedure to estimate and represent associations in full dimension is proposed.The log-linear model is shown to be equivalent to a distance association model in full dimension, after an appropriate parameterization, which enables the direct representation of associations between rows columns in a two-way contingency table resulting from cross-classified data sets.When one of the two variables involved in the contingency table is binary, as is usually the case in the diagnosis of a disease, the relation between independency and equality of localization of the points in the unfolding representation is shown.Therefore, the interpretation of close positions in the representation can be made in terms of equal degrees of association from a statistical point of view.This also enables the statistical identification of a degenerate solution in unfolding in terms of the odds ratio.
In general, the low-dimensional distance association model is not equivalent to the related log-linear model.Therefore, a procedure is considered for testing independence in terms of the observed frequencies, these being the expected values estimated by the DA model in full dimension, which is based on testing that the values of the odds ratio are all equal to one.The test statistic Q is distributed asymptotically according to a chi-squared distribution with I − 1 degrees of freedom, since only I − 1 odds ratio must be considered, and its performance is compared with that of classical independence tests.From the results obtained in the simulation experiments, it can be appreciated that, in general, the proposed Q statistic has a good asymptotic behavior, its performance being better than the classical tests for moderate samples.
Consideration of the natural logarithmic transformation of the odds ratio for the test statistic ensures its invariance in the face of a change in the baseline profile.Although other transformations have been proposed in this framework, such as the inverse hyperbolic sine or the inverse sine, 17,18 it is easy to show that these are not invariant for a change in the baseline profile, which is a clear drawback.
To determine the causes of a significant test, individual hypothesis tests on each odds ratio have been proposed using Holm's method to adjust the P-values obtained.The situation of independence is not precisely what is of interest here, since we are mainly focused on the analysis of associations.Indeed, to analyze global independence, the classic tests based on the chi-square distribution (Pearson and likelihood ratio) can also be used, as well as Fisher's exact test, all of them based on the observed frequencies since the model is saturated.Nevertheless, there are certain aspects that should be highlighted in this regard.The hypothesis tests based on the chi-square distribution, that is,  2 test and G 2 test, require certain conditions on the expected frequencies in order to be applied, which means that their use is somewhat limited for small or even moderate samples (see, for instance, Cochran 14 and McDonald 15 ).On the other hand, when the test is significant, the investigation of the causes of significance is done by partitioning the table, which is computationally expensive as the number of profiles increases.Although partitioning the table of observed frequencies into I − 1 subtables, the test statistic G 2 is the sum of the I − 1 test statistics G 2 , this property does not hold for the Q statistics, nor is it true for the  2 statistics.
The same occurs with Fisher's exact test, which may even be computationally unfeasible, as well as the subsequent investigation of the causes of the lack of independence.The global test based on the odds ratio has advantages over the previous tests in this framework: (a) it can always be applied (for zero frequencies it is enough to simply add 0.5), and (b) if the test is significant, the causes of the significance are investigated simple and fast, since all the parameters involved have been estimated when solving the global test.
In particular, the clinical situation in which the profiles are composed of two binary covariates and a BDT has been widely described, given its great interest in medical research.In this situation, the relation of the odds ratio for two profiles with measures of quality of the BDT in the different patterns of covariates is set.In addition, simulation experiments have been carried out to study the asymptotic behavior of the independence test for the case of eight profiles, and the hypothesis test showed good performance both in terms of type I error and power.
An interesting topic to investigate is the relationship between equality of odds ratio and location of points when the model is estimated in low dimension.This is particularly interesting for tables in which the large number of profiles induces the use of a combined latent class distance association model such as the LCDA. 3 In addition, given the relationship with the RC(M) model, the study of this methodology in the RC(M) framework in low dimension, and also in related models collapsing categories such as the Kateri and Iliopuolos model, 19 is of great interest.
• The two column categories are located at the ends of the graph, for example, J, i, i ′ , j.Then, • There is a column category located between the two row categories, for example i, j, i ′ , J.Then, • Row categories are placed together, as are column categories, for example, i, i ′ , j, J.Then,

APPENDIX C
To simplify the demonstration, let us suppose that the global hypothesis test is solved taking profile 1 as the baseline profile, then Û1 = ( Û12 , Û13 , … , Û1I ) T and the test statistic is Let us now consider that the global hypothesis test is solved taking profile 2 as the baseline profile.As the odds ratios verify that O ii = 1 and that Applying these properties, vector U 2 is written in terms of the components of vector U 1 as Then, the variance-covariance matrix of Û2 can be estimated from the variance-covariance matrix of Û1 applying the delta method, that is, where ) is a matrix of a dimension (I − 1) × (I − 1) whose elements are constant, that is, In this matrix, the elements of the first column are equal to −1, the rest of the elements in the main diagonal are equal to 1 and all of the other elements of the matrix are equal to 0. It is easy to verify that Then the test statistic for the global test is The demonstration is similar for the rest of profiles i ≥ 3.In general, for any other profiles i and i ′ , it holds that. and The elements of U i ′ are written in terms of the elements of U i as.
The matrix of partial derivatives U i ′ ∕U i has the following elements: the elements of column i are all equal to −1, the rest of the elements in the main diagonal are all equal to 1, and the rest of the elements in the matrix are all equal to 0. A matrix of this type always verifies that U i ′ ∕U i = (U i ′ ∕U i ) −1 and that (U i ′ ∕U i ) −1 U i ′ = U i .

APPENDIX D
In clinical practice, it is common for the diagnosis of a disease to be made based on the result of a BDT and the observation of binary covariates (eg, sex, family history, the presence of a risk factor, etc.).Here, we will consider that two binary covariates are observed, although extending this to more than two covariates is simple.Moreover, the extension to two (or more) BDTs is also simple, and in this case it is necessary to consider the covariance between the two (or more than two) BDTs when calculating the probabilities of the profiles.In the situation of a single BDT and two binary covariates (and also with a single covariate), the odds ratio between two profiles is written, as detailed further on, in terms of measures of the quality of the BDT in each pattern of covariates (or of the single covariate in such a situation).
Let us consider that for all of the n individuals in a random sample a gold standard and a BDT are applied, and that for all of these individuals we observe two binary covariates.This situation leads to the profiles and the frequencies given in Table D1.The theoretical probabilities are defined as p rst = P(GS = 1, A 1 = r, A 2 = s, T = t) and q rst = P(GS = 2, A 1 = r, A 2 = s, T = t), with r, s, t = 1, 2 and verifying that ∑ 2 r,s,t=1 p rst + ∑ 2 r,s,t=1 q rst = 1.These probabilities are expressed in terms of the sensitivity and the specificity of the BDT as TA B L E D1 Profiles and observed frequencies when I = 8.The above 16 probabilities are arranged in vector form as follows:  = (p 111 , p 112 , … , q 221 , q 222 ) T , and the odds ratio between profile i = (r, s, t) and profile i ′ = ( r ′ , s ′ , t ′ ) is expressed as follows:

Profile
O ii ′ = p rst ∕q rst p r ′ s ′ t ′ ∕q r ′ s ′ t ′ .
The odds that profile i = (r, s, t) has the disease when the BDT is positive is  where NPV rs is the negative predictive value of the BDT when A 1 = r and A 2 = s.Based on these expressions, the odds ratio between profile i and profile i ′ is written in terms of the parameters of performance of the BDT as follows:

F I G U R E 1
Representation of associations of the profiles of individuals regarding coronary disease.

TA B L E 4
Multiple comparisons in the study ofWeiner et al.
is the positive likelihood ration of the BDT in the pattern of covariates A 1 = r and A 2 = s, and LR − rs =1−Se rs Sp rs is the negative likelihood ratio is the negative likelihood ratio of the BDT in the pattern of covariates A 1 = r andA 2 = s.The product  rs 1− rs× LR + rs is the post-test odds when the BDT is positive in the pattern of covariates A 1 = r andA 2 = s, that is,  rs 1 −  rs × LR + rs = PPV rs 1 − PPV rs ,where PPV rs is the positive predictive value of the BDT when A 1 = r and A 2 = s.In an analogous way, the product rs 1− rs × LR −rs is the post-test odds when the BDT is negative and A 1 = r and A 2 = s, that is, rs 1 −  rs × LR − rs = 1 − NPV rs NPV rs ,
TA B L E 2

Table 3
Study of Weiner et al.
shows the results obtained when applying the two medical tests to a sample of 2045 people, in which the row profiles are formed by the categories of the variables of sex, Resting ST's & T-Waves (RST-TW) and EST, and in this order.Therefore, the sub-index r of the table refers to sex (r = 1 TA B L E 3 Estimations of the parameters. and the odds that profile i = (r, s, t) has the disease when the BDT is negative is